R equivalent to SQL select … group … by … having

In principle,  you can use the R ‘sqldf’ package for embedding SQL in R.  However, sqldf does not allow you to use R functions as aggregation functions.  This post is to help me remember how to translate SQL statements that include groupoing into base R.  Suppose the following test data:

test_df <- data.frame(
  sample      = c(1,2,3,4,5,6)
, sample_type = c(1,1,1,2,2,2)
)

A single aggregation can be performed to produce a single data.frame.  For example, this SQL:

select sample_type, count(sample) as sample_count
  from test_df
 group by sample_type 
 order by sample_type
having sample_count > 4

can be expressed in R as:

with(
  x <- with(
    # FROM test_df
    test_df
  , aggregate(
      # SELECT COUNT(sample) AS sample_count
      x = data.frame(sample_count = sample)
          # second column(s) of resulting data.frame
    , # SELECT sample_type
      # GROUP BY sample_type
      # ORDER BY sample_type
      by = list(sample_type = sample_type)
          # first column(s) of resulting data.frame
          # 'by' determines both the groups to be 
          #   aggregated and the order of the result
    , # SELECT COUNT(sample)
      FUN = length
          # R function to mimic SQL 'count' function
    )
  )
, # HAVING sample_count > 4
  x[ sample_count > 4, ]
)

This gets rather busy when aggregating multiple columns because each aggregation produces a data frame, so you need to “merge” the data frames (analogous to a SQL join):

pseudo SQL (pretending that SQL has statistical aggregation functions)

select sample_type
     , count(sample) as sample_count
     , mean(sample)  as sample_mean
     , var(sample)   as sample_var
  from test_df
 group by sample_type 
 order by sample_type
having sample_count > 4

R

with(
  x <-  with(
    test_df
  , {
      by <- list(sample_type = sample_type)
      Reduce(
        f = function(dtf1, dtf2) merge(dtf1, dtf2)
      , x = list(
          aggregate(
            x = data.frame(sample_count=sample)
          , by = by , FUN = length)
        , aggregate(
            x = data.frame(sample_mean =sample)
          , by = by, FUN = mean)
        , aggregate(
            x = data.frame(sample_var =sample),
            by = by, FUN = var)
        )
      )
    }
  )
, x[ sample_count > 4, ]
)

 

R S4 objects and overloaded data.frame

Here is some code extracted from this answer http://stackoverflow.com/a/14607290 to the question “How to create a dataframe of user defined S4 classes in R“. The answer and question are well worth reading, but I wanted to have the code example in one place without the intervening comments:

# Create S4 class person(name,age)
setClass("person", 
  slots = c(
    name="character", 
    age="numeric"
  )
)

# Create subsetting operator
setMethod("[", 
  "person",
  function(x, i, j) {
    initialize(x, name=x@name[i], age=x@age[i])
  }
)

# Create overload for format()
format.person <- function(x) {
  paste0(x@name, ", ", x@age)
}

# Create overload for as.data.frame()
as.data.frame.person <-
  function(x, row.names=NULL, optional=FALSE)
{
  if (is.null(row.names))
    row.names <- x@name
    value <- list(x)
    attr(value, "row.names") <- row.names
    class(value) <- "data.frame"
    return(value)
}

# Create overload for c()
c.person = function(...) {
  args = list(...)
  return(
    new("person",
      name=sapply(args, function(x) x@name),  
      age=sapply(args, function(x) x@age)
    )
  )
}

# Demonstrate the code above; writes "John, 20"
format(
    data.frame(
        c(
            new("person", name="Tom", age=30),
            new("person", name="John", age=20)
        )
    )[2,]
)