R equivalent to SQL select … group … by … having

In principle,  you can use the R ‘sqldf’ package for embedding SQL in R.  However, sqldf does not allow you to use R functions as aggregation functions.  This post is to help me remember how to translate SQL statements that include groupoing into base R.  Suppose the following test data:

test_df <- data.frame(
  sample      = c(1,2,3,4,5,6)
, sample_type = c(1,1,1,2,2,2)
)

A single aggregation can be performed to produce a single data.frame.  For example, this SQL:

select sample_type, count(sample) as sample_count
  from test_df
 group by sample_type 
 order by sample_type
having sample_count > 4

can be expressed in R as:

with(
  x <- with(
    # FROM test_df
    test_df
  , aggregate(
      # SELECT COUNT(sample) AS sample_count
      x = data.frame(sample_count = sample)
          # second column(s) of resulting data.frame
    , # SELECT sample_type
      # GROUP BY sample_type
      # ORDER BY sample_type
      by = list(sample_type = sample_type)
          # first column(s) of resulting data.frame
          # 'by' determines both the groups to be 
          #   aggregated and the order of the result
    , # SELECT COUNT(sample)
      FUN = length
          # R function to mimic SQL 'count' function
    )
  )
, # HAVING sample_count > 4
  x[ sample_count > 4, ]
)

This gets rather busy when aggregating multiple columns because each aggregation produces a data frame, so you need to “merge” the data frames (analogous to a SQL join):

pseudo SQL (pretending that SQL has statistical aggregation functions)

select sample_type
     , count(sample) as sample_count
     , mean(sample)  as sample_mean
     , var(sample)   as sample_var
  from test_df
 group by sample_type 
 order by sample_type
having sample_count > 4

R

with(
  x <-  with(
    test_df
  , {
      by <- list(sample_type = sample_type)
      Reduce(
        f = function(dtf1, dtf2) merge(dtf1, dtf2)
      , x = list(
          aggregate(
            x = data.frame(sample_count=sample)
          , by = by , FUN = length)
        , aggregate(
            x = data.frame(sample_mean =sample)
          , by = by, FUN = mean)
        , aggregate(
            x = data.frame(sample_var =sample),
            by = by, FUN = var)
        )
      )
    }
  )
, x[ sample_count > 4, ]
)