In principle, you can use the R ‘sqldf’ package for embedding SQL in R. However, sqldf does not allow you to use R functions as aggregation functions. This post is to help me remember how to translate SQL statements that include groupoing into base R. Suppose the following test data:
test_df <- data.frame( sample = c(1,2,3,4,5,6) , sample_type = c(1,1,1,2,2,2) )
A single aggregation can be performed to produce a single data.frame. For example, this SQL:
select sample_type, count(sample) as sample_count from test_df group by sample_type order by sample_type having sample_count > 4
can be expressed in R as:
with( x <- with( # FROM test_df test_df , aggregate( # SELECT COUNT(sample) AS sample_count x = data.frame(sample_count = sample) # second column(s) of resulting data.frame , # SELECT sample_type # GROUP BY sample_type # ORDER BY sample_type by = list(sample_type = sample_type) # first column(s) of resulting data.frame # 'by' determines both the groups to be # aggregated and the order of the result , # SELECT COUNT(sample) FUN = length # R function to mimic SQL 'count' function ) ) , # HAVING sample_count > 4 x[ sample_count > 4, ] )
This gets rather busy when aggregating multiple columns because each aggregation produces a data frame, so you need to “merge” the data frames (analogous to a SQL join):
pseudo SQL (pretending that SQL has statistical aggregation functions)
select sample_type , count(sample) as sample_count , mean(sample) as sample_mean , var(sample) as sample_var from test_df group by sample_type order by sample_type having sample_count > 4
R
with( x <- with( test_df , { by <- list(sample_type = sample_type) Reduce( f = function(dtf1, dtf2) merge(dtf1, dtf2) , x = list( aggregate( x = data.frame(sample_count=sample) , by = by , FUN = length) , aggregate( x = data.frame(sample_mean =sample) , by = by, FUN = mean) , aggregate( x = data.frame(sample_var =sample), by = by, FUN = var) ) ) } ) , x[ sample_count > 4, ] )