Aggregate Function in R: Making your life easier, one mean at a time

I previously posted about calculating medians using R. I used tapply to do it, but I’ve since found something that feels easier to use (at least to me).

aggregated_output = aggregate(DV ~ IV1 * IV2,
                data=data_to_aggregate, FUN=median)
aggregated_output

The above code saves an aggregated dataset to aggregated_output and gives you the median in a column. The median (or mean, or whatever function you want to apply) is specified by FUN=. The value to create a median for is specified by DV (dependent variable).

The aggregate function also gives additional columns for each IV (independent variable). You can have as many of these as you like. Here, I have two, and these are specified by IV1 * IV2.

Those of you who are familiar with relational databases will see immediately that this function is somewhat similar to GROUP BY (in MySQL). The bonus is that you don’t need to SELECT the IV columns that you want to be provided; those are done automatically. For example, take a look at this:

SELECT IV1, IV2, AVG(DV) FROM data_to_aggregate GROUP BY IV1, IV2

There is apparently more than one way to skin a cat (even if it’s a cat that’s made of data).

5 thoughts on “Aggregate Function in R: Making your life easier, one mean at a time”

  1. Thanks for the post. I have often wondered what is the best way to do this in R, especially when there is a very large dataset.

    I am wondering which is the best: aggregate(), split() using RSQLite, or using data.table packages.

  2. Good point, eric – I, too, have been confused about the best method to use. I’m currently working on a new post that compares several different methods for aggregating data. I assume it’s the case that most people stick to one that they know, or use different functions depending on the task, as well as the task requirements (e.g., large amounts of data to process).

  3. Are you familiar with the plyr package? http://had.co.nz/plyr/

    It’s basically a replacement/unification of R’s grouping functions (aggregate, by, the sapply/lapply/tapply/etc. family), which IMO makes it a lot easier to learn and use. Plus, it’s by Hadley Wickham, the same guy who created ggplot2.

    Your function would look something like
    aggregated_output = ddply(data_to_aggregate, c(“IV1”, “IV2”), summarise, median_dv = median(DV))
    [Admittedly, that may not seem simpler than the aggregate function at first.]

  4. Thanks for the tip! I have been using plyr a lot lately, and I agree that it’s easier to use than aggregate(). I’ve just put up a post comparing some more methods, inlcuding plyr (http://www.psychwire.co.uk/2011/04/data-aggregation-in-r-plyr-sqldf-and-data-table/). It should have gone up sooner, I thought I had scheduled it for yesterday, but it’s there today!

    It seems like the different aggregation methods all have their different pros and cons, but, once you get into them, they are all pretty versatile!

Comments are closed.