I’ve previously put up a couple of posts about aggregating data in R. In this post, I’m going to be trying some other alternative methods for aggregating the dataset. Before I begin, I’d like to thank Matthew Dowle for highlighting these to me. It’s a bit daunting at first, deciding which method of aggregating data is best. So I decided to give them all a go to see what they were like. Let’s go for it!
For this post, I’m going to be using the lexdec dataset that comes with the languageR package. For information see here. I’ve called it full_list here, in order to play around with it. The details of the dataset are not that important; it’s just a case of getting hold of some data from human subjects (i.e., what I’m used to!).
The Target
Before we get into the functions themselves, let’s take a look at the aggregated data that we want. It has the mean, median and standard error of the RT variable (RT stands for Response Time, or time taken to press a button). I want to get some summary statistics of this variable for every level of each participant (Subject column) and for every level of the Class column. So, ultimately, the target is the following summary table (note that I’ve truncated this as there are lots of participants):
A tool by any other name: plyr
Let’s begin with plyr. The power of plyr comes from the fact that it splits up data, runs a function on the split-up data, and then sticks it all back together. It has a wide variety of useful aggregation functions, but here I’m going to use ddply. This function gives as it’s output a dataframe and gives as output another dataframe. The plyr functions are written in the syntax of XYply where X is the input object type and Y is the output object type. In this case, both ds of ddply stand for dataframe. Let’s look at some initial code:
ddply(full_list, c("Subject","Class"), function(df)mean(df$RT))
This is fine, and gives us mean DPS values for each class and spec. But there’s a problem. The “mean” column is labelled V1, which isn’t that helpful, especially if we have multiple columns computed (i.e., ending up with V1, V2, V3 makes it hard to remember which column is which). So let’s get the column labelled:
ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT))))
Great! Now let’s add some more columns to output:
ddply(full_list, c("Subject","Class"), function(df) return(c(AVERAGE=mean(df$RT),
MEDIAN=median(df$RT),SE=sqrt(var(df$RT)/length(df$RT)))))
This then gives us the target aggregated table pictured above.
It needs no sequel: sqldf
Next up is sqldf. The name gives is away slightly: it’s a library for running SQL statements on data frames. SQL stands for Structured Query Language, with data stored on tables in a database. There are a number of SQL database types, which are all reasonably similar, and sqldf uses as default the incredibly popular SQLite. To get the target aggregated data using this, it’s a case of running a simple query:
sqldf("SELECT SUBJECT, CLASS, AVG(RT) AS AVERAGE, MEDIAN(RT) AS MEDIAN,
SQRT((VARIANCE(RT)/COUNT(RT))) AS SE
FROM full_list
GROUP BY SUBJECT, CLASS")
Note that to get the number of rows involved, we need to use COUNT rather than LENGTH. Easy!
How the tables have turned: data.table
The last library to look at here is data.table. This has the benefit of being considered the roadrunner of aggregation functions. It’s damn fast! This can be achieved as follows:
dps_dt = data.table(full_list) dps_dt[,list(AVERAGE=.Internal(mean(RT)), MEDIAN=median(RT),
SE= sqrt(var(RT)/length(RT))),by=list(Subject,Class)]
Note that the first line takes our data.frame called full_list and casts it as a data.table object type. Here, two lists are used to do two things: (1) create the column names and (2) group the data by class and spec. The first list call sets up the column names and the calculations that need to be run. The second list gets fed to the by function which then aggregates by class and spec.
Summary
So, there we have three additional ways to aggregate data using R, to be added to tapply() and aggregate() which I have covered previously. Whichever one you end up using will probably depend on your own experience with using them (or, for example, whether you are familiar with SQL in the cae of sqldf), what needs you have, and how fast you need your aggregation processing to be.
Great post! I always see plyr, ddply, and data.table used in responses on both stack overflow and the R help list. It’s great to have an explanation of these packages and functions.
There is a simpler way to return names columns with plyr, by using ddply() in cunjunction with summarize():
ddply(full_list, c(“Subject”,”Class”), summarize, AVERAGE=mean(RT), MEDIAN=median(RT), SE=sqrt(var(RT)/length(RT)))
Cool, thanks for posting! I didn’t know about that – it’s certainly much cleaner, will use it from now on 🙂
Glad it was helpful 🙂 They are all useful tools!
Did you find that there are there big speed performance differences among these approaches? Tests I’ve seen in the past suggest that data.table is the fastest . . .
Matthieu already pointed hwo to use summarize() with ddply. If all you want to do is to add a column to each row you can also use transform()
A quick test with rbenchmark shows that data.table is 10-20 times faster than plyr.
test elapsed relative
4 data.table.key 0.514 1.000000
3 data.table 0.520 1.011673
2 sqldf 2.888 5.618677
5 plyr.aggregate 6.705 13.044747
1 plyr 7.931 15.429961
Here functions 1,2,3 are from the article, 4 is like 3 except the data table is data.table(full_list, key=c(“Subject, Class”)), and 5 is the form suggested by Matthieu.
Allan, thanks for the benchmarks! There are some great written pieces on how to use data.table effectively: http://datatable.r-forge.r-project.org/.
Excellent benchmarks! You make a good point about using keys as well. I’ve been meaning to get into data.table a bit more, as speed is always a massive plus.
Cool! Didn’t realise that was possible before, another one to add to my toolbox 🙂