General problem I have very often: I want to perform some operation on a data.frame, which for each factor level will produce one number, and for this it uses information from multiple columns. How to write that in R?
I considered these functions:
- tapply - doesn't operate on multiple columns
- aggregate - the function is given the columns separately
- ave - the result has the same number of rows as input, not as the number of factors' levels
- by - this was the hottest candidate, but I hate the format returned - the list. I want
data.frame
as result, I know I can convert it but it is ugly, I prefer another solution!
The OP is asking for a general answer, so I think the 'plyr' package is the most appropriate. The 'plyr' package has limitations when approaching large data sets, but for everyday use (implied in the original post), the 'plyr' functions are wonderful assets for any R user.
Setup: Here is a quick data sample for us to work with.
data <- data.frame(id=1:50, group=sample(letters[1:3], 50, rep=TRUE), x_Value=sample(1:500, 50), y_Value=sample(2:5, 50, rep=TRUE)*100)
How to use plyr: I'm just going to address the basic uses here as an example to get things started. First, load up the package.
library(plyr)
Now, let's start calculating things. With the 'plyr' functions, you choose the first two letters of the function based on your input and output. In this example, I will be inputting a data frame (d) and outputting a data frame (d), so I will use the 'ddply" function.
The 'ddply' function uses this syntax:
ddply(
data_source,
.(grouping_variables),
function,
column_definitions)
First, let's quickly find out how many entries belong to groups a, b, and c:
ddply(
data,
.(group),
summarize,
N=length(id))
# group N
# 1 a 17
# 2 b 16
# 3 c 17
Here, we specified the data source first, and then specified that we wanted to group lines by the 'group' variable. We use the 'summarize' function to trash all of the columns except those in our grouping_variables and column_definitions. Using the 'length' function is basically just a count for this purpose.
Now, let's add a column to the data that shows the group means for the x and y values.
ddply(
data,
.(group),
mutate,
group_mean_x=mean(x_Value),
group_mean_y=mean(y_Value))
# id group x_Value y_Value group_mean_x group_mean_y
# 1 8 a 301 300 218.7059 394.1176
# 2 13 a 38 500 218.7059 394.1176
# 3 14 a 425 300 218.7059 394.1176
# .....................................................
# 17 47 a 191 300 218.7059 394.1176
# 18 5 b 411 500 235.1875 325.0000
# 19 6 b 121 400 235.1875 325.0000
# 20 11 b 151 200 235.1875 325.0000
# .....................................................
# 33 49 b 354 200 235.1875 325.0000
# 34 1 c 482 400 246.1765 400.0000
# 35 2 c 43 300 246.1765 400.0000
# .....................................................
# 50 50 c 248 500 246.1765 400.0000
I've truncated the results to make it shorter. Here, we used the same data source and grouping variable, but the 'mutate' function preserves all of the data in the data source while adding columns.
Now, let's do a two-step effort with the previous data. Let's show the means and the difference between the x and y mean values in a summary table.
ddply(
data,
.(group),
summarize,
group_mean_x=mean(x_Value),
group_mean_y=mean(y_Value),
difference=group_mean_x - group_mean_y)
# group group_mean_x group_mean_y difference
# 1 a 218.7059 394.1176 -175.4118
# 2 b 235.1875 325.0000 -89.8125
# 3 c 246.1765 400.0000 -153.8235
I show you this example, because there is something important going on... we're using columns that we just defined as part of a different column's definition. This is very, very useful when creating summary tables.
Finally, let's group by two factors: the group and the digit in the 10^2 place of the x value. Let's create a summary table that shows the mean x and y values for each group and 10^2 digit x value.
ddply(
data,
.(group, x_100=as.integer(x_Value/100)),
summarize,
mean_x=mean(x_Value),
mean_y=mean(y_Value))
# group x_100 mean_x mean_y
# 1 a 0 20.0000 425.0000
# 2 a 1 145.6667 333.3333
# 3 a 2 272.0000 400.0000
# 4 a 3 328.6667 433.3333
# 5 a 4 427.5000 350.0000
# 6 b 0 37.0000 200.0000
# 7 b 1 148.6667 383.3333
# 8 b 2 230.0000 325.0000
# 9 b 3 363.0000 200.0000
# 10 b 4 412.5000 400.0000
# 11 c 0 55.6000 360.0000
# 12 c 1 173.5000 350.0000
# 13 c 2 262.5000 450.0000
# 14 c 3 355.6667 400.0000
# 15 c 4 481.0000 433.3333
This example is important, because it shows us two things: we can create grouping columns using vectorized statements and we can group by more than one column by separating the list of columns with a comma.
This quick set of examples should be enough to get started using the 'plyr' packages. More details can be found in help(plyr)
.