Multiple column operation on data.frame

advertisements

General problem I have very often: I want to perform some operation on a data.frame, which for each factor level will produce one number, and for this it uses information from multiple columns. How to write that in R?

I considered these functions:

  • tapply - doesn't operate on multiple columns
  • aggregate - the function is given the columns separately
  • ave - the result has the same number of rows as input, not as the number of factors' levels
  • by - this was the hottest candidate, but I hate the format returned - the list. I want data.frame as result, I know I can convert it but it is ugly, I prefer another solution!

The OP is asking for a general answer, so I think the 'plyr' package is the most appropriate. The 'plyr' package has limitations when approaching large data sets, but for everyday use (implied in the original post), the 'plyr' functions are wonderful assets for any R user.

Setup: Here is a quick data sample for us to work with.

data <- data.frame(id=1:50, group=sample(letters[1:3], 50, rep=TRUE), x_Value=sample(1:500, 50), y_Value=sample(2:5, 50, rep=TRUE)*100)

How to use plyr: I'm just going to address the basic uses here as an example to get things started. First, load up the package.

library(plyr)

Now, let's start calculating things. With the 'plyr' functions, you choose the first two letters of the function based on your input and output. In this example, I will be inputting a data frame (d) and outputting a data frame (d), so I will use the 'ddply" function.

The 'ddply' function uses this syntax:

ddply(
    data_source,
    .(grouping_variables),
    function,
    column_definitions)

First, let's quickly find out how many entries belong to groups a, b, and c:

ddply(
    data,
    .(group),
    summarize,
    N=length(id))
#   group  N
# 1     a 17
# 2     b 16
# 3     c 17

Here, we specified the data source first, and then specified that we wanted to group lines by the 'group' variable. We use the 'summarize' function to trash all of the columns except those in our grouping_variables and column_definitions. Using the 'length' function is basically just a count for this purpose.

Now, let's add a column to the data that shows the group means for the x and y values.

ddply(
    data,
    .(group),
    mutate,
    group_mean_x=mean(x_Value),
    group_mean_y=mean(y_Value))
#    id group x_Value y_Value group_mean_x group_mean_y
# 1   8     a     301     300     218.7059     394.1176
# 2  13     a      38     500     218.7059     394.1176
# 3  14     a     425     300     218.7059     394.1176
# .....................................................
# 17 47     a     191     300     218.7059     394.1176
# 18  5     b     411     500     235.1875     325.0000
# 19  6     b     121     400     235.1875     325.0000
# 20 11     b     151     200     235.1875     325.0000
# .....................................................
# 33 49     b     354     200     235.1875     325.0000
# 34  1     c     482     400     246.1765     400.0000
# 35  2     c      43     300     246.1765     400.0000
# .....................................................
# 50 50     c     248     500     246.1765     400.0000

I've truncated the results to make it shorter. Here, we used the same data source and grouping variable, but the 'mutate' function preserves all of the data in the data source while adding columns.

Now, let's do a two-step effort with the previous data. Let's show the means and the difference between the x and y mean values in a summary table.

ddply(
    data,
    .(group),
    summarize,
    group_mean_x=mean(x_Value),
    group_mean_y=mean(y_Value),
    difference=group_mean_x - group_mean_y)
#   group group_mean_x group_mean_y difference
# 1     a     218.7059     394.1176  -175.4118
# 2     b     235.1875     325.0000   -89.8125
# 3     c     246.1765     400.0000  -153.8235

I show you this example, because there is something important going on... we're using columns that we just defined as part of a different column's definition. This is very, very useful when creating summary tables.

Finally, let's group by two factors: the group and the digit in the 10^2 place of the x value. Let's create a summary table that shows the mean x and y values for each group and 10^2 digit x value.

ddply(
    data,
    .(group, x_100=as.integer(x_Value/100)),
    summarize,
    mean_x=mean(x_Value),
    mean_y=mean(y_Value))
#    group x_100   mean_x   mean_y
# 1      a     0  20.0000 425.0000
# 2      a     1 145.6667 333.3333
# 3      a     2 272.0000 400.0000
# 4      a     3 328.6667 433.3333
# 5      a     4 427.5000 350.0000
# 6      b     0  37.0000 200.0000
# 7      b     1 148.6667 383.3333
# 8      b     2 230.0000 325.0000
# 9      b     3 363.0000 200.0000
# 10     b     4 412.5000 400.0000
# 11     c     0  55.6000 360.0000
# 12     c     1 173.5000 350.0000
# 13     c     2 262.5000 450.0000
# 14     c     3 355.6667 400.0000
# 15     c     4 481.0000 433.3333

This example is important, because it shows us two things: we can create grouping columns using vectorized statements and we can group by more than one column by separating the list of columns with a comma.

This quick set of examples should be enough to get started using the 'plyr' packages. More details can be found in help(plyr).