Combine tables of aggregate values ​​with variables summarized from a "parent" dataset

advertisements

I have a data set along these lines:

df<-data.frame(sp=c(100, 100, 100, 101, 101, 101, 102, 102, 102),
type=c("C","C","C","H","H","H","C","C","C"),
country=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
vals=c(1,2,3,4,5,6,7,8,9)
)

I want to aggregate df$vals and bring the other variables through as well

At the moment I'm doing it like this:

multi.func<- function(x){
c(
n = length(x),
min = min(x, na.rm=TRUE),
max = max(x, na.rm=TRUE),
mean = mean(x, na.rm=TRUE)
)}

aggVals<-as.data.frame(do.call(rbind, by(df$vals, df$sp, FUN=multi.func, simplify=TRUE)))
aggVals$sp<-row.names(aggVals)

aggDescrip<-aggregate(cbind(as.character(type), as.character(country)) ~ sp, data=df, FUN=unique)

result<-merge(aggDescrip,aggVals)

This works well enough but I wondered if there's an easier way.

Thanks


Perhaps you should look into the data.table package.

library(data.table)
DT <- data.table(df, key="sp")
DT[, list(type = unique(as.character(type)),
          country = unique(as.character(country)),
          n = .N, min = min(vals), max = max(vals),
          mean = mean(vals)), by=key(DT)]
#     sp type country n min max mean
# 1: 100    C       A 3   1   3    2
# 2: 101    H       B 3   4   6    5
# 3: 102    C       C 3   7   9    8


If you want to stick with base R, here is another approach that might be of use (though aggregate is probably more common):

unique(within(df, {
    mean <- ave(vals, sp, FUN=mean)
    max <- ave(vals, sp, FUN=max)
    min <- ave(vals, sp, FUN=min)
    n <- ave(vals, sp, FUN=length)
    rm(vals)
}))
#    sp type country n min max mean
# 1 100    C       A 3   1   3    2
# 4 101    H       B 3   4   6    5
# 7 102    C       C 3   7   9    8


Update: A variation on your initial attempt

I would suggest sticking with data.table if possible, because the resulting code is easy to follow and the process of aggregation is quick.

However, with a little bit of modification, you can have (yet another) base R approach that is somewhat more direct.

First, modify your function so that instead of using c(), use data.frame. Also, add an argument that specifies which column needs to be aggregated.

multi.func <- function(x, value_column) {
    data.frame(
        n = length(x[[value_column]]),
        min = min(x[[value_column]], na.rm=TRUE),
        max = max(x[[value_column]], na.rm=TRUE),
        mean = mean(x[[value_column]], na.rm=TRUE))
}

Second, use lapply on your dataset, split up by your grouping variable, merge the output with your original dataset, and return the unique values.

unique(merge(df[-4],
             do.call(rbind, lapply(split(df, df$sp),
                                   multi.func, value_column = "vals")),
             by.x = "sp", by.y = "row.names"))