More efficient ways of using R than the 'for'

advertisements

I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.

I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).

I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?

In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).

dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
    x <- as.numeric(dat[i,2:6])
    y <- max(grep(1, x))
    finalyear[i] <- y
    streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak

Thanks!


We could use max.col and rowSums as a vectorized approach.

dat$finalyear <- max.col(dat[-1], 'last')

If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.

dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])

Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'

dat$streak <-  rowSums(dat[,2:6])==dat$finalyear
dat
#   ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1   1  TRUE  TRUE  TRUE FALSE FALSE         3   TRUE
#2   2  TRUE FALSE  TRUE  TRUE  TRUE         5  FALSE
#3   3  TRUE  TRUE  TRUE  TRUE  TRUE         5   TRUE
#4   4 FALSE FALSE  TRUE  TRUE  TRUE         5  FALSE
#5   5  TRUE  TRUE  TRUE  TRUE FALSE         4   TRUE

Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by @ColonelBeauvel

library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
            streak=rowSums(dat[-1])==finalyear)