I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0) Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem. Please help me Here is the code: First chunk "example "makes small data set for good visualisation how it supiosed to look Second chunk "real data" is sliced portion of data set from churn library
library(caret) library(tidyr) #example #============ df <- data.frame(var1 = (1:6), var2 = (7:12), factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") , flags = c(1, 1, 1, 1, 1, 1)) df df2 <- spread(data = df, key = "factors" , value = flags, fill = " ") df2 #============= # real data #============ data(churn) str(churnTrain) churnTrain <- churnTrain[1:250,1:4] churnTrain$temporary <-1 churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0) str(churnTrain) head(churnTrain3) str(churnTrain3) #============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr) library(dplyr) library(C50) # this is one source for the churn data data(churn) churnTrain <- churnTrain[1:250,1:4] churnTrain2 <- churnTrain %>% group_by(state, account_length, area_code, international_plan) %>% tally %>% dplyr::rename(temporary = n) churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.