How to extract data from the data frame and create a new data frame

advertisements

I've got a df (300000 obs by 15 var). A simplified version looks like this:

    Col1        Col2
1   TRGV3*01    YCAVWIGSTSWVKIFA

2   TRGV7*02    YCASWAGQSGSGFHKVFA

3   TRGV7*02    YCASWAGRGSGFHKVFA

4   TRGV3*01    YCAVWINLGTSWVKIFA

5   TRGV3*01    YCAVWSGTSWVKIFA

6   TRGV7*02    YCASWAGRESSGFHKVFA

7   TRGV7*02    YCASWAVYSSGFHKVFA

8   TRGV7*02    YCASWSSSGFHKVFA

9   TRGV2*01    YCAVWICGTSWVKIFA

I want to subset the df and create a new df such that for a particular value in col1 (TRGV7*02), all possible values of Col2 are subsetted and the output is something like this:

TRGV7*02                 TRGV3*01                     TRGV2*01

YCASWAGQSGSGFHKVFA       YCAVWIGSTSWVKIFA             YCAVWICGTSWVKIFA

YCASWAGRGSGFHKVFA        YCAVWSGTSWVKIFA    

YCAVWINLGTSWVKIFA       

YCASWAGRESSGFHKVFA      

YCASWAVYSSGFHKVFA       

YCASWSSSGFHKVFA

Once I do this I need to do some analysis and I can't figure out how to do that either... 1. Spit out another df from the one just created above with the same format but only with all unique values in that particular column (lot of repeats in the original dataset) 2. Compare the values generated in each column above and figure out which values in each column are unique compared to the values in all the columns or when just trying to compare it with one other column.

Hope that made sense. Thank you so much in advance!


Here's a possible data.table solution (if the column order doesn't metter, you can skip the setcolorder part)

library(data.table)
setcolorder(dcast(setDT(df), Col2 ~ Col1)[,
          lapply(.SD, sort, na.last = TRUE), .SDcols = -c("Col2")],
          c("TRGV7*02", "TRGV3*01", "TRGV2*01"))[]
#              TRGV7*02          TRGV3*01         TRGV2*01
# 1: YCASWAGQSGSGFHKVFA  YCAVWIGSTSWVKIFA YCAVWICGTSWVKIFA
# 2: YCASWAGRESSGFHKVFA YCAVWINLGTSWVKIFA               NA
# 3:  YCASWAGRGSGFHKVFA   YCAVWSGTSWVKIFA               NA
# 4:  YCASWAVYSSGFHKVFA                NA               NA
# 5:    YCASWSSSGFHKVFA                NA               NA
# 6:                 NA                NA               NA
# 7:                 NA                NA               NA
# 8:                 NA                NA               NA
# 9:                 NA                NA               NA