Retrieve rows from the data frame for partial matching in the column of the data frame with elements in the list

advertisements

I have a dataframe df:

df:
    chr          gene_name

    1           ARF3
    1           ABC
    1           ARF3,ENSG123
    1           ENSG1245,ARF3, ENSG89
    1           ENSG,ARF3
    1           ANG
    2           XVY
    2           PQR
    3           RST
    4           TAC

and a gene_list

gene_list = ['ARF3','ABC' ]

Now, I need to get the rows from the data frame (df) for which the the gene name is either an exact match with elements in gene_list .

So, I tried :

df2 = df1[df.gene_name.isin(gene_list)]

I retrieved: chr gene_name

    1           ARF3
    1           ABC

but what I am expecting is: chr gene_name

     1           ARF3
     1           ABC
     1           ARF3, ENSG123
     1           ENSG1245,ARF3, ENSG89
     1           ENSG,ARF3
     1           ABC

so basically all the rows in the data frame where the element in gene_list is a substring of gene_name in the data frame.

I thought of using ".contains()" had it been I was looking the other way that is gene_name in the data frame would have been a substring on element in gene_list.

All the help appreciated


Another simplier solution with str.split and DataFrame.isin with boolean indexing:

gene_list = ['ARF3', 'ABC']

df1 = df.gene_name.str.split(',', expand=True)
mask = df1.isin(gene_list)
s = df1[mask].dropna(how='all').apply(lambda x: x[x.first_valid_index()], axis=1)
s.name='new'

print (s)
0    ARF3
1     ABC
2    ARF3
3    ARF3
4    ARF3
Name: new, dtype: object

print (df.join(s).dropna(subset=['new']))
   chr             gene_name   new
0    1                  ARF3  ARF3
1    1                   ABC   ABC
2    1          ARF3,ENSG123  ARF3
3    1  ENSG1245,ARF3,ENSG89  ARF3
4    1             ENSG,ARF3  ARF3

Another solution:

gene_list = ['ARF3', 'ABC']

#new dafarame with splited values
df1 = df.gene_name.str.split(',', expand=True)
#mask - True where is desired value
mask = df1.isin(gene_list)
#find first valid value in dataframe and create serie by these values
s = df1[mask].dropna(how='all').apply(lambda x: x[x.first_valid_index()], axis=1)
s.name='new'
print (s)
0    ARF3
1     ABC
2    ARF3
3    ARF3
4    ARF3
Name: new, dtype: object

#join series to filtered dataframe - create new column
print (df[mask.any(1)].join(s))
   chr             gene_name   new
0    1                  ARF3  ARF3
1    1                   ABC   ABC
2    1          ARF3,ENSG123  ARF3
3    1  ENSG1245,ARF3,ENSG89  ARF3
4    1             ENSG,ARF3  ARF3