I have a dataframe df:
df:
chr gene_name
1 ARF3
1 ABC
1 ARF3,ENSG123
1 ENSG1245,ARF3, ENSG89
1 ENSG,ARF3
1 ANG
2 XVY
2 PQR
3 RST
4 TAC
and a gene_list
gene_list = ['ARF3','ABC' ]
Now, I need to get the rows from the data frame (df) for which the the gene name is either an exact match with elements in gene_list .
So, I tried :
df2 = df1[df.gene_name.isin(gene_list)]
I retrieved: chr gene_name
1 ARF3
1 ABC
but what I am expecting is: chr gene_name
1 ARF3
1 ABC
1 ARF3, ENSG123
1 ENSG1245,ARF3, ENSG89
1 ENSG,ARF3
1 ABC
so basically all the rows in the data frame where the element in gene_list is a substring of gene_name in the data frame.
I thought of using ".contains()"
had it been I was looking the other way that is gene_name
in the data frame would have been a substring on element in gene_list
.
All the help appreciated
Another simplier solution with str.split
and DataFrame.isin
with boolean indexing
:
gene_list = ['ARF3', 'ABC']
df1 = df.gene_name.str.split(',', expand=True)
mask = df1.isin(gene_list)
s = df1[mask].dropna(how='all').apply(lambda x: x[x.first_valid_index()], axis=1)
s.name='new'
print (s)
0 ARF3
1 ABC
2 ARF3
3 ARF3
4 ARF3
Name: new, dtype: object
print (df.join(s).dropna(subset=['new']))
chr gene_name new
0 1 ARF3 ARF3
1 1 ABC ABC
2 1 ARF3,ENSG123 ARF3
3 1 ENSG1245,ARF3,ENSG89 ARF3
4 1 ENSG,ARF3 ARF3
Another solution:
gene_list = ['ARF3', 'ABC']
#new dafarame with splited values
df1 = df.gene_name.str.split(',', expand=True)
#mask - True where is desired value
mask = df1.isin(gene_list)
#find first valid value in dataframe and create serie by these values
s = df1[mask].dropna(how='all').apply(lambda x: x[x.first_valid_index()], axis=1)
s.name='new'
print (s)
0 ARF3
1 ABC
2 ARF3
3 ARF3
4 ARF3
Name: new, dtype: object
#join series to filtered dataframe - create new column
print (df[mask.any(1)].join(s))
chr gene_name new
0 1 ARF3 ARF3
1 1 ABC ABC
2 1 ARF3,ENSG123 ARF3
3 1 ENSG1245,ARF3,ENSG89 ARF3
4 1 ENSG,ARF3 ARF3