The pandas are equivalent to vlookup with one or more relationships, returning a result

advertisements

I would like to know the pandas alternative to an Excel Vlookup when it involves two dataframes and a one to many relationship. I have searched for this answer and don't think i found one that solves my use-case. below are a few post that are relevant, but not exactly what I need.

Excel VLOOKUP equivalent in pandas

vlookup between 2 Pandas dataframes

Situation:

I have two data frames and a key that links both (site). I would use the pandas merge function, but I don't want multiple records returned for one key (location B in this example).

Specifically, I want to return the status of a site from the loc_status dataframe if it exists. If the location has both a status of "inactive" and "active" then I only want to return "active".

Here is a basic example:

loc_status = [{'site':'A', 'status':'ACTIVE'}, {'site':'B',         'status':'ACTIVE'},{'site':'B', 'status':'INACTIVE'},{'site':'C', 'status':'INACTIVE'} ]

loc = [{'site':'A'}, {'site':'B'},{'site':'C'}, {'site':'D'} ]

df_status = pd.DataFrame(loc_status)

+----+-------+----------+
|    | site  |  status  |
+----+-------+----------+
| 0  | A     | ACTIVE   |
| 1  | B     | ACTIVE   |
| 2  | B     | INACTIVE |
| 3  | C     | INACTIVE |
+----+-------+----------+

df_loc = pd.DataFrame(loc)

+------+---+
| site |   |
+------+---+
|   0  | A |
|   1  | B |
|   2  | C |
|   3  | D |
+------+---+

result = [{'site':'A', 'status':'ACTIVE'}, {'site':'B', 'status':'ACTIVE'},{'site':'C', 'status':'INACTIVE'}, {'site': 'D'}]

df_result = pd.DataFrame(result)

+----+-------+----------+
|    | site  |  status  |
+----+-------+----------+
| 0  | A     | ACTIVE   |
| 1  | B     | ACTIVE   |
| 2  | C     | INACTIVE |
| 3  | D     | NaN      |
+----+-------+----------+

Thank you.


First sort_values by both columns, so for each site if duplicates status always ACTIVE is first. Then drop_duplicates with default keep=first - get only ACTIVE if duplicates.

Last map by Series created with set_index:

df_status = df_status.sort_values(['site','status']).drop_duplicates(['site'])
print (df_status)
  site    status
0    A    ACTIVE
1    B    ACTIVE
3    C  INACTIVE

df_loc['status'] = df_loc['site'].map(df_status.set_index('site')['status'])
print (df_loc)
  site    status
0    A    ACTIVE
1    B    ACTIVE
2    C  INACTIVE
3    D       NaN

Timings:

#jezrael solution
In [136]: %timeit df_loc['status_jez'] = df_loc['site'].map(df_status.sort_values(['site','status']).drop_duplicates(['site']).set_index('site')['status'])
10 loops, best of 3: 67.3 ms per loop

#Allen solution
In [137]: %timeit pd.merge(df_loc,df_status.sort_values(['site','status']).groupby(by='site').first().reset_index(),how='left')
10 loops, best of 3: 114 ms per loop

#piRSquared solution
In [138]: %timeit df_loc.assign(status_pir=df_loc.site.map(df_status.loc[(df_status.status == 'ACTIVE').groupby(df_status.site).idxmax()].set_index('site').status))
1 loop, best of 3: 3.37 s per loop

Code for timings:

np.random.seed(123)
N = 100000
L = np.random.randint(100000,size=N)
df_status = pd.DataFrame({'site': np.random.choice(L, N),
                         'status':np.random.choice(['ACTIVE','INACTIVE'],N)})
print (df_status.head(10))

df_loc = pd.DataFrame({'site':L})
print (df_loc.head(10))