python - Pandas: Filter dataframe for values that are too frequent or too rare -


on pandas dataframe, know can groupby on 1 or more columns , filter values occur more/less given number.

but want on every column on dataframe. want remove values infrequent (let's occur less 5% of times) or frequent. example, consider dataframe following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.

import pandas pd import string import numpy np vals = [(c, np.random.choice(list(string.lowercase), 100, replace=true)) c in      'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval'] df = pd.dataframe(dict(vals)) >> df.head()     city of destination     city of origin  distance, type of transport (air/car/foot)  time of day, price-interval 0   f   p     n 1   k   b     f 2   q   s   n   j 3   h   c   g   u 4   w   d   m   h 

if big dataframe, makes sense remove rows have spurious items, example, if time of day = night occurs 3% of time, or if foot mode of transport rare, , on.

i want remove such values columns (or list of columns). 1 idea have value_counts on every column, transform , add 1 column each value_counts; filter based on whether above or below threshold. think there must better way achieve this?

this procedure go through each column of dataframe , eliminate rows given category less given threshold percentage, shrinking dataframe on each loop.

this answer similar provided @ami tavory, few subtle differences:

  • it normalizes value counts can use percentile threshold.
  • it calculates counts once per column instead of twice. results in faster execution.

code:

threshold = 0.03 col in df:     counts = df[col].value_counts(normalize=true)     df = df.loc[df[col].isin(counts[counts > threshold].index), :] 

code timing:

df2 = pd.dataframe(np.random.choice(list(string.lowercase), [1e6, 4], replace=true),                     columns=list('abcd'))  %%timeit df=df2.copy() threshold = 0.03 col in df:     counts = df[col].value_counts(normalize=true)     df = df.loc[df[col].isin(counts[counts > threshold].index), :]  1 loops, best of 3: 485 ms per loop  %%timeit df=df2.copy() m = 0.03 * len(df) c in df:     df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]  1 loops, best of 3: 688 ms per loop 

Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -