python - Pandas: Filter dataframe for values that are too frequent or too rare -
on pandas dataframe, know can groupby on 1 or more columns , filter values occur more/less given number.
but want on every column on dataframe. want remove values infrequent (let's occur less 5% of times) or frequent. example, consider dataframe following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval
.
import pandas pd import string import numpy np vals = [(c, np.random.choice(list(string.lowercase), 100, replace=true)) c in 'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval'] df = pd.dataframe(dict(vals)) >> df.head() city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval 0 f p n 1 k b f 2 q s n j 3 h c g u 4 w d m h
if big dataframe, makes sense remove rows have spurious items, example, if time of day = night
occurs 3% of time, or if foot
mode of transport rare, , on.
i want remove such values columns (or list of columns). 1 idea have value_counts
on every column, transform
, add 1 column each value_counts; filter based on whether above or below threshold. think there must better way achieve this?
this procedure go through each column of dataframe , eliminate rows given category less given threshold percentage, shrinking dataframe on each loop.
this answer similar provided @ami tavory, few subtle differences:
- it normalizes value counts can use percentile threshold.
- it calculates counts once per column instead of twice. results in faster execution.
code:
threshold = 0.03 col in df: counts = df[col].value_counts(normalize=true) df = df.loc[df[col].isin(counts[counts > threshold].index), :]
code timing:
df2 = pd.dataframe(np.random.choice(list(string.lowercase), [1e6, 4], replace=true), columns=list('abcd')) %%timeit df=df2.copy() threshold = 0.03 col in df: counts = df[col].value_counts(normalize=true) df = df.loc[df[col].isin(counts[counts > threshold].index), :] 1 loops, best of 3: 485 ms per loop %%timeit df=df2.copy() m = 0.03 * len(df) c in df: df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)] 1 loops, best of 3: 688 ms per loop
Comments
Post a Comment