r - How to perform further grouping and lookups on aggregated values in .SD for a data.table - -
this related previous question on grouping/lookups on data.table additional output.
i trying perform grouping on subset .sd , select values each subset. example, in flights dataset, find out : each airport , month, uniquecarrier , destination has worst avg arrdelay. there 2 levels of aggregation.
i have working solution follows.. understand if there better solutions.
library(data.table) library(hflights) dt <- as.data.table(hflights) setkey(dt, origin, month) #the solution code... dt[, { t1 <- .sd[, .(mean(na.omit(arrdelay))) , by=uniquecarrier]; max1 <- which.max(t1$v1); t2 <- .sd[, .(mean(na.omit(arrdelay))) , by=dest]; max2 <- which.max(t2$v1); list( maxavgdelayforcarrier = t1$uniquecarrier[max1], maxavgdelaybycarrier = t1$v1[max1], maxavgdelaybydest= t2$dest[max2], maxavgdelayfordest= t2$v1[max2] ) }, = .(origin, month)] # checking correctness head(dt[ .("hou", 1), .(maxavgdelaybycarrier=mean(na.omit(arrdelay))), by=uniquecarrier][order(-maxavgdelaybycarrier)],1) head(dt[ .("iah", 2), .(maxavgdelayfordest=mean(na.omit(arrdelay))), by=dest][order(-maxavgdelayfordest)],1)
i think code fine, write way:
dt[,c( .sd[, .(cmaxval = mean(na.omit(arrdelay))), by=.(cmax = uniquecarrier)][which.max(cmaxval)], .sd[, .(dmaxval = mean(na.omit(arrdelay))), by=.(dmax = dest)][which.max(dmaxval)] ),by=key(dt)]
which gives
origin month cmax cmaxval dmax dmaxval 1: hou 1 f9 13.725806 phl 20.12500 2: hou 2 b6 17.822222 ecp 20.17308 3: hou 3 ev 23.088889 phl 46.06452 4: hou 4 ev 27.847826 phl 67.93333 5: hou 5 ev 25.436620 phl 75.61290 6: hou 6 ev 16.930233 ewr 34.87755 7: hou 7 b6 20.016129 chs 21.54839 8: hou 8 b6 30.163636 jfk 30.16364 9: hou 9 dl 18.625000 ewr 14.32143 10: hou 10 dl 17.803279 phl 22.51613 11: hou 11 f9 3.000000 ewr 18.46429 12: hou 12 mq 13.554502 ewr 28.17857 13: iah 1 ev 15.682353 hnl 21.52632 14: iah 2 mq 19.946809 bpt 29.00000 15: iah 3 15.354839 sfo 27.43590 16: iah 4 mq 16.263441 sea 22.48515 17: iah 5 mq 25.179104 day 25.96154 18: iah 6 ua 24.453125 anc 34.06667 19: iah 7 oo 15.117419 dsm 32.39286 20: iah 8 ua 17.297561 anc 37.96552 21: iah 9 ua 11.620000 sju 16.76923 22: iah 10 ua 11.601266 cid 16.88462 23: iah 11 mq 8.445545 cid 18.04167 24: iah 12 xe 11.376852 hob 25.95556 origin month cmax cmaxval dmax dmaxval
there's no need store many intermediate objects (t1
, max1
, etc.).
the approach above requires manually coding each grouping variable. instead do...
dt[,do.call(c,lapply(c("uniquecarrier","dest"), function(colname) setnames( .sd[,mean(na.omit(arrdelay)),by=c(colname)][which.max(v1)], paste0(c("max_","maxval_"),colname) ) )),by=key(dt)]
Comments
Post a Comment