r - How to perform further grouping and lookups on aggregated values in .SD for a data.table - -
this related previous question on grouping/lookups on data.table additional output.
i trying perform grouping on subset .sd , select values each subset. example, in flights dataset, find out : each airport , month, uniquecarrier , destination has worst avg arrdelay. there 2 levels of aggregation.
i have working solution follows.. understand if there better solutions.
library(data.table) library(hflights) dt <- as.data.table(hflights) setkey(dt, origin, month) #the solution code... dt[, { t1 <- .sd[, .(mean(na.omit(arrdelay))) , by=uniquecarrier]; max1 <- which.max(t1$v1); t2 <- .sd[, .(mean(na.omit(arrdelay))) , by=dest]; max2 <- which.max(t2$v1); list( maxavgdelayforcarrier = t1$uniquecarrier[max1], maxavgdelaybycarrier = t1$v1[max1], maxavgdelaybydest= t2$dest[max2], maxavgdelayfordest= t2$v1[max2] ) }, = .(origin, month)] # checking correctness head(dt[ .("hou", 1), .(maxavgdelaybycarrier=mean(na.omit(arrdelay))), by=uniquecarrier][order(-maxavgdelaybycarrier)],1) head(dt[ .("iah", 2), .(maxavgdelayfordest=mean(na.omit(arrdelay))), by=dest][order(-maxavgdelayfordest)],1)
i think code fine, write way:
dt[,c( .sd[, .(cmaxval = mean(na.omit(arrdelay))), by=.(cmax = uniquecarrier)][which.max(cmaxval)], .sd[, .(dmaxval = mean(na.omit(arrdelay))), by=.(dmax = dest)][which.max(dmaxval)] ),by=key(dt)] which gives
origin month cmax cmaxval dmax dmaxval 1: hou 1 f9 13.725806 phl 20.12500 2: hou 2 b6 17.822222 ecp 20.17308 3: hou 3 ev 23.088889 phl 46.06452 4: hou 4 ev 27.847826 phl 67.93333 5: hou 5 ev 25.436620 phl 75.61290 6: hou 6 ev 16.930233 ewr 34.87755 7: hou 7 b6 20.016129 chs 21.54839 8: hou 8 b6 30.163636 jfk 30.16364 9: hou 9 dl 18.625000 ewr 14.32143 10: hou 10 dl 17.803279 phl 22.51613 11: hou 11 f9 3.000000 ewr 18.46429 12: hou 12 mq 13.554502 ewr 28.17857 13: iah 1 ev 15.682353 hnl 21.52632 14: iah 2 mq 19.946809 bpt 29.00000 15: iah 3 15.354839 sfo 27.43590 16: iah 4 mq 16.263441 sea 22.48515 17: iah 5 mq 25.179104 day 25.96154 18: iah 6 ua 24.453125 anc 34.06667 19: iah 7 oo 15.117419 dsm 32.39286 20: iah 8 ua 17.297561 anc 37.96552 21: iah 9 ua 11.620000 sju 16.76923 22: iah 10 ua 11.601266 cid 16.88462 23: iah 11 mq 8.445545 cid 18.04167 24: iah 12 xe 11.376852 hob 25.95556 origin month cmax cmaxval dmax dmaxval there's no need store many intermediate objects (t1, max1, etc.).
the approach above requires manually coding each grouping variable. instead do...
dt[,do.call(c,lapply(c("uniquecarrier","dest"), function(colname) setnames( .sd[,mean(na.omit(arrdelay)),by=c(colname)][which.max(v1)], paste0(c("max_","maxval_"),colname) ) )),by=key(dt)]
Comments
Post a Comment