r - How to perform further grouping and lookups on aggregated values in .SD for a data.table - -


this related previous question on grouping/lookups on data.table additional output.

i trying perform grouping on subset .sd , select values each subset. example, in flights dataset, find out : each airport , month, uniquecarrier , destination has worst avg arrdelay. there 2 levels of aggregation.

i have working solution follows.. understand if there better solutions.

library(data.table) library(hflights)  dt <- as.data.table(hflights)  setkey(dt, origin, month)  #the solution code... dt[, {  t1 <- .sd[, .(mean(na.omit(arrdelay))) , by=uniquecarrier];  max1 <- which.max(t1$v1);  t2 <- .sd[, .(mean(na.omit(arrdelay))) , by=dest];  max2 <- which.max(t2$v1);  list( maxavgdelayforcarrier = t1$uniquecarrier[max1], maxavgdelaybycarrier = t1$v1[max1],  maxavgdelaybydest= t2$dest[max2], maxavgdelayfordest= t2$v1[max2] ) },  = .(origin, month)]  # checking correctness head(dt[ .("hou", 1), .(maxavgdelaybycarrier=mean(na.omit(arrdelay))), by=uniquecarrier][order(-maxavgdelaybycarrier)],1) head(dt[ .("iah", 2), .(maxavgdelayfordest=mean(na.omit(arrdelay))), by=dest][order(-maxavgdelayfordest)],1) 

i think code fine, write way:

dt[,c(    .sd[,      .(cmaxval = mean(na.omit(arrdelay))),   by=.(cmax = uniquecarrier)][which.max(cmaxval)],    .sd[,      .(dmaxval = mean(na.omit(arrdelay))),   by=.(dmax = dest)][which.max(dmaxval)]  ),by=key(dt)] 

which gives

    origin month cmax   cmaxval dmax  dmaxval  1:    hou     1   f9 13.725806  phl 20.12500  2:    hou     2   b6 17.822222  ecp 20.17308  3:    hou     3   ev 23.088889  phl 46.06452  4:    hou     4   ev 27.847826  phl 67.93333  5:    hou     5   ev 25.436620  phl 75.61290  6:    hou     6   ev 16.930233  ewr 34.87755  7:    hou     7   b6 20.016129  chs 21.54839  8:    hou     8   b6 30.163636  jfk 30.16364  9:    hou     9   dl 18.625000  ewr 14.32143 10:    hou    10   dl 17.803279  phl 22.51613 11:    hou    11   f9  3.000000  ewr 18.46429 12:    hou    12   mq 13.554502  ewr 28.17857 13:    iah     1   ev 15.682353  hnl 21.52632 14:    iah     2   mq 19.946809  bpt 29.00000 15:    iah     3   15.354839  sfo 27.43590 16:    iah     4   mq 16.263441  sea 22.48515 17:    iah     5   mq 25.179104  day 25.96154 18:    iah     6   ua 24.453125  anc 34.06667 19:    iah     7   oo 15.117419  dsm 32.39286 20:    iah     8   ua 17.297561  anc 37.96552 21:    iah     9   ua 11.620000  sju 16.76923 22:    iah    10   ua 11.601266  cid 16.88462 23:    iah    11   mq  8.445545  cid 18.04167 24:    iah    12   xe 11.376852  hob 25.95556     origin month cmax   cmaxval dmax  dmaxval 

there's no need store many intermediate objects (t1, max1, etc.).


the approach above requires manually coding each grouping variable. instead do...

dt[,do.call(c,lapply(c("uniquecarrier","dest"),   function(colname) setnames(     .sd[,mean(na.omit(arrdelay)),by=c(colname)][which.max(v1)],     paste0(c("max_","maxval_"),colname)   ) )),by=key(dt)] 

Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -