regex - Lookup table with subset/grepl in R -
i'm analyzing set of urls , values extracted using crawler. while extract substrings url, i'd rather not bother regex so—is there simple way lookup table-style replacement using subset/grepl without resorting dplyr(do conditional mutate on vairables)?
my current process:
test <- data.frame( url = c('google.com/testing/duck', 'google.com/evaluating/dog', 'google.com/analyzing/cat'), content = c(1, 2, 3), subdir = na ) test[grepl('testing', test$url), ]$subdir <- 'testing' test[grepl('evaluating', test$url), ]$subdir <- 'evaluating' test[grepl('analyzing', test$url), ]$subdir <- 'analyzing'
obviously, little clumsy , doesn't scale well. dplyr, i'd able conditionals like:
test %<>% tbl_df() %>% mutate(subdir = ifelse( grepl('testing', subdir), 'test r', ifelse( grepl('evaluating', subdir), 'eval r', ifelse( grepl('analyzing', subdir), 'anal r', na ))))
but, again, goofy , don't want incur package dependency if @ possible. there way regex-based subsetting sort of lookup table?
edit: few clarifications:
- for extracting subdirectories, yes, regex efficient; however, hoping more general pattern match dictionary-like struct of strings other, arbitrary values.
- of course, nested
ifelse
ugly , prone error—just wanted quick-and-dirty exampledplyr
up.
edit 2: thought i'd loop , post ended based upon bondeddust's approach. decided practice mapping , non-standard eval while @ it:
test <- data.frame( url = c( 'google.com/testing/duck', 'google.com/testing/dog', 'google.com/testing/cat', 'google.com/evaluating/duck', 'google.com/evaluating/dog', 'google.com/evaluating/cat', 'google.com/analyzing/duck', 'google.com/analyzing/dog', 'google.com/analyzing/cat', 'banana' ), content = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), subdir = na ) # list used key/value lookup, names can regex lookup <- c( "testing" = "testing important", "eval.*" = 'eval in r', "analy(z|s)ing" = 'r fun' ) # dumb test error handling: # lookup <- c('test', 'hey') # defining new lookup function regexlookup <- function(data, dict, searchcolumn, targetcolumn, ignore.case = true){ # basic check—need separate errors/handling if(is.null(names(dict)) || is.null(dict[[1]])) { stop("not valid replacement value; use key/value store `dict`.") } # non-standard eval column names; not sure if should # add safetytype/checks these searchcolumn <- eval(substitute(searchcolumn), data) targetcolumn <- deparse(substitute(targetcolumn)) # define find-and-replace utility findandreplace <- function (key, val){ data[grepl(key, searchcolumn, ignore.case = ignore.case), targetcolumn] <- val data <<- data } # map on key/value store mapply(findandreplace, names(dict), dict) # return result, non-matching rows preserved return(data) } regexlookup(test, lookup, url, subdir, ignore.case = false)
(target in c('testing','evaluating','analyzing') ) { test[grepl(target, test$url),'subdir' ] <- target } test url content subdir 1 google.com/testing/duck 1 testing 2 google.com/evaluating/dog 2 evaluating 3 google.com/analyzing/cat 3 analyzing
the vector of targets have instead been name of vector in workspace.
targets <- c('testing','evaluating','analyzing') for( target in targets ) { ...}
Comments
Post a Comment