r - Fast read different type of data with same command, better seperator guessing -


i have ld data, raw output file plink below (notice spaces - used make output pretty, notice leading , trailing spaces, too):

write.table(read.table(text="  chr_a     bp_a          snp_a  chr_b         bp_b          snp_b           r2   1    154834183      rs1218582      1    154794318      rs9970364    0.0929391   1    154834183      rs1218582      1    154795033     rs56744813      0.10075   1    154834183      rs1218582      1    154797272     rs16836414     0.106455   1    154834183      rs1218582      1    154798550    rs200576863    0.0916789   1    154834183      rs1218582      1    154802379     rs11264270     0.176911 ",sep="x"),           "type1.txt",col.names=false,row.names=false,quote=false)   

or nicely tab separated file:

write.table(read.table(text=" chr_a bp_a snp_a chr_b bp_b snp_b r2 1 154834183 rs1218582 1 154794318 rs9970364 0.0929391 1 154834183 rs1218582 1 154795033 rs56744813 0.10075 1 154834183 rs1218582 1 154797272 rs16836414 0.106455 1 154834183 rs1218582 1 154798550 rs200576863 0.0916789 1 154834183 rs1218582 1 154802379 rs11264270 0.176911", sep=" "),             "type2.txt",col.names=false,row.names=false,quote=false,sep="\t") 

read.csv works both types of data:

read.csv("type1.txt", sep="") read.csv("type2.txt", sep="") 

fread works type2:

fread("type1.txt") fread("type2.txt") 

files big, in millions of rows, hence can't use read.csv option. there way make fread guess better? other package/function suggestions?

i use readlines guess type of file, or tidy file using system call fread, add overhead trying avoid.

edit: sessioninfo

r version 3.2.0 (2015-04-16) platform: x86_64-w64-mingw32/x64 (64-bit) running under: windows 7 x64 (build 7601) service pack 1 

fixed on devel version, v1.9.5. either use devel (/upgrade) or wait while hit cran v1.9.6:

require(data.table) # v1.9.5+ ans <- fread("type1.txt") #    chr_a      bp_a     snp_a chr_b      bp_b       snp_b        r2 # 1:     1 154834183 rs1218582     1 154794318   rs9970364 0.0929391 # 2:     1 154834183 rs1218582     1 154795033  rs56744813 0.1007500 # 3:     1 154834183 rs1218582     1 154797272  rs16836414 0.1064550 # 4:     1 154834183 rs1218582     1 154798550 rs200576863 0.0916789 # 5:     1 154834183 rs1218582     1 154802379  rs11264270 0.1769110 

fread() has gained strip.white (default=true) amidst other arguments / bug fixes. please see readme file on project page more info.


types recognised correctly well.

sapply(ans, class) #       chr_a        bp_a       snp_a       chr_b        bp_b       snp_b          r2  #   "integer"   "integer" "character"   "integer"   "integer" "character"   "numeric"  

Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -