performance - What's the fastest way to merge/join data.frames in R? -

June 15, 2013

for example (not sure if representative example though):

n <- 1e6 d1 <- data.frame(x=sample(n,n), y1=rnorm(n)) d2 <- data.frame(x=sample(n,n), y2=rnorm(n))

this i've got far:

d <- merge(d1,d2) # 7.6 sec  library(plyr) d <- join(d1,d2) # 2.9 sec  library(data.table) dt1 <- data.table(d1, key="x") dt2 <- data.table(d2, key="x") d <- data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] ) # 4.9 sec  library(sqldf) sqldf() sqldf("create index ix1 on d1(x)") sqldf("create index ix2 on d2(x)") d <- sqldf("select * d1 inner join d2 on d1.x=d2.x") sqldf() # 17.4 sec

the match approach works when there unique key in second data frame each key value in first. if there duplicates in second data frame match , merge approaches not same. match is, of course, faster since not doing much. in particular never looks duplicate keys. (continued after code)

df1 = data.frame(a = c(1, 1, 2, 2), b = 1:4) df2 = data.frame(b = c(1, 2, 3, 3, 4), c = letters[1:5]) merge(df1, df2)     b c   1 1 1   2 2 1 b   3 3 2 c   4 3 2 d   5 4 2 e df1$c = df2$c[match(df1$b, df2$b)] df1$c [1] b c e levels: b c d e  > df1   b c 1 1 1 2 1 2 b 3 2 3 c 4 2 4 e

in sqldf code posted in question, might appear indexes used on 2 tables but, in fact, placed on tables overwritten before sql select ever runs , that, in part, accounts why slow. idea of sqldf data frames in r session constitute data base, not tables in sqlite. each time code refers unqualified table name in r workspace -- not in sqlite's main database. select statement shown reads d1 , d2 workspace sqlite's main database clobbering ones there indexes. result join no indexes. if wanted make use of versions of d1 , d2 in sqlite's main database have refer them main.d1 , main.d2 , not d1 , d2. also, if trying make run fast possible note simple join can't make use of indexes on both tables can save time of creating 1 of indexes. in code below illustrate these points.

its worthwhile notice precise computation can make huge difference on package fastest. example, merge , aggregate below. see results reversed two. in first example fastest slowest get: data.table, plyr, merge , sqldf whereas in second example sqldf, aggregate, data.table , plyr -- reverse of first one. in first example sqldf 3x slower data.table , in second 200x faster plyr , 100 times faster data.table. below show input code, output timings merge , output timings aggregate. worthwhile noting sqldf based on database , therefore can handle objects larger r can handle (if use dbname argument of sqldf) while other approaches limited processing in main memory. have illustrated sqldf sqlite supports h2 , postgresql databases well.

library(plyr) library(data.table) library(sqldf)  set.seed(123) n <- 1e5 d1 <- data.frame(x=sample(n,n), y1=rnorm(n)) d2 <- data.frame(x=sample(n,n), y2=rnorm(n))  g1 <- sample(1:1000, n, replace = true) g2<- sample(1:1000, n, replace = true) d <- data.frame(d1, g1, g2)  library(rbenchmark)  benchmark(replications = 1, order = "elapsed",    merge = merge(d1, d2),    plyr = join(d1, d2),    data.table = {        dt1 <- data.table(d1, key = "x")       dt2 <- data.table(d2, key = "x")       data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )       },    sqldf = sqldf(c("create index ix1 on d1(x)",       "select * main.d1 join d2 using(x)")) )  set.seed(123) n <- 1e5 g1 <- sample(1:1000, n, replace = true) g2<- sample(1:1000, n, replace = true) d <- data.frame(x=sample(n,n), y=rnorm(n), g1, g2)  benchmark(replications = 1, order = "elapsed",    aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean),     data.table = {       dt <- data.table(d, key = "g1,g2")       dt[, colmeans(cbind(x, y)), = "g1,g2"]    },    plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),    sqldf = sqldf(c("create index ix on d(g1, g2)",       "select g1, g2, avg(x), avg(y) main.d group g1, g2")) )

the outputs 2 benchmark call comparing merge calculations are:

joining by: x         test replications elapsed relative user.self sys.self user.child sys.child 3 data.table            1    0.34 1.000000      0.31     0.01         na        na 2       plyr            1    0.44 1.294118      0.39     0.02         na        na 1      merge            1    1.17 3.441176      1.10     0.04         na        na 4      sqldf            1    3.34 9.823529      3.24     0.04         na        na

the output benchmark call comparing aggregate calculations are:

        test replications elapsed  relative user.self sys.self user.child sys.child 4      sqldf            1    2.81  1.000000      2.73     0.02         na        na 1  aggregate            1   14.89  5.298932     14.89     0.00         na        na 2 data.table            1  132.46 47.138790    131.70     0.08         na        na 3       plyr            1  212.69 75.690391    211.57     0.56         na        na

Search This Blog

shell

performance - What's the fastest way to merge/join data.frames in R? -

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -