performance - What's the fastest way to merge/join data.frames in R? -
for example (not sure if representative example though):
n <- 1e6 d1 <- data.frame(x=sample(n,n), y1=rnorm(n)) d2 <- data.frame(x=sample(n,n), y2=rnorm(n))
this i've got far:
d <- merge(d1,d2) # 7.6 sec library(plyr) d <- join(d1,d2) # 2.9 sec library(data.table) dt1 <- data.table(d1, key="x") dt2 <- data.table(d2, key="x") d <- data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] ) # 4.9 sec library(sqldf) sqldf() sqldf("create index ix1 on d1(x)") sqldf("create index ix2 on d2(x)") d <- sqldf("select * d1 inner join d2 on d1.x=d2.x") sqldf() # 17.4 sec
the match approach works when there unique key in second data frame each key value in first. if there duplicates in second data frame match , merge approaches not same. match is, of course, faster since not doing much. in particular never looks duplicate keys. (continued after code)
df1 = data.frame(a = c(1, 1, 2, 2), b = 1:4) df2 = data.frame(b = c(1, 2, 3, 3, 4), c = letters[1:5]) merge(df1, df2) b c 1 1 1 2 2 1 b 3 3 2 c 4 3 2 d 5 4 2 e df1$c = df2$c[match(df1$b, df2$b)] df1$c [1] b c e levels: b c d e > df1 b c 1 1 1 2 1 2 b 3 2 3 c 4 2 4 e
in sqldf code posted in question, might appear indexes used on 2 tables but, in fact, placed on tables overwritten before sql select ever runs , that, in part, accounts why slow. idea of sqldf data frames in r session constitute data base, not tables in sqlite. each time code refers unqualified table name in r workspace -- not in sqlite's main database. select statement shown reads d1 , d2 workspace sqlite's main database clobbering ones there indexes. result join no indexes. if wanted make use of versions of d1 , d2 in sqlite's main database have refer them main.d1 , main.d2 , not d1 , d2. also, if trying make run fast possible note simple join can't make use of indexes on both tables can save time of creating 1 of indexes. in code below illustrate these points.
its worthwhile notice precise computation can make huge difference on package fastest. example, merge , aggregate below. see results reversed two. in first example fastest slowest get: data.table, plyr, merge , sqldf whereas in second example sqldf, aggregate, data.table , plyr -- reverse of first one. in first example sqldf 3x slower data.table , in second 200x faster plyr , 100 times faster data.table. below show input code, output timings merge , output timings aggregate. worthwhile noting sqldf based on database , therefore can handle objects larger r can handle (if use dbname argument of sqldf) while other approaches limited processing in main memory. have illustrated sqldf sqlite supports h2 , postgresql databases well.
library(plyr) library(data.table) library(sqldf) set.seed(123) n <- 1e5 d1 <- data.frame(x=sample(n,n), y1=rnorm(n)) d2 <- data.frame(x=sample(n,n), y2=rnorm(n)) g1 <- sample(1:1000, n, replace = true) g2<- sample(1:1000, n, replace = true) d <- data.frame(d1, g1, g2) library(rbenchmark) benchmark(replications = 1, order = "elapsed", merge = merge(d1, d2), plyr = join(d1, d2), data.table = { dt1 <- data.table(d1, key = "x") dt2 <- data.table(d2, key = "x") data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] ) }, sqldf = sqldf(c("create index ix1 on d1(x)", "select * main.d1 join d2 using(x)")) ) set.seed(123) n <- 1e5 g1 <- sample(1:1000, n, replace = true) g2<- sample(1:1000, n, replace = true) d <- data.frame(x=sample(n,n), y=rnorm(n), g1, g2) benchmark(replications = 1, order = "elapsed", aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean), data.table = { dt <- data.table(d, key = "g1,g2") dt[, colmeans(cbind(x, y)), = "g1,g2"] }, plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)), sqldf = sqldf(c("create index ix on d(g1, g2)", "select g1, g2, avg(x), avg(y) main.d group g1, g2")) )
the outputs 2 benchmark call comparing merge calculations are:
joining by: x test replications elapsed relative user.self sys.self user.child sys.child 3 data.table 1 0.34 1.000000 0.31 0.01 na na 2 plyr 1 0.44 1.294118 0.39 0.02 na na 1 merge 1 1.17 3.441176 1.10 0.04 na na 4 sqldf 1 3.34 9.823529 3.24 0.04 na na
the output benchmark call comparing aggregate calculations are:
test replications elapsed relative user.self sys.self user.child sys.child 4 sqldf 1 2.81 1.000000 2.73 0.02 na na 1 aggregate 1 14.89 5.298932 14.89 0.00 na na 2 data.table 1 132.46 47.138790 131.70 0.08 na na 3 plyr 1 212.69 75.690391 211.57 0.56 na na
Comments
Post a Comment