r - Increasing speed of string comparison using mapply -
i have data frames containing unique ngrams in form of:
> head(ngram4, 3) term1 term2 term3 term4 freq 1 end of 3457 2 rest of 2974 3 @ end of 2950 > head(ngram3, 3) term1 term2 term3 freq 1 1 of 15268 2 lot of 13365 3 10709
and on down ngram1 (term1 , freq only). term columns character, freq column integer.
for each row, trying pull in frequency of lower order ngram table corresponds last term column. row 1 in ngram3, "one of the", need pull ngram2$freq row term1="one" , term2="of". like:
ngram2[ngram2$term1==ngram3$term1 & ngram2$term2==ngram3$term2, "freq"]
i'm trying using mapply each row of ngram3 follows:
mapply(function(xfreq, xterm1, xterm2) ngram2[ngram2$term1==xterm1 & ngram2$term2==xterm2,"freq"], ngram3$freq, ngram3$term1, ngram3$term2)
the problem have 750,000 rows in both ngram2 , ngram3, process terribly slow. timed small sample of 100 rows of ngram3 , takes 7.612 sec.
mapply(function(xfreq, xterm1, xterm2) ngram2[ngram2$term1==xterm1 & ngram2$term2==xterm2,"freq"], ngram3$freq[1:100], ngram3$term1[1:100], ngram3$term2[1:100])
at rate, take 16 hours run through 750,000 rows. don't know if there's can speed up. thoughts?
======== tldr =========
ngram2 , ngram3 big data frames. how can speed following expression:
mapply(function(xfreq, xterm1, xterm2) ngram2[ngram2$term1==xterm1 & ngram2$term2==xterm2,"freq"], ngram3$freq, ngram3$term1, ngram3$term2)
where term1 , term2 type character , freq type integer? is, take 16 hrs run.
if read these in character-valued columns (using stringsasfactors=false) rather factors can test whether merge operation work (and does):
ngram4[4,] <-c(ngram3[3,],"fish") merge(ngram4, ngram3, by=1:3) # use all.x=true or all.y=true # term1 term2 term3 term4 freq.x freq.y #1 10709 fish 10709
i notice puzzling switch of column positions. don't understand.
Comments
Post a Comment