r - Increasing speed of string comparison using mapply -


i have data frames containing unique ngrams in form of:

> head(ngram4, 3)   term1 term2 term3 term4 freq 1     end    of   3457 2    rest    of   2974 3    @     end    of 2950  > head(ngram3, 3)    term1 term2 term3  freq 1    1    of   15268 2        lot    of 13365 3     10709 

and on down ngram1 (term1 , freq only). term columns character, freq column integer.

for each row, trying pull in frequency of lower order ngram table corresponds last term column. row 1 in ngram3, "one of the", need pull ngram2$freq row term1="one" , term2="of". like:

ngram2[ngram2$term1==ngram3$term1 & ngram2$term2==ngram3$term2, "freq"] 

i'm trying using mapply each row of ngram3 follows:

mapply(function(xfreq, xterm1, xterm2)         ngram2[ngram2$term1==xterm1 & ngram2$term2==xterm2,"freq"],         ngram3$freq, ngram3$term1, ngram3$term2) 

the problem have 750,000 rows in both ngram2 , ngram3, process terribly slow. timed small sample of 100 rows of ngram3 , takes 7.612 sec.

mapply(function(xfreq, xterm1, xterm2)         ngram2[ngram2$term1==xterm1 & ngram2$term2==xterm2,"freq"],         ngram3$freq[1:100], ngram3$term1[1:100], ngram3$term2[1:100]) 

at rate, take 16 hours run through 750,000 rows. don't know if there's can speed up. thoughts?

======== tldr =========

ngram2 , ngram3 big data frames. how can speed following expression:

mapply(function(xfreq, xterm1, xterm2)         ngram2[ngram2$term1==xterm1 & ngram2$term2==xterm2,"freq"],         ngram3$freq, ngram3$term1, ngram3$term2) 

where term1 , term2 type character , freq type integer? is, take 16 hrs run.

if read these in character-valued columns (using stringsasfactors=false) rather factors can test whether merge operation work (and does):

ngram4[4,] <-c(ngram3[3,],"fish") merge(ngram4, ngram3, by=1:3)  # use all.x=true or all.y=true  #   term1 term2 term3 term4 freq.x freq.y #1     10709   fish  10709 

i notice puzzling switch of column positions. don't understand.


Comments

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -