Using Apache-Spark, reduce or fold an RDD depending on a condition -

- August 15, 2012

i'm working apache spark , scala. have rdd of string,int

val counts =words.map(word => (word, 1)).reducebykey((a,b) => (a + b))

now reduced rdd key, i'd add feature reduce words similar.

i though use levenshtein distance, euclidean distance or cosine distance.

so, how can apply 1 of functions reduce rdd?

example:

rdd ->  (forks,12), (fork,4), (chair,15) , (table,1), (tables,11)

admitting similarity algorithm works, how can obtain reduced rdd like:

rdd -> (fork,16), (table,12), (chair,15)

i tried like:

counts.foldleft(){(x,y) =>    if(x._1.euclideandistance(y._1) > 0.9)      (x,x._2+y._2)  }

what trying not work.

if have distance(a, b) function, inefficient , complicated solve problem. need use rdd.cartesian generate possible (word1, word2) pairs. filter out great distance. have similar word pairs. let's (fox, fix), (fix, six), , reversals. want sum counts fox, fix, , six. need find connected components in graph defined similar word pairs. once have component id each word, sum counts component ids.

i think solution rather write function can turn word "canonical" form. turn forks, forking, , forked fork. can apply , reducebykey again.

it fastest step without spark. once have calculated counts spark, have tiny data set — one integer each distinct word. it's easiest collect , map , groupby counts locally.

Search This Blog

Crty

Using Apache-Spark, reduce or fold an RDD depending on a condition -

Comments

Post a Comment

Popular posts from this blog

python - mat is not a numerical tuple : openCV error -

c# - MSAA finds controls UI Automation doesn't -

wordpress - .htaccess: RewriteRule: bad flag delimiters -