Using Apache-Spark, reduce or fold an RDD depending on a condition -
i'm working apache spark , scala. have rdd of string,int
val counts =words.map(word => (word, 1)).reducebykey((a,b) => (a + b))
now reduced rdd key, i'd add feature reduce words similar.
i though use levenshtein distance, euclidean distance or cosine distance.
so, how can apply 1 of functions reduce rdd?
example:
rdd -> (forks,12), (fork,4), (chair,15) , (table,1), (tables,11)
admitting similarity algorithm works, how can obtain reduced rdd like:
rdd -> (fork,16), (table,12), (chair,15)
i tried like:
counts.foldleft(){(x,y) => if(x._1.euclideandistance(y._1) > 0.9) (x,x._2+y._2) }
what trying not work.
if have distance(a, b)
function, inefficient , complicated solve problem. need use rdd.cartesian
generate possible (word1, word2)
pairs. filter out great distance. have similar word pairs. let's (fox, fix)
, (fix, six)
, , reversals. want sum counts fox
, fix
, , six
. need find connected components in graph defined similar word pairs. once have component id each word, sum counts component ids.
i think solution rather write function can turn word "canonical" form. turn forks
, forking
, , forked
fork
. can apply , reducebykey
again.
it fastest step without spark. once have calculated counts
spark, have tiny data set — one integer each distinct word. it's easiest collect
, map
, groupby
counts
locally.
Comments
Post a Comment