Using Apache-Spark, reduce or fold an RDD depending on a condition -
i'm working apache spark , scala. have rdd of string,int
val counts =words.map(word => (word, 1)).reducebykey((a,b) => (a + b)) now reduced rdd key, i'd add feature reduce words similar.
i though use levenshtein distance, euclidean distance or cosine distance.
so, how can apply 1 of functions reduce rdd?
example:
rdd -> (forks,12), (fork,4), (chair,15) , (table,1), (tables,11) admitting similarity algorithm works, how can obtain reduced rdd like:
rdd -> (fork,16), (table,12), (chair,15) i tried like:
counts.foldleft(){(x,y) => if(x._1.euclideandistance(y._1) > 0.9) (x,x._2+y._2) }
what trying not work.
if have distance(a, b) function, inefficient , complicated solve problem. need use rdd.cartesian generate possible (word1, word2) pairs. filter out great distance. have similar word pairs. let's (fox, fix), (fix, six), , reversals. want sum counts fox, fix, , six. need find connected components in graph defined similar word pairs. once have component id each word, sum counts component ids.
i think solution rather write function can turn word "canonical" form. turn forks, forking, , forked fork. can apply , reducebykey again.
it fastest step without spark. once have calculated counts spark, have tiny data set — one integer each distinct word. it's easiest collect , map , groupby counts locally.
Comments
Post a Comment