scala - How can i avoid for loop for KNN search? -
my goal have k nearest neighbours of each data point. avoid use of loop lookup , use else simultaneously on each rdd_distance
point, can't figure out how this.
parseddata = rdd[object] //object have id , vector attribute //sqdist1 output double var rdd_distance = parseddata.cartesian(parseddata) .flatmap { case (x,y) => if(x.get_id != y.get_id) some((x.get_id,(y.get_id,sqdist1(x.get_vector,y.get_vector)))) else none } for(ind1 <- 1 size) { val ind2 = ind1.tostring val tab1 = rdd_distance.lookup(ind2) val rdd_knn0 = sc.parallelize(tab1) val tab_knn = rdd_knn0.takeordered(k)(ordering[(double)].on(x=>x._2)) }
is possible without use loop lookup ?
this code solves question (but inefficient when number of parseddata
big).
rdd_distance.groupbykey().map { case (x, iterable) => x -> iterable.toseq.sortby(_._2).take(k) }
so more appropriate solution.
import org.apache.spark.mllib.rdd.mlpairrddfunctions._ rdd_distance.topbykey(k)(ordering.by(-_._2)) // because smaller better.
note code included spark 1.4.0
. if use earlier version, use code instead https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/mlpairrddfunctions.scala
the idea of topbykey
use boundedpriorityqueue
aggregatebykey
retains top k items.
Comments
Post a Comment