scala - How can i avoid for loop for KNN search? -


my goal have k nearest neighbours of each data point. avoid use of loop lookup , use else simultaneously on each rdd_distance point, can't figure out how this.

parseddata = rdd[object] //object have id , vector attribute //sqdist1 output double  var rdd_distance = parseddata.cartesian(parseddata)   .flatmap { case (x,y) =>     if(x.get_id != y.get_id)        some((x.get_id,(y.get_id,sqdist1(x.get_vector,y.get_vector))))     else none   } for(ind1 <- 1 size) {   val ind2 = ind1.tostring   val tab1 = rdd_distance.lookup(ind2)   val rdd_knn0 = sc.parallelize(tab1)   val tab_knn = rdd_knn0.takeordered(k)(ordering[(double)].on(x=>x._2)) } 

is possible without use loop lookup ?

this code solves question (but inefficient when number of parseddata big).

  rdd_distance.groupbykey().map {     case (x, iterable) =>       x -> iterable.toseq.sortby(_._2).take(k)   } 

so more appropriate solution.

import org.apache.spark.mllib.rdd.mlpairrddfunctions._      rdd_distance.topbykey(k)(ordering.by(-_._2)) // because smaller better. 

note code included spark 1.4.0. if use earlier version, use code instead https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/mlpairrddfunctions.scala

the idea of topbykey use boundedpriorityqueue aggregatebykey retains top k items.


Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -