indexing - how does More_like_this elasticsearch work (into the whole index) -


so first getting list of termvectors, contain tokens, create map<token, frequency in document>. method createqueue determine score deleting, stopwords , word occurs not enough, compute idf, idf * doc_frequency of given token equals token, keeping 25 best one, after how work? how compare whole index? read http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ didn't explain it, or miss point.

it creates termquery out of each of terms, , chucks them simple booleanquery, boosting each term calculated tfidf score (boostfactor * myscore / bestscore, boostfactor can set user).

here the source (version 5.0):

private query createquery(priorityqueue<scoreterm> q) {   booleanquery query = new booleanquery();   scoreterm scoreterm;   float bestscore = -1;    while ((scoreterm = q.pop()) != null) {     termquery tq = new termquery(new term(scoreterm.topfield, scoreterm.word));      if (boost) {       if (bestscore == -1) {         bestscore = (scoreterm.score);       }       float myscore = (scoreterm.score);       tq.setboost(boostfactor * myscore / bestscore);     }      try {       query.add(tq, booleanclause.occur.should);     }     catch (booleanquery.toomanyclauses ignore) {       break;     }   }   return query; } 

Comments

Popular posts from this blog

python - No exponential form of the z-axis in matplotlib-3D-plots -

php - Best Light server (Linux + Web server + Database) for Raspberry Pi -

c# - "Newtonsoft.Json.JsonSerializationException unable to find constructor to use for types" error when deserializing class -