indexing - how does More_like_this elasticsearch work (into the whole index) -
so first getting list of termvectors, contain tokens, create map<token, frequency in document>.
method createqueue determine score deleting, stopwords , word occurs not enough, compute idf, idf * doc_frequency of given token equals token, keeping 25 best one, after how work? how compare whole index? read http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ didn't explain it, or miss point.
it creates termquery
out of each of terms, , chucks them simple booleanquery
, boosting each term calculated tfidf score (boostfactor * myscore / bestscore
, boostfactor can set user).
here the source (version 5.0):
private query createquery(priorityqueue<scoreterm> q) { booleanquery query = new booleanquery(); scoreterm scoreterm; float bestscore = -1; while ((scoreterm = q.pop()) != null) { termquery tq = new termquery(new term(scoreterm.topfield, scoreterm.word)); if (boost) { if (bestscore == -1) { bestscore = (scoreterm.score); } float myscore = (scoreterm.score); tq.setboost(boostfactor * myscore / bestscore); } try { query.add(tq, booleanclause.occur.should); } catch (booleanquery.toomanyclauses ignore) { break; } } return query; }
Comments
Post a Comment