阅读 45

Solr相似查询——MoreLikeThis

前言

基本概念

什么是相似查询

我们在google上进行一个查询,每一个查询结果都包含一个“相似页面”链接,单击该链接,就会发送另一个搜索请求,查找出与起初结果类似的文档。

Solr中的相似查询(MoreLikeThis)实现了一样的功能。

为什么需要相似查询

文档中这样解释:

Generate "more like this" similarity queries.
Based on this mail:

Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
Term frequencies can be computed by re-tokenizing the text, which, for a single document,
is usually fast enough.  But looking up the docFreq() of every term in the document is
probably too slow.

You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
or at all.  Since you're trying to maximize a tf*idf score, you're probably most interested
in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
reduce the number of terms under consideration.  Another heuristic is that terms with a
high idf (i.e., a low df) tend to be longer.  So you could threshold the terms by the
number of characters, not selecting anything less than, e.g., six or seven characters.
With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
that do a pretty good job of characterizing a document.

It all depends on what you're trying to do.  If you're trying to eek out that last percent
of precision and recall regardless of computational difficulty so that you can win a TREC
competition, then the techniques I mention above are useless.  But if you're trying to
provide a "more like this" button on a search results page that does a decent job and has
good performance, such techniques might be useful.

An efficient, effective "more-like-this" query generator would be a great contribution, if
anyone's interested.  I'd imagine that it would take a Reader or a String (the document's
text), analyzer Analyzer, and return a set of representative terms using heuristics like those
above.  The frequency and length thresholds could be parameters, etc.

Doug
复制代码

上述内容的几个关键点(核心思想):

  • Lucene允许您使用IndexReader.docFreq()访问某个词在所有文档频率(出现次数)
  • docFreq()在单个文档中调用非常快,为了避免过多地调用它,通过添加词频和文档频率加以控制
  • 通过调整词频和文档频率就可以做到“more like this”
  • 它支持Lucene中某个字段或单独一个Reader或String进行输入项
  • 它需要用到分析器Analyzer和similarity(打分)

MoreLikeThis实现方式

Lucene5中内置的MoreLikeThis的实现方式是使用打分的方式计算相似度,根据最终得分高低放入优先级队列,评分高的自然在队列最高处。

如何实现MoreLikeThis

官方文档中这样解释:

There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link).

The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results.

The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.

  1. MoreLikeThisHandler。作为一个单独的handler来处理。这种方法最常用。可以应用过滤等复杂操作。
  2. SearchHandler中的MoreLikeThisComponent。作为一个搜索组件,适用于简单应用。但是他对返回结果中的每一个文档都执行此操作,降低搜索结果。
  3. MoreLikeThisHandler with externally supplied text。带有外部输入文本的的request handler。类似于第一种方法。不过这种方法是根据输入文档的文本来索引相似文档。

MoreLikeThis是如何工作的

官方文档中这样解释:

MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the mlt.fl parameter, below). For best results, the fields should have stored term vectors in schema.xml. For example:

<field name="cat" ... termVectors="true" />
复制代码

If term vectors are not stored, MoreLikeThis will generate terms from stored fields. A uniqueKey must also be stored in order for MoreLikeThis to work properly.

The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the mlt.qf parameter, below) and a new document set is returned.

MoreLikeThis 会基于索引文档中的Term构造一个Lucene Query,而这些Term需要从定义的域列表上获取,该域列表需要通过mlt.fl参数执行,为了获取最佳效果,那些域应该存储Term向量信息,即域的term Vectors属性需要设置为true。

如果域的Term向量信息未存储,那么MoreLikeThis会自动从存储域(stored = true 的域)上生成Term。为了使MoreLikeThis正常工作,你还必须存储UniqueKey。

好吧!其实我还是不知道MoreLikeThis具体是怎么实现的。 但是我们应该能从上面描述中获取一些关键点:

  1. MoreLikeThis 的实现是需要根据Term来构造一个Lucene Query
  2. Term 则需要从给定的field列表获取
  3. 这些field需要在mlt.fl参数中设置执行
  4. 这些field应该存储term vectors信息,如果没有,那么就从存储域上生成

我们需要来看看源码是怎么实现的了

源码实现

MoreLikeThis重要参数

参数 解释
private Analyzer analyzer 分词器
private int minTermFreq 词频最小值(默认2)
private int minDocFreq 文档频率最小值(默认5)
private int maxDocFreq 文档频率最大值(默认2147483647)
private int maxQueryTerms 查询词的数组大小(默认25)
private TFIDFSimilarity similarity 计算相关度

MoreLikeThis重要方法

  • public Query like(int docNum) 按lucene的documentId查询
  • public Query like(String fieldName, Reader... readers) 按string类型输入查询

源码分析

MoreLikeThis为了实现与Lucene良好的互动,且扩充了Lucene:它提供了一个方法,该方法返回一个Query对象,即Lucene的查询对象,只要Lucene通过这个对象检索,就能获得相似结果;所以MoreLikeThis和Lucene完全能够无缝结合。Solr中就提供了一个不错的例子。我们以 public Query like(int docNum) 方法来解释相似查询实现原理:

入口函数

    public Query like(int docNum) throws IOException {
        if (this.fieldNames == null) {
            Collection<String> fields = MultiFields.getIndexedFields(this.ir);
            this.fieldNames = (String[])fields.toArray(new String[fields.size()]);
        }

        return this.createQuery(this.retrieveTerms(docNum));
    }
复制代码

其中的docNum参数为那个搜索结果的id,即你要通过这个搜索结果,来查找其他与之相似搜索结果。 fieldNames可以理解为我们选择的一些域,我们将取出该结果在这些域中的值,以此来分析相似度。 程序很明显,这些域是可选的。

流程

上述代码实现逻辑如下:

  1. private PriorityQueue retrieveTerms(int docNum): 用于提取 docNum 对应检测结果在指定域 fieldNames 中的值。
  2. rivate void addTermFrequencies(Map<String, Int> termFreqMap, Terms vector): 将每个词项和它出现的频率添加到 termFreqMap 中。在(1)中调用。
  3. private PriorityQueue createQueue(Map<String, Int> words): 将Map中的数据取出,进行一些相似计算,生成PriorityQueue。同样在(1)中调用
  4. private Query createQuery(PriorityQueue q): 根据PriorityQueue(优先级队列)生成最终的Query对象。
具体函数分析

termVector是项向量。项向量其实就是根据Term在文档中出现的频率和文档中包含Term的频率建立的数学模型,计算两个项向量的夹角的方式来判断他们的相似性。

  /**
   * Find words for a more-like-this query former.
   *
   * @param docNum the id of the lucene document from which to find terms
   */
  private PriorityQueue<ScoreTerm> retrieveTerms(int docNum) throws IOException {
    Map<String, Int> termFreqMap = new HashMap<>();
    for (String fieldName : fieldNames) {
      final Fields vectors = ir.getTermVectors(docNum);
      final Terms vector;
      if (vectors != null) {
        vector = vectors.terms(fieldName);
      } else {
        vector = null;
      }

      // field does not store term vector info
      //如果当前字段没有存储termVector,那么需要重新计算。其实这里就是分词,并计算term词频的过程,注意他默认使用的是StandardAnalyzer分词器!!!
      if (vector == null) {
        Document d = ir.document(docNum);
        IndexableField fields[] = d.getFields(fieldName);
        for (IndexableField field : fields) {
          final String stringValue = field.stringValue();
          if (stringValue != null) {
            addTermFrequencies(new StringReader(stringValue), termFreqMap, fieldName);
          }
        }
      } else {//如果之前保存了termVector就直接添加即可
        addTermFrequencies(termFreqMap, vector);
      }
    }

    return createQueue(termFreqMap);
  }
复制代码

由于TermVector中的term和field没有关系,不管是标题还是正文,只要term内容一样就将其频率累加。addTermFrequencies就做这个事情!

把累加的结果存放到termFreqMap中。

    private void addTermFrequencies(Reader r, Map<String, MoreLikeThis.Int> termFreqMap, String fieldName) throws IOException {
        if (this.analyzer == null) {
            throw new UnsupportedOperationException("To use MoreLikeThis without term vectors, you must provide an Analyzer");
        } else {
            TokenStream ts = this.analyzer.tokenStream(fieldName, r);

            try {
                int tokenCount = 0;
                CharTermAttribute termAtt = (CharTermAttribute)ts.addAttribute(CharTermAttribute.class);
                ts.reset();

                while(true) {
                    if (ts.incrementToken()) {
                        String word = termAtt.toString();
                        ++tokenCount;
                        if (tokenCount <= this.maxNumTokensParsed) {
                            if (!this.isNoiseWord(word)) {
                                MoreLikeThis.Int cnt = (MoreLikeThis.Int)termFreqMap.get(word);
                                if (cnt == null) {
                                    termFreqMap.put(word, new MoreLikeThis.Int());
                                } else {
                                    ++cnt.x;
                                }
                            }
                            continue;
                        }
                    }

                    ts.end();
                    return;
                }
            } finally {
                IOUtils.closeWhileHandlingException(new Closeable[]{ts});
            }
        }
    }
复制代码

上面操作中,我们还需要进行降噪操作 降噪操作有几个原则:

  1. 是否是规定的停用词
  2. term长度是否过长或过短,这个范围由minWordLen和maxWordLen控制。
  /**
   * determines if the passed term is likely to be of interest in "more like" comparisons
   *
   * @param term The word being considered
   * @return true if should be ignored, false if should be used in further analysis
   */
  private boolean isNoiseWord(String term) {
    int len = term.length();
    if (minWordLen > 0 && len < minWordLen) {
      return true;
    }
    if (maxWordLen > 0 && len > maxWordLen) {
      return true;
    }
    return stopWords != null && stopWords.contains(term);
  }
复制代码

这里的queue应该是一个优先级队列,上一步我们获得了所有<term, frequency>,虽然做了去噪,但是term项目还是太多了,还需要找出相对重要的前N个Term。

在这里,我们对每个term进行了打分排序,主要还是通过tf、idf进行计算。

  /**
   * Create a PriorityQueue from a word->tf map.
   *
   * @param words a map of words keyed on the word(String) with Int objects as the values.
   */
  private PriorityQueue<ScoreTerm> createQueue(Map<String, Int> words) throws IOException {
    // have collected all words in doc and their freqs
    //获取当前index的文档总数。
    int numDocs = ir.numDocs();
    final int limit = Math.min(maxQueryTerms, words.size());
    FreqQ queue = new FreqQ(limit); //按照term的得分进行存放

    for (String word : words.keySet()) { //对每个词
      int tf = words.get(word).x; // 对应term的词频
      if (minTermFreq > 0 && tf < minTermFreq) {
        continue; // 和去噪类似,tf太小的term直接过掉。
      }

      // 遍历所有的field,找到df最大的那个字段
      String topField = fieldNames[0];
      int docFreq = 0;
      for (String fieldName : fieldNames) {
        int freq = ir.docFreq(new Term(fieldName, word));
        topField = (freq > docFreq) ? fieldName : topField;
        docFreq = (freq > docFreq) ? freq : docFreq;
      }

      if (minDocFreq > 0 && docFreq < minDocFreq) {
        continue; // df太小的term也要直接过掉
      }

      if (docFreq > maxDocFreq) {
        continue; // df太大的term也要直接过掉
      }

      if (docFreq == 0) {
        continue; // index update problem?df==0的term也要直接过掉,怎么会有df的term???这里说是index文件的问题
      }

      float idf = similarity.idf(docFreq, numDocs);
      float score = tf * idf;

      //将结果存放到优先队列中。
      if (queue.size() < limit) {
        // there is still space in the queue
        queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));
      } else {
        ScoreTerm term = queue.top();
        if (term.score < score) { // update the smallest in the queue in place and update the queue.
          term.update(word, topField, score, idf, docFreq, tf);
          queue.updateTop();
        }
      }
    }
    return queue;
  }
复制代码

到这里我们已经将term的打分排序拿到了,分值越大的term更能表述整篇document的主要内容!(这样想的话其实有点理解文章开始的意思了,我们找一篇文章的相似文章,自然是相同的词越多越好了)

 /**
   * Create the More like query from a PriorityQueue
   */
  private Query createQuery(PriorityQueue<ScoreTerm> q) {
    BooleanQuery query = new BooleanQuery();
    ScoreTerm scoreTerm;
    float bestScore = -1;

    while ((scoreTerm = q.pop()) != null) {
      TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));
      //这里还可以对termquery进行boost的设置。默认为false
      if (boost) {
        if (bestScore == -1) {
          bestScore = (scoreTerm.score);
        }
        float myScore = (scoreTerm.score);
        tq.setBoost(boostFactor * myScore / bestScore);
      }
      ////构建boolean query,should关联。
      try {
        query.add(tq, BooleanClause.Occur.SHOULD);
      }
      catch (BooleanQuery.TooManyClauses ignore) {
        break;
      }
    }
    return query;
  }
复制代码

这样就根据一篇document和指定字段得到了一个query。这个query作为代表着document的灵魂,将寻找和他类似的documents。

参考文章

关注下面的标签,发现更多相似文章
评论