implementing autosuggest in elasticsearch | rea group tech blog

421 阅读8分钟
原文链接: rea.tech

In this article I look at several different ways to use Elasticsearch to implement autosuggest.

Most of us make use of some sort of autosuggest functionality several times per day to the point of barely even noticing it. Autosuggest functionality can help a user fill form input fields by prompting them with likely completions and even alternatives to the text they are typing as they type it.  In what follows, I will use the the term “autocomplete” in the strict sense of completing what the user has typed so far, and “autosuggest” in the wider sense of suggesting not just completions, but also alternatives that the user may have intended.

There are various important points to consider when implementing autosuggest:

  • Do you wish to allow for minor misspellings of words (fuzzy matching)
  • Do you wish to allow for minor re-ordering of words. For example: “Brunswick East” versus “East Brunswick”. I will refer to this as “sloppy phrase matching” or just “slopping matching” for short.
  • How do you want the suggestions sorted.
  • Do you want to highlight the matching words
  • Do you need to be able to start matching from a word in the middle of the suggestion, or only match from the beginning. For example, would you like “Repetitive Strain Injury” to be a suggestion for “strai

Elasticsearch is a document store designed to support fast searches.  It is built on top of Apache Lucene and so it supports a nice range of natural language text analysis options and support for geo-spatial features. All this makes it possible to use Elasticsearch as part of an autosuggest system.

In what follows, I describe several different ways to use Elasticsearch to support autosuggest, starting with simple “completion suggesters” through to more sophisticated queries.  While hopefully not essential, it will help if the reader already has some familiarity with Elasticsearch, in particular the concepts of “index”, “index mapping”, and some of the basics of the query DSL.  For an introduction to these, it is hard to beat the “Getting Started” chapter of the online “Elasticsearch Definitive Guide“.

Completion Suggesters

Like all software that uses Elasticsearch, an autosuggest system needs to interact with Elasticsearch via the Elasticsearch API.  The Elasticsearch API supports quite complex general searches, but also has specific support for autosuggest in the form of its “Completion Suggesters“.

I will show in later sections how autosuggest can be implemented using standard queries, but you should always consider using a Completion Suggester first as they are faster than standard queries (at least in theory). Also, they automatically filter out duplicate results.

The autosuggest items must be indexed using a document field whose “mapping” has been declared to be a “completion” type. I won’t go into the details of what a mapping is except to say it is a bit like a table schema in a relational database. So typically you would declare the type of a field to be a “string” or “integer” for example, but in this instance we need to use the special “completion” type.

For example, here I am declaring that my “customer” documents will have a special “nameSuggest” field whose type is the special “completion” type (mappings are typically defined using JSON):

{"mappings": {
  "customer": {
    "properties": {
      "nameSuggest": { "type": "completion" }
}}}}

What follows is a request for suggestions that are completions of the text “Smi”.  Like most requests on Elasticsearch, the request URL starts with the index name (in this case “myindex”) and ends with the desired operation (in this case “_suggest”).  The body of the request is a JSON description of the request.  For suggestion requests, you give the request a name (in this case “my-test-suggest”), the query text (“Smi”), and the name of the document field to be searched (in this case “nameSuggest”).

GET /myindex/_suggest
{"my-test-suggest": {
  "text": "Smi",
  "completion": {"field": "nameSuggest"}
}}

For those already familiar with Elasticsearch, it is worth noting that the API end point is “/_suggest” and not the normal “/_search” end point.  The body of the suggestion query is also very specific to the suggest API and not part of the normal search query DSL.

Here are some additional features of completion suggesters:

  • When indexing the suggestions, you may list multiple “inputs” that map to the same “output” suggestion (e.g. “East Melbourne” and “Melbourne East” to both map to “Melbourne East” as the suggestion).
  • When indexing the suggestions, you may specify weights for ranking.
  • Fuzzy matching is supported (i.e. minor spelling mistakes)
  • Context Suggestion filtering can be enabled to support filtering by simple “context” strings, or geo-spatial filtering.
  • You can optionally specify that spaces are to be ignored when matching. For example, “Box Hill” would match “Boxhill”.
  • Suggester queries can be submitted together with ordinary queries in the one request.

Some completion suggester restrictions:

  • Matching always starts at the beginning of the text. So, for example, “Smi” will match “Smith, Fed” but not “Fed Smith”. However, you could list both “Smith, Fed” and “Fed Smith” as two different inputs for the one output.
  • No “sloppy” phrase matching is supported. i.e. words must appear in the right order.
  • The only sorting mechanism available is via integer weights that you may optionally provide when indexing the suggestions. This would be a problem if, for example, you are required to sort the suggestions alphabetically, or wish the sort order to depend on some sort of context.
  • Highlighting of the matching words is not supported.

Prefix Queries and Filters

If the inflexible sorting options is the only reason completion suggesters are not suitable for you, then you may want to consider using a prefix query or prefix filter. In theory, prefix queries and filters will be slower than completion suggesters, but the reality may depend on your data and query traffic.

What follows is an example of a prefix search. Like the completion suggestion request in the last section, the URL starts with the index name (in this case “/movie”). Unlike the completion suggestion request, a search request ends with “/_search”, and may include the document mapping type (in this case “/actor”).  The body of the request contains the details of the query in the form needed by a “request body search“. In this case, we are requesting that the query be a “prefix” query that matches documents whose “name” fields start with the letters “fre”.

GET /movie/actor/_search
{
  "query": {
    "prefix": {
      "name": {"value": "fre"}
    }}}

The most important thing to realize about both the prefix query and the prefix filter is that the query text is not “analyzed”. In other words, it is not split into words (tokenized), lower cased, stemmed etc. For example, if the field was indexed using the “standard” string analyzer, then the document text “Fred Smith” would be indexed as two separate lowercased terms: “fred” and “smith”, and only prefixes of “fred” and “smith” will match in a prefix query. A prefix query term like “fred smi” will not match either of those terms. Nor would “Fre” (case sensitivity). However, prefix queries like “fre” and “smi” would match.

If you wish to use prefix queries/filters for autosuggest on multi-word text, then there are a couple of options to work around the fact that the query strings are not analyzed:

  • Index the field using a custom analyzer based on the keyword tokenizer so that, for example, “Fred Smith” is indexed as a single term (“fred smith”).  Then all you need to do is lowercase the query string before submitting it to Elasticsearch and, for example “fred smi” will match.  Here is an example of such a mapping:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "keywordLowercase": { 
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase"] 
        } } } }, 
  "mappings": { 
    "suburb": { 
      "properties": { 
        "name": { 
          "type": "string", "analyzer": "keywordLowercase"}
      }}}}
  • Index the field using a standard analyzer, but tokenize and lowercase the query yourself.  For example, if the query string is “Tim Smi”, then you will need to create a composite query consisting of “tim” and “smi” as two separate items (in a bool query for example).  Here is an example of such a query:
{"query": { 
  "bool": { 
    "must": [ 
      {"prefix": {"name": {"value": "tim"}}}, 
      {"prefix": {"name": {"value": "smi"}}}
    ] } } }

The main advantage a prefix query/filter has over a completion suggester is that you have all the usual flexibility over ranking that other queries have.  The fact that the prefix query/filter can be combined with other queries and filters in the usual manner (e.g. via bool queries) may also be a big advantage over a completion suggester.

Additional Restrictions of prefix queries/filters:

  • Fuzzy matching is not supported.
  • Sloppy matching is not supported.
  • Duplicate results are not filtered out (this is true for all queries).
  • Matching always starts at the beginning of the text.
  • Highlighting of the matching terms is of limited use.

You can usually work around these restrictions by tokenizing the query string yourself and constructing a compound query involving a prefix query or filter on the last query term and some other sort of match query on the remainder of the query string.

Phrase prefix matchers

Buried deep in the documentation for “match” queries is the description of the very useful phrase prefix matcher. A phrase prefix matcher differs from a prefix query by supporting analysis of the query text.  This has many nice consequences.

  • The query can automatically be tokenized, lowercased, etc. etc.
  • Sloppy phrase matching is supported (words can be out of order and interspersed with other words up to a specified slop factor).
  • Matching can start from a word in the middle of the field.
  • Highlighting will work.

For example:

{"query": { "match_phrase_prefix": {"name": "Tim Arnold Smi"}}}

You will also have the usual control over sorting available to queries.

Curiously, fuzzy matching does not seem to be supported by phrase prefix matching.

Like all the “match” query types, phrase prefix match queries will not be cached by default. You may be able to work around this by embedding it in a query filter with caching explicitly enabled.

Given that the query string is analyzed, it should follow that phrase prefix matchers will be slower than the simple prefix queries/filters, and even slower again than completion suggesters. This will be even more likely if you make use of match highlighting and sloppy matching. However, like all the performance claims in this article, the reality may depend on your particular circumstances.

Edge NGrams

The Edge NGram token filter takes the term to be indexed and indexes prefix strings up to a configurable length. For example, the text “smith” would be indexed as “s”, “sm”, “smi”, “smit”, and “smith”.

Supporting autosuggest with Edge NGrams moves a lot of the work that Elasticsearch has to do from query time to index time.  This should result in faster queries, but may result in slower indexing and in larger index storage.

It may seem crazy to index all the prefixes.  However, for many scenarios it is perfectly fine.  For example, with Elasticsearch running on my laptop, it took less than one second to create an Edge NGram index of all of the eight thousand distinct suburb and town names of Australia. The resulting index used less than a megabyte of storage.

The trick to using the edge NGrams is to NOT use the edge NGram token filter on the query.  Otherwise you will find a query like “smit” will be tokenized into “s”, “sm”, “smi”, “smit”, and your results will include all terms starting with “s”.

Instead, you should specify a more standard analyzer (such as the standard or the simple analyzer) for the query.

For example:

{ 
  "settings": { 
    "analysis": { 
      "filter": { 
        "myNGramFilter": {
          "type": "edgeNGram",
          "min_gram": 1,
          "max_gram": 40 
      }}, 
      "analyzer": { 
        "myNGramAnalyzer": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "myNGramFilter"]
      }}}}, 
  "mappings": { 
    "customer": { 
      "properties": { 
        "name": {
          "type": "string",
          "index_analyzer": "myNGramAnalyzer",
          "search_analyzer": "standard"
        }
        ...

Another issue to consider when using edge NGrams is what will happen when you use a multi-word query.  For example, if you have indexed “Fred Smith” using edge NGrams, but specify that the query should use the standard analyzer then, not only will “Fre” and “Smi” match, but so will “Fre Smith”, “Fred Smi”, and “Smi Fre”.  If this is not the behaviour that you want, then you might want to use a similar workaround to that suggested for prefix queries: Index the field using both a standard analyzer as well as an edge NGram analyzer, split the query string just before the last term and then use a compound query that matches the query string preceding the last term on the standard analyzed field and matches on the last term on the edge NGram analyzed field.

In our example mapping, you could change the mapping for “name” to use the standard analyzer by default and an NGram analyzer within a list of “multi-fields”:

{
  "mappings": {
    "customer": {
      "properties": {
        "name": {
          "type": "string", "analyzer": "standard", 
          "fields": { 
            "prefix": {"type": "string", "index_analyzer": "myNGramAnalyzer", "search_analyzer": "standard"} 
          } 
          ...

And a query for “Tim George Smi” might look like this:

{"query": { 
  "bool": { 
    "must": [ 
      {"match_phrase": {"name": "Tim George"}}, 
      {"match": {"name.prefix": "Smi"}} 
    ] 
  }...

Summary

There are many ways to implement auto-suggest with Elasticsearch, each with different tradeoffs with respect to functionality, query speed, indexing speed, and resulting index storage size.

Completion suggesters are likely to provide the most performant solution, but they do suffer some restrictions, such as the lack of support for “sloppy phrase matching”, and limited sort order options. Ordinary Elasticsearch queries are much more flexible, but may result in slower response times and lower throughputs.

The following table summarises the four main options. The restrictions indicated in the table for the various options can often be worked-around with varying degrees of difficulty.

Completion Suggesters Prefix Queries/ Filters Phrase Prefix Matchers Edge NGrams
Query is analyzed Yes No Yes Yes
Fuzzy match Yes No No Yes
Sloppy Match No No Yes No
Duplicate filtering Yes No No No
Match from middle No No Yes Yes
Flexible sorting No* Yes Yes Yes
Highlighting No No Yes Yes
Cached ? Yes Restricted** Restricted**

* Completion suggester results can be sorted using predefined weights

** Queries can be cached if wrapped in query filters, but this limits scoring and highlighting.