164

I've recently started using ElasticSearch and I can't seem to make it search for a part of a word.

Example: I have three documents from my couchdb indexed in ElasticSearch:

{
  "_id" : "1",
  "name" : "John Doeman",
  "function" : "Janitor"
}
{
  "_id" : "2",
  "name" : "Jane Doewoman",
  "function" : "Teacher"
}
{
  "_id" : "3",
  "name" : "Jimmy Jackal",
  "function" : "Student"
} 

So now, I want to search for all documents containing "Doe"

curl http://localhost:9200/my_idx/my_type/_search?q=Doe

That doesn't return any hits. But if I search for

curl http://localhost:9200/my_idx/my_type/_search?q=Doeman

It does return one document (John Doeman).

I've tried setting different analyzers and different filters as properties of my index. I've also tried using a full blown query (for example:

{
  "query": {
    "term": {
      "name": "Doe"
    }
  }
}

) But nothing seems to work.

How can I make ElasticSearch find both John Doeman and Jane Doewoman when I search for "Doe" ?

UPDATE

I tried to use the nGram tokenizer and filter, like Igor proposed, like this:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "bulk_size": "100",
    "bulk_timeout": "10ms",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": [
            "my_ngram_filter"
          ]
        }
      },
      "filter": {
        "my_ngram_filter": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      }
    }
  }
}

The problem I'm having now is that each and every query returns ALL documents. Any pointers? ElasticSearch documentation on using nGram isn't great...

2
  • 10
    no wonder, you habe min/max ngram set to 1, so 1 letter :)
    – Martin B.
    May 14, 2014 at 14:38
  • 5
    I'm actually surprised ES doesn't make this easier. It's ElasticSearch, not ElasticExactMatchUnlessIDoSomeCeremony Jun 2, 2021 at 19:41

11 Answers 11

93

I'm using nGram, too. I use standard tokenizer and nGram just as a filter. Here is my setup:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "analysis": {
      "index_analyzer": {
        "my_index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "mynGram"
          ]
        }
      },
      "search_analyzer": {
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "mynGram"
          ]
        }
      },
      "filter": {
        "mynGram": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 50
        }
      }
    }
  }
}

Let's you find word parts up to 50 letters. Adjust the max_gram as you need. In german words can get really big, so I set it to a high value.

8
75

I think there's no need to change any mapping. Try to use query_string, it's perfect. All scenarios will work with default standard analyzer:

We have data:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

Scenario 1:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Doe*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

Scenario 2:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Jan*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}

Scenario 3:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*oh* *oe*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

EDIT - Same implementation with spring data elastic search https://stackoverflow.com/a/43579948/2357869

One more explanation how query_string is better than others https://stackoverflow.com/a/43321606/2357869

15
  • 3
    i think this is the easiest May 26, 2017 at 11:59
  • 1
    try this :-{ "query": { "query_string" : { "fields" : ["content", "name"], "query" : "this AND that" } } }
    – Vijay
    Jun 2, 2017 at 10:58
  • 3
    how come the scenario 2 is giving Johan doeman instead of Jane Doewoman Sep 6, 2020 at 19:42
  • 1
    This worked for me. Thanks
    – DJ Burb
    Oct 19, 2021 at 15:49
  • 1
    Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false. Taken from: elastic.co/guide/en/elasticsearch/reference/7.16/… Aug 3, 2023 at 15:15
70

Searching with leading and trailing wildcards is going to be extremely slow on a large index. If you want to be able to search by word prefix, remove leading wildcard. If you really need to find a substring in a middle of a word, you would be better of using ngram tokenizer.

2
  • 16
    Igor is right. At least remove the leading *. For NGram ElasticSearch example, see this gist: gist.github.com/988923
    – karmi
    Jun 24, 2011 at 19:08
  • 3
    @karmi: Thanks for your complete example! Perhaps you want to add your comment as an actual answer, it's what got it working for me and what I would want to upvote. Nov 12, 2012 at 15:46
16

without changing your index mappings you could do a simple prefix query that will do partial searches like you are hoping for

ie.

{
  "query": { 
    "prefix" : { "name" : "Doe" }
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html

2
  • 1
    can you do multi field search using prefix query?
    – Emil
    Oct 22, 2018 at 15:09
  • Thanks, just what I was looking for! Any thoughts on performance impact?
    – Vingtoft
    Aug 28, 2019 at 9:04
8

While there are a lot of answers which focuses on solving the issue at hand but don't talk much about the various trade-off which someone needs to make before choosing a particular answer. So let me try to add a few more details on this perspective.

Partial search is now a day a very common and important feature and if not implemented properly can lead to poor user experience and bad performance, so first know your application function and non-function requirement related to this feature which I talked about in my this detailed SO answer.

Now there are various approaches, like query time, index time, completion suggester and search as you type data-types added in recent version of elasticsarch.

Now people who quickly want to just implement a solution can use below end to end working solution.

Index mapping

{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    },
    "index.max_ngram_diff" : 10
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete", 
        "search_analyzer": "standard" 
      }
    }
  }
}

Index given sample docs

{
  "title" : "John Doeman"
  
}

{
  "title" : "Jane Doewoman"
  
}

{
  "title" : "Jimmy Jackal"
  
}

And search query

{
    "query": {
        "match": {
            "title": "Doe"
        }
    }
}

which returns expected search results

 "hits": [
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.76718915,
                "_source": {
                    "title": "John Doeman"
                }
            },
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.76718915,
                "_source": {
                    "title": "Jane Doewoman"
                }
            }
        ]
1
  • In the mapping, "search_analyzer": "standard" the "standard" search analyzer is important. I was using a "lowercase" filter, and the values I was searching had digits in them. Digits are ignored by "lowercase" filter.
    – JessieinAg
    Nov 7, 2022 at 15:51
6

I am using this and got I worked

"query": {
        "query_string" : {
            "query" : "*test*",
            "fields" : ["field1","field2"],
            "analyze_wildcard" : true,
            "allow_leading_wildcard": true
        }
    }
1
  • 1
    Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false. Taken from: elastic.co/guide/en/elasticsearch/reference/7.16/… Aug 3, 2023 at 15:12
6

Try the solution with is described here: Exact Substring Searches in ElasticSearch

{
    "mappings": {
        "my_type": {
            "index_analyzer":"index_ngram",
            "search_analyzer":"search_ngram"
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "index_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "ngram_filter", "lowercase" ]
                },
                "search_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": "lowercase"
                }
            }
        }
    }
}

To solve the disk usage problem and the too-long search term problem short 8 characters long ngrams are used (configured with: "max_gram": 8). To search for terms with more than 8 characters, turn your search into a boolean AND query looking for every distinct 8-character substring in that string. For example, if a user searched for large yard (a 10-character string), the search would be:

"arge ya AND arge yar AND rge yard.

4
  • 4
    dead link, pls fix
    – DarkMukke
    Sep 12, 2017 at 11:45
  • I have been looking for something like this for a while. Thank you! Do you know how the memory scales with the min_gram and max_gram it seems like it would be linearly dependent on the size of the field values and the range of min and max. How frowned upon is using something like this? Oct 26, 2019 at 16:34
  • Also is there any reason that the ngram is a filter over a tokenizer? could you not just have it as a tokenizer and then apply a lowercase filter... index_ngram: { type: "custom", tokenizer: "ngram_tokenizer", filter: [ "lowercase" ] } I tried it and it seems to give the same results using the analyzer test api Oct 26, 2019 at 17:49
  • Used wayback machine web.archive.org/web/20131216221809/http://blog.rnf.me/2013/…
    – Pants
    Oct 3, 2021 at 15:21
4

If you want to implement autocomplete functionality, then Completion Suggester is the most neat solution. The next blog post contains a very clear description how this works.

In two words, it's an in-memory data structure called an FST which contains valid suggestions and is optimised for fast retrieval and memory usage. Essentially, it is just a graph. For instance, and FST containing the words hotel, marriot, mercure, munchen and munich would look like this:

enter image description here

4

you can use regexp.

{ "_id" : "1", "name" : "John Doeman" , "function" : "Janitor"}
{ "_id" : "2", "name" : "Jane Doewoman","function" : "Teacher"  }
{ "_id" : "3", "name" : "Jimmy Jackal" ,"function" : "Student"  } 

if you use this query :

{
  "query": {
    "regexp": {
      "name": "J.*"
    }
  }
}

you will given all of data that their name start with "J".Consider you want to receive just the first two record that their name end with "man" so you can use this query :

{
  "query": { 
    "regexp": {
      "name": ".*man"
    }
  }
}

and if you want to receive all record that in their name exist "m" , you can use this query :

{
  "query": { 
    "regexp": {
      "name": ".*m.*"
    }
  }
}

This works for me .And I hope my answer be suitable for solve your problem.

1

Using wilcards (*) prevent the calc of a score

1
  • 2
    Could you add more details to your answer? Provide a sample code or reference to documentation on what this does.
    – Cray
    Jul 1, 2019 at 16:07
-4

Nevermind.

I had to look at the Lucene documentation. Seems I can use wildcards! :-)

curl http://localhost:9200/my_idx/my_type/_search?q=*Doe*

does the trick!

3
  • 12
    See @imotov answer. The use of wildcards is not going to scale well at all. Jun 5, 2012 at 11:19
  • 5
    @Idx - See how your own answer is downvoted. Downvotes represents how quality and relevancy of an answer. Could you spare a minute to accept the right answer? At least new users would be grateful to you.
    – asyncwait
    Dec 26, 2013 at 14:43
  • 3
    Enough downvotes. OP made clear what the best answer is now. +1 for sharing what seemed to be the best answer before someone posted a better one.
    – s.Daniel
    Mar 17, 2015 at 10:05

Not the answer you're looking for? Browse other questions tagged or ask your own question.