提高查询得分

当然,bool查询并不仅仅是组合多个简单的一个词的match查询。他可以组合任何其他查询,包括bool查询。bool查询通常会通过组合几个不同查询的得分为每个文档调整相关性得分。

假设我们想查找关于”full-text search”的文档,但是我们又想给涉及到“Elasticsearch”或者“Lucene”的文档更高的权重。我们的用意是想涉及到”Elasticsearch” 或者 “Lucene”的文档的相关性得分会比那些没有涉及到的文档的得分要高,也就是说这些文档会出现在结果集更靠前的位置。

一个简单的bool查询允许我们写出像下面一样的非常复杂的逻辑:

  1. GET /_search
  2. {
  3. "query": {
  4. "bool": {
  5. "must": {
  6. "match": {
  7. "content": { (1)
  8. "query": "full text search",
  9. "operator": "and"
  10. }
  11. }
  12. },
  13. "should": [ (2)
  14. { "match": { "content": "Elasticsearch" }},
  15. { "match": { "content": "Lucene" }}
  16. ]
  17. }
  18. }
  19. }
  1. content字段必须包含full,text,search这三个单词。
  2. 如果content字段也包含了“Elasticsearch”或者“Lucene”,则文档会有一个更高的得分。

匹配的should子句越多,文档的相关性就越强。到目前为止一切都很好。但是如果我们想给包含“Lucene”一词的文档比较高的得分,甚至给包含“Elasticsearch”一词更高的得分要怎么做呢?

我们可以在任何查询子句中指定一个boost值来控制相对权重,默认值为1。一个大于1的boost值可以提高查询子句的相对权重。因此我们可以像下面一样重写之前的查询:

  1. GET /_search
  2. {
  3. "query": {
  4. "bool": {
  5. "must": {
  6. "match": { (1)
  7. "content": {
  8. "query": "full text search",
  9. "operator": "and"
  10. }
  11. }
  12. },
  13. "should": [
  14. { "match": {
  15. "content": {
  16. "query": "Elasticsearch",
  17. "boost": 3 (2)
  18. }
  19. }},
  20. { "match": {
  21. "content": {
  22. "query": "Lucene",
  23. "boost": 2 (3)
  24. }
  25. }}
  26. ]
  27. }
  28. }
  29. }
  1. 这些查询子句的boost值为默认值1
  2. 这个子句是最重要的,因为他有最高的boost值。
  3. 这个子句比第一个查询子句的要重要,但是没有“Elasticsearch”子句重要。

注意:

  1. boost参数用于提高子句的相对权重(boost值大于1)或者降低子句的相对权重(boost值在0-1之间),但是提高和降低并非是线性的。换句话说,boost值为2并不能够使结果变成两部的得分。

  2. 另外,boost值被使用了以后新的得分是标准的。每个查询类型都会有一个独有的标准算法,算法的详细内容并不在本书的范畴。简单的概括一下,一个更大的boost值可以得到一个更高的得分。

  3. 如果你自己实现了没有基于TF/IDF的得分模型,但是你想得到更多的对于提高得分过程的控制,你可以使用function_score查询来调整一个文档的boost值而不用通过标准的步骤。

我们会在下一章介绍更多的组合查询,【multi-field-search】。但是首先让我们一起来看一下查询的另外一个重要的特征:文本分析。 <!— === Boosting Query Clauses

Of course, the bool query isn’t restricted (((“full text search”, “boosting query clauses”)))to combining simple one-word match queries. It can combine any other query, including other bool queries.(((“relevance scores”, “controlling weight of query clauses”))) It is commonly used to fine-tune the relevance _score for each document by combining the scores from several distinct queries.

Imagine that we want to search for documents(((“bool query”, “boosting weight of query clauses”)))(((“weight”, “controlling for query clauses”))) about “full-text search,” but we want to give more weight to documents that also mention “Elasticsearch” or “Lucene.” By more weight, we mean that documents mentioning “Elasticsearch” or “Lucene” will receive a higher relevance _score than those that don’t, which means that they will appear higher in the list of results.

A simple bool query allows us to write this fairly complex logic as follows:

[source,js]

GET /_search { “query”: { “bool”: { “must”: { “match”: { “content”: { <1> “query”: “full text search”, “operator”: “and” } } }, “should”: [ <2> { “match”: { “content”: “Elasticsearch” }}, { “match”: { “content”: “Lucene” }} ] } }

}

// SENSE: 100_Full_Text_Search/25_Boost.json

<1> The content field must contain all of the words full, text, and search.

<2> If the content field also contains Elasticsearch or Lucene, the document will receive a higher _score.

The more should clauses that match, the more relevant the document. So far, so good.

But what if we want to give more weight to the docs that contain Lucene and even more weight to the docs containing Elasticsearch?

We can control (((“boost parameter”)))the relative weight of any query clause by specifying a boost value, which defaults to 1. A boost value greater than 1 increases the relative weight of that clause. So we could rewrite the preceding query as follows:

[source,js]

GET /_search { “query”: { “bool”: { “must”: { “match”: { <1> “content”: { “query”: “full text search”, “operator”: “and” } } }, “should”: [ { “match”: { “content”: { “query”: “Elasticsearch”, “boost”: 3 <2> } }}, { “match”: { “content”: { “query”: “Lucene”, “boost”: 2 <3> } }} ] } }

}

// SENSE: 100_Full_Text_Search/25_Boost.json

<1> These clauses use the default boost of 1.

<2> This clause is the most important, as it has the highest boost.

<3> This clause is more important than the default, but not as important as the Elasticsearch clause.

[NOTE]

[[boost-normalization]]

The boost parameter is used to increase(((“boost parameter”, “score normalied after boost applied”))) the relative weight of a clause (with a boost greater than 1) or decrease the relative weight (with a boost between 0 and 1), but the increase or decrease is not linear. In other words, a boost of 2 does not result in double the _score.

Instead, the new _score is normalized after(((“normalization”, “score normalied after boost applied”))) the boost is applied. Each type of query has its own normalization algorithm, and the details are beyond the scope of this book. Suffice to say that a higher boost value results in a higher _score.

If you are implementing your own scoring model not based on TF/IDF and you need more control over the boosting process, you can use the <> to(((“function_score query”))) manipulate a document’s

boost without the normalization step.

We present other ways of combining queries in the next chapter, <>. But first, let’s take a look at the other important feature of queries: text analysis. —>