寻找相关词

寻找相关词

短语查询和邻近查询都很好用，但仍有一个缺点。它们过于严格了：为了匹配短语查询，所有词项都必须存在，即使使用了 slop 。

用 slop 得到的单词顺序的灵活性也需要付出代价，因为失去了单词对之间的联系。即使可以识别 sue 、 alligator 和 ate 相邻出现的文档，但无法分辨是 Sue ate 还是 alligator ate 。

当单词相互结合使用的时候，表达的含义比单独使用更丰富。两个子句 I’m not happy I’m working 和 I’m happy I’m not working 包含相同的单词，也拥有相同的邻近度，但含义截然不同。

如果索引单词对而不是索引独立的单词，就能对这些单词的上下文尽可能多的保留。

对句子 Sue ate the alligator ，不仅要将每一个单词（或者 unigram ）作为词项索引(((“unigrams”)))

    ["sue", "ate", "the", "alligator"]

也要将每个单词 以及它的邻近词 作为单个词项索引：

    ["sue ate", "ate the", "the alligator"]

这些单词对（或者 bigrams ）被称为 shingles 。


Tip	Shingles 不限于单词对；你也可以索引三个单词（ trigrams ）： `["sue ate the", "ate the alligator"]` Trigrams 提供了更高的精度，但是也大大增加了索引中唯一词项的数量。在大多数情况下，Bigrams 就够了。

当然，只有当用户输入的查询内容和在原始文档中顺序相同时，shingles 才是有用的；对 sue alligator 的查询可能会匹配到单个单词，但是不会匹配任何 shingles 。

幸运的是，用户倾向于使用和搜索数据相似的构造来表达搜索意图。但这一点很重要：只是索引 bigrams 是不够的；我们仍然需要 unigrams ，但可以将匹配 bigrams 作为增加相关度评分的信号。

生成 Shingles

Shingles 需要在索引时作为分析过程的一部分被创建。我们可以将 unigrams 和 bigrams 都索引到单个字段中，但将它们分开保存在能被独立查询的字段会更清晰。unigrams 字段将构成我们搜索的基础部分，而 bigrams 字段用来提高相关度。

首先，我们需要在创建分析器时使用 shingle 语汇单元过滤器：

DELETE /my_index
PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  <1>
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, <2>
                    "max_shingle_size": 2, <2>
                    "output_unigrams":  false   <3>
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" <4>
                    ]
                }
            }
        }
    }
}

<1> 参考 relevance-is-broken。

<2> 默认最小/最大的 shingle 大小是 2 ，所以实际上不需要设置。

<3> shingle 语汇单元过滤器默认输出 unigrams ，但是我们想让 unigrams 和 bigrams 分开。

<4> my_shingle_analyzer 使用我们常规的 my_shingles_filter 语汇单元过滤器。

首先，用 analyze API 测试下分析器：

GET /my_index/_analyze?analyzer=my_shingle_analyzer
Sue ate the alligator

果然，我们得到了 3 个词项：

sue ate
ate the
the alligator

现在我们可以继续创建一个使用新的分析器的字段。

多字段

我们曾谈到将 unigrams 和 bigrams 分开索引更清晰，所以 title 字段(((“multifields”)))将创建成一个多字段（参考 multi-fields ）：

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "title": {
                "type": "string",
                "fields": {
                    "shingles": {
                        "type":     "string",
                        "analyzer": "my_shingle_analyzer"
                    }
                }
            }
        }
    }
}

通过这个映射， JSON 文档中的 title 字段将会被以 unigrams (title)和 bigrams (title.shingles)被索引，这意味着可以独立地查询这些字段。

最后，我们可以索引以下示例文档:

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "title": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "title": "Sue never goes anywhere without her alligator skin purse" }

搜索 Shingles

为了理解添加 shingles 字段的好处(((“shingles”, “searching for”)))，让我们首先来看 The hungry alligator ate Sue 进行简单 match 查询的结果：

GET /my_index/my_type/_search
{
   "query": {
        "match": {
           "title": "the hungry alligator ate sue"
        }
   }
}

这个查询返回了所有的三个文档，但是注意文档 1 和 2 有相同的相关度评分因为他们包含了相同的单词：

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.44273707, <1>
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "2",
        "_score": 0.44273707, <1>
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "3", <2>
        "_score": 0.046571054,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

<1> 两个文档都包含 the 、 alligator 和 ate ，所以获得相同的评分。

<2> 我们可以通过设置 minimum_should_match 参数排除文档 3 ，参考 match-precision。

现在在查询里添加 shingles 字段。不要忘了在 shingles 字段上的匹配是充当一种信号—为了提高相关度评分—所以我们仍然需要将基本 title 字段包含到查询中：

GET /my_index/my_type/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

仍然匹配到了所有的 3 个文档，但是文档 2 现在排到了第一名因为它匹配了 shingled 词项 ate sue.

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.4883322,
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "1",
        "_score": 0.13422975,
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "3",
        "_score": 0.014119488,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

即使查询包含的单词 hungry 没有在任何文档中出现，我们仍然使用单词邻近度返回了最相关的文档。

Performance性能

shingles 不仅比短语查询更灵活，(((“shingles”,”better performance than phrase queries”)))而且性能也更好。 shingles 查询跟一个简单的 match 查询一样高效，而不用每次搜索花费短语查询的代价。只是在索引期间因为更多词项需要被索引会付出一些小的代价，这也意味着有 shingles 的字段会占用更多的磁盘空间。然而，大多数应用写入一次而读取多次，所以在索引期间优化我们的查询速度是有意义的。

这是一个在 Elasticsearch 里会经常碰到的话题：不需要任何前期进行过多的设置，就能够在搜索的时候有很好的效果。一旦更清晰的理解了自己的需求，就能在索引时通过正确的为你的数据建模获得更好结果和性能。