Elasticsearch搜索引擎学习之六——索引雇员文档

高新技术,ElasticSearch

2017-06-04

152

0

目录


目前我们已经在 Elasticsearch 中存储了一些数据, 接下来就能专注于实现应用的业务需求了。

检索单个文档

第一个需求是可以检索到单个雇员的数据。这在 Elasticsearch 中很简单。简单地执行 一个 HTTP GET 请求并指定文档的地址——索引库、类型和ID。

使用这三个信息可以返回原始的 JSON 文档:

GET /megacorp/employee/1

执行情况如下:

$ curl -XGET 'localhost:9200/megacorp/employee/1?pretty'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   294  100   294    0     0  18375      0 --:--:-- --:--:-- --:--:--  287k{
  "_index" : "megacorp",
  "_type" : "employee",
  "_id" : "1",
  "_version" : 2,
  "found" : true,
  "_source" : {
    "first_name" : "John",
    "last_name" : "Smith",
    "age" : 25,
    "about" : "I love to go rock climbing",
    "interests" : [
      "sports",
      "music"
    ]
  }
}

将 HTTP 命令由 PUT 改为 GET 可以用来检索文档,同样的,可以使用 DELETE 命令来删除文档,以及使用 HEAD 指令来检查文档是否存在。如果想更新已存在的文档,只需再次 PUT(最好按照RESTful标准,使用POST新增,PUT修改)。

更新文档:

belonk@DESKTOP-7TSK7UJ MINGW64 /d/opensource
$ curl -XPUT -i 'http://localhost:9200/megacorp/employee/3' -d '
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry1" ]
}
'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   315  100   146  100   169   9125  10562 --:--:-- --:--:-- --:--:-- 10562HTTP/1.1 200 OK
Warning: 299 Elasticsearch-5.4.0-780f8c4 "Content type detection for rest requests is deprecated. Specify the content type using the [Content-Type] header." "Fri, 02 Jun 2017 10:07:01 GMT"
content-type: application/json; charset=UTF-8
content-length: 146

{"_index":"megacorp","_type":"employee","_id":"3","_version":3,"result":"updated","_shards":{"total":2,"successful":1,"failed":0},"created":false}

查询更新:

belonk@DESKTOP-7TSK7UJ MINGW64 /d/opensource
$ curl -XGET -i 'http://localhost:9200/megacorp/employee/3?pretty'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   281  100   281    0     0    281      0  0:00:01 --:--:--  0:00:01  274kHTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 281

{
  "_index" : "megacorp",
  "_type" : "employee",
  "_id" : "3",
  "_version" : 3,
  "found" : true,
  "_source" : {
    "first_name" : "Douglas",
    "last_name" : "Fir",
    "age" : 35,
    "about" : "I like to build cabinets",
    "interests" : [
      "forestry1"
    ]
  }
}

轻量搜索

一个 GET 是相当简单的,可以直接得到指定的文档。 现在尝试点儿稍微高级的功能,比如一个简单的搜索!第一个尝试的几乎是最简单的搜索了。我们使用下列请求来搜索所有雇员:

GET /megacorp/employee/_search

可以看到,我们仍然使用索引库 megacorp 以及类型 employee,但与指定一个文档 ID 不同,这次使用_search 。返回结果包括了所有三个文档,放在数组 hits 中。一个搜索默认返回十条结果:

belonk@DESKTOP-7TSK7UJ MINGW64 /d/opensource
$ curl -XGET 'localhost:9200/megacorp/employee/_search?pretty'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1277  100  1277    0     0   8185      0 --:--:-- --:--:-- --:--:--  8185{
  "took" : 67,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "Douglas",
          "last_name" : "Fir",
          "age" : 35,
          "about" : "I like to build cabinets",
          "interests" : [
            "forestry1"
          ]
        }
      }
    ]
  }
}

注意:返回结果不仅告知匹配了哪些文档,还包含了整个文档本身:显示搜索结果给最终用户所需的全部信息。

接下来,尝试下搜索姓氏为Smith的雇员。为此,我们将使用一个高亮搜索,很容易通过命令行完成。这个方法一般涉及到一个查询字符串 (_query-string_)搜索,因为我们通过一个URL参数来传递查询信息给搜索接口:

GET /megacorp/employee/_search?q=last_name:Smith

我们仍然在请求路径中使用 _search 端点,并将查询本身赋值给参数 q= 。返回结果给出了所有的 Smith:

belonk@DESKTOP-7TSK7UJ MINGW64 /d/opensource
$ curl -XGET 'localhost:9200/megacorp/employee/_search?q=last_name:Smith&pretty'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   941  100   941    0     0  20021      0 --:--:-- --:--:-- --:--:-- 29406{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

使用查询表达式搜索

Elasticsearch 提供一个丰富灵活的查询语言叫做 查询表达式 , 它支持构建更加复杂和健壮的查询。领域特定语言 (DSL), 指定了使用一个 JSON 请求。我们可以像这样重写之前的查询所有 Smith 的搜索 :

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

返回结果与之前的查询一样,但还是可以看到有一些变化。其中之一是,不再使用 query-string 参数,而是一个请求体替代。这个请求使用 JSON 构造,并使用了一个 match 查询(属于查询类型之一,后续将会了解)。执行情况如下:

$ curl -XGET 'localhost:9200/megacorp/employee/_search?pretty' -H 'Content-Type:application/json' -d'
> {
>     "query" : {
>         "match" : {
    }
>             "last_name" : "Smith"
>         }
>     }
> }
> '
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1031  100   940  100    91  15161   1467 --:--:-- --:--:-- --:--:-- 15161{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

更复杂的搜索

现在尝试下更复杂的搜索。 同样搜索姓氏为 Smith 的雇员,但这次我们只需要年龄大于 30 的。查询需要稍作调整,使用过滤器 filter ,它支持高效地执行一个结构化查询。

GET /megacorp/employee/_search
{
    "query" : {
        "bool": {
            "must": {
                "match" : {
                    "last_name" : "smith" // 这部分与我们之前使用的 match 查询 一样。
                }
            },
            "filter": {
                "range" : {
                    "age" : { "gt" : 30 } // 这部分是一个 range 过滤器 , 它能找到年龄大于 30 的文档,其中 gt 表示_大于(_great than)。
                }
            }
        }
    }
}

 

执行情况如下:

$ curl -XGET 'localhost:9200/megacorp/employee/_search?pretty' -H 'Content-Type: application/json' -d'
> {
>     "query" : {
>         "bool": {
>             "must": {
>                 "match" : {
>                     "last_name" : "smith"
>                 }
>             },
>             "filter": {
>                 "range" : {
>                     "age" : { "gt" : 30 }
>                 }
>             }
>         }
>     }
> }
> '
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   871  100   563  100   308   3608   1974 --:--:-- --:--:-- --:--:--  3608{
  "took" : 117,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      }
    ]
  }
}

全文搜索

截止目前的搜索相对都很简单:单个姓名,通过年龄过滤。现在尝试下稍微高级点儿的全文搜索——一项传统数据库确实很难搞定的任务。搜索下所有喜欢攀岩(rock climbing)的雇员:

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

显然我们依旧使用之前的 match 查询在about 属性上搜索 “rock climbing” 。得到两个匹配的文档:

Administrator@PC-201610162213 MINGW64 ~
$ curl -XGET 'localhost:9200/megacorp/employee/_search?pretty' -H 'Content-Type: application/json' -d'
> {
>     "query" : {
>         "match" : {
>             "about" : "rock climbing"
>         }
>     }
> }
> '
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1039  100   944  100    95  29500   2968 --:--:-- --:--:-- --:--:-- 29500{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.53484553,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.53484553, // 相关性得分
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.26742277, // 相关性得分
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      }
    ]
  }
}

Elasticsearch 默认按照相关性得分排序,即每个文档跟查询的匹配程度。第一个最高得分的结果很明显:John Smith 的 about 属性清楚地写着 “rock climbing” 。

但为什么 Jane Smith 也作为结果返回了呢?原因是她的 about 属性里提到了 “rock” 。因为只有 “rock” 而没有 “climbing” ,所以她的相关性得分低于 John 的。

这是一个很好的案例,阐明了 Elasticsearch 如何  全文属性上搜索并返回相关性最强的结果。Elasticsearch中的 相关性 概念非常重要,也是完全区别于传统关系型数据库的一个概念(传统关系型数据库中的一条记录要么匹配要么不匹配)。

短语搜索

短语搜索,即是在搜索时匹配一系列单词或短语。比如, 我们想执行这样一个查询,仅匹配同时包含 “rock”  “climbing” ,并且 二者以短语 “rock climbing” 的形式紧挨着的雇员记录。

为此对 match 查询稍作调整,使用一个叫做 match_phrase 的查询:

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

请求结果如下:

belonk@DESKTOP-7TSK7UJ MINGW64 ~
$ curl -XGET 'localhost:9200/megacorp/employee/_search?pretty' -H 'Content-Type: application/json' -d'
> {
>     "query" : {
>         "match_phrase" : {
>             "about" : "rock climbing"
>         }
>     }
> }
> '
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   684  100   582  100   102  18774   3290 --:--:-- --:--:-- --:--:-- 36375{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.53484553,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.53484553,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}

高亮搜索

高亮搜索即是在检索结果中将匹配的内容高亮显示。在 Elasticsearch 中检索出高亮片段也很容易:

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

当执行该查询时,中还多了一个叫做 highlight 的部分。这个部分包含了 about 属性匹配的文本片段,并以 HTML 标签em封装:

belonk@DESKTOP-7TSK7UJ MINGW64 ~
$ curl -XGET 'localhost:9200/megacorp/employee/_search?pretty' -H 'Content-Type: application/json' -d'
> {
>     "query" : {
>         "match_phrase" : {
>             "about" : "rock climbing"
    },
>         }
>     },
>     "highlight": {
>         "fields" : {
>             "about" : {}
>         }
>     }
> }
> '
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   894  100   710  100   184  44375  11500 --:--:-- --:--:-- --:--:--  693k{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.53484553,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.53484553,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        },
        "highlight" : {
          "about" : [
            "I love to go rock climbing"
          ]
        }
      }
    ]
  }
}

关于高亮搜索片段,可以在 highlighting reference documentation 了解更多信息。

总结

本文从简单开始,尝试了简单搜索、轻量搜索、表达式搜索以及全文搜索、短语搜索、高亮搜索,对Elasticsearch的搜索有了初步的认识,全文搜索有搜索命中、结果评分等搜索相关的概念,关于这些概念和检索类型、方式等我们将在后续继续深入学习。


前一篇:Elasticsearch搜索引擎学习之五——创建雇员索引
后一篇:Elasticsearch搜索服务学习之七——分析雇员信息

belonk

轻轻地我走了,正如我轻轻地来,我挥一挥衣袖,不带走一片云彩