Lucene

2023-12-20 21:29:05

1. Lucene概述

1.1 什么是Lucene

1.2 Lucene的原理

1. Lucene概述

1.1 什么是Lucene

Lucene是一个全文搜索框架，而不是应用产品。因此它并不像baidu或者google那样拿来就能用，它只是提供了一些工具让你能实现这些产品。

Lucene的发明者Doug Cutting也同样是Hadoop的创造者。

Lucene能做什么？

要回答这个问题，先要了解Lucene的本质。

实际上Lucene的功能很单一，说到底，就是你给它若干个字符串，然后它为你提供一个全文搜索服务，告诉你你要搜索的关键词出现在哪里。

知道了这个本质，你就可以发挥想象做任何符合这个条件的事情了。

你可以把站内新闻都索引了，做个资料库；

你可以把一个数据库表的若干个字段索引起来，那就不用再担心因为"%like%"而锁表了；

你也可以写个自己的搜索引擎。

Lucene效率如何？

下面给出一些N年前的测试数据，如果你觉得可以接受，那么可以选择。
- 测试一：250万记录，300M左右文本，生成索引380M左右，800线程下平均处理时间300ms。
- 测试二：37000记录，索引数据库中的两个varchar字段，索引文件2.6M，800线程下平均处理时间1.5ms。

1.2 Lucene的原理

Lucene为什么这么快？
- 倒排索引
- 压缩算法
- 二元搜索

倒排索引

先要了解一下数据库的like搜索为什么那么慢：

执行like搜索时，数据库要遍历每条数据，并在每次遍历的过程中匹配关键词。

也就是说，慢的原因是在记录中搜索关键词，所以可以把这个过程反过来，以关键词搜索记录。

倒排索引是将每个文档（数据库中的记录）中的词汇提前取出，建立“词汇-文档索引”。

以上表中数据为例，将其中所有词汇取出，并建立索引：
Lucene的工作方式

Lucene提供的服务实际包含两部分：一入一出。

所谓入是写入，即将你提供的源（本质是字符串）写入索引或者将其从索引中删除；

所谓出是读出，即向用户提供全文搜索服务，让用户可以通过关键词定位源。
- 写入流程：
  
  源字符串首先经过analyzer处理，包括：分词（分成一个个单词）、去除停用词（stopword，可选）。
  
  将源中需要的信息加入Document的各个Field中，并把需要索引的Field索引起来，把需要存储的Field存储起来。
  
  将索引写入存储器，存储器可以是内存或磁盘。
- 读出流程：
  
  用户提供搜索关键词，经过analyzer处理。
  
  对处理后的关键词搜索索引找出对应的Document。
  
  用户根据需要从找到的Document中提取需要的Field。

2. Lucene的使用

2.1 准备

引入依赖

 ? ? ? ?<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-highlighter -->
 ? ? ? ?<dependency>
 ? ? ? ? ? ?<groupId>org.apache.lucene</groupId>
 ? ? ? ? ? ?<artifactId>lucene-highlighter</artifactId>
 ? ? ? ? ? ?<version>9.5.0</version>
 ? ? ? ?</dependency>
?
 ? ? ? ?<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
 ? ? ? ?<dependency>
 ? ? ? ? ? ?<groupId>org.apache.lucene</groupId>
 ? ? ? ? ? ?<artifactId>lucene-queryparser</artifactId>
 ? ? ? ? ? ?<version>9.5.0</version>
 ? ? ? ?</dependency>
?
?
 ? ? ? ?<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
 ? ? ? ?<dependency>
 ? ? ? ? ? ?<groupId>org.apache.lucene</groupId>
 ? ? ? ? ? ?<artifactId>lucene-analyzers-common</artifactId>
 ? ? ? ? ? ?<version>8.11.2</version>
 ? ? ? ?</dependency>
?
             ?<!--流操作 -->
             <dependency>
 ? ? ? ? ? ?<groupId>commons-io</groupId>
 ? ? ? ? ? ?<artifactId>commons-io</artifactId>
 ? ? ? ? ? ?<version>2.11.0</version>
 ? ? ? ?</dependency>
?

准备基础数据

知乎热榜 - 知乎

创建两个目录，例如 D:/data 和 D:/index 在data目录中添加一些数据，可以保存几个网页

2.2 生成索引

private static final String INDEX_DIR = "/Users/whitecamellia/Desktop/lucene/index";
private static final String DATA_DIR = "/Users/whitecamellia/Desktop/lucene/data";
?
@Test
public void createIndex() throws Exception {
?
 ? ?// 获取存放索引的目录
 ? ?Directory directory = FSDirectory.open(Paths.get(INDEX_DIR));
 ? ?// 创建IndexWriter的默认配置
 ? ?IndexWriterConfig indexWriterConfig = new IndexWriterConfig();
 ? ?// 创建IndexWriter
 ? ?IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
?
 ? ?// 获取存放原始数据的目录
 ? ?File dataDir = new File(DATA_DIR);
 ? ?// 遍历目录中的文件
 ? ?for (File data : dataDir.listFiles()) {
 ? ? ? ?if (data.isFile()) {
 ? ? ? ? ? ?// 创建文档
 ? ? ? ? ? ?Document document = new Document();
?
 ? ? ? ? ? ?String title = data.getName();
 ? ? ? ? ? ?// 向title Field域中加入文件名
 ? ? ? ? ? ?document.add(new TextField("title", title, Field.Store.YES));
?
 ? ? ? ? ? ?String content = FileUtils.readFileToString(data, "utf-8");
 ? ? ? ? ? ?// 向content Field域中加入文件内容
 ? ? ? ? ? ?document.add(new TextField("content", content, Field.Store.YES));
?
 ? ? ? ? ? ?indexWriter.addDocument(document);
 ? ? ?  }
 ?  }
 ? ?indexWriter.close();
}

2.3 全文检索

@Test
public void search() throws Exception {
 ? ?String keyword = "何平";
?
 ? ?// 获取索引目录
 ? ?Directory directory = FSDirectory.open(Paths.get(INDEX_DIR));
 ? ?// 创建标准分词器
 ? ?Analyzer analyzer = new StandardAnalyzer();
 ? ?// 通过索引创建出IndexSearcher
 ? ?DirectoryReader reader = DirectoryReader.open(directory);
 ? ?IndexSearcher indexSearcher = new IndexSearcher(reader);
?
 ? ?// 指定搜索域
 ? ?QueryParser queryParser = new QueryParser("title", analyzer);
 ? ?// QueryParser queryParser = new QueryParser("content", analyzer);
 ? ?// 处理搜索关键词
 ? ?Query query = queryParser.parse(keyword);
?
 ? ?// 搜索整个索引，并获取前5条
 ? ?TopDocs topDocs = indexSearcher.search(query, 5);
 ? ?System.out.println("共搜索出 " + topDocs.totalHits + " 条数据");
?
 ? ?System.out.println("#################################################################");
 ? ?// 遍历搜索结果
 ? ?for (ScoreDoc doc : topDocs.scoreDocs) {
 ? ? ? ?// 通过文档的索引获取文档
 ? ? ? ?Document document = indexSearcher.doc(doc.doc);
 ? ? ? ?// 获取文档指定域的内容
 ? ? ? ?// System.out.println(document.get("title"));
 ? ? ? ?System.out.println(document.getField("title").stringValue());
 ? ? ? ?// 获取该文档的score，决定了结果顺序，代表与关键词的相关性（相关性计算所占百分比）
 ? ? ? ?System.out.println("搜索评分：" + doc.score);
 ?  }
}

2.4 多Field检索

以上只是在单个Field（title）中搜索，也可以在多个Field中搜索。

只需将

QueryParser queryParser = new QueryParser("title", analyzer);

替换为

QueryParser queryParser = new MultiFieldQueryParser(new String[]{"title", "content"}, analyzer);

2.5 中文分词器

Lucene默认使用StandardAnalyzer进行分词，该分词器逻辑较为简单，尤其是对于中文，只是单纯对每个字进行拆分，没办法使用。

Lucene也提供了中文分词器，需引入依赖：org.apache.lucene:lucene-analyzers-smartcn，然后将上面程序中的分词器修改成SmartChineseAnalyzer，注意，在生成索引时也要指定分词器，不然还是默认的StandardAnalyzer：

// search 创建中文分词器
Analyzer analyzer = new SmartChineseAnalyzer();
// createIndex 
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);

<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-smartcn -->
<dependency>
 ? ?<groupId>org.apache.lucene</groupId>
 ? ?<artifactId>lucene-analyzers-smartcn</artifactId>
 ? ?<version>8.11.2</version>
</dependency>

2.6 停用词

文档中有很多词对于检索来说是没有实际意义的，比如“我”、“是”、“的”、“了”……

这就需要在生成索引和检索时排除掉这些，Lucene默认有一个停用词库，不过里面只包含了几十个标点符号。

我们也可以在创建分词器时自己指定停用词：

List<String> STOP_WORD_LIST = Arrays.asList(new String[]{"，", "。", ",", ".", "?", "我", "的", "了"});
CharArraySet set = new CharArraySet(STOP_WORD_LIST, true);
Analyzer analyzer = new SmartChineseAnalyzer(set);

也可以去网上找一个通用的停用词库。

中文停用词表

2.7 是否索引,是否储存

// 创建文档
Document document = new Document();
String title = data.getName();
//  是否索引==》是否要对title的数据进行分词，生成索引表
//  TextField("title") 是索引
//  new StringField("author")
//  是否储存  生成索引表的时候要不要在我们的索引库中保存原数据
//  保存就可以获取到对应的内容 document.getField("content").stringValue()
//  Field.Store.No
//  通常，对于标题，我们是存储的，但是对于内容，我们并不需要，因为内容的数据太了，
//  内容的数据，我们可以去查表获取
document.add(new TextField("title", title, Field.Store.YES));
?
String content = FileUtils.readFileToString(data, "utf-8");
// 向content Field域中加入文件内容
document.add(new TextField("content", content, Field.Store.NO));
?
indexWriter.addDocument(document);

文章来源:https://blog.csdn.net/2301_77351901/article/details/135117523
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：veading@qq.com进行投诉反馈，一经查实，立即删除！