LangChain(0.0.340)官方文档十:Retrieval——Retrievers(检索器)

2023-12-16 22:29:16

LangChain官网LangChain官方文档langchain Githublangchain API文档llm-universe


??Retriever(检索器)是一个接口,用于在给定非结构化查询(unstructured query)的情况下检索相关的文档。Retriever提供了多种不同的算法和方法来实现检索操作,例如:

  • Vector store-backed retriever:使用Vector store作为其后端存储来执行检索操作
  • Parent Document Retriever:检索时,如果文档太大其向量表示会非常模糊,不能准确地反映其含义;如果文档太小,则没有足够的上下文信息。Parent Document Retriever可以使用多个分割器将文档分割成不同的粒度,来平衡准确性和上下文的需求(可以选择返回small chunkslarge chunksfull documents
  • MultiQueryRetriever:使用LLM对用户输入的query进行文本增强,生成多个queries,每个生成的query都进行文档检索,最后返回所有相关文档的集合。对query进行多角度生成,可以获得更丰富的结果。
  • MultiVector Retriever:对文档存储多种向量表示,比如使用通文档的Smaller chunks、Summary或者提前设置假设性提问等,以得到更准确的检索结果。
  • Contextual compression:对检索到的文档集进行压缩和过滤,从而提高检索效率和响应质量。(过滤不相关文档,提取相关文档中的相关内容)
  • Ensemble Retriever:集成多个不同的Retrievers来得到多个检索结果,通过利用不同算法的优势, EnsembleRetriever 能实现比任何单一算法更好的性能。
  • Time-weighted vector store retriever:结合了语义相似性和时间衰减的方法来评分文档的相关性,经常被访问的文档保持较高的相关性得分
  • Self-querying:使用LLM将用户的自然语言查询问题构建为SQL查询语句,在向量数据库中执行检索,免去了编写SQL语句的麻烦。
  • WebResearchRetriever:直接从Web上检索相关文档,比如使用Google搜索并加载搜索结果的网页内容,然后使用嵌入和相似性搜索来进行检索。

下面一一进行演示。

一、Vector store-backed retriever

一种常见的操作是使用Vector store作为其后端存储来执行检索操作,下面以Chroma为例进行演示。

# pip install chromadb

1.1 基础示例

1.1.1 从文本创建Vector store
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

full_text = open("state_of_the_union.txt", "r").read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_text(full_text)

embeddings = OpenAIEmbeddings()
db = Chroma.from_texts(texts, embeddings)
retriever = db.as_retriever()
retrieved_docs = retriever.invoke(
    "What did the president say about Ketanji Brown Jackson?"
)
print(retrieved_docs[0].page_content)
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. 

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.
1.1.2 从documents创建Vector store
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

embeddings = OpenAIEmbeddings()
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

db = FAISS.from_documents(texts, embeddings)
Exiting: Cleaning up .chroma directory
retriever = db.as_retriever()
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
1.1.3 MMR搜索

??默认情况下,向量存储检索器使用相似性搜索,你也可以使用MMR搜索(如果vector store支持的话)。

retriever = db.as_retriever(search_type="mmr")
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
1.1.4 设置相似性分数阈值

您还可以设置相似性分数阈值,仅返回分数高于该阈值的文档

retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5})
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
1.1.5 指定 top k
retriever = db.as_retriever(search_kwargs={"k": 1})
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
len(docs)
1

1.2 LCEL

??Retriever同样实现了Runnable 接口,所以也支持invoke, ainvoke, stream, astream, batch, abatch, astream_log 等方法的调用,以及按照LCEL语法与其它Runnable组件一起结合使用。

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI()

# 提取所有文档的page_content字段,并以换行符分割
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

??{"context": retriever | format_docs, "question": RunnablePassthrough()}字典包含两个键值对,每个键值对都表示一个可运行的组件。在LCEL语法中,字典格式可以用作RunnableParallel的输入,用于同时执行多个可运行组件,并将它们的输出作为一个字典返回。

  • retriever | format_docs:retriever检索相关文档,format_docs函数用于将文档列表格式化为字符串。通过使用管道操作符|将它们连接在一起,可以将这两个步骤组合成一个可运行的组件。
  • RunnablePassthrough():runnables中的一个子类,用于将输入数据原样传递给下一个组件,不做任何修改或处理。本示例中,是将问题原样传递给下一个组件。

常用的Runnable组合方式有两种(详见《LangChain(0.0.339)官方文档二:LCEL》):

  • RunnableSequence:顺序执行,后一个Runnable的输入来自前一个的输出,可以通过操作符|或者传递Runnable列表来构建
  • RunnableParallel:并行执行,每个Runnable拿到同样的输入,可以在一个序列内使用字典字面量或者传入字典来构建。
chain.invoke("What did the president say about technology?")
'The president said that technology plays a crucial role in the future and that passing the Bipartisan Innovation Act will make record investments in emerging technologies and American manufacturing. The president also mentioned Intel\'s plans to build a semiconductor "mega site" and increase their investment from $20 billion to $100 billion, which would be one of the biggest investments in manufacturing in American history.'

二、Parent Document Retriever

??Parent Document Retriever通过将大型文档拆分为小块并存储起来来平衡准确性和上下文的需求。它首先获取small chunks,然后查找这些small chunks的父文档并返回这些较大的文档。父文档可以是整个原始文档或较大的chunks。

from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

from langchain.retrievers import ParentDocumentRetriever

loaders = [
    TextLoader("../../paul_graham_essay.txt"),
    TextLoader("../../state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

??paul_graham_essay.txtstate_of_the_union.txt数据地址见langchain/docs/docs/modules,如果失效了直接在langchain库搜索就行。

2.1 检索完整文档

# text splitter用于创建child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# 初始化vectorstore
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

retriever.add_documents(docs, ids=None)
list(store.yield_keys())

这应该产生两个键,因为我们添加了两个文档。

['f73cb162-5eb2-4118-abcf-d87aa6a1b564',
 '8a2478e0-ac7d-4abf-811a-33a8ace3e3b8']

??调用vector store的搜索功能,结果是small chunks,因为我们在其中存储的是small chunks:

sub_docs = vectorstore.similarity_search("justice breyer")
print(sub_docs[0].page_content)
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

??而如果我们使用检索器进行检索,它将返回small chunks所在的整个文档:

retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)
38540

2.2 检索 larger chunks

??有时,完整的文档可能太大,不适合直接拿来用。这种情况下,我们可以将将原始文档先拆分为较大的chunks,再拆分为更小的chunks。这样我们检索到到small chunks时可以返回其 larger chunks

# create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(docs)
len(list(store.yield_keys()))
66

我们可以看到,现在有不止两个文档——这些larger chunks。

接下来进行同样的操作:

sub_docs = vectorstore.similarity_search("justice breyer")
print(sub_docs[0].page_content)
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)
1849

可以看到现在检索器检索的结果不再是整个文档了,将其打印出来有:

print(retrieved_docs[0].page_content)
In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. 

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.  

We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. 

We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.

三、MultiQueryRetriever

??当用户的查询发生细微变化,或者单一嵌入不能很好地捕获数据的语义时,可能会得到不同的检索结果。一种解决办法是使用Prompt engineering / tuning进行优化,但是会比较繁琐。MultiQueryRetriever使用LLM对用户输入的查询从不同角度生成多个queries,从而自动执行提示优化过程。

??对于每个生成的query,都会检索出一组相关文档。所有query检索结果取并集,得到更大的潜在相关文档集。通过对同一问题产生多个视角, MultiQueryRetriever 能够克服基于距离的检索的一些局限性,并获得更丰富的结果集。

3.1 基础示例

# Build a sample vectorDB
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

# Load blog post
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# VectorDB
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

??下面使用MultiQueryRetriever.from_llm方法创建一个MultiQueryRetriever对象retriever_from_llm,该对象将使用vectordb作为检索器,并与ChatOpenAI模型llm一起使用。

from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)							# 输出:5
INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be approached?', '2. What are the different methods for Task Decomposition?', '3. What are the various approaches to decomposing tasks?']

3.2 LCEL

下面演示在基础示例上,组合使用prompts和output_parser,后者将LLM结果的解析为字符串列表。

from typing import List

from langchain.chains import LLMChain
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)


output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)

# Chain
llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

# Other inputs
question = "What are the approaches to Task Decomposition?"
  • LineList:Pydantic模型,用于表示LLM结果的解析输出,其中包含一个名为lines的字符串列表。
  • LineListOutputParser:自定义输出解析器,用于将LLM结果解析为LineList对象
# Run
retriever = MultiQueryRetriever(
    retriever=vectordb.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

# Results
unique_docs = retriever.get_relevant_documents(
    query="What does the course say about regression?"
)
len(unique_docs)					# 输出11
INFO:langchain.retrievers.multi_query:Generated queries: ["1. What is the course's perspective on regression?", '2. Can you provide information on regression as discussed in the course?', '3. How does the course cover the topic of regression?', "4. What are the course's teachings on regression?", '5. In relation to the course, what is mentioned about regression?']

四、MultiVector Retriever

??将文档存储为多个向量通常是有益的,可以满足不同的检索需求。常见的创建多个向量的方法包括:

  • Smaller chunks:将文档分成较小的块,并对这些块进行嵌入(ParentDocumentRetriever)。
  • Summary:为每个文档创建摘要,并将其与文档一起嵌入(或替代文档)。
  • Hypothetical questions:创建适合每个文档的假设性问题,并将其与文档一起嵌入(或替代文档)。

??MultiVectorRetriever可以轻松实现这种文档的多向量表示,并提供了一些内置的方法和功能来处理多个向量。除此之外,你也可以显式地添加问题或查询,以及与文档相关的向量。这样可以更好地控制检索过程。

from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.storage import InMemoryByteStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

loaders = [
    TextLoader("../../paul_graham_essay.txt"),
    TextLoader("../../state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

4.1 Smaller chunks

import uuid

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]

# create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("justice breyer")[0]
Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '3f826cfe-78bd-468d-adb8-f5c2719255df', 'source': '../../state_of_the_union.txt'})
# Retriever returns larger chunks
len(retriever.get_relevant_documents("justice breyer")[0].page_content)
9875

检索器默认执行的是相似性搜索,你也可以启用MMR搜索:

from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr
len(retriever.get_relevant_documents("justice breyer")[0].page_content)
9875

4.2 Summary

??通常,摘要可能能够更准确地提炼出块的内容,从而更好地检索。在这里,我们将展示如何创建摘要,然后嵌入这些摘要。

import uuid

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs[0]
Document(page_content="The document is a speech given by the President of the United States, highlighting various issues and priorities. The President discusses the nomination of Judge Ketanji Brown Jackson for the Supreme Court and emphasizes the importance of securing the border and fixing the immigration system. The President also mentions the need to protect women's rights, support LGBTQ+ Americans, pass the Equality Act, and sign bipartisan bills into law. Additionally, the President addresses the opioid epidemic, mental health, support for veterans, and the fight against cancer. The speech concludes with a message of unity and optimism for the future of the United States.", metadata={'doc_id': '1f0bb74d-4878-43ae-9a5d-4c63fb308ca1'})
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)
9194

4.3 Hypothetical Queries

LLM 还可用于生成可以对特定文档提出的假设问题列表。然后可以嵌入这些问题

functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

chain.invoke(docs[0])
["What was the author's initial career choice before deciding to switch to AI?",
 'Why did the author become disillusioned with AI during his first year of grad school?',
 'What realization did the author have when visiting the Carnegie Institute?']
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

sub_docs = vectorstore.similarity_search("justice breyer")
sub_docs
[Document(page_content='Who is the nominee for the United States Supreme Court, and what is their background?', metadata={'doc_id': 'd4a82bd9-9001-4bd7-bff1-d8ba2dca9692'}),
 Document(page_content='Why did Robert Morris suggest the narrator to quit Y Combinator?', metadata={'doc_id': 'aba9b00d-860b-4b93-8e80-87dc08fa461d'}),
 Document(page_content='What events led to the narrator deciding to hand over Y Combinator to someone else?', metadata={'doc_id': 'aba9b00d-860b-4b93-8e80-87dc08fa461d'}),
 Document(page_content="How does the Bipartisan Infrastructure Law aim to improve America's infrastructure?", metadata={'doc_id': '822c2ba8-0abe-4f28-a72e-7eb8f477cc3d'})]
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)
9194

五、Contextual compression

??文档检索的一个挑战是,与查询最相关的信息可能隐藏在包含大量不相关文本的文档中。如果全部传递给LLM,会导致更昂贵的 LLM 调用和更差的响应。Contextual compression旨在对检索到的大量文档进行压缩及批量过滤,从而提高检索的效率和准确性。这种压缩可以包括减少文档的内容或直接丢弃文档,只返回与查询相关的内容。

??要实现此功能,需要一个 base retriever 和一个Document Compressor (文档压缩器)。使用时,Contextual Compression Retriever将base retriever的检索文档列表传递给Document Compressor ,后者对此文档集进行压缩和过滤,再返回最终的文档集。

5.1 普通检索

??下面以2023 State of the Union speech为例,普通检索时检索器会返回一两个相关文档以及一些不相关的文档,而即使是相关文档中,也有很多不相关的内容。

# Helper function for printing docs

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS

documents = TextLoader('../../../state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()

docs = retriever.get_relevant_documents("What did the president say about Ketanji Brown Jackson")
pretty_print_docs(docs)
Document 1:

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.

We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.

We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
----------------------------------------------------------------------------------------------------
Document 3:

And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong.

As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.

And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things.

So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.

First, beat the opioid epidemic.
----------------------------------------------------------------------------------------------------
Document 4:

Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers.

And as Wall Street firms take over more nursing homes, quality in those homes has gone down and costs have gone up.

That ends on my watch.

Medicare is going to set higher standards for nursing homes and make sure your loved ones get the care they deserve and expect.

We’ll also cut costs and keep the economy going strong by giving workers a fair shot, provide more training and apprenticeships, hire them based on their skills not degrees.

Let’s pass the Paycheck Fairness Act and paid leave.

Raise the minimum wage to $15 an hour and extend the Child Tax Credit, so no one has to raise a family in poverty.

Let’s increase Pell Grants and increase our historic support of HBCUs, and invest in what Jill—our First Lady who teaches full-time—calls America’s best-kept secret: community colleges.

5.2 LLMChainExtractor

langchain.retrievers API

本节使用LLMChainExtractorContextualCompressionRetriever进行演示:

  • LLMChainExtractor:使用LLM来构建一个LLM链(Language Model Link),遍历最初返回的文档列表,并从每个文档中提取仅与查询相关的内容。

  • ContextualCompressionRetriever:使用base_compressor(LLMChainExtractor)和base_retriever来实现文档压缩和检索的功能,提高检索的效率和准确性。

from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
pretty_print_docs(compressed_docs)
Document 1:

"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence."
----------------------------------------------------------------------------------------------------
Document 2:

"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans."

5.3 LLMChainFilter

LLMChainFilter使用 LLM 链来过滤检索文档。

from langchain.retrievers.document_compressors import LLMChainFilter

_filter = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
pretty_print_docs(compressed_docs)
Document 1:

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

5.4 EmbeddingsFilter

??调用LLM来过滤文档非常昂贵,EmbeddingsFilter 根据查询和检索到的文档的embedding相似性来进行文档过滤,更便宜更快捷。

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter

embeddings = OpenAIEmbeddings()
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
pretty_print_docs(compressed_docs)
Document 1:

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.

We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.

We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
----------------------------------------------------------------------------------------------------
Document 3:

And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong.

As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.

And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things.

So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.

First, beat the opioid epidemic.

5.5 组合压缩器和文档转换器

??DocumentCompressorPipeline可以按顺序组合多个压缩器,此外还可以添加BaseDocumentTransformer进行文档转换(例如TextSplitter)。

??下面我们创建一个压缩器管道,首先将文档拆分为更小的块,然后删除冗余文档,然后根据与查询的相关性进行筛选。

from langchain.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)
pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, redundant_filter, relevant_filter]
)
compression_retriever = ContextualCompressionRetriever(base_compressor=pipeline_compressor, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
pretty_print_docs(compressed_docs)
Document 1:

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson
----------------------------------------------------------------------------------------------------
Document 2:

As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year
----------------------------------------------------------------------------------------------------
Document 3:

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder

六、Ensemble Retriever(混合搜索)

??EnsembleRetriever可以集成多个 retrievers ,然后综合多个 retrievers检索到的文档,进行 Reciprocal Rank Fusion排序。最常见的模式是将 sparse retriever(例如BM25)和dense retriever(例如embedding similarity)结合起来,因为它们的优势是互补的,前者擅长根据关键词查找相关文档,后者擅长基于语义相似度查找相关文档,这也称之为hybrid search(混合搜索)。

??Reciprocal Rank Fusion算法的基本思想是将每个检索器返回的文档按照它们的排名进行重新排序,然后根据排名的倒数(互惠排名)来计算每个文档的最终得分。较高的互惠排名表示文档在多个检索器中都具有较高的排名,因此被认为是更相关的文档。

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS

doc_list = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# 初始化bm25 retriever 和 faiss retriever,各返回两个相关文档
bm25_retriever = BM25Retriever.from_texts(doc_list)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
faiss_vectorstore = FAISS.from_texts(doc_list, embedding)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# 初始化ensemble retriever,并设定bm25_retriever和faiss_retriever权重相等
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)
docs = ensemble_retriever.get_relevant_documents("apples")
docs
[Document(page_content='I like apples', metadata={}),
 Document(page_content='Apples and oranges are fruits', metadata={})]

七、Self-querying retriever

??Self-querying retriever是一种能够根据自然语言查询和过滤条件,在数据库中执行搜索的检索器。与其他检索器不同的是,self-querying retriever可以将用户输入的query构建为结构化的query(将用户查询转换为SQL查询语句),而无需用户手动编写SQL。这使得使用self-querying retriever可以更轻松地进行复杂的搜索操作,得到更准确和个性化的检索结果。

7.1 基础示例

# 使用Self-querying retriever之前需要安装lark包
# !pip install lark chromadb
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document
from langchain.vectorstores import Chroma

docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={
            "year": 1979,
            "director": "Andrei Tarkovsky",
            "genre": "thriller",
            "rating": 9.9,
        },
    ),
]

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

??下面我们可以创建self-querying retriever了,但是在此之前,我们需要预先提供一些文档元数据信息,以及文档内容的简短描述。

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

document_content_description = "Brief summary of a movie"
llm = ChatOpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
)
  • metadata_field_info是一个包含元数据字段信息的列表。每个元数据字段都由一个AttributeInfo对象表示,其中包含以下属性:
    • name:字段的名称,例如"genre"、“year”、“director"和"rating”。
    • description:字段的描述,例如"电影的类型"、“电影的上映年份”、“电影的导演姓名"和"电影的评分”。
    • type:字段的数据类型,例如"string"、“integer"和"float”。

??通过调用SelfQueryRetriever.from_llm()方法,使用上述信息创建了一个SelfQueryRetriever对象。这个对象可以根据自然语言查询和过滤条件,在向量存储中执行搜索,并返回符合条件的文档结果。也就是说,我们可以使用SelfQueryRetriever对象进行自然语言搜索和过滤操作,而无需手动编写SQL查询语句。下面进行演示:

  1. 指定filter
# 检索评分高于8.5的电影
retriever.invoke("I want to watch a movie rated higher than 8.5")
[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979}),
 Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006})]
  1. 指定 a query and a filter
# 检索Greta Gerwig是否导演过关于女性的电影
retriever.invoke("Has Greta Gerwig directed any movies about women")
[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019})]
  1. 指定了一个复合过滤器(a composite filter)
#  检索评分高于8.5的科幻电影
retriever.invoke("What's a highly rated (above 8.5) science fiction film?")
[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}),
 Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'year': 1979})]
  1. 指定 a query and composite filter
# 检索1990年后2005年前的关于玩具的电影,最好是动画片
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)
[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]

查询(query)是指搜索的关键词或问题,而过滤器(filter)是用于进一步筛选结果的条件。

7.2 Filter k

我们可以指定要检索的文档数量k:

retriever = SelfQueryRetriever.from_llm(
    llm,
    vectorstore,
    document_content_description,
    metadata_field_info,
    enable_limit=True,
)

# This example only specifies a relevant query
retriever.invoke("What are two movies about dinosaurs")
[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}),
 Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]

7.3 使用LCEL进行构建

??为了了解这背后发生了什么,以及更多的自定义控制,我们可以从头开始重建我们的retriever。

??首先,我们需要创建一个query-construction chain。此链将接受 |user query并生成一个 StructuredQuery 对象,该对象捕获用户指定的过滤器。我们提供了一些用于创建提示和输出解析器的辅助函数。它们有许多可调参数,为了简单起见,我们将在这里忽略这些参数。

from langchain.chains.query_constructor.base import StructuredQueryOutputParser,get_query_constructor_prompt,


prompt = get_query_constructor_prompt(
    document_content_description,
    metadata_field_info,
)
output_parser = StructuredQueryOutputParser.from_components()
query_constructor = prompt | llm | output_parser

让我们看一下我们的提示:

print(prompt.format(query="dummy question"))
Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte | contain | like | in | nin): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or | not): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling timestamp data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.

<< Example 1. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre

Structured Request:
```json
{
    "query": "teenager love",
    "filter": "and(or(eq(\"artist\", \"Taylor Swift\"), eq(\"artist\", \"Katy Perry\")), lt(\"length\", 180), eq(\"genre\", \"pop\"))"
}
```


<< Example 2. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs that were not published on Spotify

Structured Request:
```json
{
    "query": "",
    "filter": "NO_FILTER"
}
```


<< Example 3. >>
Data Source:
```json
{
    "content": "Brief summary of a movie",
    "attributes": {
    "genre": {
        "description": "The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        "type": "string"
    },
    "year": {
        "description": "The year the movie was released",
        "type": "integer"
    },
    "director": {
        "description": "The name of the movie director",
        "type": "string"
    },
    "rating": {
        "description": "A 1-10 rating for the movie",
        "type": "float"
    }
}
}
```

User Query:
dummy question

Structured Request:
query_constructor.invoke(
    {
        "query": "What are some sci-fi movies from the 90's directed by Luc Besson about taxi drivers"
    }
)
StructuredQuery(query='taxi driver', filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='year', value=1990), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2000)]), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Luc Besson')]), limit=None)

??query constructor是自查询检索器的关键要素,通常,这需要调整提示、提示中的示例、属性描述等。为了更好地理解如何改进query constructor,可以参考一个关于在酒店库存数据上优化query constructor的示例《self_query_hotel_search.ipynb》

??另一个关键要素是structured query translator,它负责将通用的StructuredQuery对象转换为特定的语法,以便与指定的vector store进行匹配。比如ChromaTranslator可以将通用的StructuredQuery对象转换为Chroma向量存储的语法,确保在使用SelfQueryRetriever进行自查询时,将使用Chroma向量存储的语法来执行查询操作。

from langchain.retrievers.self_query.chroma import ChromaTranslator

retriever = SelfQueryRetriever(
    query_constructor=query_constructor,
    vectorstore=vectorstore,
    structured_query_translator=ChromaTranslator(),
)

retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated"
)
[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995})]

八、Time-weighted vector store retriever

??Time-weighted vector store retriever是一种使用时间加权的向量存储检索器。它结合了语义相似性和时间衰减的方法来评分文档的相关性。该检索器的算法如下:

semantic_similarity + (1.0 - decay_rate) ^ hours_passed
  • semantic_similarity:语义相似性得分
  • decay_rate:衰减率
  • hours_passed:表示自上次访问(被检索到)文档以来经过的小时数,而不是文档创建以来的小时数。

这种算法会让经常被访问的文档保持较高的相关性得分,因为它们仍然是“新鲜”的。

8.1 Low decay rate

??如果是低衰减率,比如极端情况衰减率接近于0。此时,文档的相关性得分会相对稳定,不容易受到时间的影响,经常访问的文档在检索结果中更容易被返回。

import faiss

from datetime import datetime, timedelta
from langchain.docstore import InMemoryDocstore
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain.schema import Document
from langchain.vectorstores import FAISS
# Define your embedding model
embeddings_model = OpenAIEmbeddings()
# Initialize the vectorstore as empty
embedding_size = 1536
index = faiss.IndexFlatL2(embedding_size)
vectorstore = FAISS(embeddings_model.embed_query, index, InMemoryDocstore({}), {})
retriever = TimeWeightedVectorStoreRetriever(vectorstore=vectorstore, decay_rate=.0000000000000000000000001, k=1)

yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents([Document(page_content="hello world", metadata={"last_accessed_at": yesterday})])
retriever.add_documents([Document(page_content="hello foo")])

# "Hello World" is returned first because it is most salient, and the decay rate is close to 0., meaning it's still recent enough
retriever.get_relevant_documents("hello world")
[Document(page_content='hello world', metadata={'last_accessed_at': datetime.datetime(2023, 5, 13, 21, 0, 27, 678341), 'created_at': datetime.datetime(2023, 5, 13, 21, 0, 27, 279596), 'buffer_idx': 0})]

??上面代码中,我们使用vectorstoredecay_ratek初始化了时间加权向量存储检索器TimeWeightedVectorStoreRetriever,通过调用retriever.add_documents()方法,将两个文档添加到向量存储中。其中一个文档具有last_accessed_at元数据,表示上次访问的时间。

??检索字符串"hello world"时,由于有一个是最近被检索过的,且衰减率接近于0,因此它在检索结果中排名靠前,并最终被作为检索结果返回。

8.2 High decay rate

??如果是高衰减率,比如极端情况衰减率接近于1。此时,文档的相关性得分会随着时间的推移迅速衰减为较低的值,较早访问的文档在检索结果中更容易被过滤掉,因为它们被认为是“过时”的。

vectorstore = FAISS(embeddings_model.embed_query, index, InMemoryDocstore({}), {})
retriever = TimeWeightedVectorStoreRetriever(vectorstore=vectorstore, decay_rate=.999, k=1)

yesterday = datetime.now() - timedelta(days=1)
retriever.add_documents([Document(page_content="hello world", metadata={"last_accessed_at": yesterday})])
retriever.add_documents([Document(page_content="hello foo")])

# "Hello Foo" is returned first because "hello world" is mostly forgotten
retriever.get_relevant_documents("hello world")
[Document(page_content='hello foo', metadata={'last_accessed_at': datetime.datetime(2023, 4, 16, 22, 9, 2, 494798), 'created_at': datetime.datetime(2023, 4, 16, 22, 9, 2, 178722), 'buffer_idx': 1})]

8.3 Virtual time

??使用LangChain中的一些utils,你可以模拟出虚拟时间。比如使用mock_now函数将当前时间设置为2011年2月3日10点11分。

from langchain.utils import mock_now
import datetime

# Notice the last access time is that date time
with mock_now(datetime.datetime(2011, 2, 3, 10, 11)):
    print(retriever.get_relevant_documents("hello world"))
[Document(page_content='hello world', metadata={'last_accessed_at': MockDateTime(2011, 2, 3, 10, 11), 'created_at': datetime.datetime(2023, 5, 13, 21, 0, 27, 279596), 'buffer_idx': 0})]

九、WebResearchRetriever(略)

文章来源:https://blog.csdn.net/qq_56591814/article/details/135007557
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。