自定义存储

默认情况下,LlamaIndex隐藏了复杂性,让您在不到5行代码中查询数据:

from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()
index = GPTVectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the documents.")

在幕后,LlamaIndex还支持可替换的存储层,允许您自定义摄取的文档(即Node对象),嵌入向量和索引元数据的存储位置。

低级API

为此,我们使用更低级的API,而不是高级API,

index = GPTVectorStoreIndex.from_documents(documents)

这样可以提供更细粒度的控制:

from llama_index.storage.docstore import SimpleDocumentStore
from llama_index.storage.index_store import SimpleIndexStore
from llama_index.vector_stores import SimpleVectorStore
from llama_index.node_parser import SimpleNodeParser

# create parser and parse document into nodes
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(documents)

# create storage context using default stores
storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
    index_store=SimpleIndexStore(),
)

# create (or load) docstore and add nodes
storage_context.docstore.add_documents(nodes)

# build index
index = GPTVectorStoreIndex(nodes, storage_context=storage_context)

# save index
index.storage_context.persist(persist_dir="<persist_dir>")

# can also set index_id to save multiple indexes to the same folder
index.set_index_id = "<index_id>"
index.storage_context.persist(persist_dir="<persist_dir>")

# to load index later, make sure you setup the storage context
# this will loaded the persisted stores from persist_dir
storage_context = StorageContext.from_defaults(
    persist_dir="<persist_dir>"
)

# then load the index object
from llama_index import load_index_from_storage
loaded_index = load_index_from_storage(storage_context)

# if loading an index from a persist_dir containing multiple indexes
loaded_index = load_index_from_storage(storage_context, index_id="<index_id>")

# if loading multiple indexes from a persist dir
loaded_indicies = load_index_from_storage(storage_context, index_ids=["<index_id>", ...])

您可以通过一行更改来自定义底层存储,以实例化不同的文档存储,索引存储和向量存储。请参阅大多数我们的向量存储集成都将整个索引(向量+文本)存储在向量存储本身中。这具有不必显式持久化索引的主要好处,因为向量存储已经托管并将数据持久化到我们的索引中。支持此做法的向量存储包括:ChatGPTRetrievalPluginClient、ChromaVectorStore、LanceDBVectorStore、MetalVectorStore、MilvusVectorStore、MyScaleVectorStore、OpensearchVectorStore、PineconeVectorStore、QdrantVectorStore、RedisVectorStore和WeaviateVectorStore。以Pinecone为例的一个小示例如下:导入pinecone,从llama_index中导入GPTVectorStoreIndex和SimpleDirectoryReader,从llama_index.vector_stores中导入PineconeVectorStore,创建Pinecone索引,定义特定于此向量索引的过滤器,构造向量存储,创建存储上下文,加载文档,创建索引,重建/加载索引。