Retriever Query Engine with Custom Retrievers - Simple Hybrid Search

In this tutorial, we show you how to define a very simple version of hybrid search!

Combine keyword lookup retrieval with vector retrieval using "AND" and "OR" conditions.

Setup

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import (
    GPTVectorStoreIndex, 
    GPTSimpleKeywordTableIndex, 
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext
)
from IPython.display import Markdown, display

Load Data

We first show how to convert a Document into a set of Nodes, and insert into a DocumentStore.

# load documents
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()

# initialize service context (set chunk size)
service_context = ServiceContext.from_defaults(chunk_size_limit=1024)
node_parser = service_context.node_parser

nodes = node_parser.get_nodes_from_documents(documents)

# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

Define Vector Index and Keyword Table Index over Same Data

We build a vector index and keyword index over the same DocumentStore

vector_index = GPTVectorStoreIndex(nodes, storage_context=storage_context)
keyword_index = GPTSimpleKeywordTableIndex(nodes, storage_context=storage_context)

Define Custom Retriever

We now define a custom retriever class that can implement basic hybrid search with both keyword lookup and semantic search.

setting "AND" means we take the intersection of the two retrieved sets
setting "OR" means we take the union

# import QueryBundle
from llama_index import QueryBundle
# import NodeWithScore
from llama_index.data_structs import NodeWithScore
# Retrievers 
from llama_index.retrievers import BaseRetriever, VectorIndexRetriever, KeywordTableSimpleRetriever

from typing import List

class CustomRetriever(BaseRetriever):
    """Custom retriever that performs both semantic search and hybrid search."""
    
    def __init__(
        self,
        vector_retriever: VectorIndexRetriever,
        keyword_retriever: KeywordTableSimpleRetriever,
        mode: str = "AND"
    ) -> None:
        """Init params."""
        
        self._vector_retriever = vector_retriever
        self._keyword_retriever = keyword_retriever
        if mode not in ("AND", "OR"):
            raise ValueError("Invalid mode.")
        self._mode = mode
        
    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]: 
        """Retrieve nodes given query."""
        
        vector_nodes = self._vector_retriever.retrieve(query_bundle)
        keyword_nodes = self._keyword_retriever.retrieve(query_bundle)
        
        vector_ids = {n.node.get_doc_id() for n in vector_nodes}
        keyword_ids = {n.node.get_doc_id() for n in keyword_nodes}
        
        combined_dict = {n.node.get_doc_id(): n for n in vector_nodes}
        combined_dict.update({n.node.get_doc_id(): n for n in keyword_nodes})
        
        if self._mode == "AND":
            retrieve_ids = vector_ids.intersection(keyword_ids)
        else:
            retrieve_ids = vector_ids.union(keyword_ids)
            
        retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
        return retrieve_nodes
        
        

Plugin Retriever into Query Engine

Plugin retriever into a query engine, and run some queries

from llama_index import ResponseSynthesizer
from llama_index.query_engine import RetrieverQueryEngine

# define custom retriever
vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=2)
keyword_retriever = KeywordTableSimpleRetriever(index=keyword_index)
custom_retriever = CustomRetriever(vector_retriever, keyword_retriever)

# define response synthesizer
response_synthesizer = ResponseSynthesizer.from_args()

# assemble query engine
custom_query_engine = RetrieverQueryEngine(
    retriever=custom_retriever,
    response_synthesizer=response_synthesizer,
)

# vector query engine
vector_query_engine = RetrieverQueryEngine(
    retriever=vector_retriever,
    response_synthesizer=response_synthesizer,
)
# keyword query engine
keyword_query_engine = RetrieverQueryEngine(
    retriever=keyword_retriever,
    response_synthesizer=response_synthesizer,
)

response = custom_query_engine.query("What did the author do during his time at YC?")

INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 12 tokens
> [retrieve] Total embedding token usage: 12 tokens
> [retrieve] Total embedding token usage: 12 tokens
INFO:llama_index.indices.keyword_table.retrievers:> Starting query: What did the author do during his time at YC?
> Starting query: What did the author do during his time at YC?
> Starting query: What did the author do during his time at YC?
INFO:llama_index.indices.keyword_table.retrievers:query keywords: ['time', 'yc', 'author']
query keywords: ['time', 'yc', 'author']
query keywords: ['time', 'yc', 'author']
INFO:llama_index.indices.keyword_table.retrievers:> Extracted keywords: ['time', 'yc']
> Extracted keywords: ['time', 'yc']
> Extracted keywords: ['time', 'yc']
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 2250 tokens
> [get_response] Total LLM token usage: 2250 tokens
> [get_response] Total LLM token usage: 2250 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens
> [get_response] Total embedding token usage: 0 tokens

print(response) 

The author worked on YC, wrote essays, hacked, and worked on a new version of Arc with Robert. He also organized a summer program for undergrads to start startups, funded a batch of 8 startups, and provided free air conditioners to the founders. He also noticed the advantages of funding startups in batches, such as the tight alumni community and the startups becoming each other's customers.

# hybrid search can allow us to not retrieve nodes that are irrelevant
# Yale is never mentioned in the essay
response = custom_query_engine.query("What did the author do during his time at Yale?")

print(str(response))
len(response.source_nodes)

None

# in contrast, vector search will return an answer
response = vector_query_engine.query("What did the author do during his time at Yale?")

print(str(response))
len(response.source_nodes)

The author attended Harvard for his PhD program in computer science and took art classes there. He then applied to the Rhode Island School of Design (RISD) for a Bachelor of Fine Arts (BFA) program and the Accademia di Belli Arti in Florence for an entrance exam. He eventually went to RISD and passed the entrance exam in Florence. During his time at Harvard, he also co-founded the Summer Founders Program, which invited undergrads to apply to start their own startups. He also worked on a new version of Arc with Robert Morris and wrote essays to promote the program.