SQL Index Guide (Setting context)

This guide is an extension to the core SQL guide. We show how you can additionally inject unstructured context through the SQLContextContainer. This unstructured context is used during the text-to-SQL prompt, which can better inform the LLM in generating the correct SQL query.

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import GPTSQLStructStoreIndex, SQLDatabase, SimpleDirectoryReader, WikipediaReader, Document
from llama_index.indices.struct_store import SQLContextContainerBuilder
from IPython.display import Markdown, display

Load Wikipedia Data

We use our WikipediaReader to load in data from various cities.

# install wikipedia python package
!pip install wikipedia

wiki_docs = WikipediaReader().load_data(pages=['Toronto', 'Berlin', 'Tokyo'])

Create Database Schema

We use sqlalchemy, a popular SQL database toolkit, to create an empty city_stats Table

from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column

engine = create_engine("sqlite:///:memory:")
metadata_obj = MetaData()

# create city SQL table
table_name = "city_stats"
city_stats_table = Table(
    table_name,
    metadata_obj,
    Column("city_name", String(16), primary_key=True),
    Column("population", Integer),
    Column("country", String(16), nullable=False),
)
metadata_obj.create_all(engine)

Build Index with Context

We then build the SQL Index (GPTSQLStructStoreIndex). We first define our SQLDatabase abstraction (a light wrapper around SQLAlchemy).

In this example, we also define a SQLContextContainerBuilder object, where we can set additional unstructured context and generate a ContextContainer.

from llama_index import GPTSQLStructStoreIndex, SQLDatabase
from llama_index.indices.struct_store import SQLContextContainerBuilder

sql_database = SQLDatabase(engine, include_tables=["city_stats"])

sql_database.table_info

NOTE: We can either set the context manually, or have the LLM extract the context for us.

Option 1: Manually set context

# manually set context text
city_stats_text = (
    "This table gives information regarding the population and country of a given city.\n"
    "The user will query with codewords, where 'foo' corresponds to population and 'bar'"
    "corresponds to city."
)
table_context_dict={"city_stats": city_stats_text}
context_builder = SQLContextContainerBuilder(sql_database, context_dict=table_context_dict)
context_container = context_builder.build_context_container()

index = GPTSQLStructStoreIndex.from_documents(
    wiki_docs, 
    sql_database=sql_database, 
    table_name="city_stats",
    sql_context_container=context_container
)

Option 2: Extract context using LLM

# extract context from a raw Document using LLM
city_stats_text = (
    "This table gives information regarding the population and country of a given city.\n"
)
context_documents_dict = {"city_stats": [Document(city_stats_text)]}
context_builder = SQLContextContainerBuilder.from_documents(
    context_documents_dict, 
    sql_database
)
context_container = context_builder.build_context_container()

index = GPTSQLStructStoreIndex.from_documents(
    wiki_docs, 
    sql_database=sql_database, 
    table_name="city_stats",
    sql_context_container=context_container,
)

# view current table
stmt = select(
    city_stats_table.c["city_name", "population", "country"]
).select_from(city_stats_table)

with engine.connect() as connection:
    results = connection.execute(stmt).fetchall()
    print(results)

Query Index

Here we show a natural language query, which is translated to a SQL query under the hood.

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
    query_mode="nl"
)
response = query_engine.query("Which city has the highest population?")

Since we set the appropriate context, we can also use codewords during the NL query!

# set Logging to DEBUG for more detailed outputs
response = query_engine.query("Which bar has the highest foo?")

display(Markdown(f"<b>{response}</b>"))