Pandas Index

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import SimpleDirectoryReader
from IPython.display import Markdown, display

/Users/jerryliu/Programming/llama_index/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

from llama_index.indices.struct_store import GPTPandasIndex
import pandas as pd

Let's start on a Toy DataFrame

Very simple dataframe containing city and population pairs.

# Test on some sample data 
df = pd.DataFrame(
    {
        "city": ["Toronto", "Tokyo", "Berlin"], 
        "population": [2930000, 13960000, 3645000]
    }
)

index = GPTPandasIndex(df=df)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
> [build_index_from_nodes] Total embedding token usage: 0 tokens

query_engine = index.as_query_engine(
    verbose=True
)
response = query_engine.query(
    "What is the city with the highest population?",
)

> Pandas Instructions:
```
df['city'][df['population'].idxmax()]
```
> Pandas Output: Tokyo
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 180 tokens
> [query] Total LLM token usage: 180 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens
> [query] Total embedding token usage: 0 tokens

display(Markdown(f"<b>{response}</b>"))

Tokyo

# get pandas python instructions
print(response.extra_info["pandas_instruction_str"])

df['city'][df['population'].idxmax()]

Analyzing the Titanic Dataset

The Titanic dataset is one of the most popular tabular datasets in introductory machine learning Source: https://www.kaggle.com/c/titanic

df = pd.read_csv("titanic_train.csv")

index = GPTPandasIndex(df=df)

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
> [build_index_from_nodes] Total embedding token usage: 0 tokens

query_engine = index.as_query_engine(
    verbose=True
)
response = query_engine.query(
    "What is the correlation between survival and age?",
)

> Pandas Instructions:
```
df['Survived'].corr(df['Age'])
```
> Pandas Output: -0.07722109457217755
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 829 tokens
> [query] Total LLM token usage: 829 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 0 tokens
> [query] Total embedding token usage: 0 tokens

display(Markdown(f"<b>{response}</b>"))

-0.07722109457217755

# get pandas python instructions
print(response.extra_info["pandas_instruction_str"])

df['Survived'].corr(df['Age'])