Knowledge Graph Index

# My OpenAI Key
import os
os.environ['OPENAI_API_KEY'] = "INSERT OPENAI KEY"

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

Building the Knowledge Graph

from llama_index import SimpleDirectoryReader, LLMPredictor, ServiceContext
from llama_index.indices.knowledge_graph.base import GPTKnowledgeGraphIndex
from langchain import OpenAI
from IPython.display import Markdown, display

documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()

# define LLM
# NOTE: at the time of demo, text-davinci-002 did not have rate-limit errors
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)

# NOTE: can take a while! 
new_index = GPTKnowledgeGraphIndex.from_documents(
    documents, 
    max_triplets_per_chunk=2,
    service_context=service_context
)

[Optional] Try building the graph and manually add triplets!

Querying the Knowledge Graph

query_engine = new_index.as_query_engine(
    include_text=False, 
    response_mode="tree_summarize"
)
response = query_engine.query(
    "Tell me more about Interleaf", 
)

INFO:root:> Starting query: Tell me more about Interleaf
INFO:root:> Query keywords: ['history', 'company', 'Interleaf', 'software']
ERROR:root:Index was not constructed with embeddings, skipping embedding usage...
INFO:root:> Extracted relationships: The following are knowledge triplets in the form of (subset, predicate, object):
('Interleaf', 'made software for', 'creating documents')
('Interleaf', 'added', 'scripting language')
('software', 'generate', 'web sites')
INFO:root:> Building index from nodes: 0 chunks
INFO:root:> [query] Total LLM token usage: 312 tokens
INFO:root:> [query] Total embedding token usage: 0 tokens

display(Markdown(f"<b>{response}</b>"))

Interleaf was a software company that made software for creating documents. They later added a scripting language to their software, which allowed users to generate web sites.

query_engine = new_index.as_query_engine(
    include_text=True, 
    response_mode="tree_summarize"
)
response = query_engine.query(
    "Tell me more about what the author worked on at Interleaf", 
)

INFO:root:> Starting query: Tell me more about what the author worked on at Interleaf
INFO:root:> Query keywords: ['author', 'Interleaf', 'work']
ERROR:root:Index was not constructed with embeddings, skipping embedding usage...
INFO:root:> Querying with idx: ed39a830-a116-41b9-a551-bdd348dba61d: life, we aren't consciously aware of much we're seeing. Most visual perceptio...
INFO:root:> Querying with idx: fa1cfbb9-782b-4352-b610-cdae080b8f4f: painting that looks like a certain kind of cartoon, you know it's by Roy Lich...
INFO:root:> Extracted relationships: The following are knowledge triplets in the form of (subset, predicate, object):
('Interleaf', 'made software for', 'creating documents')
('Interleaf', 'added', 'scripting language')
INFO:root:> Building index from nodes: 0 chunks
INFO:root:> [query] Total LLM token usage: 1254 tokens
INFO:root:> [query] Total embedding token usage: 0 tokens

display(Markdown(f"<b>{response}</b>"))

The author worked on software that allowed users to create documents, similar to Microsoft Word. The software also had a scripting language that was based on Lisp.

Query with embeddings

# NOTE: can take a while! 
new_index = GPTKnowledgeGraphIndex.from_documents(
    documents, 
    max_triplets_per_chunk=2,
    service_context=service_context,
    include_embeddings=True
)

INFO:root:> [build_index_from_documents] Total LLM token usage: 24724 tokens
INFO:root:> [build_index_from_documents] Total embedding token usage: 1216 tokens

# query using top 3 triplets plus keywords (duplicate triplets are removed)
query_engine = new_index.as_query_engine(
    include_text=True, 
    response_mode="tree_summarize",
    embedding_mode='hybrid',
    similarity_top_k=5
)
response = query_engine.query(
    "Tell me more about what the author worked on at Interleaf", 
)

INFO:root:> Starting query: Tell me more about what the author worked on at Interleaf
INFO:root:> Query keywords: ['author', 'Interleaf', 'work']
INFO:root:> Querying with idx: 16514e26-ef09-404a-87ec-71492ffaeafd: that I could write essays again, I wrote a bunch about topics I'd had stacked...
INFO:root:> Querying with idx: e9978fb2-df4c-4d20-8c33-1a0d8a6f2ef9: went in 1988 to visit Rich Draves at CMU, where he was in grad school. One da...
INFO:root:> Querying with idx: ec114363-5cf9-4204-94ab-b7117e3475c4: me.

So I tried to paint, but I just didn't seem to have any energy or ambiti...
INFO:root:> Querying with idx: 59b1330b-95d8-41d0-8e89-b020847ccf70: gradually dragged me down. After a few months it felt disconcertingly like wo...
INFO:root:> Querying with idx: 7f832bd0-5aed-49f9-80b7-5e39c3fc39e8: 		

What I Worked On

February 2021

Before college the two main things I wor...
INFO:root:> Querying with idx: dcb05717-1d6f-40a8-8748-ec4ffc75e77b: from writing essays during most of this time, or I'd never have finished. In ...
INFO:root:> Querying with idx: f311317c-6794-4b7e-918d-e68c82747a0e: making paintings and living in New York.

I was nervous about money, because ...
INFO:root:> Querying with idx: 4bbb384a-e989-4d24-8710-22b547d1686e: had not.) So although Robert had his graduate student stipend, I needed that ...
INFO:root:> Querying with idx: f045cfbf-5b62-4717-bfe7-b7a6c9c493f3: urls showed that someone had posted it on Slashdot. [10]

Wow, I thought, the...
INFO:root:> Querying with idx: b314b8fa-0502-40a5-a318-6b2b4b413cbc: taking philosophy courses and they kept being boring. So I decided to switch ...
INFO:root:> Extracted relationships: The following are knowledge triplets in the form of (subset, predicate, object):
('I', 'wrote', 'essays')
('I', 'wrote', 'programs')
('I', 'wrote', 'short stories')
('Paul Graham', 'worked on', 'Bel')
('Interleaf', 'got crushed by', "Moore's Law")
INFO:root:> Building index from nodes: 1 chunks
INFO:root:> [query] Total LLM token usage: 5607 tokens
INFO:root:> [query] Total embedding token usage: 12 tokens

display(Markdown(f"<b>{response}</b>"))

The author worked on software at Interleaf, specifically the Interleaf 6 program. This program was a sort of graphical user interface for Unix that featured windows and a mouse. The author was the lead engineer on the team that ported the program to the Macintosh, which required rewriting a lot of code and designing a new user interface. In addition to this, the author wrote code for the Interleaf 6 Lisp system, which was used to extend the program.

Visualizing the Graph

## create graph
from pyvis.network import Network

g = new_index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)
net.show("example.html")

[Optional] Try building the graph and manually add triplets!

from llama_index.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser()

nodes = node_parser.get_nodes_from_documents(documents)

# initialize an empty index for now 
index = GPTKnowledgeGraphIndex(
    [],
    service_context=service_context,
)

# add keyword mappings and nodes manually
# add triplets (subject, relationship, object) 

# for node 0
node_0_tups = [("author", "worked on", "writing"), ("author", "worked on", "programming")]
for tup in node_0_tups:
    index.upsert_triplet_and_node(tup, nodes[0])
    
# for node 1
node_1_tups = [
    ('Interleaf', 'made software for', 'creating documents'),
    ('Interleaf', 'added', 'scripting language'),
    ('software', 'generate', 'web sites')
]
for tup in node_1_tups:
    index.upsert_triplet_and_node(tup, nodes[1])

query_engine = index.as_query_engine(
    include_text=False, 
    response_mode="tree_summarize"
)
response = query_engine.query(
    "Tell me more about Interleaf", 
)

str(response)