
注意:我们的数据连接器现在通过`LlamaHub <https://llamahub.ai/>`_ 🦙提供。LlamaHub是一个开源存储库,其中包含您可以轻松插入任何LlamaIndex应用程序的数据加载程序。


Data Connectors for LlamaIndex.

This module contains the data connectors for LlamaIndex. Each connector inherits from a BaseReader class, connects to a data source, and loads Document objects from that data source.

You may also choose to construct Document objects manually, for instance in our Insert How-To Guide. See below for the API definition of a Document - the bare minimum is a text property.

class llama_index.readers.BeautifulSoupWebReader(website_extractor: Optional[Dict[str, Callable]] = None)

BeautifulSoup web page reader.

Reads pages from the web. Requires the bs4 and urllib packages.


file_extractor (Optional[Dict[str, Callable]]) -- A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.

load_data(urls: List[str], custom_hostname: Optional[str] = None) List[Document]

Load data from the urls.

  • urls (List[str]) -- List of URLs to scrape.

  • custom_hostname (Optional[str]) -- Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.ChatGPTRetrievalPluginReader(endpoint_url: str, bearer_token: Optional[str] = None, retries: Optional[Retry] = None, batch_size: int = 100)

ChatGPT Retrieval Plugin reader.

load_data(query: str, top_k: int = 10, separate_documents: bool = True, **kwargs: Any) List[Document]

Load data from ChatGPT Retrieval Plugin.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.ChromaReader(collection_name: str, persist_directory: Optional[str] = None, host: str = 'localhost', port: int = 8000)

Chroma reader.

Retrieve documents from existing persisted Chroma collections.

  • collection_name -- Name of the peristed collection.

  • persist_directory -- Directory where the collection is persisted.

create_documents(results: Any) List[Document]

Create documents from the results.


results -- Results from the query.


List of documents.

load_data(query_embedding: Optional[List[float]] = None, limit: int = 10, where: Optional[dict] = None, where_document: Optional[dict] = None, query: Optional[Union[str, List[str]]] = None) Any

Load data from the collection.

  • limit -- Number of results to return.

  • where -- Filter results by metadata. {"metadata_field": "is_equal_to_this"}

  • where_document -- Filter results by document. {"$contains":"search_string"}


List of documents.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.DeepLakeReader(token: Optional[str] = None)

DeepLake reader.

Retrieve documents from existing DeepLake datasets.


dataset_name -- Name of the deeplake dataset.

load_data(query_vector: List[float], dataset_path: str, limit: int = 4, distance_metric: str = 'l2') List[Document]

Load data from DeepLake.

  • dataset_name (str) -- Name of the DeepLake dataet.

  • query_vector (List[float]) -- Query vector.

  • limit (int) -- Number of results to return.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.DiscordReader(discord_token: Optional[str] = None)

Discord reader.

Reads conversations from channels.


discord_token (Optional[str]) -- Discord token. If not provided, we assume the environment variable DISCORD_TOKEN is set.

load_data(channel_ids: List[int], limit: Optional[int] = None, oldest_first: bool = True) List[Document]

Load data from the input directory.

  • channel_ids (List[int]) -- List of channel ids to read.

  • limit (Optional[int]) -- Maximum number of messages to read.

  • oldest_first (bool) -- Whether to read oldest messages first. Defaults to True.


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.Document(text: Optional[str] = None, doc_id: Optional[str] = None, embedding: Optional[List[float]] = None, doc_hash: Optional[str] = None, extra_info: Optional[Dict[str, Any]] = None)

Generic interface for a data document.

This document connects to data sources.

doc_hash: Optional[str] = None

" metadata fields - injected as part of the text shown to LLMs as context - used by vector DBs for metadata filtering

This must be a flat dictionary, and only uses str keys, and (str, int, float) values.

property extra_info_str: Optional[str]

Extra info string.

classmethod from_langchain_format(doc: Document) Document

Convert struct from LangChain document format.

get_doc_hash() str

Get doc_hash.

get_doc_id() str

Get doc_id.

get_embedding() List[float]

Get embedding.

Errors if embedding is None.

get_text() str

Get text.

classmethod get_type() str

Get Document type.

classmethod get_types() List[str]

Get Document type.

property is_doc_id_none: bool

Check if doc_id is None.

property is_text_none: bool

Check if text is None.

to_langchain_format() Document

Convert struct to LangChain document format.

class llama_index.readers.ElasticsearchReader(endpoint: str, index: str, httpx_client_args: Optional[dict] = None)

Read documents from an Elasticsearch/Opensearch index.

These documents can then be used in a downstream Llama Index data structure.

  • endpoint (str) -- URL (http/https) of cluster

  • index (str) -- Name of the index (required)

  • httpx_client_args (dict) -- Optional additional args to pass to the httpx.Client

load_data(field: str, query: Optional[dict] = None, embedding_field: Optional[str] = None) List[Document]

Read data from the Elasticsearch index.

  • field (str) -- Field in the document to retrieve text from

  • query (Optional[dict]) -- Elasticsearch JSON query DSL object. For example: {"query": {"match": {"message": {"query": "this is a test"}}}}

  • embedding_field (Optional[str]) -- If there are embeddings stored in this index, this field can be used to set the embedding field on the returned Document list.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.FaissReader(index: Any)

Faiss reader.

Retrieves documents through an existing in-memory Faiss index. These documents can then be used in a downstream LlamaIndex data structure. If you wish use Faiss itself as an index to to organize documents, insert documents, and perform queries on them, please use GPTVectorStoreIndex with FaissVectorStore.


faiss_index (faiss.Index) -- A Faiss Index object (required)

load_data(query: ndarray, id_to_text_map: Dict[str, str], k: int = 4, separate_documents: bool = True) List[Document]

Load data from Faiss.

  • query (np.ndarray) -- A 2D numpy array of query vectors.

  • id_to_text_map (Dict[str, str]) -- A map from ID's to text.

  • k (int) -- Number of nearest neighbors to retrieve. Defaults to 4.

  • separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.GithubRepositoryReader(owner: str, repo: str, use_parser: bool = True, verbose: bool = False, github_token: Optional[str] = None, concurrent_requests: int = 5, ignore_file_extensions: Optional[List[str]] = None, ignore_directories: Optional[List[str]] = None)

Github repository reader.

Retrieves the contents of a Github repository and returns a list of documents. The documents are either the contents of the files in the repository or the text extracted from the files using the parser.


>>> reader = GithubRepositoryReader("owner", "repo")
>>> branch_documents = reader.load_data(branch="branch")
>>> commit_documents = reader.load_data(commit_sha="commit_sha")
load_data(commit_sha: Optional[str] = None, branch: Optional[str] = None) List[Document]

Load data from a commit or a branch.

Loads github repository data from a specific commit sha or a branch.

  • commit -- commit sha

  • branch -- branch name


list of documents

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.GoogleDocsReader

Google Docs reader.

Reads a page from Google Docs

load_data(document_ids: List[str]) List[Document]

Load data from the input directory.


document_ids (List[str]) -- a list of document ids.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.JSONReader(levels_back: Optional[int] = None, collapse_length: Optional[int] = None)

JSON reader.

Reads JSON documents with options to help suss out relationships between nodes.

  • levels_back (int) -- the number of levels to go back in the JSON tree, 0

  • None (if you want all levels. If levels_back is) --

  • the (then we just format) --

  • embedding (JSON and make each line an) --

  • collapse_length (int) -- the maximum number of characters a JSON fragment

  • output (would be collapsed in the) --

  • ex -- if collapse_length = 10, and

  • {a (input is) -- [1, 2, 3], b: {"hello": "world", "foo": "bar"}}

  • line (then a would be collapsed into one) --

  • not. (while b would) --

  • there. (Recommend starting around 100 and then adjusting from) --

load_data(input_file: str) List[Document]

Load data from the input file.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MakeWrapper

Make reader.

load_data(*args: Any, **load_kwargs: Any) List[Document]

Load data from the input directory.

NOTE: This is not implemented.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pass_response_to_webhook(webhook_url: str, response: Response, query: Optional[str] = None) None

Pass response object to webhook.

  • webhook_url (str) -- Webhook URL.

  • response (Response) -- Response object.

  • query (Optional[str]) -- Query. Defaults to None.

class llama_index.readers.MboxReader

Mbox e-mail reader.

Reads a set of e-mails saved in the mbox format.

load_data(input_dir: str, **load_kwargs: Any) List[Document]

Load data from the input directory.


max_count (int): Maximum amount of messages to read. message_format (str): Message format overriding default.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MetalReader(api_key: str, client_id: str, index_id: str)

Metal reader.

  • api_key (str) -- Metal API key.

  • client_id (str) -- Metal client ID.

  • index_id (str) -- Metal index ID.

load_data(limit: int, query_embedding: Optional[List[float]] = None, filters: Optional[Dict[str, Any]] = None, separate_documents: bool = True, **query_kwargs: Any) List[Document]

Load data from Metal.

  • query_embedding (Optional[List[float]]) -- Query embedding for search.

  • limit (int) -- Number of results to return.

  • filters (Optional[Dict[str, Any]]) -- Filters to apply to the search.

  • separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.

  • **query_kwargs -- Keyword arguments to pass to the search.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MilvusReader(host: str = 'localhost', port: int = 19530, user: str = '', password: str = '', use_secure: bool = False)

Milvus reader.

load_data(query_vector: List[float], collection_name: str, expr: Optional[Any] = None, search_params: Optional[dict] = None, limit: int = 10) List[Document]

Load data from Milvus.

  • collection_name (str) -- Name of the Milvus collection.

  • query_vector (List[float]) -- Query vector.

  • limit (int) -- Number of results to return.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MyScaleReader(myscale_host: str, username: str, password: str, myscale_port: Optional[int] = 8443, database: str = 'default', table: str = 'llama_index', index_type: str = 'IVFLAT', metric: str = 'cosine', batch_size: int = 32, index_params: Optional[dict] = None, search_params: Optional[dict] = None, **kwargs: Any)

MyScale reader.

  • myscale_host (str) -- An URL to connect to MyScale backend.

  • username (str) -- Usernamed to login.

  • password (str) -- Password to login.

  • myscale_port (int) -- URL port to connect with HTTP. Defaults to 8443.

  • database (str) -- Database name to find the table. Defaults to 'default'.

  • table (str) -- Table name to operate on. Defaults to 'vector_table'.

  • index_type (str) -- index type string. Default to "IVFLAT"

  • metric (str) -- Metric to compute distance, supported are ('l2', 'cosine', 'ip'). Defaults to 'cosine'

  • batch_size (int, optional) -- the size of documents to insert. Defaults to 32.

  • index_params (dict, optional) -- The index parameters for MyScale. Defaults to None.

  • search_params (dict, optional) -- The search parameters for a MyScale query. Defaults to None.

load_data(query_vector: List[float], where_str: Optional[str] = None, limit: int = 10) List[Document]

Load data from MyScale.

  • query_vector (List[float]) -- Query vector.

  • where_str (Optional[str], optional) -- where condition string. Defaults to None.

  • limit (int) -- Number of results to return.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.NotionPageReader(integration_token: Optional[str] = None)

Notion Page reader.

Reads a set of Notion pages.


integration_token (str) -- Notion integration token.

load_data(page_ids: List[str] = [], database_id: Optional[str] = None) List[Document]

Load data from the input directory.


page_ids (List[str]) -- List of page ids to load.


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

query_database(database_id: str, query_dict: Dict[str, Any] = {}) List[str]

Get all the pages from a Notion database.

read_page(page_id: str) str

Read a page.

search(query: str) List[str]

Search Notion page given a text query.

class llama_index.readers.ObsidianReader(input_dir: str)

Utilities for loading data from an Obsidian Vault.


input_dir (str) -- Path to the vault.

load_data(*args: Any, **load_kwargs: Any) List[Document]

Load data from the input directory.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.PineconeReader(api_key: str, environment: str)

Pinecone reader.

  • api_key (str) -- Pinecone API key.

  • environment (str) -- Pinecone environment.

load_data(index_name: str, id_to_text_map: Dict[str, str], vector: Optional[List[float]], top_k: int, separate_documents: bool = True, include_values: bool = True, **query_kwargs: Any) List[Document]

Load data from Pinecone.

  • index_name (str) -- Name of the index.

  • id_to_text_map (Dict[str, str]) -- A map from ID's to text.

  • separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.

  • vector (List[float]) -- Query vector.

  • top_k (int) -- Number of results to return.

  • include_values (bool) -- Whether to include the embedding in the response. Defaults to True.

  • **query_kwargs -- Keyword arguments to pass to the query. Arguments are the exact same as those found in Pinecone's reference documentation for the query method.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.PsychicReader(psychic_key: Optional[str] = None)

Psychic reader.

Psychic is a platform that allows syncing data from many SaaS apps through one

universal API.

This reader connects to an instance of Psychic and reads data from it, given a

connector ID, account ID, and API key.

Learn more at docs.psychic.dev.


psychic_key (str) -- Secret key for Psychic. Get one at https://dashboard.psychic.dev/api-keys.

load_data(connector_id: Optional[str] = None, account_id: Optional[str] = None) List[Document]

Load data from a Psychic connection

  • connector_id (str) -- The connector ID to connect to

  • account_id (str) -- The account ID to connect to


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.QdrantReader(location: Optional[str] = None, url: Optional[str] = None, port: Optional[int] = 6333, grpc_port: int = 6334, prefer_grpc: bool = False, https: Optional[bool] = None, api_key: Optional[str] = None, prefix: Optional[str] = None, timeout: Optional[float] = None, host: Optional[str] = None, path: Optional[str] = None)

Qdrant reader.

Retrieve documents from existing Qdrant collections.

  • location -- If :memory: - use in-memory Qdrant instance. If str - use it as a url parameter. If None - use default values for host and port.

  • url -- either host or str of "Optional[scheme], host, Optional[port], Optional[prefix]". Default: None

  • port -- Port of the REST API interface. Default: 6333

  • grpc_port -- Port of the gRPC interface. Default: 6334

  • prefer_grpc -- If true - use gPRC interface whenever possible in custom methods.

  • https -- If true - use HTTPS(SSL) protocol. Default: false

  • api_key -- API key for authentication in Qdrant Cloud. Default: None

  • prefix -- If not None - add prefix to the REST URL path. Example: service/v1 will result in http://localhost:6333/service/v1/{qdrant-endpoint} for REST API. Default: None

  • timeout -- Timeout for REST and gRPC API requests. Default: 5.0 seconds for REST and unlimited for gRPC

  • host -- Host name of Qdrant service. If url and host are None, set to 'localhost'. Default: None

load_data(collection_name: str, query_vector: List[float], should_search_mapping: Optional[Dict[str, str]] = None, must_search_mapping: Optional[Dict[str, str]] = None, must_not_search_mapping: Optional[Dict[str, str]] = None, rang_search_mapping: Optional[Dict[str, Dict[str, float]]] = None, limit: int = 10) List[Document]

Load data from Qdrant.

  • collection_name (str) -- Name of the Qdrant collection.

  • query_vector (List[float]) -- Query vector.

  • should_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.

  • must_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.

  • must_not_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.

  • rang_search_mapping (Optional[Dict[str, Dict[str, float]]]) -- Mapping from field name to range query.

  • limit (int) -- Number of results to return.


reader = QdrantReader() reader.load_data(

collection_name="test_collection", query_vector=[0.1, 0.2, 0.3], should_search_mapping={"text_field": "text"}, must_search_mapping={"text_field": "text"}, must_not_search_mapping={"text_field": "text"}, # gte, lte, gt, lt supported rang_search_mapping={"text_field": {"gte": 0.1, "lte": 0.2}}, limit=10



A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.RssReader(html_to_text: bool = False)

RSS reader.

Reads content from an RSS feed.

load_data(urls: List[str]) List[Document]

Load data from RSS feeds.


urls (List[str]) -- List of RSS URLs to load.


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)

Simple directory reader.

Can read files into separate documents, or concatenates files into one document text.

  • input_dir (str) -- Path to the directory.

  • input_files (List) -- List of file paths to read (Optional; overrides input_dir, exclude)

  • exclude (List) -- glob of python file paths to exclude (Optional)

  • exclude_hidden (bool) -- Whether to exclude hidden files (dotfiles).

  • errors (str) -- how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open

  • recursive (bool) -- Whether to recursively search in subdirectories. False by default.

  • required_exts (Optional[List[str]]) -- List of required extensions. Default is None.

  • file_extractor (Optional[Dict[str, BaseReader]]) -- A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.

  • num_files_limit (Optional[int]) -- Maximum number of files to read. Default is None.

  • file_metadata (Optional[Callable[str, Dict]]) -- A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.

load_data() List[Document]

Load data from the input directory.


concatenate (bool) -- whether to concatenate all text docs into a single doc. If set to True, file metadata is ignored. False by default. This setting does not apply to image docs (always one doc per image).


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SimpleMongoReader(host: Optional[str] = None, port: Optional[int] = None, uri: Optional[str] = None, max_docs: int = 1000)

Simple mongo reader.

Concatenates each Mongo doc into Document used by LlamaIndex.

  • host (str) -- Mongo host.

  • port (int) -- Mongo port.

  • max_docs (int) -- Maximum number of documents to load.

load_data(db_name: str, collection_name: str, field_names: List[str] = ['text'], query_dict: Optional[Dict] = None) List[Document]

Load data from the input directory.

  • db_name (str) -- name of the database.

  • collection_name (str) -- name of the collection.

  • field_names (List[str]) -- names of the fields to be concatenated. Defaults to ["text"]

  • query_dict (Optional[Dict]) -- query to filter documents. Defaults to None


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SimpleWebPageReader(html_to_text: bool = False)

Simple web page reader.

Reads pages from the web.


html_to_text (bool) -- Whether to convert HTML to text. Requires html2text package.

load_data(urls: List[str]) List[Document]

Load data from the input directory.


urls (List[str]) -- List of URLs to scrape.


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SlackReader(slack_token: Optional[str] = None, ssl: Optional[SSLContext] = None, earliest_date: Optional[datetime] = None, latest_date: Optional[datetime] = None)

Slack reader.

Reads conversations from channels. If an earliest_date is provided, an optional latest_date can also be provided. If no latest_date is provided, we assume the latest date is the current timestamp.

  • slack_token (Optional[str]) -- Slack token. If not provided, we assume the environment variable SLACK_BOT_TOKEN is set.

  • ssl (Optional[str]) -- Custom SSL context. If not provided, it is assumed there is already an SSL context available.

  • earliest_date (Optional[datetime]) -- Earliest date from which to read conversations. If not provided, we read all messages.

  • latest_date (Optional[datetime]) -- Latest date from which to read conversations. If not provided, defaults to current timestamp in combination with earliest_date.

load_data(channel_ids: List[str], reverse_chronological: bool = True) List[Document]

Load data from the input directory.


channel_ids (List[str]) -- List of channel ids to read.


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SteamshipFileReader(api_key: Optional[str] = None)

Reads persistent Steamship Files and converts them to Documents.


api_key -- Steamship API key. Defaults to STEAMSHIP_API_KEY value if not provided.


Requires install of steamship package and an active Steamship API Key. To get a Steamship API Key, visit: https://steamship.com/account/api. Once you have an API Key, expose it via an environment variable named STEAMSHIP_API_KEY or pass it as an init argument (api_key).

load_data(workspace: str, query: Optional[str] = None, file_handles: Optional[List[str]] = None, collapse_blocks: bool = True, join_str: str = '\n\n') List[Document]

Load data from persistent Steamship Files into Documents.

  • workspace -- the handle for a Steamship workspace (see: https://docs.steamship.com/workspaces/index.html)

  • query -- a Steamship tag query for retrieving files (ex: 'filetag and value("import-id")="import-001"')

  • file_handles -- a list of Steamship File handles (ex: smooth-valley-9kbdr)

  • collapse_blocks -- whether to merge individual File Blocks into a single Document, or separate them.

  • join_str -- when collapse_blocks is True, this is how the block texts will be concatenated.


The collection of Files from both query and file_handles will be combined. There is no (current) support for deconflicting the collections (meaning that if a file appears both in the result set of the query and as a handle in file_handles, it will be loaded twice).

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.StringIterableReader

String Iterable Reader.

Gets a list of documents, given an iterable (e.g. list) of strings.


from llama_index import StringIterableReader, GPTTreeIndex

documents = StringIterableReader().load_data(
    texts=["I went to the store", "I bought an apple"])
index = GPTTreeIndex.from_documents(documents)
query_engine = index.as_query_engine()
query_engine.query("what did I buy?")

# response should be something like "You bought an apple."
load_data(texts: List[str]) List[Document]

Load the data.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.TrafilaturaWebReader(error_on_missing: bool = False)

Trafilatura web page reader.

Reads pages from the web. Requires the trafilatura package.

load_data(urls: List[str]) List[Document]

Load data from the urls.


urls (List[str]) -- List of URLs to scrape.


List of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.TwitterTweetReader(bearer_token: str, num_tweets: Optional[int] = 100)

Twitter tweets reader.

Read tweets of user twitter handle.

Check 'https://developer.twitter.com/en/docs/twitter-api/ getting-started/getting-access-to-the-twitter-api' on how to get access to twitter API.

  • bearer_token (str) -- bearer_token that you get from twitter API.

  • num_tweets (Optional[int]) -- Number of tweets for each user twitter handle. Default is 100 tweets.

load_data(twitterhandles: List[str], **load_kwargs: Any) List[Document]

Load tweets of twitter handles.


twitterhandles (List[str]) -- List of user twitter handles to read tweets.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.WeaviateReader(host: str, auth_client_secret: Optional[Any] = None)

Weaviate reader.

Retrieves documents from Weaviate through vector lookup. Allows option to concatenate retrieved documents into one Document, or to return separate Document objects per document.

  • host (str) -- host.

  • auth_client_secret (Optional[weaviate.auth.AuthCredentials]) -- auth_client_secret.

load_data(class_name: Optional[str] = None, properties: Optional[List[str]] = None, graphql_query: Optional[str] = None, separate_documents: Optional[bool] = True) List[Document]

Load data from Weaviate.

If graphql_query is not found in load_kwargs, we assume that class_name and properties are provided.

  • class_name (Optional[str]) -- class_name to retrieve documents from.

  • properties (Optional[List[str]]) -- properties to retrieve from documents.

  • graphql_query (Optional[str]) -- Raw GraphQL Query. We assume that the query is a Get query.

  • separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.


A list of documents.



load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.WikipediaReader

Wikipedia reader.

Reads a page.

load_data(pages: List[str], **load_kwargs: Any) List[Document]

Load data from the input directory.


pages (List[str]) -- List of pages to read.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.YoutubeTranscriptReader

Youtube Transcript reader.

load_data(ytlinks: List[str], **load_kwargs: Any) List[Document]

Load data from the input directory.


pages (List[str]) -- List of youtube links for which transcripts are to be read.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.