数据连接器

注意：我们的数据连接器现在通过`LlamaHub <https://llamahub.ai/>`_ 🦙提供。LlamaHub是一个开源存储库，其中包含您可以轻松插入任何LlamaIndex应用程序的数据加载程序。

以下数据连接器仍然可以在核心存储库中找到。

Data Connectors for LlamaIndex.

This module contains the data connectors for LlamaIndex. Each connector inherits from a BaseReader class, connects to a data source, and loads Document objects from that data source.

You may also choose to construct Document objects manually, for instance in our Insert How-To Guide. See below for the API definition of a Document - the bare minimum is a text property.

class llama_index.readers.BeautifulSoupWebReader(website_extractor: Optional[Dict[str, Callable]] = None)

BeautifulSoup web page reader.

Reads pages from the web. Requires the bs4 and urllib packages.

参数: file_extractor (Optional[Dict[str, Callable]]) -- A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.

load_data(urls: List[str], custom_hostname: Optional[str] = None) → List[Document]

Load data from the urls.

参数

urls (List[str]) -- List of URLs to scrape.
custom_hostname (Optional[str]) -- Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.ChatGPTRetrievalPluginReader(endpoint_url: str, bearer_token: Optional[str] = None, retries: Optional[Retry] = None, batch_size: int = 100)

ChatGPT Retrieval Plugin reader.

load_data(query: str, top_k: int = 10, separate_documents: bool = True, **kwargs: Any) → List[Document]: Load data from ChatGPT Retrieval Plugin.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.ChromaReader(collection_name: str, persist_directory: Optional[str] = None, host: str = 'localhost', port: int = 8000)

Chroma reader.

Retrieve documents from existing persisted Chroma collections.

参数

collection_name -- Name of the peristed collection.
persist_directory -- Directory where the collection is persisted.

create_documents(results: Any) → List[Document]

Create documents from the results.

参数: results -- Results from the query.
返回: List of documents.

load_data(query_embedding: Optional[List[float]] = None, limit: int = 10, where: Optional[dict] = None, where_document: Optional[dict] = None, query: Optional[Union[str, List[str]]] = None) → Any

Load data from the collection.

参数

limit -- Number of results to return.
where -- Filter results by metadata. {"metadata_field": "is_equal_to_this"}
where_document -- Filter results by document. {"$contains":"search_string"}

返回

List of documents.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.DeepLakeReader(token: Optional[str] = None)

DeepLake reader.

Retrieve documents from existing DeepLake datasets.

参数: dataset_name -- Name of the deeplake dataset.

load_data(query_vector: List[float], dataset_path: str, limit: int = 4, distance_metric: str = 'l2') → List[Document]

Load data from DeepLake.

参数

dataset_name (str) -- Name of the DeepLake dataet.
query_vector (List[float]) -- Query vector.
limit (int) -- Number of results to return.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.DiscordReader(discord_token: Optional[str] = None)

Discord reader.

Reads conversations from channels.

参数: discord_token (Optional[str]) -- Discord token. If not provided, we assume the environment variable DISCORD_TOKEN is set.

load_data(channel_ids: List[int], limit: Optional[int] = None, oldest_first: bool = True) → List[Document]

Load data from the input directory.

参数

channel_ids (List[int]) -- List of channel ids to read.
limit (Optional[int]) -- Maximum number of messages to read.
oldest_first (bool) -- Whether to read oldest messages first. Defaults to True.

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.Document(text: Optional[str] = None, doc_id: Optional[str] = None, embedding: Optional[List[float]] = None, doc_hash: Optional[str] = None, extra_info: Optional[Dict[str, Any]] = None)

Generic interface for a data document.

This document connects to data sources.

doc_hash: Optional[str] = None

" metadata fields - injected as part of the text shown to LLMs as context - used by vector DBs for metadata filtering

This must be a flat dictionary, and only uses str keys, and (str, int, float) values.

property extra_info_str: Optional[str]: Extra info string.

classmethod from_langchain_format(doc: Document) → Document: Convert struct from LangChain document format.

get_doc_hash() → str: Get doc_hash.

get_doc_id() → str: Get doc_id.

get_embedding() → List[float]

Get embedding.

Errors if embedding is None.

get_text() → str: Get text.

classmethod get_type() → str: Get Document type.

classmethod get_types() → List[str]: Get Document type.

property is_doc_id_none: bool: Check if doc_id is None.

property is_text_none: bool: Check if text is None.

to_langchain_format() → Document: Convert struct to LangChain document format.

class llama_index.readers.ElasticsearchReader(endpoint: str, index: str, httpx_client_args: Optional[dict] = None)

Read documents from an Elasticsearch/Opensearch index.

These documents can then be used in a downstream Llama Index data structure.

参数

endpoint (str) -- URL (http/https) of cluster
index (str) -- Name of the index (required)
httpx_client_args (dict) -- Optional additional args to pass to the httpx.Client

load_data(field: str, query: Optional[dict] = None, embedding_field: Optional[str] = None) → List[Document]

Read data from the Elasticsearch index.

参数

field (str) -- Field in the document to retrieve text from
query (Optional[dict]) -- Elasticsearch JSON query DSL object. For example: {"query": {"match": {"message": {"query": "this is a test"}}}}
embedding_field (Optional[str]) -- If there are embeddings stored in this index, this field can be used to set the embedding field on the returned Document list.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.FaissReader(index: Any)

Faiss reader.

Retrieves documents through an existing in-memory Faiss index. These documents can then be used in a downstream LlamaIndex data structure. If you wish use Faiss itself as an index to to organize documents, insert documents, and perform queries on them, please use GPTVectorStoreIndex with FaissVectorStore.

参数: faiss_index (faiss.Index) -- A Faiss Index object (required)

load_data(query: ndarray, id_to_text_map: Dict[str, str], k: int = 4, separate_documents: bool = True) → List[Document]

Load data from Faiss.

参数

query (np.ndarray) -- A 2D numpy array of query vectors.
id_to_text_map (Dict[str, str]) -- A map from ID's to text.
k (int) -- Number of nearest neighbors to retrieve. Defaults to 4.
separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.GithubRepositoryReader(owner: str, repo: str, use_parser: bool = True, verbose: bool = False, github_token: Optional[str] = None, concurrent_requests: int = 5, ignore_file_extensions: Optional[List[str]] = None, ignore_directories: Optional[List[str]] = None)

Github repository reader.

Retrieves the contents of a Github repository and returns a list of documents. The documents are either the contents of the files in the repository or the text extracted from the files using the parser.

示例

>>> reader = GithubRepositoryReader("owner", "repo")
>>> branch_documents = reader.load_data(branch="branch")
>>> commit_documents = reader.load_data(commit_sha="commit_sha")

load_data(commit_sha: Optional[str] = None, branch: Optional[str] = None) → List[Document]

Load data from a commit or a branch.

Loads github repository data from a specific commit sha or a branch.

参数

commit -- commit sha
branch -- branch name

返回

list of documents

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.GoogleDocsReader

Google Docs reader.

Reads a page from Google Docs

load_data(document_ids: List[str]) → List[Document]

Load data from the input directory.

参数: document_ids (List[str]) -- a list of document ids.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.JSONReader(levels_back: Optional[int] = None, collapse_length: Optional[int] = None)

JSON reader.

Reads JSON documents with options to help suss out relationships between nodes.

参数

levels_back (int) -- the number of levels to go back in the JSON tree, 0
None (if you want all levels. If levels_back is) --
the (then we just format) --
embedding (JSON and make each line an) --
collapse_length (int) -- the maximum number of characters a JSON fragment
output (would be collapsed in the) --
ex -- if collapse_length = 10, and
{a (input is) -- [1, 2, 3], b: {"hello": "world", "foo": "bar"}}
line (then a would be collapsed into one) --
not. (while b would) --
there. (Recommend starting around 100 and then adjusting from) --

load_data(input_file: str) → List[Document]: Load data from the input file.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.MakeWrapper

Make reader.

load_data(*args: Any, **load_kwargs: Any) → List[Document]

Load data from the input directory.

NOTE: This is not implemented.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

pass_response_to_webhook(webhook_url: str, response: Response, query: Optional[str] = None) → None

Pass response object to webhook.

参数

webhook_url (str) -- Webhook URL.
response (Response) -- Response object.
query (Optional[str]) -- Query. Defaults to None.

class llama_index.readers.MboxReader

Mbox e-mail reader.

Reads a set of e-mails saved in the mbox format.

load_data(input_dir: str, **load_kwargs: Any) → List[Document]

Load data from the input directory.

load_kwargs:: max_count (int): Maximum amount of messages to read. message_format (str): Message format overriding default.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.MetalReader(api_key: str, client_id: str, index_id: str)

Metal reader.

参数

api_key (str) -- Metal API key.
client_id (str) -- Metal client ID.
index_id (str) -- Metal index ID.

load_data(limit: int, query_embedding: Optional[List[float]] = None, filters: Optional[Dict[str, Any]] = None, separate_documents: bool = True, **query_kwargs: Any) → List[Document]

Load data from Metal.

参数

query_embedding (Optional[List[float]]) -- Query embedding for search.
limit (int) -- Number of results to return.
filters (Optional[Dict[str, Any]]) -- Filters to apply to the search.
separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.
**query_kwargs -- Keyword arguments to pass to the search.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.MilvusReader(host: str = 'localhost', port: int = 19530, user: str = '', password: str = '', use_secure: bool = False)

Milvus reader.

load_data(query_vector: List[float], collection_name: str, expr: Optional[Any] = None, search_params: Optional[dict] = None, limit: int = 10) → List[Document]

Load data from Milvus.

参数

collection_name (str) -- Name of the Milvus collection.
query_vector (List[float]) -- Query vector.
limit (int) -- Number of results to return.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.MyScaleReader(myscale_host: str, username: str, password: str, myscale_port: Optional[int] = 8443, database: str = 'default', table: str = 'llama_index', index_type: str = 'IVFLAT', metric: str = 'cosine', batch_size: int = 32, index_params: Optional[dict] = None, search_params: Optional[dict] = None, **kwargs: Any)

MyScale reader.

参数

myscale_host (str) -- An URL to connect to MyScale backend.
username (str) -- Usernamed to login.
password (str) -- Password to login.
myscale_port (int) -- URL port to connect with HTTP. Defaults to 8443.
database (str) -- Database name to find the table. Defaults to 'default'.
table (str) -- Table name to operate on. Defaults to 'vector_table'.
index_type (str) -- index type string. Default to "IVFLAT"
metric (str) -- Metric to compute distance, supported are ('l2', 'cosine', 'ip'). Defaults to 'cosine'
batch_size (int, optional) -- the size of documents to insert. Defaults to 32.
index_params (dict, optional) -- The index parameters for MyScale. Defaults to None.
search_params (dict, optional) -- The search parameters for a MyScale query. Defaults to None.

load_data(query_vector: List[float], where_str: Optional[str] = None, limit: int = 10) → List[Document]

Load data from MyScale.

参数

query_vector (List[float]) -- Query vector.
where_str (Optional[str], optional) -- where condition string. Defaults to None.
limit (int) -- Number of results to return.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.NotionPageReader(integration_token: Optional[str] = None)

Notion Page reader.

Reads a set of Notion pages.

参数: integration_token (str) -- Notion integration token.

load_data(page_ids: List[str] = [], database_id: Optional[str] = None) → List[Document]

Load data from the input directory.

参数: page_ids (List[str]) -- List of page ids to load.
返回: List of documents.
返回类型: List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

query_database(database_id: str, query_dict: Dict[str, Any] = {}) → List[str]: Get all the pages from a Notion database.

read_page(page_id: str) → str: Read a page.

search(query: str) → List[str]: Search Notion page given a text query.

class llama_index.readers.ObsidianReader(input_dir: str)

Utilities for loading data from an Obsidian Vault.

参数: input_dir (str) -- Path to the vault.

load_data(*args: Any, **load_kwargs: Any) → List[Document]: Load data from the input directory.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.PineconeReader(api_key: str, environment: str)

Pinecone reader.

参数

api_key (str) -- Pinecone API key.
environment (str) -- Pinecone environment.

load_data(index_name: str, id_to_text_map: Dict[str, str], vector: Optional[List[float]], top_k: int, separate_documents: bool = True, include_values: bool = True, **query_kwargs: Any) → List[Document]

Load data from Pinecone.

参数

index_name (str) -- Name of the index.
id_to_text_map (Dict[str, str]) -- A map from ID's to text.
separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.
vector (List[float]) -- Query vector.
top_k (int) -- Number of results to return.
include_values (bool) -- Whether to include the embedding in the response. Defaults to True.
**query_kwargs -- Keyword arguments to pass to the query. Arguments are the exact same as those found in Pinecone's reference documentation for the query method.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.PsychicReader(psychic_key: Optional[str] = None)

Psychic reader.

Psychic is a platform that allows syncing data from many SaaS apps through one: universal API.
This reader connects to an instance of Psychic and reads data from it, given a: connector ID, account ID, and API key.

Learn more at docs.psychic.dev.

参数: psychic_key (str) -- Secret key for Psychic. Get one at https://dashboard.psychic.dev/api-keys.

load_data(connector_id: Optional[str] = None, account_id: Optional[str] = None) → List[Document]

Load data from a Psychic connection

参数

connector_id (str) -- The connector ID to connect to
account_id (str) -- The account ID to connect to

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.QdrantReader(location: Optional[str] = None, url: Optional[str] = None, port: Optional[int] = 6333, grpc_port: int = 6334, prefer_grpc: bool = False, https: Optional[bool] = None, api_key: Optional[str] = None, prefix: Optional[str] = None, timeout: Optional[float] = None, host: Optional[str] = None, path: Optional[str] = None)

Qdrant reader.

Retrieve documents from existing Qdrant collections.

参数

location -- If :memory: - use in-memory Qdrant instance. If str - use it as a url parameter. If None - use default values for host and port.
url -- either host or str of "Optional[scheme], host, Optional[port], Optional[prefix]". Default: None
port -- Port of the REST API interface. Default: 6333
grpc_port -- Port of the gRPC interface. Default: 6334
prefer_grpc -- If true - use gPRC interface whenever possible in custom methods.
https -- If true - use HTTPS(SSL) protocol. Default: false
api_key -- API key for authentication in Qdrant Cloud. Default: None
prefix -- If not None - add prefix to the REST URL path. Example: service/v1 will result in http://localhost:6333/service/v1/{qdrant-endpoint} for REST API. Default: None
timeout -- Timeout for REST and gRPC API requests. Default: 5.0 seconds for REST and unlimited for gRPC
host -- Host name of Qdrant service. If url and host are None, set to 'localhost'. Default: None

load_data(collection_name: str, query_vector: List[float], should_search_mapping: Optional[Dict[str, str]] = None, must_search_mapping: Optional[Dict[str, str]] = None, must_not_search_mapping: Optional[Dict[str, str]] = None, rang_search_mapping: Optional[Dict[str, Dict[str, float]]] = None, limit: int = 10) → List[Document]

Load data from Qdrant.

参数

collection_name (str) -- Name of the Qdrant collection.
query_vector (List[float]) -- Query vector.
should_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.
must_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.
must_not_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.
rang_search_mapping (Optional[Dict[str, Dict[str, float]]]) -- Mapping from field name to range query.
limit (int) -- Number of results to return.

示例

reader = QdrantReader() reader.load_data(

collection_name="test_collection", query_vector=[0.1, 0.2, 0.3], should_search_mapping={"text_field": "text"}, must_search_mapping={"text_field": "text"}, must_not_search_mapping={"text_field": "text"}, # gte, lte, gt, lt supported rang_search_mapping={"text_field": {"gte": 0.1, "lte": 0.2}}, limit=10

)

返回: A list of documents.
返回类型: List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.RssReader(html_to_text: bool = False)

RSS reader.

Reads content from an RSS feed.

load_data(urls: List[str]) → List[Document]

Load data from RSS feeds.

参数: urls (List[str]) -- List of RSS URLs to load.
返回: List of documents.
返回类型: List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)

Simple directory reader.

Can read files into separate documents, or concatenates files into one document text.

参数

input_dir (str) -- Path to the directory.
input_files (List) -- List of file paths to read (Optional; overrides input_dir, exclude)
exclude (List) -- glob of python file paths to exclude (Optional)
exclude_hidden (bool) -- Whether to exclude hidden files (dotfiles).
errors (str) -- how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open
recursive (bool) -- Whether to recursively search in subdirectories. False by default.
required_exts (Optional[List[str]]) -- List of required extensions. Default is None.
file_extractor (Optional[Dict[str, BaseReader]]) -- A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.
num_files_limit (Optional[int]) -- Maximum number of files to read. Default is None.
file_metadata (Optional[Callable[str, Dict]]) -- A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.

load_data() → List[Document]

Load data from the input directory.

参数: concatenate (bool) -- whether to concatenate all text docs into a single doc. If set to True, file metadata is ignored. False by default. This setting does not apply to image docs (always one doc per image).
返回: A list of documents.
返回类型: List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.SimpleMongoReader(host: Optional[str] = None, port: Optional[int] = None, uri: Optional[str] = None, max_docs: int = 1000)

Simple mongo reader.

Concatenates each Mongo doc into Document used by LlamaIndex.

参数

host (str) -- Mongo host.
port (int) -- Mongo port.
max_docs (int) -- Maximum number of documents to load.

load_data(db_name: str, collection_name: str, field_names: List[str] = ['text'], query_dict: Optional[Dict] = None) → List[Document]

Load data from the input directory.

参数

db_name (str) -- name of the database.
collection_name (str) -- name of the collection.
field_names (List[str]) -- names of the fields to be concatenated. Defaults to ["text"]
query_dict (Optional[Dict]) -- query to filter documents. Defaults to None

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.SimpleWebPageReader(html_to_text: bool = False)

Simple web page reader.

Reads pages from the web.

参数: html_to_text (bool) -- Whether to convert HTML to text. Requires html2text package.

load_data(urls: List[str]) → List[Document]

Load data from the input directory.

参数: urls (List[str]) -- List of URLs to scrape.
返回: List of documents.
返回类型: List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.SlackReader(slack_token: Optional[str] = None, ssl: Optional[SSLContext] = None, earliest_date: Optional[datetime] = None, latest_date: Optional[datetime] = None)

Slack reader.

Reads conversations from channels. If an earliest_date is provided, an optional latest_date can also be provided. If no latest_date is provided, we assume the latest date is the current timestamp.

参数

slack_token (Optional[str]) -- Slack token. If not provided, we assume the environment variable SLACK_BOT_TOKEN is set.
ssl (Optional[str]) -- Custom SSL context. If not provided, it is assumed there is already an SSL context available.
earliest_date (Optional[datetime]) -- Earliest date from which to read conversations. If not provided, we read all messages.
latest_date (Optional[datetime]) -- Latest date from which to read conversations. If not provided, defaults to current timestamp in combination with earliest_date.

load_data(channel_ids: List[str], reverse_chronological: bool = True) → List[Document]

Load data from the input directory.

参数: channel_ids (List[str]) -- List of channel ids to read.
返回: List of documents.
返回类型: List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.SteamshipFileReader(api_key: Optional[str] = None)

Reads persistent Steamship Files and converts them to Documents.

参数: api_key -- Steamship API key. Defaults to STEAMSHIP_API_KEY value if not provided.

备注

Requires install of steamship package and an active Steamship API Key. To get a Steamship API Key, visit: https://steamship.com/account/api. Once you have an API Key, expose it via an environment variable named STEAMSHIP_API_KEY or pass it as an init argument (api_key).

load_data(workspace: str, query: Optional[str] = None, file_handles: Optional[List[str]] = None, collapse_blocks: bool = True, join_str: str = '\n\n') → List[Document]

Load data from persistent Steamship Files into Documents.

参数

workspace -- the handle for a Steamship workspace (see: https://docs.steamship.com/workspaces/index.html)
query -- a Steamship tag query for retrieving files (ex: 'filetag and value("import-id")="import-001"')
file_handles -- a list of Steamship File handles (ex: smooth-valley-9kbdr)
collapse_blocks -- whether to merge individual File Blocks into a single Document, or separate them.
join_str -- when collapse_blocks is True, this is how the block texts will be concatenated.

备注

The collection of Files from both query and file_handles will be combined. There is no (current) support for deconflicting the collections (meaning that if a file appears both in the result set of the query and as a handle in file_handles, it will be loaded twice).

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.StringIterableReader

String Iterable Reader.

Gets a list of documents, given an iterable (e.g. list) of strings.

示例

from llama_index import StringIterableReader, GPTTreeIndex

documents = StringIterableReader().load_data(
    texts=["I went to the store", "I bought an apple"])
index = GPTTreeIndex.from_documents(documents)
query_engine = index.as_query_engine()
query_engine.query("what did I buy?")

# response should be something like "You bought an apple."

load_data(texts: List[str]) → List[Document]: Load the data.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.TrafilaturaWebReader(error_on_missing: bool = False)

Trafilatura web page reader.

Reads pages from the web. Requires the trafilatura package.

load_data(urls: List[str]) → List[Document]

Load data from the urls.

参数: urls (List[str]) -- List of URLs to scrape.
返回: List of documents.
返回类型: List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.TwitterTweetReader(bearer_token: str, num_tweets: Optional[int] = 100)

Twitter tweets reader.

Read tweets of user twitter handle.

Check 'https://developer.twitter.com/en/docs/twitter-api/ getting-started/getting-access-to-the-twitter-api' on how to get access to twitter API.

参数

bearer_token (str) -- bearer_token that you get from twitter API.
num_tweets (Optional[int]) -- Number of tweets for each user twitter handle. Default is 100 tweets.

load_data(twitterhandles: List[str], **load_kwargs: Any) → List[Document]

Load tweets of twitter handles.

参数: twitterhandles (List[str]) -- List of user twitter handles to read tweets.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.WeaviateReader(host: str, auth_client_secret: Optional[Any] = None)

Weaviate reader.

Retrieves documents from Weaviate through vector lookup. Allows option to concatenate retrieved documents into one Document, or to return separate Document objects per document.

参数

host (str) -- host.
auth_client_secret (Optional[weaviate.auth.AuthCredentials]) -- auth_client_secret.

load_data(class_name: Optional[str] = None, properties: Optional[List[str]] = None, graphql_query: Optional[str] = None, separate_documents: Optional[bool] = True) → List[Document]

Load data from Weaviate.

If graphql_query is not found in load_kwargs, we assume that class_name and properties are provided.

参数

class_name (Optional[str]) -- class_name to retrieve documents from.
properties (Optional[List[str]]) -- properties to retrieve from documents.
graphql_query (Optional[str]) -- Raw GraphQL Query. We assume that the query is a Get query.
separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.WikipediaReader

Wikipedia reader.

Reads a page.

load_data(pages: List[str], **load_kwargs: Any) → List[Document]

Load data from the input directory.

参数: pages (List[str]) -- List of pages to read.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.

class llama_index.readers.YoutubeTranscriptReader

Youtube Transcript reader.

load_data(ytlinks: List[str], **load_kwargs: Any) → List[Document]

Load data from the input directory.

参数: pages (List[str]) -- List of youtube links for which transcripts are to be read.

load_langchain_documents(**load_kwargs: Any) → List[Document]: Load data in LangChain document format.