数据连接器

注意:我们的数据连接器现在通过`LlamaHub <https://llamahub.ai/>`_ 🦙提供。LlamaHub是一个开源存储库,其中包含您可以轻松插入任何LlamaIndex应用程序的数据加载程序。

以下数据连接器仍然可以在核心存储库中找到。

Data Connectors for LlamaIndex.

This module contains the data connectors for LlamaIndex. Each connector inherits from a BaseReader class, connects to a data source, and loads Document objects from that data source.

You may also choose to construct Document objects manually, for instance in our Insert How-To Guide. See below for the API definition of a Document - the bare minimum is a text property.

class llama_index.readers.BeautifulSoupWebReader(website_extractor: Optional[Dict[str, Callable]] = None)

BeautifulSoup web page reader.

Reads pages from the web. Requires the bs4 and urllib packages.

参数

file_extractor (Optional[Dict[str, Callable]]) -- A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.

load_data(urls: List[str], custom_hostname: Optional[str] = None) List[Document]

Load data from the urls.

参数
  • urls (List[str]) -- List of URLs to scrape.

  • custom_hostname (Optional[str]) -- Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.ChatGPTRetrievalPluginReader(endpoint_url: str, bearer_token: Optional[str] = None, retries: Optional[Retry] = None, batch_size: int = 100)

ChatGPT Retrieval Plugin reader.

load_data(query: str, top_k: int = 10, separate_documents: bool = True, **kwargs: Any) List[Document]

Load data from ChatGPT Retrieval Plugin.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.ChromaReader(collection_name: str, persist_directory: Optional[str] = None, host: str = 'localhost', port: int = 8000)

Chroma reader.

Retrieve documents from existing persisted Chroma collections.

参数
  • collection_name -- Name of the peristed collection.

  • persist_directory -- Directory where the collection is persisted.

create_documents(results: Any) List[Document]

Create documents from the results.

参数

results -- Results from the query.

返回

List of documents.

load_data(query_embedding: Optional[List[float]] = None, limit: int = 10, where: Optional[dict] = None, where_document: Optional[dict] = None, query: Optional[Union[str, List[str]]] = None) Any

Load data from the collection.

参数
  • limit -- Number of results to return.

  • where -- Filter results by metadata. {"metadata_field": "is_equal_to_this"}

  • where_document -- Filter results by document. {"$contains":"search_string"}

返回

List of documents.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.DeepLakeReader(token: Optional[str] = None)

DeepLake reader.

Retrieve documents from existing DeepLake datasets.

参数

dataset_name -- Name of the deeplake dataset.

load_data(query_vector: List[float], dataset_path: str, limit: int = 4, distance_metric: str = 'l2') List[Document]

Load data from DeepLake.

参数
  • dataset_name (str) -- Name of the DeepLake dataet.

  • query_vector (List[float]) -- Query vector.

  • limit (int) -- Number of results to return.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.DiscordReader(discord_token: Optional[str] = None)

Discord reader.

Reads conversations from channels.

参数

discord_token (Optional[str]) -- Discord token. If not provided, we assume the environment variable DISCORD_TOKEN is set.

load_data(channel_ids: List[int], limit: Optional[int] = None, oldest_first: bool = True) List[Document]

Load data from the input directory.

参数
  • channel_ids (List[int]) -- List of channel ids to read.

  • limit (Optional[int]) -- Maximum number of messages to read.

  • oldest_first (bool) -- Whether to read oldest messages first. Defaults to True.

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.Document(text: Optional[str] = None, doc_id: Optional[str] = None, embedding: Optional[List[float]] = None, doc_hash: Optional[str] = None, extra_info: Optional[Dict[str, Any]] = None)

Generic interface for a data document.

This document connects to data sources.

doc_hash: Optional[str] = None

" metadata fields - injected as part of the text shown to LLMs as context - used by vector DBs for metadata filtering

This must be a flat dictionary, and only uses str keys, and (str, int, float) values.

property extra_info_str: Optional[str]

Extra info string.

classmethod from_langchain_format(doc: Document) Document

Convert struct from LangChain document format.

get_doc_hash() str

Get doc_hash.

get_doc_id() str

Get doc_id.

get_embedding() List[float]

Get embedding.

Errors if embedding is None.

get_text() str

Get text.

classmethod get_type() str

Get Document type.

classmethod get_types() List[str]

Get Document type.

property is_doc_id_none: bool

Check if doc_id is None.

property is_text_none: bool

Check if text is None.

to_langchain_format() Document

Convert struct to LangChain document format.

class llama_index.readers.ElasticsearchReader(endpoint: str, index: str, httpx_client_args: Optional[dict] = None)

Read documents from an Elasticsearch/Opensearch index.

These documents can then be used in a downstream Llama Index data structure.

参数
  • endpoint (str) -- URL (http/https) of cluster

  • index (str) -- Name of the index (required)

  • httpx_client_args (dict) -- Optional additional args to pass to the httpx.Client

load_data(field: str, query: Optional[dict] = None, embedding_field: Optional[str] = None) List[Document]

Read data from the Elasticsearch index.

参数
  • field (str) -- Field in the document to retrieve text from

  • query (Optional[dict]) -- Elasticsearch JSON query DSL object. For example: {"query": {"match": {"message": {"query": "this is a test"}}}}

  • embedding_field (Optional[str]) -- If there are embeddings stored in this index, this field can be used to set the embedding field on the returned Document list.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.FaissReader(index: Any)

Faiss reader.

Retrieves documents through an existing in-memory Faiss index. These documents can then be used in a downstream LlamaIndex data structure. If you wish use Faiss itself as an index to to organize documents, insert documents, and perform queries on them, please use GPTVectorStoreIndex with FaissVectorStore.

参数

faiss_index (faiss.Index) -- A Faiss Index object (required)

load_data(query: ndarray, id_to_text_map: Dict[str, str], k: int = 4, separate_documents: bool = True) List[Document]

Load data from Faiss.

参数
  • query (np.ndarray) -- A 2D numpy array of query vectors.

  • id_to_text_map (Dict[str, str]) -- A map from ID's to text.

  • k (int) -- Number of nearest neighbors to retrieve. Defaults to 4.

  • separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.GithubRepositoryReader(owner: str, repo: str, use_parser: bool = True, verbose: bool = False, github_token: Optional[str] = None, concurrent_requests: int = 5, ignore_file_extensions: Optional[List[str]] = None, ignore_directories: Optional[List[str]] = None)

Github repository reader.

Retrieves the contents of a Github repository and returns a list of documents. The documents are either the contents of the files in the repository or the text extracted from the files using the parser.

示例

>>> reader = GithubRepositoryReader("owner", "repo")
>>> branch_documents = reader.load_data(branch="branch")
>>> commit_documents = reader.load_data(commit_sha="commit_sha")
load_data(commit_sha: Optional[str] = None, branch: Optional[str] = None) List[Document]

Load data from a commit or a branch.

Loads github repository data from a specific commit sha or a branch.

参数
  • commit -- commit sha

  • branch -- branch name

返回

list of documents

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.GoogleDocsReader

Google Docs reader.

Reads a page from Google Docs

load_data(document_ids: List[str]) List[Document]

Load data from the input directory.

参数

document_ids (List[str]) -- a list of document ids.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.JSONReader(levels_back: Optional[int] = None, collapse_length: Optional[int] = None)

JSON reader.

Reads JSON documents with options to help suss out relationships between nodes.

参数
  • levels_back (int) -- the number of levels to go back in the JSON tree, 0

  • None (if you want all levels. If levels_back is) --

  • the (then we just format) --

  • embedding (JSON and make each line an) --

  • collapse_length (int) -- the maximum number of characters a JSON fragment

  • output (would be collapsed in the) --

  • ex -- if collapse_length = 10, and

  • {a (input is) -- [1, 2, 3], b: {"hello": "world", "foo": "bar"}}

  • line (then a would be collapsed into one) --

  • not. (while b would) --

  • there. (Recommend starting around 100 and then adjusting from) --

load_data(input_file: str) List[Document]

Load data from the input file.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MakeWrapper

Make reader.

load_data(*args: Any, **load_kwargs: Any) List[Document]

Load data from the input directory.

NOTE: This is not implemented.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

pass_response_to_webhook(webhook_url: str, response: Response, query: Optional[str] = None) None

Pass response object to webhook.

参数
  • webhook_url (str) -- Webhook URL.

  • response (Response) -- Response object.

  • query (Optional[str]) -- Query. Defaults to None.

class llama_index.readers.MboxReader

Mbox e-mail reader.

Reads a set of e-mails saved in the mbox format.

load_data(input_dir: str, **load_kwargs: Any) List[Document]

Load data from the input directory.

load_kwargs:

max_count (int): Maximum amount of messages to read. message_format (str): Message format overriding default.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MetalReader(api_key: str, client_id: str, index_id: str)

Metal reader.

参数
  • api_key (str) -- Metal API key.

  • client_id (str) -- Metal client ID.

  • index_id (str) -- Metal index ID.

load_data(limit: int, query_embedding: Optional[List[float]] = None, filters: Optional[Dict[str, Any]] = None, separate_documents: bool = True, **query_kwargs: Any) List[Document]

Load data from Metal.

参数
  • query_embedding (Optional[List[float]]) -- Query embedding for search.

  • limit (int) -- Number of results to return.

  • filters (Optional[Dict[str, Any]]) -- Filters to apply to the search.

  • separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.

  • **query_kwargs -- Keyword arguments to pass to the search.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MilvusReader(host: str = 'localhost', port: int = 19530, user: str = '', password: str = '', use_secure: bool = False)

Milvus reader.

load_data(query_vector: List[float], collection_name: str, expr: Optional[Any] = None, search_params: Optional[dict] = None, limit: int = 10) List[Document]

Load data from Milvus.

参数
  • collection_name (str) -- Name of the Milvus collection.

  • query_vector (List[float]) -- Query vector.

  • limit (int) -- Number of results to return.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.MyScaleReader(myscale_host: str, username: str, password: str, myscale_port: Optional[int] = 8443, database: str = 'default', table: str = 'llama_index', index_type: str = 'IVFLAT', metric: str = 'cosine', batch_size: int = 32, index_params: Optional[dict] = None, search_params: Optional[dict] = None, **kwargs: Any)

MyScale reader.

参数
  • myscale_host (str) -- An URL to connect to MyScale backend.

  • username (str) -- Usernamed to login.

  • password (str) -- Password to login.

  • myscale_port (int) -- URL port to connect with HTTP. Defaults to 8443.

  • database (str) -- Database name to find the table. Defaults to 'default'.

  • table (str) -- Table name to operate on. Defaults to 'vector_table'.

  • index_type (str) -- index type string. Default to "IVFLAT"

  • metric (str) -- Metric to compute distance, supported are ('l2', 'cosine', 'ip'). Defaults to 'cosine'

  • batch_size (int, optional) -- the size of documents to insert. Defaults to 32.

  • index_params (dict, optional) -- The index parameters for MyScale. Defaults to None.

  • search_params (dict, optional) -- The search parameters for a MyScale query. Defaults to None.

load_data(query_vector: List[float], where_str: Optional[str] = None, limit: int = 10) List[Document]

Load data from MyScale.

参数
  • query_vector (List[float]) -- Query vector.

  • where_str (Optional[str], optional) -- where condition string. Defaults to None.

  • limit (int) -- Number of results to return.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.NotionPageReader(integration_token: Optional[str] = None)

Notion Page reader.

Reads a set of Notion pages.

参数

integration_token (str) -- Notion integration token.

load_data(page_ids: List[str] = [], database_id: Optional[str] = None) List[Document]

Load data from the input directory.

参数

page_ids (List[str]) -- List of page ids to load.

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

query_database(database_id: str, query_dict: Dict[str, Any] = {}) List[str]

Get all the pages from a Notion database.

read_page(page_id: str) str

Read a page.

search(query: str) List[str]

Search Notion page given a text query.

class llama_index.readers.ObsidianReader(input_dir: str)

Utilities for loading data from an Obsidian Vault.

参数

input_dir (str) -- Path to the vault.

load_data(*args: Any, **load_kwargs: Any) List[Document]

Load data from the input directory.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.PineconeReader(api_key: str, environment: str)

Pinecone reader.

参数
  • api_key (str) -- Pinecone API key.

  • environment (str) -- Pinecone environment.

load_data(index_name: str, id_to_text_map: Dict[str, str], vector: Optional[List[float]], top_k: int, separate_documents: bool = True, include_values: bool = True, **query_kwargs: Any) List[Document]

Load data from Pinecone.

参数
  • index_name (str) -- Name of the index.

  • id_to_text_map (Dict[str, str]) -- A map from ID's to text.

  • separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.

  • vector (List[float]) -- Query vector.

  • top_k (int) -- Number of results to return.

  • include_values (bool) -- Whether to include the embedding in the response. Defaults to True.

  • **query_kwargs -- Keyword arguments to pass to the query. Arguments are the exact same as those found in Pinecone's reference documentation for the query method.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.PsychicReader(psychic_key: Optional[str] = None)

Psychic reader.

Psychic is a platform that allows syncing data from many SaaS apps through one

universal API.

This reader connects to an instance of Psychic and reads data from it, given a

connector ID, account ID, and API key.

Learn more at docs.psychic.dev.

参数

psychic_key (str) -- Secret key for Psychic. Get one at https://dashboard.psychic.dev/api-keys.

load_data(connector_id: Optional[str] = None, account_id: Optional[str] = None) List[Document]

Load data from a Psychic connection

参数
  • connector_id (str) -- The connector ID to connect to

  • account_id (str) -- The account ID to connect to

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.QdrantReader(location: Optional[str] = None, url: Optional[str] = None, port: Optional[int] = 6333, grpc_port: int = 6334, prefer_grpc: bool = False, https: Optional[bool] = None, api_key: Optional[str] = None, prefix: Optional[str] = None, timeout: Optional[float] = None, host: Optional[str] = None, path: Optional[str] = None)

Qdrant reader.

Retrieve documents from existing Qdrant collections.

参数
  • location -- If :memory: - use in-memory Qdrant instance. If str - use it as a url parameter. If None - use default values for host and port.

  • url -- either host or str of "Optional[scheme], host, Optional[port], Optional[prefix]". Default: None

  • port -- Port of the REST API interface. Default: 6333

  • grpc_port -- Port of the gRPC interface. Default: 6334

  • prefer_grpc -- If true - use gPRC interface whenever possible in custom methods.

  • https -- If true - use HTTPS(SSL) protocol. Default: false

  • api_key -- API key for authentication in Qdrant Cloud. Default: None

  • prefix -- If not None - add prefix to the REST URL path. Example: service/v1 will result in http://localhost:6333/service/v1/{qdrant-endpoint} for REST API. Default: None

  • timeout -- Timeout for REST and gRPC API requests. Default: 5.0 seconds for REST and unlimited for gRPC

  • host -- Host name of Qdrant service. If url and host are None, set to 'localhost'. Default: None

load_data(collection_name: str, query_vector: List[float], should_search_mapping: Optional[Dict[str, str]] = None, must_search_mapping: Optional[Dict[str, str]] = None, must_not_search_mapping: Optional[Dict[str, str]] = None, rang_search_mapping: Optional[Dict[str, Dict[str, float]]] = None, limit: int = 10) List[Document]

Load data from Qdrant.

参数
  • collection_name (str) -- Name of the Qdrant collection.

  • query_vector (List[float]) -- Query vector.

  • should_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.

  • must_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.

  • must_not_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.

  • rang_search_mapping (Optional[Dict[str, Dict[str, float]]]) -- Mapping from field name to range query.

  • limit (int) -- Number of results to return.

示例

reader = QdrantReader() reader.load_data(

collection_name="test_collection", query_vector=[0.1, 0.2, 0.3], should_search_mapping={"text_field": "text"}, must_search_mapping={"text_field": "text"}, must_not_search_mapping={"text_field": "text"}, # gte, lte, gt, lt supported rang_search_mapping={"text_field": {"gte": 0.1, "lte": 0.2}}, limit=10

)

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.RssReader(html_to_text: bool = False)

RSS reader.

Reads content from an RSS feed.

load_data(urls: List[str]) List[Document]

Load data from RSS feeds.

参数

urls (List[str]) -- List of RSS URLs to load.

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)

Simple directory reader.

Can read files into separate documents, or concatenates files into one document text.

参数
  • input_dir (str) -- Path to the directory.

  • input_files (List) -- List of file paths to read (Optional; overrides input_dir, exclude)

  • exclude (List) -- glob of python file paths to exclude (Optional)

  • exclude_hidden (bool) -- Whether to exclude hidden files (dotfiles).

  • errors (str) -- how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open

  • recursive (bool) -- Whether to recursively search in subdirectories. False by default.

  • required_exts (Optional[List[str]]) -- List of required extensions. Default is None.

  • file_extractor (Optional[Dict[str, BaseReader]]) -- A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.

  • num_files_limit (Optional[int]) -- Maximum number of files to read. Default is None.

  • file_metadata (Optional[Callable[str, Dict]]) -- A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.

load_data() List[Document]

Load data from the input directory.

参数

concatenate (bool) -- whether to concatenate all text docs into a single doc. If set to True, file metadata is ignored. False by default. This setting does not apply to image docs (always one doc per image).

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SimpleMongoReader(host: Optional[str] = None, port: Optional[int] = None, uri: Optional[str] = None, max_docs: int = 1000)

Simple mongo reader.

Concatenates each Mongo doc into Document used by LlamaIndex.

参数
  • host (str) -- Mongo host.

  • port (int) -- Mongo port.

  • max_docs (int) -- Maximum number of documents to load.

load_data(db_name: str, collection_name: str, field_names: List[str] = ['text'], query_dict: Optional[Dict] = None) List[Document]

Load data from the input directory.

参数
  • db_name (str) -- name of the database.

  • collection_name (str) -- name of the collection.

  • field_names (List[str]) -- names of the fields to be concatenated. Defaults to ["text"]

  • query_dict (Optional[Dict]) -- query to filter documents. Defaults to None

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SimpleWebPageReader(html_to_text: bool = False)

Simple web page reader.

Reads pages from the web.

参数

html_to_text (bool) -- Whether to convert HTML to text. Requires html2text package.

load_data(urls: List[str]) List[Document]

Load data from the input directory.

参数

urls (List[str]) -- List of URLs to scrape.

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SlackReader(slack_token: Optional[str] = None, ssl: Optional[SSLContext] = None, earliest_date: Optional[datetime] = None, latest_date: Optional[datetime] = None)

Slack reader.

Reads conversations from channels. If an earliest_date is provided, an optional latest_date can also be provided. If no latest_date is provided, we assume the latest date is the current timestamp.

参数
  • slack_token (Optional[str]) -- Slack token. If not provided, we assume the environment variable SLACK_BOT_TOKEN is set.

  • ssl (Optional[str]) -- Custom SSL context. If not provided, it is assumed there is already an SSL context available.

  • earliest_date (Optional[datetime]) -- Earliest date from which to read conversations. If not provided, we read all messages.

  • latest_date (Optional[datetime]) -- Latest date from which to read conversations. If not provided, defaults to current timestamp in combination with earliest_date.

load_data(channel_ids: List[str], reverse_chronological: bool = True) List[Document]

Load data from the input directory.

参数

channel_ids (List[str]) -- List of channel ids to read.

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.SteamshipFileReader(api_key: Optional[str] = None)

Reads persistent Steamship Files and converts them to Documents.

参数

api_key -- Steamship API key. Defaults to STEAMSHIP_API_KEY value if not provided.

备注

Requires install of steamship package and an active Steamship API Key. To get a Steamship API Key, visit: https://steamship.com/account/api. Once you have an API Key, expose it via an environment variable named STEAMSHIP_API_KEY or pass it as an init argument (api_key).

load_data(workspace: str, query: Optional[str] = None, file_handles: Optional[List[str]] = None, collapse_blocks: bool = True, join_str: str = '\n\n') List[Document]

Load data from persistent Steamship Files into Documents.

参数
  • workspace -- the handle for a Steamship workspace (see: https://docs.steamship.com/workspaces/index.html)

  • query -- a Steamship tag query for retrieving files (ex: 'filetag and value("import-id")="import-001"')

  • file_handles -- a list of Steamship File handles (ex: smooth-valley-9kbdr)

  • collapse_blocks -- whether to merge individual File Blocks into a single Document, or separate them.

  • join_str -- when collapse_blocks is True, this is how the block texts will be concatenated.

备注

The collection of Files from both query and file_handles will be combined. There is no (current) support for deconflicting the collections (meaning that if a file appears both in the result set of the query and as a handle in file_handles, it will be loaded twice).

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.StringIterableReader

String Iterable Reader.

Gets a list of documents, given an iterable (e.g. list) of strings.

示例

from llama_index import StringIterableReader, GPTTreeIndex

documents = StringIterableReader().load_data(
    texts=["I went to the store", "I bought an apple"])
index = GPTTreeIndex.from_documents(documents)
query_engine = index.as_query_engine()
query_engine.query("what did I buy?")

# response should be something like "You bought an apple."
load_data(texts: List[str]) List[Document]

Load the data.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.TrafilaturaWebReader(error_on_missing: bool = False)

Trafilatura web page reader.

Reads pages from the web. Requires the trafilatura package.

load_data(urls: List[str]) List[Document]

Load data from the urls.

参数

urls (List[str]) -- List of URLs to scrape.

返回

List of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.TwitterTweetReader(bearer_token: str, num_tweets: Optional[int] = 100)

Twitter tweets reader.

Read tweets of user twitter handle.

Check 'https://developer.twitter.com/en/docs/twitter-api/ getting-started/getting-access-to-the-twitter-api' on how to get access to twitter API.

参数
  • bearer_token (str) -- bearer_token that you get from twitter API.

  • num_tweets (Optional[int]) -- Number of tweets for each user twitter handle. Default is 100 tweets.

load_data(twitterhandles: List[str], **load_kwargs: Any) List[Document]

Load tweets of twitter handles.

参数

twitterhandles (List[str]) -- List of user twitter handles to read tweets.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.WeaviateReader(host: str, auth_client_secret: Optional[Any] = None)

Weaviate reader.

Retrieves documents from Weaviate through vector lookup. Allows option to concatenate retrieved documents into one Document, or to return separate Document objects per document.

参数
  • host (str) -- host.

  • auth_client_secret (Optional[weaviate.auth.AuthCredentials]) -- auth_client_secret.

load_data(class_name: Optional[str] = None, properties: Optional[List[str]] = None, graphql_query: Optional[str] = None, separate_documents: Optional[bool] = True) List[Document]

Load data from Weaviate.

If graphql_query is not found in load_kwargs, we assume that class_name and properties are provided.

参数
  • class_name (Optional[str]) -- class_name to retrieve documents from.

  • properties (Optional[List[str]]) -- properties to retrieve from documents.

  • graphql_query (Optional[str]) -- Raw GraphQL Query. We assume that the query is a Get query.

  • separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.

返回

A list of documents.

返回类型

List[Document]

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.WikipediaReader

Wikipedia reader.

Reads a page.

load_data(pages: List[str], **load_kwargs: Any) List[Document]

Load data from the input directory.

参数

pages (List[str]) -- List of pages to read.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.

class llama_index.readers.YoutubeTranscriptReader

Youtube Transcript reader.

load_data(ytlinks: List[str], **load_kwargs: Any) List[Document]

Load data from the input directory.

参数

pages (List[str]) -- List of youtube links for which transcripts are to be read.

load_langchain_documents(**load_kwargs: Any) List[Document]

Load data in LangChain document format.