数据连接器
注意:我们的数据连接器现在通过`LlamaHub <https://llamahub.ai/>`_ 🦙提供。LlamaHub是一个开源存储库,其中包含您可以轻松插入任何LlamaIndex应用程序的数据加载程序。
以下数据连接器仍然可以在核心存储库中找到。
Data Connectors for LlamaIndex.
This module contains the data connectors for LlamaIndex. Each connector inherits from a BaseReader class, connects to a data source, and loads Document objects from that data source.
You may also choose to construct Document objects manually, for instance in our Insert How-To Guide. See below for the API definition of a Document - the bare minimum is a text property.
- class llama_index.readers.BeautifulSoupWebReader(website_extractor: Optional[Dict[str, Callable]] = None)
BeautifulSoup web page reader.
Reads pages from the web. Requires the bs4 and urllib packages.
- 参数
file_extractor (Optional[Dict[str, Callable]]) -- A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.
- load_data(urls: List[str], custom_hostname: Optional[str] = None) List[Document]
Load data from the urls.
- 参数
urls (List[str]) -- List of URLs to scrape.
custom_hostname (Optional[str]) -- Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.ChatGPTRetrievalPluginReader(endpoint_url: str, bearer_token: Optional[str] = None, retries: Optional[Retry] = None, batch_size: int = 100)
ChatGPT Retrieval Plugin reader.
- load_data(query: str, top_k: int = 10, separate_documents: bool = True, **kwargs: Any) List[Document]
Load data from ChatGPT Retrieval Plugin.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.ChromaReader(collection_name: str, persist_directory: Optional[str] = None, host: str = 'localhost', port: int = 8000)
Chroma reader.
Retrieve documents from existing persisted Chroma collections.
- 参数
collection_name -- Name of the peristed collection.
persist_directory -- Directory where the collection is persisted.
- create_documents(results: Any) List[Document]
Create documents from the results.
- 参数
results -- Results from the query.
- 返回
List of documents.
- load_data(query_embedding: Optional[List[float]] = None, limit: int = 10, where: Optional[dict] = None, where_document: Optional[dict] = None, query: Optional[Union[str, List[str]]] = None) Any
Load data from the collection.
- 参数
limit -- Number of results to return.
where -- Filter results by metadata. {"metadata_field": "is_equal_to_this"}
where_document -- Filter results by document. {"$contains":"search_string"}
- 返回
List of documents.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.DeepLakeReader(token: Optional[str] = None)
DeepLake reader.
Retrieve documents from existing DeepLake datasets.
- 参数
dataset_name -- Name of the deeplake dataset.
- load_data(query_vector: List[float], dataset_path: str, limit: int = 4, distance_metric: str = 'l2') List[Document]
Load data from DeepLake.
- 参数
dataset_name (str) -- Name of the DeepLake dataet.
query_vector (List[float]) -- Query vector.
limit (int) -- Number of results to return.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.DiscordReader(discord_token: Optional[str] = None)
Discord reader.
Reads conversations from channels.
- 参数
discord_token (Optional[str]) -- Discord token. If not provided, we assume the environment variable DISCORD_TOKEN is set.
- load_data(channel_ids: List[int], limit: Optional[int] = None, oldest_first: bool = True) List[Document]
Load data from the input directory.
- 参数
channel_ids (List[int]) -- List of channel ids to read.
limit (Optional[int]) -- Maximum number of messages to read.
oldest_first (bool) -- Whether to read oldest messages first. Defaults to True.
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.Document(text: Optional[str] = None, doc_id: Optional[str] = None, embedding: Optional[List[float]] = None, doc_hash: Optional[str] = None, extra_info: Optional[Dict[str, Any]] = None)
Generic interface for a data document.
This document connects to data sources.
- doc_hash: Optional[str] = None
" metadata fields - injected as part of the text shown to LLMs as context - used by vector DBs for metadata filtering
This must be a flat dictionary, and only uses str keys, and (str, int, float) values.
- property extra_info_str: Optional[str]
Extra info string.
- classmethod from_langchain_format(doc: Document) Document
Convert struct from LangChain document format.
- get_doc_hash() str
Get doc_hash.
- get_doc_id() str
Get doc_id.
- get_embedding() List[float]
Get embedding.
Errors if embedding is None.
- get_text() str
Get text.
- classmethod get_type() str
Get Document type.
- classmethod get_types() List[str]
Get Document type.
- property is_doc_id_none: bool
Check if doc_id is None.
- property is_text_none: bool
Check if text is None.
- to_langchain_format() Document
Convert struct to LangChain document format.
- class llama_index.readers.ElasticsearchReader(endpoint: str, index: str, httpx_client_args: Optional[dict] = None)
Read documents from an Elasticsearch/Opensearch index.
These documents can then be used in a downstream Llama Index data structure.
- 参数
endpoint (str) -- URL (http/https) of cluster
index (str) -- Name of the index (required)
httpx_client_args (dict) -- Optional additional args to pass to the httpx.Client
- load_data(field: str, query: Optional[dict] = None, embedding_field: Optional[str] = None) List[Document]
Read data from the Elasticsearch index.
- 参数
field (str) -- Field in the document to retrieve text from
query (Optional[dict]) -- Elasticsearch JSON query DSL object. For example: {"query": {"match": {"message": {"query": "this is a test"}}}}
embedding_field (Optional[str]) -- If there are embeddings stored in this index, this field can be used to set the embedding field on the returned Document list.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.FaissReader(index: Any)
Faiss reader.
Retrieves documents through an existing in-memory Faiss index. These documents can then be used in a downstream LlamaIndex data structure. If you wish use Faiss itself as an index to to organize documents, insert documents, and perform queries on them, please use GPTVectorStoreIndex with FaissVectorStore.
- 参数
faiss_index (faiss.Index) -- A Faiss Index object (required)
- load_data(query: ndarray, id_to_text_map: Dict[str, str], k: int = 4, separate_documents: bool = True) List[Document]
Load data from Faiss.
- 参数
query (np.ndarray) -- A 2D numpy array of query vectors.
id_to_text_map (Dict[str, str]) -- A map from ID's to text.
k (int) -- Number of nearest neighbors to retrieve. Defaults to 4.
separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.GithubRepositoryReader(owner: str, repo: str, use_parser: bool = True, verbose: bool = False, github_token: Optional[str] = None, concurrent_requests: int = 5, ignore_file_extensions: Optional[List[str]] = None, ignore_directories: Optional[List[str]] = None)
Github repository reader.
Retrieves the contents of a Github repository and returns a list of documents. The documents are either the contents of the files in the repository or the text extracted from the files using the parser.
示例
>>> reader = GithubRepositoryReader("owner", "repo") >>> branch_documents = reader.load_data(branch="branch") >>> commit_documents = reader.load_data(commit_sha="commit_sha")
- load_data(commit_sha: Optional[str] = None, branch: Optional[str] = None) List[Document]
Load data from a commit or a branch.
Loads github repository data from a specific commit sha or a branch.
- 参数
commit -- commit sha
branch -- branch name
- 返回
list of documents
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.GoogleDocsReader
Google Docs reader.
Reads a page from Google Docs
- load_data(document_ids: List[str]) List[Document]
Load data from the input directory.
- 参数
document_ids (List[str]) -- a list of document ids.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.JSONReader(levels_back: Optional[int] = None, collapse_length: Optional[int] = None)
JSON reader.
Reads JSON documents with options to help suss out relationships between nodes.
- 参数
levels_back (int) -- the number of levels to go back in the JSON tree, 0
None (if you want all levels. If levels_back is) --
the (then we just format) --
embedding (JSON and make each line an) --
collapse_length (int) -- the maximum number of characters a JSON fragment
output (would be collapsed in the) --
ex -- if collapse_length = 10, and
{a (input is) -- [1, 2, 3], b: {"hello": "world", "foo": "bar"}}
line (then a would be collapsed into one) --
not. (while b would) --
there. (Recommend starting around 100 and then adjusting from) --
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.MakeWrapper
Make reader.
- load_data(*args: Any, **load_kwargs: Any) List[Document]
Load data from the input directory.
NOTE: This is not implemented.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.MboxReader
Mbox e-mail reader.
Reads a set of e-mails saved in the mbox format.
- load_data(input_dir: str, **load_kwargs: Any) List[Document]
Load data from the input directory.
- load_kwargs:
max_count (int): Maximum amount of messages to read. message_format (str): Message format overriding default.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.MetalReader(api_key: str, client_id: str, index_id: str)
Metal reader.
- 参数
api_key (str) -- Metal API key.
client_id (str) -- Metal client ID.
index_id (str) -- Metal index ID.
- load_data(limit: int, query_embedding: Optional[List[float]] = None, filters: Optional[Dict[str, Any]] = None, separate_documents: bool = True, **query_kwargs: Any) List[Document]
Load data from Metal.
- 参数
query_embedding (Optional[List[float]]) -- Query embedding for search.
limit (int) -- Number of results to return.
filters (Optional[Dict[str, Any]]) -- Filters to apply to the search.
separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.
**query_kwargs -- Keyword arguments to pass to the search.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.MilvusReader(host: str = 'localhost', port: int = 19530, user: str = '', password: str = '', use_secure: bool = False)
Milvus reader.
- load_data(query_vector: List[float], collection_name: str, expr: Optional[Any] = None, search_params: Optional[dict] = None, limit: int = 10) List[Document]
Load data from Milvus.
- 参数
collection_name (str) -- Name of the Milvus collection.
query_vector (List[float]) -- Query vector.
limit (int) -- Number of results to return.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.MyScaleReader(myscale_host: str, username: str, password: str, myscale_port: Optional[int] = 8443, database: str = 'default', table: str = 'llama_index', index_type: str = 'IVFLAT', metric: str = 'cosine', batch_size: int = 32, index_params: Optional[dict] = None, search_params: Optional[dict] = None, **kwargs: Any)
MyScale reader.
- 参数
myscale_host (str) -- An URL to connect to MyScale backend.
username (str) -- Usernamed to login.
password (str) -- Password to login.
myscale_port (int) -- URL port to connect with HTTP. Defaults to 8443.
database (str) -- Database name to find the table. Defaults to 'default'.
table (str) -- Table name to operate on. Defaults to 'vector_table'.
index_type (str) -- index type string. Default to "IVFLAT"
metric (str) -- Metric to compute distance, supported are ('l2', 'cosine', 'ip'). Defaults to 'cosine'
batch_size (int, optional) -- the size of documents to insert. Defaults to 32.
index_params (dict, optional) -- The index parameters for MyScale. Defaults to None.
search_params (dict, optional) -- The search parameters for a MyScale query. Defaults to None.
- load_data(query_vector: List[float], where_str: Optional[str] = None, limit: int = 10) List[Document]
Load data from MyScale.
- 参数
query_vector (List[float]) -- Query vector.
where_str (Optional[str], optional) -- where condition string. Defaults to None.
limit (int) -- Number of results to return.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.NotionPageReader(integration_token: Optional[str] = None)
Notion Page reader.
Reads a set of Notion pages.
- 参数
integration_token (str) -- Notion integration token.
- load_data(page_ids: List[str] = [], database_id: Optional[str] = None) List[Document]
Load data from the input directory.
- 参数
page_ids (List[str]) -- List of page ids to load.
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- query_database(database_id: str, query_dict: Dict[str, Any] = {}) List[str]
Get all the pages from a Notion database.
- read_page(page_id: str) str
Read a page.
- search(query: str) List[str]
Search Notion page given a text query.
- class llama_index.readers.ObsidianReader(input_dir: str)
Utilities for loading data from an Obsidian Vault.
- 参数
input_dir (str) -- Path to the vault.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.PineconeReader(api_key: str, environment: str)
Pinecone reader.
- 参数
api_key (str) -- Pinecone API key.
environment (str) -- Pinecone environment.
- load_data(index_name: str, id_to_text_map: Dict[str, str], vector: Optional[List[float]], top_k: int, separate_documents: bool = True, include_values: bool = True, **query_kwargs: Any) List[Document]
Load data from Pinecone.
- 参数
index_name (str) -- Name of the index.
id_to_text_map (Dict[str, str]) -- A map from ID's to text.
separate_documents (Optional[bool]) -- Whether to return separate documents per retrieved entry. Defaults to True.
vector (List[float]) -- Query vector.
top_k (int) -- Number of results to return.
include_values (bool) -- Whether to include the embedding in the response. Defaults to True.
**query_kwargs -- Keyword arguments to pass to the query. Arguments are the exact same as those found in Pinecone's reference documentation for the query method.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.PsychicReader(psychic_key: Optional[str] = None)
Psychic reader.
- Psychic is a platform that allows syncing data from many SaaS apps through one
universal API.
- This reader connects to an instance of Psychic and reads data from it, given a
connector ID, account ID, and API key.
Learn more at docs.psychic.dev.
- 参数
psychic_key (str) -- Secret key for Psychic. Get one at https://dashboard.psychic.dev/api-keys.
- load_data(connector_id: Optional[str] = None, account_id: Optional[str] = None) List[Document]
Load data from a Psychic connection
- 参数
connector_id (str) -- The connector ID to connect to
account_id (str) -- The account ID to connect to
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.QdrantReader(location: Optional[str] = None, url: Optional[str] = None, port: Optional[int] = 6333, grpc_port: int = 6334, prefer_grpc: bool = False, https: Optional[bool] = None, api_key: Optional[str] = None, prefix: Optional[str] = None, timeout: Optional[float] = None, host: Optional[str] = None, path: Optional[str] = None)
Qdrant reader.
Retrieve documents from existing Qdrant collections.
- 参数
location -- If :memory: - use in-memory Qdrant instance. If str - use it as a url parameter. If None - use default values for host and port.
url -- either host or str of "Optional[scheme], host, Optional[port], Optional[prefix]". Default: None
port -- Port of the REST API interface. Default: 6333
grpc_port -- Port of the gRPC interface. Default: 6334
prefer_grpc -- If true - use gPRC interface whenever possible in custom methods.
https -- If true - use HTTPS(SSL) protocol. Default: false
api_key -- API key for authentication in Qdrant Cloud. Default: None
prefix -- If not None - add prefix to the REST URL path. Example: service/v1 will result in http://localhost:6333/service/v1/{qdrant-endpoint} for REST API. Default: None
timeout -- Timeout for REST and gRPC API requests. Default: 5.0 seconds for REST and unlimited for gRPC
host -- Host name of Qdrant service. If url and host are None, set to 'localhost'. Default: None
- load_data(collection_name: str, query_vector: List[float], should_search_mapping: Optional[Dict[str, str]] = None, must_search_mapping: Optional[Dict[str, str]] = None, must_not_search_mapping: Optional[Dict[str, str]] = None, rang_search_mapping: Optional[Dict[str, Dict[str, float]]] = None, limit: int = 10) List[Document]
Load data from Qdrant.
- 参数
collection_name (str) -- Name of the Qdrant collection.
query_vector (List[float]) -- Query vector.
should_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.
must_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.
must_not_search_mapping (Optional[Dict[str, str]]) -- Mapping from field name to query string.
rang_search_mapping (Optional[Dict[str, Dict[str, float]]]) -- Mapping from field name to range query.
limit (int) -- Number of results to return.
示例
reader = QdrantReader() reader.load_data(
collection_name="test_collection", query_vector=[0.1, 0.2, 0.3], should_search_mapping={"text_field": "text"}, must_search_mapping={"text_field": "text"}, must_not_search_mapping={"text_field": "text"}, # gte, lte, gt, lt supported rang_search_mapping={"text_field": {"gte": 0.1, "lte": 0.2}}, limit=10
)
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.RssReader(html_to_text: bool = False)
RSS reader.
Reads content from an RSS feed.
- load_data(urls: List[str]) List[Document]
Load data from RSS feeds.
- 参数
urls (List[str]) -- List of RSS URLs to load.
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)
Simple directory reader.
Can read files into separate documents, or concatenates files into one document text.
- 参数
input_dir (str) -- Path to the directory.
input_files (List) -- List of file paths to read (Optional; overrides input_dir, exclude)
exclude (List) -- glob of python file paths to exclude (Optional)
exclude_hidden (bool) -- Whether to exclude hidden files (dotfiles).
errors (str) -- how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open
recursive (bool) -- Whether to recursively search in subdirectories. False by default.
required_exts (Optional[List[str]]) -- List of required extensions. Default is None.
file_extractor (Optional[Dict[str, BaseReader]]) -- A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.
num_files_limit (Optional[int]) -- Maximum number of files to read. Default is None.
file_metadata (Optional[Callable[str, Dict]]) -- A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.
- load_data() List[Document]
Load data from the input directory.
- 参数
concatenate (bool) -- whether to concatenate all text docs into a single doc. If set to True, file metadata is ignored. False by default. This setting does not apply to image docs (always one doc per image).
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.SimpleMongoReader(host: Optional[str] = None, port: Optional[int] = None, uri: Optional[str] = None, max_docs: int = 1000)
Simple mongo reader.
Concatenates each Mongo doc into Document used by LlamaIndex.
- 参数
host (str) -- Mongo host.
port (int) -- Mongo port.
max_docs (int) -- Maximum number of documents to load.
- load_data(db_name: str, collection_name: str, field_names: List[str] = ['text'], query_dict: Optional[Dict] = None) List[Document]
Load data from the input directory.
- 参数
db_name (str) -- name of the database.
collection_name (str) -- name of the collection.
field_names (List[str]) -- names of the fields to be concatenated. Defaults to ["text"]
query_dict (Optional[Dict]) -- query to filter documents. Defaults to None
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.SimpleWebPageReader(html_to_text: bool = False)
Simple web page reader.
Reads pages from the web.
- 参数
html_to_text (bool) -- Whether to convert HTML to text. Requires html2text package.
- load_data(urls: List[str]) List[Document]
Load data from the input directory.
- 参数
urls (List[str]) -- List of URLs to scrape.
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.SlackReader(slack_token: Optional[str] = None, ssl: Optional[SSLContext] = None, earliest_date: Optional[datetime] = None, latest_date: Optional[datetime] = None)
Slack reader.
Reads conversations from channels. If an earliest_date is provided, an optional latest_date can also be provided. If no latest_date is provided, we assume the latest date is the current timestamp.
- 参数
slack_token (Optional[str]) -- Slack token. If not provided, we assume the environment variable SLACK_BOT_TOKEN is set.
ssl (Optional[str]) -- Custom SSL context. If not provided, it is assumed there is already an SSL context available.
earliest_date (Optional[datetime]) -- Earliest date from which to read conversations. If not provided, we read all messages.
latest_date (Optional[datetime]) -- Latest date from which to read conversations. If not provided, defaults to current timestamp in combination with earliest_date.
- load_data(channel_ids: List[str], reverse_chronological: bool = True) List[Document]
Load data from the input directory.
- 参数
channel_ids (List[str]) -- List of channel ids to read.
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.SteamshipFileReader(api_key: Optional[str] = None)
Reads persistent Steamship Files and converts them to Documents.
- 参数
api_key -- Steamship API key. Defaults to STEAMSHIP_API_KEY value if not provided.
备注
Requires install of steamship package and an active Steamship API Key. To get a Steamship API Key, visit: https://steamship.com/account/api. Once you have an API Key, expose it via an environment variable named STEAMSHIP_API_KEY or pass it as an init argument (api_key).
- load_data(workspace: str, query: Optional[str] = None, file_handles: Optional[List[str]] = None, collapse_blocks: bool = True, join_str: str = '\n\n') List[Document]
Load data from persistent Steamship Files into Documents.
- 参数
workspace -- the handle for a Steamship workspace (see: https://docs.steamship.com/workspaces/index.html)
query -- a Steamship tag query for retrieving files (ex: 'filetag and value("import-id")="import-001"')
file_handles -- a list of Steamship File handles (ex: smooth-valley-9kbdr)
collapse_blocks -- whether to merge individual File Blocks into a single Document, or separate them.
join_str -- when collapse_blocks is True, this is how the block texts will be concatenated.
备注
The collection of Files from both query and file_handles will be combined. There is no (current) support for deconflicting the collections (meaning that if a file appears both in the result set of the query and as a handle in file_handles, it will be loaded twice).
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.StringIterableReader
String Iterable Reader.
Gets a list of documents, given an iterable (e.g. list) of strings.
示例
from llama_index import StringIterableReader, GPTTreeIndex documents = StringIterableReader().load_data( texts=["I went to the store", "I bought an apple"]) index = GPTTreeIndex.from_documents(documents) query_engine = index.as_query_engine() query_engine.query("what did I buy?") # response should be something like "You bought an apple."
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.TrafilaturaWebReader(error_on_missing: bool = False)
Trafilatura web page reader.
Reads pages from the web. Requires the trafilatura package.
- load_data(urls: List[str]) List[Document]
Load data from the urls.
- 参数
urls (List[str]) -- List of URLs to scrape.
- 返回
List of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.TwitterTweetReader(bearer_token: str, num_tweets: Optional[int] = 100)
Twitter tweets reader.
Read tweets of user twitter handle.
Check 'https://developer.twitter.com/en/docs/twitter-api/ getting-started/getting-access-to-the-twitter-api' on how to get access to twitter API.
- 参数
bearer_token (str) -- bearer_token that you get from twitter API.
num_tweets (Optional[int]) -- Number of tweets for each user twitter handle. Default is 100 tweets.
- load_data(twitterhandles: List[str], **load_kwargs: Any) List[Document]
Load tweets of twitter handles.
- 参数
twitterhandles (List[str]) -- List of user twitter handles to read tweets.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.WeaviateReader(host: str, auth_client_secret: Optional[Any] = None)
Weaviate reader.
Retrieves documents from Weaviate through vector lookup. Allows option to concatenate retrieved documents into one Document, or to return separate Document objects per document.
- 参数
host (str) -- host.
auth_client_secret (Optional[weaviate.auth.AuthCredentials]) -- auth_client_secret.
- load_data(class_name: Optional[str] = None, properties: Optional[List[str]] = None, graphql_query: Optional[str] = None, separate_documents: Optional[bool] = True) List[Document]
Load data from Weaviate.
If graphql_query is not found in load_kwargs, we assume that class_name and properties are provided.
- 参数
class_name (Optional[str]) -- class_name to retrieve documents from.
properties (Optional[List[str]]) -- properties to retrieve from documents.
graphql_query (Optional[str]) -- Raw GraphQL Query. We assume that the query is a Get query.
separate_documents (Optional[bool]) -- Whether to return separate documents. Defaults to True.
- 返回
A list of documents.
- 返回类型
List[Document]
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.WikipediaReader
Wikipedia reader.
Reads a page.
- load_data(pages: List[str], **load_kwargs: Any) List[Document]
Load data from the input directory.
- 参数
pages (List[str]) -- List of pages to read.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.
- class llama_index.readers.YoutubeTranscriptReader
Youtube Transcript reader.
- load_data(ytlinks: List[str], **load_kwargs: Any) List[Document]
Load data from the input directory.
- 参数
pages (List[str]) -- List of youtube links for which transcripts are to be read.
- load_langchain_documents(**load_kwargs: Any) List[Document]
Load data in LangChain document format.