Files
DocumentIngestor ¶
Bases: BaseComponent
Ingest common office document types into Document for indexing
Document types
- xlsx, xls
- docx, doc
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_mode
|
mode for pdf extraction, one of "normal", "mathpix", "ocr" - normal: parse pdf text - mathpix: parse pdf text using mathpix - ocr: parse pdf image using flax |
required | |
doc_parsers
|
list of document parsers to parse the document |
required | |
text_splitter
|
splitter to split the document into text nodes |
required | |
override_file_extractors
|
override file extractors for specific file extensions
The default file extractors are stored in |
required |
Source code in libs/kotaemon/kotaemon/indices/ingests/files.py
run ¶
Ingest the file paths into Document
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_paths
|
list[str | Path] | str | Path
|
list of file paths or a single file path |
required |
Returns:
Type | Description |
---|---|
list[Document]
|
list of parsed Documents |