Loaders
AdobeReader ¶
Bases: BaseReader
Read PDF using the Adobe's PDF Services. Be able to extract text, table, and figure with high accuracy
Example
Args:
endpoint: URL to the Vision Language Model endpoint. If not provided,
will use the default kotaemon.loaders.adobe_loader.DEFAULT_VLM_ENDPOINT
1 2 |
|
Source code in libs/kotaemon/kotaemon/loaders/adobe_loader.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
load_data ¶
Load data by calling to the Adobe's API
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
Path
|
Path to the PDF file |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: list of documents extracted from the PDF file, includes 3 types: text, table, and image |
Source code in libs/kotaemon/kotaemon/loaders/adobe_loader.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
AzureAIDocumentIntelligenceLoader ¶
Bases: BaseReader
Utilize Azure AI Document Intelligence to parse document
As of April 24, the supported file formats are: pdf, jpeg/jpg, png, bmp, tiff, heif, docx, xlsx, pptx and html.
Source code in libs/kotaemon/kotaemon/loaders/azureai_document_intelligence_loader.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
|
load_data ¶
Extract the input file, allowing multi-modal extraction
Source code in libs/kotaemon/kotaemon/loaders/azureai_document_intelligence_loader.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
|
AutoReader ¶
Bases: BaseReader
General auto reader for a variety of files. (based on llama-hub)
Source code in libs/kotaemon/kotaemon/loaders/base.py
BaseReader ¶
DirectoryReader ¶
Bases: LIReaderMixin
, BaseReader
Wrap around llama-index SimpleDirectoryReader
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_dir
|
str
|
Path to the directory. |
required |
input_files
|
List
|
List of file paths to read (Optional; overrides input_dir, exclude) |
required |
exclude
|
List
|
glob of python file paths to exclude (Optional) |
required |
exclude_hidden
|
bool
|
Whether to exclude hidden files (dotfiles). |
required |
encoding
|
str
|
Encoding of the files. Default is utf-8. |
required |
errors
|
str
|
how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open |
required |
recursive
|
bool
|
Whether to recursively search in subdirectories. False by default. |
required |
filename_as_id
|
bool
|
Whether to use the filename as the document id. False by default. |
required |
required_exts
|
Optional[List[str]]
|
List of required extensions. Default is None. |
required |
file_extractor
|
Optional[Dict[str, BaseReader]]
|
A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS. |
required |
num_files_limit
|
Optional[int]
|
Maximum number of files to read. Default is None. |
required |
file_metadata
|
Optional[Callable[str, Dict]]
|
A function that takes in a filename and returns a Dict of metadata for the Document. Default is None. |
required |
Source code in libs/kotaemon/kotaemon/loaders/composite_loader.py
DocxReader ¶
Bases: BaseReader
Read Docx files that respect table, using python-docx library
Reader behavior
- All paragraphs are extracted as a Document
- Each table is extracted as a Document, rendered as a CSV string
- The output is a list of Documents, concatenating the above (tables + paragraphs)
Source code in libs/kotaemon/kotaemon/loaders/docx_loader.py
load_data ¶
Load data using Docx reader
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Path
|
Path to .docx file |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: list of documents extracted from the HTML file |
Source code in libs/kotaemon/kotaemon/loaders/docx_loader.py
ExcelReader ¶
Bases: BaseReader
Spreadsheet exporter respecting multiple worksheets
Parses CSVs using the separator detection from Pandas read_csv
function.
If special parameters are required, use the pandas_config
dict.
Args:
1 2 3 4 |
|
Source code in libs/kotaemon/kotaemon/loaders/excel_loader.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
load_data ¶
Parse file and extract values from a specific column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
Path
|
The path to the Excel file to read. |
required |
include_sheetname
|
bool
|
Whether to include the sheet name in the output. |
True
|
sheet_name
|
Union[str, int, None]
|
The specific sheet to read from, default is None which reads all sheets. |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: A list of`Document objects containing the values from the specified column in the Excel file. |
Source code in libs/kotaemon/kotaemon/loaders/excel_loader.py
PandasExcelReader ¶
Bases: BaseReader
Pandas-based CSV parser.
Parses CSVs using the separator detection from Pandas read_csv
function.
If special parameters are required, use the pandas_config
dict.
Args:
1 2 3 4 |
|
Source code in libs/kotaemon/kotaemon/loaders/excel_loader.py
load_data ¶
Parse file and extract values from a specific column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
Path
|
The path to the Excel file to read. |
required |
include_sheetname
|
bool
|
Whether to include the sheet name in the output. |
False
|
sheet_name
|
Union[str, int, None]
|
The specific sheet to read from, default is None which reads all sheets. |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: A list of`Document objects containing the values from the specified column in the Excel file. |
Source code in libs/kotaemon/kotaemon/loaders/excel_loader.py
HtmlReader ¶
Bases: BaseReader
Reader HTML usimg html2text
Reader behavior
- HTML is read with html2text.
- All of the texts will be split by
page_break_pattern
- Each page is extracted as a Document
- The output is a list of Documents
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page_break_pattern
|
str
|
Pattern to split the HTML into pages |
None
|
Source code in libs/kotaemon/kotaemon/loaders/html_loader.py
load_data ¶
Load data using Html reader
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Path | str
|
path to HTML file |
required |
extra_info
|
Optional[dict]
|
extra information passed to this reader during extracting data |
None
|
Returns:
Type | Description |
---|---|
list[Document]
|
list[Document]: list of documents extracted from the HTML file |
Source code in libs/kotaemon/kotaemon/loaders/html_loader.py
MhtmlReader ¶
Bases: BaseReader
Parse MHTML
files with BeautifulSoup
.
Source code in libs/kotaemon/kotaemon/loaders/html_loader.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
load_data ¶
Load MHTML document into document objects.
Source code in libs/kotaemon/kotaemon/loaders/html_loader.py
MathpixPDFReader ¶
Bases: BaseReader
Load PDF
files using Mathpix
service.
Source code in libs/kotaemon/kotaemon/loaders/mathpix_loader.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
wait_for_processing ¶
Wait for processing to complete.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pdf_id
|
str
|
a PDF id. |
required |
Returns: None
Source code in libs/kotaemon/kotaemon/loaders/mathpix_loader.py
clean_pdf ¶
Clean the PDF file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
contents
|
str
|
a PDF file contents. |
required |
Returns:
Source code in libs/kotaemon/kotaemon/loaders/mathpix_loader.py
ImageReader ¶
Bases: BaseReader
Read PDF using OCR, with high focus on table extraction
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
endpoint
|
Optional[str]
|
URL to FullOCR endpoint. If not provided, will look for
environment variable |
None
|
use_ocr
|
whether to use OCR to read text (e.g: from images, tables) in the PDF If False, only the table and text within table cells will be extracted. |
required |
Source code in libs/kotaemon/kotaemon/loaders/ocr_loader.py
load_data ¶
Load data using OCR reader
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Path
|
Path to PDF file |
required |
debug_path
|
Path
|
Path to store debug image output |
required |
artifact_path
|
Path
|
Path to OCR endpoints artifacts directory |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: list of documents extracted from the PDF file |
Source code in libs/kotaemon/kotaemon/loaders/ocr_loader.py
OCRReader ¶
Bases: BaseReader
Read PDF using OCR, with high focus on table extraction
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
endpoint
|
Optional[str]
|
URL to FullOCR endpoint. If not provided, will look for
environment variable |
None
|
use_ocr
|
whether to use OCR to read text (e.g: from images, tables) in the PDF If False, only the table and text within table cells will be extracted. |
True
|
Source code in libs/kotaemon/kotaemon/loaders/ocr_loader.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
load_data ¶
Load data using OCR reader
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Path
|
Path to PDF file |
required |
debug_path
|
Path
|
Path to store debug image output |
required |
artifact_path
|
Path
|
Path to OCR endpoints artifacts directory |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: list of documents extracted from the PDF file |
Source code in libs/kotaemon/kotaemon/loaders/ocr_loader.py
PDFThumbnailReader ¶
Bases: PDFReader
PDF parser with thumbnail for each page.
Source code in libs/kotaemon/kotaemon/loaders/pdf_loader.py
load_data ¶
Parse file.
Source code in libs/kotaemon/kotaemon/loaders/pdf_loader.py
UnstructuredReader ¶
Bases: BaseReader
General unstructured text reader for a variety of files.
Source code in libs/kotaemon/kotaemon/loaders/unstructured_loader.py
load_data ¶
If api is set, parse through api