Html Loader
HtmlReader ¶
Bases: BaseReader
Reader HTML usimg html2text
Reader behavior
- HTML is read with html2text.
- All of the texts will be split by
page_break_pattern
- Each page is extracted as a Document
- The output is a list of Documents
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page_break_pattern |
str
|
Pattern to split the HTML into pages |
None
|
Source code in libs\kotaemon\kotaemon\loaders\html_loader.py
load_data ¶
Load data using Html reader
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
Path | str
|
path to HTML file |
required |
extra_info |
Optional[dict]
|
extra information passed to this reader during extracting data |
None
|
Returns:
Type | Description |
---|---|
list[Document]
|
list[Document]: list of documents extracted from the HTML file |
Source code in libs\kotaemon\kotaemon\loaders\html_loader.py
MhtmlReader ¶
Bases: BaseReader
Parse MHTML
files with BeautifulSoup
.
Source code in libs\kotaemon\kotaemon\loaders\html_loader.py
load_data ¶
Load MHTML document into document objects.