Html Loader
HtmlReader ¶
Bases: BaseReader
Reader HTML usimg html2text
Reader behavior
- HTML is read with html2text.
- All of the texts will be split by
page_break_pattern
- Each page is extracted as a Document
- The output is a list of Documents
Parameters:
Name | Type | Description | Default |
---|---|---|---|
page_break_pattern
|
str
|
Pattern to split the HTML into pages |
None
|
Source code in libs/kotaemon/kotaemon/loaders/html_loader.py
load_data ¶
Load data using Html reader
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Path | str
|
path to HTML file |
required |
extra_info
|
Optional[dict]
|
extra information passed to this reader during extracting data |
None
|
Returns:
Type | Description |
---|---|
list[Document]
|
list[Document]: list of documents extracted from the HTML file |
Source code in libs/kotaemon/kotaemon/loaders/html_loader.py
MhtmlReader ¶
Bases: BaseReader
Parse MHTML
files with BeautifulSoup
.
Source code in libs/kotaemon/kotaemon/loaders/html_loader.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
load_data ¶
Load MHTML document into document objects.