An efficient tool to process (mainly) ARC files from Heritrix (HTML stripper, entity converter, boilerplate remover, language filter). Converts everything to UTF-8 using ICU. Also includes a conservative w-shingling implementation for near-duplicate detection. Uses multi-threading for single-machine parallelization.
Commercial Use
Modify
Distribute
Place Warranty
Use Patent Claims
Sub-License
Hold Liable
Distribute Original
Disclose Source
Include Copyright
State Changes
Include License
Include Install Instructions
These details are provided for information only. No information here is legal advice and should not be used as such.
30 Day SummaryDec 18 2024 — Jan 17 2025
|
12 Month SummaryJan 17 2024 — Jan 17 2025
|