T
Analyzed about 2 months ago
An efficient tool to process (mainly) ARC files from Heritrix (HTML stripper, entity converter, boilerplate remover, language filter). Converts everything to UTF-8 using ICU. Also includes a conservative w-shingling implementation for near-duplicate detection. Uses multi-threading for single-machine parallelization.