Managed Projects

texrex

  Analyzed about 2 months ago

An efficient tool to process (mainly) ARC files from Heritrix (HTML stripper, entity converter, boilerplate remover, language filter). Converts everything to UTF-8 using ICU. Also includes a conservative w-shingling implementation for near-duplicate detection. Uses multi-threading for single-machine parallelization.

21K lines of code

0 current contributors

almost 9 years since last commit

0 users on Open Hub

Activity Not Available
0.0
 
I Use This