0
I Use This!
Activity Not Available
Analyzed about 2 months ago. based on code collected about 2 months ago.

Project Summary

An efficient tool to process (mainly) ARC files from Heritrix (HTML stripper, entity converter, boilerplate remover, language filter). Converts everything to UTF-8 using ICU. Also includes a conservative w-shingling implementation for near-duplicate detection. Uses multi-threading for single-machine parallelization.

Tags

No tags have been added

In a Nutshell, texrex...

GNU General Public License v3.0 or later
Permitted

Commercial Use

Modify

Distribute

Place Warranty

Use Patent Claims

Forbidden

Sub-License

Hold Liable

Required

Distribute Original

Disclose Source

Include Copyright

State Changes

Include License

Include Install Instructions

These details are provided for information only. No information here is legal advice and should not be used as such.

This Project has No vulnerabilities Reported Against it

Did You Know...

  • ...
    use of OSS increased in 65% of companies in 2016
  • ...
    anyone with an Open Hub account can update a project's tags
  • ...
    55% of companies leverage OSS for production infrastructure
  • ...
    you can subscribe to e-mail newsletters to receive update from the Open Hub blog

30 Day Summary

Dec 18 2024 — Jan 17 2025

12 Month Summary

Jan 17 2024 — Jan 17 2025

Ratings

Be the first to rate this project
Click to add your rating
  
Review this Project!