Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

Apache Flume

Compare

Claimed by Apache Software Foundation Analyzed 26 days ago

Apache Flume is a system for reliably collecting high-throughput data from streaming data sources like logs.

72.8K lines of code

3 current contributors

6 months since last commit

4 users on Open Hub

Very Low Activity
0.0
 
I Use This

Apache Ignite

Compare

Claimed by Apache Software Foundation Analyzed 27 days ago

Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.

1.52M lines of code

0 current contributors

2 months since last commit

4 users on Open Hub

High Activity
0.0
 
I Use This

StreamSets Data Collector

Compare

Claimed by StreamSets No analysis available

Open source software for the rapid development and ​reliable​ operation of complex data flows.

0 lines of code

60 current contributors

0 since last commit

4 users on Open Hub

Activity Not Available
5.0
 
I Use This
Mostly written in language not available
Licenses: apache_2

Cascading

Compare

  Analyzed 26 days ago

Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on a Hadoop cluster.

106K lines of code

0 current contributors

over 11 years since last commit

2 users on Open Hub

Inactive
0.0
 
I Use This

Apache Whirr

Compare

Claimed by Apache Software Foundation Analyzed 26 days ago

Apache Whirr is a set of libraries for running cloud services. Whirr provides: * A cloud-neutral way to run services. You don't have to worry about the idiosyncrasies of each provider. * A common service API. The details of provisioning are particular to the service. * Smart defaults for ... [More] services. You can get a properly configured system running quickly, while still being able to override settings as needed. You can also use Whirr as a command line tool for deploying clusters. [Less]

26.9K lines of code

0 current contributors

over 9 years since last commit

2 users on Open Hub

Inactive
0.0
 
I Use This

Disco

Compare

  Analyzed 26 days ago

Disco is an open-source implementation of the Map-Reduce framework for distributed computing. As the original framework, Disco supports parallel computations over large data sets on unreliable cluster of computers. The Disco core is written in Erlang, a functional language that is designed for ... [More] building robust fault-tolerant distributed applications. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data. [Less]

29.8K lines of code

0 current contributors

over 8 years since last commit

2 users on Open Hub

Inactive
0.0
 
I Use This

Apache Crunch

Compare

  Analyzed 27 days ago

Apache Crunch is a Java library for writing, testing, and running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

125K lines of code

5 current contributors

about 4 years since last commit

2 users on Open Hub

Inactive
5.0
 
I Use This

WikiHadoop

Compare

  Analyzed 26 days ago

WikiHadoop is a set of Hadoop modules focusing onto processing Wikipedia's TB-scale XML dump files. Wikipedia XML dumps with complete edit histories have been difficult to process because of its exceptional size and structure. While a "page" is a common processing unit, one Wikipedia page may ... [More] contain more than gigabytes of text when the edit history is very long. WikiHadoop provides an InputFormat for Hadoop Streaming Interface that processes Wikipedia bzip2 XML dumps in a streaming manner. Using this InputFormat, the content of every page is fed to a mapper via standard input and output without using too much memory. Thanks to Hadoop Streaming, mappers can be implemented in any language. [Less]

4.92K lines of code

0 current contributors

almost 12 years since last commit

1 users on Open Hub

Inactive
0.0
 
I Use This

Pangool

Compare

  Analyzed 26 days ago

Pangool is a Java, low-level MapReduce API. It aims to be a replacement for the Hadoop Java MapReduce API. By implementing an intermediate Tuple-based schema and configuring a Job conveniently, many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear. Things ... [More] like secondary sort and reduce-side joins become extremely easy to implement and understand. Pangool's performance is comparable to that of the Hadoop Java MapReduce API. Pangool also augments Hadoop's API by making multiple outputs and inputs first-class and allowing instance-based configuration. [Less]

27.8K lines of code

0 current contributors

about 4 years since last commit

1 users on Open Hub

Inactive
0.0
 
I Use This

Shark - Hive on Spark

Compare

  Analyzed 27 days ago

Hive on Spark

17K lines of code

0 current contributors

over 10 years since last commit

1 users on Open Hub

Inactive
0.0
 
I Use This