parbash 0.1

Sponsored links

parbash is an open source extension to the BASH scripting language to enable scalable text processing over large data using distributed or multicore systems.

parbash makes use of familiar shell text processing commands to process large files by transparently distributing the processing pipeline over multiple processors or multiple machines using Apache Hadoop, a map-reduce middleware for computer clusters.

The main task of parbash is to manage the tedious interface gap between hadoop (map-reduce) and bash while letting the programmer focus on writing text processing scripts.

parbash is best suited for computationally expensive processing, especially when it must be done over multi-gigabyte (or larger) files. At the time of writing, processing smaller files with hadoop is not practical due to large overhead in hadoop framework.

Detailed installation and usage instructions are available here and here.