Chapter 10: Batch processing

This chapter starts by discussing Unix tools and pipes and how they are composed to solve larger problems. Later, the same principles are derived to explain how MapReduce and other more recent dataflow engines work. The pipe is the main driver of data between two Unix processes. On MapReduce, on the other hand, are files in the distributed file system.

There are two main problems that MapReduce needs to solve: partitioning and fault tolerance.

Partitioning consists of organizing mappers according to input file blocks. The output of mappers is reorganized, repartitioned, sorted, and merged into output blocks, and this process can repeat itself to create larger blocks.

The fault tolerance part is to have redundancy on those input blocks. HDFS usually has many copies of files across many nodes on the network to tolerate network and other hardware failures.

Other batch processing engines, descendants of MapReduce, rely on the callback function pattern to hide all the distributed aspects of the data from the developer. As those callbacks must be functions with no collateral effect, the execution can tolerate failures, and re-execution can happen without any problems. The crucial idea is to keep data immutable, in other words, to read data from one place, process it, and then save it somewhere else. The main point is that the input data must be bounded, meaning it has a known and fixed size.

Leave a comment

Your email address will not be published.