Nicole Han

MapReduce

Word count: 352Reading time: 2 min
2019/03/27 Share

MapReduce Introduction

MapReduce is designed to be applied to Big Data in NoSQL DBs, in data and disk parallel fashion - resulting in dramatic processing gains

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.

MapReduce works like this:

  1. big data is split into file segments, held in a compute cluster made up of nodes (aka partitions)
  2. a mapper task is run in parallel on all the segments (ie. in each node/partition, in each of its segments); each mapper produces output in the form of multiple (key,value) pairs
  3. key/value output pairs from all mappers are forwarded to a shuffler, which consolidates each key’s values into a list (and associates it with that key)
  4. the shuffler forwards keys and their value lists, to multiple reducer tasks; each reducer processes incoming key-value lists, and emits a single value for each key
    MapReduce

    MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
    In summary:

  • MapReduce consists of two distinct tasks – Map and Reduce.
  • As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed.
  • map job: where a block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

Hadoop?

Hadoop is modeled after the MapReduce paradigm, and is utilized identically (by having users run mappers and reducers on (big) data).

The following databases are most commonly used inside a Hadoop cluster:

  1. MongoDB

  2. Spark

  3. Solr

HDFS concepts

a distributed file system designed to run on commodity hardware

CATALOG
  1. 1. MapReduce Introduction
    1. 1.1. MapReduce is designed to be applied to Big Data in NoSQL DBs, in data and disk parallel fashion - resulting in dramatic processing gains
      1. 1.1.1. Hadoop?
      2. 1.1.2. HDFS concepts