Word Count using Hadoop MapReduce

3 min readJun 17, 2022

Introduction

Word count is a simple program that counts the number of times a word appears in a file. In this article, It is implemented through the MapReduce paradigm. The Mapper’s role is to map the keys to the existing values, and the Reducer’s role is to aggregate the keys with common values. As a result, everything is represented as a Key-value pair. First, let's understand what's happening in the mapper and reducer functions.

Pre-requisite

Java Installation
Hadoop Installation

If any of them is not installed in your system, click here to install them.

Mapper

The above mapper code will do the following

Read the lines in the given file into a variable namely line.
Substitute the special characters with a null.
Then split the line into words and store it in a list namely words.
Then loop through the list of words, Output the word as key and one as value.
Example outputs are (Hi,1),(Hello,1),(Medium,1)

Sort and shuffle

In this stage, all the keys are sorted which means any words occurring multiple times will be side by side. There is no need to loop through all the words to count. This step makes our life easy.

Reducer

The above reducer code will do the following

Maintain two variables previous word and previous count for maintaining the previous state.
Take the output of the mapper and split that into word, and count.
Typecast the count as an integer.
If the word equals the previous word then increment the previous count by count.
Otherwise, output the previous word as key and previous count as value and make the previous count as count & previous word as a word.

Steps to execute word count Map Reduce

Create a text file in your local file system and write some text into it.

2. Create a directory in HDFS, to keep local text files. Upload the local file on HDFS in the specific directory.

3. Keep the Mapper and Reducer in a specific directory of the local file system.

4. Copy the Hadoop streaming jar file path.

5. Run the jar file using the following command:

hadoop jar $HADOOP_HOME /share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar 
-file /home/huser/wordcount/mapper.py -mapper mapper.py 
-file /home/huser/wordcount/reducer.py -reducer reducer.py 
-input /wordcount/input 
-output /wordcount/output

This command has four different parts other than jar file