Word Count using Hadoop MapReduce
Introduction
Word count is a simple program that counts the number of times a word appears in a file. In this article, It is implemented through the MapReduce paradigm. The Mapper’s role is to map the keys to the existing values, and the Reducer’s role is to aggregate the keys with common values. As a result, everything is represented as a Key-value pair. First, let's understand what's happening in the mapper and reducer functions.
Pre-requisite
- Java Installation
- Hadoop Installation
If any of them is not installed in your system, click here to install them.
Mapper
The above mapper code will do the following
- Read the lines in the given file into a variable namely line.
- Substitute the special characters with a null.
- Then split the line into words and store it in a list namely words.
- Then loop through the list of words, Output the word as key and one as value.
- Example outputs are (Hi,1),(Hello,1),(Medium,1)
Sort and shuffle
In this stage, all the keys are sorted which means any words occurring multiple times will be side by side. There is no need to loop through all the words to count. This step makes our life easy.
Reducer
The above reducer code will do the following
- Maintain two variables previous word and previous count for maintaining the previous state.
- Take the output of the mapper and split that into word, and count.
- Typecast the count as an integer.
- If the word equals the previous word then increment the previous count by count.
- Otherwise, output the previous word as key and previous count as value and make the previous count as count & previous word as a word.
Steps to execute word count Map Reduce
- Create a text file in your local file system and write some text into it.
2. Create a directory in HDFS, to keep local text files. Upload the local file on HDFS in the specific directory.
3. Keep the Mapper and Reducer in a specific directory of the local file system.
4. Copy the Hadoop streaming jar file path.
5. Run the jar file using the following command:
hadoop jar $HADOOP_HOME /share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar
-file /home/huser/wordcount/mapper.py -mapper mapper.py
-file /home/huser/wordcount/reducer.py -reducer reducer.py
-input /wordcount/input
-output /wordcount/output
This command has four different parts other than jar file
- -file <mapper path> -mapper <mapper name>
- -file <reducer path> -reducer <reducer name>
- -input <input path>
- -output <output path>
6. To see the output use the HDFS cat command with the output path.
References
Source Code: https://github.com/krishnakanth-G/HadooACp-works/tree/main/wordcount%20with%20python%20MapReduce