Join operations in Hadoop

5 min readJun 16, 2022

Introduction

The join operation is used to combine two or more database tables based on foreign keys. In general, companies maintain separate tables for the customer details and the transaction records in their database. And, many times these companies need to generate analytic reports using the data present in such separate tables. Therefore, they perform a join operation on these separate tables using a common column (foreign key), like customer id, etc., to generate a combined table. Then, they analyze this combined table to get the desired analytic reports. Generally joining tables in python or R is easy but if data is large we need big data frameworks. In this article, I will take you through the process of joining tables in the Big data framework using Hadoop MapReduce.

MapReduce Joins

Just like SQL join, we can also perform join operations in MapReduce on different data sets. There are two types of join operations in MapReduce:

Map Side Join: As the name implies, the join operation is performed in the map phase itself. Therefore, in the map side join, the mapper performs the join and it is mandatory that the input to each map is partitioned and sorted according to the keys.
Reduce Side Join: As the name suggests, in the reduce side join, the reducer is responsible for performing the join operation. It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends the values having identical keys to the same reducer and therefore, by default, the data is organized for us.

let's see them with an example

Data to join

Let’s make two datasets: One with Bank customer data and the other with their credit transaction data both having the customer id as a common key. The Attributes in the datasets are

Customer Details: Customer ID, Name, Age
Credit transaction Details: Transaction ID, Date, Customer ID, Transaction amount

I would like to calculate the number of credit transactions made by each customer along with the total amount. So, the output after joining is Customer name, transactions count, and total transaction amount.

This is how the sample data for customer details and Credit transaction details will look.

Reduce side Join

Each of the two datasets will have its own mapper, with one for customer details input and the other for transaction details input. Also, we have one reducer to produce our desired output.

1. Customer details mapper:

It will read the input, taking one tuple at a time.
Then, it will tokenize each word in that tuple and fetch the customer ID along with the name of the person.
Make customer ID as the key.
Make customer's name as value.
We are appending the “customers” string before the customer name to indicate it is from customer details.
Therefore, my mapper for customer details will produce the following intermediate key-value pair: [customer ID, customers name]
Example: [1, customers Jhon], [2, customers Ravi], etc.

2. Credit transaction details mapper:

It will read the input, taking one tuple at a time.
Then, it will tokenize each word in that tuple and fetch the customer ID along with the transaction amount of the customer.
Make customer ID as the key
Make credit amount as value.
We are appending the “transaction” string before the credit amount to indicate it is from transaction details.
Therefore, my mapper for transaction details will produce the following intermediate key-value pair: [customer ID, transaction Credit amount]
Example: [1, transaction 50], [2, transaction 150], etc.

3. Sort and shuffle: The sorting and shuffling phase will generate an array list of values corresponding to each key.

4. Reduce join reducer:

It will read the input from the sort and shuffle phase as a key & list of values where the key is nothing but the customer ID. The list of values will have the input from both datasets.
Now, it will loop through the values present in the list of values in the reducer.
Then, it will split the list of values and check whether the value is of transaction details type or customer details type.
If it is of the credit transaction details type, then perform the following steps:

1. Increase the counter value by one to calculate the number of transactions made by each customer.
2. Also, add all the transaction amounts to get the total transaction amount of a particular customer.

On the other hand, if the value is of customer details type, then store it in a string variable.
After that assign the name as the key and the number of transactions along total amount as the value in the output key-value pair.
Finally, write the output key-value pair in the output folder in HDFS.
This is how the output will look.

Map side Join

Map side Join is similar to a join but all the tasks will be performed by the mapper alone. The Map-side Join will be most suitable for small tables to optimize the task. In this type, the join is performed before data is actually consumed by the map function. It is mandatory that the input to each map is in the form of a partition and is in sorted order. Also, there must be an equal number of partitions and they must be sorted by the join key.

Advantages of using map side join:

Map-side join helps in minimizing the cost that is incurred for sorting and merging in the shuffle and reduce stage.
Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.

Disadvantages of Map-side join:

Map side join is adequate only when one of the tables on which you perform the map-side join operation is small enough to fit into the memory. Hence it is not suitable to perform a map-side join on the tables which are huge data in both of them.

References

https://www.edureka.co/blog/mapreduce-example-reduce-side-join/

https://www.edureka.co/blog/map-side-join-vs-join/