Matrix Multiplication in Big data frameworks

Krishnakanth G
4 min readJun 13, 2022

Introduction

For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix has the number of rows of the first and the number of columns of the second matrix. Below is the pictorial representation of matrix multiplication.

image source

c12 = a11*b12 + a12*b22
c33 = a31*b13 + a32*b23
In this exercise, I used both Hadoop MapReduce and Spark to implement matrix multiplication.

Spark vs Hadoop

Spark

  • Spark is a cutting-edge cluster computing system that expands on the MapReduce architecture to handle a wider range of computations.
  • Spark speeds up processing by reducing the number of reading/write cycles to disk and storing intermediate data in memory.
  • Spark is built to efficiently handle real-time data.
  • Spark is a low-latency computing platform that allows you to analyze data in real-time.
  • Spark can handle real-time data from sources such as Twitter and Facebook.
  • To run in-memory, Spark requires a lot of RAM, which increases the cluster size and cost.

Hadoop

  • Hadoop is an open-source platform that uses a MapReduce algorithm
  • Hadoop’s MapReduce architecture reads and writes data from a disk, slowing down processing.
  • Hadoop is built to efficiently handle batch processing.
  • Hadoop is a high-throughput computing system that lacks an interactive mode.
  • A developer can only process data in batch mode with Hadoop MapReduce.
  • When it comes to cost, Hadoop is the most cost-effective solution.

Requirements

To test the programs, you’ll need the Hadoops HDFS and Pysprak frameworks. I’m using Cloudera’s Hadoop distribution since it’s wonderful for learning Hadoop without having to deal with installation issues, and databricks is the greatest platform for avoiding all installations for pyspark.

Matrix Multiplication with Hadoop Map Reduce

While thinking about how to attack this problem I thought it can be solved by two maps and two reducers using changing but I was inspired by the idea of doing it in one Map and Reduce. With this idea, I wrote the map and reduce functions.

Map

Mapper class

The above map function will read each line in the input file which is in the format (A/B, index 1 of A/B, index 2 of A/B, value of A/B) then it will output the (index 1 of A, index 2 of B) as key and the (A/B, common index, value of A/B) as value. conf. get(“m”) and conf.get(“p”) will get the number of rows and columns of the resultant matrix. The m, n, and p are initialized in the main method.

Reduce

Reduce class

The above reduce function will read each line from the output of sort and shuffle which is in the format ((index1 of A, index2 of B), <List of values of (A/B, common index, the value of A/B)>) then it will add all the A matrix elements to List A in the position of common index and B matrix elements to List B in the position of common index. Then Multiply the corresponding values in both lists, add them up and store them in a variable called value. Then output the (key, value).

If the input matrix is given as figure1, then figure-2 will be the Resultant Matrix,

Example

Matrix Multiplication with Spark

Spark code

The functionality of the above code is identical to Hadoop MapReduce; however, the programming language and platform are different. This also produces the same results as Hadoop MapReduce. The input (figure-3) and output (figure-4) of this program are shown below which are identical to Hadoop MapReduce,

Example

Performance

The primary difference between Spark and MapReduce is that Spark processes and retains data in memory for subsequent steps, whereas MapReduce processes data on disk. As a result, for smaller workloads, Spark’s data processing speeds are up to 100x faster than MapReduce. when large data is given to Hadoop Mapreduce it took a few minutes whereas spark did it in seconds.

References

https://docs.databricks.com/languages/python.html

https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations

https://en.wikipedia.org/wiki/Matrix_multiplication

Source code: https://github.com/krishnakanth-G/Big-Data-Matrix-Multiplication

--

--