# Sorting in Parallel

## The problem

The last assignment of my parallel programming class in Spring 2018 involved parallelizing a recursive algorithm. We had no constraints on how we were to parallelize the algorithm other than using C and MPI to do it. Without going into too much detail about the assignment, the main idea of the algorithm was to optmize a cost function based on the datapoint of interested and other “near” points. However, the input data provided was an absolute was compeltely unsorted.

## Approach

So, the approach my partner and I wrote had 2 phases:

1. Sort the data in parallel
2. Calculate the cost function, dividing work among the processors

I’m only going into detail on the parallel sorting for a couple of reasons. Sorting in parallel was my major contribution and secondly, I can talk about that aspect without going into any specifics of the assignment since it will likely be used again.

## Sorting

Sorting data is a very common topic in Computer Science classes, so I’m not going to spend much time here talking about it, but there are a couple of reasons why I chose to paralelize mergesort.

1. Mergesort lends itself easily to parallelization because it is a divide-and-Conquer algorithm.
2. Our main goal was to show a performance improvement at many different problem sizes, and mergesort’s computational complexity (best, worst, and average) is $O(n logn)$. In order to achieve this performance profile, mergesort trades off in terms of memory space, but this was not a concern for our application.
3. Mergesort is very easy to visualize, and is pretty simple to write.

#### Mergesort

Mergesort is typically written as a recursive function which creates a binary tree in the call-stack as shown in the image below from this Techie Delight tutorial on Mergesort.

As the algorithm divides up the work, it creates smaller and smaller instances of the exact same problem to solve. That’s why mergesort is so easy to describe recursively. Once it reaches a “base case” (e.g. an array of a single element) that is “trivially” sorted, the process of merging elements back together can begin.

In terms of parallelization, each level of the tree in the image above describes a different number of processors that can be used to sort. As the algorithm breaks in problem down toward the base case, more and more processors can be used. As it builds the sorted array back up, the sections that can be executed in parallel decrease by a factor of 2 each time.

The following function defines the number of procesors, assigns them IDs, and determines at the how much work should be assigned to each processor. For our use-case, each processor was assigned a subset of the total array based on its processor ID. This is called a static block scheduling.

Each processor performs a merge sort on its assigned section of the array, then the sorted segments are sent to another processor to build a larger sorted array. This approach to work distribution was chosen because the smallest input was guaranteed to be more elements than our maximum number of processors. Thus, each processor must be responsible for at least several elements and the dataflow we adopt is that each processor gets a unique slice of the input and performs as shown in Figure 1. Then, all of these segments are merged together efficiently in a tree structure as in the bottom half of Figure 1.

Top level Mergesort Function (C with MPI)

This functions manages the single threaded Mergesort. If a processor is assigned more than 2 elements, it will recursively divide and merge the input array until the processor has sorted its work allocation.

Recursive Function to divide the problem (C)

The merge function is the sorting actually happens, and unchanged from a typical single-threaded implementation. An array of two sorted halfs are given as input and the halves are copied into temporary arrays so as to not overwrite the values. As long as there are items in both halves, the two top items are compared. The first item is taken and placed into the sorted array.

If one of the halves becomes empty, the remaining half is appended to the sorted array. Now twice the number of items are sorted.

Merge 2 sorted arrays together (C)

As the merging continues in parallel at each level across processors, the number of sorted elements doubles until finally there is a single sorted array.