This is the title of the webpage!

Checkpoint report

Summary

To begin with, we first familiarized ourselves with the PageRank algorithm and the various methods to implement it. We implemented a simple baseline version of the code which is based on serial page rank calculation. It uses the power method iteratively, until the matrix columns representing each webpage’s incoming page ranks converge.

There are several publicly available datasets and we chose to go with the Google, Stanford and Berkeley web graphs available at http://snap.stanford.edu/data/#web. As they are publicly available, we will be able to cross verify the correctness of our implementation.

The basic structure of the code is as follows:

The first step is to build the graph data structure - this involves reading the graph from the input dataset file.

Next we call the page rank function which serially calculates the page rank for each of the nodes in the graph and updates the value in the graph structure.

We have timing functions around this function to get some baseline results and have verified correctness on small handmade graphs.

Challenges

We need to figure out a way to accurately measure the correctness of our algorithm for larger datasets. On further research we found some versions online which simply printed out a pagerank for the websites and we plan to compare our results with their solutions. The serial version is basic where we are still using a sparse representation of the graph which is very inefficient for space. We need to refactor our code to handle compressed sparse row matrix representation.

Initially we had planned to have our parallel version working by now but we are behind schedule by a few days and need to figure out how to have different nodes work on separate pieces of information and converge at regular intervals. If we follow the naive approach, it simply divides up the work and conveys the result via message passing like the last assignment. However we are trying to have a better parallel version to begin with and then upgrade our code to handle smarter partitioning, as well as dynamic scheduling using a work queue.

Goals and Deliverables

Our proposal stated that we wanted to compare the parallel performance of the algorithm using MPI, OpenMP which we are on track for. In addition, time permitting, we are also interested in exploring a solution using domain specific language like GraphLab.

During the poster presentation, we aim to be able to demonstrate the various versions by running them in person and displaying the speedup results on our poster. We only have the numbers for our baseline version. We ran the code for various datasets and also got some numbers for a couple of versions found online.

New schedule

Week 1 (11/22 - 11/24)	Get baseline parallel version working
Week 2 (11/25 - 11/28)	Thanksgiving break, Start Implementation of OpenMP and Message Passing version
Week 3 (11/29 - 12/1)	All code should be complete by now, profile code and collect preliminary results
Week 4 (12/2 - 12/5)	Fix any inefficiencies caught by profiling the code, try implementing working version in DSL like Graphlab
Week 5 (12/6 - 12/8)	Collect results, prepare report and poster