Ante Tonkovic-capin

Ashu Kumar

Drishan Poovaya

Pratik Sanghavi

Part 1: Environment Setup

We followed the instructions on the course website to set up Apache Spark on the 3-node cluster.

Figure 1: Experiment after setup

Figure 1: Experiment after setup

Figure 2: Ready nodes

Figure 2: Ready nodes

The master and the name node process runs on node0. The data node and worker processes run on all 3 nodes.

To perform the setup, we additionally did the following steps:

  1. Added the below paths in /users/ashuk/.bashrc

    export PATH="/mnt/data/hadoop-3.3.6/sbin:$PATH"
    export PATH="/mnt/data/hadoop-3.3.6/bin:$PATH"
    export PATH="/mnt/data/spark-3.3.4-bin-hadoop3/sbin:$PATH"
    
  2. Load changes with source /users/ashuk/.bashrc

  3. Confirmed Spark is up and running with curl -v --silent 10.10.1.1:8080

  4. Modified spark-env.sh to include Python, Hadoop home, library paths and scratch data locations:

export PYSPARK_PYTHON=/usr/bin/python3.7
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.7
export SPARK_LOCAL_IP=10.10.1.1 // 10.10.1.2 and 10.10.1.2 
// in other machines
export SPARK_MASTER_HOST=10.10.1.1
export HADOOP_HOME=/mnt/data/hadoop-3.3.6
export LD_LIBRARY_PATH=/mnt/data/hadoop-3.3.6/lib/native
export SPARK_LOCAL_DIRS=/mnt/data/local_dir
  1. Modified spark-3.3.4-bin-hadoop3/conf/workers to include the IP address of all the follower machines.
  2. Loaded files into HDFS from the local file system using hdfs dfs -mkdir /user/ashuk/ to make a new HDFS directory and hdfs dfs -put /users/ashuk/hw1/export.csvb /user/ashuk/data/export.csv to load the data
  3. Started the Spark standalone cluster using spark-3.3.4-bin-hadoop3/sbin/start-all.sh

Part 2: A simple Spark application

We downloaded data from here, and loaded it into HDFS using hdfs dfs -put . We sorted the data first by the country code in alphabetical order and then by the timestamp. The output is stored on another file, provided as part of the command line input. The code is present in the sort.py file.

In order to run the script, we used spark-submit --master spark://10.10.1.1:7077 /users/ashuk/hw1/sort.py hdfs://10.10.1.1:9000/user/ashuk/data/export.csv hdfs://10.10.1.1:9000/user/ashuk/data/output -v

The sort application can be run through the run.sh script for this section by $ ./run.sh "<spark master ip>" "<path/to/input/file>" "<path/to/output/file> <optional verbose flag -v> <optional rename flag -r>"

For our experiment, the arguments were $ ./run.sh "10.10.1.1" "/user/ashuk/data/export.csv" "/user/ashuk/data/output"' . The application will read the input data from HDFS (export.csv), finish the sorting, and save the output into HDFS as output. The sort algorithm took 28 seconds to run.

Part 3: PageRank