Ante Tonkovic-capin
Ashu Kumar
Drishan Poovaya
Pratik Sanghavi
We followed the instructions on the course website to set up Apache Spark on the 3-node cluster.
Figure 1: Experiment after setup
Figure 1: Experiment after setup
Figure 2: Ready nodes
The master and the name node process runs on node0. The data node and worker processes run on all 3 nodes.
To perform the setup, we additionally did the following steps:
Added the below paths in /users/ashuk/.bashrc
export PATH="/mnt/data/hadoop-3.3.6/sbin:$PATH"
export PATH="/mnt/data/hadoop-3.3.6/bin:$PATH"
export PATH="/mnt/data/spark-3.3.4-bin-hadoop3/sbin:$PATH"
Load changes with source /users/ashuk/.bashrc
Confirmed Spark is up and running with curl -v --silent 10.10.1.1:8080
Modified spark-env.sh to include Python, Hadoop home, library paths and scratch data locations:
export PYSPARK_PYTHON=/usr/bin/python3.7
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.7
export SPARK_LOCAL_IP=10.10.1.1 // 10.10.1.2 and 10.10.1.2
// in other machines
export SPARK_MASTER_HOST=10.10.1.1
export HADOOP_HOME=/mnt/data/hadoop-3.3.6
export LD_LIBRARY_PATH=/mnt/data/hadoop-3.3.6/lib/native
export SPARK_LOCAL_DIRS=/mnt/data/local_dir
spark-3.3.4-bin-hadoop3/conf/workers
to include the IP address of all the follower machines.hdfs dfs -mkdir /user/ashuk/
to make a new HDFS directory and hdfs dfs -put /users/ashuk/hw1/export.csvb /user/ashuk/data/export.csv
to load the dataspark-3.3.4-bin-hadoop3/sbin/start-all.sh
We downloaded data from here, and loaded it into HDFS using hdfs dfs -put
. We sorted the data first by the country code in alphabetical order and then by the timestamp. The output is stored on another file, provided as part of the command line input. The code is present in the sort.py
file.
In order to run the script, we used spark-submit --master spark://10.10.1.1:7077 /users/ashuk/hw1/sort.py hdfs://10.10.1.1:9000/user/ashuk/data/export.csv hdfs://10.10.1.1:9000/user/ashuk/data/output -v
The sort
application can be run through the run.sh
script for this section by $ ./run.sh "<spark master ip>" "<path/to/input/file>" "<path/to/output/file> <optional verbose flag -v> <optional rename flag -r>"
For our experiment, the arguments were $ ./run.sh "10.10.1.1" "/user/ashuk/data/export.csv" "/user/ashuk/data/output"'
. The application will read the input data from HDFS (export.csv), finish the sorting, and save the output into HDFS as output
. The sort algorithm took 28 seconds to run.