Course on Big Data 2016

Prerequisites

  1. Install Virtualbox + VirtualBox Oracle VM VirtualBox Extension Pack

    Make sure the version of Extension Pack and Virtualbox matches.

  2. Install Vagrant

Instructions

Installing the Pyspark VM and all its dependencies consists of two steps:

  1. Download and install the Virtual Machine (VM)
  2. Download all the datasets required by the course

Installing the VM

If you have installed an old version, delete the existing VM first by following the removal instructions (next section)
  1. Create an empty directory and put this file into it (Vagrantfile), right click and save target as...

    Make sure that your browser downloads the file with the name intact. The downloaded file should have exactly the name "Vagrantfile" and not "Vagrantfile.txt" which can be the case with some browsers.

  2. Open up your terminal/command prompt at this new directory and run the following commands:

    vagrant box add http://semantica.cs.lth.se/pyspark/vm.json
    vagrant up

Installing the datasets

  1. When the vagrant up command finishes, navigate to http://localhost:8082 and press the Update All button, and all datasets will automatically be downloaded.
  2. Navigate to http://localhost:8081 and you should have a running Jupyter Notebook in a virtual machine
  3. To stop the virtual machine from running, execute:
    vagrant halt

Removal instructions

If you have installed an old version of the Pyspark VM, delete the existing VM by following the removal instructions below
  1. Delete your current VM in the directory where you put Vagrantfile by executing:
    vagrant destroy
  2. Delete the box by executing:
    vagrant box remove lth-pysparkvm

Content

The PysparkVM comes with the following preinstalled bits:

Information

The Virtual Machine (VM) is based on Ubuntu Server 15.04 (64 bit)

The configuration of the VM

If you wish to edit the number of CPU and the amount RAM, you can open Vagrantfile as a textfile and edit this information. You can edit this before running vagrant up the first time.

Pyspark Local Installation

We strongly recommend to use the Pyspark VM because it is ready for use right out of the box. An alternative is to have local installation.
These instructions are given in hope that it will be useful, we cannot give full support due to time constraints.

  1. Download and Install Anaconda or an equivalent distribution: http://continuum.io/downloads
  2. Start up a command prompt and run:
    pip install jupyter
  3. Download a Spark distribution (link to mirrors)
  4. Unpack the Spark distribution to a directory and take note of the path
  5. Run this to install extended regex capabilities needed by notebooks
    pip install regex
  6. Run the command
    jupyter notebook
  7. Add as a first cell in your notebook and run
    import sys
    import os
    import os.path
    
    SPARK_HOME = """C:\spark-1.5.0-bin-hadoop2.6""" #CHANGE THIS PATH TO YOURS!
    
    sys.path.append(os.path.join(SPARK_HOME, "python", "lib", "py4j-0.9-src.zip"))
    sys.path.append(os.path.join(SPARK_HOME, "python", "lib", "pyspark.zip"))
    os.environ["SPARK_HOME"] = SPARK_HOME
    
    ## This is already prepended to up to date notebook
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    
    sc = SparkContext(master="local[*]")
    sqlContext = SQLContext(sc)
    
  8. You should now have a working Pyspark installation with a driver running in the background with a web UI at 4040

Lecture slides

Notebooks

Required reading

Each lecture will feature one compulsory reading, where each participant will hand-in a one page review and comment of the article.

  1. Week 1: Jeffrey Dean, Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM - 50th anniversary issue: 1958-2008, 51(1), 2008. [link]
  2. Week 2: Zaharia, Matei and Chowdhury, Mosharaf and Das, Tathagata and Dave, Ankur and Ma, Justin and McCauley, Murphy and Franklin, Michael J. and Shenker, Scott and Stoica, Ion , Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2012. [link]

  3. Alternative: Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, Spark: Cluster Computing with Working Sets, Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, 2010, [link]
  4. Week 3: Chieh-Yen Lin; Cheng-Hao Tsai; Ching-Pei Lee; Chih-Jen Lin, Large-scale logistic regression and linear support vector machines using Spark, 2014 IEEE International Conference on Big Data, 2014. [link]
  5. Week 4: Han, Jiawei, Micheline Kamber, and Jian Pei. Cluster Analysis: Basic Concepts and Methods, In Data mining: concepts and techniques, 3rd ed., Chapter 10, Elsevier, 2012. [link]

Recommended reading

The books we recommend:

  1. Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark, O'Reilly Media, 2015. ISBN: 978-1-449-35862-4 [link]
  2. Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark, O'Reilly, 2015. ISBN: 978-1-491-91276-8 [link]

Lecture schedule

The course will be taught at LTH in the E building. The maps to find the rooms are available here: http://cs.lth.se/om/salar-i-e-huset/

All the lectures are from 9 to 16. Lunch break from 12 to 13. Coffee will be served at 10 and 14.

Lecturers

  1. Peter Exner
  2. Marcus Klang
  3. Dennis Medved
  4. HÃ¥kan Jonsson and Vedran Sekara

Last Edit: 2016-01-25