Instructor: Alex Dekhtyar, dekhtyar@calpoly.edu, 14-210
Office Hours:
| Who | Where | |
Monday | 1:10pm - 2:00pm | Alex | Our Lecture Zoom |
Tuesday | 1:10pm - 3:00pm | Alex | Office Hours Zoom |
Friday | 1:10pm - 2:00pm | Alex | Our Lecture Zoom |
Note: Zoom links are not shown on this page to prevent web crawlers from finding them. They were emailed to you, and will be available upon an email or Slack request
Additional appoinments: send email.
All lectures will be recorded and the recordings, together with raw transcripts will be made available for perusal via our Canvas site. [April 5, 2020]
Syllabus | Postscript | |
Canvas site | Canvas | |
Cal Poly Zoom | HTML | |
CSC 369 Slack Channel | HTML | |
Campus VPN Instructions | MacOS and Windows | Linux (.txt) |
Survey 2 | Google Form |
Lab 1 | Due: April 13 | JSON Manipulation | Postscript | Lab Data (daily.json) | Source | [April 5, 2020] | |
Lab 2 | Due: April 17 | MongoDB: first steps | Postscript | Lab Data (daily.json) | Source | [April 13, 2020] | |
Lab 3 | Due: April 24 | MongoDB aggregation pipelines | Postscript | [April 21, 2020] | |||
Lab 4 | Due: May 10 | MongoDB Application | Lab Info | [April 28, 2020] | |||
Lab 5 | Due: May 6 | First Hadoop Program | Postscript | Lab Info | [May 4, 2020] | ||
Lab 6 | Due: May 18 | Hadoop Programs | Postscript | [May 8, 2020, 2019] | |||
Lab 7 | Due: | Hadoop Research Mini-project | Postscript | [May 20, 2020] | |||
Lab 8-1 | Due: June 9 | Spark | [May 30, 2020] | ||||
Lab 8-1 | Due: June 9 | Spark Data Frames | lab08.py | README | [June 4, 2020] | ||
Optional Mini-Project | Due: June 14 (noon) | Distribued Computing with Real Data | [June 6, 2020] |
JSON home page/List of JSON Libraries | json.org | Java, Python libraries: scroll to the bottom of the page |
JSON specification | ECMA-404: The JSON Data Interchange Format (PDF) | |
org.json Javadocs | Javadoc | |
Local copy of org.JSON library | org.json-20120521.jar | Put in the same directory as your code for now |
Sample Code
org.json JSONArray demo | jsonArrayTest.java |
data file for jsonArrayTest.java | p.json |
bash script for of JSONArrayTest.java | jsonArra y-run.sh |
Reading from a JSON array file one object at a time | json7.java |
April 4, 2020 | General welcome, and MongoDB login process | Slides |
MongoDB Documentation
MongoDB 4.2 Documentation
HTML
mongo shell
HTML
Create, Read, Update, Delete (CRUD)
HTML
db.
HTML
Aggregation Pipeline
Overview,
db.collection.aggregate()
Aggregation pipeline stages
Python MongoDB API
PyMongo 3.10.1 Documentation | HTML | Tutorial | HTML |
Authentication example | HTML |
Aggregation pipeline example | HTML |
Additional examples | HTML |
Code and Queries
MongoDB Python API example | example.py | example.out (example.py output) |
Hadoop Resources and code is posted here.
Monitor hadoop jobs here | http://ambari-head.csc.calpoly.edu:8088/cluster |
The Original MapReduce paper | ||
org.apache.hadoop Version 3.2.1. javadocs | API | |
org.apache.hadoop Version 2.7 Jar file | hadoop-core-1.2.1.jar | |
Bash local variable settings | bashrc-commands.txt | Paste into the bottom of your .bashrc file |
MapReduce (Hadoop v. 2.7) tutorial | HTML |
Code samples discussed in class are posted here
Hadoop program template | template.java |
Our first Hadoop program | NickCage.java |
Data file for NickCage.java | data.csv |
Input Format Tests | ||
---|---|---|
TextInputFormat test | FITest.java | |
KeyValueTextInputFormat test | KeyValueTest.java | |
FixedRecordInputFormat test | FixedRecordTest.java | |
NLineInputFormat test | NLTest.java | |
NLineInputFormat test | NLgroup.java | |
One-mapper/One-reducer version of NLgroup.java | SeqGroup.java | |
Multiple chained MapReduce jobs | filter.java | words (input file) |
Multiple Input Files/Multiple Mappers | multiInMR.java | users.in, messages.in (input files) |
Map-Side Join with Distributed Cache | dCacheDemo.java | users.in, messages.in (input files) |
Use of JSON | ||
Using JSON objects | JsonJob.java | json.in,simple.json (input files) |
Multiline JSON | MultilineJsonJob.java | test.json (input file) |
Multiline JSON Input Format | json-mapreduce-1.0.jar | Advanced Hadoop Features |
Finding Max | FindMax.java | numbers.txt |
Combiner Test: graph scan with no Combiner | TwitterTest.java | |
Combiner Test: graph scan with Combiner | CombinerTest.java |
The Original Spark paper (USENIX Cloud Computing'2010) | |
Resilient Distributed Datasets (USENIX NSDI'2012) | |
PySpark Documentation (version 2.4.5) | HTML |
Wienqiang Feng: Learning Apache Spark with Python | |
Running PySpark Applications on ambari-head.csc.calpoly.edu | Googledoc |
PySpark RDD API annotated | Googledoc |
In-class Example (March 1 lecture) | inClass.py |
Use of Hadoop Files | htest.py |
April 6 (Monday) | Introduction | Slides (PDF) | Notes (PDF) |
April 8 (Wednesday) | Key Value stores | Slides (PDF) | Notes #1(PDF), Notes #2(PDF) |
April 10 (Friday) | Distributed DBMS/The CAP Theorem | Slides (PDF) | Notes (PDF) |
April 13 (Monday) | MongodDB Basics | Slides (PDF) | Notes (PDF) |
April 15 (Wednesday) | Problem Decomposition, Data Manipulation Algebra | Slides (PDF) | |
April 17 (Friday) | MongodDB Aggregation Pipeline | Slides (PDF) | Notes (PDF) |
April 20 (Monday) | MongodDB Aggregation Pipeline | Slides (PDF) | Notes (PDF) |
April 22 (Wednesday) | MongodDB Aggregation Pipeline (continued) | Slides (PDF) | |
April 24 (Friday) | Quiz 1 | ||
April 27 (Monday) | Lab Exam 1 | ||
April 29 (Wednesday) | Overview of Distributed Systems | Slides (PDF) | Notes (PDF) |
May 1 (Friday) | MapReduce | Slides (PDF) | Notes (PDF) |
May 4 (Monday) | Introduction to Hadoop | Slides (PDF) | Notes #1(PDF), Notes #2(PDF) |
May 6 (Wednesday) | Hadoop Input Data Types | Slides (PDF) | Notes (PDF) |
May 8 (Monday) | Midterm postmortem, Hadoop API | Slides (PDF) | |
May 11 (Monday) | Joins in MapReduce | Slides (PDF) | Notes (PDF) |
Lecture 1 | What's in this class? | Postscript | [January 4, 2016] | |
Lecture 2 | Motivating Examples | Postscript | [January 4, 2016] | |
Lecture 2-1 | JSON | Postscript | [January 10, 2017] | |
Lecture 3-1 | Distributed Databases and The CAP Theorem | Postscript | [April 12, 2020] | |
Lecture 3-2 | Maps, Dictionaries, Key-Value Pairs | Postscript | [January 12, 2016] | |
Lecture 4 | MongoDB Basics | Postscript | [January 18, 2016] | |
Lecture 5 | MongoDB Java Connectivity | Postscript | [January 28, 2016] | |
Lecture 6 | MongoDB Aggregation Pipeline | Postscript | [January 27, 2017] | |
Lecture 7 | MongoDB Aggregation Pipeline: Part 2 | Postscript | [Feb 3, 2017] | |
Lecture 8 | Overview of Distributed Systems | Postscript | [February 4, 2017] | |
Lecture 9 | MapReduce | Postscript | [January 28, 2016] | |
Lecture 10 | Hadoop on our cluster | Postscript | [February 4, 2019] | |
Lecture 11 | HDFS commands primer | Postscript | [February 13, 2017] | |
Lecture 12 | Hadoop Input Data Formats | Postscript | [February 21, 2017] | |
Lecture 13 | Joins in MapReduce | Postscript | [May 11, 2020] | |
Lecture 14 | Matrix Multiplication in MapReduce | Postscript | [March 5, 2017] | |
Lecture 15 | MapReduce for Top K Problem | Postscript | [March 10, 2017] | |
Lecture 16 | Resilient Distributed Datasts | Postscript | [February 28, 2019] |
JSON home page | json.org |
JSON specification | ECMA-404: The JSON Data Interchange Format (PDF) |
org.json Javadocs | Javadoc |