Instructor: Alexander Dekhtyar, dekhtyar@calpoly.edu, 14-210
Office Hours:
|
| Who | Where | |
| Monday | 9:10am - 10:00am | Alex | 14-210 |
| Tuesday | 1:10pm - 3:00pm | Alex | 14-210 |
| Friday | 9:30am - 10:00am | Alex | 14-210 |
Additional appoinments: send email.
| Syllabus | Postscript | ||
| Jupyter Server | ambari-head.csc.calpoly.edu | ||
| Lab 1 | Due: January 14 | JSON Manipulation | Postscript | Lab Data | [January 9, 2019] | |
| Lab 2 | Due: January 16 | MongoDB: first steps | Postscript | Lab Data | [January 14, 2019] | |
| Lab 3 | Due: January 28 | MongoDB find() queries/aggregate pipelines | Postscript | [January 23, 2019] | ||
| Lab 4 | Due: February 4 | MongoDB Aggregate Pipelines | Postscript | [January 31, 2019] | ||
| Lab 5 | Due: February 13 | Simple Hadoop Programs | Postscript | Lab Info | [February 6, 2019] | |
| Lab 6 | Due: March 1 | Intricate Hadoop Programs | Postscript | Lab Info (coming up) | [February 15, 2019] | |
| Lab 7 | Due: March 7 | Simple Spark | [March 6, 2019] | |||
| Lab 8 | Due: March 19 | Real Spark | Postscript | [March 8, 2019] |
| January 14 | Jan14-01.log |
| PyMongo 3.7.2 Documentation | HTML | Tutorial | HTML |
| Authentication example | HTML |
| Aggregation pipeline example | HTML |
Code and Queries
| MongoDB Python API example | example.py | example.out (example.py output) |
Hadoop Resources and code is posted here.
| Monitor hadoop jobs here | http://ambari-head.csc.calpoly.edu:8088/cluster |
| The Original MapReduce paper | ||
| org.apache.hadoop Version 2.7 javadocs | API | |
| org.apache.hadoop Version 2.7 Jar file | hadoop-core-1.2.1.jar | |
| Bash local variable settings | bashrc-commands.txt | Paste into the bottom of your .bashrc file |
| MapReduce (Hadoop v. 2.7) tutorial | HTML |
Code samples discussed in class are posted here
| Hadoop program template | template.java | |
| Our first Hadoop program | switchMR.java | |
| Data file for switchMR.java | data | |
| Input Format Tests | ||
|---|---|---|
| TextInputFormat test | FITest.java | |
| KeyValueTextInputFormat test | KeyValueTest.java | |
| FixedRecordInputFormat test | FixedRecordTest.java | |
| NLineInputFormat test | NLTest.java | |
| NLineInputFormat test | NLgroup.java | |
| One-mapper/One-reducer version of NLgroup.java | SeqGroup.java | |
| Multiple chained MapReduce jobs | filter.java | words (input file) |
| Multiple Input Files/Multiple Mappers | multiInMR.java | users.in, messages.in (input files) |
| Use of JSON | ||
| Using JSON objects | JsonJob.java | json.in,simple.json (input files) |
| Multiline JSON | MultilineJsonJob.java | test.json (input file) |
| Multiline JSON Input Format | json-mapreduce-1.0.jar | Advanced Hadoop Features |
| Finding Max | FindMax.java | numbers.txt |
| Map-Side Join with Distributed Cache | dCacheDemo.java | users.in, messages.in (input files) |
| Combiner Test: graph scan with no Combiner | TwitterTest.java | |
| Combiner Test: graph scan with Combiner | CombinerTest.java |
| The Original Spark paper (USENIX Cloud Computing'2010) | |
| Resilient Distributed Datasets (USENIX NSDI'2012) | |
| PySpark Documentation | HTML |
| Wienqiang Feng: Learning Apache Spark with Python | |
| Running PySpark Applications on ambari-head.csc.calpoly.edu | Googledoc |
| In-class Example (March 1 lecture) | inClass.py |
| Use of Hadoop Files | htest.py |
| Lecture 1 | What's in this class? | Postscript | [January 4, 2016] | |
| Lecture 2 | Motivating Examples | Postscript | [January 4, 2016] | |
| Lecture 3-1 | JSON | Postscript | [January 10, 2017] | |
| Lecture 3-2 | Maps, Dictionaries, Key-Value Pairs | Postscript | [January 12, 2016] | |
| Lecture 4 | MongoDB Basics | Postscript | [January 18, 2016] | |
| Lecture 5 | MongoDB Java Connectivity | Postscript | [January 28, 2016] | |
| Lecture 6 | MongoDB Aggregation Pipeline | Postscript | [January 27, 2017] | |
| Lecture 7 | MongoDB Aggregation Pipeline: Part 2 | Postscript | [Feb 3, 2017] | |
| Lecture 8 | Overview of Distributed Systems | Postscript | [February 4, 2017] | |
| Lecture 9 | MapReduce | Postscript | [January 28, 2016] | |
| Lecture 10 | Hadoop on our cluster | Postscript | [February 4, 2019] | |
| Lecture 11 | HDFS commands primer | Postscript | [February 13, 2017] | |
| Lecture 12 | Hadoop Input Data Formats | Postscript | [February 21, 2017] | |
| Lecture 14 | Matrix Multiplication in MapReduce | Postscript | [March 5, 2017] | |
| Lecture 15 | MapReduce for Top K Problem | Postscript | [March 10, 2017] | |
| Lecture 16 | Resilient Distributed Datasts | Postscript | [February 28, 2019] |
| JSON home page | json.org |
| JSON specification | ECMA-404: The JSON Data Interchange Format (PDF) |
| org.json Javadocs | Javadoc |