Week |
Date |
Topic |
Readings |
Week 1
|
Mar 27
|
Introduction, Architectures, and Tradeoffs
|
|
Mar 29
|
Shared Nothing vs. Shared Memory vs. Shared Whatever
|
|
Week 2
|
Apr 3
|
Cloud-based Query Processing
|
Assigned:
|
Apr 5
|
Distributed Query processing
|
Assigned:
|
Week 3
|
Apr 10
|
Distributed Query Processing (Continued)
|
Assigned:
Kossmann Survey
|
Apr 12
|
Geographic Distribution
|
Assigned:
Global Analytics in the Face of Bandwidth and Regulatory Constraints. Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Thomas Jungblut, Jitu Padhye, and George Varghese - NSDI 2015.
Transparency In its Place: The case against transparent access to geographically distributed data. Jim Gray, Tandem TR 89.1, Feb 1989
|
Week 4
|
Apr 17
|
Dealing with Heterogeniety
|
Assigned:
Just-In-Time Data Virtualization: Lightweight Data Management with ViDa.
Manos Karpathiotakis, Ioannis Alagiannis, Thomas Heinis, Miguel Branco, Anastasia Ailamaki - CIDR 2015 (note: added late - no write ups expected)
Also:
Look at SparkSQL Remote Access Operators and JSON Schema Inference:
Spark SQL: Relational Data Processing in Spark.Michael Armbrust, Reynold Xin. Cheng Lian, Yin Huai, Davies Liu, Joseph Bradley, Xiangrui Meng, Tomer Kaftan, Michael Franklin, Ali Ghodsi, Matei Zaharia - SIGMOD 2015 (see mostly sections 4.4 and 5)
|
Apr 19
|
Heterogeniety II
|
Assigned:
Weld: A Common Runtime for High Performance Data Analytics .
Palkar et. al - CIDR 2017
Read Also (no summary needed):
Adaptive Query Processing on Raw Data.
Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, Anastasia Ailamaki - VLDB 2014.
|
Week 5
|
Apr 24
|
Stream Processing I
|
Assigned:
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica - SOSP 2013.
Read Also (no summary needed):
Structured Streaming In Apache Spark: A new high-lvel API for streaming.
Matei Zaharia, Tathagata Das, Michael Armbrust and Reynold Xin - Databricks Blog Post - July 28, 2016.
Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3.
Joseph Torres, Michael Armbrust, Tathagata Das and Shiziong Zhu - Databricks Blog Post - March 20, 2018.
|
Apr 26
|
Stream Processing II
|
Assigned:
The Dataflow Model: A Practical Approach to Balancing
Correctness, Latency, and Cost in Massive-Scale,
Unbounded, Out-of-Order Data Processing Akidau et al. VLDB 2016.
Read Also (no summary needed):
Millwheel: Fault-Tolerant Stream Processing at Internet Scale Akidau et al. VLDB 2013.
|
Week 6
|
May 1
|
Distributed ML - Model Serving
|
Assigned:
Clipper: A Low-Latency Online Prediction Serving System. Dan Crankshaw, Guilo Zhao, Michael Franklin, Joseph Gonzalez, Ion Stoica - NSDI 2017.
|
May 3
|
Learned Indexes (Why not?)
|
Assigned:
The Case for Learned Index Structures. Tim Kraska, Alex Beutel, Ed Chi, Jeff Dean, Neoklis Polyzotis - SIGMOD 2018 (to appear).
Read Also (Various responses - no summary needed):
The Case for B-Tree Index StructuresThomas Neumann.
Don't Throw Out Your Algorithms Book Just Yet: Classical Data Structures That Can Outperform Learned IndexesPeter Bailis, Kai Sheng Tai, Pratiksha Thaker, and Matei Zaharia.
|
Week 7
|
May 8
|
Parameter Servers
|
Assigned:
Scaling Distributed Machine Learning with the Parameter Server .
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long,
Eugene J. Shekita, Bor-Yiing Su. OSDI 2014.
|
May 10
|
Distributed Deep Learning (Why not?)
|
Assigned:
GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server.
Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, Eric P. Xing. Eurosys 2016.
Also take a look at (Two other systems - no summary needed):
SPARKNET: Training Deep Networks in Spark
Philipp Moritz, Robert Nishihara, Ion Stoica, Michael I. Jordan. ICLR 2016
BigDL: A Distributed Deep Learning Framework for Big Data
Jason Dai et al (Intel, Tencent, Alibaba). Arxiv Paper 2018.
|
Week 8
|
May 15
|
Graph Systems
|
Assigned:
GraphX: Graph Processing in a Distributed Dataflow Framework.
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica. OSDI 2014.
Also take a look at (Two other systems - no summary needed):
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud.
Yucheng Low et al., VLDB 2012
Pregel: A System for Large-Scale Graph Processing
Grzegorz Malewiscz et al. SIGMOD 2010.
|
May 17
|
Towards Data Markets
|
Assigned (2 short papers):
Data Markets in the Cloud: An Opportunity for the Database Community.Magda Balazinska, Bill Howe, and Dan Suciu, VLDB 2011.
Why Data Citation is a Computational Problem Peter Buneman, Susan Davidson, James Frew, CACM Sept. 2016.
|
Week 9
|
May 22
|
Provenance and Lineage
|
Assigned:
Smoke: Fine-grained Lineage at Interactive Speed.
Fotis Psallidas and Eugene Wu, VLDB 2018.
Also take a look at (no summary needed):
Diagnosing Machine Learning Pipelines with Fine-grained Lineage.
Zhao Zhang, Evan R. Sparks, Michael J. Franklin. HPDC 2017.
Provenance in Databases: Why, How, and Where.
James Cheney, Laura Chiticariu, Wang-Chiew Tan. Now Publishers, 2009.
(This is a long, comprehensive survey focused on semantic issues - check out the first section for an overview.)
|
May 24
|
Systems for ML and Advanced Analytics (brainstorming)
|
Assigned:
A Berkeley View of Systems Challenges for AI.
Ion Stoica et al., UC Berkeley Technical Report No. UCB/EECS-2017-159, October 2017.
Also take a look at (no summary needed):
Proceedings of the First SysML Conference. Stanford, CA. Feburary 2018.
Skim the Posters for interesting topics; Some of the videos are interesting too if you have time.
|
Week 10
|
May 29
|
NO MEETING
|
Prof. Franklin Away - No Meeting Today
|
May 31
|
PROJECT REPORTS
|
Assigned:
PROJECT REPORTS (15 min each)
|