VVZ API is not affiliated with ETH Zurich. Data might be outdated or incorrect. Please view the official ETHZ Vorlesungsverzeichnis for binding information.

263-3000-00L 6 Credits DS , MSC , WBZ D-INFK

Massively Parallel Data Analysis with MapReduce

Lecturers & Examiners: Prof. Dr. Gustavo Alonso, Dr. Donald Kossmann, Dr. Nesime Tatbul Bitim, Prof. Dr. Timothy Roscoe

VVZ CR n/a

Last Updated: 2026-02-05 15:24:45

Abstract

The purpose of this course is to teach students how to carry out massively parallel data analysis using MapReduce as the programming abstraction and Hadoop on top of a (large) cluster of machines in order to get hands on experience and solve real problems.

Content

Many applications involve the processing and analysis of huge amounts of data. Typical examples are Web-scale search engines (such as Google, MSN, or Yahoo), new Web applications such as Flickr or Google Maps, and scientific applications (e.g., in the life sciences or physics). A typical analysis of this data would, for instance, detect certain behavior patterns in a Web log or the detection of star constellations in telescope images. Given the amounts of data that need to be analyzed, parallelization on large clusters of machines is a must in order to get acceptable response times. The idea is to partition the data into "chunks" and process a large set of chunks in parallel. The first large-scale implementation of this idea on thousands of machines was implemented by Google using the so-called MapReduce paradigm. MapReduce is a programming framework designed for the analysis of masses of data. Its implementation makes use of the Google File System (GFS) which is a distributed file system designed to store peta-bytes of data on thousands of machines. Recently, Yahoo and the Apache Foundation launched an open-source implementation of MapReduce and a distributed file system. This implemenation is called Hadoop and has been shown to scale up to 2000 machines. Google is establishing a data center for Academic use with 1000 machines that operates using Hadoop. This data center can potentially be used to run programs as part of this course. The purpose of this course is to teach students how to carry out massively parallel data analysis using MapReduce as the programming abstraction and Hadoop on top of a (large) cluster of machines in order to get hands on experience and solve real problems. The course will have two parts: a.) Six week of classes in order to understand the underlying technology (distributed file system, scheduling in warehouse-size data centers, and the Sawzall programming language used in the MapReduce framework). b.) Projects: solving a big data analysis problem (e.g., Web log mining, discovering intelligent life in space, etc.)

General Information

Language: English
Levels: DS , MSC , WBZ
Frequency: Yearly recurring

Examination

Type: graded semester performance

Course Components

Type	Title	Time & Place	Hours
lecture	Massively Parallel Data Analysis with MapReduce	Fri 10:15-12:00 (CAB G 51)	2 h weekly
independent project	Massively Parallel Data Analysis with MapReduce Project work, no fixed class!	No time listed	2 h weekly

Abstract

Content

General Information

Examination

Course Components

Offered In