EEC 171, Parallel Computer Architecture @ UC Davis

John Owens, Associate Professor, Electrical and Computer Engineering, UC Davis

At UC Davis in 2006, our undergraduate computer architecture sequence had two quarter-long courses: EEC 170, the standard Patterson and Hennessy material, and EEC 171, titled Parallel Computer Architecture. According to some of the students who had taken it, the course was "10 weeks of cache coherence protocols". I revamped this course with the support of an NVIDIA Teaching Fellowship to cover more modern material. I had 20 2-hour class periods to cover the material, plus a 2-hour final exam.

My philosophy in creating the course was to teach the students the what and why of parallel architecture, but not the how. As an example, I teach them what out-of-order execution is, and why it's important and useful, but I don't teach them the Tomasulo algorithm. For undergrads, many more of them will use parallel processors than build them, and our graduate courses teach the implementation details if students are more interested.

This course is taught in spring quarter. Enrollment has been roughly 12 students each spring (I have now taught this course 3 times). In general these students are the best computer engineering students in the department (since they're taking a senior elective), so they have been bright and interested in the material, and attend and actively participate in class. In other words, I'm teaching to great students, and my materials reflect that.

I use Hennessy and Patterson's senior book (CA4) as a text. It doesn't necessarily correspond to the order or the material I use, but it has lots of great info and is a reference I'm comfortable asking the students to buy.

Course Organization

I divide the course into 3 roughly equal parts: instruction-level parallelism (ILP), thread-level parallelism (TLP), and data-level parallelism (DLP). Each part has several lectures, one project, one homework assignment, and one (two-hour) exam. After 3 years of teaching this, I'm happy with this organization, and with the ordering.

Roughly, I'm covering:

Introduction: 2 lectures, (1) a course overview, (2) benchmarks, economics, and technology trends
ILP: What is ILP? Superpipelining vs. superscalar. In-order vs. out-of-order. Pentium and Pentium Pro as examples. VLIW. Branch prediction. Predication. Speculation. EPIC. Trace scheduling. Transmeta Crusoe. Limits to ILP.
TLP: SIMD, MIMD. Threads and processes. Coarse and fine grain. Supercomputing as a market. Symmetric multiprocessing. Simultaneous multithreading. The Google cluster. Sun T1 (Niagara). Parallel hardware and programming models; shared memory vs. message passing. Atomicity. Cache coherence (snoopy and directory protocols). Interconnection networks.
DLP: SIMD instruction sets. Vector machines. Massively parallel machines. GPUs. CUDA. Stream processors. Algorithms and programming models.
Case Studies: Have used in the past: MIT RAW/Tilera; IBM Cell; Stream Processors Inc. Storm; Texas TRIPS.
Wrapup: Berkeley View, plus a review.

Homeworks & Exams

One of the hard things about teaching the class in the way that I teach it is I'm more qualitative than quantitative. Consequently it's difficult to ask homework and exam questions. For instance, in the thread-parallel part of the class, I always ask questions on interconnection networks and something having to do with cache coherence, simply because a lot of the other material just doesn't lend itself well to exam questions./p>

H&P have good questions (but not good answers) to ILP questions. Their TLP questions and answers are good ones. I supplement them with an exercise from the Hill and Marty 2008 IEEE Computer paper. They have no DLP questions (wish I had a good set of vector questions). For all three of these sections I use old exam questions as homework questions, so it gets easier every year to have enough questions.

Projects

One of my goals in projects is to give the students more open-ended assignments. Undergraduates get good at doing small, bite-sized, perfectly specified problems. I'd like to make their problems larger, more nebulous, and larger/harder.

Each project has a warmup (to get familiar with the tools) then a main project. Usually they have a week for the warmup and then a week-and-a-half to two weeks for the main project.

I also think their writing is important, so all three projects have taken the form of "write a 3 page report to make a recommendation to your manager" as the final deliverable.

Finally, it's hard to give a hardware assignment. All three assignments are software assignments, but designed to learn as much about the hardware as possible.

ILP: We use the Trimaran toolchain to compile benchmarks to different machine configurations (different numbers and distributions of functional units). We give them a cost model for assembling a processor and ask them to recommend a machine configuration for (a) general-purpose and (b) high-performance processors. This can all be run on a standard single-core multiprocessor.
TLP: The students compare shared-memory (pthreads on a quad-core machine) vs. message-passing (MPI across a cluster) parallel implementations. One of the tricky bits was finding code that was not clearly network bound. We are using a Mandelbrot-set computation that has a high compute-to-network ratio.
DLP: Using CUDA and NVIDIA GPUs, the students first maximize the number of FLOPS they can achieve on a GPU, then look at scalability as a function of the number of threads and the number of blocks, and also branch granularity.

What Worked and What Didn't Work

First, I'm happy with both the order and scope of the material, and the overall what-and-why-but-not-how philosophy. I will continue to teach this with 3 main thrusts and have hw, project, and exam per thrust.

In general the projects have been successful with a good scope and time to complete them. Having a warmup is important because it works out any tool issues (in particular, it eliminates the 9 pm email the night before that says "This tool is broken").

ILP assignment: Would be good to have more benchmarks (like, run the whole SPEC suite) and vary them every year. Also, the optimization part tends to lend itself to students using dozens of hours of computer time exhaustively searching every possibility. This is OK, I guess, but sets apart the students who do this from those who don't.
TLP assignment: This seems to work pretty well. Would be nice to have multiple interesting programs (not just mandelbrot) to better explore the space. However, since the students generally have to learn both MPI and pthreads to finish the assignment, might be too much work.
The #1 problem with the DLP (CUDA) assignment is the branch granularity part. To make it work, we need large grains of work, but large grains of work blow out the CUDA compiler; the students generally aren't able to get past 32 different branch paths. I need to figure this out for future years.

I usually write a one-hour exam and give the students 2 hours to do the exam. Like my exams in general, the questions are make-you-think questions, but the students do quite well. One thing that worries me is that the questions to some extent reward just thinking about the question logically and do not rely heavily on stuff we talked about in class; I figure a smart student who had never seen the material before could probably do well on my exams. However, I do feel like the students are learning the material, which is what counts.

Lecture parts that I like: Benchmarks and technology lecture I think is very important; integration of Pentium and derivatives with discussion of ILP; Intel's coding guide and how it relates to ILP; discussion of Crusoe as an alternative to Intel x86 and what's good/bad about the approach; discussion of supercomputing, what it's for, and what kind of machines are used; discussion of the Google cluster; the Hill and Marty problem on the TLP homework; case studies throughout as good examples; a self-contained 2-hour CUDA lecture that's an intro to CUDA [I do this so I can invite other colleagues who want to learn about CUDA]; parallel algorithms discussion.

Things I need to work on: better top-level overview in each lecture and wrapup of big ideas at end; would like to compress speculation/predication discussion and perhaps introduce it before branch prediction; compress programming model/machine parallelism lecture; cache coherence discussion is really boring and long; better discussion of consistency vs. coherence, which I personally don't understand as well as I should; given my research focus, a better tie between historical data-parallel machines and today's machines would be really instructive (what is the same? what is different and why?) better discussion of Cray 1 and the big ideas in it; need a better discussion of CM-2 (my notes say "Maybe mix up programming model and hw more? 'How do we support this?'"); case study at end should have Larrabee, and possibly discuss their SIMD instruction set instead of Altivec; Imagine discussion is just pasting in a bunch of my Imagine slides instead of integrating it more into the material.

Files

[ Lecture slides (directory of PDFs) | Homework assignments (directory of PDFs) | Project assignments and slides (directory of PDFs) and code ]

Because this material is publicly available, I don't want to post anything that might have solutions. I am happy, upon request from instructors, to provide any of (a) Keynote source for slides; (b) homework solutions; and (c) past exams and solutions.

Credits

I used slide material (with permission!) from many sources, including:

Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Kathy Yelick / UCB 2007, © Wen-Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–7, © John Lazzaro / UCB 2006, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998.

Many thanks to all of them.