Assignment2

Assignment 2: Heterogeneous Scheduling - Matrix multiply


Due date: Sunday May 15th, 2011 by 11:59PM PDT.

The purpose of this assignment is to give you experience in tuning a well known algorithm, single precision matrix multiply (SGEMM), for the CPU and GPU and to distribute and load balance the workload. The goal is for 50% of your system's theoretical peak. You are only required to achieve this for a single matrix size, likely reasonably large. As a warning, this project is harder than it may seem on face. This is not something you can do the day the assignment is due.

Step 1 - CPU:


You are allowed to use any single threaded optimized CPU implementation. Examples are MKL, ACML, GotoBLAS, your EE282 code, etc. If you have trouble finding a reasonable CPU implementation, let me know. The point on the CPU side is for you to figure out how to distribute a matrix multiply among multiple CPU cores. You may use any parallel runtime system you wish for managing tasks, examples are TBB, GCD, ConcRT, or write your own. You may also choose to use OpenCL for this task (you will need a CPU implementation from AMD, Intel, or Apple).

As a hint if trying to write your own CPU SGEMM implementation: SSE is not where you should start, start with optimizing for the memory system, e.g. block into registers, L1, L2, etc will get you most of the way.

To estimate peak for the CPU, you need to know how many cores you have, the width of the SIMD units, the number of flops per clock of the SIMD unit, and the clock rate. For example, for a quad core AMD Phenom II at 3GHz: 4 cores * 4 floats per clock * 2 flops per clock * 3GHz = 96GFlops.

In your writeup, describe what external code/libraries you used, how you chose to distribute work, and what total CPU performance and percentage of peak theoretical you were able to achieve.

Step 2 - GPU:


You are not allowed to use any GPU library. You can choose to write the matrix multiply in whatever GPU language you wish. OpenCL, OpenGL, Direct Compute, Direct X, CUDA (on Nvidia), CAL (on AMD) are all fair game. I would tend to suggest a modern compute language, like OpenCL, to better aid you in your final projects.

As an implementation hint, just like the CPU, think about how you are touching memory and blocking data and execution. You will want to look at using buffers as well as images in OpenCL and explore the differences in the memory system. Same when thinking about using local memory. You also may want to think about a compute API vs a graphics API and the advantages and disadvantages. Also think carefully about how you are going to move data onto the device. Remember that total performance is measures as round trip including transfer costs not kernel only performance (kernel only performance is referred to as a "GPGPU lie").

To estimate peak for the GPU, you need to know how many cores you have, the number of ALUs per core, the number of flops per clock of the SIMD unit, and the clock rate. For example, for ATI Radeon HD 5870, there are 20 cores with 80 ALUs per core, 2 flops per clock (MAD) at 850MHz = 2.72TFlops.

In your writeup, describe what your algorithm strategy is, how you chose to distribute work, and what total GPU performance and percentage of peak theoretical you were able to achieve, and what you believe the bottleneck is (memory bandwidth, memory latency, ALU, etc) and why.

Step 3 - Putting it all together:


Now put the GPU and CPU pieces together. You will need to decide how you are going to distribute the work between the CPU and GPU to get maximal performance.

In your writeup, describe how you chose to distribute work, and what total CPU+GPU performance and percentage of peak theoretical you were able to achieve, and what you believe the bottleneck is (memory bandwidth, memory latency, ALU, transfer latency, transfer overlap, etc) and why.

Step 4 (optional):

Explore the benefits and drawbacks of the technique(s) you have chosen more deeply and how you would deal with dynamic load balancing for different matrix sizes, or if power state transitions are changing the performance aspects of the CPU and/or GPU. When does it make sense to schedule work on the GPU? etc.

Submission and Grading:


Remember the purpose of this assignment is to get you to understand the platform you have using a simple algorithm (that can be challenging to optimize) and get you some experience with using the CPU and GPU together in a simple way. What you choose to learn from the assignment will be what you put into it. Only the minimum is required, but I would like you to at least think about the optional parts.

This assignment will be graded on a credit/partial credit/no credit basis. Credit will be given if you achieve 50% of your platform theoretical peak performance. Partial credit will be given if you make a reasonable attempt and explain where and how you got hung up. Note that "out of time" it not a reasonable attempt! In your writeup I want you to describe the CPU and GPU algorithms chosen, what you believe the bottlenecks on each are, and what your strategy for load balancing is. This should be put into a simple writeup in pdf form.

To submit your work, email the course staff at cs448s-spr1011-staff@lists.stanford.edu. As a reminder, for smooth submission, you need to submit from your Stanford email account (not cs.stanford.edu, gmail, etc).

Recent