Homework #2¶

Due: Friday, April 11th at 11:59pm

In this homework, we will start to explore the CUDA execution model by implementing and timming the execution a CUDA program that performs grayscaling on an image.

Midway3¶

For the remainder of the course we will be using the Midway3 cluster. The Midway3 is the University of Chicago’s newest high-performance computing (HPC) cluster, maintained by the Research Computing Center (RCC). It’s a powerful shared system designed to support computational research across campus. It’s particularly useful for GPU-accelerated workloads, which makes it a great match for the kinds of CUDA projects we’ll be working on in this course.

Midway3 is organized into two main types of nodes:

Login Nodes: These are the machines you connect to when you SSH into Midway3. They’re meant for lightweight tasks like editing files, compiling code, managing jobs, or exploring the filesystem. Do not run any computationally intensive programs on the login nodes. They’re shared resources, and running heavy jobs here can slow down the system for everyone.
Compute Nodes: These are the powerful machines where actual computations take place — including GPU-accelerated workloads. You don’t log into these directly; instead, you access them by submitting jobs through the SLURM scheduler.

Just like with the Peanut cluster in Homework 1, you’ll use SLURM to submit your CUDA programs to the compute nodes. SLURM will handle queueing, resource allocation (like GPUs), and execution.

Using GPUs on UChicago RCC Midway3¶

For this assignment, we will be using the Midway3 gpu partition to run our CUDA applications.

To inspect the available GPU nodes on Midway3, run the following command:

scontrol show partition gpu

Submitting Your Job¶

You must submit your CUDA application using a batch script via sbatch. Below is a sample script that you will need to adapt:

#!/bin/bash

#SBATCH --job-name=hw2-CNET_ID         # Replace CNET_ID with your own
#SBATCH --account=mpcs53113            # Course account
#SBATCH --partition=gpu                # Submit to GPU partition
#SBATCH --gres=gpu:1                   # Request 1 GPU
#SBATCH --time=00:10:00                # Set wall time limit (10 minutes)
#SBATCH --constraint=v100             # Request V100 GPU (change if needed)
#SBATCH --ntasks-per-node=1            # One task per GPU
#SBATCH --cpus-per-task=1              # One CPU thread per task

# Load CUDA
module load cuda/11.7

# Compile and run your code here
make
./grayscale test_img.png test_img_out.png

Important Notes¶

Replace ``CNET_ID`` in the --job-name line with your actual CNetID. This helps with job tracking and management.
The --account=mpcs53113 line ensures your job is charged to our course account. This is required for any job running on Midway3 compute nodes. Yes, this is not our actual course code. This still needs to be fixed by RCC but until then we will use mpcs53113.
Do not compile or run code directly on the login node. Always edit and develop from the login node, and submit your jobs to the compute nodes using sbatch.
If you need to use a specific GPU type (e.g., RTX6000), modify the --constraint flag accordingly:
```
#SBATCH --constraint=rtx6000
```

If you have questions or run into any issues, please don’t hesitate to ask!

Connecting to Midway3 Using VSCode¶

You can connect to Midway3 from VSCode just like you did for the CS Linux Servers. Here’s how:

Connect to Midway3

Open VSCode, and use the Remote - SSH extension.
When prompted for the host address, enter:
```
CNET-ID@midway3.rcc.uchicago.edu
```
Replace CNET-ID with your actual UChicago CNetID.
You’ll be prompted to enter your CNet password.
After that, you’ll be prompted for 2FA using DUO. When it asks, type 1 to send a DUO push notification to your phone and approve it.

Once Connected

You’ll land on one of Midway3’s login nodes. Your terminal prompt will look something like:
```
[your-cnetid@midway3-login3 ~]
```
From here, you can now:
- Clone the course repository
- Clone your individual homework/project repositories

Note

Before cloning, you’ll need to set up an SSH key on Midway3 — just like you did on the CS Linux Servers — to authenticate with GitHub.

Follow the GitHub SSH setup instructions in the section below to complete this step.

Clone Your Repository

Once GitHub access is set up:

Clone your homework or project repository for the assignment.
Now you’re ready to get started working on your CUDA code!

Creating Your Private Repository¶

To actually get your private repository, you will need this invitation URL:

HW2 invitation (Please check the Post “HW 2 is ready” Ed)

When you click on an invitation URL, you will have to complete the following steps:

You will need to select your CNetID from a list. This will allow us to know what student is associated with each GitHub account. This step is only done for the very first invitation you accept.

Note

If you are on the waiting list for this course you will not have a repository made for you until you are admitted into the course. I will post the starter code on Ed so you can work on the assignment until you are admitted into the course.

You must click “Accept this assignment” or your repository will not actually be created.
After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.
You now need to clone your repository (i.e., download it to your machine).
- Make sure you’ve set up SSH access on your GitHub account.
- For each repository, you will need to get the SSH URL of the repository. To get this URL, log into GitHub and navigate to your project repository (take into account that you will have a different repository per project). Then, click on the green “Code” button, and make sure the “SSH” tab is selected. Your repository URL should look something like this: git@github.com:mpcs52072-sum24/hw2-GITHUB-USERNAME.git.
- If you do not know how to use git clone to clone your repository then follow this guide that Github provides: Cloning a Repository

If you run into any issues, or need us to make any manual adjustments to your registration, please let us know via Ed Discussion.

Programming Problem: Grayscaling an Image¶

For this homework, you will implement a CUDA program that will take a PNG image and convert it to a grayscale representation. To help you get started, we have provided the code that

Includes a function that reads in a png file and returns a flatten 1D array of pixels along with the image width and height (see png_flatten_load inside hw2/png_flatten.h).
Includes a host side function, image_to_grayscale (inside hw2/grayscale.cu) that converts the flatten image of pixels to grayscale.
Includes a main file (i.e., grayscale.cu) that takes in two arguments: an input and output png file. The program loads in the input png file, converts it to grayscale, and saves it to output file path (see png_flatten_save inside hw2/png_flatten.h).

For the remainder of assignment, you will augment the grayscale.cu and write code to perform timing measurements on your program.

Task 0: Understanding, Compiling, and Running the Starter Code¶

Make sure you fulling understand how to use the provided starter code files

grayscale.cu
png_flatten.h
png_flatten.c

The flatten array (i.e., unsigned char *image inside grayscale.cu ) places all pixels adjacent to each other. Since each pixel is composed of 4 components (RGBA), the start of a new pixel in the array is offset by 4. You can see how to access each pixel by looking over the image_to_grayscale function and how it performs grayscaling. Do not modify the files png_flatten.h and png_flatten.c! These files are provided to help with easily loading, and saving png files. For this assignment, you must use a flatten version of the png file.

Grayscale Program: Usage and Behavior¶

Your grayscale.cu program must support the following command-line usage:

./grayscale [-p] [-d GRID BLOCK] input.png output.png

The purpose of this program is to convert a color image (input.png) into a grayscale image (output.png), using either a sequential CPU implementation or a parallel GPU implementation.

The command-line arguments modify how the program runs:

Default (no flags):

Runs the sequential (CPU) version of the grayscale algorithm. This version will read input.png, convert it to grayscale using host code, and save the result in output.png.
``-p`` flag:

Enables the GPU version of the grayscale algorithm. This version uses CUDA and launches a kernel with default grid and block dimensions. You may choose reasonable defaults based on the image size of the input and/or querying the device properties.
``-d GRID BLOCK`` flag:

This flag must be used in combination with ``-p``, and allows the user to explicitly specify grid and block dimensions via the command line.

For example:
```
./grayscale -p -d 64 16 input.png output.png
```
This would run the GPU version with a grid size of 64 and a block size of 16.

Assumptions

You do not need to perform error checking on the command-line arguments.
You can assume all inputs (including grid/block sizes and filenames) are valid.
Your program should apply the appropriate logic based on the presence or absence of the -p and -d flags.

Make sure to document your default grid and block settings somewhere in your code or README so we know what configuration is being used when -d is not provided.

Compiling and Running¶

We have provided a Makefile to easily compile and generate your CUDA program named grayscale. Update the sbatch script file named hw2-job.sh with information about your directory structure and CNet credentials (similar to the hw1) and sbatch the script.

$ sbatch hw2-job.sh

The script builds and executes the grayscale program on a GPU partition. The grayscale program uses the provided test file test_img.png to produce a grayscale version that is saved to the test_img_out.png file. If you logged into the Midway3 using Visual Studio Code you should be able to code test_img_out.png to see the grayscaled image.

Task 1: Implement a GPU Version¶

Inside grayscale.cu, implement code that performs grayscaling on the GPU. Call your kernel image_to_grayscale_kern. Your goal is to ensure you are utilizing the GPU efficiently based on using the image data along querying the GPU capabilities to determine the grid and block dimensions. You will need to think about how you will map CUDA’s thread execution model (i.e., how to assign threads to pixels). This is the challenge of the assignment so you will be on your own for this portion. However, feel free to ask general questions on Ed or during office hours about the execution model.

Verifying Correctness¶

To test your implementation, implement a function verify_gpu_results that uses the image_to_grayscale function to verify that the GPU’s output matches the expected output produced by image_to_grayscale. You can determine the function arguments and return type for this function. The main goal of is to make sure your GPU code is working as expected.

When comparing the outputs of your CPU and GPU implementations, you may notice that some values differ by a small amount — typically by 1 when working with 8-bit image data (i.e., values in the range 0–255). This is expected behavior and is generally not a cause for concern.

Why does this happen?

Floating point arithmetic on CPUs and GPUs can produce slightly different results due to differences in hardware design, precision, rounding modes, and optimization strategies.
Even when performing the exact same operations in the same order, these differences may result in small rounding discrepancies.
In image processing or other applications involving floating point math, this can lead to final outputs that differ by a value of 1 in a few locations.

Note

Differences of ±1 between the CPU and GPU versions are acceptable and expected. You do not need to modify your code to eliminate them.

As long as your GPU implementation produces consistent results and behaves correctly overall, it is considered correct — even if some output values differ slightly from the CPU version.

Task 2: GPU vs CPU Execution Timing Task¶

On Midway3, a shared class directory has been created at:

/project/mpcs52072/

This directory contains all necessary data, test cases, and code materials for the course. You all have read access to this location. For this assignment, you will find a subdirectory:

/project/mpcs52072/hw2-data/

This directory holds several test images that you will use for benchmarking your grayscale conversion program.

Your Task¶

You will analyze the computational speedup of the GPU implementation compared to the CPU, focusing only on the core image processing functions:

image_to_grayscale (CPU version)
image_to_grayscale_kern (GPU kernel)

Do not measure the total program runtime — focus specifically on the portion that performs the grayscale conversion.

Using the test images in the hw2-data directory, perform the following steps:

Run and time both the CPU and GPU grayscale conversion functions on each test image.
Generate a bar graph comparing the average execution time of each version per image.

Graph Requirements¶

The x-axis should list the filenames of the test images.
The y-axis should represent the execution time in seconds.
For each image, show two bars side-by-side:
- One for the CPU time
- One for the GPU time
The graph must be saved with the filename: execution.png.
Add an appropriate title and axis labels.
Adjust the y-axis range for visibility — e.g., if your execution times are mostly between 0 and 1 seconds, don’t use a y-axis that goes to 14.

Additional Requirements¶

Occupancy Check:
- Use NVIDIA Compute Utilities (ncu) to profile your kernel.
- Ensure the occupancy is above 80%.
- You may need to experiment with different grid and block dimensions to achieve this.
Stable Timings:
- Due to normal variation in execution time, run each experiment at least 10 times.
- Use the average time from those runs in your graph.
Optional Automation:
- You are not required to write a script that automates both the timing and graph generation for this assignment.
- However, future assignments will require this, so you are strongly encouraged to write one now that you can reuse later.

Submission Notes:
- Ensure your final output includes the execution.png graph as specified.
- Make sure all your results and measurements are based on the image data in the shared class directory at /project/mpcs52072/hw2-data.

`README.md` file¶

Inside the hw2/README.md file, provide explanation on your results. Focus on answering the following:

Where are you getting speedups in your graphs and why?
What areas are you not getting a speedup and why?
What affect does the grid and block sizes have on the performance your GPU implementation?

One-two paragraphs is sufficient for answering these questions.

Grading¶

Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our Assignment Rubric page.)

The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:

Task 1: 50%
Task 2: 50%

Submission¶

Before submitting, make sure you’ve added, committed, and pushed all your code to GitHub. You must submit your final work through Gradescope (linked from our Canvas site) in the “Homework #2” assignment page via two ways,

Uploading from Github directly (recommended way): You can link your Github account to your Gradescope account and upload the correct repository based on the homework assignment. When you submit your homework, a pop window will appear. Click on “Github” and then “Connect to Github” to connect your Github account to Gradescope. Once you connect (you will only need to do this once), then you can select the repository you wish to upload and the branch (which should always be “main” or “master”) for this course.
Uploading via a Zip file: You can also upload a zip file of the homework directory. Please make sure you upload the entire directory and keep the initial structure the same as the starter code; otherwise, you run the risk of not passing the automated tests.

Note

For either option, you must upload the entire directory structure; otherwise, your automated test grade will not run correctly and you will be penalized if we have to manually run the tests. Going with the first option will do this automatically for you. You can always add additional directories and files (and even files/directories inside the stater directories) but the default directory/file structure must not change.

Depending on the assignment, once you submit your work, an “autograder” will run. This autograder should produce the same test results as when you run the code yourself; if it doesn’t, please let us know so we can look into it. A few other notes:

You are allowed to make as many submissions as you want before the deadline.
Please make sure you have read and understood our Late Submission Policy.
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.

Homework #2¶

Midway3¶

Using GPUs on UChicago RCC Midway3¶

Submitting Your Job¶

Important Notes¶

Connecting to Midway3 Using VSCode¶

Creating Your Private Repository¶

Programming Problem: Grayscaling an Image¶

Task 0: Understanding, Compiling, and Running the Starter Code¶

Grayscale Program: Usage and Behavior¶

Compiling and Running¶

Task 1: Implement a GPU Version¶

Verifying Correctness¶

Task 2: GPU vs CPU Execution Timing Task¶

Your Task¶

Graph Requirements¶

Additional Requirements¶

README.md file¶

Grading¶

Submission¶

`README.md` file¶