Homework #3¶

Due: Friday, April 18th at 11:59pm

In this assignment, you will implement an edge detection algorithm on grayscale images using 2D convolution on the GPU. Your implementation will support:

Arbitrary stencil sizes: 3×3, 5×5, 7×7, loaded at runtime from a file
Zero-padding boundary handling
Both global memory and shared memory kernel variants

This assignment is designed to help you understand how stencil-based image operations interact with CUDA memory models and how optimization strategies like shared memory tiling can accelerate performance.

Using a GPU Resource¶

Due to ongoing issues with job submission on the Midway3 servers, you are permitted to complete this assignment (and possibly future assignments) using either:

The Midway3 cluster (via RCC), or
The Peanut GPU cluster on the CS Linux servers

Please choose the environment that works best for you.

If you choose to work on the Peanut cluster, note that we are temporarily suspending GPU profiling requirements for this assignment. This decision was made to allow the technical staff time to update and configure the profiling tools (e.g., nsys, ncu) on the Peanut cluster.

We expect to return to profiling-based performance analysis in future assignments once the tooling is fully configured and consistent across both platforms.

In your repository, please include a short README.md file that clearly indicates which GPU cluster you used to run and test your code.

Creating Your Private Repository¶

To actually get your private repository, you will need this invitation URL:

HW3 invitation (Please check the Post “HW 3 is ready” Ed)

When you click on an invitation URL, you will have to complete the following steps:

You will need to select your CNetID from a list. This will allow us to know what student is associated with each GitHub account. This step is only done for the very first invitation you accept.

Note

If you are on the waiting list for this course you will not have a repository made for you until you are admitted into the course. I will post the starter code on Ed so you can work on the assignment until you are admitted into the course.

You must click “Accept this assignment” or your repository will not actually be created.
After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.
You now need to clone your repository (i.e., download it to your machine).
- Make sure you’ve set up SSH access on your GitHub account.
- For each repository, you will need to get the SSH URL of the repository. To get this URL, log into GitHub and navigate to your project repository (take into account that you will have a different repository per project). Then, click on the green “Code” button, and make sure the “SSH” tab is selected. Your repository URL should look something like this: git@github.com:mpcs52072-spr25/hw3-GITHUB-USERNAME.git.
- If you do not know how to use git clone to clone your repository then follow this guide that Github provides: Cloning a Repository

If you run into any issues, or need us to make any manual adjustments to your registration, please let us know via Ed Discussion.

Programming Problem: Convolution-Based Edge Detection¶

Convolution is a common image processing operation used to detect edges, smooth images, or extract features. It works by sliding a filter (also called a kernel or stencil) over an image and computing a weighted sum of neighboring pixel values.

Given an image I and a convolution kernel K of size n×n, the output at pixel position (x, y) is:

\[\text{output}(x, y) = \sum_{i=-r}^{r} \sum_{j=-r}^{r} I(x+i, y+j) \cdot K(i+r, j+r)\]

Where r = n // 2 is the stencil radius. At the edges, some of the neighborhood may fall outside the image bounds. In this assignment, you will handle this case using zero-padding— i.e., treat out-of-bounds pixels as having value 0.

Tasks: Here is what you will need to accomplish for this assignment:

Implement both global and shared memory convolution kernels. The global kernel will only use global memory and the shared memory convolution kernel will use a mixture of shared and global memory. You are required to use 2D grids and blocks for this assignment.
Load the stencil from a file on the host
Copy the stencil to device constant memory
Use your existing image_to_grayscale_kern to convert the input image to grayscale
Implement zero-padding in both kernels
Dynamically allocate shared memory for the shared memory version
Measure and report kernel execution time using CUDA events

The following sections will provide guidance on how to implement these tasks.

Directory Structure¶

After cloning your repository, the assignment should follow the structure outlined below:

├── bin
├── data
│   ├── stencil3x3.txt
│   ├── stencil5x5.txt
│   ├── stencil7x7.txt
│   └── test_img.png
├── include
│   └── png_flatten.h
├── scripts
│   └── hw3-job.sh
├── src
│   ├── edge_detect.cu
│   └── png_flatten.c
├── Makefile
├── report.md
└── README.md

src/ contains all source code files, including your CUDA and C utilities
include/ contains any header files your project uses
data/ holds input images or other data (e.g., stencils)
scripts/ may include any SLURM or job submission scripts
bin/ is where the final executable binary (e.g., edge_detect) will be placed after building
Makefile automates compilation and linking
report.md The report file that includes answers to the questions described below.
README.md should describe how to build and run your project, and indicate which GPU cluster (Midway3 or Peanut) you used.

Feel free to add additional files are needed. Additionally feel free to use the /project/mpcs52072/hw2-data/ images while working on this assignment.

Stencil Input Format¶

The stencil will be stored in a text file containing space-separated integers with one row per line. The number of rows must equal the number of columns (i.e., the stencil must be square). The first line will specify the dimension followed by the line representing the space-separated integers with one row per line.

Example: A 3×3 Laplacian stencil (``data/stencil3x3.txt``)

 0 -1  0
-1  4 -1
 0 -1  0

Your CUDA kernel must use __constant__ memory to store the stencil coefficients.

Example:

__constant__ int stencil[49];  // Supports up to 7×7

Then copy data to the device like this:

cudaMemcpyToSymbol(stencil, h_stencil, stencil_size * stencil_size * sizeof(int));

You may assume the maximum stencil size is 7×7 (49 elements total). You may include a runtime check to validate that the loaded stencil is 3×3, 5×5, or 7×7 but this is not required. Inside the data directory, we have provide example stencils for all dimensions. We will only use these stencil files when grading.

Shared Memory: Dynamic Allocation¶

In your shared memory version of the convolution kernel, you will use dynamically allocated shared memory to load a tile and its halo.

Declare shared memory like this:

extern __shared__ uint8_t tile[];

Then, compute the amount of shared memory needed during kernel launch:

int shared_mem_size = (BLOCK_SIZE + 2 * radius) * (BLOCK_SIZE + 2 * radius);
edge_detect_shared<<<grid, block, shared_mem_size>>>(...);

This ensures that the kernel has enough space to store the current tile of the image, including the halo around it needed for stencil operations.

Preprocessing: Grayscale Conversion¶

Before launching your convolution kernel, you must first convert the input image to grayscale on the GPU using your kernel from the previous homework assignment:

image_to_grayscale_kern<<<grid, block>>>(...);

Only the grayscale image should be passed into your convolution kernels.

Program Usage: Command-Line Interface¶

Your program should support the following CLI:

./edge_detect -m [global|shared] stencil.txt input.png output.png

-m: Run a single convolution kernel: global or shared and produce its execution time. The default execution runs both kernels and produces their execution time.
input.png: Path to the input RGB PNG image
output.png: Path to the output PNG image
stencil.txt: Path to the stencil file (e.g., data/stencil3x3.txt)

As with the prior assignments, you do not need to have error-handling code for your arguments. You can assume we will run your program as specified by the usage statement above and will provide it with valid command-line arguments.

Output¶

A edge detected output PNG image after convolution
A printed performance summary, e.g.:

shared memory version (stencil 5x5): 3.215 ms
global memory version (stencil 5x5): 6.848 ms

If -m is turned on then the program should only produce one of the above print outs based on the chosen kernel to execute.

Measuring Kernel Execution Time with CUDA Events¶

For this assignment, you will measure and report the execution time of both your global memory and shared memory convolution kernels using CUDA Events.

CUDA events are a simple and effective way to measure elapsed time between two points in your GPU code. Although we will revisit CUDA Events in more detail in a later module (to explore advanced uses like stream synchronization and multi-kernel coordination), for now we will only use them for basic performance timing.

Basic Timing Workflow¶

Here’s how you can time a kernel using CUDA events:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);

// Kernel launch
convolution_kernel<<<grid, block>>>(...);

cudaEventRecord(stop);
cudaEventSynchronize(stop);

float elapsed_ms = 0.0f;
cudaEventElapsedTime(&elapsed_ms, start, stop);

printf("Kernel execution time: %.3f ms\n", elapsed_ms);

You should wrap the timing around only the kernel launch, not memory allocations or file I/O.

Report (Submit as `report.md`)¶

Answer the following:

What is the average runtime for each stencil using both global and shared memory?
How does performance scale with stencil size?
What speedup (if any) does shared memory provide?
What implementation issues did you encounter with: - Loading the stencil from file? - Allocating shared memory? - Managing kernel launch arguments? - Using constant memory?

You are not required to produce graphs or provide the timing measurements for this specific assignment. Please provide a brief overview of what’s happening in your implementation and execution when answering the above questions. However, feel free to provide timings and graphs in your report to justify your answers.

Grading¶

Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our Assignment Rubric page.)

The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:

Completeness: 60%
Correctness: 15%
Design/Style: 10%
report.md: 15%

Submission¶

Before submitting, make sure you’ve added, committed, and pushed all your code to GitHub. You must submit your final work through Gradescope (linked from our Canvas site) in the “Homework #3” assignment page via two ways,

Uploading from Github directly (recommended way): You can link your Github account to your Gradescope account and upload the correct repository based on the homework assignment. When you submit your homework, a pop window will appear. Click on “Github” and then “Connect to Github” to connect your Github account to Gradescope. Once you connect (you will only need to do this once), then you can select the repository you wish to upload and the branch (which should always be “main” or “master”) for this course.
Uploading via a Zip file: You can also upload a zip file of the homework directory. Please make sure you upload the entire directory and keep the initial structure the same as the starter code; otherwise, you run the risk of not passing the automated tests.

Note

For either option, you must upload the entire directory structure; otherwise, your automated test grade will not run correctly and you will be penalized if we have to manually run the tests. Going with the first option will do this automatically for you. You can always add additional directories and files (and even files/directories inside the stater directories) but the default directory/file structure must not change.

Depending on the assignment, once you submit your work, an “autograder” will run. This autograder should produce the same test results as when you run the code yourself; if it doesn’t, please let us know so we can look into it. A few other notes:

You are allowed to make as many submissions as you want before the deadline.
Please make sure you have read and understood our Late Submission Policy.
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.

Homework #3¶

Using a GPU Resource¶

Creating Your Private Repository¶

Programming Problem: Convolution-Based Edge Detection¶

Directory Structure¶

Stencil Input Format¶

Shared Memory: Dynamic Allocation¶

Preprocessing: Grayscale Conversion¶

Program Usage: Command-Line Interface¶

Output¶

Measuring Kernel Execution Time with CUDA Events¶

Basic Timing Workflow¶

Report (Submit as report.md)¶

Grading¶

Submission¶

Report (Submit as `report.md`)¶