Project #3: Your Choice!¶

Due: Friday, May 30th at 11:59pm. No extensions can be used for Project #3

Getting started¶

For each assignment, a Git repository will be created for you on GitHub. However, before that repository can be created for you, you need to have a GitHub account. If you do not yet have one, you can get an account here: https://github.com/join.

To actually get your private repository, you will need this invitation URL:

Project #3 invitation (Please check the Post “Project #3 is ready” Ed)

When you click on an invitation URL, you will have to complete the following steps:

You will need to select your CNetID from a list. This will allow us to know what student is associated with each GitHub account. This step is only done for the very first invitation you accept.

Note

If you are on the waiting list for this course you will not have a repository made for you until you are admitted into the course. I will post the starter code on Ed so you can work on the assignment until you are admitted into the course.

You must click “Accept this assignment” or your repository will not actually be created.
After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.
You now need to clone your repository (i.e., download it to your machine).
- Make sure you’ve set up SSH access on your GitHub account.
- For each repository, you will need to get the SSH URL of the repository. To get this URL, log into GitHub and navigate to your project repository (take into account that you will have a different repository per project). Then, click on the green “Code” button, and make sure the “SSH” tab is selected. Your repository URL should look something like this: git@github.com:mpcs52072-spr25/proj3-GITHUB-USERNAME.git.
- If you do not know how to use git clone to clone your repository then follow this guide that Github provides: Cloning a Repository

If you run into any issues, or need us to make any manual adjustments to your registration, please let us know via Ed Discussion.

Assignment¶

This final project gives you the freedom to explore and apply your GPU programming knowledge to a problem of your own choosing. You will propose, implement, and analyze a CUDA-based application that leverages parallel computation for meaningful performance gains.

There are no restrictions on domain—your application can involve graphics, machine learning, audio processing, scientific simulation, finance, data analytics, or anything else that can benefit from parallelism.

While open-ended, your project must meet the following minimum requirements.

Project Requirements¶

Input/Output Component: Your application must be able to read data from an input source (e.g., file, generated dataset, user input) and produce results to an output (e.g., file, visualization, log, etc.).
Substantial Data Size: Your dataset should be large enough that GPU acceleration is justified. If the dataset is too small, you may not see significant benefit from using the GPU.
Optimization Considerations: You must evaluate and apply GPU-specific optimizations appropriate to your system. Use the Checklist of Common Optimizations (m4_week5.pdf) as a reference. While there is no strict requirement for the number of optimizations you apply, we will evaluate your project by reviewing your implementation and looking for opportunities where a GPU optimization should have been applied. If there are clear places where an optimization (e.g., shared memory reuse, memory coalescing, occupancy tuning) would have significantly improved performance and it was not considered or mentioned in your write-up, that may impact your score. Be thoughtful about the design of your system, and make sure to justify your choices—even if you decide not to use a certain optimization, explain why it wasn’t suitable for your case.
Advanced GPU Patterns (Minimum of 3): Your project must incorporate at least three patterns and/or advanced CUDA features from the following list:
- CUDA Streams
- Dynamic Parallelism
- Tensor Core APIs
- Standard GPU Patterns: map, scatter, gather, stencil, scan, reduce, etc.
You are encouraged to use third-party libraries (e.g., Thrust, cuBLAS, cuDNN) for these patterns, but you must implement your application’s core computation in custom CUDA kernels.
System Write-Up: You must submit a detailed report documenting your project, your design decisions, and your analysis. See the “System Write-Up Requirements” section below.
Dataset Submission: Submit any dataset used in your analysis. If the file(s) are large, you may provide a public link (e.g., Google Drive, Dropbox, etc.) where we can download them easily.
System Modularity: Organize your project as a modular system, with logical separation into files/modules/functions. Avoid placing all logic in one monolithic file.
Execution Script: You must provide a command or script that reproduces the main results of your project. This should be as simple as:
```
$ ./run_project.sh
```
or
```
$ python3 run_project.py
```
Failure to provide a working, single-step execution path will result in significant point deductions. If your project does not run easily with a single script call then you must be very explicit in your README.md that clearly explains how to execute various runs of your project. Provide us the exact command lines to enter.
Use of CUDA Libraries: You may use standard CUDA libraries; however, the main application computation must be written in CUDA kernels authored by you. Projects that only wrap CUDA libraries will not receive full credit.

System Write-Up Requirements¶

Your final report should include the following sections that answers the questions below:

Problem Description
- What is your application doing?
- What real-world problem does it address?
- Why does it benefit from parallelism?
- Include any helpful diagrams to explain your system architecture or kernel pipeline.
Implementation Details
- How is the problem decomposed and parallelized?
- Which CUDA kernels did you implement? Where in the code are they located?
- Which GPU patterns or advanced features did you use, and why were they appropriate?
- Which optimizations did you apply? How did they affect performance?
Profiling and Performance Analysis
- Use Nsight Compute, or nvprof to evaluate your system.
- Discuss occupancy, memory throughput, instruction throughput, and latency.
- What bottlenecks were identified?
- What does your profiler output tell you about your system?
Challenges and Reflections
- What were the most difficult aspects of implementing this project?
- Which parts were hard to parallelize and why?
- What did you learn by working on this system?
- What would you do differently if given more time?
Scalability Discussion
- Test and comment on how performance scales with data size or number of threads/blocks.

Provide profiling metrics to help justify your answer when applicable!

Example Project Ideas¶

The following project examples are intended to spark inspiration. Each one is feasible to complete within approximately one week and aligns with the required features for this assignment.

You are not required to choose from these two. Feel free to implement something original!

CUDA-Accelerated K-Means Clustering In this project, you will implement the K-Means clustering algorithm using CUDA to accelerate the iterative assignment and update phases. K-Means is a classic unsupervised learning algorithm that partitions a dataset into K distinct clusters by minimizing intra-cluster variance. The algorithm is computationally intensive for large datasets, making it an excellent candidate for GPU parallelization.

k-Nearest Neighbors (k-NN) on the GPU Implement the k-Nearest Neighbors algorithm using CUDA to accelerate classification or regression tasks on large datasets. The goal is to compute distances from a query point (or multiple queries) to all points in a training dataset and identify the k closest neighbors efficiently.

If you’re unsure whether your proposed idea meets the scope of the assignment, please reachout to me on Ed for feedback or approval.

Submission Instructions¶

You must submit the following:

Source code and all necessary scripts to build and run your project.
Your final system write-up as a .pdf or .md file.
Any datasets or external links to large datasets.
A short README file explaining how to compile and run your project.

Grading Criteria¶

50% Completeness. Your code should implement the required features without deadlocks or race conditions.
20% Performance. Does your code scale, did you avoid unnecessary data copies, did you make an effort to remove obvious performance bottlenecks and/or provide optimal optimizations.
20% Writeup. Is the report detailed, reasonably well written, and contains all the parts we asked for.
10% Design and Style.

Submission¶

Before submitting, make sure you’ve added, committed, and pushed all your code to GitHub. You must submit your final work through Gradescope (linked from our Canvas site) in the “Project #3” assignment page via two ways,

Uploading from Github directly (recommended way): You can link your Github account to your Gradescope account and upload the correct repository based on the homework assignment. When you submit your homework, a pop window will appear. Click on “Github” and then “Connect to Github” to connect your Github account to Gradescope. Once you connect (you will only need to do this once), then you can select the repsotiory you wish to upload and the branch (which should always be “main” or “master”) for this course.
Uploading via a Zip file: You can also upload a zip file of the homework directory. Please make sure you upload the entire directory and keep the initial structure the same as the starter code; otherwise, you run the risk of not passing the automated tests.

As a reminder, for this assignment, there will be no autograder on Gradescope. We will run the program the CS Peanut cluster and manually enter in the grading into Gradescope. However, you must still submit your final commit to Gradescope.