Homework #1

Due: Friday, April 4th at 11:59pm

This homework is intended to server as an introduction to CUDA programming model. It also allows for you to understand how to submit and run your CUDA programs on clusters using SLURM.

Getting started

The following two sections explain the process of setting up your homework workspace for the course and how to retrieve your private Github repository.

Peanut Cluster and GPU Partitions

If you are not familiar with submitting jobs to a cluster (i.e., the Peanut cluster) please make sure to watch the following video and read over the SLURM documentation provided by our tech-staff before beginning the assignment:

We will grade all assignments using the Peanut cluster’s GPUs and all programming assignments must work correctly on one of these machines. If you have a NVIDIA GPU installed on your local machine then you can work; however, your program must still work on one of the cluster’s GPU nodes, no exceptions.

You have the option batching your programs to the following cluster nodes

  • PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

  • gpu-all      up    4:00:00      2   idle gpu[2-3]

  • titan        up    4:00:00      1   idle gpu3

  • pascal       up    4:00:00      1   idle gpu2

  • quadro       up    4:00:00      1   idle gpu1

Please be cautious of using titan since it’s been known to have stalling problems so we advise not using it until it’s been reliably fixed.

Creating Your Private Repository

For each assignment, a Git repository will be created for you on GitHub. However, before that repository can be created for you, you need to have a GitHub account. If you do not yet have one, you can get an account here: https://github.com/join.

To actually get your private repository, you will need this invitation URL:

  • HW1 invitation (Please check the Post “HW 1 is ready” Ed)

When you click on an invitation URL, you will have to complete the following steps:

  1. You will need to select your CNetID from a list. This will allow us to know what student is associated with each GitHub account. This step is only done for the very first invitation you accept.

Note

If you are on the waiting list for this course you will not have a repository made for you until you are admitted into the course. I will post the starter code on Ed so you can work on the assignment until you are admitted into the course.

  1. You must click “Accept this assignment” or your repository will not actually be created.

  2. After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.

  3. You now need to clone your repository (i.e., download it to your machine).
    • Make sure you’ve set up SSH access on your GitHub account.

    • For each repository, you will need to get the SSH URL of the repository. To get this URL, log into GitHub and navigate to your project repository (take into account that you will have a different repository per project). Then, click on the green “Code” button, and make sure the “SSH” tab is selected. Your repository URL should look something like this: git@github.com:mpcs52072-sum24/hw1-GITHUB-USERNAME.git.

    • If you do not know how to use git clone to clone your repository then follow this guide that Github provides: Cloning a Repository

If you run into any issues, or need us to make any manual adjustments to your registration, please let us know via Ed Discussion.

Problem 1: Estimating Pi using the Leibniz formula

Inside the p1 directory, open the file called pi.cu and write a CUDA program that uses an infinite series method to estimate pi. One of the most widely known infinite series that can be used for this is the Leibniz formula, as described here:

Your program should use this formula, and must use the following usage and have the these required command-line arguments:

Usage: pi terms num_blocks
terms = the number of iterations (i.e., intervals) to perform
num_blocks = the number of blocks to launch

The program should read in the terms and num_blocks argument argument and only print the estimate of pi. The program will run in two phases:

Phase 1: Calculate local estimates

Define a kernel named compute_estimates that launches with a grid dimension equal to num_blocks and a block dimension of one. The single thread in each block computes its own local estimate of pi and stores the result in an output array named estimates. The single thread in each block must perform an equal amount of terms computations. This means for each block you will need to compute its start (inclusive) and end (exclusive) indices based on how much work each block must compute (i.e., terms\num_blocks ). For example, if terms=1000 and num_blocks=4 then the chart shows the breakdown for each block’s start and end interval.

Block ID

Start (Inclusive)

End (Exclusive)

Number of iterations

0

0

250

250

1

250

500

250

2

500

750

250

3

750

1000

250

If terms/num_blocks does not divide evenly then have one block perform the necessary additional iterations to equal terms.

You have full control over the arguments passed to compute_estimates, how memory is allocated, and how the interval is divided among the blocks. Please keep in mind that we are not looking at performance in this first homework assignment, so you do not need to come up with the most optimal solution.

Phase 2: Aggregation

Each block will produce a partial local-sum of pi. After the compute_estimates completes, the host-side adds up the partial sums and outputs the final computation for pi.

Requirements:

  1. All local-sum computations must be done on the GPU. The host-side code only reads in the data, prepares it to be transferred to the GPU, launches kernels, and finally aggregates the local-sums to output the final estimate of pi.

  2. Please make sure you are verifying your CUDA calls by checking the error codes as I described in the videos.

Assumptions: You can assume that terms and num_blocks will always be given as integers and be provided. Thus, you do not need to perform error checking on those arguments. You also will not need to print out the usage statement for this problem. The usage is given here for clarification purposes.

Problem 2: 2D Translations

2D translation is a process of moving an object from one position to another in a two dimensional plane. In this problem, you will write a program that performs 2D translation on a set of points using 3D homogeneous transformation matrices.

What is 2D Translation with Homogeneous Coordinates?

In computer graphics, homogeneous coordinates allow for the representation of translations, rotations, and other transformations using matrix multiplication. A 2D point \((x,y)\) can be represented in homogeneous coordinates as \((x,y,1)\).

A 2D translation can be represented as a \(3x3\) matrix:

\[\begin{split}\begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \end{bmatrix}\end{split}\]

When this matrix multiplies the homogeneous coordinate \((x,y,1)\), the result is a translated point \((x',y',1)\):

\[\begin{split}\begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} x + t_x \\ y + t_y \\ 1 \end{bmatrix}\end{split}\]

For example, consider we had the points \((1, 1)\), \((2,3)\), and \((3, 1)\). If we apply a translation vector \((2, 1)\) to this shape using the homogeneous transformation matrix, each point of the triangle will move 2 units to the right and 1 unit up. The new positions of the vertices will be \((3, 2)\), \((4, 4)\), and \((5, 2)\), respectively.

Make sure you understand how to perform matrix and vector multiplication. You may want to do a google search as refresher on that formula.

Task

Inside the p2/translate.cu directory, implement a CUDA program that performs 2D translation on a set of points using 3D homogeneous transformation matrices. The program will take in three command line arguments: points_file, matrix_file, and iterations with the program having the following usage statement:

Usage: translate points_file matrix_file iterations
   points_file - the filepath to the file containing all the points
   matrix_file - the filepath to the file containing the 3D matrix
   iterations  - an integer representing how many times to translate the points

The points_file file has the following format

  • The first line represents the total number of points in the file.

  • All remaining lines represent a 2D point where each integer coordinate is separated by a single whitespace.

Inside the p2 directory, there is a file named points-001.in that provides an example of this format:

3
1 1
2 3
3 1

The matrix_file contains a single \(3x3\) matrix, where each line represents a single row of integers, where the columns are separated by a single whitespace. For example, inside p2/matrix-001.in provides an example of this format:

1 0 2
0 1 1
0 0 1

Finally, iterations represents how many times to translate a point by the provided matrix. For example, if iterations=3 then given point \((1, 1)\) and a translation vector \((2, 1)\) then the following represents the final location for the given point:

  1. \((1, 1)\) + \((2, 1)\) = \((3, 2)\)

  2. \((3, 2)\) + \((2, 1)\) = \((5, 3)\)

  3. \((5, 3)\) + \((2, 1)\) = \((7, 4)\)

The program outputs the final translated location for each point after applying the matrix to it iterations number of times. The order of the output must be the same order of how the points appeared in the original points_file. Each point in the output is on a single line and each coordinate is separated by a single whitespace.

Requirements

  1. All computation must be done on the GPU. The host-side code only reads in the data, prepares it to be transferred to the GPU and finally prints out the final locations of the points.

  2. There can only be a single kernel launch in this program unlike in problem 1.

  3. How you format the data in memory and determine the grid and block dimensions is up to you.

  4. You must write the CUDA code for performing matrix and vector multiplication. You can not just retrieve the tx and ty values directly from the matrix and add it to the vector coordinates.

  5. Please make sure you are verifying your CUDA calls by checking the error codes as I described in the videos.

To help think about what to make your block and grid sizes for problem 2. You may want to query the device for the maximum number of blocks and threads per block in CUDA.

int device;
cudaDeviceProp prop;

// Get the current CUDA device
cudaGetDevice(&device);

// Get device properties
cudaGetDeviceProperties(&prop, device);

// Query max threads per block
int maxThreadsPerBlock = prop.maxThreadsPerBlock;

// Query max blocks per dimension (grid size)
int maxGridSizeX = prop.maxGridSize[0];
int maxGridSizeY = prop.maxGridSize[1];
int maxGridSizeZ = prop.maxGridSize[2];

Assumptions: We will always provide translate with the correct number of arguments and valid input. The file arguments points_file and matrix_file will always be formatted as specified in the above description and iterations will always be a positive integer.

Testing

Professor Samuels, will provide a few more test-cases at the beginning of the second week. I first want you to test your own code and then will provide some additional tests later on to help you further debug.

Grading

Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our Assignment Rubric page.)

The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:

  • Problem 1: 50%

  • Problem 2: 50%

Submission

Before submitting, make sure you’ve added, committed, and pushed all your code to GitHub. You must submit your final work through Gradescope (linked from our Canvas site) in the “Homework #1” assignment page via two ways,

  1. Uploading from Github directly (recommended way): You can link your Github account to your Gradescope account and upload the correct repository based on the homework assignment. When you submit your homework, a pop window will appear. Click on “Github” and then “Connect to Github” to connect your Github account to Gradescope. Once you connect (you will only need to do this once), then you can select the repository you wish to upload and the branch (which should always be “main” or “master”) for this course.

  2. Uploading via a Zip file: You can also upload a zip file of the homework directory. Please make sure you upload the entire directory and keep the initial structure the same as the starter code; otherwise, you run the risk of not passing the automated tests.

Note

For either option, you must upload the entire directory structure; otherwise, your automated test grade will not run correctly and you will be penalized if we have to manually run the tests. Going with the first option will do this automatically for you. You can always add additional directories and files (and even files/directories inside the stater directories) but the default directory/file structure must not change.

Depending on the assignment, once you submit your work, an “autograder” will run. This autograder should produce the same test results as when you run the code yourself; if it doesn’t, please let us know so we can look into it. A few other notes:

  • You are allowed to make as many submissions as you want before the deadline.

  • Please make sure you have read and understood our Late Submission Policy.

  • Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).

  • Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.