Homework #2¶
Due: Thursday, April 4th at 11:59pm
This homework is intended to serve as an introduction to using atomic operations and importance of understanding the architecture of the parallel hardware a parallel application is running on.
Getting started¶
For each assignment, a Git repository will be created for you on GitHub. However, before that repository can be created for you, you need to have a GitHub account. If you do not yet have one, you can get an account here: https://github.com/join.
To actually get your private repository, you will need this invitation URL:
Homework #2 invitation (Please check the Post “Homework #2 is ready” Ed)
When you click on an invitation URL, you will have to complete the following steps:
You will need to select your CNetID from a list. This will allow us to know what student is associated with each GitHub account. This step is only done for the very first invitation you accept.
Note
If you are on the waiting list for this course you will not have a repository made for you until you are admitted into the course. I will post the starter code on Ed so you can work on the assignment until you are admitted into the course.
You must click “Accept this assignment” or your repository will not actually be created.
After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.
- You now need to clone your repository (i.e., download it to your machine).
Make sure you’ve set up SSH access on your GitHub account.
For each repository, you will need to get the SSH URL of the repository. To get this URL, log into GitHub and navigate to your project repository (take into account that you will have a different repository per project). Then, click on the green “Code” button, and make sure the “SSH” tab is selected. Your repository URL should look something like this: git@github.com:mpcs52060-spr24/hw1-GITHUB-USERNAME.git.
If you do not know how to use
git clone
to clone your repository then follow this guide that Github provides: Cloning a Repository
If you run into any issues, or need us to make any manual adjustments to your registration, please let us know via Ed Discussion.
Programming questions¶
For this assignment you are only allowed to use the following Go concurrent constructs:
go
statementsync/atomic
package. You may use any of the atomic operations.sync.WaitGroup
and its associated methods.
As the course progresses, you will be able to use other constructs provided by Go such as channels, conditional variables, mutexes, etc. I want you to learn the basics and build up from there. This way you’ll have a good picture of how parallel programs and their constructs are built from each other. If you are unsure about whether you are able to use a language feature then please ask on Ed before using it.
Running the problems¶
You cannot run your solutions on the CS Linux Servers for this assignment. You must run either:
Execute your code locally, run the grader, and then verify that results on Gradescope.
Go physically into Computer Science Instructional Laboratory (CSIL) and work on your assignment on a physical machine.
We will discuss in the third week a method for working on the CS Linux Servers for executing parallel programs. Please let me know if you have any questions.
Static task decomposition¶
For problems 1 and 2, you will be implementing your first parallel programs in the course. Each goroutine will need to perform a certain amount of work to complete. But how should the work be distributed to each goroutine? For this assignment, we use the task distribution technique we saw during class, which equally distributes the amount of work to each goroutine.
Problem 1: Estimating Pi using the Leibniz formula¶
Inside the pi directory, open the file called pi.go and write a concurrent Go program that uses an infinite series method to estimate pi. One of the most widely known infinite series that can be used for this is the Leibniz formula, as described here:
Your program should use this formula, and must use the following usage and have the these required command-line arguments:
const usage = "Usage: pi interval threads\n" +
" interval = the number of iterations to perform\n" +
" threads = the number of threads (i.e., goroutines to spawn)\n"
The main goroutine should read in the interval
and threads
argument argument and only print the estimate of pi. The program must use the fork-join pattern described in class and a static task distribution technique. The threshold for running the sequential version is 1000 intervals (even though the threads value is always set); otherwise, you must run the parallel implementation.
Assumptions: You can assume that interval
and threads
will always be given as integers and be provided. Thus, you do not need to perform error checking on those arguments. You also will not need to print out the usage statement for this problem. The usage is given here for clarification purposes.
Sample Runs ($: is just mimicking the command line)
$: go run hw2/pi 100 2
3.1315929036
$: go run hw2/pi 10 1
3.0418396189
$: go run hw2/pi 10 3
3.0418396189
The tests for this problem is inside pi_test.go
. Please go back to homework #1 if you are having trouble running tests.
Problem 2: Waitgroup Implementation¶
Read the following about go interfaces before starting this problem:
Inside the waitgroup
directory, open the file called waitgroup.go
. Consider the following interface and function:
type WaitGroup interface {
Add(amount uint)
Done()
Wait()
}
// NewWaitGroup returns a instance of a waitgroup
// This instance must be a pointer and should not
// be copied after creation.
func NewWaitGroup() WaitGroup {
//IMPLEMENT ME!
}
Provide an implementation of a WaitGroup
that mimics the Go implementation:
The only difference is that our implementation requires uint (rather than just int) to be passed to the Add method. Remember, you can only use atomic operations in this implementation. Don’t overthink this problem. Many of these methods can be implemented in very few lines. Note that there are various types of int (64 or 32 bit, signed or unsigned), and there are special atomic operations for each type. The test does not check which type you use inside your waitgroup implementation.
The tests for this problem is inside waitgroup_test.go
. Please go back to homework #1 if you are having trouble running tests.
Problem 3: Wrangling COVID data¶
One of the tedious tasks data scientists have to perform is data wrangling, which is the process of gathering, and transforming data to get ready for doing data analytics. The raw data may have empty fields, duplicated records, be dispersed among many files, etc. For this problem, we will take a look at zip code COVID data from the Chicago Data Portal:
Unfortunately, for you, data you will be processing is the same data provided by the link above but needs to be wrangled. Make a directory called data
inside the covid
directory, and look at the README.txt
to see how you can download the data. Place the data you download into the covid/data
directory.
DO NOT INCLUDE THE DATA FILES TO YOUR REPO!! The gitignore file thats included in your repositories should already handle this so please make sure to not modify or force the commit of the data files.
Here is additional information about the data:
The data is dispersed between multiple files with the format
covid_NUM.csv
, whereNUM
is an integer. There is a static number of files that is equal to500
.Each file may contain a subset of the original file records. The original file that comes from the link above is named
covid_sample.csv
and is included in the data directory. You can use this file to help with testing your code.The files also have duplicated records between them, which you will need to ignore when performing the main computation.
Each file begins with the original header.
Main task: Your task is to write a concurrent program inside covid/covid.go
that calculates the total number of weekly cases, and tests for a zip code in a given month and year. Please note: the data is only for the years 2020-2021. The output of the program must be a string that of the format: TOTAL_NUM_CASES,TOTAL_NUM_TESTS
, where TOTAL_NUM_CASES
, and TOTAL_NUM_TESTS
are the cumulative counts for each category, which are all separated by a single comma. Nothing else should be printed to the console. For guidance, you should only be looking at the columns:
ZIP Code
Week Start
Cases - Weekly
Tests - Weekly
For verifying if a record is part of the month and year specified, the program just checks the month component of the "Week Start"
column. If it matches the month and year given to the program then count that record as part of the total calculations. The format for the date will always be "MONTH/DAY/YEAR"
. We know some weeks overlap the beginning of other months but to keep this problem simple we will just look at the "Week Start"
column. If a record for any of those specified columns do not contain a value OR does not match the zip code, month, year specified by the user then your program skips that record. To help you know the indices for these columns, you can use the following code:
const zipcodeCol = 0
const weekStart = 2
const casesWeek = 4
const testsWeek = 8
const deathsWeek = 14
Your program must use a static task decomposition by assigning the goroutines a static amount of files to work on in order to calculate these numbers.
The program must have the following usage statement:
const usage = "Usage: covid threads zipcode month year\n" +
" threads = the number of threads (i.e., goroutines to spawn)\n" +
" zipcode = a possible Chicago zipcode\n" +
" month = the month to display for that zipcode \n" +
" year = the year to display for that zipcode \n"
Sample Runs ($: is just mimicking the command line)
$: go run covid.go 0 60603 5 2020
2,48
$: go run covid.go 4 60640 2 2021
182,9961
$; go run covid.go 3 89149 2 2020
0,0
Assumptions: You can assume that threads
, zipcode
, month
and year
will always be provided. threads
, month
and year
are always given as integers. month
will always be a value between [1-12] and year will always be [2020-2021]. Thus, you do not need to perform error checking on those arguments. However, the program could be given an invalid zipcode, which should result in the string "0,0"
.
- Helpful Hints:
Use
fmt.Sprintf
to create a formatted string andstrings.Join
to join together a slice of strings into a single string.Use CSV package csv
State that empty column values are empty strings
Go Maps (e.g.,
map[TYPE]type
) will be helpful here.
The tests for this problem is inside covid_test.go
. Please go back to homework #1 if you are having trouble running tests.
Program performance¶
Do not worry about the performance of your programs for this week. We will reexamine these implementations for the next homework assignment once you a better understanding about parallel performance, which is covered in module 3.
Grading¶
Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our Assignment Rubric page.)
The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:
Note
Your code must be deterministic. This note means that if your code is running and passing the tests on your local machine or on CS remote machine but not on Gradescope this means you have a race condition that must be fixed in order for you to get full completeness points.
Completeness: 80%
Correctness: 10%
Design & Style 10%
Obtaining your test score¶
The completeness part of your score will be determined using automated
tests. To get your score for the automated tests, simply run the
following from the Terminal. (Remember to leave out the
$
prompt when you type the command.)
$ cd grader
$ go run hw2/grader hw2
This should print total score after running all test cases inside the individual problems. This printout will not show the tests you failed. You must run the problem’s individual test file to see the failures.
Design, style and cleaning up¶
Before you submit your final solution, you should, remove
any
Printf
statements that you added for debugging purposes andall in-line comments of the form: “YOUR CODE HERE” and “TODO …”
Think about your function decomposition. No code duplication. This homework assignment is relatively small so this shouldn’t be a major problem but could be in certain problems.
Go does not have a strict style guide. However, use your best judgment from prior programming experience about style. Did you use good variable names? Do you have any lines that are too long, etc.
As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.
Submission¶
Before submitting, make sure you’ve added, committed, and pushed all your code to GitHub. You must submit your final work through Gradescope (linked from our Canvas site) in the “Homework #2” assignment page via two ways,
Uploading from Github directly (recommended way): You can link your Github account to your Gradescope account and upload the correct repository based on the homework assignment. When you submit your homework, a pop window will appear. Click on “Github” and then “Connect to Github” to connect your Github account to Gradescope. Once you connect (you will only need to do this once), then you can select the repository you wish to upload and the branch (which should always be “main” or “master”) for this course.
Uploading via a Zip file: You can also upload a zip file of the homework directory. Please make sure you upload the entire directory and keep the initial structure the same as the starter code; otherwise, you run the risk of not passing the automated tests.
Note
For either option, you must upload the entire directory structure; otherwise, your automated test grade will not run correctly and you will be penalized if we have to manually run the tests. Going with the first option will do this automatically for you. You can always add additional directories and files (and even files/directories inside the stater directories) but the default directory/file structure must not change.
Depending on the assignment, once you submit your work, an “autograder” will run. This autograder should produce the same test results as when you run the code yourself; if it doesn’t, please let us know so we can look into it. A few other notes:
You are allowed to make as many submissions as you want before the deadline.
Please make sure you have read and understood our Late Submission Policy.
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.