Homework #3 (Short): Due: Friday August 9th, 2013 @ 5:00pm
In this assignment you will write a series of functions to handle basic statistical manipulations of one dimensional arrays, namely mean, variance, standard deviation, zscore, outliers.
This assignment is intended to teach you how to use
- Arrays
- Loops
- Functions
- Basic Testing
The context is that you are given a large quantity of data as an array of numbers (floats). From this data you will compute an average (mean), spread (standard deviation), z-scores, and find values that are oddly large or small (outliers). You will do this by building a series of functions that make use of arrays for loops. These operations are very common in data analysis.
In this assignment you’ll need to include the libraries stdlib.h
and math.h
at the top of your .c file.
All your functions from this homework assignment should be implemented in a called hw3.c and should be placed in a hw3 directory in your repository.
Problem 1: Mean
Given an array of numbers float[] data
and a length of the array int len
compute the average of the array. This can be done by summing all the elements in the array and then dividing by the number of elements. The formula for the mean mu of an array x of length n is as follows
Write a function which takes in an array of floats and the length of the array and returns the mean.
Problem 2: Standard Deviation (Include Test Cases in the main function)
The sample standard deviation is (roughly) the expected difference between the numbers in the list and the mean. If the numbers are all very close to the mean then it will be low. If there is a large spread in the numbers then the sample standard deviation will be high. For example the following two lists have the same mean (2) but very different sample standard deviations:
- The list
[1, 2, 0, 2, 3, 1, 3, 4]
has a sample standard deviation around 1. The numbers tend to deviate from the mean (here equal to 2) by about 1. - The list
[2, 2, 2, 2, 2, 2, 2, 2]
has sample standard deviation 0 (there is no deviation from the mean).
The formula for the sample standard deviation sigma of an array x of length n with mean mu is as follows
Write a function which takes an array and it’s length and returns the standard deviation.
Problem 3: Z-Scores (Include Test Cases in the main function)
If these are new to you I’ll explain them below. You should also consider the explanation of z-scores on wikipedia
We want to normalize an array so that it has mean zero and standard deviation one. Consider the grades on a test for a small class: [64 98 66 60 67 64 70 73 59 75 35 75 ]
This dataset has mean 67.166667 and standard deviation 14.427457
I would like to scale and recenter this dataset so that the mean is zero and the standard deviation is 1 but preserve the relative distance between all the numbers. The resulting test scores would be z-scores. In this example they are as follows
Z Scores: [-0.219489 2.137129 -0.080864 -0.496738 -0.011552 -0.219489 0.196385 0.404322 -0.566050 0.542946 -2.229545 0.542946 ]
The first score 64, has zscore -.21. This means that it is -.21 standard deviations below the mean. We verify this by checking that the mean is 67, that 64-67 is -3 and that -3 is a little more than a negative fifth (-.21) of a standard deviation (14.4).
Z-scores give us a good way to quickly judge how good a score is without thinking about the average or spread of the data.
We computed the z-score of 64 by subtracting the mean from the value and then dividing by the standard deviation.
The z-score of a data point is the number of standard deviations it is above the mean. For example if the mean is 10 and the standard deviation is 2 then the value 11 has z-score .5 and 6 has a z-score of -2. Z-scores are a convenient way to normalize data and quickly see which values are above average, below average, and by how much.
Write a function which takes an array of floats (represents the data), an array of floats that will store the z-scores, and the length of the arrays. Based off the dataset, the function will store the z-scores in the z-scores array.
Problem 4: Outliers (Include Test Cases in the main function)
Outliers are values in our dataset which seem very high or very low given the mean and standard deviation. If we already know the Z-scores of our data then it is easy to define outliers as having a z-score with absolute value greater than some constant. For this exercise, we say that any data point with z-score greater than 2 or less than -2 is an outlier.
Write two functions that
- Returns the number of outliers a dataset contains
- Prints out the outliers to the user
Were there any outliers in the class test scores example?
[64 98 66 60 67 64 70 73 59 75 35 75 ]
Z Scores: [-0.219489 2.137129 -0.080864 -0.496738 -0.011552 -0.219489 0.196385 0.404322 -0.566050 0.542946 -2.229545 0.542946 ]
Yes! There were two outliers (result from function 1)
They had the values [98.0 35.0]
(result from function 2). This will be printed to screen. You do not need to save the outliers in an array.
Style
At the top of your C file, write a comment with your name, etc., in the following form:
This information is not strictly necessary, since your files are already identified by their names and the repository they reside in. Nevertheless, the redundancy is a helpful convenience for us when we are browsing and/or grading your work./* Jane Doe, jdoe */ /* CS152, Summer 2013 */ /* Homework 3 */
Comments, where they occur, should be helpful and precise. Omit superfluous comments:
Yes, we can see that.int a = b + c; /* I'm adding b and c and naming it a! */
Your code should be no more than 80 columns wide.
Do not write more than one statement on a line.
Do not submit code that does not compile. If a function is broken, and makes compilation impossible, comment out that function and submit the rest. Non-compiling code will receive little to no credit.
Submitting Your Work
Save and commit your code in YOUR-REPOSITORY/hw3/hw3.c. Recall that you will need to add your work before you commit it. (Also, notice that in the -m message you include at commit time, -m is simply a command-line option.)
Commit your work early and often. We do not grade intermediate commits, only the work as it stands at the deadline. If you have any issues with subversion, not only your instructors but your classmates can help you. Most of the students in this class have at least one full quarter of experience running subversion.
If, for any reason, perhaps due to a late add, you do not have a CS repository, save your work somewhere you can easily get it and send mail to Adam. We'll get you set up with a repository in time to make the deadline.