Computer Science with Applications 1 & 2

Computing Correlations in Time Series Data

Due: Nov 29th at 6pm

When looking at multiple variables in a dataset, such as the prices of stocks or the number of crimes in a given area, it can be illuminating to compute the correlation between every possible pair of variables. When this data is provided in a time series (where the value of the variables changes over time), the samples for the correlation are taken from an interval of time.

However, this can result in a deluge of numbers that can be difficult to make sense of. A useful way of visualizing this information is as a heat map, where positively correlated variables are "hot" and negatively correlated variables are "cold". For example, this heat map shows the correlations between 49 stocks, sampling the five days following the downgrade of the US's credit rating in early August 2011:

As you can see, all stocks are positively correlated with each other, except gold, which is negatively correlated with every other stock.

In this assignment, we will give you two datasets (historical stock data and crime data from the City of Chicago), and you will compute the correlation between various variables during an interval of time. We provide a visualizer that will generate a heat map like the one above, assuming you correctly implement the functions that compute the correlation between variables.

Computing the sample correlation

There are various measurements of correlation, but we will use a fairly well-known one: the Pearson product-moment correlation coefficient. When computing correlation in a time series, we are usually interested in observing the correlation between two variables during a specific time range, so we use a sample correlation coefficient, where the sample values are the measurements taken during that time range. The sample correlation coefficient of two variables A and B is defined as the sample covariance of the two variables, divided by their sample standard deviation:

With N sample measurements, the sample covariance is defined thusly:

Where a_i and b_i are the individual measurements, and a and b are the means of the measurements.

The sample standard deviation is defined thusly:

Computing a time series correlation

To compute correlations in time series data, the measurements we include in the sample are determined by a time interval. We will specify this interval using two values: the first day in the interval, and the number of days in the interval (including that first day). In one of the datasets we provide, there is already a single piece of data per day, whereas the other contains multiple pieces of data per day, which you will need to aggregate into a single piece of data per day.

Generating a correlation matrix

The correlation matrix of M variables (x₁, x₂, ... , x_M) is an M x M matrix, where each entry of the matrix is defined as follows:

Take into account that this will be a symmetric matrix, since:

Datasets

In this programming assignment, we are providing two datasets that contain information on multiple variables throughout time:

Historical stock data: two years of pricing data for 49 stocks. In this dataset, the variables being correlated are the daily returns of the stocks.
City of Chicago data: data on crime reports, 311 calls, etc. for three months in 2011. The variables being correlated are the number of reports/requests for the different services per day.

Although the code to compute the correlations will be nearly identical for both datasets, each dataset has a different format and has to be manipulated in different ways before you can compute correlations in the data. You must choose one dataset to work with. You will not get extra credit for using both datasets.

Dataset 1: Historical stock data

This dataset is a directory with one CSV file per stock. The name of each file is symbol.csv. So, for example, the file with the data for Google is GOOG.csv. Each file looks like this:

Date,Open,High,Low,Close,Volume,Adj Close
2010-11-09,17.22,17.60,16.86,16.97,56218900,16.97
2010-11-10,17.00,17.01,16.75,16.94,17012600,16.94
2010-11-11,16.63,16.86,16.52,16.80,15310600,16.80
2010-11-12,16.65,16.75,16.40,16.55,17703400,16.55
2010-11-15,16.56,16.89,16.33,16.60,18934600,16.60
2010-11-16,16.45,16.49,16.10,16.24,23484100,16.24
2010-11-17,16.21,16.33,16.11,16.15,10305800,16.15
2010-11-18,16.40,17.17,16.29,16.99,46500100,16.99
2010-11-19,16.97,16.97,16.52,16.57,24036200,16.57

For each date, the following information is provided:

Open: The price at the start of the trading day.
High: Highest price on that day.
Low: Lowest price on that day.
Close: The price at the end of the trading day.
Volume: The number of shares.
Adj Close: The adjusted closing price. This is the closing price of the stock, adjusted for all splits and dividends.

Additionally, the directory also contains a file called variables.txt with the list of symbols (one per line) that should be included in the correlation matrix (and in the same order as the would appear in the matrix). The purpose of this file is to allow you to work with smaller sets of stocks, and the order in which they appear in the matrix, without having to remove or move the stock files around.

You may assume that, for a given directory with stock data files, all the files contain the same number of entries and for the same dates. In fact, you are not expected to work with the dates directly; instead, you can simply consider the first entry in the file to be "day 0", the second entry to be "day 1", etc. For the purposes of this assignment, you are dealing with a sequence of T days. You should not need to use Python's datetime data type for this assignment.

To compute the correlation between stocks, you will need to compute a new piece of data based on the data contained in the dataset. More specifically, we will want to see the correlation between the returns of each stock. The raw return of a stock on day i is simply:

However, we will use the log return instead:

Notice that, when reading in this dataset, you should discard the entry for day 0, as we cannot compute the log return for it (but it will be necessary to compute the log return for day 1).

Dataset 2: City of Chicago data

This dataset contains information on several types of events (311 calls, reported crimes, etc.). Like the stock dataset, this dataset is a directory with multiple CSV files (in this case, one file per type of event). The city data is quite large, so we have elided many of the fields. For example, here are the first few lines from the file ASSAULT.csv:

CREATION DATE, LATITUDE, LONGITUDE
01/01/2011,41.69086778069725,-87.62431771445297
01/01/2011,41.815175023517334,-87.66622787195554
01/01/2011,41.675309884758434,-87.62873824037594
01/01/2011,41.97422126077659,-87.65147882822946
01/01/2011,41.764133641959404,-87.68826950986085

Each line represents an event and the following information is provided:

CreationDate: The date the event record was created.
Latitude: Event's latitude.
Latitude: Event's longitude.

The lines are sorted by date, with the earliest date first, and there is at least one report/request for each day.

To compute the correlation between events, you will need to compute a new piece of data based on the data contained in the dataset. More specifically, we will want to see the correlation between the number of events of each kind per day.

Like the stock dataset, the directory contains a variables.txt file with the list of events (one per line) that should be included in the correlation matrix (and in the same order as the would appear in the matrix). You should also regard the dataset as specifying a sequence of T days. In this dataset, 01/01/2011 is day 0, 01/02/2011 is day 1, etc. You do not need to use Python's datetime data type for this assignment.

Your task

In this assignment, we are providing you with code that will produce a correlation matrix visualization like the one shown at the beginning of this document. This code assumes that you have implemented a series of functions that read in the data and compute the correlation matrix. Your task will be to implement those functions.

def read_data_from_dir(path, variables)

This function reads in a dataset (either the stocks dataset or the city dataset, as described above) . It takes two parameters:
- path: A string with the path of the directory containing the CSV files.
- variables: A list of strings. Each string is a variable name (a stock symbol in the stock dataset, and event type in the city dataset). The function must only read the data for the variables specified in this list. Although this list is specified in the variables.txt file, you DO NOT have to read in this file yourself. The code we have provided takes care of this, and converts the variables.txt file to a list of strings (which is used in the call to this function; see visualizer.py for the actual call)
The return type is up to you. However, take into account that our code uses whatever you return in this function to call two other functions below.
Since reading in a dataset involves reading multiple CSV files, we recommend you add an auxiliary function that reads in a single CSV file.
def get_number_of_days(dataset)

This function takes in the dataset, as returned by your implementation read_data_from_dir, and returns the number of days in the dataset
def compute_correlation_matrix(dataset, variables, interval_start, interval_length)

This function computes the correlation matrix between several variables during a given time period. It takes four parameters:
- dataset: This is the dataset, as returned by your implementation of read_data_from_dir
- variables: The variables for which to compute the correlation matrix (same as in read_data_from_dir).
- interval_start: The start of the interval for which you will be computing the correlation. This is specified as an integer, where 0 is the first day, 1 is the second day, etc.
- interval_length: The number of days in the interval.
This function returns the correlation matrix as a list of lists. More specifically, if the number of variables in the variables list is N, then this function must return a list of size N, where each element is a list of N floats (i.e., essentially a 2D array represented as lists). The (i,j) position is the correlation between the variable named variables[i] and the variable named variables[j].
def compute_correlation(datalist1, datalist2, interval_start, interval_length)

This is an auxiliary function that you must call from compute_correlation_matrix. It computes the correlation between two variables during a given time period. It takes four parameters:
- datalist1: A list of floats with the values of the first variable for all the days in the dataset. The first element of the list contains the value for day 0, the second element for day 1, etc.
- datalist2: Same as datalist1 for the second variable.
- interval_start: Same as in compute_correlation_matrix.
- interval_length: Same as in compute_correlation_matrix.
The function must return a floating point number with the correlation (r, as described earlier) between the specified variables during the specified interval.

Getting started

We have seeded your PhoenixForge accounts with a directory named pa6. It contains:

comatrix.py -- The correlation matrix visualizer
visualizer.py --Calls your functions, and feeds the data to the visualizer
data/stocks -- a directory containing sample stock data.
data/city -- a directory containing sample city data.

Do not modify any of these files. Your work should be constrained to either the stocks.py file or the city.py file. We have provided barebones versions of these two files, containing only the function declarations.

Once you implement stocks.py or city.py, you can run the visualizer like this:

python visualizer.py (stock|city) <data-directory>

Where the first parameter must be either stock or city, depending on the type of data you're working with, and <data-directory> is a directory containing the data.

Submission

Make sure your name and CNET ID are at the top of either stocks.py or city.py and then check your code into PhoenixForge.

We strongly recommend that you check your code into PhoenixForge at regular intervals!