Computer Science with Applications 1 & 2

Computing Correlations in Time Series Data

Due: Nov 29th at 6pm

When looking at multiple variables in a dataset, such as the prices of stocks or the number of crimes in a given area, it can be illuminating to compute the correlation between every possible pair of variables. When this data is provided in a time series (where the value of the variables changes over time), the samples for the correlation are taken from an interval of time.

However, this can result in a deluge of numbers that can be difficult to make sense of. A useful way of visualizing this information is as a heat map, where positively correlated variables are "hot" and negatively correlated variables are "cold". For example, this heat map shows the correlations between 49 stocks, sampling the five days following the downgrade of the US's credit rating in early August 2011:

As you can see, all stocks are positively correlated with each other, except gold, which is negatively correlated with every other stock.

In this assignment, we will give you two datasets (historical stock data and crime data from the City of Chicago), and you will compute the correlation between various variables during an interval of time. We provide a visualizer that will generate a heat map like the one above, assuming you correctly implement the functions that compute the correlation between variables.

Computing the sample correlation

There are various measurements of correlation, but we will use a fairly well-known one: the Pearson product-moment correlation coefficient. When computing correlation in a time series, we are usually interested in observing the correlation between two variables during a specific time range, so we use a sample correlation coefficient, where the sample values are the measurements taken during that time range. The sample correlation coefficient of two variables A and B is defined as the sample covariance of the two variables, divided by their sample standard deviation:

With N sample measurements, the sample covariance is defined thusly:

Where ai and bi are the individual measurements, and a and b are the means of the measurements.

The sample standard deviation is defined thusly:

Computing a time series correlation

To compute correlations in time series data, the measurements we include in the sample are determined by a time interval. We will specify this interval using two values: the first day in the interval, and the number of days in the interval (including that first day). In one of the datasets we provide, there is already a single piece of data per day, whereas the other contains multiple pieces of data per day, which you will need to aggregate into a single piece of data per day.

Generating a correlation matrix

The correlation matrix of M variables (x1, x2, ... , xM) is an M x M matrix, where each entry of the matrix is defined as follows:

Take into account that this will be a symmetric matrix, since:

Datasets

In this programming assignment, we are providing two datasets that contain information on multiple variables throughout time: Although the code to compute the correlations will be nearly identical for both datasets, each dataset has a different format and has to be manipulated in different ways before you can compute correlations in the data. You must choose one dataset to work with. You will not get extra credit for using both datasets.

Dataset 1: Historical stock data

This dataset is a directory with one CSV file per stock. The name of each file is symbol.csv. So, for example, the file with the data for Google is GOOG.csv. Each file looks like this:

Date,Open,High,Low,Close,Volume,Adj Close
2010-11-09,17.22,17.60,16.86,16.97,56218900,16.97
2010-11-10,17.00,17.01,16.75,16.94,17012600,16.94
2010-11-11,16.63,16.86,16.52,16.80,15310600,16.80
2010-11-12,16.65,16.75,16.40,16.55,17703400,16.55
2010-11-15,16.56,16.89,16.33,16.60,18934600,16.60
2010-11-16,16.45,16.49,16.10,16.24,23484100,16.24
2010-11-17,16.21,16.33,16.11,16.15,10305800,16.15
2010-11-18,16.40,17.17,16.29,16.99,46500100,16.99
2010-11-19,16.97,16.97,16.52,16.57,24036200,16.57

For each date, the following information is provided:

Additionally, the directory also contains a file called variables.txt with the list of symbols (one per line) that should be included in the correlation matrix (and in the same order as the would appear in the matrix). The purpose of this file is to allow you to work with smaller sets of stocks, and the order in which they appear in the matrix, without having to remove or move the stock files around.

You may assume that, for a given directory with stock data files, all the files contain the same number of entries and for the same dates. In fact, you are not expected to work with the dates directly; instead, you can simply consider the first entry in the file to be "day 0", the second entry to be "day 1", etc. For the purposes of this assignment, you are dealing with a sequence of T days. You should not need to use Python's datetime data type for this assignment.

To compute the correlation between stocks, you will need to compute a new piece of data based on the data contained in the dataset. More specifically, we will want to see the correlation between the returns of each stock. The raw return of a stock on day i is simply:

However, we will use the log return instead:

Notice that, when reading in this dataset, you should discard the entry for day 0, as we cannot compute the log return for it (but it will be necessary to compute the log return for day 1).

Dataset 2: City of Chicago data

This dataset contains information on several types of events (311 calls, reported crimes, etc.). Like the stock dataset, this dataset is a directory with multiple CSV files (in this case, one file per type of event). The city data is quite large, so we have elided many of the fields. For example, here are the first few lines from the file ASSAULT.csv:

CREATION DATE, LATITUDE, LONGITUDE
01/01/2011,41.69086778069725,-87.62431771445297
01/01/2011,41.815175023517334,-87.66622787195554
01/01/2011,41.675309884758434,-87.62873824037594
01/01/2011,41.97422126077659,-87.65147882822946
01/01/2011,41.764133641959404,-87.68826950986085

Each line represents an event and the following information is provided:

The lines are sorted by date, with the earliest date first, and there is at least one report/request for each day.

To compute the correlation between events, you will need to compute a new piece of data based on the data contained in the dataset. More specifically, we will want to see the correlation between the number of events of each kind per day.

Like the stock dataset, the directory contains a variables.txt file with the list of events (one per line) that should be included in the correlation matrix (and in the same order as the would appear in the matrix). You should also regard the dataset as specifying a sequence of T days. In this dataset, 01/01/2011 is day 0, 01/02/2011 is day 1, etc. You do not need to use Python's datetime data type for this assignment.

Your task

In this assignment, we are providing you with code that will produce a correlation matrix visualization like the one shown at the beginning of this document. This code assumes that you have implemented a series of functions that read in the data and compute the correlation matrix. Your task will be to implement those functions.

Getting started

We have seeded your PhoenixForge accounts with a directory named pa6. It contains:

Do not modify any of these files. Your work should be constrained to either the stocks.py file or the city.py file. We have provided barebones versions of these two files, containing only the function declarations.

Once you implement stocks.py or city.py, you can run the visualizer like this:

python visualizer.py (stock|city) <data-directory>

Where the first parameter must be either stock or city, depending on the type of data you're working with, and <data-directory> is a directory containing the data.

Submission

Make sure your name and CNET ID are at the top of either stocks.py or city.py and then check your code into PhoenixForge.

We strongly recommend that you check your code into PhoenixForge at regular intervals!