.. |R2| replace:: *R*\ :sup:`2`

=================
Linear Regression
=================

**Due: Friday, Nov 10 at 4pm**

In this assignment, you will fit linear regression models and
implement a few simple variable selection algorithms. The assignment
will give you experience with Numpy and more practice with using
classes and functions to support code reuse. 

**You must work alone on this assignment.**

At the heart of the assignment is a table, where each column is a
variable and each row is a sample unit. As an example, in a health
study, each sample unit might be a person, with variables like height,
weight, sex, etc. In your analysis, you will build models that, with
varying levels of accuracy, can predict the value of one of the
variables as a function of the others.

Predictions are only possible if variables are related somehow. As an example,
look at this plot of recorded crimes against logged complaint calls about
garbage to 311.

.. image:: img/ex-intro.svg
    :align: center

Each point describes a sample unit, which in this example represents a geographical
region of Chicago. Each region is associated with variables, such as the number
of crimes or complaint calls during a fixed time frame. Given this plot, if you
got the question of how many crimes you think were recorded for a region that
had 150 complaint calls about garbage, you would follow the general trend and
probably say something like 3000 recorded crimes. To formalize this prediction,
we need a model for the data that relates a *dependent variable* (e.g., crimes)
to a set of *predictor variables* (e.g., complaint calls). We will assume a
linear dependence as follows:

.. math:: y_n = \beta_0 + \beta_1 x_{n1} + \dots + \beta_K x_{nK} + \varepsilon_n,

or using vector notation

.. math:: y_n = \mathbf{x}_n^T \beta + \varepsilon_n.

We have introduced the following notation:

:math:`n`
    The sample unit that we are considering out of :math:`N` total.
:math:`y_n`
    An observation of the *dependent variable*, e.g., the total number of
    crimes.
:math:`x_{nk}`
    An observation of a *predictor variable* :math:`k` for sample unit :math:`n,`
    e.g., the complaint calls about garbage. We assume a total of :math:`K`
    different predictors.
:math:`\beta = (\beta_0, \beta_1, \dots, \beta_K)^T`
    A column vector of the regression coefficients, where :math:`\beta_0` is
    the intercept and :math:`\beta_k` is the coefficient associated with the
    :math:`k\mathrm{th}` predictor. This vector describes the red line in the
    figure above. Note that a positive value of a coefficient suggests a
    positive correlation with the dependent variable. The same is true for a
    negative value and a negative correlation.

:math:`\varepsilon_n`
    Not all sample unit points lie on the red line, which means the
    equation :math:`y_n = \mathbf{x}_n^T \beta` is rarely
    satisfied. In our model, this discrepancy is assumed to be caused
    by random error. We assume that the random error is zero-mean and
    drawn from a normal distribution of a fixed variance. We want as
    much as possible to be explained deterministically, which means we
    will favor values of :math:`\beta` that result in small errors.

:math:`\mathbf{x}_n = (1, x_{n1}, \dots, x_{nK})^T`
    A column vector representation of all the predictors for a given sample unit.
    As a convenience, a 1 has been prepended to the vector to incorporate
    the intercept :math:`\beta_0` into the vector notation.

..
    In our example we only have one predictor. This means that :math:`K = 1` and :math:`x_{n1}`
    is the weight of the :math:`n:\mathrm{th}` person. The dependent variable,
    height, is denoted :math:`y_n`. The relationship between the height and weight
    is parameterized by an intercept (:math:`\beta_0`), a coefficient
    (:math:`\beta_1`) and a random error :math:`\varepsilon_n` drawn from a
    zero-mean normal distribution.

When making predictions, what we are really doing is asking what is
the weighted average value for a given set of predictors. Since we
have assumed the errors have zero mean, to make a prediction, we just
remove the error term as:

.. math:: \hat y = f(\mathbf{x}) = \mathbf{x}^T \beta.

This equation can be written for several sample units at the same time as:

.. math:: \mathbf{\hat y} = \mathbf{X} \beta,

where :math:`\mathbf{\hat y}` is a column vector of predictions and
:math:`\mathbf{X}` is a :math:`N \times K` matrix where each row is one sample unit.

Model fitting
-------------

There are many possible candidates for :math:`\beta,` some that fit the data
better than others.  Finding the best value of :math:`\beta` is referred to as
*fitting the model*.  To simplify your job, we have provided code for fitting a
linear model. The function ``linear_regression`` finds a value :math:`\beta`
that minimizes :math:`\mathbf{\hat y} - \mathbf{y}` in a least squared sense.
That is, it finds values for :math:`\beta` such that the predicted values
:math:`\mathbf{\hat y} = \mathbf{X} \beta` are as close to the observed
variables as possible (in a statistically motivated way using `maximum
likelihood`_). This function accepts a two-dimensional Numpy array of floats
containing the predictor variables (``X``) and a one-dimensional Numpy array of
floats (``y``) containing the dependent variable as input and returns
:math:`\beta.`  Note that ``X`` does not correspond exactly with
:math:`\mathbf{X},` since it should not have the column of ones (we will add
that for you).

We have also provided a function, ``apply_beta(beta, X)`` for making
predictions using the model. It takes ``beta``, the array generated by
our linear regression function, and ``X``, a two-dimensional Numpy
array of floats that contains values for the predictor variables
(again, without the added column of 1's), applies the function
described by ``beta`` to each row in ``X``, and returns a
one-dimensional Numpy array of the resulting values.

Here are sample calls to ``linear_regression`` and ``apply_beta``:

.. code:: python

    >>> import numpy as np
    >>> from model import *
    >>> X = np.array([[5,2], [3,2], [6,2.1], [7,3]]) # predictors
    >>> y = np.array([5,2,6,6]) # dependent
    >>> beta = linear_regression(X, y)
    >>> beta
    array([ 1.20104895,  1.41083916, -1.6958042 ]) # f(x) = 1.2 + 1.4*x1 - 1.7*x2
    >>> apply_beta(beta, X)
    array([ 4.86363636,  2.04195804,  6.1048951 ,  5.98951049]) # predictions

The input to both of these functions require ``X`` to be
two-dimensional, even in the case of *K* = 1. Numpy makes a
distinction between one-dimensional and two-dimensional arrays, even
if they contain the same information. For instance, if we want to
extract the second column of ``X`` as a one-dimensional array, we do
the following:

.. code:: python

    >>> X
    array([[ 5. ,  2. ],
           [ 3. ,  2. ],
           [ 6. ,  2.1],
           [ 7. ,  3. ]])
    >>> X[:,1]
    array([ 2. ,  2. ,  2.1,  3. ])
    >>> X[:,1].shape
    (4,)

The resulting shape will not be accepted by ``linear_regression`` and
``apply_beta``. To retrieve a 2D column subset of ``X``, you can use a
list of integers as the index. This mechanism keeps ``X``
two-dimensional, even if the list of indices has only one value:

.. code:: python

    >>> X[:,[1]]
    array([[ 2. ],
           [ 2. ],
           [ 2.1],
           [ 3. ]])
    >>> X[:,[1]].shape
    (4, 1)

In general, you can specify a slice containing a specific subset of
columns as a list.  For example, let ``Z`` be:

.. code:: python

   >>> Z = np.array([[1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24]])
   >>> Z
   array([[ 1,  2,  3,  4],
          [11, 12, 13, 14],
          [21, 22, 23, 24]])

Evaluating the expression ``Z[:, [0,2,3]]`` will yield a new 2D array
with columns 0, 2, and 3 from the array ``Z``:

.. code:: python

   >>> Z[:, [0, 2, 3]]
   array([[ 1,  3,  4],
          [11, 13, 14],
          [21, 23, 24]])

or more generally, we can specify any expression for the slice that
yields a list:

.. code:: python

   >>> l = [0, 2, 3]

   >>> X[:, l]
   array([[ 1,  3,  4],
          [11, 13, 14],
          [21, 23, 24]])


Data
----

In this assignment you will write code that can be used on many
different datasets.  We have provided a couple of sample datasets for
testing purposes.

:City_:  Predicting crime rate from the number of complaint calls to 311.

:Stock_: Predicting a stock index from a set of stocks.

**More information about each dataset can be found by clicking the
links above.** 

For this assignment, a dataset is stored in a directory that contains
two files: a CSV (comma-separated values) file and a JSON file.  The
CSV file contains a table where each column corresponds to a variable
and each row corresponds to a sample unit.  The first row contains the
column names.  The JSON file contains a few parameters:

* ``name``:              the name of the dataset, 
* ``predictor_vars``:  the column indices of the predictor variables, 
* ``dependent_var``:           the column index of the dependent variable, 
* ``training_fraction``: the fraction of the data that should be used to train models, and
* ``seed``:               a random number generator seed.

The last two parameters will be used to help split the data into two
sets: one that you will use to fit or *train* models and one
that you will use to evaluate how well different models predict
outcomes on new-to-the-model data.  We describe this process below.


Submission requirements
-----------------------

You will do six tasks in this assignment.  Tasks 1-5 list specific
submission requirements. Code to generate the required output text
should be added to the file ``output.py``.  When we test your code, we
will run the commands::

    $ python3 output.py data/city
    $ python3 output.py data/stock

then manually inspect the output to the terminal.  Submissions that do
not include the necessary code in ``output.py`` will lose substantial
credit.

All output must be labeled with both the name of the data set and the task
number, and it should be output using print.

Most tasks require you to include one or more models in the output.
These models should be output in the following form: the name of the
dependent variable followed by a tilde (``~``) and the regression
equation with the constant first.  For example, here's the expected
output for the model shown above for predicting crime from calls about
garbage::

    CRIME_TOTALS ~ 144.003954 + 17.365791 * GARBAGE


Getting started
---------------

We have seeded your repository with a directory for this assignment.
To pick it up, change to your ``cmsc12100-aut-17-username`` directory
(where the string ``username`` should be replaced with your username)
and then run the command: ``git pull upstream master``.  You should
also run ``git pull`` to make sure your local copy of your repository
is in sync with the server.

The ``pa5`` directory contains the following files:

:``dataset.py``: Python file that contains code for reading data
                 from JSON and CSV files.  You will complete the
                 ``DataSet`` class in this file.

:``model.py``: Python file that contains code for computing linear
               regressions and applying them to data.  You should add
               your ``Model`` class to this file along with your
               general purpose model-selection functions.

:``output.py``: Python file where you will put your code to compute
                the required output. This file should rely heavily on
                the general functions that you add to ``model.py``.

The ``pa5`` directory also contains a ``data`` directory which, in turn,
contains two sub-directories: ``city`` and ``stock``.

We will be using the `scikit-learn`_ machine learning library in this
assignment.  It is installed on the department's Linux machines, but
not on the class VM.  To install it on your VM, run:

::

    sudo pip3 install sklearn    

You will be asked for a password.  Use the student account password:
``uccs``.

Task 0: Representations
-----------------------

*Representing Datasets*

Recall that each dataset has two files: a CSV file with the data to be
used for training and testing models and a JSON file with the
parameters for the assignment.  In the file ``dataset.py``, we have
provided two functions: one to read the sample data from the CSV file
and one to read the parameters from the JSON file.  In this file, we have also
defined a skeleton for a class, named ``DataSet``, for representing a
dataset.  

Your first task is to complete the ``DataSet`` constructor method,
which takes the name of the directory holding the dataset as its only
parameter.  In addition to saving the values necessary for the
computation from the JSON and CSV files (dataset name, predictor
variable indices, dependent variable index, and the column labels),
your constructor should also split the sample data into two Numpy
arrays: one, called the training data, that you will use to construct
models, and another, called the testing data, that you will use to
evaluate how well different models predict new-to-the-model data.

Why not use all the data to fit the models?  It is easy to fit a model
that does a good job of predicting the dependent variable for the
sample units that were used to train it, but does a terrible job of
predicting the dependent variable for other data.  How do we determine
how well a model does with new-to-it data?  We can hold out some of
the data and use the model to predict the values of the dependent
variable for the held-out sample units.  Since we know the actual
value of the dependent variable for this held-out data, we can use it
to evaluate the effectiveness of the model on previously-unseen data.

We will use the `train_test_split`_ function from the Scikit-learn
`model selection library`_ to make the split.  This method takes a Numpy
array and splits it into two arrays: one to use for training models
and another to use for testing them.  It has two keyword parameters that
can be used to control the size of each array: ``train_size`` and
``test_size``.  We'll set ``train_size`` using the
``training_fraction`` parameter from the parameters file and we will
set the ``test_size`` parameter to ``None``.  Using ``None`` for the
``test_size`` parameter indicates that the rows not chosen for the
training set should be included in the test set, which is the default
behavior.  Note that you could omit this parameter, but the function
generates an annoying warning message if you do.

The ``train_test_split`` method decides randomly which rows to include
in which set.  To support repeatability and simplify testing, it
includes a ``random_state`` parameter that can be used to set the seed
for the underlying random process. Starting from a given seed, the 
"random" (actually pseudo-random) function will generate the same 
"random" numbers in the same order each time, causing the same rows to 
be assigned to the same sets each time you run your program. 
We'll set this parameter using the
``seed`` parameter from the JSON file.

Your implementation of the ``DataSet`` constructor should be quite
simple.  Ours is less than 10 lines of code.  The heavy lifting is
done in the IO functions we provided, as well as the ``train_test_split``
method from Scikit-learn.


*Representing models*

In this assignment, you will be creating many different models for a
given dataset using different subsets of the dataset's predictor
variables.  You are *required* to add a class to ``model.py`` for
representing such models.

Each model will be constructed from a dataset and a subset of the
dataset's predictor variables.  We encourage you to think carefully
about what attributes and methods to include in your class.  Don't be
discouraged if you find yourself reworking the design of your class
several times as you work through the tasks for this assignment.

Other than the requirement to add classes for representing the dataset
and the models, you have free rein over the structure of your code.
Think carefully about what functions to build, since these choices
along with the design of your model class will largely determine the
quality of your code. We suggest that you create functions that are as
general as possible and break them up into several smaller function if
they get too big.


Task 1: Model evaluation using the Coefficient of Determination (|R2|)
----------------------------------------------------------------------

Blindly regressing on all available predictors is bad practice and can lead to
*over-fit* models. Over-fitting happens when your model picks up on random structure in the
training data that does not represent the underlying phenomenon. This situation is
made even worse if
some of your predictors are highly correlated (see `multicollinearity
<http://en.wikipedia.org/wiki/Multicollinearity>`__), since it can ruin the
interpretation of your coefficients. As an example, consider a study that looks
at the effects of smoking and drinking on developing lung cancer. Smoking and
drinking are highly correlated, so the regression can arbitrarily pick up
either as the explanation for high prevalence of lung cancer. As a result,
drinking could be assigned a high coefficient while smoking a low one, leading
to the questionable conclusion that drinking causes lung cancer while smoking
does not.

This example illustrates the importance of variable selection. In order to
compare two different models (subsets of predictors) against each other, we
need a measurement of their `goodness of fit
<http://en.wikipedia.org/wiki/Goodness_of_fit>`__, i.e. how well they explain
the data. A popular choice is the coefficient of determination, |R2|, which
first considers the variance of the dependent variable,
:math:`\mathrm{Var}(y)`:

.. image:: img/expl-fit.svg
   :width: 500px
   :height: 250px
   :align: center

In this example we revisit the regression of the total number of
crimes, *y*, on the number of calls about garbage to 311, *x*. To the
right, a histogram over *y* is shown and its variance is
calculated. The variance is a measure of how spread out the data is.
It is estimated from sample units as:

.. math:: \mathrm{Var}(y) = \frac{1}{N}{\sum_{n=1}^N} (y_n - \bar y)^2,

where :math:`\bar y` denotes the mean of all *y*\ 's.

We now subtract the red line from each point. These new points are
called the *residuals* and represent the deviations from the expected
values under our model.  If the variance of the residuals,
:math:`\mathrm{Var}(y - \hat y),` is zero, it means that all residuals
had to be zero and thus all sample units lie perfectly on the
fitted line. On the other hand, if the variance of the residuals is
the same as the variance of *y*, it means that the model did not
explain anything away and we can consider the model useless.  These
two extremes represent |R2| values of 1 and 0, respectively.  All |R2|
values in between these two extremes will be cases where the variance
was reduced some, but not entirely. Our example is one such case:

.. image:: img/expl-res.svg
   :width: 500px
   :height: 250px
   :align: center

As we have hinted, the coefficient of determination, |R2|, measures how much
the variance was reduced by subtracting the model. More specifically, it
calculates this reduction as a percentage of the original variance:

.. math:: R^2 = \frac{\mathrm{Var}(y) - \mathrm{Var}(y - \hat y)}{\mathrm{Var}(y)}
              = 1 - \frac{\mathrm{Var}(y - \hat y)}{\mathrm{Var}(y)},

which finally becomes:

.. math:: R^2 = 1 - \frac{\sum_{n=1}^N (y_n - \hat y_n)^2}{\sum_{n=1}^N (y_n - \bar y)^2}. \quad (1)

In these steps we omitted the normalization constants *N* since they cancel
each other out. We also did not subtract the mean when calculating the variance
of the residuals, since it can be shown mathematically that if the model has
been fit by least squares, the sum of the residuals is always zero.

There are two properties of |R2| that are good to keep in mind when checking
the correctness of your calculations:

#. 0 ≤ |R2| ≤ 1
#. If model *A* contains a superset of predictors of model *B*, then model *A*
   cannot have a lower |R2| value than *B*.

**These properties are only true if** |R2| **is computed using the same
data that was used to train the model parameters.** If you calculate
|R2| on a held-out testing set, the |R2| value could decrease with
more predictors (due to over-fitting) and |R2| is no longer guaranteed
to be greater than or equal to zero. The intuition behind |R2| would
also need to be reconsidered, but since the final formula in (1)
remains the same we will omit the details here.

A good model should explain as much of the variance as possible, so we will be
favoring higher |R2| values. However, since using all predictors gives the
highest |R2|, we must balance this goal with a desire to use few predictors.
In general it is important to be cautious and not blindly interpret |R2|; take
a look at these two generated examples:

.. image:: img/good-bad.svg
   :width: 700px
   :height: 250px
   :align: center

To the left, we see that the fitted model is close to the one we used to
generate the data. However, the variance of the error is still high, so the
|R2| is rather low. To the right is a model that clearly is not right for the
data, but still manages to record a high |R2| value. This example also shows
that that linear regression is not always the right model choice. However, in
many cases, a simple transformation of the predictors or the dependent variable
can resolve this problem, making the linear regression paradigm still relevant.

In this task, you will construct *P* univariate (*i.e.*, one variable)
models of the dataset's dependent variable, one for each of the
possible predictor variables, and compute their associated |R2|
values.  Note that when we ask for the |R2| value for a model, we mean
the value obtained by computing |R2| using the model's :math:`\beta`
on the data that was used to *train* it.

**Submission requirement 1a**

- Output a table of *P* models, one per possible predictor variable, and their associated |R2| values.  

.. note::

    You may **not** use the Numpy ``var`` method to compute |R2| for
    two reasons: (1) our computation uses the biased variance whereas
    the ``var`` method computes the unbiased variance and will lead
    you to generate the wrong answers and (2) we want you to get
    practice with using operations on Numpy arrays.

From this table, you will see that none of the predictor variables
individually are particularly good predictors for the dependent
variable.  In subsequent tasks, we will construct better models using
multiple variables. For example, we could predict crime using not only
complaints about garbage, but graffiti as well.  In this example, we
want to find :math:`\beta = (\beta_0, \beta_\mathrm{garbage},
\beta_\mathrm{graffiti})` in the equation

.. math:: f(\mathbf{x}) = \beta_0 + \beta_\mathrm{garbage} x_\mathrm{garbage} + 
                 \beta_\mathrm{graffiti} x_\mathrm{graffiti},

where :math:`f(\mathbf{x})` is a prediction of the number of crimes given
the number of complaint calls about garbage :math:`(x_\mathrm{garbage})` and graffiti :math:`(x_\mathrm{graffiti}).`

In terms of the |R2| calculation, this new equation only changes how
each prediction :math:`\hat y_n` is calculated. At this point, you
should make sure that your code to calculate a model's |R2| value is
general enough to handle models of with multiple predictor variables
(*i.e.* multivariate models).  


**Submission requirement 1b**

- Construct a single model that uses all of the dataset's predictor
  variables to predict the dependent variable.  Output the model along
  with its |R2| value.  According to the second property of |R2|, this
  value should be the largest |R2| value you will see for the training
  data.


Task 2: Building and selecting bivariate models
-----------------------------------------------

In Task 1, we found that none of our single variable models were particularly
accurate (they all had low |R2| values compared to when using all predictors).
In this and future tasks we're going to build more complex models that use
multiple variables as predictors, without using all of them and risking
over-fitting.

For this task, you will test all possible bivariate models (*K* = 2) and determine
the one with the highest |R2| value. We suggest that you use two nested for-loops
to iterate over all combinations of two predictors, calculate the |R2| value for
each combination and keep track of one with the highest |R2| value.

How does your result compare to the single-variable models in Task 1?  Which
pair of variables perform best together? Does this result make sense given the
results from Task 1?

**Submission requirement**

- Output the bivariate model with the highest |R2| value along with its |R2| value.

Task 3: Building models of arbitrary complexity
-----------------------------------------------

How do we determine how many and which variables will generate the
best model? It turns out that this question is unsolved and is of
interest to both computer scientists and statisticians. To find the
best model of three variables we could repeat the process in Task 2
with three nested for-loops instead of two. For *K* variables we would
have to use *K* nested for-loops. Nesting for-loops quickly becomes
computationally expensive and infeasible for even the most powerful
computers. Keep in mind that models in genetics research easily reach
into the thousands of variables. How then do we find optimal models of
several variables?

We'll approach this problem by using heuristics, which will give us an
approximate (as opposed to an exact) result. We are going to split
this task into two pieces: (1) determine the best *K* variables to use
for a fixed *K* and (2) determine the best value for *K*.

How do we determine which *K* variables will yield the best model for
a fixed *K*? As noted, the naive approach, test all possible
combinations as in Task 2, is intractable for large *K*. An
alternative simpler approach is to choose the *K* variables with the
*K* highest |R2| values in the table from Task 1.  This approach does
not work well if your predictors are correlated.  As you've already
seen from the first two tasks, neither of the top two individual
predictors for crime (GARBAGE and STREET_LIGHTS) are included in the
best bivariate model (CRIME_TOTALS ~ -36.151629 + 3.300180 * POT_HOLES
+ 7.129337 * ABANDONED_BUILDINGS).

Instead, you'll implement a heuristic known as `Backward
elimination`_, which is an example of a `greedy algorithm`_.  Backward
elimination starts with a set that contains all potential predictor
variables and then repeatedly eliminates variables until the set
contains *K* variables.  At each step, the algorithm requires you to
identify the variable in the model that when subtracted, yields the
model with the best |R2| value and eliminate it from the set.

**Submission requirement**

- Let P be the number of predictor variables in your data set. Output
  a table that lists K, the best *K*-variable model chosen by the
  backward elimination algorithm, and |R2| for this model for :math:`1
  \leq K \leq P`.


Task 4: Selecting the best K
----------------------------

In Task 3, you computed a list of the best K-variable models for
:math:`1\leq K \leq P` using backward elimination.  How do we choose
among these models?  The calculation of |R2| does not take into
account the number of variables in a model and so, while it is a valid
way to choose among different models, each with *K* variables, it is
not a good way to choose among different values for *K*.  Adjusted
|R2|, which does take the number of predictor variables in the models
into account, is a reasonable alternative.

Essentially, adjusted |R2| replaces the biased variances used in |R2|
with their unbiased counterparts.)  Here's a formulation for it that
highlights the adjustment, rather than the change in the definition of
variance:

.. math:: adj\_R^2 = R^2 - (1 - R^2) * \frac{K}{(N-K-1)}

where :math:`N` is the sample size (*i.e.* number of rows in the
training data) and :math:`K` is the number of predictor variables in
the model.

**Submission requirement**

- Output the model computed using the backward elimination heuristic that has the largest adjusted |R2| along with both its |R2| and adjusted |R2| values.


Task 5: Training vs. test data
------------------------------

*Please read this section very carefully*

Until now, you have evaluated a model using the data that was used to
train it.  The resulting model may be quite good for that particular
dataset, but it may not be particularly good at predicting novel data.
This is the problem that we have been referring to as *over-fitting*.

Up to this point, we have only computed |R2| values for the training
data that was used to construct the model. It is also valid to compute
an |R2| value for a model when applied to other data. **For this task,
you will use the *testing data* to evaluate the model identified in
Task 4.  You will receive *zero credit* for this task if you train a
new model using the testing data.**

**Submission requirements**

- Output the model chosen for Task 4 along with its |R2| value
  computed using the training data and its |R2| value computed using
  the *testing data*.


Submission requirement round-up
-------------------------------

Here is a round-up of all the submission requirements.  Your
implementation of ``output.py`` must produce these results when called
from the Linux command line with a path to the dataset.

Recall that you should include the data set name and task numbers in
your output.  See the `City`_ dataset for sample output.

*Task 1:*

- Output a table of *P* models, one per possible predictor variable, and their associated |R2| values.  

- Construct a single model that uses all of the dataset's predictor
  variables to predict the dependent variable.  Output the model along
  with its |R2| value.  According to the second property of |R2|, this
  value should be the largest |R2| value you will see for the training
  data.

*Task 2:*

- Output the bivariate model with the highest |R2| value along with its |R2| value.

*Task 3:*

- Let P be the number of predictor variables in your data set. Output
  a table that lists K, the best *K*-variable model chosen by the
  backward elimination algorithm, and |R2| for this model for :math:`1
  \leq K \leq P`.


*Task 4:*

- Output the model computed using the backward elimination heuristic that has the largest adjusted |R2| along with both its |R2| and adjusted |R2| values.

*Task 5:*

- Output the model chosen for Task 4 along with its |R2| value
  computed using the training data and its |R2| value computed using
  the *testing data*.


Submission
----------

To submit your assignment, make sure that you have:

- put your name at the top of your files,
- registered for the assignment using chisubmit,
- added, committed, and pushed your code to the git server, and
- run the chisubmit submission command.


.. code::

    chisubmit student assignment register pa5

    git add dataset.py
    git add model.py
    git add output.py

    git commit -m "final version of pa5 ready for submission"
    git push

    chisubmit student assignment submit pa5


**Acknowledgments:** This assignment has been worked on by many people over the years, including Matthew Rocklin and Gustav Larsson.

.. _City: dataset-city.html
.. _Stock: dataset-stock.html
.. _Election: dataset-election.html
.. _maximum likelihood: http://en.wikipedia.org/wiki/Maximum_likelihood
.. _greedy algorithm: http://en.wikipedia.org/wiki/Greedy_algorithm
.. _Forward selection: http://en.wikipedia.org/wiki/Stepwise_regression
.. _Backward elimination: http://en.wikipedia.org/wiki/Stepwise_regression
.. _train_test_split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split
.. _model selection library: http://scikit-learn.org/stable/model_selection.html
.. _scikit-learn: http://scikit-learn.org/stable