======   
Pandas
======

Pandas is a Python library for working with data.  Pandas is a large
library with many useful features.  This lab is adapted from a similar
lab from last quarter (CMSC 12100), with some additional exercises
useful for PA #4.

Getting started
---------------

To get started, if you are using your VM and have not already done so,
open up a terminal window and please run:

::

    sudo pip3 install pandas

to install the Pandas library on your VM. Instead, if you are on a CSIL
computer, you should upgrade its pandas installation by running::

    pip3 install --user --upgrade pandas

Once Pandas is installed, you should navigate (``cd``) to your
``cs122-win-17-username`` directory. Run ``git pull upstream master``
to collect the lab materials and ``git pull`` to sync with your
personal repository.


Reading the data
----------------

We will be using one of the sample data sets from an assignment from
last quarter.  Please look at the `Data <https://classes.cs.uchicago.edu/archive/2016/fall/12100-1/pa/pa6/index.html#data>`__ and
`MORG Files
<https://classes.cs.uchicago.edu/archive/2016/fall/12100-1/pa/pa6/index.html#morg-files>`__
sections for get a quick overview of this data.

**Task 1**: Use ``pd.read_csv`` to read the sample data into a pandas
dataframe and save the result in a variable named ``morg_df``.  Use
``h_id``, which uniquely identifies each row and as the row index.


Basics
------

Read `10 Minutes to pandas
<http://pandas.pydata.org/pandas-docs/stable/10min.html>`__ for an
overview of how to select items from a dataframe using simple and
boolean indexing, do simple element-wise arithmetic, and deal with
missing data. 

**Task 2**: Extract the ``"age"`` column from ``morg_df``.

**Task 3**: Extract the row that corresponds to ``h_id`` 1_2_2 from ``morg_df``.

**Task 4**: Use slicing to extract the first four rows of ``morg_df``.

**Task 5**: Use boolean indexing to extract all rows that correspond to a person who works 35 or more hours per week.

.. **Task 6**: Use ``fillna`` to convert the missing values in ethnicity, which are currently represented using ``NaN``, which stands for *Not a Number*, to zero.

Extracting strings
----------------------------------

Read `Working with Text Data
<http://pandas.pydata.org/pandas-docs/version/0.19.2/text.html>`_ for
a description on the functions pandas provides to manipulate text
data.

**Task 6**: Each ``h_id`` values has three numbers separated by
underscores. Create a dataframe with two columns, with the first and
last of these numbers.

**Task 7**: Now use the pandas ``strip()`` function to create a series
in which you have the substring left in ``h_id`` if you remove the
first and last number.  


Concatenating objects
-----------------------------------

Read `Concatenating objects <http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects>`_
on how to combine dataframes and series.

**Task 8**: Use the pandas ``concat`` function to create a new
dataframe of three columns created by concatenating the dataframe and the series created in Task 6 and 7 above.


[The remaining tasks are from the original lab, and are optional.  You should complete
them if you feel rusty with pandas.]


Categorical values
------------------

Categoricals are a pandas data type that is used to represent
variables which can take on only a limited, and usually fixed, number
of possible values.  We saw a way to convert strings to categorical
values using ``.astype`` in lecture.  This method is less useful for
the Current Population Survey Data, because the categorical values
are encoded as small integers.  We could work with these small integers
but it is much better to work with meaningful values.

The Pandas library function ``pd.Categorical.from_codes`` allows you
to convert codes into meaningful categorical values.  For example, we
could convert the ``ethnicity_code`` (after having replaced the
``NaN`` values with zeros using the ``fillna`` method) from an integer
encoding into more meaningful values using the following:

::

    # convert NaN to 0
    morg_df["ethnicity_code"] = morg_df["ethnicity_code"].fillna(0)

    # convert codes to categorical values
    ethnic_categories = ['Non-Hispanic', 'Mexican', 'PuertoRican',
                         'Cuban', 'Dominican',
                         'Salvadoran', 'CentralAmericanExcludingSalvadoran',
                         'SouthAmerican',  'OtherSpanish']
    morg_df["ethnicity_code"] = pd.Categorical.from_codes(morg_df["ethnicity_code"], ethnic_categories)


Having done so, we would probably want to also change the name of the
column:

::

    morg_df.rename(columns={"ethnicity_code":"ethnicity"}, inplace=True)

to reflect the fact that it no longer contains integer codes.

**Task 9**: Convert the status code column, which has values that
range from 1 through 7 to categorical values and rename it "status".  

.. COMMENT

   We have defined a variable, ``status_categories``, to simplify this
   task for the purpos es of this lab. For the assignment, you will
   need to read this information from a file. Keep in mind that
   ``from_categorical`` is expecting the values to be zero-based and so
   you will need to transform the values to range from 0 through 6
   before you can convert them to categorical values.

**Task 10**: Now that you have converted the status column into categorical values, use boolean indexing to extract the rows that correspond to the people who are not working.

**Task 11**: Use boolean indexing to extract the rows that correspond to people who worked at least 35 hours per week and have a status of "Working".


Converting floats to bins
-------------------------

On occasion, it is useful to group floating point values, such as
ages or salaries, into bins.  Pandas provides a function `cut
<http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html#pandas.cut>`__
that serves this purpose.

Let's look at some code for adding an "age bin" column to the Morg
dataframe.  But first, let's look at the range of ages seen in our
data:

::

    In [19]: morg_df["age"].min()
    Out[19]: 16

    In [20]: morg_df["age"].max()
    Out[20]: 85


We'd like to group the data into 10 bins.  For this lab, we'll use
``range(16, 89, 8)`` for this purpose.  This approach is not ideal,
since the last bin includes four years that do not correspond to the
ages of any of the people in our dataset.  For PA #6, we recommend
using Numpy's `linspace <https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html>`__ function, which allows you to get a list of
values equally spaced between a lower bound and an upper bound.

The bins need to be labelled.  For simplicity, we'll label the bins
with integers.  Keep in mind that we need :math:`N+1` boundaries for
:math:`N` bins (and :math:`N` labels).

Once we have defined the bin boundaries and the labels, we can use
``cut`` to compute a new column where the ith entry correponds to the
appropriate bin for the age of the person in the ith row of ``morg_df``:

::

    boundaries = range(16, 89, 8)
    morg_df["age_bin"] = pd.cut(morg_df["age"],
                                bins=boundaries,
                                labels=range(len(boundaries)-1),
                                include_lowest=True, right=False)

We use ``include_lowest=True`` because we have at least one person who
is 16 and we want to include his age in a bin.  We use
``right=False``, because we do not want to include 89 in our last bin.

**Task 12**: Add a column to ``morg_df`` for hours worked per week that assigns a person's hours worked per week to one of ten buckets, evenly spaced from the fewest number of hours worked (0) to the most (99) inclusive.

Counting
--------

Counting the number of occurrences of different values is a very
common operation.  We saw two ways to do this task in class: using
``value_counts`` and ``groupby`` plus ``size``:

::

    In [59]: morg_df["age_bin"].value_counts()
    Out[59]: 
    3    41893
    4    40857
    0    39945
    2    37416
    1    37368
    5    32701
    6    23293
    7    16891
    8    13656
    dtype: int64

    In [60]: morg_df.groupby("age_bin").size()
    Out[60]: 
    age_bin
    0    39945
    1    37368
    2    37416
    3    41893
    4    40857
    5    32701
    6    23293
    7    16891
    8    13656
    dtype: int64


**Task 13**: Use these methods to count the number of people in each of your hours-worked-per-week bins.

Note that you can get the values sorted by bin number, for either approach,
using the ``sort_index`` method:

::

    In [63]: morg_df["age_bin"].value_counts().sort_index()
    Out[63]: 
    0    39945
    1    37368
    2    37416
    3    41893
    4    40857
    5    32701
    6    23293
    7    16891
    8    13656
    dtype: int64

    In [62]: morg_df.groupby("age_bin").size().sort_index()
    Out[62]: 
    age_bin
    0    39945
    1    37368
    2    37416
    3    41893
    4    40857
    5    32701
    6    23293
    7    16891
    8    13656
    dtype: int64