==========================
Analyzing Candidate Tweets
==========================

**Due: Friday, October 21st at 4pm**

You may work alone or in a pair on this assignment. 

The purpose of this assignment is to give you experience with using
dictionaries to represent data and as a mechanism for mapping keys to
values that change over time.


Introduction
============

The candidates in the 2016 presidential primaries and general election
have used Twitter as a way to communicate with their supporters (and
detractors) and gain media attention.  In this assignment, you will
analyze tweets from Hillary Clinton's (HRC) and Donald Trump's (DJT)
Twitter feeds to gain some insight into how they use Twitter as a
campaign tool.  We'll ask question such as:

- What is HRC's favorite hashtag? [#demdebate]
- Who has DJT mentioned at least 100 times? [realdonaldtrump, cnn, foxnews]
- What words occur most often in HRC's tweets? [our, with, are, have, not]
- What two-word phrases occur most often in DJT's tweets? [i will, make america, great again, i am, america great]
- How do the candidates' feeds change over time? [Not as much as you might think!]

Getting started
===============

See `these start-up instructions <startup-alone.html>`__  if you intend to work alone.

See `these start-up instructions <startup-pair-continuing.html>`__ if you intend to work with the same partner as in PA #2.

See `these start-up instructions <startup-pair.html>`__ if you intend to work in a *NEW* pair.

Here is a description of the contents of this directory:

- ``basic_algorithms.py`` -- you will add code for the  basic analysis algorithms to this file.

- ``test_basic_algorithms.py`` -- this file contains very basic test code for the basic algorithms.

- ``analyze.py`` -- you will add code to extract information from tweets and analyze them to this file. Note that this file includes several constants and two functions that you may find useful.  They are documented with code comments, and you should make sure to read and understand them before you start writing code.

- ``test_analyze_writeup.py`` -- test code for the examples in the writeup.

- ``util.py`` -- this file contains a few useful functions.

-  ``data/`` -- a directory for the data files. Since the data files are very large they are not included in the Git repository. See below for instructions on how to obtain the data files.

The ``data/`` directory contains a file called ``get_files.sh`` that will download the data
files. To download the data files, go into the ``data`` directory and run this command::

    ./get_files.sh

This script will download the complete HRC dataset (``hrc-all.json``),
the complete DJT dataset (``djt-all.json``), and several smaller
datasets that may be useful for testing purposes.  Please note that
you must be connected to the network to use this script.

**Do not add the data to your repository!** If you wish to use both
  CSIL & the Linux servers and your VM, you will need to run the
  ``get_files.sh`` script twice once for CSIL & the Linux servers and
  once for your VM.


Part 1: Basic Algorithms
========================

In this part, you will implement three algorithms: `top k`, `min
count`, and `frequent`.  In Part 2, you will use these algorithms for
to analyze data extracted from tweets.

Before we describe the algorithms you will implement, we first discuss
a function that we have provided for ordering tuples.

*Ordering tuples*

All three of the algorithms we describe in this section compute an
ordered list of (key, value) pairs.  We provide a function, named
``sort_count_pairs``, that will handle sorting the pairs for you.  It
takes a list of the form:

.. code-block:: python

    [(k0, v0), (k1, v1), ...]

and sorts it.  The natural sort order for pairs uses the first item in
the pair as the primary sort key and the second value as the secondary
sort key. This ordering is not desirable for our purposes.  Instead,
we will use the values (``v0``, ``v1``, etc) as the primary sort key.
When sorting the pairs, our sort method puts larger values earlier in
the ordering than smaller values.  If two values are the same, then
the keys (``k0``, ``k1``, etc) are used as the secondary sort key.
They are ordered in increasing order.

For example, given the list:

.. code-block:: python

    [('D', 1), ('C', 2), ('A', 5), ('B', 2)]

our function would yield:

.. code-block:: python

    [('A', 5), ('B', 2), ('C', 2), ('D', 1)]

*Top K*

The first algorithm, `Top K`, which will be familiar from class,
computes the :math:`K` items that occur most frequently in the list.
To do this computation, you will need to:

#. use a dictionary to count the number of times each unique value occurs,
#. extract a list of (key, count) pairs from the dictionary, 
#. sort the pairs using the supplied function and finally, 
#. pick off the first :math:`K` pairs.

The first step can and should be done with one pass over the data.

Given the following list:

.. code-block:: python

    l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

and a value of two for :math:`K`, this algorithm would yield:

.. code-block:: python

    [('A', 5), ('B', 2)]


*Minimum number of occurrences*

The second algorithm, `Min Count`, finds the items in a list that
occur at least some specified minimum number of times.  To perform
this algorithm, you will need to:

#. compute the counts,
#. build a list of the items and associated counts that meet the threshold, and then
#. sort it using the supplied function.

Running this algorithm on the list ``l`` defined above with a minimum
count of two would yield the following list:

.. code-block:: python 

    [('A', 5), ('B', 2), ('C', 2)


*Frequent items*

The previous two algorithms require space proportional to the number
of unique items in the list.  The third algorithm, `Frequent`,
finds the items in a sequence with a frequency that `exceeds` a
:math:`1/K` fraction of :math:`N`, the total number of items.  This
algorithm uses space proportional to :math:`K`, which is likely to be
much smaller than :math:`N`.  

The `frequent` algorithm uses a data structure :math:`D` with up to
:math:`K-1` counters, each associated with a particular item, to track
an approximation to the frequency of the corresponding items in the
list.  For a given list item :math:`I`, you will update the counters
using the following rules:

- If item :math:`I` occurs in :math:`D`, then increment by one the count associated with :math:`I`.
- If item :math:`I` does not occur in :math:`D` and there are fewer than :math:`K-1` items in :math:`D`, then add :math:`I` with a value of 1 to :math:`D`.
- If item :math:`I` does not occur in :math:`D` and there are :math:`K-1` items in :math:`D`, then decrement all of the counters by one and remove any items with a count of zero from :math:`D`.

Once this data structure is computed, you will need extract the (key,
count) pairs from it and sort them using our sort method.

Before we look at the result of applying this algorithm to the list
``l``, let's look at a more straightforward example.   Given the list:

.. code-block:: python

    ["A", "A", "B", "B", "A", "B", "C", "B", "A"]

and a value of three for :math:`K`, the `frequent` algorithm will yield:

.. code-block:: python

    [('A', 3), ('B', 3)]

Notice that while the items identified by the `frequent` algorithm are
correct, the counts are not.  The algorithm computes an approximation,
not the exact count.

Returning to our earlier example, running the algorithm on the list
``l`` defined above and a value of three for :math:`K`, would yield:

.. code-block:: python

    [('A', 3), ('D', 1)]

This result may seem a bit odd.  The algorithm guarantees that the
result will include any item whose frequency `exceeds` :math:`N/K`,
which is why ``'A'`` occurs in the result.  If there are fewer than
:math:`K-1` hashtags with a frequency that exceeds :math:`N/K`, then
the result may include hashtags that occur less frequently, which
explains the presence of ``'D'`` in the result.

It is instructive to think through what happens when running this
algorithm with the list ``l`` defined above and two as the value of
:math:`K`.  The data structure :math:`D` will contain one key, ``'A```
with an associated value of one after processing the last ``'A'`` in
the list.  When the last value (``'D'``) in the list is processed, the
third rule (decrement all the counts) will be invoked and as a result,
the data structure :math:`D` will be empty when the algorithm
finishes.

This algorithm and other similar algorithms can also be used with
streaming data.  See the paper `Finding Frequent Items in Data Streams <http://dimacs.rutgers.edu/~graham/pubs/papers/freq.pdf>`__ by Cormode
and Hadjieleftheriou for a good summary and an explanation of the
relationship between the frequencies reported in the results and the
true frequencies.

*Task 0*

Your first task is to implement the three algorithms described above.
We have provided code that runs a few tests cases for each the
algorithms (see ``test_basic_algorithms.py``).  Our code assumes that
you have defined three functions--- ``find_top_k``,
``find_min_count``, and ``find_frequent``---each of which takes a list
and an integer as parameters and returns a sorted list of tuples.  We
encourage you to inspect the tests and ask yourself "Why did they
choose these tests?"  and "What tests are obviously missing?"

As in the previous assignments, our test code uses the ``pytest``
testing framework.  You can use the ``-k`` flag and the name of the
algorithm (``top_k``, ``min_count``, and ``frequent``) to run the test
code for a specific algorithm.  For example,

.. code-block:: python

    py.test test_basic_algorithms.py -k min_count

will run the tests for the `min count` algorithm


Data
====

Twitter makes it possible to search for tweets with particular
properties, say, from a particular user, containing specific terms,
and within a given range of dates.  There are several Python libraries
that simplify the process of using this feature of Twitter.  We used
the `TwitterSearch <https://pypi.python.org/pypi/TwitterSearch/>`_
library to gather tweets from HRC and DJT's Twitter feeds from
mid-November 2015 through mid-October 2016 and stored the resulting data in
JSON files.  We have provided code that reads the data from the JSON
files back into a list of dictionaries, one per tweet.


The representation of a tweet contains a lot of information: creation
time, hashtags used, users mentioned, text of the tweet, etc.  For
example, here is the fourth tweet from HRC from Feb 1st.

.. code-block:: python

    "'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt'"

and here is an abbreviated version of the corresponding tweet
dictionary that includes a few of the 20+ key/value pairs:

.. code-block:: python

    {'created_at': 'Mon Feb 01 01:18:37 +0000 2016',
     'entities': {
       'hashtags': [{'indices': [22, 31], 'text': 'iacaucus'},
                    {'indices': [52, 62], 'text': 'ImWithHer'}],
      
       'symbols': [],
       'urls': [],
       'user_mentions': [{'id': 3056142064,
                          'id_str': '3056142064',
                           'indices': [3, 16],
                           'name': 'Hillary for Iowa',
                           'screen_name': 'HillaryforIA'}]},
     'text': 'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt',
    }

Notice that the value associated with ``entities``, which contains
information about hashtags, users mentioned, URLs included, etc, is
itself a dictionary that maps strings to lists of dictionaries.  The
structure of the different entity dictionaries is dependent upon the
entity type.  For example, if we wanted to find the users mentioned in
a tweet, we'd extract the ``"screen_name"`` values from the dictionaries
in the list associated with the ``"user_mentions"`` key in the
``"entities"`` dictionary.

Part 2: Analyzing Tweets
========================

Now that you have implemented the basic analysis algorithms, you can
move on to analyzing the Twitter feeds.  Your code for this part and
all subsequent parts should be added to the file
``analyze.py``.  This file contains a main block to allow the
program to be run from the command-line.  Run::

    python3 analyze.py --help

to get a description of the arguments.

While we provide function headers for each of the required tasks, the
structure of the rest of the code is entirely up to you.  We will note
that while some tasks can be done cleanly with one function, others
cannot.  Also, we recommend looking for sub-tasks that are common to
multiple tasks and re-using previously completed functions.

Finding commonly-occurring entities
-----------------------------------

What are DJT's favorite hashtags? Which URLs are referenced at least 5
times by HRC? Are there users who represent at least 5% of the mentions
in DJT's tweets?  Answering these questions requires us to extract the
desired entities (hashtags, user mentions, etc) from the candidate's
tweets and process them appropriately.

For these tasks, we will ignore case differences and so, for example,
we will consider "demdebate" to be the same as "DemDebate".

*Common Parameters*

The next three tasks will use many of the same parameters.  To avoid
repetition later, we describe them here:

- ``tweets`` will be a list of tweet dictionaries, 
- ``entity_type`` will be one of ``"hashtags"``, ``"urls"``, or ``"user_mentions"``,  and
- ``value_key`` will be one of ``"text"``, ``"url"``, or ``"screen_name``.

*Task 1: Top K entities*

For Task 1, you must complete the function:

.. code-block:: python

    def find_top_k_entities(tweets, entity_key, value_key, k):

where the first three parameters are as described above and ``k`` is
an integer.  This function should return a sorted list of (entity,
count) pairs with the ``k`` most frequently occurring entities and
their associated counts.

Running this function on the tweets in ``hrc-feb.json`` with
``"hashtags"`` as the entity type, ``"text"`` as the value key, and a
value of three for ``k``, for example, would yield:

.. code-block:: python

    [('imwithher', 70), ('demdebate', 49), ('demtownhall', 28)]

*Task 2: Minimum number of occurrences*

For Task 2, you must complete the function:

.. code-block:: python 

    def find_min_count_entities(tweets, entity_key, value_key, min_count):

where the first three parameters are as described above and
``min_count`` is an integer that specifies the minimum number of times
a entity must occur to be included in the result.  This function
should return a sorted list of (entity, count) pairs.

Running this function on the tweets in ``djt-feb.json`` with
``"hashtags"`` as the entity type, ``"text"`` as the value key, and a
value of 15 for ``min_count``, for example, would yield:

.. code-block:: python

    [('trump2016', 54), ('makeamericagreatagain', 50), ('fitn', 16)]

FYI, #FITN stands for "First In The Nation".


*Task 3: Frequent entities*

For Task 3, you must implement the function:

.. code-block:: python

    def find_frequent_entities(tweets, entity_key, value_key, k)

where the first three parameters are as described above and ``k`` is
an integer.  This function should return a sorted list of (entity,
count) pairs computed using the `frequent` algorithm.

Running this function on the tweets in ``hrc-feb.json`` with
``"hashtags"`` as the entity type, ``"text"`` as the value key, and a
value of three for ``k``, for example, would yield:

.. code-block:: python

    [('imwithher', 6), 
     ('iacaucus', 3)]

*Testing*

We have provided test cases for the basic analysis algorithms and for
the tweet-based examples shown in the writeup.  Beyond the few
examples that we provided, you are on your own for testing these and
all subsequent tasks.

Analyzing N-grams
-----------------

If looking at frequently occurring hashtags or finding all the users
mentioned some minimum number of times yields some insight into a
candidate, what might we learn by identifying commonly occurring pairs
of words, word triples, etc?

In this part, you will apply the three algorithms described earlier to
contiguous sequences of :math:`N` words, which are known as `n-grams`.
Before you can apply these algorithms to a candidate's tweets, you
will pre-process the tweets to try to reveal salient words.

*Pre-processing step*

The pre-processing step converts the text of a tweet into a list or
tuple of strings.

We will define a `word` to be any sequence of characters delimited by
whitespace.  For example, "abc", "7xy", and "#2016election." are all
words by this definition.

Once you extract the words from a tweet, you should remove leading and
trailing punctuation and convert the word to lower case.  We have
defined a constant, ``PUNCTUATION`` that specifies what constitutes
punctuation for this assignment.  Note that apostrophes, as in the
word "it's", should not be removed, because they occur in the middle
of the word.

Finally, you should eliminate words that appear in a specified set of
`stop words` (for example, "an", "of", and "to") or that start with a
specified set of `stop prefixes` (for example, ``"#"``, ``"@"``, and
``"http"``).

Here is an example: pre-processing the text from the fourth tweet from DJT on Feb 1st:

   'RT @kevcirilli: CEDAR RAPIDS --\\n\\nTRUMP\\'S DAUGHT IVANKA:\\n\\n"I can just say without equivocation, my father will make America great again."'


with the sets ``STOP_WORDS["djt"]`` and ``STOP_PREFIX["default"]`` should
yield:

.. code-block:: python

    ('cedar', 'rapids', "trump's", 'daught', 'ivanka', 'i',  'can',  'just',
     'say', 'without', 'equivocation', 'my', 'father', 'will', 'make', 
     'america', 'great', 'again')

You will find the ``lower``, ``split``, ``strip``, and ``startswith`` methods from the 
`str API <https://docs.python.org/3/library/stdtypes.html#string-methods>`_  useful for this step.

*Representing N-grams*

N-grams should be represented as tuples of strings.  Given a value of
2 for :math:`N`, the above DJT tweet would yield the following bi-grams
(2-grams):

.. code-block:: python

    ('cedar', 'rapids')
    ('rapids', "trump's")
    ("trump's", 'daught')
    ('daught', 'ivanka')
    ('ivanka', 'i')
    ('i', 'can')
    ('can', 'just')
    ('just', 'say')
    ('say', 'without')
    ('without', 'equivocation')
    ('equivocation', 'my')
    ('my', 'father')
    ('father', 'will')
    ('will', 'make')
    ('make', 'america')
    ('america', 'great') 
    ('great', 'again')


You might ask "Why use tuples and not lists to represents n-grams?"
Tuples of strings are immutable (that is, cannot be modified) and as a
result, can be used as keys for dictionaries.  Lists, on the other
hand, cannot be used as dictionary keys because the data structure
used to implement dictionaries relies on the fact that keys are not
mutable.

*Common parameters*

As in Tasks 1-3, Tasks 4-6 have several parameters in
common:

- ``tweets`` is a list of tweet dictionaries, 
- ``n`` is the number of words in a sequence, 
- ``stop_words`` is a set of words (strings) to ignore, and 
- ``stop_prefixes`` is a set of strings that describe prefixes for words that should be ignored. 

*Task 4: Top K N-grams*

Your task is to implement the function:

.. code-block:: python

    def find_top_k_ngrams(tweets, n, stop_words, stop_prefixes, k):

where the first four parameters are as described above and ``k`` is
the desired number of n-grams. This function should return a sorted
list of the (n-gram, integer count of occurrences) pairs for the ``k``
most frequently occurring n-grams.

Running this function on the tweets in ``hrc-feb.json`` with two as
the value for ``n``, the Clinton stop words (``STOP_WORDS["hrc"]``), the
default stop prefixes (``STOP_PREFIXES["default"]``), and three as the
value for ``k`` would yield:

.. code-block:: python

    [(('new', 'hampshire'), 22),
     (('south', 'carolina'), 14),
     (('stand', 'with'), 12)]

*Task 5: Minimum number of n-gram occurrences*

Your task is to implement the function:

.. code-block:: python

    def find_min_count_ngrams(tweets, n, stop_words, stop_prefixes, min_count):

where the first four parameters are as described above and
``min_count`` specifies the minimum number of times an n-gram must
occur to be included in the result.  This function should return a
sorted list of (n-gram, integer count of occurrences) pairs.

Running this function on the tweets in ``djt-feb.json`` with
two as the value for ``n``, the Trump stop words (``STOP_WORDS["djt"]``), the
default stop prefixes (``STOP_PREFIXES["default"]``), and 30 as the
value for ``min_count`` would yield:

.. code-block:: python

    [(('ted', 'cruz'), 36),
     (('south', 'carolina'), 35),
     (('i', 'will'), 32),
     (('new', 'hampshire'), 31)]


*Task 6: Frequent N-grams*

Your task is to implement the function:

.. code-block:: python
                                                             
    def find_frequent_ngrams(tweets, n, stop_words, stop_prefixes, k)

where the first four parameters are as described above and ``k`` is an
integer.  This function should return a a sorted list of (n-gram,
integer approximate counter) pairs as computed using the `Frequent`
algorithm.


Running this function on the tweets in ``hrc-feb.json`` with two as
the value for ``n``, the Clinton stop words (``STOP_WORDS["hrc"]``), the
default stop prefixes (``STOP_PREFIXES["default"]``), and three as the
value for ``k`` would yield:

.. code-block:: python

    [(('few', 'iowans'), 1)]


By Month Analysis
-----------------

A candidate's focus shifts over the course of an election.  The number
of voters who participate in the primaries is relatively small and
their values are often not in line with those of the much larger
general election electorate.  The oft-discussed need to "pivot to the
center" for the general election reflects this difference.  An
interesting question is whether this shift is visible in the
candidates Twitter feeds. (In retrospect. 2016 may not be the best
year to illustrate this principle.)

*Task 7*


We'll look at this question by calculating the `Top K` n-grams by month.
Specifically, your last task is to implement the function:

.. code-block:: python

   def find_top_k_ngrams_per_month(tweets, n, stop_words, stop_prefixes, k):

which takes the same parameters as ``find_top_k_ngrams`` and returns a
list of tuples of the form ((year, month), list of the top ``k``
``n``-grams and their counts sorted using our function).  The
resulting list should be sorted using the standard sort function.
Running this function on the tweets in ``djt-all.json`` with two as the
value for ``n``, the combined stop words (``STOP_WORDS["both"]``), the
default stop prefixes (``STOP_PREFIXES["default"]``), and three as the
value for ``k`` would yield:

::

    [((2015, 11),
      [(('make', 'america'), 10),
       (('south', 'carolina'), 9),
       (('america', 'great'), 8)]),
     ((2015, 12),
      [(('i', 'will'), 34), (('i', 'am'), 26), (('great', 'again'), 19)]),
     ((2016, 1), [(('i', 'will'), 33), (('ted', 'cruz'), 28), (('i', 'am'), 16)]),
     ((2016, 2),
      [(('ted', 'cruz'), 36), (('south', 'carolina'), 35), (('i', 'will'), 32)]),
     ((2016, 3), [(('i', 'am'), 24), (('i', 'will'), 24), (('ted', 'cruz'), 19)]),
     ((2016, 4),
      [(('new', 'york'), 28), (('i', 'will'), 22), (('great', 'again'), 15)]),
     ((2016, 5),
      [(('i', 'will'), 25),
       (('elizabeth', 'warren'), 16),
       (('goofy', 'elizabeth'), 16)]),
     ((2016, 6), [(('i', 'will'), 15), (('join', 'me'), 14), (('i', 'am'), 12)]),
     ((2016, 7),
      [(('bernie', 'sanders'), 19),
       (('our', 'country'), 17),
       (('make', 'america'), 12)]),
     ((2016, 8), [(('i', 'will'), 14), (('i', 'am'), 13), (('join', 'me'), 10)]),
     ((2016, 9),
      [(('i', 'will'), 16), (('join', 'me'), 16), (('make', 'america'), 14)]),
     ((2016, 10),
      [(('i', 'will'), 5), (('join', 'me'), 5), (('about', 'her'), 4)])]


Because the results can be a bit hard to decipher in this form, we
have provided a function, ``util.pretty_print_by_month``, that takes the
output of ``find_top_k_ngrams_per_month`` and prints it in an
easier-to-read form.  For example, here is the result of calling
``pretty_print_by_month`` on the list above:


::

    November 2015:   make america       10
                     south carolina     9
                     america great      8

    December 2015:   i will             34
                     i am               26
                     great again        19

    January 2016:    i will             33
                     ted cruz           28
                     i am               16

    February 2016:   ted cruz           36
                     south carolina     35
                     i will             32

    March 2016:      i am               24
                     i will             24
                     ted cruz           19

    April 2016:      new york           28
                     i will             22
                     great again        15

    May 2016:        i will             25
                     elizabeth warren   16
                     goofy elizabeth    16

    June 2016:       i will             15
                     join me            14
                     i am               12

    July 2016:       bernie sanders     19
                     our country        17
                     make america       12

    August 2016:     i will             14
                     i am               13
                     join me            10

    September 2016:  i will             16
                     join me            16
                     make america       14

    October 2016:    i will             5
                     join me            5
                     about her          4


Submission
==========

See `these submission instructions <startup-alone.html#submission>`__  if you are working alone.

See `these submission instructions <startup-pair-continuing.html#submission>`__ if you are working in the same pair as for PA #2.

See `these submission instructions <startup-pair.html#submission>`__ if you are working in a *NEW* pair.