==========================
Analyzing Candidate Tweets
==========================

**Due: Friday, October 20th at 4pm**

You may work alone or in a pair on this assignment. 

The purpose of this assignment is to give you experience with using
dictionaries to represent data and as a mechanism for mapping keys to
values that change over time.  You will also get a chance to practice
using functions to avoid repeated code.


Introduction
============

On April 18th, Theresa May, the Prime Minister of the United Kingdom,
announced that she would call a snap election.  This news came as a
bit of a surprise since the next election was not due to be held until
2020.  Her stated reason for calling the election was a desire to
strengthen the UK Government's hand in negotiations with the European
Union (EU) over Britain's exit from the EU (colloquially referred to
as Brexit).  While the election did not necessarily play out to Prime
Minister May's satisfaction, it did yield a trove of tweets that we can
mine for insight into British politics.

Unlike US Presidential Elections, which seem to last forever, the
period between when a general election is called officially in the UK
and the date the election is held is quite short, typically six weeks
or so.  During this pre-election period, known as `purdah
<https://en.wikipedia.org/wiki/Purdah_(pre-election_period)>`__, civil
servants are restricted from certain activities that might be viewed
as biased towards the party currently in power.  Purdah ran from April
22nd until June 9th for the 2017 General Election.

For this assignment, we'll be analyzing tweets sent from the official
Twitter feeds of four parties: the Conservative and Unionist Party
(``@Conservatives``), the Labour Party (``@UKLabour``), the Liberal Democrats
(``@LibDems``) and the Scottish National Party (``@theSNP``) during purdah.
We'll ask question such as:

- What was ``@Conservatives``'s favorite hashtag during purdah? [``#bbcqt``]
- Who was mentioned at least 50 times by ``@UKLabour``? [``@jeremycorbyn``]
- What words occured most often in ``@theSNP``'s tweets? [``snp, scotland, our, have, more``]
- What two-word phrases occured most often in ``@LibDems``'s tweets? [``stand up, stop tories, will stand, theresa may, lib dems``]
- How do the parties' feeds change over time?  [It is a short election season, so not too much, especially for the Conservatives.]

.. parsed-literal::
    April 2017:      strong stable leadership       19
                     stable leadership national     7
                     leadership national interest   6
                     can provide strong             5
                     provide strong stable          5

    May 2017:        strong stable leadership       79
                     through brexit beyond          34
                     only theresa may               30
                     best brexit deal               26
                     hand brexit negotiations       26

    June 2017:       best brexit deal               78
                     12-point plan brexit           32
                     our 12-point plan              32
                     get best brexit                28
                     polls are open                 25


For those of you who do not follow British politics, a few notes:

#. The hashtag ``#bbcqt`` refers to *BBC Question Time*, a political debate program on the British Broadcasting Company.
#. Jeremy Corbyn is the leader of the Labour Party.
#. Theresa May leads the Conservatives.
#. Nicola Sturgeon is the leader of the Scottish National Party.
#. Tim Farron was the leader of the Liberal Democrats during the 2017 election.
#. The Conservatives are also known as the Tories.


Getting started
===============

Students working alone should use their personal repositories.  Pairs
should work exclusively in their pair repositories.  Please see the
instructions linked to below for instructions on how to get a pair
repository.

See `these start-up instructions <startup-alone.html>`__  if you intend to work alone.

See `these start-up instructions <startup-pair-continuing.html>`__ if you intend to work with the same partner as in PA #2.

See `these start-up instructions <startup-pair.html>`__ if you intend to work in a *NEW* pair.


Here is a description of the contents of the ``pa3`` directory:

- ``basic_algorithms.py`` -- you will add code for Part 1 to this file.

- ``test_basic_algorithms.py`` -- this file contains very basic test code for the algorithms you will implement for Task 0.

- ``analyze.py`` -- you will add code for the tasks in Part 2 to this file.

- ``test_analyze.py`` -- test code for Part 2 of this writeup.

- ``util.py`` -- this file contains a few useful functions.

- ``load_tweets.py`` -- the file contains code to load the tweets for the four different parties to use during testing.

-  ``data/`` -- a directory for the tweet files. Some of the data files are large and are not included in the Git repository. See the Data_ section below for instructions on how to obtain these files.

Recall that you can set-up ``ipython3`` to reload code automatically
when it is modified by running the following commands after you fire
up ``ipython3``:

::

    In [1]: %load_ext autoreload

    In [2]: %autoreload 2

    In [3]: import basic_algorithms, analyze


Part 1: Basic Algorithms
========================

In this part, you will implement three algorithms: `top k`, `min
count`, and `frequent`.  In Part 2, you will use these algorithms to
analyze data extracted from tweets.

Before we describe the algorithms you will implement, we first discuss
a function that we have provided for ordering tuples.

Ordering tuples
~~~~~~~~~~~~~~~

All three of the algorithms we describe in this section compute an
ordered list of (key, value) pairs.  We provide a function, named
``sort_count_pairs``, that will handle sorting the pairs for you.  It
takes a list of the form:

.. code-block:: python

    [(k0, v0), (k1, v1), ...]

and sorts it.  The natural sort order for pairs uses the first item in
the pair as the primary sort key and the second value as the secondary
sort key and sorts both in ascending order. This ordering is not
desirable for our purposes.  Instead, we use the values (``v0``,
``v1``, etc) as the primary sort key and order them in descending
order.  We use the keys (``k0``, ``k1``, etc) as the secondary sort
key (that is, to break ties) and order them in ascending lexicographic
order.  For example, given the list:

.. code-block:: python

    [('D', 1), ('C', 2), ('A', 5), ('B', 2)]

our function would yield:

.. code-block:: python

    [('A', 5), ('B', 2), ('C', 2), ('D', 1)]

Top K
~~~~~

The first algorithm, `Top K`, computes the :math:`K` items that occur
most frequently in the list.  To do this computation, you will need
to:

#. use a dictionary to count the number of times each unique value occurs,
#. extract a list of (key, count) pairs from the dictionary, 
#. sort the pairs using the supplied function and finally, 
#. pick off the first :math:`K` pairs.

The first step *must* be done with one pass over the data.

Here is an example use of this function:

::

    In [4]: l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

    In [5]: basic_algorithms.find_top_k(l, 2)
    Out[5]: [('A', 5), ('B', 2)]


Minimum number of occurrences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The second algorithm, `Min Count`, finds the items in a list that
occur at least some specified minimum number of times.  To perform
this algorithm, you will need to:

#. compute the counts,
#. build a list of the items and associated counts that meet the threshold, and then
#. sort it using the supplied function.

Here is an example use of this function:

:: 

    In [6]: l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

    In [7]: basic_algorithms.find_min_count(l, 2)
    Out[7]: [('A', 5), ('B', 2), ('C', 2)]


Frequent items
~~~~~~~~~~~~~~

The previous two algorithms require space proportional to the number
of unique items in the list.  The third algorithm, `Frequent`, finds
the items in a sequence with a frequency that `exceeds` a :math:`1/K`
fraction of :math:`N`, the total number of items.  This algorithm uses
space proportional to :math:`K`, which is likely to be much smaller
than :math:`N`.  (Note: the value :math:`K` used in the this algorithm
is unrelated to the value :math:`K` used in the Top K algorithm.)

The `frequent` algorithm uses a data structure :math:`D` with up to
:math:`K-1` counters, each associated with a particular item, to track
an approximation to the frequency of the corresponding items in the
list.  For a given list item :math:`I`, you will update the counters
using the following rules:

- If item :math:`I` occurs in :math:`D`, then increment the count associated with :math:`I` by one.
- If item :math:`I` does not occur in :math:`D` and there are fewer than :math:`K-1` items in :math:`D`, then add :math:`I` with a value of one to :math:`D`.
- If item :math:`I` does not occur in :math:`D` and there are :math:`K-1` items in :math:`D`, then decrement all of the counters by one and remove any items with a count of zero from :math:`D`.

When implementing the third rule, it is important to remember that you
*cannot* safely modify a dictionary (or list) as you iterate over it.

Once this data structure is computed, you will need extract the (key,
count) pairs from it and sort them using our sort method.


Before we look at the result of applying this algorithm to the list
``l``, let's look at a more straightforward example.   


::

    In [8]: l0 = ['A', 'A', 'B', 'B', 'A', 'B', 'C', 'B', 'A']

    In [9]: basic_algorithms.find_frequent(l0, 3)
    Out[9]: [('A', 3), ('B', 3)]

Notice that while the items identified by the `frequent` algorithm are
correct, the counts are not.  The algorithm computes an approximation,
not the exact count.  To help you understand what is happening in this
algorithm, here's some output from a call to an augmented version
``find_frequent`` that shows the state of :math:`D` after processing
each item in the list:

.. parsed-literal::

    Input:
      items: ['A', 'A', 'B', 'B', 'A', 'B', 'C', 'B', 'A']
      k: 3

    After processing...'A', the state of D is...{'A': 1}
    After processing...'A', the state of D is...{'A': 2}
    After processing...'B', the state of D is...{'A': 2, 'B': 1}
    After processing...'B', the state of D is...{'A': 2, 'B': 2}
    After processing...'A', the state of D is...{'A': 3, 'B': 2}
    After processing...'B', the state of D is...{'A': 3, 'B': 3}
    After processing...'C', the state of D is...{'A': 2, 'B': 2}
    After processing...'B', the state of D is...{'A': 2, 'B': 3}
    After processing...'A', the state of D is...{'A': 3, 'B': 3}
    
    Output: [('A', 3), ('B', 3)]

Let's return to our earlier example:

::

    In [10]: l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

    In [11]: basic_algorithms.find_frequent(l, 3)
    Out[11]: [('A', 3), ('D', 1)]


This result may seem a bit odd.  The algorithm guarantees that the
result will include any item whose frequency `exceeds` :math:`N/K`,
which is why ``'A'`` occurs in the result.  If there are fewer than
:math:`K-1` values with a frequency that exceeds :math:`N/K`, then the
result may include values that occur less frequently, which explains
the presence of ``'D'`` in the result.  Again, here is a step by step
depiction of the process:

.. parsed-literal::

  Input:
    items: ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']
    k: 3

  After processing... 'A', the state of D is... {'A': 1}
  After processing... 'B', the state of D is... {'B': 1, 'A': 1}
  After processing... 'C', the state of D is... {}
  After processing... 'A', the state of D is... {'A': 1}
  After processing... 'A', the state of D is... {'A': 2}
  After processing... 'B', the state of D is... {'B': 1, 'A': 2}
  After processing... 'A', the state of D is... {'B': 1, 'A': 3}
  After processing... 'C', the state of D is... {'A': 2}
  After processing... 'A', the state of D is... {'A': 3}
  After processing... 'D', the state of D is... {'A': 3, 'D': 1}

  Output: [('A', 3), ('D', 1)]

(Note: you should **not** augment your code to produce this output.
We included it for explanatory purposes only.)

This algorithm and other similar algorithms can also be used with
streaming data, that is, data that arrives continuously.  See the
paper `Finding Frequent Items in Data Streams
<http://dimacs.rutgers.edu/~graham/pubs/papers/freq.pdf>`__ by Cormode
and Hadjieleftheriou for a good summary and an explanation of the
relationship between the frequencies reported in the results and the
true frequencies.

Task 0
~~~~~~

Your first task is to add code to ``basic_algorithms.py`` to implement
the three algorithms described above.  We have provided code,
``test_basic_algorithms.py``, that runs a few tests cases for each the
algorithms.  You will want to augment this code with some tests that
you run by hand in ``ipython3``.  To figure out what tests you might
want to run by hand, you should inspect the our tests and ask yourself
"Why did they choose these tests?"  and "What tests are obviously
missing?"

As in the previous assignments, our test code uses the ``pytest``
testing framework.  You can use the ``-k`` flag and the name of the
algorithm (``top_k``, ``min_count``, and ``frequent``) to run the test
code for a specific algorithm.  For example, running the following
command from the **Linux command-line**::

    $ py.test -xv -k min_count test_basic_algorithms.py 

will run the tests for the `min count` algorithm.  (As usual, we use
``$`` to signal the **Linux command-line prompt**.  You should not
type the ``$``.)


Part 2: Analyzing Tweets
========================

Now that you have implemented the basic analysis algorithms, you can
move on to analyzing the Twitter feeds.  Your code for this part and
all subsequent parts should be added to the file ``analyze.py``.  This
file contains a main block to allow the program to be run from the
command-line.  Run the following command from the **Linux
command-line**::

    $ python3 analyze.py --help

to get a description of the arguments.

While we provide function headers for each of the required tasks, the
structure of the rest of the code is entirely up to you.  We will note
that while some tasks can be done cleanly with one function, others
cannot.  We expect you to look for sub-tasks that are common to
multiple tasks and to reuse previously completed functions.

Testing
~~~~~~~

We have provided code for testing this part in ``test_analyze.py``.
Our test suite contains tests for checking your code on the examples
shown below and on various corner cases.

Data
~~~~

Files
-----

Twitter makes it possible to search for tweets with particular
properties, say, from a particular user, containing specific terms,
and within a given range of dates.  There are several Python libraries
that simplify the process of using this feature of Twitter.  We used
the `TwitterSearch <https://pypi.python.org/pypi/TwitterSearch/>`_
library to gather tweets from the Twitter feeds of ``@Conservatives``,
``@UKLabour``, ``@theSNP``, and ``@LibDems`` from the purdah period
and stored the resulting data in JSON files.

These files are large and so we did not include them the distribution.
Instead, we provided a shell script called ``get_large_files.sh`` to
download them.  To grab these files, change to your ``pa3/data``
directory and run this command from the Linux command-line::

    $ ./get_large_files.sh

This script will download one file per party:

- ``Conservatives.json``
- ``UKLabour.json``
- ``LibDems.json``
- ``theSNP.json``

Please note that you must be connected to the network to use this
script.  Also, if you wish to use both CSIL & the Linux servers and
your VM, you will need to run the ``get_files.sh`` script twice, once
for CSIL & the Linux servers and once for your VM.

**Do not add the data to your repository!** 

Representing tweets
-------------------

The representation of a tweet contains a lot of information: creation
time, hashtags used, users mentioned, text of the tweet, etc.  For
example, here is a tweet sent in mid-May by ``@UKLabour``:

.. code-block:: python

    'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to 
           vote here: https://t.co/2Lf9M2q3mP #GE2017…'

and here is an abridged version of the corresponding tweet
dictionary that includes a few of the 20+ key/value pairs:

.. code-block:: python

    {'created_at': 'Thu May 18 19:44:01 +0000 2017',
     'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                               {'indices': [132, 139], 'text': 'GE2017'}],
                  'symbols': [],
                  'urls': [{'display_url': 'gov.uk/register-to-vo…',
                            'expanded_url': 'http://gov.uk/register-to-vote',
                            'indices': [108, 131],
                            'url': 'https://t.co/2Lf9M2q3mP'}],
                  'user_mentions': [{'id': 1597669326,
                                     'id_str': '1597669326',
                                     'indices': [3, 15],
                                     'name': 'Patchwork Foundation',
                                     'screen_name': 'UKPatchwork'},
                                   {'id': 105573429,
                                    'id_str': '105573429',
                                    'indices': [18, 30],
                                    'name': 'Iain McNicol',
                                    'screen_name': 'IainMcNicol'},
                                   {'id': 14291684,
                                    'id_str': '14291684',
                                    'indices': [35, 44],
                                    'name': 'The Labour Party',
                                    'screen_name': 'UKLabour'}]},
     'text': 'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…'}

Collectively, hashtags, user mentions, URLs, etc. are referred to as
entities.  Notice that the value associated with the key
``"entities"`` in the tweet dictionary is itself a dictionary that
maps strings to lists of dictionaries.  The structure of the different
types of entity dictionaries is dependent upon the entity type.  For
example, if we wanted to find the users mentioned in a tweet, we'd
extract the ``"screen_name"`` value from the dictionaries in the list
associated with the ``"user_mentions"`` key in the ``"entities"``
dictionary.

We encourage you to play around with extracting information about
hashtags, user mentions, etc for a few tweets before you start working
on the tasks for Part 2.  

To simplify the process of exploring the data and testing, we provided
code (``load_tweets.py``) that loads the tweet files and assigns the
results to variables, one per party.  We also defined variables for a
couple of specific tweets.  The one above, for example, is named
``tweet0``.  The sample code below assumes that ``load_tweets.py`` has
been run as follows:

::

    In [12]: run load_tweets.py


Finding commonly-occurring entities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What are ``@theSNP``'s favorite hashtags? Which URLs are referenced at
least 5 times by ``@LibDems``? Are there users who represent at least 5%
of the mentions in ``@Conservatives`` tweets?  Answering these questions
requires us to extract the desired entities (hashtags, user mentions,
etc) from the parties' tweets and process them appropriately.

For the tasks that use tweets, we will ignore case differences and so,
for example, we will consider ``"#bbcqt"`` to be the same as ``"#BBCQT"``.

Common Parameters
-----------------

The next three tasks will use two of the same parameters.  To avoid
repetition later, we describe them here:

- ``tweets`` will be a list of dictionaries representing tweets.  
- ``entity_key`` will be one of the following pairs: ``("hashtags", "text")``, ``("urls", "url")``, or ``("user_mentions", "screen_name")``.  The first item in the pair tells us the type of entity and the second tells us the value of interest from the entities of that type.


Task 1: Top K entities
----------------------

For Task 1, you must complete the function:

.. code-block:: python

    def find_top_k_entities(tweets, entity_key, k):

where the first two parameters are as described above and ``k`` is an
integer.  This function, which is in ``analyze.py``, should return a
sorted list of (entity, count) pairs with the ``k`` most frequently
occurring entities and their associated counts.

Here's a sample call using the tweets from ``@theSNP``, ``("hashtags",
"text")`` as the entity key, and a value of three for ``k``:

.. code-block:: python

    In [13]: analyze.find_top_k_entities(theSNP, ("hashtags", "text"), 3)
    Out[13]: [('votesnp', 625), ('ge17', 428), ('snpbecause', 195)]


Task 2: Minimum number of occurrences
-------------------------------------

For Task 2, you must complete the function:

.. code-block:: python 

    def find_min_count_entities(tweets, entity_key, min_count):

where the first two parameters are as described above and
``min_count`` is an integer that specifies the minimum number of times
an entity must occur to be included in the result.  This function
should return a sorted list of (entity, count) pairs.

Here's a sample use of this function using the tweets from
``@LibDems`` with ``("user_mentions", "screen_name")`` as the entity
type and value of 100 for ``min_count``:

.. code-block:: python

    In [14]: analyze.find_min_count_entities(LibDems, ("user_mentions", "screen_name"), 100)
    Out[14]: 
    [('libdems', 568),
     ('timfarron', 547),
     ('liberalbritain', 215),
     ('libdempress', 115)]

Task 3: Frequent entities
-------------------------

For Task 3, you must implement the function:

.. code-block:: python

    def find_frequent_entities(tweets, entity_key, k)

where the first two are as described above and ``k`` is an integer.
This function should return a sorted list of (entity, count) pairs
computed using the `frequent` algorithm.

Here's a sample use of this function on the tweets from
``@Conservatives`` with ``("hashtags", "text")`` as the entity key and
a value of five for ``k``:

.. code-block:: python

    In [15]: analyze.find_frequent_entities(Conservatives, ("hashtags", "text"), 5)
    Out[15]: [('bbcqt', 94), ('ge2017', 64), ('chaos', 33), ('voteconservative', 1)]


Analyzing N-grams
~~~~~~~~~~~~~~~~~

If looking at frequently occurring hashtags or finding all the users
mentioned some minimum number of times yields some insight into a
party, what might we learn by identifying commonly occurring words,
pairs of words, word triples, etc?

In this part, you will apply the three algorithms described earlier to
contiguous sequences of :math:`N` words, which are known as `n-grams`.
Before you apply these algorithms to a candidate's tweets, you will
pre-process the tweets to try to reveal salient words and then extract
the n-grams.

Pre-processing step
-------------------

The pre-processing step converts the text of a tweet into a list of
strings.

We will define a `word` to be any sequence of characters delimited by
whitespace.  For example, "abc", "10", and "#2017election." are all
words by this definition.

Once you have extracted the words from a tweet, you should remove
leading and trailing punctuation and convert the words to lower case.
We have defined a constant, ``PUNCTUATION``, that specifies which
characters constitute punctuation for this assignment.  Note that
apostrophes, as in the word "it's", should not be removed, because
they occur in the middle of the word.

Finally, you should eliminate the stop words, symbols, and emoji that
appear in the defined constant ``STOP_WORDS`` and words that start
with any of the strings in the defined constant ``STOP_PREFIXES``
(which includes ``"#"``, ``"@"``, and ``"http"``).  *Stop words* are
commonly occurring words, such as "and", "of", and "to" that convey
little insight into what makes one tweet different from another.  We
included symbols, such as less than (``<``), and emoji in the constant
``STOP_WORDS`` to focus attention on the important words in the
tweets.

Here is an example: pre-processing the text (see ``tweet1`` defined in ``load_tweets.py``)

.. parsed-literal::

    Things you need to vote
    Polling card? ✘ 
    ID? ✘ 
    A dog? ✘
    Just yourself ✔
    You've got until 10pm – #VoteLabour now… https://t.co/sCDJY1Pxc9

would yield:

.. code-block:: python

    ('things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
     'just', 'yourself', "you've", 'got', 'until', '10pm', 'now')

You will find the ``lower``, ``split``, ``strip``, and ``startswith``
methods from the `string API <https://docs.python.org/3/library/stdtypes.html#string-methods>`_
useful for this step.  You will save yourself a lot of time and
unnecessary work, if you read about these methods in detail *before*
you start writing code.

Representing N-grams
--------------------

Your implementation should compute the n-grams of a tweet after
pre-processing the tweet's text.  These n-grams should be represented
as tuples of strings.  Given a value of 2 for :math:`N`, the above
``@UKLabour`` tweet would yield the following bi-grams (2-grams):

.. code-block:: python

    [('things', 'need'),
     ('need', 'vote'),
     ('vote', 'polling'),
     ('polling', 'card'),
     ('card', 'id'),
     ('id', 'dog'),
     ('dog', 'just'),
     ('just', 'yourself'),
     ('yourself', "you've"),
     ("you've", 'got'),
     ('got', 'until'),
     ('until', '10pm'),
     ('10pm', 'now')]


Common parameters
-----------------

As in Tasks 1-3, Tasks 4-6 have two parameters in common:

- ``tweets`` is a list of  dictionaries representing tweets
- ``n`` is the number of words in an n-gram.

Task 4: Top K N-grams
---------------------

Your task is to implement the function:

.. code-block:: python

    def find_top_k_ngrams(tweets, n, k):

where the first two parameters are as described above and ``k`` is the
desired number of n-grams. This function should return a sorted list
of the (n-gram, integer count of occurrences) pairs for the ``k`` most
frequently occurring n-grams.

Here's a sample use of this function using the tweets from ``@theSNP``
with two as the value for ``n`` and three as the value for ``k``:


.. code-block:: python

    In [16]: analyze.find_top_k_ngrams(theSNP, 2, 3)
    Out[16]: [(('nicola', 'sturgeon'), 81), (('read', 'more'), 67), (('stand', 'up'), 55)]


Task 5: Minimum number of n-gram occurrences
--------------------------------------------

Your task is to implement the function:

.. code-block:: python

    def find_min_count_ngrams(tweets, n, min_count):

where the first two parameters are as described above and
``min_count`` specifies the minimum number of times an n-gram must
occur to be included in the result.  This function should return a
sorted list of (n-gram, integer count of occurrences) pairs.

Here's a sample use this function using the tweets from ``@LibDems``
with two as the value for ``n`` and 100 as the value for
``min_count``:

.. code-block:: python

    In [17]: analyze.find_min_count_ngrams(LibDems, 2, 100)
    Out[17]: 
    [(('stand', 'up'), 189),
     (('stop', 'tories'), 189),
     (('will', 'stand'), 166),
     (('theresa', 'may'), 125),
     (('lib', 'dems'), 116),
     (('can', 'stop'), 104),
     (('only', 'can'), 100)]

Task 6: Frequent N-grams
------------------------

Your task is to implement the function:

.. code-block:: python
                                                             
    def find_frequent_ngrams(tweets, n, k)

where the first two parameters are as described above and ``k`` is an
integer.  This function should return a a sorted list of (n-gram,
integer approximate counter) pairs as computed using the `Frequent`
algorithm.

For example, using this function on the tweets from ``@Conservatives``
with with two as the value for ``n`` and three as the value for ``k``
would yield:

.. code-block:: python

    In [18]: analyze.find_frequent_ngrams(Conservatives, 2, 3)
    Out[18]: [(('session', 'mk'), 1)]


By Month Analysis
~~~~~~~~~~~~~~~~~

Task 7
------

Your last task is to implement the function:

.. code-block:: python

   def find_top_k_ngrams_per_month(tweets, n, k):

which takes the same parameters as ``find_top_k_ngrams`` and returns a
list of tuples of the form ((year, month), sorted list of the top
``k`` ``n``-grams and their counts).  The resulting list should be
sorted using the built-in Python3 ``sort`` function rather than
``sort_count_pairs``.

Here's a sample use of this function using the tweets from ``@UKLabour``
with two as the value for ``n`` and five as the value for ``k``:

::

    In [19]: analyze.find_top_k_ngrams_by_month(UKLabour, 2, 5)
    Out[19]: 
    [((2017, 4), 
      [(('nhs', 'staff'), 15),
       (('labour', 'will'), 13),
       (('labour', 'government'), 11),
       (('register', 'vote'), 11),
       (('2', 'mins'), 9)]),
     ((2017, 5),
      [(('labour', 'will'), 51),
       (('register', 'vote'), 39),
       (('our', 'manifesto'), 26),
       (('tories', 'have'), 25),
       (('not', 'few'), 24)]),
     ((2017, 6),
      [(('labour', 'will'), 39),
       (('labour', 'gain'), 36),
       (('8', 'june'), 22),
       (('voting', 'labour'), 20),
       (('vote', 'labour'), 17)])]


Because the results can be a bit hard to decipher in this form, we
have provided a function, ``util.pretty_print_by_month``, that takes
the output of ``find_top_k_ngrams_per_month`` and prints it in an
easier-to-read form.  For example, here is a sample use of this function:

::

    In [20]: util.pretty_print_by_month(analyze.find_top_k_ngrams_by_month(UKLabour, 2, 5))
    April 2017:      nhs staff           15
                     labour will         13
                     labour government   11
                     register vote       11 
                     2 mins              9

    May 2017:        labour will         51
                     register vote       39
                     our manifesto       26
                     tories have         25 
                     not few             24

    June 2017:       labour will         39
                     labour gain         36
                     8 june              22
                     voting labour       20
                     vote labour         17

The function ``util.grab_year_month`` will be useful for this task.

Submission
==========

See `these submission instructions <startup-alone.html#submission>`__  if you are working alone.

See `these submission instructions <startup-pair-continuing.html#submission>`__ if you are working in the same pair as for PA #2.

See `these submission instructions <startup-pair.html#submission>`__ if you are working in a *NEW* pair.