========================== Analyzing Candidate Tweets ========================== **Due: Friday, October 21st at 4pm** You may work alone or in a pair on this assignment. The purpose of this assignment is to give you experience with using dictionaries to represent data and as a mechanism for mapping keys to values that change over time. Introduction ============ The candidates in the 2016 presidential primaries and general election have used Twitter as a way to communicate with their supporters (and detractors) and gain media attention. In this assignment, you will analyze tweets from Hillary Clinton's (HRC) and Donald Trump's (DJT) Twitter feeds to gain some insight into how they use Twitter as a campaign tool. We'll ask question such as: - What is HRC's favorite hashtag? [#demdebate] - Who has DJT mentioned at least 100 times? [realdonaldtrump, cnn, foxnews] - What words occur most often in HRC's tweets? [our, with, are, have, not] - What two-word phrases occur most often in DJT's tweets? [i will, make america, great again, i am, america great] - How do the candidates' feeds change over time? [Not as much as you might think!] Getting started =============== See `these start-up instructions `__ if you intend to work alone. See `these start-up instructions `__ if you intend to work with the same partner as in PA #2. See `these start-up instructions `__ if you intend to work in a *NEW* pair. Here is a description of the contents of this directory: - ``basic_algorithms.py`` -- you will add code for the basic analysis algorithms to this file. - ``test_basic_algorithms.py`` -- this file contains very basic test code for the basic algorithms. - ``analyze.py`` -- you will add code to extract information from tweets and analyze them to this file. Note that this file includes several constants and two functions that you may find useful. They are documented with code comments, and you should make sure to read and understand them before you start writing code. - ``test_analyze_writeup.py`` -- test code for the examples in the writeup. - ``util.py`` -- this file contains a few useful functions. - ``data/`` -- a directory for the data files. Since the data files are very large they are not included in the Git repository. See below for instructions on how to obtain the data files. The ``data/`` directory contains a file called ``get_files.sh`` that will download the data files. To download the data files, go into the ``data`` directory and run this command:: ./get_files.sh This script will download the complete HRC dataset (``hrc-all.json``), the complete DJT dataset (``djt-all.json``), and several smaller datasets that may be useful for testing purposes. Please note that you must be connected to the network to use this script. **Do not add the data to your repository!** If you wish to use both CSIL & the Linux servers and your VM, you will need to run the ``get_files.sh`` script twice once for CSIL & the Linux servers and once for your VM. Part 1: Basic Algorithms ======================== In this part, you will implement three algorithms: `top k`, `min count`, and `frequent`. In Part 2, you will use these algorithms for to analyze data extracted from tweets. Before we describe the algorithms you will implement, we first discuss a function that we have provided for ordering tuples. *Ordering tuples* All three of the algorithms we describe in this section compute an ordered list of (key, value) pairs. We provide a function, named ``sort_count_pairs``, that will handle sorting the pairs for you. It takes a list of the form: .. code-block:: python [(k0, v0), (k1, v1), ...] and sorts it. The natural sort order for pairs uses the first item in the pair as the primary sort key and the second value as the secondary sort key. This ordering is not desirable for our purposes. Instead, we will use the values (``v0``, ``v1``, etc) as the primary sort key. When sorting the pairs, our sort method puts larger values earlier in the ordering than smaller values. If two values are the same, then the keys (``k0``, ``k1``, etc) are used as the secondary sort key. They are ordered in increasing order. For example, given the list: .. code-block:: python [('D', 1), ('C', 2), ('A', 5), ('B', 2)] our function would yield: .. code-block:: python [('A', 5), ('B', 2), ('C', 2), ('D', 1)] *Top K* The first algorithm, `Top K`, which will be familiar from class, computes the :math:`K` items that occur most frequently in the list. To do this computation, you will need to: #. use a dictionary to count the number of times each unique value occurs, #. extract a list of (key, count) pairs from the dictionary, #. sort the pairs using the supplied function and finally, #. pick off the first :math:`K` pairs. The first step can and should be done with one pass over the data. Given the following list: .. code-block:: python l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D'] and a value of two for :math:`K`, this algorithm would yield: .. code-block:: python [('A', 5), ('B', 2)] *Minimum number of occurrences* The second algorithm, `Min Count`, finds the items in a list that occur at least some specified minimum number of times. To perform this algorithm, you will need to: #. compute the counts, #. build a list of the items and associated counts that meet the threshold, and then #. sort it using the supplied function. Running this algorithm on the list ``l`` defined above with a minimum count of two would yield the following list: .. code-block:: python [('A', 5), ('B', 2), ('C', 2) *Frequent items* The previous two algorithms require space proportional to the number of unique items in the list. The third algorithm, `Frequent`, finds the items in a sequence with a frequency that `exceeds` a :math:`1/K` fraction of :math:`N`, the total number of items. This algorithm uses space proportional to :math:`K`, which is likely to be much smaller than :math:`N`. The `frequent` algorithm uses a data structure :math:`D` with up to :math:`K-1` counters, each associated with a particular item, to track an approximation to the frequency of the corresponding items in the list. For a given list item :math:`I`, you will update the counters using the following rules: - If item :math:`I` occurs in :math:`D`, then increment by one the count associated with :math:`I`. - If item :math:`I` does not occur in :math:`D` and there are fewer than :math:`K-1` items in :math:`D`, then add :math:`I` with a value of 1 to :math:`D`. - If item :math:`I` does not occur in :math:`D` and there are :math:`K-1` items in :math:`D`, then decrement all of the counters by one and remove any items with a count of zero from :math:`D`. Once this data structure is computed, you will need extract the (key, count) pairs from it and sort them using our sort method. Before we look at the result of applying this algorithm to the list ``l``, let's look at a more straightforward example. Given the list: .. code-block:: python ["A", "A", "B", "B", "A", "B", "C", "B", "A"] and a value of three for :math:`K`, the `frequent` algorithm will yield: .. code-block:: python [('A', 3), ('B', 3)] Notice that while the items identified by the `frequent` algorithm are correct, the counts are not. The algorithm computes an approximation, not the exact count. Returning to our earlier example, running the algorithm on the list ``l`` defined above and a value of three for :math:`K`, would yield: .. code-block:: python [('A', 3), ('D', 1)] This result may seem a bit odd. The algorithm guarantees that the result will include any item whose frequency `exceeds` :math:`N/K`, which is why ``'A'`` occurs in the result. If there are fewer than :math:`K-1` hashtags with a frequency that exceeds :math:`N/K`, then the result may include hashtags that occur less frequently, which explains the presence of ``'D'`` in the result. It is instructive to think through what happens when running this algorithm with the list ``l`` defined above and two as the value of :math:`K`. The data structure :math:`D` will contain one key, ``'A``` with an associated value of one after processing the last ``'A'`` in the list. When the last value (``'D'``) in the list is processed, the third rule (decrement all the counts) will be invoked and as a result, the data structure :math:`D` will be empty when the algorithm finishes. This algorithm and other similar algorithms can also be used with streaming data. See the paper `Finding Frequent Items in Data Streams `__ by Cormode and Hadjieleftheriou for a good summary and an explanation of the relationship between the frequencies reported in the results and the true frequencies. *Task 0* Your first task is to implement the three algorithms described above. We have provided code that runs a few tests cases for each the algorithms (see ``test_basic_algorithms.py``). Our code assumes that you have defined three functions--- ``find_top_k``, ``find_min_count``, and ``find_frequent``---each of which takes a list and an integer as parameters and returns a sorted list of tuples. We encourage you to inspect the tests and ask yourself "Why did they choose these tests?" and "What tests are obviously missing?" As in the previous assignments, our test code uses the ``pytest`` testing framework. You can use the ``-k`` flag and the name of the algorithm (``top_k``, ``min_count``, and ``frequent``) to run the test code for a specific algorithm. For example, .. code-block:: python py.test test_basic_algorithms.py -k min_count will run the tests for the `min count` algorithm Data ==== Twitter makes it possible to search for tweets with particular properties, say, from a particular user, containing specific terms, and within a given range of dates. There are several Python libraries that simplify the process of using this feature of Twitter. We used the `TwitterSearch `_ library to gather tweets from HRC and DJT's Twitter feeds from mid-November 2015 through mid-October 2016 and stored the resulting data in JSON files. We have provided code that reads the data from the JSON files back into a list of dictionaries, one per tweet. The representation of a tweet contains a lot of information: creation time, hashtags used, users mentioned, text of the tweet, etc. For example, here is the fourth tweet from HRC from Feb 1st. .. code-block:: python "'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt'" and here is an abbreviated version of the corresponding tweet dictionary that includes a few of the 20+ key/value pairs: .. code-block:: python {'created_at': 'Mon Feb 01 01:18:37 +0000 2016', 'entities': { 'hashtags': [{'indices': [22, 31], 'text': 'iacaucus'}, {'indices': [52, 62], 'text': 'ImWithHer'}], 'symbols': [], 'urls': [], 'user_mentions': [{'id': 3056142064, 'id_str': '3056142064', 'indices': [3, 16], 'name': 'Hillary for Iowa', 'screen_name': 'HillaryforIA'}]}, 'text': 'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt', } Notice that the value associated with ``entities``, which contains information about hashtags, users mentioned, URLs included, etc, is itself a dictionary that maps strings to lists of dictionaries. The structure of the different entity dictionaries is dependent upon the entity type. For example, if we wanted to find the users mentioned in a tweet, we'd extract the ``"screen_name"`` values from the dictionaries in the list associated with the ``"user_mentions"`` key in the ``"entities"`` dictionary. Part 2: Analyzing Tweets ======================== Now that you have implemented the basic analysis algorithms, you can move on to analyzing the Twitter feeds. Your code for this part and all subsequent parts should be added to the file ``analyze.py``. This file contains a main block to allow the program to be run from the command-line. Run:: python3 analyze.py --help to get a description of the arguments. While we provide function headers for each of the required tasks, the structure of the rest of the code is entirely up to you. We will note that while some tasks can be done cleanly with one function, others cannot. Also, we recommend looking for sub-tasks that are common to multiple tasks and re-using previously completed functions. Finding commonly-occurring entities ----------------------------------- What are DJT's favorite hashtags? Which URLs are referenced at least 5 times by HRC? Are there users who represent at least 5% of the mentions in DJT's tweets? Answering these questions requires us to extract the desired entities (hashtags, user mentions, etc) from the candidate's tweets and process them appropriately. For these tasks, we will ignore case differences and so, for example, we will consider "demdebate" to be the same as "DemDebate". *Common Parameters* The next three tasks will use many of the same parameters. To avoid repetition later, we describe them here: - ``tweets`` will be a list of tweet dictionaries, - ``entity_type`` will be one of ``"hashtags"``, ``"urls"``, or ``"user_mentions"``, and - ``value_key`` will be one of ``"text"``, ``"url"``, or ``"screen_name``. *Task 1: Top K entities* For Task 1, you must complete the function: .. code-block:: python def find_top_k_entities(tweets, entity_key, value_key, k): where the first three parameters are as described above and ``k`` is an integer. This function should return a sorted list of (entity, count) pairs with the ``k`` most frequently occurring entities and their associated counts. Running this function on the tweets in ``hrc-feb.json`` with ``"hashtags"`` as the entity type, ``"text"`` as the value key, and a value of three for ``k``, for example, would yield: .. code-block:: python [('imwithher', 70), ('demdebate', 49), ('demtownhall', 28)] *Task 2: Minimum number of occurrences* For Task 2, you must complete the function: .. code-block:: python def find_min_count_entities(tweets, entity_key, value_key, min_count): where the first three parameters are as described above and ``min_count`` is an integer that specifies the minimum number of times a entity must occur to be included in the result. This function should return a sorted list of (entity, count) pairs. Running this function on the tweets in ``djt-feb.json`` with ``"hashtags"`` as the entity type, ``"text"`` as the value key, and a value of 15 for ``min_count``, for example, would yield: .. code-block:: python [('trump2016', 54), ('makeamericagreatagain', 50), ('fitn', 16)] FYI, #FITN stands for "First In The Nation". *Task 3: Frequent entities* For Task 3, you must implement the function: .. code-block:: python def find_frequent_entities(tweets, entity_key, value_key, k) where the first three parameters are as described above and ``k`` is an integer. This function should return a sorted list of (entity, count) pairs computed using the `frequent` algorithm. Running this function on the tweets in ``hrc-feb.json`` with ``"hashtags"`` as the entity type, ``"text"`` as the value key, and a value of three for ``k``, for example, would yield: .. code-block:: python [('imwithher', 6), ('iacaucus', 3)] *Testing* We have provided test cases for the basic analysis algorithms and for the tweet-based examples shown in the writeup. Beyond the few examples that we provided, you are on your own for testing these and all subsequent tasks. Analyzing N-grams ----------------- If looking at frequently occurring hashtags or finding all the users mentioned some minimum number of times yields some insight into a candidate, what might we learn by identifying commonly occurring pairs of words, word triples, etc? In this part, you will apply the three algorithms described earlier to contiguous sequences of :math:`N` words, which are known as `n-grams`. Before you can apply these algorithms to a candidate's tweets, you will pre-process the tweets to try to reveal salient words. *Pre-processing step* The pre-processing step converts the text of a tweet into a list or tuple of strings. We will define a `word` to be any sequence of characters delimited by whitespace. For example, "abc", "7xy", and "#2016election." are all words by this definition. Once you extract the words from a tweet, you should remove leading and trailing punctuation and convert the word to lower case. We have defined a constant, ``PUNCTUATION`` that specifies what constitutes punctuation for this assignment. Note that apostrophes, as in the word "it's", should not be removed, because they occur in the middle of the word. Finally, you should eliminate words that appear in a specified set of `stop words` (for example, "an", "of", and "to") or that start with a specified set of `stop prefixes` (for example, ``"#"``, ``"@"``, and ``"http"``). Here is an example: pre-processing the text from the fourth tweet from DJT on Feb 1st: 'RT @kevcirilli: CEDAR RAPIDS --\\n\\nTRUMP\\'S DAUGHT IVANKA:\\n\\n"I can just say without equivocation, my father will make America great again."' with the sets ``STOP_WORDS["djt"]`` and ``STOP_PREFIX["default"]`` should yield: .. code-block:: python ('cedar', 'rapids', "trump's", 'daught', 'ivanka', 'i', 'can', 'just', 'say', 'without', 'equivocation', 'my', 'father', 'will', 'make', 'america', 'great', 'again') You will find the ``lower``, ``split``, ``strip``, and ``startswith`` methods from the `str API `_ useful for this step. *Representing N-grams* N-grams should be represented as tuples of strings. Given a value of 2 for :math:`N`, the above DJT tweet would yield the following bi-grams (2-grams): .. code-block:: python ('cedar', 'rapids') ('rapids', "trump's") ("trump's", 'daught') ('daught', 'ivanka') ('ivanka', 'i') ('i', 'can') ('can', 'just') ('just', 'say') ('say', 'without') ('without', 'equivocation') ('equivocation', 'my') ('my', 'father') ('father', 'will') ('will', 'make') ('make', 'america') ('america', 'great') ('great', 'again') You might ask "Why use tuples and not lists to represents n-grams?" Tuples of strings are immutable (that is, cannot be modified) and as a result, can be used as keys for dictionaries. Lists, on the other hand, cannot be used as dictionary keys because the data structure used to implement dictionaries relies on the fact that keys are not mutable. *Common parameters* As in Tasks 1-3, Tasks 4-6 have several parameters in common: - ``tweets`` is a list of tweet dictionaries, - ``n`` is the number of words in a sequence, - ``stop_words`` is a set of words (strings) to ignore, and - ``stop_prefixes`` is a set of strings that describe prefixes for words that should be ignored. *Task 4: Top K N-grams* Your task is to implement the function: .. code-block:: python def find_top_k_ngrams(tweets, n, stop_words, stop_prefixes, k): where the first four parameters are as described above and ``k`` is the desired number of n-grams. This function should return a sorted list of the (n-gram, integer count of occurrences) pairs for the ``k`` most frequently occurring n-grams. Running this function on the tweets in ``hrc-feb.json`` with two as the value for ``n``, the Clinton stop words (``STOP_WORDS["hrc"]``), the default stop prefixes (``STOP_PREFIXES["default"]``), and three as the value for ``k`` would yield: .. code-block:: python [(('new', 'hampshire'), 22), (('south', 'carolina'), 14), (('stand', 'with'), 12)] *Task 5: Minimum number of n-gram occurrences* Your task is to implement the function: .. code-block:: python def find_min_count_ngrams(tweets, n, stop_words, stop_prefixes, min_count): where the first four parameters are as described above and ``min_count`` specifies the minimum number of times an n-gram must occur to be included in the result. This function should return a sorted list of (n-gram, integer count of occurrences) pairs. Running this function on the tweets in ``djt-feb.json`` with two as the value for ``n``, the Trump stop words (``STOP_WORDS["djt"]``), the default stop prefixes (``STOP_PREFIXES["default"]``), and 30 as the value for ``min_count`` would yield: .. code-block:: python [(('ted', 'cruz'), 36), (('south', 'carolina'), 35), (('i', 'will'), 32), (('new', 'hampshire'), 31)] *Task 6: Frequent N-grams* Your task is to implement the function: .. code-block:: python def find_frequent_ngrams(tweets, n, stop_words, stop_prefixes, k) where the first four parameters are as described above and ``k`` is an integer. This function should return a a sorted list of (n-gram, integer approximate counter) pairs as computed using the `Frequent` algorithm. Running this function on the tweets in ``hrc-feb.json`` with two as the value for ``n``, the Clinton stop words (``STOP_WORDS["hrc"]``), the default stop prefixes (``STOP_PREFIXES["default"]``), and three as the value for ``k`` would yield: .. code-block:: python [(('few', 'iowans'), 1)] By Month Analysis ----------------- A candidate's focus shifts over the course of an election. The number of voters who participate in the primaries is relatively small and their values are often not in line with those of the much larger general election electorate. The oft-discussed need to "pivot to the center" for the general election reflects this difference. An interesting question is whether this shift is visible in the candidates Twitter feeds. (In retrospect. 2016 may not be the best year to illustrate this principle.) *Task 7* We'll look at this question by calculating the `Top K` n-grams by month. Specifically, your last task is to implement the function: .. code-block:: python def find_top_k_ngrams_per_month(tweets, n, stop_words, stop_prefixes, k): which takes the same parameters as ``find_top_k_ngrams`` and returns a list of tuples of the form ((year, month), list of the top ``k`` ``n``-grams and their counts sorted using our function). The resulting list should be sorted using the standard sort function. Running this function on the tweets in ``djt-all.json`` with two as the value for ``n``, the combined stop words (``STOP_WORDS["both"]``), the default stop prefixes (``STOP_PREFIXES["default"]``), and three as the value for ``k`` would yield: :: [((2015, 11), [(('make', 'america'), 10), (('south', 'carolina'), 9), (('america', 'great'), 8)]), ((2015, 12), [(('i', 'will'), 34), (('i', 'am'), 26), (('great', 'again'), 19)]), ((2016, 1), [(('i', 'will'), 33), (('ted', 'cruz'), 28), (('i', 'am'), 16)]), ((2016, 2), [(('ted', 'cruz'), 36), (('south', 'carolina'), 35), (('i', 'will'), 32)]), ((2016, 3), [(('i', 'am'), 24), (('i', 'will'), 24), (('ted', 'cruz'), 19)]), ((2016, 4), [(('new', 'york'), 28), (('i', 'will'), 22), (('great', 'again'), 15)]), ((2016, 5), [(('i', 'will'), 25), (('elizabeth', 'warren'), 16), (('goofy', 'elizabeth'), 16)]), ((2016, 6), [(('i', 'will'), 15), (('join', 'me'), 14), (('i', 'am'), 12)]), ((2016, 7), [(('bernie', 'sanders'), 19), (('our', 'country'), 17), (('make', 'america'), 12)]), ((2016, 8), [(('i', 'will'), 14), (('i', 'am'), 13), (('join', 'me'), 10)]), ((2016, 9), [(('i', 'will'), 16), (('join', 'me'), 16), (('make', 'america'), 14)]), ((2016, 10), [(('i', 'will'), 5), (('join', 'me'), 5), (('about', 'her'), 4)])] Because the results can be a bit hard to decipher in this form, we have provided a function, ``util.pretty_print_by_month``, that takes the output of ``find_top_k_ngrams_per_month`` and prints it in an easier-to-read form. For example, here is the result of calling ``pretty_print_by_month`` on the list above: :: November 2015: make america 10 south carolina 9 america great 8 December 2015: i will 34 i am 26 great again 19 January 2016: i will 33 ted cruz 28 i am 16 February 2016: ted cruz 36 south carolina 35 i will 32 March 2016: i am 24 i will 24 ted cruz 19 April 2016: new york 28 i will 22 great again 15 May 2016: i will 25 elizabeth warren 16 goofy elizabeth 16 June 2016: i will 15 join me 14 i am 12 July 2016: bernie sanders 19 our country 17 make america 12 August 2016: i will 14 i am 13 join me 10 September 2016: i will 16 join me 16 make america 14 October 2016: i will 5 join me 5 about her 4 Submission ========== See `these submission instructions `__ if you are working alone. See `these submission instructions `__ if you are working in the same pair as for PA #2. See `these submission instructions `__ if you are working in a *NEW* pair.