“The Markovian Candidate”: Speaker Attribution Using Markov Models¶
Due: Friday, January 22nd at 9pm
The goal of this assignment is to give you practice with building a hash table from scratch, markov models and to exercise the skills you learned last quarter, specfically dunder methods.
You must work alone for this assignment.
Introduction¶
In lecture, we saw how Markov models capture the statistical relationships present in a language like English. These models allow us to go beyond simplistic observations, like the frequency with which specific letters or words appear in a language, and instead to capture the relationships between words or letters in sequences. As a result, we can not only appreciate that the letter “q” appears in text at a certain rate, but also that it is virtually always followed by a “u.” Similarly, “to be” is a much more likely word sequence than “to is.”
In class, we discussed how to generate realistic-looking, if nonsensical, text, by randomly choosing words while keeping in mind the influence of the preceding words. Another application of Markov models is in analyzing actual text and assessing the likelihood that a particular person uttered it. That is one objective of this assignment.
In lecture, we also talked about hash tables: data structures that store associations between keys and values (exactly like dictionaries in Python) and provide an efficient means of looking up the value associated with a given key. Hash tables find a desired entry rapidly by limiting the set of places where it can be. They avoid “hot spots,” even with data that might otherwise all seem to belong in the same place, by dispersing the data through hashing.
These benefits are why Python, in fact, uses hash tables behind the scenes in its implementation of dictionaries. The concept of hashing is so fundamental and useful in so many ways, however, that now is the time to peel back the curtain and see how dictionaries work by building your own. That is the other objective of this assignment.
Apart from developing an appreciation of how hashing and hash tables work, you will also be better prepared if ever you need to write your own hash table in the future. For instance, you may use a language (like C) that does not feature a built-in hash table. Or, the hash table that is used by a language, like Python, may interact poorly with your particular data, obligating you to design a custom-tailored one. After completing this assignment, you should consider hash tables to be in your programming repertoire.
Getting started¶
Before you start working on the assignment’s tasks, please take a moment to follow the steps described in Coursework Basics page to get the files for this assignment (these steps will be the same as the ones you followed last quarter). Specifically, you need to setup your course repository for the class. You should only need to do the steps defined in the Setting up your repository section.
Please note that
you will not be able to start working on the assignment until you
fetch the assignment files (which will appear in a pa1
directory
in your repository)
Hash tables and Linear Probing¶
In the interest of building and testing code one component at a time, you will start by building a hash table. Once you have gotten this module up and running, you will use it in your construction of a Markov model for speaker attribution.
There are different types of hash tables; for this assignment, we will use the type that is implemented with linear probing. We described how these work in lecture; here is a reference on the topic.
Please look at hash_table.py
. You will modify this file to implement a hash table using the linear probing algorithm. The class Hashtable
must have:
A hash function that takes in a string and returns a hash value. Use the standard string hashing function discussed in class. This function should be “private” to the class.
A constructor that takes in the initial number of cells to use and a default value, of your choosing, to return when looking up a key that has not been inserted. It must create a list of empty cells with the specified length. You can assume the value passed in for the initial number of cells will be greater than 0.
__getitem__
, which takes in a key and returns its associated value: either the one stored in the hash table, if present, or the default value, if not.__setitem__
, which takes in a key and value, and updates the existing value for that key with the new value or inserts the key-value pair into the hash table, as appropriate.__delitem__
, which takes in a key and removes the key-value pair from the hash table. Deleting an item in a hash table that uses linear probing is tricky. Simply removing the key-value pair could potentially invalidate the entire table. For our implementation, we will “logically” delete the key-value pairings. This means that you should have marker for each key-value pair that indicates whether or not it’s been “deleted” (e.g., having a third component of the tuple to represent the marker). If the marker isTrue
then the key-value pair is still inside the table; otherwise aFalse
marker says that the pair was deleted at some point. You should think about how this helps with implementing your__getitem__
and__setitem__
methods. You must only “physically” remove (i.e., actually delete from the data structure) the key-value pairings that have been “logically” removed during rehashing. If the key is not inside the hash table then you will need to raise the following error:raise RuntimeError("Key was not found in table")
.__contains__
, which takes in a key and returnsTrue
if the key is inside the hash table; otherwise, returnFalse
.keys
, which returns a list of all the keys inside the hash table. Do not include the “logically” deleted keys.values
, which returns a list of all the values inside the hash table. Do not include the “logically” deleted values.__len__
, which returns the number of key-value pairings inside the hash table. Do not count the “logically” deleted pairings.
A hash table built on linear probing does not have unlimited capacity, since each cell can only contain a single value and there are a fixed number of cells. But, we do not want to have to anticipate how many cells might be used for a given hash table in advance and hard-code that number in for the capacity of the hash table. Therefore, we will take the following approach: the cell capacity passed into the constructor will represent the initial size of the table. If the fraction of occupied cells grows beyond the constant TOO_FULL
after an update, then we will perform an operation called rehashing: We will expand the size of our hash table, and migrate all the data into their proper locations in the newly-expanded hash table (i.e., each key-value pair is hashed again, with the hash function now considering the new size of the table). We will grow the size of the table by GROWTH_RATIO
; for instance, if this is 2
, the size of the hash table will double each time it becomes too full.
You must use the hash function presented in these slides, and use 37 as a constant in that function. The ord()
function in Python is an option to compute the numerical value of a character. Let \(f()\) be a function that maps a character \(c_i\) in a string \(S=c_1c_2...c_j\) to a number, \(k=37\) a constant, and \(M\) as the size of the hash table, then a possible hash function \(h()\), which you must use in this assignment, can be defined as (see slide #6):
and
where \(g(c_i)\) is the function that actually computes the integer index for all characters in the string \(S\).
Testing - We have provided test code for your hash table class in test_hash_table.py
.
Note
You will most likely only use the __getitem__
and __setitem__
methods in the following sections. However, the other methods are there so you can get practice building
a more complex data structure that uses the built-in operators of Python.
Markov model for sequences of letters¶
Next, we will use this hash table to track the number of times letter sequences appear in a text.
A k-th order Markov model tracks the last \(k\) letters as the
context for the present letter. We will build a class called
Markov
that will work for any positive value of \(k\) provided. This class,
naturally, resides in markov.py
.
While we use the term “letter,” we will actually work with all characters, whether they be letters, digits, spaces, or punctuation, and will distinguish between upper- and lower-case letters.
We will discuss how we will use this class to help identify the speaker of a particular text. Then, we will circle back and explain how it learns the characteristics of a given speaker in the first place.
Determining the likelihood of unidentified text¶
Given a string of text from an unidentified speaker, we will use a Markov model for a known speaker to assess the likelihood that the text was uttered by that speaker. If we have built models for different speakers, then we will have likelihood values for each, and will choose the speaker with the highest likelihood as the probable source.
As we saw in lecture, these probabilities can be very small, since they take into account every possible phrase of a certain length that a speaker could have uttered. Therefore, we expect that all likelihoods are low in an absolute sense, but will still find their relative comparisons to be meaningful. Very small numbers are problematic, however, because they tax the precision available for floating-point values. The solution we will adopt for this problem is to use log probabilities instead of the probabilities themselves. This way, even a very small number is represented by a negative value in the range between zero and, for instance, -20. If our log probabilities are negative and we want to determine which probability is more likely, will the greater number (the one closer to zero) represent the higher or lower likelihood, based on how logarithms work?
Note that when we use Python’s math.log
function, we will calculate natural logarithms. Your code should use this base for its logarithms. While any base would suffice to ameliorate our real number precision problem, we will be comparing your results to the results from our implementation, which itself uses natural logs.
As we scan down the string of text from the unidentified speaker, at every character, we will consider it and the \(k\) characters that preceded it in our \(k\)-th order Markov model. We will use the model to calculate the probability that the current character would have been uttered by the modeled speaker, given the context of the previous \(k\) characters. This represents a likelihood for that particular letter. Were we calculating a probability for the entire phrase, we would multiply together all the per-letter probabilities to yield the value for the entire phrase. But, since we are working with log probabilities, we add them.
What is the probability of a given character in the context of the last \(k\)? One reasonable way to calculate this is to look at how many times we have seen the modeled speaker utter the \(k\) preceding characters followed by the current character (a \(k\) + 1 length sequence), divided by the number of times we have observed the speaker to have uttered the \(k\) preceding characters (a \(k\)-letter sequence). In the count we use for the denominator, we only look at the \(k\) preceding characters in a row, but each time they appeared, they were followed by any character. Thus, we are taking the ratio of all the times they were followed by the current character over all the times they were followed by some character, to yield the likelihood that the current character was that “some character.” As an example (\(k=2\)), if we have the sequence “The”, we divide the number of times we have seen “The” by the number of times we have seen “Th”. Each of those occurrences of “Th” may have been part of “The”, but also could have been part of “Tha”, “Thi”, “Tho”, and “Thu”.
While this seems reasonable, we need to keep in mind that we are constructing the model with one set of text and using it to evaluate other text. The specific letter sequences that appear in that new text are not necessarily guaranteed ever to have appeared in the original text. Consequently, we are at risk of dividing by zero.
It turns out that there is a theoretically-justifiable solution to this issue, called Laplace smoothing. We modify the simple equation above by adding to the denominator the number of unique characters that appeared in the original text we used for modeling. For instance, if every letter in the alphabet appeared in the text, we add 26. (In practice, the number is likely to be greater, because we distinguish between upper- and lower-case letters, and consider spaces, digits, and punctuation as well.) Because we have added this constant to the denominator, it will never be zero. Next, we must compensate for the fact that we have modified the denominator; a theoretically sound way to balance this is to add one to the numerator. Symbolically, if \(N\) is the number of times we have observed the \(k\) preceding letters and \(M\) is the number of times we have observed those letters followed by the present letter, and \(S\) is the size of the “alphabet” of possible characters, then our probability is \((M+1)/(N+S)\).
One final complication: what do we do when we do not have \(k\) characters of context available? For instance, the first \(k\) characters in the sequence will have fewer preceding letters than we need for our analysis. To address this, we will “wrap around”: we will think of the string circularly, and glue the end of the string on to the beginning to provide a source of the needed context. For instance, if \(k\) = 2, and we have the string "ABCD"
, then the two letters of context for D
will be BC
, but the two letters of context for A
will wrap around and be CD
.
Learning the statistics of text¶
Now that we have identified the information we need for our end goal of calculating probabilities, we will briefly discuss how to collect it.
To learn the statistical characteristics of a particular speaker, we will be given a string of text that they have uttered. Using this string, and based on the previous section, determine the information you need to collect about the frequencies of characters, and the contexts in which they appear. Then, consider the data structure to use to hold this information.
You may find it helpful to think back to the example of modeling rigged dice in class and the tables we used to capture their statistics. (Later in the lecture, we discussed text generation, and an exotic optimization involving duplicate list entries. This optimization helped with generation of text, not recognition of text, so do not be misled into trying to apply that more general approach here.)
There was one relevant insight we encountered when we switched from dice to text, however: because dice rolls are numeric, a simple matrix was suitable. Text, on the other hand, lends itself to being the key in a dictionary, and so we used a dictionary where the key was the previous word or words of context and the value was information on what might follow.
Hash tables and Python dictionaries support the same operations and benefit from the same highly-efficient implementation. Because the goal of this assignment is to give you practice both implementing and using hash tables, you should not use a stock Python dictionary, but rather, your own hash table implementation, for your data structure.
As we determine the counts of occurrences of sequences, we again will encounter the issue of insufficient context for the early letters in a string. As before, the solution is to “wrap around.”
Structuring your code¶
Your Markov
class will have the following methods:
A constructor that takes in a value of \(k\) and a string of text to create the model. It will create one or more hash tables with
HASH_CELLS
many cells; we have provided this constant to be a suitable number that is a good starting size for your hash tables, although they will have to grow significantly to accommodate all the statistics you will learn as you scan over the sample text we have provided.log_probability
is a method that takes in a new string and returns the log probability that the modeled speaker uttered it, using the approach described in a prior section.
While these are the only methods whose details we are dictating, your code will benefit dramatically from a careful selection of helper functions.
Using the Markov class¶
While the log_probability
method yields the likelihood that a particular speaker uttered a specified string of text, this probability may be misleading because it depends upon the length of the string. Because there are many more possible strings of a greater length, and these probabilities are calculated across the universe of all possible strings of the same length, we should expect to see significant variance in these values for different phrases. To reduce this effect, we will divide all of our probabilities by the length of the unknown string, to yield normalized ones. To be clear, this division does not occur in log_probability
, but rather, in the code we are about to write. Also, note that we will be calculating log probabilities, under different models, for the same string. Thus, string length differences are not a factor during a single run of our program. Rather, we choose to normalize in this fashion because we want outputs for different runs to be more directly comparable when they may pertain to different length quotes being analyzed.
Your final task is to complete the function identify_speaker
outside of the Markov class. This function is called by the main
block with three strings and a value of \(k\). You must learn models for the speakers that uttered the first two strings, calculate the normalized log probabilities that those two speakers uttered the third string, and return these two probabilities in a tuple (with the first entry being the probability of the first speaker). Finally, you must compare the probabilities and place in the third slot of your returned tuple, a conclusion of which speaker was most likely. This conclusion should be either the string "A"
or "B"
. In the rare, and highly unlikely event there is tie then conclude to the string "B"
. We will call your function and print the result, which will have the form:
Speaker A: -2.49045540512 Speaker B: -2.44619679933 Conclusion: Speaker B is most likely
Testing your code¶
We have provided a set of files containing text from United States presidential debates from the 2004 and 2008 general elections. In the 2004 election, George W. Bush debated John Kerry; in the 2008 debates, Barack Obama went up against John McCain. We have provided single files for Bush, Kerry, Obama, and McCain to use to build models. These files contain all the text uttered by the corresponding candidate from two debates. We have also provided directories from the third debates of each election year, containing many files, appropriately labeled, that have remarks made by one of the candidates.
Here’s a sample use of the program:
The following command line will allow you to test with the specified files:
$ python3 Markov.py speakerA.txt speakerB.txt unidentified.txt k
where the first two file names represent the text files that will be used to build the two models, and the third file name is that of the file containing text whose speaker we wish to determine using the Markov approach. The final argument is the order (\(k\)) of the Markov models to use, an integer.
Here’s a sample use:
$ python3 Markov.py speeches/bush1+2.txt speeches/kerry1+2.txt speeches/bush-kerry3/BUSH-0.txt 2 Speaker A: -2.1670591295191572 Speaker B: -2.2363636778055525 Conclusion: Speaker A is most likely
We have also provided the usual style of test code in
test_markov.py
.
Debugging suggestions¶
A few debugging suggestions:
Make sure you have chosen the right data structure for the Markov model!
Check your code for handling the wrap-around carefully. It is a common source of errors.
Test the code for constructing a Markov model using a very small string, such as
"abcabd"
. Check your data structure to make sure that you have the right set of keys and the correct counts.Make sure you are using the correct value for \(S\): the number of unique characters that appeared in the training text.
Grading¶
Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)
The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:
Completeness: 70%
Correctness: 10%
Design: 10%
Style: 10%
The completeness part of your score will be determined using automated tests. To get your score for the automated tests, simply run the grader script, as described in our Testing Your Code page.
Cleaning up¶
Before you submit your final solution, you should, remove
any
print
statements that you added for debugging purposes andall in-line comments of the form: “YOUR CODE HERE” and “REPLACE …”
Also, check your code against the style guide. Did you use good variable names? Do you have any lines that are too long, etc.
Do not remove header comments, that is, the triple-quote strings that describe the purpose, inputs, and return values of each function.
As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.
Submission¶
You must submit your work through Gradescope (linked from our Canvas site). In the “Programming Assignment #1” assignment, simply upload files hash_table.py
and markov.py
(do not upload any other file!). Please note:
You are allowed to make as many submissions as you want before the deadline.
Please make sure you have read and understood our Late Submission Policy
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.
Acknowledgment¶
This assignment is based on one developed by Rob Schapire with contributions from Kevin Wayne.