CMSC 35100 - Natural Language Processing
Spring 2003
Assignment #2: Due May 1, 2003
Goals:
This assignment explores the application of n-gram models to language
identification (aka language ID). Language ID attempts to figure out
which of several known languages a new text was written in.
Character n-gram models have been used effectively for this task.
In the assignment you will build character n-gram models of a
variety of texts based on a relatively large corpus. You will
then apply the n-gram models to perform language ID by determining
which language model provides a better fit to a new test text.
Tools:
Building a simple n-gram model - without sophisticated smoothing - is
fairly straightforward, and you are welcome to do so in this assignment
if you wish. Alternatively, techstaff has installed the Cambridge
Statistical Language Modeling toolkit that builds and evaluates
language models. The documentation for the toolkit is available
on-line at Cambridge's website. The programs are located in
Data:
Since you will be training and testing language models, you will need
some texts in different languages. The data is available on the
course website.
/opt/CMU-Cam-Toolkit/default/
Step 1: Model training
Compute character bigram, trigram, and quadigram language models
for each of the large language samples.
Step 2: Language ID: MLE
For each of the test samples, and each of the language models,
determine the most likely language to have produced the test sample.
Simply compute the maximum likelihood estimate, that is P(sample|model).
(You may also compute the (-) logprob to avoid underflow problems.)
A maximum likelihood estimate (MLE) of probability is computed by
taking the raw count and dividing by the total, so for a general
N-gram model the relevant maximum likelihood estimates are:
(1) P(w[n-N+1],...,w[n-1],n) = Count(w[n-N+1],...,w[n-1],w[n])/Total
(2) P(w[n-N+1],...,w[n-1]) = Count(w[n-N+1],...,w[n-1])/Total
What we want is the conditional probability
P(w[n-N+1],...,w[n-1],n)
(3) P(w[n] | w[n-N+1],...,w[n-1]) = -------------------------
P(w[n-N+1],...,w[n-1])
Substituting (1) and (2) into the numerator and denominator of (3),
the Total in the numerator and denominator cancel out, so we get the
general formula for the MLE of probability in an N-gram model:
Count(w[n-N+1],...,w[n-1],w[n])
(4) P(w[n] | w[n-N+1],...,w[n-1]) = ---------------------------------
Count(w[n-N+1],...,w[n-1])
Report your results. What do you observe?
Step 3: Language ID: perplexity
Using the SLM TK, perform language ID using the evallm
function, your earlier language models, and the test samples.
Compare your results to those derived from the MLE assignment.