University of Chicago: Computer Science: CMSC 35100: Winter 2003: Syllabus

CMSC 35100 - Natural Language Processing
Spring 2003
Assignment #2: Due May 1, 2003

Goals:

This assignment explores the application of n-gram models to language identification (aka language ID). Language ID attempts to figure out which of several known languages a new text was written in. Character n-gram models have been used effectively for this task. In the assignment you will build character n-gram models of a variety of texts based on a relatively large corpus. You will then apply the n-gram models to perform language ID by determining which language model provides a better fit to a new test text.

Tools:

Building a simple n-gram model - without sophisticated smoothing - is fairly straightforward, and you are welcome to do so in this assignment if you wish. Alternatively, techstaff has installed the Cambridge Statistical Language Modeling toolkit that builds and evaluates language models. The documentation for the toolkit is available on-line at Cambridge's website. The programs are located in

Data:

Since you will be training and testing language models, you will need some texts in different languages. The data is available on the course website.

/opt/CMU-Cam-Toolkit/default/

Step 1: Model training

Compute character bigram, trigram, and quadigram language models for each of the large language samples.

Step 2: Language ID: MLE

For each of the test samples, and each of the language models, determine the most likely language to have produced the test sample.

Simply compute the maximum likelihood estimate, that is P(sample|model). (You may also compute the (-) logprob to avoid underflow problems.) A maximum likelihood estimate (MLE) of probability is computed by taking the raw count and dividing by the total, so for a general N-gram model the relevant maximum likelihood estimates are:

  (1)  P(w[n-N+1],...,w[n-1],n) = Count(w[n-N+1],...,w[n-1],w[n])/Total
  (2)  P(w[n-N+1],...,w[n-1])   = Count(w[n-N+1],...,w[n-1])/Total

What we want is the conditional probability

                                          P(w[n-N+1],...,w[n-1],n)
  (3)  P(w[n] | w[n-N+1],...,w[n-1])  =  -------------------------
                                          P(w[n-N+1],...,w[n-1])

Substituting (1) and (2) into the numerator and denominator of (3),
the Total in the numerator and denominator cancel out, so we get the
general formula for the MLE of probability in an N-gram model:

                                         Count(w[n-N+1],...,w[n-1],w[n])
  (4)  P(w[n] | w[n-N+1],...,w[n-1]) =  ---------------------------------
                                          Count(w[n-N+1],...,w[n-1])

Report your results. What do you observe?

Step 3: Language ID: perplexity

Using the SLM TK, perform language ID using the evallm function, your earlier language models, and the test samples.

Compare your results to those derived from the MLE assignment.