[CS logo]

CMSC 35100 - Natural Language Processing
Spring 2003
Assignment #1: Due April, 2003


Goals:

This assignment covers approaches to morphological and pronunciation variation in the two-level morphology framework using finite state transducers in general and an implementation called PC-Kimmo in particular. You will implement rules to model several phenomena and analyze some of the strengths and limitations of both the general approach and some specific morphological models.

Implementation:

We'll be using the PC-Kimmo system for two level morphology. The code for this system is installed in /opt/pc-parse/pc-parse-20.../bin on CS department machines. The two programs you will likely need are pckimmo and kgen. pckimmo takes a specification of lexicon and rule files and then implements the finite state transduction process according to the rules. Since pckimmo expects rules to be specified as transition tables - which should be written by machines and not by humans, we include the program kgen that compiles the Chomsky and Halle-style phonological rules we have used in class into the necessary transition tables. The invocation is as follows:

kgen < rules.in > rules.out

test1.txt and test2.txt are examples of valid input files to kgen.

The auxiliary model files appear here. You will want to use the English grammar specification files for comparison.

NEW: Additional pckimmo documentation: http://www.sil.org/pckimmo/v2/doc/guide.html

Problem 1

Use the English lexicon and rule files provided by invoking pckimmo and giving the command:

take englex.tak

with the paths updated to reflect your directory structure.

Part A

Execute generate refer+ing. What happens? What is the problem and why does it occur?

Part B

Execute generate picnic+ing What happens? What is the problem and why does it occur?

Problem 2

Implement the SOUNDEX algorithm with PC-Kimmo. (This is exercise 3.6 in the text.) The SOUNDEX algorithm (Odell and Russell, 1922; Knuth,1973) is a method commonly used in libraries and older Census records for representing people's names. It has the advantage that versions of the names that are slightly misspelled or otherwise modified (common, for example, in hand-written census records) will stil have the same representation as correctly-spelled names. (e.g. Jurafsky, Jarofsky, Jarovsky, and Jarovski all map to J612).
  1. Keep the first letter of the name, and drop all occurrences of non-initial a, e, h, i, o, u, w, y
  2. Replace the remaining letters with the following numbers:
    • b,f,p,v: 1
    • c,g,j,k,q,s,x,z: 2
    • d,t: 3,
    • l: 4
    • m,n: 5
    • r: 6
  3. Replace any sequences of identical numbers with a single number (i.e., 666: 6)
  4. Convert to the form Letter Digit Digit Digit by dropping digits past the third (if necessary) or padding with trailing zeroes (if necessary).