CMSC 35100 - Natural Language Processing
Spring 2003
Assignment #1: Due April, 2003
Goals:
This assignment covers approaches to morphological and pronunciation
variation in the two-level morphology framework using finite state
transducers in general and an implementation called PC-Kimmo in
particular. You will implement rules to model several phenomena
and analyze some of the strengths and limitations of both the
general approach and some specific morphological models.
Implementation:
We'll be using the PC-Kimmo system for two level morphology. The
code for this system is installed in /opt/pc-parse/pc-parse-20.../bin
on CS department machines. The two programs you will likely
need are pckimmo and kgen.
pckimmo takes a specification of
lexicon and rule files and then implements the finite state
transduction process according to the rules. Since pckimmo
expects rules to be specified as transition tables - which should be
written by machines and not by humans, we include the program
kgen that compiles the Chomsky and Halle-style
phonological rules we have used in class into the necessary
transition tables. The invocation is as follows:
kgen < rules.in > rules.out
test1.txt and test2.txt are examples of valid input files to kgen.
The auxiliary model files appear here.
You will want to use the English grammar specification files
for comparison.
NEW: Additional pckimmo documentation: http://www.sil.org/pckimmo/v2/doc/guide.html
Problem 1
Use the English lexicon and rule files provided by invoking pckimmo and giving the command:
take englex.tak
with the paths updated to reflect your directory structure.
Part A
Execute generate refer+ing. What happens? What
is the problem and why does it occur?
Part B
Execute generate picnic+ing What happens? What
is the problem and why does it occur?
Problem 2
Implement the SOUNDEX algorithm with PC-Kimmo.
(This is exercise 3.6 in the text.)
The SOUNDEX algorithm (Odell and Russell, 1922; Knuth,1973) is a method
commonly used in libraries and older Census records for representing
people's names. It has the advantage that versions of the names that
are slightly misspelled or otherwise modified (common, for example,
in hand-written census records) will stil have the same representation
as correctly-spelled names. (e.g. Jurafsky, Jarofsky, Jarovsky, and
Jarovski all map to J612).
- Keep the first letter of the name, and drop all occurrences of non-initial
a, e, h, i, o, u, w, y
- Replace the remaining letters with the following numbers:
- b,f,p,v: 1
- c,g,j,k,q,s,x,z: 2
- d,t: 3,
- l: 4
- m,n: 5
- r: 6
- Replace any sequences of identical numbers with a single number (i.e.,
666: 6)
- Convert to the form Letter Digit Digit Digit by dropping
digits past the third (if necessary) or padding with trailing zeroes (if necessary).