The goal of this assignment is to create a token-level, bi-gram,
language model and use it to assign probabilities to sentences. This
assignment is to be done individually.
Part 1
Write a program to collect bigram counts from a set of newspaper
articles similar to those used in the first two assignments. I will
provide a test set. In the meantime, use the articles from the last
assignment.
In addition to the usual counts of word pairs, collect counts for bigrams
that include the beginning and ending of sentences using <s> and </s> markers
Treat clitics as separate tokens as described in Ch. 3.
Retain periods that occur with abbreviations or honorifics ("Mr.",
"Dr.", "etc.", etc.)
Treat numbers as words
Treat hyphenated words as single words (as in "life-affirming",
"18-by-30")
Remaining punctuation should be treated as tokens.
For this part, you should generate a simple list of the bigrams found in the
collection with a frequency greater than 1, along
with their counts sorted in descending order of rank.
Part 2
Use a standard implementation of
Simple Good-Turing
to produce a smoothed set of bigram log-probabilities for
the counts generated in Part 1. Output a list of bigrams sorted
alphabetically according to the first word in the bigram and the
log-probabilities.
Part 3
Using the smoothed log-probabilities from Part 2, write a program to
assign log-probabilities to sentences for new texts. Assume that the
input is similar to that from the last assignment.
Be prepared to detail with unknown words.
Details
If and when you run into issues that you can't resolve, feel free to
send mail to the class mailing list to discuss the issue.
You should email me your code and answers to the test cases by March 20.
As usual, the deadline for remote CAETE students is one week after the
in-class deadline.