Natural Language Processing NLP Assignment 1

Assignment 3

The goal of this assignment is to create a token-level, bi-gram, language model and use it to assign probabilities to sentences. This assignment is to be done individually.

Part 1

Write a program to collect bigram counts from a set of newspaper articles similar to those used in the first two assignments. I will provide a test set. In the meantime, use the articles from the last assignment.

In addition to the usual counts of word pairs, collect counts for bigrams that include the beginning and ending of sentences using <s> and </s> markers
Treat clitics as separate tokens as described in Ch. 3.
Retain periods that occur with abbreviations or honorifics ("Mr.", "Dr.", "etc.", etc.)
Treat numbers as words
Treat hyphenated words as single words (as in "life-affirming", "18-by-30")
Remaining punctuation should be treated as tokens.

For this part, you should generate a simple list of the bigrams found in the collection with a frequency greater than 1, along with their counts sorted in descending order of rank.

Part 2

Use a standard implementation of Simple Good-Turing to produce a smoothed set of bigram log-probabilities for the counts generated in Part 1. Output a list of bigrams sorted alphabetically according to the first word in the bigram and the log-probabilities.

Part 3

Using the smoothed log-probabilities from Part 2, write a program to assign log-probabilities to sentences for new texts. Assume that the input is similar to that from the last assignment. Be prepared to detail with unknown words.

Details

If and when you run into issues that you can't resolve, feel free to send mail to the class mailing list to discuss the issue. You should email me your code and answers to the test cases by March 20. As usual, the deadline for remote CAETE students is one week after the in-class deadline.