Assignments

Assignments will be posted here as we go along. Programming assignments should be completed using the Python programming language.

Assignment 3: Information Extraction

In this assignment you are to implement an HMM-based approach to named entity recognition.  In this approach, we can cast the problem of finding named entities as a tagging task using IOB tags. The framework for the HMM-based solution is identical to the one used for the POS tagging assignment.  The particular NER task we’ll be tackling is to find all the references to genes in a set of biomedical journal article abstracts. 

Sample GENE tags

Structure O

, O

promoter O

analysis O

and O

chromosomal O

assignment O

of O

the O

human B

APEX I

gene I

. O

The training material consists of around 13000 sentences with gene references tagged with IOB tags.  Since we’re only dealing with one kind of entity here there are only three tag types in the data. The format of the data is identical to the POS tagging HW: one token per line associated with its proper tag (tab separated).  An example is shown in the sidebar. In this example there is one gene mentioned “human apex gene” with the corresponding tag sequence B I I.

Although the structure of this problem is the same as POS tagging, the characteristics of the problem are quite different.  In particular, there are far fewer parameters to learn for transition probabilities since there are only three tags. However, the vocabulary is much larger than the BERP domain and unknown words will be far more prevalent.  Both of these considerations may lead you to different strategies from those you used in Assignment 2.

Assignment 2

In this assignment you are to implement an HMM-based approach to POS tagging. Specifically, you are to implement the Viterbi algorithm using a bigram tag/state model.  As training data, I am providing a POS-tagged section of the BERP corpus. Your systems will be evaluated against an unseen test set drawn from the same corpus.

Sample sentence

i       PRP

'd      MD

like    VB

french  JJ

food    NN

.       .


Training

The training data consists of around 15,000 POS-tagged sentences from the BERP corpus. The sentences are arranged as one word-tag pair per line with a blank line between sentences, words and tags are tab-separated. Contractions are split out into separate tokens with separate tags.  An example is shown to the left.

You should assume that the tags that appear in the training data constitute all the tags that exist (no new tags will appear in testing).  On the other hand, new words may appear in the test set.


Decoding

For decoding, your system will read in sentences from a file with the same format minus the tags.  That is one word per line ending with a period and blank line before the next sentence. As output you should emit an appropriate tag for each word in the same format as the training data.

Assignment 1: 50 Points

Your task in this assignment is to write a python program that accepts as input a plain text newspaper article and returns the number of paragraphs, sentences, and words contained in the article.  You should try to be appropriately modular and forward-looking in this homework, as you may want to reuse the code in future assignments.

Use this file as a development example. Don't try to fit your code too closely to this single example.  Try to generalize (a bit) from what you encounter in it.

What to Turn In

I will post a test file shortly before the due date.  Email me as attachments:

1. Your code as <lastname-firstname>-assgn1.py

2. A short report of what you did and all the various assumptions you made, including the result of running your code on the development article and the test article.  As part of this description, you should discuss how you think a system such as yours should be evaluated.  This should be <lastname-firstname>-assgn1-report.pdf.

This is due on September 17 by the start of class.