Assignment 2: Probabilistic Hashtag Segmentation

Your task in this assignment is to improve the performance of the deterministic methods you used in Assignment 1  through the use of a probabilistic language model (the details of what you try are up to you). 

Your primary resource in tackling this problem is a large list of bigrams, with frequency counts, derived from the Google N-gram corpus. The most obvious approach is to use these counts to derive a bigram language model and then use that model to find the most probable segmentation given some hashtag.

Improving performance in this context, means reducing the error rate from the previous HW, where error rate is WER (length normalized minimum edit distance). 

How much credit you get for this will be based in large part on how much your system improves things (reduces WER) and how complete/clever you were in your approach. Note that I am specifically interested in approaches that employ a probabilistic language model.

This assignment is due on Thursday, February 24. 


 As with the first assignment, I will post a new test-set with answers. You should run your system (as is at that time) on that test-set and send the output to a plain text file with one segmentation per line.  Don't include any extraneous output or decoration in this file. Then send me an email with the following attachments:
1. Your code as <yourlastname>
2.Output as <yourlastname>-out-extra.txt
3 A short description of what you did, including the WER of the new system.