Implementing a Hidden Markov Model Toolkit

In this assignment, you will implement the main algorthms associated with Hidden Markov Models, and become comfortable with dynamic programming and expectation maximization. You will also apply your HMM for part-of-speech tagging, linguistic analysis, and decipherment.

All code is to be written in hmm.py. Functions with I/O essentials for interacting with the program are provided, as well as headers for the methods you need to define. Read and understand the provided code.

Test your implementatons by following the directions under the Verify labels.

You may work on the sections in any order as long as you complete the prerequisites, sketched in the dependency diagram below. (Sections b, c, d, and e are reliant on Section a, Section f is reliant on Section e, and Section g is reliant on all.)

a: read and write model
b: supervised learning	c: random generation	d: viterbi	e: forward & backward
			f: unsupervised learning
g: experimentation

Section a: Load+Write the model files

Models

We have provided the beginnings of a class called HMM to represent a Hidden Markov Model, parametrized by transition and emission probabilities.

HMM transition and emission parameters are specified in a pair of files, like models/two_english.trans and models/two_english.emit. Each row of a .trans file is of the form

fromstate tostate P(tostate|fromstate)

# is a special state that denotes the start state. Nothing is emitted from the # state.

Each row of a .emit file is of the form

state output P(output|state)

Write a method, HMM.load, to load the parameters from the model files.

Note that some possible transitions or emissions may be omitted in the files. These parameters should maintain a value of 0.
Optionally, the probabilities may be left out when we don't know them, as in models/two_armenian.trans and models/two_armenian.emit. If the model files don't specify the conditional probabilities, the constructor should initialize them randomly.

Your constructor should take the basename of the file pair (such as models/two_english) as an argument.

Additionally, define the inverse method, HMM.dump, which takes a basename of the output file pair and writes the parameters of the model to the corresponding .trans and .emit files. For brevity, transition or emission probabilities that are 0 should not be written.

Observations (no code needed)

Observations are represented in a .obs file such as data/browntags.obs, where each observation is a list of 2 lines in order:

state1 state2 state3 ...
output1 output2 output3 ...

The first line is the space-delimited state sequence, and the second is the space-delimited output sequence. The first line may be empty when the state sequence is unknown, as in data/armenian.obs

Verify

Create a model by loading from models/two_english and then writing it two_english_test, like so:



model = HMM() 

model.load(models/two_english)

model.dump(two_english_test)

Verify that models/two_english.emit and two_english_test.emit are identical (others than line ordering), as are models/two_english.trans and two_english_test.trans.

We have provided a class called Observation to represent each observation, as well as a load_observations function to load a list of observations from a file.

Section b: Supervised Learning

Define HMM.learn_supervised that takes a list of observations with known state sequences, such as the observations in data/browntags.obs, as well as booleans specifying whether the transition and emission paramaters should be "locked" (unchanged during training). The method should estimate the parameters of the HMM using maximum likelihood estimation. No smoothing is required.

Verify

python hmm.py models/partofspeech sup data/browntags.obs

which estimates the model using the state and output sequences in data/browntags.obs (a portion of the Brown corpus). The program writes the trained model to models/partofspeech.browntags.trained{.emit, .trans}. This model should be identical to gold/partofspeech.browntags.trained{.emit, .trans}.

Note: If you are skipping this section or leaving it for later, copy over partofspeech.browntags.trained{.trans, .emit} from the gold directory to the models directory, since you will need them for the following sections.

Section c: Random Observation Generation

Define a method, HMM.generate, that takes an integer n, and returns a random observation of length n, generated by sampling from the HMM.

Test the output of this method for the part-of-speech model you learned in section b by running

python hmm.py models/partofspeech.browntags.trained g generated.obs

which writes 20 random observations generated by the HMM into generated.obs

Here are two example observations generated by our run. (Of course, due to randomness, yours will differ.)

DET ADJ . ADV ADJ VERB ADP DET ADJ NOUN VERB ADJ NOUN the semi-catatonic , quite several must of an western bridge cannot spectacular analyses DET NOUN ADP NOUN CONJ DET VERB DET NOUN NOUN NOUN ADP DET NOUN whose light for wall and the learned the hull postmaster trash in his peters

Section d: Viterbi Algorithm for the Best State Sequence

Define a method, HMM.viterbi, that implements the Viterbi algorithm to find the best state sequence for the output sequence of a given observation. The method should set the state sequence of the observation to be this Viterbi state sequence.

Your algorithm can break ties however it likes. (In practice, HMM parameter values tend to differ enough that you won't run into many ties.)

Verify

python hmm.py models/partofspeech.browntags.trained v data/ambiguous_sents.txt

This uses the HMM parameters in models/partofspeech.{trans,emit} to compute the best sequence of part-of-speech tags for each sentence in data/ambiguous_sents.obs, and writes it to data/ambiguous_sents.tagged.obs.

Compare the output file to gold/ambiguous_sents.tagged.obs. They should be identical.

Section e: Forward and Backward Algorithms

Define methods HMM.forward and HMM.backward to implement the forward and backward algorithms respectively. Both methods should take an output sequence, and return a data structure containing the values for \(\alpha_i(t)\) and \(\beta_i(t)\) respectively.

Additionally, define HMM.forward_probability and HMM.backward_probability which return the probability of a given output sequence using the values computed by the above methods.

Verify

python hmm.py models/partofspeech.browntags.trained f data/ambiguous_sents.obs

computes the total probability of each sentence in data/ambiguous_sents.obs using HMM.forward, and writes it to data/ambiguous_sents.forwardprob.

Similarly,

python hmm.py models/partofspeech b data/ambiguous_sents.obs

computes the total probability of each sentence using HMM.backward, and writes it to data/ambiguous_sents.backwardprob.

The numbers in both these files should be the same, and identical to gold/ambiguous_sents.prob.

Section f: Unsupervised Learning with Baum Welch

Define HMM.learn_unsupervised that takes a list of observations where the state sequences may be unknown, a convergence threshold, booleans specifying whether the transition and emission paramaters should be "locked", and the number of random restarts. This method should use the Baum Welch EM algorithm (which draws upon HMM.forward and HMM.backward) to estimate the model parameters, starting from the current model.

The function should also return the log likehood of the trained model.

If the number of restarts is greater than 0, re-initialize the model with random values (keeping the locked parameters the same) and run EM again. Keep the model with the best log likelihood from all the restarts.

Verify

python hmm.py models/two_english unsup data/english_words.obs

which will print out the trained model's log likelihood

The final model's log likelihood is -152860.669251

The program will also write the trained model to models/two_english.english_words.trained{.emit,.trans}.

Check them against gold/two_english.english_words.trained{.emit, .trans}.

Section g: Experimentation and Analysis

You're done! Now on to play with your cool new toy.

(section c) Examine the sentences produced in generated.obs. Ignore the state sequences (part of speech tags) and only look at the output sequences. Are these reasonable English sentences?
(section d) data/ambiguous_sents.obs contains some words with ambiguous parts of speech. The Viterbi algorithm resolves some of the ambiguities correctly but not others. Which ones are incorrect? Looking at the partofspeech{.trans, .emit} HMM parameters, can you explain why? [reference on part-of-speech tags]
(section f) What did the models/two_english HMM learn when trained unsupervised on the list of English words in Section f?
(section d, section f)
Run hmm.py with the appropriate command line arguments to fit the models/two_armenian HMM on the list of Armenian words provided in data/armenian_words.obs, followed by Viterbi-tagging the data with the learned model.

Did the program find a separation of vowels and consonants? (Keep in mind that the "C" and "V" names for the states are arbitrary; what's important is the separation.) Look online for information about Armenian characters.

If the separation is not what you expect, and your code is correct, perhaps you got stuck in low local maximum. Re-run EM with restarts or a lower convergence threshold.

Briefly describe your experiments and findings.
(section d, section f)
The secret message in data/message.obs has been encoded by translating English characters into numbers. Each number corresponds to a unique character, but the same character could map to multiple possible numbers. Your task is to decode this message.

As per the noisy channel, we say that the message was created by generating English characters with some language model probability, and then encoding each English character into this secret language with some channel model probability. So, we're going to use English character bigrams, estimated from English text, as the HMM state transitions in models/encoding.trans. We will keep them locked because we're certain about them; by doing so, we only learn the emission (English-character state -> number) parameters.

Run hmm.py with the appropriate command line arguments to learn updated parameters for models/encoding{.trans,.emit} and then decode the message. As before, you may have to experiment with random restarts and varying the convergence threshold.