In this assignment, you will implement the main algorthms associated with Hidden Markov Models, and become comfortable with dynamic programming and expectation maximization. You will also apply your HMM for part-of-speech tagging, linguistic analysis, and decipherment.
All code is to be written in hmm.py. Functions with I/O essentials for interacting with the program are provided, as well as headers for the methods you need to define. Read and understand the provided code.
Test your implementatons by following the directions under the Verify labels.
You may work on the sections in any order as long as you complete the prerequisites, sketched in the dependency diagram below. (Sections b, c, d, and e are reliant on Section a, Section f is reliant on Section e, and Section g is reliant on all.)
a: read and write model | |||
b: supervised learning | c: random generation | d: viterbi | e: forward & backward |
f: unsupervised learning | |||
g: experimentation |
We have provided the beginnings of a class called HMM
to represent
a Hidden Markov Model, parametrized by transition and emission probabilities.
HMM transition and emission parameters are specified in a pair of files, like models/two_english.trans and models/two_english.emit. Each row of a .trans file is of the form
fromstate tostate P(tostate|fromstate)# is a special state that denotes the start state. Nothing is emitted from the # state.
Each row of a .emit file is of the form
state output P(output|state)
Write a method, HMM.load
, to load the parameters
from the model files.
Your constructor should take the basename of the file pair (such as models/two_english) as an argument.
Additionally, define the inverse method, HMM.dump
,
which takes a basename of the output file pair and
writes the parameters of the model to the corresponding
.trans and .emit files.
For brevity, transition or emission probabilities that are 0 should not be written.
Observations are represented in a .obs file such as data/browntags.obs, where each observation is a list of 2 lines in order:
state1 state2 state3 ... output1 output2 output3 ...The first line is the space-delimited state sequence, and the second is the space-delimited output sequence. The first line may be empty when the state sequence is unknown, as in data/armenian.obs Verify
Create a model by loading from models/two_english and then writing it two_english_test, like so:
model = HMM()
model.load(models/two_english)
model.dump(two_english_test)
Verify that models/two_english.emit and two_english_test.emit are identical (others than line ordering), as are models/two_english.trans and two_english_test.trans.
We have provided a class called Observation
to represent
each observation,
as well as a load_observations
function to
load a list of observations from a file.
Define HMM.learn_supervised
that takes a list of
observations with known state sequences, such as the observations in data/browntags.obs,
as well as booleans specifying whether
the transition and emission paramaters should be "locked" (unchanged during training).
The method should estimate the parameters of the HMM using maximum likelihood estimation.
No smoothing is required.
python hmm.py models/partofspeech sup data/browntags.obswhich estimates the model using the state and output sequences in data/browntags.obs (a portion of the Brown corpus). The program writes the trained model to models/partofspeech.browntags.trained{.emit, .trans}. This model should be identical to gold/partofspeech.browntags.trained{.emit, .trans}.
Note: If you are skipping this section or leaving it for later, copy over partofspeech.browntags.trained{.trans, .emit} from the gold directory to the models directory, since you will need them for the following sections.
Define a method, HMM.generate
, that takes an integer n,
and returns a random observation of length n,
generated by sampling from the HMM.
Test the output of this method for the part-of-speech model you learned in section b by running
python hmm.py models/partofspeech.browntags.trained g generated.obswhich writes 20 random observations generated by the HMM into generated.obs
Here are two example observations generated by our run. (Of course, due to randomness, yours will differ.)
DET ADJ . ADV ADJ VERB ADP DET ADJ NOUN VERB ADJ NOUNDefine a method,
HMM.viterbi
, that implements the Viterbi algorithm to find the best state
sequence for the output sequence of a given observation.
The method should set the state sequence of the observation
to be this Viterbi state sequence.
Your algorithm can break ties however it likes. (In practice, HMM parameter values tend to differ enough that you won't run into many ties.)
Verify
python hmm.py models/partofspeech.browntags.trained v data/ambiguous_sents.txtThis uses the HMM parameters in models/partofspeech.{trans,emit} to compute the best sequence of part-of-speech tags for each sentence in data/ambiguous_sents.obs, and writes it to data/ambiguous_sents.tagged.obs.
Compare the output file to gold/ambiguous_sents.tagged.obs. They should be identical.
Define methods HMM.forward
and
HMM.backward
to implement the forward and backward algorithms respectively.
Both methods should take an output sequence, and
return a data structure containing the values for \(\alpha_i(t)\)
and \(\beta_i(t)\) respectively.
Additionally, define HMM.forward_probability
and HMM.backward_probability
which
return the probability of a given output sequence
using the values computed by the above methods.
python hmm.py models/partofspeech.browntags.trained f data/ambiguous_sents.obscomputes the total probability of each sentence in data/ambiguous_sents.obs using
HMM.forward
,
and writes it to data/ambiguous_sents.forwardprob.
Similarly,
python hmm.py models/partofspeech b data/ambiguous_sents.obscomputes the total probability of each sentence using
HMM.backward
,
and writes it to data/ambiguous_sents.backwardprob.
The numbers in both these files should be the same,
and identical to
gold/ambiguous_sents.prob.
Define HMM.learn_unsupervised
that takes a list of
observations where the state sequences may be unknown,
a convergence threshold,
booleans specifying whether the transition and emission paramaters should be "locked",
and the number of random restarts.
This method should use the Baum Welch EM algorithm
(which draws upon HMM.forward
and HMM.backward
)
to estimate the model parameters,
starting from the current model.
The function should also return the log likehood of the trained model.
If the number of restarts is greater than 0, re-initialize the model with random values (keeping the locked parameters the same) and run EM again. Keep the model with the best log likelihood from all the restarts.
Verify
python hmm.py models/two_english unsup data/english_words.obswhich will print out the trained model's log likelihood
The final model's log likelihood is -152860.669251The program will also write the trained model to models/two_english.english_words.trained{.emit,.trans}.
Check them against gold/two_english.english_words.trained{.emit, .trans}.
You're done! Now on to play with your cool new toy.
Run hmm.py
with the appropriate command line arguments
to fit the models/two_armenian HMM on the list of
Armenian words provided in
data/armenian_words.obs,
followed by Viterbi-tagging the data with the learned model.
Did the program find a separation of vowels and consonants? (Keep in mind that the "C" and "V" names for the states are arbitrary; what's important is the separation.) Look online for information about Armenian characters.
If the separation is not what you expect, and your code is correct, perhaps you got stuck in low local maximum. Re-run EM with restarts or a lower convergence threshold.
Briefly describe your experiments and findings.
The secret message in data/message.obs has been encoded by translating English characters into numbers. Each number corresponds to a unique character, but the same character could map to multiple possible numbers. Your task is to decode this message.
As per the noisy channel, we say that the message was created by generating English characters with some language model probability, and then encoding each English character into this secret language with some channel model probability. So, we're going to use English character bigrams, estimated from English text, as the HMM state transitions in models/encoding.trans. We will keep them locked because we're certain about them; by doing so, we only learn the emission (English-character state -> number) parameters.
Run hmm.py
with the appropriate command line arguments to learn
updated parameters for models/encoding{.trans,.emit}
and then decode the message.
As before, you may have to experiment with random restarts and varying the convergence threshold.