Summary This is a series of two assignments to build an algorithm to fix spelling errors in user-inputted text.
  1. Part 1: Using a Hidden Markov Model (HMM) and the Viterbi algorithm to "correct" words which are not in the dictionary. This part includes instructions to calculate emission and transition probabilities and use them in the Viterbi algorithm. (The HMM "emissions" are the typed letters, and the "states" are the ground-truth characters.)
  2. Part 2: Improving the algorithm by using BERT to take context into account. In this assignment, students choose how to design the final algorithm which can take into account context (via the language model), probabilities from the Hidden Markov model, or text edit distance (via Levenshtein distance). There is ample PyTorch starter code to help students with BERT.
Students also answer questions to analyze various types of errors in their algorithm's output and the choice of training dataset.
Topics Hidden Markov Models, BERT, algorithm design, dataset analysis
Audience Undergraduate or graduate introductory course in AI. Students should be interested in the practical aspects of implementing AI algorithms.
Difficulty Students who are interested in the practical aspects of AI (such as writing code and analysing data) will likely find this assignment more engaging. This assignment has two parts, each on a different AI topic. In my courses, each part is one weekly assignment. Students' self-reported times varied widely, but averaged around 10 hours for each part.
Strengths
  • Solving this one problem covers multiple learning objectives, reducing overall workload.
  • It provides some relevant Python syntax to help students get started using BERT.
  • It asks students to think deeply about the algorithm design and the impacts of the training dataset choice.
Weaknesses
  • The assignment illustrates how recent tools like large language models outperform "classical" AI techniques like Hidden Markov Models.
  • Part 2 of the assignment should yield a much more accurate algorithm than Part 1. However, Part 2 uses BERT, which is slower (without a GPU).
  • While Google Colab (https://colab.google) has some resources, students with regular access to a GPU may finish the assignment quicker.
Dependencies
  • This assignment covers Hidden Markov Models, which we assume have recently been covered in class.
  • Similarly, the second part of this assignment belongs when language models like BERT have been covered.
  • This assignment uses Python and Transformers, and is designed to be completed in a notebook such as Google Colab (https://colab.google).
Variants
  • Especially as further developments in language models improve our spelling fixers, a third part could be added to use generative language models to improve accuracy and allow analysis.
  • Instructors could also omit the example training dataset, and require students to create their own dataset. This would further incentivize analysis on the choice of training dataset (though it would take more time for students to complete).
Part 1
Part 2 (including example Python BERT usage)