Text Autoencoder for Embedding News Headlines

Lisa Zhang and Pouria Fewzee

Overview

In this assignment, students combine the techniques they learned throughout a deep learning course to build a denoising autoencoder for news headlines. Students then use this denoising autoencoder to query similar headlines, and interpolate between headlines. The assignment combines students understanding of autoencoders, language modelling via recurrent neural networks (RNN), data augmentation, and working with embeddings.

The cumulative nature of this assignment makes it a good choice for a final assignment in an introductory Artificial Intelligence, Machine Learning, or Deep Learning course that covers the pre-requisite topics.

For a traditional AI course, it is possible to modify the assignment to use various other Information Retrieval techniques on the provided dataset.

Meta Information

Summary

Students combine their understanding of autoencoders, language modelling, data augmentation, and embedding to build a denoising recurrent neural network autoencoder for news headlines. Students use this model to retrieve similar headlines, and interpolate between headlines.

Topics

Natural Language Processing
Recurrent Neural Networks
Denoising Autoencoder
Data Augmentation
Embeddings
Information retrieval

Audience

Third- and fourth-year students in an introductory artificial intelligence, machine learning, or deep learning course.

Difficulty

The assignment is of moderate difficulty, and depends on student’s comfort in programming, understanding of the pre-requisite materials, and ability to debug neural network code. Some of the common debugging issues are listed in the final section of this page.

Still, the assignment is heavily scaffolded so that a student can tackle some of the questions without completing all of them.

Strengths

The assignment covers a diverse set of deep learning techniques in a single piece of work.
The dataset consists of real news headlines, and can be used for related information retrieval assignments.
We provide a pre-trained model, so interested instructors can use the model for classroom demonstrations or for a smaller exercise.
The scaffolding allows students to create a more involved model without spending an overwhelming amount of time debugging. Common debugging issues are listed in the “Lessons Learned” section.

Weaknesses

Descriptions and explanations are platform-specific (PyTorch and torchtext).
The assignment requires knowledge of many pre-requisite topics. Some of the pre-requisite references are provided in the “Materials” section.
Scaffolding is a double-edge sword: such a heavily scaffolded assignment may not be ideal for a cumulative assignment.

Dependencies

Software: We use Google Colab with Python, PyTorch, and torchtext for this assignment. Although the use of Google Colab is not necessary, the scaffolding in the assignment is specific to Google Colab, PyTorch and torchtext. In particular, the assignment contains instructions for how to use a GPU in Google Colab.

Prior Material: The handout assumes that students have seen autoencoders, recurrent neural networks, data augmentation, embeddings, and programming using Python, Google Colab, and pytorch.

Variants

The dataset can be used for a similar information retrieval task using traditional AI techniques.

A pre-trained model is available as part of the assignment. Instructors can instead use this pre-trained model for a lecture demonstration or a smaller exercise.

Materials

Assignment Handout:

Pre-requisite Materials:

Autoencoders Demo
- Source (prereq/autoencoder_notes.md)
- Jupyter Notebook (prereq/autoencoder_notes.ipynb)
Recurrent Neural Networks Demo
- Source (prereq/rnn_notes.md)
- Jupyter Notebook (prereq/rnn_notes.ipynb)
Generative Recurrent Neural Network Exercise
- Source (prereq/gen_rnn_tut.md)
- Jupyter Notebook (prereq/gen_rnn_tut.ipynb)

Makefile for building the jupyter notebooks from the markdown source.

Lessons Learned

Using a cloud environment like Google Colab helps to avoid common installation issues, as well as ensure equitable access to computational resources for students.

Debugging is difficult for students. Overfitting on a single headline help students understand that there is an issue, but it is difficult for students to identify the issue. Here are some of the most common observations and the underlying issue:

Observation: when overfitting on a single headline, the loss decreases very quickly, but the model does not generate the correct headline.
- Issue: Off-by-one in the decode function. In particular, the first token that the decoder outputs should not be the token, but rather the following token after . The decode function is replicating the input token, rather than predicting the next token. Since it is very easy to replicate the input token, the loss decreases very quickly.
Observation: when overfitting on a single headline, the same token is always generated.
- Issue: The order of the dimension of the input tensor is incorrect. PyTorch is interpreting the input as N different sequences, each of length 1, rather than 1 sequence of length N. Then, when making predictions for these N sequences, since PyTorch is not conditioning the first token on any information, the prediction for that first token is the same for all N sequences.
Observation: Google Colab crashes when initializing the model.
- Issue: The sizes provided to AutoEncoder.__init__ method is incorrect, and asks Colab to initialize a large number of weights, requiring too much memory.
Observation: Google Colab crashes when computing the embeddings of the validation data.
- Issue: The batch size is too large. In fact, to avoid padding, we should be using a batch size of 1 to compute the embeddings of the validation data.

Some of these issues may not relate directly to the kind of machine learning concepts commonly taught in courses, but these issues are extremely common when developing one’s own machine learning model. This assignment is appropriate if debugging neural networks is a course learning objective, but some students will need help identifying issues with their code.