Text Autoencoder for Embedding News Headlines

Lisa Zhang and Pouria Fewzee

Overview

In this assignment, students combine the techniques they learned throughout a deep learning course to build a denoising autoencoder for news headlines. Students then use this denoising autoencoder to query similar headlines, and interpolate between headlines. The assignment combines students understanding of autoencoders, language modelling via recurrent neural networks (RNN), data augmentation, and working with embeddings.

The cumulative nature of this assignment makes it a good choice for a final assignment in an introductory Artificial Intelligence, Machine Learning, or Deep Learning course that covers the pre-requisite topics.

For a traditional AI course, it is possible to modify the assignment to use various other Information Retrieval techniques on the provided dataset.

Meta Information

Summary

Students combine their understanding of autoencoders, language modelling, data augmentation, and embedding to build a denoising recurrent neural network autoencoder for news headlines. Students use this model to retrieve similar headlines, and interpolate between headlines.

Topics

Audience

Third- and fourth-year students in an introductory artificial intelligence, machine learning, or deep learning course.

Difficulty

The assignment is of moderate difficulty, and depends on student’s comfort in programming, understanding of the pre-requisite materials, and ability to debug neural network code. Some of the common debugging issues are listed in the final section of this page.

Still, the assignment is heavily scaffolded so that a student can tackle some of the questions without completing all of them.

Strengths

Weaknesses

Dependencies

Software: We use Google Colab with Python, PyTorch, and torchtext for this assignment. Although the use of Google Colab is not necessary, the scaffolding in the assignment is specific to Google Colab, PyTorch and torchtext. In particular, the assignment contains instructions for how to use a GPU in Google Colab.

Prior Material: The handout assumes that students have seen autoencoders, recurrent neural networks, data augmentation, embeddings, and programming using Python, Google Colab, and pytorch.

Variants

The dataset can be used for a similar information retrieval task using traditional AI techniques.

A pre-trained model is available as part of the assignment. Instructors can instead use this pre-trained model for a lecture demonstration or a smaller exercise.

Materials

Assignment Handout:

Pre-requisite Materials:

Makefile for building the jupyter notebooks from the markdown source.

Lessons Learned

Using a cloud environment like Google Colab helps to avoid common installation issues, as well as ensure equitable access to computational resources for students.

Debugging is difficult for students. Overfitting on a single headline help students understand that there is an issue, but it is difficult for students to identify the issue. Here are some of the most common observations and the underlying issue:

Some of these issues may not relate directly to the kind of machine learning concepts commonly taught in courses, but these issues are extremely common when developing one’s own machine learning model. This assignment is appropriate if debugging neural networks is a course learning objective, but some students will need help identifying issues with their code.