{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Denoising Autoencoder for News Headlines\n",
    "\n",
    "In this assignment, we'll explore a more advanced use of deep learning on a\n",
    "natural language process (NLP) task involving news headlines. \n",
    "In particular, we'll be working with a dataset of Reuters news headlines\n",
    "collected over a span of 15 months, covering some of 2018, 2019, and early 2020.\n",
    "This assignment will combine several of the concepts that we discussed in class,\n",
    "including autoencoders, recurrent neural networks, data augmentation,\n",
    "and working with embeddings.\n",
    "\n",
    "To be more specific, we'll be building an **autoencoder** of news headlines.\n",
    "This idea is similar the kind of image autoencoder we saw/built in lecture:\n",
    "we will build an **encoder** model that maps a news headline to a vector embedding,\n",
    "and a **decoder** that reconstructs the news headline. Both our encoder and decoder\n",
    "networks will be Recurrent Neural Networks. You'll have a chance build\n",
    "networks that takes a sequence as an input, and a network that generates a\n",
    "sequence as an output.\n",
    "\n",
    "This project is organized as follows:\n",
    "\n",
    "- Question 1. Data exploration\n",
    "- Question 2. Building the autoencoder\n",
    "- Question 3. Training the autoencoder using *data augmentation*\n",
    "- Question 4. Analyzing the embeddings (interpolating between headlines)\n",
    "\n",
    "Much of the idea behind this assignment is movtivated by Shen et al [1].\n",
    "We'll use the data augmentation rules proposed in that work to improve\n",
    "the robustness of the autoencoder.\n",
    "\n",
    "[1] Shen et al (2019) \"Educating Text Autoencoders: Latent Representation Guidance via Denoising\" https://arxiv.org/pdf/1905.12777.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import random\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1\n",
    "\n",
    "Download the files `reuters_train.txt` and `reuters_valid.txt`, and upload them to Google Drive.\n",
    "\n",
    "Then, mount Google Drive from your Google Colab notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.colab import drive\n",
    "drive.mount('/content/gdrive')\n",
    "\n",
    "train_path = '/content/gdrive/My Drive/CSC321/reuters_train.txt' # Update me\n",
    "valid_path = '/content/gdrive/My Drive/CSC321/reuters_valid.txt' # Update me"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be using PyTorch's `torchtext` utilities to help us load, process,\n",
    "and batch the data. This package is useful, but takes a bit of time to get\n",
    "used to. \n",
    "\n",
    "We'll be using a `TabularDataset` to load our data, which works well on structured\n",
    "CSV data with fixed columns (e.g. a column for the sequence, a column for the label). Our tabular dataset\n",
    "is even simpler: we have no labels, just some text. So, we are treating our data as a table with one field\n",
    "representing our sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchtext\n",
    "\n",
    "# Tokenization function to separate a headline into words\n",
    "def tokenize_headline(headline):\n",
    "    \"\"\"Returns the sequence of words in the string headline. We also\n",
    "    prepend the \"<bos>\" or beginning-of-string token, and append the\n",
    "    \"<eos>\" or end-of-string token to the headline.\n",
    "    \"\"\"\n",
    "    return (\"<bos> \" + headline + \" <eos>\").split()\n",
    "\n",
    "# Data field (column) representing our *text*.\n",
    "text_field = torchtext.data.Field(\n",
    "    sequential=True,            # this field consists of a sequence\n",
    "    tokenize=tokenize_headline, # how to split sequences into words\n",
    "    include_lengths=True,       # to track the length of sequences, for batching\n",
    "    batch_first=True,           # similar to batch_first=True in nn.RNN demonstrated in lecture\n",
    "    use_vocab=True)             # to turn each character into an integer index\n",
    "train_data = torchtext.data.TabularDataset(\n",
    "    path=train_path,                # data file path\n",
    "    format=\"tsv\",                   # fields are separated by a tab\n",
    "    fields=[('title', text_field)]) # list of fields (we have only one)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (a) -- 2 points\n",
    "\n",
    "Draw histograms of the number of words per headline in our training set.\n",
    "Excluding the `<bos>` and `<eos>` tags in your computation.\n",
    "Explain why we would be interested in such histograms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include your histogram and your written explanations\n",
    "\n",
    "# Here is an example of how to plot a histogram in matplotlib:\n",
    "# plt.hist(np.random.normal(0, 1, 40), bins=20)\n",
    "\n",
    "# Here are some sample code that uses the train_data object:\n",
    "print(train_data[5].title)\n",
    "for example in train_data:\n",
    "    print(example.title)\n",
    "    break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (b) -- 2 points\n",
    "\n",
    "How many distinct words appear in the training data?\n",
    "Exclude the `<bos>` and `<eos>` tags in your computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Report your values here. Make sure that you report the actual values,\n",
    "# and not just the code used to get those values\n",
    "\n",
    "# You might find the python class Counter from the collections package useful\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (c) -- 2 points\n",
    "\n",
    "The distribution of *words* will have a long tail, meaning that there are some words\n",
    "that will appear very often, and many words that will appear infrequently. How many words\n",
    "appear exactly once in the training set? Exactly twice?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Report your values here. Make sure that you report the actual values,\n",
    "# and not just the code used to get those values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (d) -- 2 points\n",
    "\n",
    "Explain why we may wish to replace these infrequent\n",
    "words with an `<unk>` tag, instead of learning embeddings for these rare words.\n",
    "(Hint: Consider words in the validation set that might not appear in training)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include your explanation here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (e) -- 2 points\n",
    "\n",
    "We will only model the top 9995 words in the training set, excluding the tags\n",
    "`<bos>`, `<eos>`, and other possible tags we haven't mentioned yet\n",
    "(including those, we will have a vocabulary size of exactly 10000 tokens).\n",
    "\n",
    "What percentage of word occurrences will be supported? Alternatively, what percentage\n",
    "of word occurrences in the training set will be set to the `<unk>` tag?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Report your values here. Make sure that you report the actual values,\n",
    "# and not just the code used to get those values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our `torchtext` package will help us keep track of our list of unique words, known\n",
    "as a **vocabulary**. A vocabulary also assigns a unique integer index to each word.\n",
    "You can interpret these indices as sparse representations of one-hot vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build the vocabulary based on the training data. The vocabulary\n",
    "# can have at most 9997 words (9995 words + the <bos> and <eos> token)\n",
    "text_field.build_vocab(train_data, max_size=9997)\n",
    "\n",
    "# This vocabulary object will be helpful for us\n",
    "vocab = text_field.vocab\n",
    "print(vocab.stoi[\"hello\"]) # for instances, we can convert from string to (unique) index\n",
    "print(vocab.itos[10])      # ... and from word index to string\n",
    "\n",
    "# The size of our vocabulary is actually 10000\n",
    "vocab_size = len(text_field.vocab.stoi)\n",
    "print(vocab_size) # should be 10000\n",
    "\n",
    "# The reason is that torchtext adds two more tokens for us:\n",
    "print(vocab.itos[0]) # <unk> represents an unknown word not in our vocabulary\n",
    "print(vocab.itos[1]) # <pad> will be used to pad short sequences for batching"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2\n",
    "\n",
    "Building a text autoencoder is a little more complicated than an image autoencoder, so\n",
    "we'll need to thoroughly understand the model that we want to build before actually building\n",
    "our model. Note that the best and fastest way to complete this assignment is to spend a *lot*\n",
    "of time upfront understanding the architecture. The explanations are quite dense, and you\n",
    "might need to stop every sentence or two to understand what's going on.\n",
    "You won't feel productive for a while since you won't be writing code,\n",
    "but this initial investment will help you become more productive later on.\n",
    "Understanding this architecture will also help you understand other machine learning\n",
    "papers you might come across. So, take a deep breath, and let's do this!\n",
    "\n",
    "Here is a diagram showing our desired architecture:\n",
    "\n",
    "![](p4model.png){ width=90% }\n",
    "\n",
    "<img src=\"p4model.png\" width=\"95%\" />\n",
    "\n",
    "There are two main components to the model: the **encoder** and the **decoder**.\n",
    "As always with neural networks, we'll first describe how to make\n",
    "**predictions** with of these components. Let's get started:\n",
    "\n",
    "The **encoder** will take a sequence of words (a headline) as *input*, and produce an\n",
    "embedding (a vector) that represents the entire headline. In the diagram above,\n",
    "the vector ${\\bf h}^{(7)}$ is the vector embedding containing information about \n",
    "the entire headline.  This portion is very similar\n",
    "to the sentiment analysis RNN that we discussed in lecture (but without the fully-connected\n",
    "layer that makes a prediction).\n",
    "\n",
    "The **decoder** will take an embedding (in the diagram, the vector ${\\bf h}^{(7)}$) as input,\n",
    "and uses a separate RNN to **generate a sequence of words**. To generate a sequence of words,\n",
    "the decoder needs to do the following:\n",
    "\n",
    "1) Determine the previous word that was generated. This previous word will act as ${\\bf x}^{(t)}$\n",
    "   to our RNN, and will be used to update the hidden state ${\\bf m}^{(t)}$. Since each of our\n",
    "   sequences begin with the `<bos>` token, we'll set ${\\bf x}^{(1)}$ to be the `<bos>` token.\n",
    "2) Compute the updates to the hidden state ${\\bf m}^{(t)}$ based on the previous hidden state\n",
    "   ${\\bf m}^{(t-1)}$ and ${\\bf x}^{(t)}$. Intuitively, this hidden state vector ${\\bf m}^{(t)}$\n",
    "   is a representation of *all the words we still need to generate*.\n",
    "3) We'll use a fully-connected layer to take a hidden state ${\\bf m}^{(t)}$, and determine\n",
    "   *what the next word should be*. This fully-connected layer solves a *classification problem*,\n",
    "   since we are trying to choose a word out of $K=10000$ distinct words. As in a classification\n",
    "   problem, the fully-connected neural network will compute a *probability distribution* over\n",
    "   these 10,000 words. In the diagram, we are using ${\\bf z}^{(t)}$ to represent the logits,\n",
    "   or the pre-softmax activation values representing the probability distribution.\n",
    "4) We will need to *sample* an actual word from this probability distribution ${\\bf z}^{(t)}$.\n",
    "   We can do this in a number of ways, which we'll discuss in question 3. For now, you can \n",
    "   imagine your favourite way of picking a word given a distribution over words.\n",
    "5) This word we choose will become the next input ${\\bf x}^{(t+1)}$ to our RNN, which is used\n",
    "   to update our hidden state ${\\bf m}^{(t+1)}$---i.e. to determine what are the remaining\n",
    "   words to be generated.\n",
    "\n",
    "We can repeat this process until we see an `<eos>` token generated, or until the generated\n",
    "sequence becomes too long.\n",
    "\n",
    "Unfortunately, we can't *train* this autoencoder in the way we just described. That is,\n",
    "we can't just compare our generated sequence with our ground-truth sequence, and get\n",
    "gradients. Both sequences are **discrete** entities, so we won't be able to compute\n",
    "gradients at all! In particular, **sampling is a discrete process**, and so we won't be\n",
    "able to back-propagate through any kind of sampling that we do.\n",
    "\n",
    "You might wonder whether we can get away with computing gradients by comparing the\n",
    "distributions ${\\bf z}^{(t)}$ with the ground truth words at each time step. Like any\n",
    "multi-class classification problem, we can represent the ground-truth words as a one-hot\n",
    "vector, and use the cross-entropy loss.\n",
    "\n",
    "In theory, we can do this. In practice, there are a few issues. One is that the generated\n",
    "sequence might be longer or shorter than the actual sequence, meaning that there may\n",
    "be more/fewer ${\\bf z}^{(t)}$s than ground-truth words. Another more insidious issue\n",
    "is that the **gradients will become very high-variance and unstable**, because\n",
    "**early mistakes will easily throw the model off-track**. Early in training,\n",
    "our model is unlikely to produce the right answer in step $t=1$, so the gradients\n",
    "we obtain based on the other time steps will not be very useful.\n",
    "\n",
    "At this point, you might have some ideas about \"hacks\" we can use to make training\n",
    "work. Fortunately, there is one very well-established solution called\n",
    "**teacher forcing** which we can use for training:\n",
    "instead of *sampling* the next word based on ${\\bf z}^{(t)}$, we will forgo sampling,\n",
    "and use the **ground truth** ${\\bf x}^{(t)}$ in the next step.\n",
    "\n",
    "Here is a diagram showing how we can use **teacher forcing** to train our model:\n",
    "\n",
    "![](p4model_tf.png){ width=90% }\n",
    "\n",
    "<img src=\"p4model_tf.png\" width=\"95%\" />\n",
    "\n",
    "We will use the RNN generator to compute the logits\n",
    "${\\bf z}^{(1)},{\\bf z}^{(2)},  \\cdots {\\bf z}^{(T)}$. These distributions\n",
    "can be compared to the ground-truth words using the cross-entropy loss.\n",
    "The loss function for this model will be the sum of the losses across each $t$.\n",
    "(This is similar to what we did in a pixel-wise prediction problem.)\n",
    "\n",
    "We'll train the encoder and decoder model simultaneously. There are several components\n",
    "to our model that contain tunable weights:\n",
    "\n",
    "- The word embedding that maps a word to a vector representation.\n",
    "  In theory, we could use GloVe embeddings, or initialize our parameters to\n",
    "  GloVe embeddings. To prevent students who don't have Colab access\n",
    "  from having to download a 1GB file, we won't do that.\n",
    "  The word embedding component is represented with blue arrows in the diagram.\n",
    "- The encoder RNN (which will use Gated Recurrent Units) that computes the\n",
    "  embedding over the entire headline. The encoder RNN \n",
    "  is represented with black arrows in the diagram.\n",
    "- The decoder RNN (which will also use Gated Recurrent Units) that computes\n",
    "  hidden states, which are vectors representing what words are to be generated.\n",
    "  The decoder RNN is represented with gray arrows in the diagram.\n",
    "- The **projection MLP** (one fully-connected layer) that computes\n",
    "  a distribution over the next word to generate, given a decoder RNN hidden\n",
    "  state.\n",
    "\n",
    "## Part (a) -- 10 pts\n",
    "\n",
    "Complete the code for the AutoEncoder class below by:\n",
    "\n",
    "1. Filling in the missing numbers in the `__init__` method using\n",
    "   the parameters `vocab_size`, `emb_size`, and `hidden_size`. (4 points)\n",
    "2. Complete the `forward` method, which uses teacher forcing\n",
    "   and computes the logits $z^{(t)}$ of the reconstruction of\n",
    "   the sequence. (4 points)\n",
    "\n",
    "You should first try to understand the `encode` and `decode` methods,\n",
    "which are written for you. The `encode` method mimics a discriminative\n",
    "RNN (see the sentiment analysis notebook).  The `decode` method is\n",
    "a generative RNN and is a bit more complex (see the text generation\n",
    "tutorial notebook).  You might want to scroll down to the\n",
    "`sample_sequence` function to see how this function will be called.\n",
    "\n",
    "You can (but don't have to) use the `encode` and `decode` method in\n",
    "your `forward` method. In either case, be very careful of the input\n",
    "that you feed into ether `decode` or to `self.decoder_rnn`.\n",
    "Refer to the teacher-forcing diagram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class AutoEncoder(nn.Module):\n",
    "    def __init__(self, vocab_size, emb_size, hidden_size):\n",
    "        \"\"\"\n",
    "        A text autoencoder. The parameters \n",
    "            - vocab_size: number of unique words/tokens in the vocabulary\n",
    "            - emb_size: size of the word embeddings $x^{(t)}$\n",
    "            - hidden_size: size of the hidden states in both the\n",
    "                           encoder RNN ($h^{(t)}$) and the\n",
    "                           decoder RNN ($m^{(t)}$)\n",
    "        \"\"\"\n",
    "        super().__init__()\n",
    "        self.embed = nn.Embedding(num_embeddings=None, # TODO\n",
    "                                  embedding_dim=None)  # TODO\n",
    "        self.encoder_rnn = nn.GRU(input_size=None, #TODO\n",
    "                                  hidden_size=None, #TODO\n",
    "                                  batch_first=True)\n",
    "        self.decoder_rnn = nn.GRU(input_size=None, #TODO\n",
    "                                  hidden_size=None, #TODO\n",
    "                                  batch_first=True)\n",
    "        self.proj = nn.Linear(in_features=None, # TODO\n",
    "                              out_features=None) # TODO\n",
    "\n",
    "    def encode(self, inp):\n",
    "        \"\"\"\n",
    "        Computes the encoder output given a sequence of words.\n",
    "        \"\"\"\n",
    "        emb = self.embed(inp)\n",
    "        out, last_hidden = self.encoder_rnn(emb)\n",
    "        return last_hidden\n",
    "\n",
    "    def decode(self, inp, hidden=None):\n",
    "        \"\"\"\n",
    "        Computes the decoder output given a sequence of words, and\n",
    "        (optionally) an initial hidden state.\n",
    "        \"\"\"\n",
    "        emb = self.embed(inp)\n",
    "        out, last_hidden = self.decoder_rnn(emb, hidden)\n",
    "        out_seq = self.proj(out)\n",
    "        return out_seq, last_hidden\n",
    "\n",
    "    def forward(self, inp):\n",
    "        \"\"\"\n",
    "        Compute both the encoder and decoder forward pass\n",
    "        given an integer input sequence inp with shape [batch_size, seq_length],\n",
    "        with inp[a,b] representing the (index in our vocabulary of) the b-th word\n",
    "        of the a-th training example.\n",
    "\n",
    "        This function should return the logits $z^{(t)}$ in a tensor of shape\n",
    "        [batch_size, seq_length - 1, vocab_size], computed using *teaching forcing*.\n",
    "\n",
    "        The (seq_length - 1) part is not a typo. If you don't understand why\n",
    "        we need to subtract 1, refer to the teacher-forcing diagram above.\n",
    "        \"\"\"\n",
    "\n",
    "        # TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (b) -- 5 pts\n",
    "\n",
    "To check that your model is set up correctly, we'll train our AutoEncoder\n",
    "neural network for at least 300 iterations to memorize this sequence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "headline = train_data[42].title\n",
    "input_seq = torch.Tensor([vocab.stoi[w] for w in headline]).long().unsqueeze(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are looking for the way that you set up your loss function\n",
    "corresponding to the figure above.\n",
    "**Be very careful of off-by-ones.**\n",
    "\n",
    "Note that the Cross Entropy Loss expects a rank-2 tensor as its first\n",
    "argument, and a rank-1 tensor as its second argument. You will\n",
    "need to properly reshape your data to be able to compute the loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = AutoEncoder(vocab_size, 128, 128)\n",
    "optimizer = optim.Adam(model.parameters(), lr=0.001)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "\n",
    "for it in range(300):\n",
    "\n",
    "    # TODO\n",
    "\n",
    "    if (it+1) % 50 == 0:\n",
    "        print(\"[Iter %d] Loss %f\" % (it+1, float(loss)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (c) -- 2 pt\n",
    "\n",
    "Once you are satisfied with your model, encode your input using\n",
    "the RNN encoder, and sample some sequences from the decoder. The \n",
    "sampling code is provided to you, and performs the computation\n",
    "from the first diagram (without teacher forcing).\n",
    "\n",
    "Note that we are sampling from a multi-nomial distribution described\n",
    "by the logits $z^{(t)}$. For example, if our distribution is [80%, 20%]\n",
    "over a vocabulary of two words, then we will choose the first word\n",
    "with 80% probability and the second word with 20% probability.\n",
    "\n",
    "Call `sample_sequence` at least 5 times, with the default temperature\n",
    "value. Make sure to include the generated sequences in your PDF\n",
    "report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_sequence(model, hidden, max_len=20, temperature=1):\n",
    "    \"\"\"\n",
    "    Return a sequence generated from the model's decoder\n",
    "        - model: an instance of the AutoEncoder model\n",
    "        - hidden: a hidden state (e.g. computed by the encoder)\n",
    "        - max_len: the maximum length of the generated sequence\n",
    "        - temperature: described in Part (d)\n",
    "    \"\"\"\n",
    "    # We'll store our generated sequence here\n",
    "    generated_sequence = []\n",
    "    # Set input to the <BOS> token\n",
    "    inp = torch.Tensor([text_field.vocab.stoi[\"<bos>\"]]).long()\n",
    "    for p in range(max_len):\n",
    "        # compute the output and next hidden unit\n",
    "        output, hidden = model.decode(inp.unsqueeze(0), hidden)\n",
    "        # Sample from the network as a multinomial distribution\n",
    "        output_dist = output.data.view(-1).div(temperature).exp()\n",
    "        top_i = int(torch.multinomial(output_dist, 1)[0])\n",
    "        # Add predicted word to string and use as next input\n",
    "        word = text_field.vocab.itos[top_i]\n",
    "        # Break early if we reach <eos>\n",
    "        if word == \"<eos>\":\n",
    "            break\n",
    "        generated_sequence.append(word)\n",
    "        inp = torch.Tensor([top_i]).long()\n",
    "    return generated_sequence\n",
    "\n",
    "# Your solutions go here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (d) -- 3 pt\n",
    "\n",
    "The multi-nomial distribution can be manipulated using the `temperature`\n",
    "setting. This setting can be used to make the distribution \"flatter\" (e.g.\n",
    "more likely to generate different words) or \"peakier\" (e.g. less likely\n",
    "to generate different words).\n",
    "\n",
    "Call `sample_sequence` at least 5 times each for at least 3 different\n",
    "temperature settings (e.g. 1.5, 2, and 5). Explain why we generally\n",
    "don't want the temperature setting to be too **large**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include the generated sequences and explanation in your PDF report."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 3\n",
    "\n",
    "It turns out that getting good results from a text auto-encoder is very difficult,\n",
    "and that it is very easy for our model to **overfit**. We have discussed several methods\n",
    "that we can use to prevent overfitting, and we'll introduce one more today:\n",
    "**data augmentation**.\n",
    "\n",
    "The idea behind data augmentation is to artificially increase the number of training\n",
    "examples by \"adding noise\" to the image. For example, during AlexNet training,\n",
    "the authors randomly cropped $224\\times 224$\n",
    "regions of a $256 \\times 256$ pixel image to increase the amount of training data.\n",
    "The authors also flipped the image left/right (but not up/down---why?).\n",
    "Machine learning practitioners can also add Gaussian noise to the image.\n",
    "\n",
    "When we use data augmentation to train an *autoencoder*, we typically to only add\n",
    "the noise to the input, and expect the reconstruction to be *noise free*.\n",
    "This makes the task of the autoencoder even more difficult. An autoencoder trained\n",
    "with noisy inputs is called a **denoising auto-encoder**. For simplicity, we will\n",
    "*not* build a denoising autoencoder today.\n",
    "\n",
    "### Part (a) -- 3pt\n",
    "\n",
    "Give three more examples of data augmentation techniques that we could use if\n",
    "we were training an **image** autoencoder. What are different ways that we can\n",
    "change our input?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include your three answers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (b) -- 2pt\n",
    "\n",
    "We will add noise to our headlines using a few different techniques:\n",
    "\n",
    "1. Shuffle the words in the headline, taking care that words don't end up too far from where they were initially\n",
    "2. Drop (remove) some words \n",
    "3. Replace some words with a blank word (a `<pad>` token)\n",
    "4. Replace some words with a random word \n",
    "\n",
    "The code for adding these types of noise is provided for you:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tokenize_and_randomize(headline,\n",
    "                           drop_prob=0.1,  # probability of dropping a word\n",
    "                           blank_prob=0.1, # probability of \"blanking\" out a word\n",
    "                           sub_prob=0.1,   # probability of substituting a word with a random one\n",
    "                           shuffle_dist=3): # maximum distance to shuffle a word\n",
    "    \"\"\"\n",
    "    Add 'noise' to a headline by slightly shuffling the word order,\n",
    "    dropping some words, blanking out some words (replacing with the <pad> token)\n",
    "    and substituting some words with random ones.\n",
    "    \"\"\"\n",
    "    headline = [vocab.stoi[w] for w in headline.split()]\n",
    "    n = len(headline)\n",
    "    # shuffle\n",
    "    headline = [headline[i] for i in get_shuffle_index(n, shuffle_dist)]\n",
    "\n",
    "    new_headline = [vocab.stoi['<bos>']]\n",
    "    for w in headline:\n",
    "        if random.random() < drop_prob:\n",
    "            # drop the word\n",
    "            pass\n",
    "        elif random.random() < blank_prob:\n",
    "            # replace with blank word\n",
    "            new_headline.append(vocab.stoi[\"<pad>\"])\n",
    "        elif random.random() < sub_prob:\n",
    "            # substitute word with another word\n",
    "            new_headline.append(random.randint(0, vocab_size - 1))\n",
    "        else:\n",
    "            # keep the original word\n",
    "            new_headline.append(w)\n",
    "    new_headline.append(vocab.stoi['<eos>'])\n",
    "    return new_headline\n",
    "\n",
    "def get_shuffle_index(n, max_shuffle_distance):\n",
    "    \"\"\" This is a helper function used to shuffle a headline with n words,\n",
    "    where each word is moved at most max_shuffle_distance. The function does\n",
    "    the following: \n",
    "       1. start with the *unshuffled* index of each word, which\n",
    "          is just the values [0, 1, 2, ..., n]\n",
    "       2. perturb these \"index\" values by a random floating-point value between\n",
    "          [0, max_shuffle_distance]\n",
    "       3. use the sorted position of these values as our new index\n",
    "    \"\"\"\n",
    "    index = np.arange(n)\n",
    "    perturbed_index = index + np.random.rand(n) * 3\n",
    "    new_index = sorted(enumerate(perturbed_index), key=lambda x: x[1])\n",
    "    return [index for (index, pert) in new_index]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Call the function `tokenize_and_randomize` 5 times on a headline of your\n",
    "choice. Make sure to include both your original headline, and the five new\n",
    "headlines in your report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Report your values here. Make sure that you report the actual values,\n",
    "# and not just the code used to get those values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (c) -- 3 pt\n",
    "\n",
    "The training code that we use to train the model is mostly provided for you. \n",
    "The only part we left blank are the parts from Q2(b). Complete the code,\n",
    "and train a new AutoEncoder model for 1 epoch. You can train your model\n",
    "for longer if you want, but training tend to take a long time,\n",
    "so we're only checking to see that your training loss is trending down.\n",
    "\n",
    "If you are using Google Colab, you can use a GPU for this portion.\n",
    "Go to \"Runtime\" => \"Change Runtime Type\"  and set \"Hardware acceleration\" to GPU.\n",
    "Your Colab session will restart.\n",
    "You can move your model to the GPU by typing `model.cuda()`, and move\n",
    "other tensors to GPU (e.g. `xs = xs.cuda()`). To move a model back to CPU,\n",
    "type `model.cpu`. To move a tensor back, use `xs = xs.cpu()`. For training,\n",
    "your model and inputs need to be on the *same device*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_autoencoder(model, batch_size=64, learning_rate=0.001, num_epochs=10):\n",
    "    optimizer = optim.Adam(model.parameters(), lr=learning_rate)\n",
    "    criterion = nn.CrossEntropyLoss()\n",
    "\n",
    "    for ep in range(num_epochs):\n",
    "        # We will perform data augmentation by re-reading the input each time\n",
    "        field = torchtext.data.Field(sequential=True,\n",
    "                                     tokenize=tokenize_and_randomize, # <-- data augmentation\n",
    "                                     include_lengths=True,\n",
    "                                     batch_first=True,\n",
    "                                     use_vocab=False, # <-- the tokenization function replaces this\n",
    "                                     pad_token=vocab.stoi['<pad>'])\n",
    "        dataset = torchtext.data.TabularDataset(train_path, \"tsv\", [('title', field)])\n",
    "\n",
    "        # This BucketIterator will handle padding of sequences that are not of the same length\n",
    "        train_iter = torchtext.data.BucketIterator(dataset,\n",
    "                                                   batch_size=batch_size,\n",
    "                                                   sort_key=lambda x: len(x.title), # to minimize padding\n",
    "                                                   repeat=False)\n",
    "        for it, ((xs, lengths), _) in enumerate(train_iter):\n",
    "\n",
    "            # Fill in the training code here\n",
    "\n",
    "            if (it+1) % 100 == 0:\n",
    "                print(\"[Iter %d] Loss %f\" % (it+1, float(loss)))\n",
    "\n",
    "        # Optional: Compute and track validation loss\n",
    "        #val_loss = 0\n",
    "        #val_n = 0\n",
    "        #for it, ((xs, lengths), _) in enumerate(valid_iter):\n",
    "        #    zs = model(xs)\n",
    "        #    loss = None # TODO\n",
    "        #    val_loss += float(loss)\n",
    "\n",
    "# Include your training curve or output to show that your training loss is trending down"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (d) -- 2 pt\n",
    "\n",
    "This model requires many epochs (>50) to train, and is quite slow without using a GPU.\n",
    "You can train a model yourself, or you can load the model weights that we have trained,\n",
    "and available on the course website https://www.cs.toronto.edu/~lczhang/321/files/p4model.pk (11MB).\n",
    "\n",
    "Assuming that your `AutoEncoder` is set up correctly, the following code should run without\n",
    "error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = AutoEncoder(10000, 128, 128)\n",
    "checkpoint_path = '/content/gdrive/My Drive/CSC321/p4model.pk' # Update me\n",
    "model.load_state_dict(torch.load(checkpoint_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, repeat your code from Q2(d), for `train_data[10].title`\n",
    "with temperature settings 0.7, 0.9, and 1.5.\n",
    "Explain why we generally don't want the temperature setting to\n",
    "be too **small**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include the generated sequences and explanation in your PDF report.\n",
    "\n",
    "headline = train_data[10].title\n",
    "input_seq = torch.Tensor([vocab.stoi[w] for w in headline]).unsqueeze(0).long()\n",
    "\n",
    "# ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 4\n",
    "\n",
    "In parts 2-3, we've explored the decoder portion of the autoencoder. In this section,\n",
    "let's explore the **encoder**. In particular, the encoder RNN gives us \n",
    "embeddings of news headlines!\n",
    "\n",
    "First, let's load the **validation** data set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_data = torchtext.data.TabularDataset(\n",
    "    path=valid_path,                # data file path\n",
    "    format=\"tsv\",                   # fields are separated by a tab\n",
    "    fields=[('title', text_field)]) # list of fields (we have only one)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (a) -- 2 pt\n",
    "\n",
    "Compute the embeddings of every item in the validation set. Then, store the\n",
    "result in a single PyTorch tensor of shape `[19046, 128]`, since there are\n",
    "19,046 headlines in the validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here\n",
    "# Show that your resulting PyTorch tensor has shape `[19046, 128]`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (b) -- 2 pt\n",
    "\n",
    "Find the 5 closest headlines to the headline `valid_data[13]`. Use the\n",
    "cosine similarity to determine closeness. (Hint: You can use code from Project 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here. Make sure to include the actual 5 closest headlines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (c) -- 2 pt\n",
    "\n",
    "Find the 5 closest headlines to another headline of your choice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here. \n",
    "# Make sure to include the original headline and the 5 closest headlines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part (d) -- 4 pts\n",
    "\n",
    "Choose two headlines from the validation set, and find their embeddings.\n",
    "We will **interpolate** between the two embeddings.\n",
    "\n",
    "Find 3 points, equally spaced between the embeddings of your headlines.\n",
    "If we let $e_0$ be the embedding of your first headline and $e_4$ be\n",
    "the embedding of your second headline, your three points should be:\n",
    "\n",
    "\\begin{align*}\n",
    "e_1 &=  0.75 e_0 + 0.25 e_4 \\\\\n",
    "e_2 &=  0.50 e_0 + 0.50 e_4 \\\\\n",
    "e_3 &=  0.25 e_0 + 0.75 e_4 \\\\\n",
    "\\end{align*}\n",
    "\n",
    "Decode each of $e_1$, $e_2$ and $e_3$ five times, with a temperature setting\n",
    "that shows some variation in the generated sequences, while generating sequences\n",
    "that makes sense."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here. Include your generated sequences."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}