To write legible answers you will need to be familiar with both Markdown and Latex
Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel→→Restart) and then run all cells (in the menubar, select Cell→→Run All).
Make sure you fill in any place that says "YOUR CODE HERE" or "YOUR ANSWER HERE", as well as your name below:
NAME = ""
STUDENT_ID = ""
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
Keras has lots of datsets that you can just load right into python numpy arrays, see: https://keras.io/datasets/
We will be using the Fashion-MNIST dataset, which is a cool little dataset with gray scale $28\times28$ images of articles of clothing. Keep in mind that they will be downloaded from the internet, so it may take a while.
fashion_mnist = keras.datasets.fashion_mnist
# splitting training and test data and corresponding labels
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
class_dict = {i:class_name for i,class_name in enumerate(class_names)}
def show_image(index):
plt.figure()
# cmap=plt.cm.binary allows us to show the picture in grayscale
plt.imshow(train_images[index], cmap=plt.cm.binary)
plt.title(class_names[train_labels[index]])
plt.colorbar() # adds a bar to the side with values
plt.show()
show_image(0)
As you can see above, the images are valued from $[0,255]$. This is the normal range for images and excercises that we need to normalize our dataset.
In order to normalize our data to $[0,1]$ we use the equation:
$$x_{norm}=\frac{x-x_{min}}{x_{max}-x_{min}}$$In our case we can assume that $x_{min}=0$ and $x_{max}=255$, this is a safe assumption since we are working with image data.
This means that for image data, if we want to normalize to $[0,1]$ the equation simplifies to:
$$img_{norm}=\frac{img}{255}$$Anytime you work with image data in any kind of model you will be normalizing with this equation. Unless the range you want to normalize is different. Sometimes you want to normalize between $[-1,1]$, for that you would use a slightly different equation.
Normalize BOTH the training and testing images using the above equation.
train_images = # YOUR CODE HERE
test_images = # YOUR CODE HERE
If we show the image again, you will see the values are all scaled correctly.
show_image(0)
# Lets sample our data to see what kind of images are stored.
# see documentation for subplot here:
# https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplot.html
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[i]])
plt.show()
Since our data is composed of grayscale images (one channel) with a resolution of $28\times28$, we can think of this as the images existing in a $28\times28=784$ dimensional space. This means that every single image in our dataset can be represented by a vector of length $784$.
Please reshape BOTH the training and testing images to be $784D$.
Hint: look into numpy.reshape().
print(f'Before reshape, train_images shape: {train_images.shape} test_images shape: {test_images.shape}')
train_images = # YOUR CODE HERE
test_images = # YOUR CODE HERE
print(f'Before reshape, train_images shape: {train_images.shape} test_images shape: {test_images.shape}')
We create a dataframe using our training and testing data to keep everything tidy.
# Add training data into a dataframe
img_data = {f"z{i}":train_images[:,i] for i in range(784)}
img_data["label"] = train_labels
df_img_train = pd.DataFrame(img_data)
df_img_train["class"] = df_img_train["label"].map(class_dict)
df_img_train.head()
# Add test data into a dataframe
img_data = {f"z{i}":test_images[:,i] for i in range(784)}
img_data["label"] = test_labels
df_img_test = pd.DataFrame(img_data)
df_img_test["class"] = df_img_test["label"].map(class_dict)
df_img_test.head()
Now we have our data reshaped into the $784D$ vectors, using these we can try and visualize the space they live in. However, accurately visualizing high dimensional spaces like these would be practically impossible. We use tools like principle component analysis (PCA), and tSNE to create a projection of this $784D$ space into a more digestible $2D$ or even $3D$.
Don't worry about how they work just yet, we are just using them to visualize our data. Just run the cells to see some cool visuals.
You can direcly use the built in PCA and TSNE models from sklearn, you import them like so:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
These tools require our data to be scaled correctly.
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(train_images)
n_comps = 50
pca = PCA(n_components=n_comps)
pca_features = pca.fit_transform(standardized_data)
# Add data into a dataframe
pca_data = {f"z{i}":pca_features[:,i] for i in range(n_comps)}
pca_data["label"] = train_labels
df_pca = pd.DataFrame(pca_data)
df_pca["class"] = df_pca["label"].map(class_dict)
df_pca.head()
print(f'Compressed dimension of {train_images.shape[1]} to {n_comps} maintaining {pca.explained_variance_ratio_.sum()*100:.2f}% of information.')
model = TSNE(n_components=2, random_state=0, perplexity=30, learning_rate=200, n_iter=1000)
print('This may take a few minutes...')
# We are only using the first 10000 data points, this is sufficient for this application.
visualization_data = model.fit_transform(pca_features[:10000])
print('Done.')
data_to_visualize = {"z1":visualization_data[:,0], "z2":visualization_data[:,1], "label":train_labels[:10000]}
df_visualize = pd.DataFrame(data_to_visualize)
df_visualize["class"] = df_visualize["label"].map(class_dict)
df_visualize.head()
These images, which started out as $784$ dimensional vectors, are now being projected into a $2D$ space. Don't worry about how tSNE works, it essentially tries to estimate and project distances in a high dimensional space to $2D$. Below we can see a pretty good representation of the space where our data lives.
sns.lmplot(x='z1',
y='z2',
data=df_visualize,
fit_reg=False,
hue='class',
height=9,
scatter_kws={"s":50,"alpha":0.5})
Some of the data is easily to separate with a line, this concept is called linear separability. Below we plot only the Ankle Boot and Trouser class. See? It's easy to draw a line between them. This makes sense because it's easy to distinguish between a shoe and pants.
sns.lmplot(x='z1',
y='z2',
data=df_visualize[(df_visualize["class"] == "Ankle boot") | (df_visualize["class"] == "Trouser")],
fit_reg=False,
hue='class',
height=9,
scatter_kws={"s":50,"alpha":0.5})
How do you think our logistic regression model will fare at distinguishing Ankle boots from Trousers? Please explain.
Now lets plot something harder to classify, the Pullover and Coat classes which are basically mixed together. You can't easily draw a line between them. This makes sense since a coat and a pullover are pretty similar.
sns.lmplot(x='z1',
y='z2',
data=df_visualize[(df_visualize["class"] == "Pullover") | (df_visualize["class"] == "Coat")],
fit_reg=False,
hue='class',
height=9,
scatter_kws={"s":50,"alpha":0.5})
How do you think our logistic regression model will fare at distinguishing Pullovers from Coats? Please explain.
Recall that each image is $28\times28\times1$ matrix which we flatten to a $784$-dimensional row vector. Image $i$ looks like:
\begin{equation} \textbf x_{i} = \begin{bmatrix} x_{1} & x_{2} & \cdots & x_{784} \end{bmatrix} \tag{1} \end{equation}Our dataset $\textbf X$ is then the collection of all these $n$ images.
\begin{equation} \textbf X = \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,784} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,784} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,784}\\ \end{bmatrix} \tag{2} \end{equation}Logistic regression uses a bias term, we can easily incorporate this by adding a column of ones at the beginning.
\begin{equation} \textbf X = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,784} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2,784} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n,784} \end{bmatrix} \tag{3} \end{equation}For each image we have a corresponding label $y_i$. This is the class "Coat", "Ankle boot", "Pullover", etc. Each of which is mapped to a unique number.
\begin{equation} \textbf Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{n} \end{bmatrix} \tag{4} \end{equation}We will try to find the optimal parameter values $\theta = [\theta_0, \theta_1, \cdots, \theta_{784}]^T$ of our logistic regression model, where $\theta_0$ is the bias weight. To simplify our notation, let
\begin{equation} Z = X \theta = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,784} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2,784} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n,784} \end{bmatrix} \begin{bmatrix}\theta_0 \\ \theta_1 \\ \vdots \\ \theta_{784} \end{bmatrix}= \begin{bmatrix} \theta_0 + x_{1,1}\theta_1 + \cdots + x_{1,784}\theta_{748} \\ \theta_0 + x_{2,1}\theta_1 + \cdots + x_{2,784}\theta_{784} \\ \vdots \\ \theta_0 + X_{n,1}\theta_1 + \cdots + x_{n,784}\theta_{784} \end{bmatrix} = \begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_{n} \end{bmatrix} \tag{5} \end{equation}Since each $z_i$ is in the range $(-\infty, \infty)$ and our labels are $[0,1]$ we pass $z_i$ into the sigmoid function.
\begin{equation} \sigma(z_i) = \frac{1}{1+e^{-z_i}} = h_i \tag{6} \end{equation}Now we can make predictions using $h_i$ by simply rounding it to 0 or 1.
\begin{equation} pred_i = round(h_i) \tag{7} \end{equation}In order to train our logistic regression model we need to find the parameter vector $\theta$ that minimizes our cost function $J$, we will be Binary Cross Entropy (BCE).
\begin{equation} \text{J} = -\frac{1}{n} \sum_{i=1}^{n} \left [ y_i\log (h_i) + (1-y_i)\log (1-h_i) \right ] \tag{8} \end{equation}Where $y_i$ is your true label and $h_i = \sigma (z_i)$.
In order to know whether we want to increment or decrement our weights to minimize the loss, we calculate its partial derivative with respect to weights, we call this the gradient, the matrix form of the gradient is:
\begin{equation} \nabla \text{J} = \frac{1}{n}X^T (H-Y)\tag{9} \end{equation}with $X = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,784} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2,784} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n,784} \end{bmatrix}$, $ \textbf Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{n} \end{bmatrix}$, and $ \textbf H = \begin{bmatrix} h_1 \\ h_2 \\ \vdots \\ h_{n} \end{bmatrix}$
We then use this gradient in an iterative algorithm called gradient descent where every iteration we change the weights like so:
\begin{equation} \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla \text{J} \tag{10} \end{equation}Where $t$ is the iteration number, $\alpha$ is your learning rate, $\theta^{(t)}$ is your current weight column vector, and $\theta^{(t+1)}$ is your updated weight column vector.
For this question you will implement some of the necessary functions for logistic regression:
You will also complete the weight update within the gradient descent algorithm.
def get_data_subset(df, classes=[], shuffle=True, shuffle_seed=42):
if classes == []:
return None
else:
df_filtered = df[(df["class"] == classes[0]) | (df["class"] == classes[1])].copy()
df_filtered["binary_label"] = 0
df_filtered.loc[df["class"] == classes[1], "binary_label"] = 1
data = df_filtered.filter(regex=("z[0-9]+")).values
labels = df_filtered["binary_label"].values
if shuffle:
np.random.seed(shuffle_seed)
np.random.shuffle(data)
np.random.seed(shuffle_seed)
np.random.shuffle(labels)
return data, labels.reshape(-1,1)
def train_test_validation_split(X, y, test_size=.20, validation_size=.20):
trainIdx = int((1. - test_size - validation_size)*X.shape[0])
testIdx = int((1. - test_size)*X.shape[0])
validationIdx = int(1.0 * X.shape[0])
X_train, y_train = X[:trainIdx], y[:trainIdx]
X_test, y_test = X[trainIdx:testIdx], y[trainIdx:testIdx]
X_val, y_val = X[testIdx:validationIdx], y[testIdx:validationIdx]
return X_train, y_train, X_test, y_test, X_val, y_val
def bias_trick(X):
return np.hstack((np.ones((X.shape[0],1)), X))
# applying sigmoid function (equation 6) to the vector input of z_i's
def sigmoid(z):
sig_z = # YOUR CODE HERE
return sig_z
# implementing equation 8
def binary_cross_entropy(y, h):
n = len(y)
return # YOUR CODE HERE
# returns the derivation of the BCE function based on equation 9
def binary_cross_entropy_derivative(X, y, theta):
z = # YOUR CODE HERE
h = # YOUR CODE HERE
n = len(y)
return # YOUR CODE HERE
# returns percentage of correct predictions
def accuracy(y, h):
return # YOUR CODE HERE
# returns predictions for all inputs. Given input is h_i based on equations 7.
def predict(h):
return # YOUR CODE HERE
def logistic_regression(X, y, learning_rate, num_steps):
# split your data into train and test subsets. Don't forget to apply the bias trick before you split!
X = bias_trick(X)
X_train, y_train, X_test, y_test, X_val, y_val = train_test_validation_split(X,y)
# start with intial parameters theta_i = 0, how many parameters/weights do we need?
theta = # YOUR CODE HERE
z = np.dot(X_train,theta)
h = sigmoid(z)
print(f'Initial Accuracy:{accuracy(y_train, predict(h)):.4f}')
val_losses = []
train_losses = []
val_accuracies = []
train_accuracies = []
for step in range(num_steps):
# Calculate the current output of your logistic network (h_train and h_val)
z_train = # YOUR CODE HERE
h_train = # YOUR CODE HERE
z_val = # YOUR CODE HERE
h_val = # YOUR CODE HERE
# Calculate your current training/validation accuracy and BCE loss
training_accuracy = # YOUR CODE HERE
training_loss = # YOUR CODE HERE
validation_accuracy = # YOUR CODE HERE
validation_loss = # YOUR CODE HERE
val_losses.append(validation_loss)
val_accuracies.append(validation_accuracy)
train_losses.append(training_loss)
train_accuracies.append(training_accuracy)
# Calculate the gradient using the derivative of your loss function.
gradient = # YOUR CODE HERE
# Adjust your weights
theta = # YOUR CODE HERE
print(f'Epoch [{step+1}/{num_steps}] '.ljust(20) + f'loss: {training_loss:.4f} - accuracy: {training_accuracy:.4f} - ', end='')
print(f"val_loss: {validation_loss:.4f} - val_accuracy: {validation_accuracy:.4f}")
return theta, train_losses, train_accuracies, val_losses, val_accuracies
Now that we have our logistic implementation complete, let's try and classify Ankle boots against Trousers.
We get the subset of the data containing these classes using the function defined above.
Don't change the learning rate or number of steps.
X, y = get_data_subset(df_img_train, classes=["Trouser", "Ankle boot"])
# train the model until it converges, use the plotted losses below to verify
learning_rate = 0.05
num_steps = 100
theta, train_losses, train_accuracies, val_losses, val_accuracies = logistic_regression(X, y, learning_rate, num_steps)
import matplotlib.pyplot as plt
# plot your training accuracy and validation accuracy curves together
plt.plot(train_accuracies)
plt.plot(val_accuracies)
plt.show()
# plot your training losses and validation losses curves together
plt.plot(train_losses)
plt.plot(val_losses)
plt.show()
What accuracy did it reach? Why do you think it reached this accuracy? How long did it take?
Now train it to classify Pullover vs. Coat.
X, y = get_data_subset(df_img_train, classes=["Pullover", "Coat"])
# train the model until it converges, use the plotted losses below to verify
# modify learning_rate and num_steps to accomplish this
learning_rate = 0.01
num_steps = 100
theta, train_losses, train_accuracies, val_losses, val_accuracies = logistic_regression(X, y, learning_rate, num_steps)
import matplotlib.pyplot as plt
# plot your training accuracy and validation accuracy curves together
plt.plot(train_accuracies)
plt.plot(val_accuracies)
plt.show()
# plot your training losses and validation losses curves together
plt.plot(train_losses)
plt.plot(val_losses)
plt.show()
What accuracy did it reach? Why do you think it reached this accuracy? How long did it take?
Compare and contrast both runs above, are they the same? Different? Then explain why this is the case.
In this question we will build a neural network with Keras that can beat our logistic regression model at classifying Pullovers vs. Coats.
from tensorflow.keras.layers import Input, Dense # only use these layers
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import * # you can use any optimizer
# Define an input layer with the correct shape for your data
input_layer = # YOUR CODE HERE
# Pass the input layer's output to a dense of size 100, choose whatever activation you think is suitable
x = # YOUR CODE HERE
# Pass the previouses hidden layer's output to a dense of size 1 for classification, what activation do we need to use?
output = # YOUR CODE HERE
# Define a model with it's input as your input layer and output as your output layer
model = # YOUR CODE HERE
# Show a summary of your model
model.summary()
# Compile your model with your chosen optimizer, binary cross entropy for the loss, and accuracy as the metric
# YOUR CODE HERE
X, y = get_data_subset(df_img_train, classes=["Pullover", "Coat"])
# Call fit on your model passing in the X, y data above with validation split of 0.2 and train for 100 epochs
hist = # YOUR CODE HERE
def plot_losses(hist):
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'])
plt.show()
def plot_accuracies(hist):
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'])
plt.show()
# plot your losses and accuracies
plot_losses(hist)
plot_accuracies(hist)
How did your neural network perform? What hyperparameters and optimizer did you choose to beat logistic regression? Why do you think your neural network beat your logistic regression model?