ScalarFlow - Automatic Differentiation in Python

Learning Objectives

After completing this assignment, students should be able to:

Part 1: Implementation

Complete the following stubbed-out ScalarFlow library so that all public methods and attributes correspond to the provided docstring comments:

You can read a nicely-formatted version of the scalarflow documentation here.

The scalarflow.py module depends on the networkx library for actually maintaining the graph structure. You'll need to install networkx:

pip install networkx

Note that the starter code already includes functionality for building and visualizing computation graphs. For example, the following snippet should work without making any modifications to scalarflow.py:

import scalarflow as sf

graph = sf.Graph()

with graph:
    x = sf.Variable(2.0, name='x')
    y = sf.Variable(4.0, name='y')

    x_squared = sf.Pow(x, 2)
    y_squared = sf.Pow(y, 2)

    xy_sum = sf.Add(x_squared, y_squared)

    func = sf.Pow(xy_sum, .5) # (Square root)

graph.gen_dot("sample.dot")

This code creates a computation graph corresponding to the formula \(\sqrt{x^2 + y^2}\). The resulting dot file can be used to generate a nicely-formatted image representing the structure of the graph:

The provided code doesn't provide any functionality for actually performing the calculations represented by the graph. You are free to add any additional helper methods or private instance variables you find helpful for providing the required functionality.

Implementation Tips

Part 2: Machine Learning with ScalarFlow

The following files provide a simple machine learning library built on top of the ScalarFlow toolkit:

If scalarflow.py is completed correctly, the demo methods in sf_classifier_examples.py should reliably learn the two classification tasks.

Even without completing the classes in scalarflow.py it is possible to initialize the models in sf_classifiers.py and visualize the structure of the corresponding computation graphs:

Logistic regression loss function with two input features
Three-layer neural network loss function with two input units and ten hidden units

Exploring Relu's

We have discussed the fact that sigmoid nonlinearities can lead to vanishing gradients when used with deep neural networks (more than three layers). Rectified linear units, or Relu's, are widely used because they help to avoid the problem of vanishing gradients: while sigmoids have near-zero derivatives across much of their domain, Relu's have non-zero derivatives for all positive values.

Logistic (sigmoid) function and its derivative

Relu function and its derivative

For Part 2 of this assignment, complete the following steps:

  1. Add a Relu node type to scalarflow.py.

  2. Update the MLP class in sf_classifiers.py to accept 'relu' as an option for the activation argument of the constructor. Make sure to change the weight initialization code to He initialization for relu units.

  3. Test your modified MLP implementation to make sure that it can reliably learn the xor task with ten hidden units using both sigmoid and relu activation functions. I found that it took some experimentation with learning rates to get both versions of the network to work well. It seems that the sigmoid activation function requires a significantly higher learning rate than relu.

  4. Create two figures illustrating the learning curves for each classifier across 10 training runs. Each figure should show the epoch number on the x-axis and training loss on the y-axis. The first figure should include ten lines, each representing a single training run with sigmoid nonlinearities. The second figure should show the same information for the relu network. The captions should explain the data in the figures and should include the learning rate that was used.

    The point of these figures is to illustrate that either activation function is effective for a three-layer network.

  5. Create two additional figures by replicating the experiment above using networks with five hidden layers, each with 10 hidden units. The question here is whether the relu network does a better job of learning with a much deeper network. Again, the captions for the figures should explain the data and include the learning rates. (You should use the same learning rates here as in the previous experiment.)

  6. Combine your figures into a single pdf document for submission.

Grading

Grades will be calculated according to the following distribution.