Lab 2: Cats vs Dogs

Late Penalty: There is a penalty-free grace period of one hour past the deadline. Any work that is submitted between 1 hour and 24 hours past the deadline will receive a 20% grade deduction. No other late work is accepted. Quercus submission time will be used, not your local computer time. You can submit your labs as many times as you want before the deadline, so please submit often and early.

This lab is partially based on an assignment developed by Prof. Jonathan Rose and Harris Chan.

In this lab, you will train a convolutional neural network to classify an image into one of two classes: "cat" or "dog". The code for the neural networks you train will be written for you, and you are not (yet!) expected to understand all provided code. However, by the end of the lab, you should be able to:

  1. Understand at a high level the training loop for a machine learning model.
  2. Understand the distinction between training, validation, and test data.
  3. The concepts of overfitting and underfitting.
  4. Investigate how different hyperparameters, such as learning rate and batch size, affect the success of training.
  5. Compare an ANN (aka Multi-Layer Perceptron) with a CNN.

What to submit

Submit a PDF file containing all your code, outputs, and write-up from parts 1-5. You can produce a PDF of your Google Colab file by going to File > Print and then save as PDF. The Colab instructions has more information.

Do not submit any other files produced by your code.

Include a link to your colab file in your submission.

Please use Google Colab to complete this assignment. If you want to use Jupyter Notebook, please complete the assignment and upload your Jupyter Notebook file to Google Colab for submission.

With Colab, you can export a PDF file using the menu option File -> Print and save as PDF file. Adjust the scaling to ensure that the text is not cutoff at the margins.

Include a link to your colab file here

Colab Link: https://drive.google.com/file/d/1MgHLy8mqeaCT31Jl-Tv6X_oQlyoaTOew/view?usp=sharing

Part 0. Helper Functions

We will be making use of the following helper functions. You will be asked to look at and possibly modify some of these, but you are not expected to understand all of them.

You should look at the function names and read the docstrings. If you are curious, come back and explore the code after making some progress on the lab.

Part 1. Visualizing the Data [7 pt]

We will make use of some of the CIFAR-10 data set, which consists of colour images of size 32x32 pixels belonging to 10 categories. You can find out more about the dataset at https://www.cs.toronto.edu/~kriz/cifar.html

For this assignment, we will only be using the cat and dog categories. We have included code that automatically downloads the dataset the first time that the main script is run.

Part (a) -- 1 pt

Visualize some of the data by running the code below. Include the visualization in your writeup.

(You don't need to submit anything else.)

Part (b) -- 3 pt

How many training examples do we have for the combined cat and dog classes? What about validation examples? What about test examples?

Part (c) -- 3pt

Why do we need a validation set when training our model? What happens if we judge the performance of our models using the training set loss/error instead of the validation set loss/error?

Answer:

A validation set is needed when training a model because we require some way to verify the results of the trained model. By tracking the validation set loss/error, we can make more informed decisions when modifying the model architecture and tuning hyperparameters. If we judge the performance of our models using the training set loss/error, we may overfit our models to the training set and not generalize well to a brand-new data set.

Part 2. Training [15 pt]

We define two neural networks, a LargeNet and SmallNet. We'll be training the networks in this section.

You won't understand fully what these networks are doing until the next few classes, and that's okay. For this assignment, please focus on learning how to train networks, and how hyperparameters affect training.

Part (a) -- 2pt

The methods small_net.parameters() and large_net.parameters() produces an iterator of all the trainable parameters of the network. These parameters are torch tensors containing many scalar values.

We haven't learned how how the parameters in these high-dimensional tensors will be used, but we should be able to count the number of parameters. Measuring the number of parameters in a network is one way of measuring the "size" of a network.

What is the total number of parameters in small_net and in large_net? (Hint: how many numbers are in each tensor?)

The function train_net

The function train_net below takes an untrained neural network (like small_net and large_net) and several other parameters. You should be able to understand how this function works. The figure below shows the high level training loop for a machine learning model:

alt text

Part (b) -- 1pt

The parameters to the function train_net are hyperparameters of our neural network. We made these hyperparameters easy to modify so that we can tune them later on.

What are the default values of the parameters batch_size, learning_rate, and num_epochs?

Answer:

The default values of the parameters are batch_size=64, learning_rate=0.01, and num_epochs=30.

Part (c) -- 3 pt

What files are written to disk when we call train_net with small_net, and train for 5 epochs? Provide a list of all the files written to disk, and what information the files contain.

Answer:

The files written to disk are:

Part (d) -- 2pt

Train both small_net and large_net using the function train_net and its default parameters. The function will write many files to disk, including a model checkpoint (saved values of model weights) at the end of each epoch.

If you are using Google Colab, you will need to mount Google Drive so that the files generated by train_net gets saved. We will be using these files in part (d). (See the Google Colab tutorial for more information about this.)

Report the total time elapsed when training each network. Which network took longer to train? Why?

Answer:

As shown in the code above, the total time elapsed for training small_net is 109.83 seconds and the total time elapsed for training large_net is 122.63 seconds. large_net took longer to train because it has significantly more parameters to be updated compared to small_net as shown in Part (a), which also means that it has a higher model complexity or capacity.

Part (e) - 2pt

Use the function plot_training_curve to display the trajectory of the training/validation error and the training/validation loss. You will need to use the function get_model_name to generate the argument to the plot_training_curve function.

Do this for both the small network and the large network. Include both plots in your writeup.

Part (f) - 5pt

Describe what you notice about the training curve. How do the curves differ for small_net and large_net? Identify any occurences of underfitting and overfitting.

Answer:

The main differences between the training curves of small_net and large_net are that:

  1. The small_net training error decreases much more rapidly at lower epochs compared to large_net.
  2. The large_net training curves are much less noisy compared to small_net.

For small_net, underfitting occurs at lower epochs (0-17) since both training error/loss and validation error/loss decrease as the number of epochs increases. Underfitting becomes less prominent for small_net as the number of epochs approaches 30 since the error and loss curves start to flatten out.

For large_net, underfitting again occurs at lower epochs (0-17) since both training error/loss and validation error/loss decrease as the number of epochs increases. However, the model starts to overfit for larger numbers of epochs (18-29) since validation error/loss flattens out and increases as training error/loss continues to decrease.

Part 3. Optimization Parameters [12 pt]

For this section, we will work with large_net only.

Part (a) - 3pt

Train large_net with all default parameters, except set learning_rate=0.001. Does the model take longer/shorter to train? Plot the training curve. Describe the effect of lowering the learning rate.

Answer:

The model takes roughly the same amount of time to train compared to the default settings.

By lowering the learning rate from 0.01 to 0.001, the size of each gradient descent step is smaller, which makes the error/loss to decrease slower compared to the default settings. As a result, the model no longer overfits for larger numbers of epochs.

Part (b) - 3pt

Train large_net with all default parameters, except set learning_rate=0.1. Does the model take longer/shorter to train? Plot the training curve. Describe the effect of increasing the learning rate.

Answer:

The model takes roughly the same amount of time to train compared to the default settings.

By increasing the learning rate from 0.01 to 0.1, the size of each gradient descent step is larger, which makes the error/loss to decrease faster compared to the default settings. As a result, the model starts to overfit much earlier compared to the default model.

Part (c) - 3pt

Train large_net with all default parameters, including with learning_rate=0.01. Now, set batch_size=512. Does the model take longer/shorter to train? Plot the training curve. Describe the effect of increasing the batch size.

Answer:

The model takes less time to train compared to the default settings.

By increasing the batch size from 64 to 512, the model no longer overfits for larger numbers of epochs. However, the new model results in slightly higher training error/loss and validation error compared to the default settings.

Part (d) - 3pt

Train large_net with all default parameters, including with learning_rate=0.01. Now, set batch_size=16. Does the model take longer/shorter to train? Plot the training curve. Describe the effect of decreasing the batch size.

Answer:

The model takes more time to train compared to the default settings.

By decreasing the batch size from 64 to 16, the model starts to overfit much earlier. However, the new model results in a lower training error/loss but a much higher validation loss compared to the default settings.

Part 4. Hyperparameter Search [6 pt]

Part (a) - 2pt

Based on the plots from above, choose another set of values for the hyperparameters (network, batch_size, learning_rate) that you think would help you improve the validation accuracy. Justify your choice.

Answer:

The set of hyperparameter values I have chosen is (large_net, batch_size=512, learning_rate=0.001).

I chose large_net because based on Part 2(f), it is observed that large_net is less susceptible to noise compared to small_net. However, it will overfit at larger numbers of epochs so the remaining hyperparameters will need to be adjusted to compensate for this.

I chose batch_size=512 and learning_rate=0.001 because based on Part 3(a) and Part 3(c), increasing the batch size and decreasing the learning rate can help reduce overfitting at larger numbers of epochs.

Part (b) - 1pt

Train the model with the hyperparameters you chose in part(a), and include the training curve.

Part (c) - 2pt

Based on your result from Part(a), suggest another set of hyperparameter values to try. Justify your choice.

Answer:

Based on the results of Part (a), the improved set of hyperparameter values I have chosen is (large_net, batch_size=512, learning_rate=0.05, num_epochs=19).

Despite the fact that increasing the batch size and decreasing the learning rate helped reduce overfitting, the combined effect also greatly reduced the rate at which error/loss decreases. As a result, I decided to increase the learning rate to 0.05 and let the error/loss decrease more rapidly.

Increasing the learning rate introduced some minor overfitting at the larger epoch range, specifically epochs 20-30. So, I chose to reduce the number of epochs to 19 as well.

Part (d) - 1pt

Train the model with the hyperparameters you chose in part(c), and include the training curve.

Part 5. Evaluating the Best Model [15 pt]

Part (a) - 1pt

Choose the best model that you have so far. This means choosing the best model checkpoint, including the choice of small_net vs large_net, the batch_size, learning_rate, and the epoch number.

Modify the code below to load your chosen set of weights to the model object net.

Part (b) - 2pt

Justify your choice of model from part (a).

Answer:

I chose large_net because based on Part 2(f), it is observed that large_net is less susceptible to noise compared to small_net. However, it will overfit at larger numbers of epochs so the remaining hyperparameters will need to be adjusted to compensate for this.

I chose batch_size=512 because based on Part 3(a), increasing the batch size can help reduce overfitting at larger numbers of epochs and reduce model training time.

I chose learning_rate=0.05 so the model can learn faster by letting the error/loss decrease more rapidly.

I chose num_epochs=19 because the model overfits at greater numbers of epochs.

Part (c) - 2pt

Using the code in Part 0, any code from lecture notes, or any code that you write, compute and report the test classification error for your chosen model.

Part (d) - 3pt

How does the test classification error compare with the validation error? Explain why you would expect the test error to be higher than the validation error.

Answer:

The test classification error is 0.304 and the validation error is 0.2995. While the errors are quite similar, the test classification error is indeed slightly higher compared to the validation error. This is expected because the test error is an indication of how the model will perform on a new data set. The test set has only been shown to the model for the first time while the validation set has been used quite extensively when searching for the best hyperparameters.

Part (e) - 2pt

Why did we only use the test data set at the very end? Why is it important that we use the test data as little as possible?

Answer:

The test data set is used at the very end because it is utilized to provide a realistic estimate of how the model will perform on a brand-new data set. It is important that the test data is used as little as possible because it is crucial to not make any neural network architecture decisions based on test data or test accuracy. Otherwise, bias towards the test data set will be introduced and the model will be overfitted to the test data.

Part (f) - 5pt

How does the your best CNN model compare with an 2-layer ANN model (no convolutional layers) on classifying cat and dog images. You can use a 2-layer ANN architecture similar to what you used in Lab 1. You should explore different hyperparameter settings to determine how well you can do on the validation dataset. Once satisified with the performance, you may test it out on the test data.

Hint: The ANN in lab 1 was applied on greyscale images. The cat and dog images are colour (RGB) and so you will need to flatted and concatinate all three colour layers before feeding them into an ANN.

Answer:

As shown in the code above, the best test classification error achieved using a 2-layer ANN architecture is 0.381. This is much higher compared to the 0.304 test classification error produced using the CNN model. Therefore, a CNN architecture is more suited for this problem of cats vs dogs classification.