Lab 5: Spam Detection

Deadline: June 18th, 11:59pm

Late Penalty: There is a penalty-free grace period of one hour past the deadline. Any work that is submitted between 1 hour and 24 hours past the deadline will receive a 20% grade deduction. No other late work is accepted. Quercus submission time will be used, not your local computer time. You can submit your labs as many times as you want before the deadline, so please submit often and early.

In this assignment, we will build a recurrent neural network to classify a SMS text message as "spam" or "not spam". In the process, you will

  1. Clean and process text data for machine learning.
  2. Understand and implement a character-level recurrent neural network.
  3. Use torchtext to build recurrent neural network models.
  4. Understand batching for a recurrent neural network, and use torchtext to implement RNN batching.

What to submit

Submit a PDF file containing all your code, outputs, and write-up. You can produce a PDF of your Google Colab file by going to File > Print and then save as PDF. The Colab instructions have more information (.html files are also acceptable).

Do not submit any other files produced by your code.

Include a link to your colab file in your submission.

Include a link to your Colab file here. If you would like the TA to look at your Colab file in case your solutions are cut off, please make sure that your Colab file is publicly accessible at the time of submission.

Colab Link: https://drive.google.com/file/d/1vDnjUn0OESVJuJyYXkYvqRSQnodb9hgY/view?usp=sharing

Part 1. Data Cleaning [15 pt]

We will be using the "SMS Spam Collection Data Set" available at http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

There is a link to download the "Data Folder" at the very top of the webpage. Download the zip file, unzip it, and upload the file SMSSpamCollection to Colab.

Part (a) [2 pt]

Open up the file in Python, and print out one example of a spam SMS, and one example of a non-spam SMS.

What is the label value for a spam message, and what is the label value for a non-spam message?

Answer:

As shown in the print outs above, the label for a spam message is "spam" and the label for a non-spam message is "ham".

Part (b) [1 pt]

How many spam messages are there in the data set? How many non-spam messages are there in the data set?

Part (c) [4 pt]

We will be using the package torchtext to load, process, and batch the data. A tutorial to torchtext is available below. This tutorial uses the same Sentiment140 data set that we explored during lecture.

https://medium.com/@sonicboom8/sentiment-analysis-torchtext-55fb57b1fab8

Unlike what we did during lecture, we will be building a character level RNN. That is, we will treat each character as a token in our sequence, rather than each word.

Identify two advantage and two disadvantage of modelling SMS text messages as a sequence of characters rather than a sequence of words.

Answer:

Advantages:

  1. Requires less memory and has faster inference due to a much smaller vocabulary (less than 100 characters vs millions of words)
  2. Able to recognize and interpret misspelled words/typos

Disadvantages:

  1. Higher computational cost
  2. May result in a lower accuracy compared to word level RNN

Part (d) [1 pt]

We will be loading our data set using torchtext.data.TabularDataset. The constructor will read directly from the SMSSpamCollection file.

For the data file to be read successfuly, we need to specify the fields (columns) in the file. In our case, the dataset has two fields:

Split the dataset into train, valid, and test. Use a 60-20-20 split. You may find this torchtext API page helpful: https://torchtext.readthedocs.io/en/latest/data.html#dataset

Hint: There is a Dataset method that can perform the random split for you.

Part (e) [2 pt]

You saw in part (b) that there are many more non-spam messages than spam messages. This imbalance in our training data will be problematic for training. We can fix this disparity by duplicating spam messages in the training set, so that the training set is roughly balanced.

Explain why having a balanced training set is helpful for training our neural network.

Note: if you are not sure, try removing the below code and train your mode.

Answer:

It is important to have a balanced training set because it is undesirable to introduce any biases into the neural network model. In this case, since there are many more non-spam messages, the model will be biased towards the non-spam class. As a result, even if the model results in a high training accuracy, it does not indicate that we have learned a good model for this problem.

Part (f) [1 pt]

We need to build the vocabulary on the training data by running the below code. This finds all the possible character tokens in the training set.

Explain what the variables text_field.vocab.stoi and text_field.vocab.itos represent.

Answer:

As shown in the code above, the variable text_field.vocab.stoi is a dictionary mapping of character tokens to integer indices, where stoi stands for string to integer.

The variable text_field.vocab.itos represents a list mapping integer indices to the character tokens, where itos stands for integer to string.

Part (g) [2 pt]

The tokens <unk> and <pad> were not in our SMS text messages. What do these two values represent?

Answer:

The <unk> token represents an unknown token, which means that the token is unrecognized.

The <pad> token represents padding. Since the SMS text messages vary in length, padding is applied to ensure training data in the same batch have equal lengths.

Part (h) [2 pt]

Since text sequences are of variable length, torchtext provides a BucketIterator data loader, which batches similar length sequences together. The iterator also provides functionalities to pad sequences automatically.

Take a look at 10 batches in train_iter. What is the maximum length of the input sequence in each batch? How many <pad> tokens are used in each of the 10 batches?

Part 2. Model Building [8 pt]

Build a recurrent neural network model, using an architecture of your choosing. Use the one-hot embedding of each character as input to your recurrent network. Use one or more fully-connected layers to make the prediction based on your recurrent network output.

Instead of using the RNN output value for the final token, another often used strategy is to max-pool over the entire output array. That is, instead of calling something like:

out, _ = self.rnn(x)
self.fc(out[:, -1, :])

where self.rnn is an nn.RNN, nn.GRU, or nn.LSTM module, and self.fc is a fully-connected layer, we use:

out, _ = self.rnn(x)
self.fc(torch.max(out, dim=1)[0])

This works reasonably in practice. An even better alternative is to concatenate the max-pooling and average-pooling of the RNN outputs:

out, _ = self.rnn(x)
out = torch.cat([torch.max(out, dim=1)[0], 
                 torch.mean(out, dim=1)], dim=1)
self.fc(out)

We encourage you to try out all these options. The way you pool the RNN outputs is one of the "hyperparameters" that you can choose to tune later on.

Part 3. Training [16 pt]

Part (a) [4 pt]

Complete the get_accuracy function, which will compute the accuracy (rate) of your model across a dataset (e.g. validation set). You may modify torchtext.data.BucketIterator to make your computation faster.

Part (b) [4 pt]

Train your model. Plot the training curve of your final model. Your training curve should have the training/validation loss and accuracy plotted periodically.

Note: Not all of your batches will have the same batch size. In particular, if your training set does not divide evenly by your batch size, there will be a batch that is smaller than the rest.

Part (c) [4 pt]

Choose at least 4 hyperparameters to tune. Explain how you tuned the hyperparameters. You don't need to include your training curve for every model you trained. Instead, explain what hyperparemters you tuned, what the best validation accuracy was, and the reasoning behind the hyperparameter decisions you made.

For this assignment, you should tune more than just your learning rate and epoch. Choose at least 2 hyperparameters that are unrelated to the optimizer.

Answer:

model_4 produces the best results, with a final validation accuracy of 98.3%.

Part (d) [2 pt]

Before we deploy a machine learning model, we usually want to have a better understanding of how our model performs beyond its validation accuracy. An important metric to track is how well our model performs in certain subsets of the data.

In particular, what is the model's error rate amongst data with negative labels? This is called the false positive rate.

What about the model's error rate amongst data with positive labels? This is called the false negative rate.

Report your final model's false positive and false negative rate across the validation set.

Part (e) [2 pt]

The impact of a false positive vs a false negative can be drastically different. If our spam detection algorithm was deployed on your phone, what is the impact of a false positive on the phone's user? What is the impact of a false negative?

Answer:

If the spam detection algorithm was deployed on my phone, a high false positive rate would mean that normal texts I receive can potentially be marked as spam as well. This could lead to the negligence or deletion of important texts that I might receive. On the other hand, a high false negative rate would mean that the algorithm is often labeling spam texts as normal texts, which means that it is not performing very well at reducing the amount of spam texts I might receive.

Part 4. Evaluation [11 pt]

Part (a) [1 pt]

Report the final test accuracy of your model.

Part (b) [3 pt]

Report the false positive rate and false negative rate of your model across the test set.

Part (c) [3 pt]

What is your model's prediction of the probability that the SMS message "machine learning is sooo cool!" is spam?

Hint: To begin, use text_field.vocab.stoi to look up the index of each character in the vocabulary.

Part (d) [4 pt]

Do you think detecting spam is an easy or difficult task?

Since machine learning models are expensive to train and deploy, it is very important to compare our models against baseline models: a simple model that is easy to build and inexpensive to run that we can compare our recurrent neural network model against.

Explain how you might build a simple baseline model. This baseline model can be a simple neural network (with very few weights), a hand-written algorithm, or any other strategy that is easy to build and test.

Do not actually build a baseline model. Instead, provide instructions on how to build it.

Answer:

In my opinion, it is not difficult to achieve high accuracies for spam detection using a recurrent neural network model. However, even with a high overall accuracy, the impact of false positives and false negatives can still be significant. So, it is important to reduce these errors as much as possible, which is the difficult part.

A simple baseline model could be built through targeting keywords in common spam messages and labeling any messages containing these keywords as spams. This can be an easy to build and inexpensive model to compare the recurrent neural network against.