Sentiment140 Dataset - Using embeding and handling overfitting

You can get the dataset from here: http://help.sentiment140.com/home

In [1]:
import csv
import random
import pickle
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
from scipy.stats import linregress

Defining some useful global variables

  • EMBEDDING_DIM: Dimension of the dense embedding, will be used in the embedding layer of the model. Defaults to 100.
  • MAXLEN: Maximum length of all sequences. Defaults to 16.
  • TRUNCATING: Truncating strategy (truncate either before or after each sequence.). Defaults to 'post'.
  • PADDING: Padding strategy (pad either before or after each sequence.). Defaults to 'post'.
  • OOV_TOKEN: Token to replace out-of-vocabulary words during text_to_sequence calls. Defaults to \"\\".
  • MAX_EXAMPLES: Max number of examples to use. Defaults to 160000 (10% of the original number of examples)
  • TRAINING_SPLIT: Proportion of data used for training. Defaults to 0.9
In [2]:
EMBEDDING_DIM = 100
MAXLEN = 16
TRUNCATING = 'post'
PADDING = 'post'
OOV_TOKEN = "<OOV>"
MAX_EXAMPLES = 160000
TRAINING_SPLIT = 0.9

Explore the dataset

The dataset is provided in a csv file.

Each row of this file contains the following values separated by commas:

  • target: the polarity of the tweet (0 = negative, 4 = positive)

  • ids: The id of the tweet

  • date: the date of the tweet

  • flag: The query. If there is no query, then this value is NO_QUERY.

  • user: the user that tweeted

  • text: the text of the tweet

Take a look at the first two examples:

In [3]:
SENTIMENT_CSV = "./data/training_cleaned.csv"

with open(SENTIMENT_CSV, 'r') as csvfile:
    print(f"First data point looks like this:\n\n{csvfile.readline()}")
    print(f"Second data point looks like this:\n\n{csvfile.readline()}")
First data point looks like this:

"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

Second data point looks like this:

"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"

Parsing the raw data

In [4]:
sentences = []
labels = []

with open(filename, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        label = int(int(row[0]) * 1/4)
        labels.append(label)
        sentence = row[5]
        sentences.append(sentence)
In [5]:
print(f"dataset contains {len(sentences)} examples\n")

print(f"Text of second example should look like this:\n{sentences[1]}\n")
print(f"Text of fourth example should look like this:\n{sentences[3]}")

print(f"\nLabels of last 5 examples should look like this:\n{labels[-5:]}")
dataset contains 1600000 examples

Text of second example should look like this:
is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!

Text of fourth example should look like this:
my whole body feels itchy and like its on fire 

Labels of last 5 examples should look like this:
[1, 1, 1, 1, 1]

You might have noticed that this dataset contains a lot of examples. We will keep a low execution time and will be using only 10% of the original data. The next cell does this while also randomnizing the datapoints that will be used:

In [6]:
# Bundle the two lists into a single one
sentences_and_labels = list(zip(sentences, labels))

# Perform random sampling
random.seed(42)
sentences_and_labels = random.sample(sentences_and_labels, MAX_EXAMPLES)

# Unpack back into separate lists
sentences, labels = zip(*sentences_and_labels)

print(f"There are {len(sentences)} sentences and {len(labels)} labels after random sampling\n")
There are 160000 sentences and 160000 labels after random sampling

Training - Validation Split

In [7]:
# Compute the number of sentences that will be used for training (should be an integer)
train_size = int(len(sentences) * training_split)

# Split the sentences and labels into train/validation splits
train_sentences = sentences[0:train_size]
train_labels = labels[0:train_size]

validation_sentences = sentences[train_size:]
validation_labels = labels[train_size:]
In [8]:
print(f"There are {len(train_sentences)} sentences for training.\n")
print(f"There are {len(train_labels)} labels for training.\n")
print(f"There are {len(val_sentences)} sentences for validation.\n")
print(f"There are {len(val_labels)} labels for validation.")
There are 144000 sentences for training.

There are 144000 labels for training.

There are 16000 sentences for validation.

There are 16000 labels for validation.

Tokenization - Sequences, truncating and padding

In [9]:
# Instantiate the Tokenizer class, passing in the correct value for oov_token
tokenizer = Tokenizer(oov_token=oov_token)

# Fit the tokenizer to the training sentences
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
VOCAB_SIZE = len(word_index)
In [10]:
print(f"Vocabulary contains {VOCAB_SIZE} words\n")
print("<OOV> token included in vocabulary" if "<OOV>" in word_index else "<OOV> token NOT included in vocabulary")
print(f"\nindex of word 'i' should be {word_index['i']}")
Vocabulary contains 128293 words

<OOV> token included in vocabulary

index of word 'i' should be 2
In [11]:
# Convert sentences to sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Pad the sequences using the correct padding, truncating and maxlen
pad_trunc_sequences = pad_sequences(sequences, maxlen=maxlen, padding=padding, truncating=truncating)
In [12]:
print(f"Padded and truncated training sequences have shape: {train_pad_trunc_seq.shape}\n")
print(f"Padded and truncated validation sequences have shape: {val_pad_trunc_seq.shape}")
Padded and truncated training sequences have shape: (144000, 16)

Padded and truncated validation sequences have shape: (16000, 16)

Remember that the pad_sequences function returns numpy arrays, so your training and validation sequences are already in this format.

However the labels are still Python lists. Before going forward we should convert them numpy arrays as well.

In [13]:
train_labels = np.array(train_labels)
val_labels = np.array(val_labels)

Using pre-defined Embeddings

In [14]:
# Define path to file containing the embeddings
GLOVE_FILE = './data/glove.6B.100d.txt'

# Initialize an empty embeddings index dictionary
GLOVE_EMBEDDINGS = {}

# Read file and fill GLOVE_EMBEDDINGS with its contents
with open(GLOVE_FILE) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        GLOVE_EMBEDDINGS[word] = coefs

Now you have access to GloVe's pre-trained word vectors. Isn't that cool?

Let's take a look at the vector for the word dog:

In [15]:
test_word = 'dog'

test_vector = GLOVE_EMBEDDINGS[test_word]

print(f"Vector representation of word {test_word} looks like this:\n\n{test_vector}")
Vector representation of word dog looks like this:

[ 0.30817    0.30938    0.52803   -0.92543   -0.73671    0.63475
  0.44197    0.10262   -0.09142   -0.56607   -0.5327     0.2013
  0.7704    -0.13983    0.13727    1.1128     0.89301   -0.17869
 -0.0019722  0.57289    0.59479    0.50428   -0.28991   -1.3491
  0.42756    1.2748    -1.1613    -0.41084    0.042804   0.54866
  0.18897    0.3759     0.58035    0.66975    0.81156    0.93864
 -0.51005   -0.070079   0.82819   -0.35346    0.21086   -0.24412
 -0.16554   -0.78358   -0.48482    0.38968   -0.86356   -0.016391
  0.31984   -0.49246   -0.069363   0.018869  -0.098286   1.3126
 -0.12116   -1.2399    -0.091429   0.35294    0.64645    0.089642
  0.70294    1.1244     0.38639    0.52084    0.98787    0.79952
 -0.34625    0.14095    0.80167    0.20987   -0.86007   -0.15308
  0.074523   0.40816    0.019208   0.51587   -0.34428   -0.24525
 -0.77984    0.27425    0.22418    0.20164    0.017431  -0.014697
 -1.0235    -0.39695   -0.0056188  0.30569    0.31748    0.021404
  0.11837   -0.11319    0.42456    0.53405   -0.16717   -0.27185
 -0.6255     0.12883    0.62529   -0.52086  ]
In [16]:
print(f"Each word vector has shape: {test_vector.shape}")
Each word vector has shape: (100,)

Represent the words in your vocabulary using the embeddings

In [17]:
# Initialize an empty numpy array with the appropriate size
EMBEDDINGS_MATRIX = np.zeros((VOCAB_SIZE+1, EMBEDDING_DIM))

# Iterate all of the words in the vocabulary and if the vector representation for 
# each word exists within GloVe's representations, save it in the EMBEDDINGS_MATRIX array
for word, i in word_index.items():
    embedding_vector = GLOVE_EMBEDDINGS.get(word)
    if embedding_vector is not None:
        EMBEDDINGS_MATRIX[i] = embedding_vector

Now you have the pre-trained embeddings ready to use!

Define a model that does not overfit

In [48]:
model = tf.keras.Sequential([ 
    # This is how you need to set the Embedding layer when using pre-trained embeddings
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=maxlen, weights=[embeddings_matrix], trainable=False),
    #tf.keras.layers.Conv1D(filters=128, kernel_size=5, activation='relu'),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy']) 
In [49]:
model.summary()

# Train the model and save the training history
history = model.fit(train_pad_trunc_seq, train_labels, epochs=20, validation_data=(val_pad_trunc_seq, val_labels))
Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_11 (Embedding)    (None, 16, 100)           12829400  
                                                                 
 bidirectional_11 (Bidirecti  (None, 256)              234496    
 onal)                                                           
                                                                 
 dropout_11 (Dropout)        (None, 256)               0         
                                                                 
 dense_22 (Dense)            (None, 64)                16448     
                                                                 
 dense_23 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 13,080,409
Trainable params: 251,009
Non-trainable params: 12,829,400
_________________________________________________________________
Epoch 1/20
4500/4500 [==============================] - 65s 14ms/step - loss: 0.5437 - accuracy: 0.7195 - val_loss: 0.5071 - val_accuracy: 0.7511
Epoch 2/20
4500/4500 [==============================] - 61s 14ms/step - loss: 0.4992 - accuracy: 0.7540 - val_loss: 0.4956 - val_accuracy: 0.7584
Epoch 3/20
4500/4500 [==============================] - 62s 14ms/step - loss: 0.4817 - accuracy: 0.7664 - val_loss: 0.4824 - val_accuracy: 0.7656
Epoch 4/20
4500/4500 [==============================] - 61s 14ms/step - loss: 0.4717 - accuracy: 0.7730 - val_loss: 0.4770 - val_accuracy: 0.7706
Epoch 5/20
4500/4500 [==============================] - 62s 14ms/step - loss: 0.4624 - accuracy: 0.7797 - val_loss: 0.4796 - val_accuracy: 0.7670
Epoch 6/20
4500/4500 [==============================] - 60s 13ms/step - loss: 0.4537 - accuracy: 0.7847 - val_loss: 0.4705 - val_accuracy: 0.7739
Epoch 7/20
4500/4500 [==============================] - 58s 13ms/step - loss: 0.4482 - accuracy: 0.7887 - val_loss: 0.4728 - val_accuracy: 0.7726
Epoch 8/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.4407 - accuracy: 0.7926 - val_loss: 0.4732 - val_accuracy: 0.7725
Epoch 9/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.4357 - accuracy: 0.7963 - val_loss: 0.4780 - val_accuracy: 0.7720
Epoch 10/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.4301 - accuracy: 0.7996 - val_loss: 0.4729 - val_accuracy: 0.7752
Epoch 11/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.4256 - accuracy: 0.8031 - val_loss: 0.4800 - val_accuracy: 0.7736
Epoch 12/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.4212 - accuracy: 0.8044 - val_loss: 0.4754 - val_accuracy: 0.7751
Epoch 13/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.4174 - accuracy: 0.8080 - val_loss: 0.4764 - val_accuracy: 0.7758
Epoch 14/20
4500/4500 [==============================] - 58s 13ms/step - loss: 0.4132 - accuracy: 0.8091 - val_loss: 0.4783 - val_accuracy: 0.7730
Epoch 15/20
4500/4500 [==============================] - 59s 13ms/step - loss: 0.4096 - accuracy: 0.8109 - val_loss: 0.4781 - val_accuracy: 0.7741
Epoch 16/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.4057 - accuracy: 0.8131 - val_loss: 0.4894 - val_accuracy: 0.7683
Epoch 17/20
4500/4500 [==============================] - 60s 13ms/step - loss: 0.4029 - accuracy: 0.8157 - val_loss: 0.4811 - val_accuracy: 0.7714
Epoch 18/20
4500/4500 [==============================] - 60s 13ms/step - loss: 0.3991 - accuracy: 0.8174 - val_loss: 0.4868 - val_accuracy: 0.7708
Epoch 19/20
4500/4500 [==============================] - 57s 13ms/step - loss: 0.3963 - accuracy: 0.8181 - val_loss: 0.4839 - val_accuracy: 0.7735
Epoch 20/20
4500/4500 [==============================] - 58s 13ms/step - loss: 0.3943 - accuracy: 0.8198 - val_loss: 0.4919 - val_accuracy: 0.7746

Run the following cell to check your loss curves:

In [50]:
#-----------------------------------------------------------
# Retrieve a list of list results on training and test data
# sets for each training epoch
#-----------------------------------------------------------
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = [*range(20)]

#------------------------------------------------
# Plot training and validation loss per epoch
#------------------------------------------------
plt.plot(epochs, loss, 'r')
plt.plot(epochs, val_loss, 'b')
plt.title('Training and validation loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend(["Loss", "Validation Loss"])
plt.show()

If you wish so, you can also check the training and validation accuracies of your model:

In [51]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

#------------------------------------------------
# Plot training and validation accuracy per epoch
#------------------------------------------------
plt.plot(epochs, acc, 'r')
plt.plot(epochs, val_acc, 'b')
plt.title('Training and validation accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend(["Accuracy", "Validation Accuracy"])
plt.show()
In [52]:
# Test the slope of your val_loss curve
slope, *_ = linregress(epochs, val_loss)
print(f"The slope of your validation loss curve is {slope:.5f}")
The slope of your validation loss curve is -0.00010