Shakespeare Sonnets Dataset - Predict Next Word

You can get the dataset from here: https://www.opensourceshakespeare.org/views/sonnets/sonnet_view.php?range=viewrange&sonnetrange1=1&sonnetrange2=154

In [1]:
import numpy as np 
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
In [3]:
# Define path for file with sonnets
SONNETS_FILE = './sonnets.txt'

# Read the data
with open('./sonnets.txt') as f:
    data = f.read()

# Convert to lower case and save as a list
corpus = data.lower().split("\n")

print(f"There are {len(corpus)} lines of sonnets\n")
print(f"The first 5 lines look like this:\n")
for i in range(5):
  print(corpus[i])
There are 2159 lines of sonnets

The first 5 lines look like this:

from fairest creatures we desire increase,
that thereby beauty's rose might never die,
but as the riper should by time decease,
his tender heir might bear his memory:
but thou, contracted to thine own bright eyes,

Tokenizing the text

In [4]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

When converting the text into sequences you can use the texts_to_sequences method as you have done throughout this course.

It is important to keep in mind that the way you are feeding the data unto this method affects the result. Check the following example to make this clearer.

The first example of the corpus is a string and looks like this:

In [5]:
corpus[0]
Out[5]:
'from fairest creatures we desire increase,'

If you pass this text directly into the texts_to_sequences method you will get an unexpected result:

In [6]:
tokenizer.texts_to_sequences(corpus[0])
Out[6]:
[[],
 [],
 [58],
 [],
 [],
 [],
 [17],
 [6],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [17],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [6],
 [],
 [],
 [],
 [6],
 [],
 [],
 [],
 [],
 [17],
 [],
 [],
 []]

This happened because texts_to_sequences expects a list and you are providing a string. However a string is still and iterable in Python so you will get the word index of every character in the string.

Instead you need to place the example whithin a list before passing it to the method:

In [7]:
tokenizer.texts_to_sequences([corpus[0]])
Out[7]:
[[34, 417, 877, 166, 213, 517]]

Notice that we received the sequence wrapped inside a list so in order to get only the desired sequence you need to explicitly get the first item in the list like this:

In [8]:
tokenizer.texts_to_sequences([corpus[0]])[0]
Out[8]:
[34, 417, 877, 166, 213, 517]

Generating n_grams

This function receives the fitted tokenizer and the corpus (which is a list of strings) and should return a list containing the n_gram sequences for each line in the corpus:

In [23]:
def n_gram_seqs(corpus, tokenizer):
    """
    Generates a list of n-gram sequences
    
    Args:
        corpus (list of string): lines of texts to generate n-grams for
        tokenizer (object): an instance of the Tokenizer class containing the word-index dictionary
    
    Returns:
        input_sequences (list of int): the n-gram sequences for each line in the corpus
    """
    input_sequences = []
    n_gram_sequence = []
    # Loop over every line
    for line in corpus:

      # Tokenize the current line
      token_list = tokenizer.texts_to_sequences([line])[0]

      # Loop over the line several times to generate the subphrases
      for i in range(1, len(token_list)):
        
        # Generate the subphrase
        n_gram_sequence = token_list[:i+1]

        # Append the subphrase to the sequences list
        input_sequences.append(n_gram_sequence)
    
    return input_sequences
In [24]:
# Test our function with one example
first_example_sequence = n_gram_seqs([corpus[0]], tokenizer)

print("n_gram sequences for first example look like this:\n")
first_example_sequence
n_gram sequences for first example look like this:

Out[24]:
[[34, 417],
 [34, 417, 877],
 [34, 417, 877, 166],
 [34, 417, 877, 166, 213],
 [34, 417, 877, 166, 213, 517]]
In [25]:
# Test your function with a bigger corpus
next_3_examples_sequence = n_gram_seqs(corpus[1:4], tokenizer)

print("n_gram sequences for next 3 examples look like this:\n")
next_3_examples_sequence
n_gram sequences for next 3 examples look like this:

Out[25]:
[[8, 878],
 [8, 878, 134],
 [8, 878, 134, 351],
 [8, 878, 134, 351, 102],
 [8, 878, 134, 351, 102, 156],
 [8, 878, 134, 351, 102, 156, 199],
 [16, 22],
 [16, 22, 2],
 [16, 22, 2, 879],
 [16, 22, 2, 879, 61],
 [16, 22, 2, 879, 61, 30],
 [16, 22, 2, 879, 61, 30, 48],
 [16, 22, 2, 879, 61, 30, 48, 634],
 [25, 311],
 [25, 311, 635],
 [25, 311, 635, 102],
 [25, 311, 635, 102, 200],
 [25, 311, 635, 102, 200, 25],
 [25, 311, 635, 102, 200, 25, 278]]

Apply the n_gram_seqs transformation to the whole corpus and save the maximum sequence length to use it later:

In [26]:
# Apply the n_gram_seqs transformation to the whole corpus
input_sequences = n_gram_seqs(corpus, tokenizer)

# Save max length 
max_sequence_len = max([len(x) for x in input_sequences])

print(f"n_grams of input_sequences have length: {len(input_sequences)}")
print(f"maximum length of sequences is: {max_sequence_len}")
n_grams of input_sequences have length: 15462
maximum length of sequences is: 11

Add padding to the sequences

Now we will code the pad_seqs function which will pad any given sequences to the desired maximum length. Notice that this function receives a list of sequences and should return a numpy array with the padded sequences:

In [29]:
def pad_seqs(input_sequences, maxlen):
    """
    Pads tokenized sequences to the same length
    
    Args:
        input_sequences (list of int): tokenized sequences to pad
        maxlen (int): maximum length of the token sequences
    
    Returns:
        padded_sequences (array of int): tokenized sequences padded to the same length
    """
    padded_sequences = np.array(pad_sequences(input_sequences, maxlen=maxlen, padding='pre'))
    
    return padded_sequences
In [30]:
# Test your function with the n_grams_seq of the first example
first_padded_seq = pad_seqs(first_example_sequence, len(first_example_sequence))
first_padded_seq
Out[30]:
array([[  0,   0,   0,  34, 417],
       [  0,   0,  34, 417, 877],
       [  0,  34, 417, 877, 166],
       [ 34, 417, 877, 166, 213],
       [417, 877, 166, 213, 517]], dtype=int32)
In [31]:
# Test your function with the n_grams_seq of the next 3 examples
next_3_padded_seq = pad_seqs(next_3_examples_sequence, max([len(s) for s in next_3_examples_sequence]))
next_3_padded_seq
Out[31]:
array([[  0,   0,   0,   0,   0,   0,   8, 878],
       [  0,   0,   0,   0,   0,   8, 878, 134],
       [  0,   0,   0,   0,   8, 878, 134, 351],
       [  0,   0,   0,   8, 878, 134, 351, 102],
       [  0,   0,   8, 878, 134, 351, 102, 156],
       [  0,   8, 878, 134, 351, 102, 156, 199],
       [  0,   0,   0,   0,   0,   0,  16,  22],
       [  0,   0,   0,   0,   0,  16,  22,   2],
       [  0,   0,   0,   0,  16,  22,   2, 879],
       [  0,   0,   0,  16,  22,   2, 879,  61],
       [  0,   0,  16,  22,   2, 879,  61,  30],
       [  0,  16,  22,   2, 879,  61,  30,  48],
       [ 16,  22,   2, 879,  61,  30,  48, 634],
       [  0,   0,   0,   0,   0,   0,  25, 311],
       [  0,   0,   0,   0,   0,  25, 311, 635],
       [  0,   0,   0,   0,  25, 311, 635, 102],
       [  0,   0,   0,  25, 311, 635, 102, 200],
       [  0,   0,  25, 311, 635, 102, 200,  25],
       [  0,  25, 311, 635, 102, 200,  25, 278]], dtype=int32)
In [32]:
# Pad the whole corpus
input_sequences = pad_seqs(input_sequences, max_sequence_len)

print(f"padded corpus has shape: {input_sequences.shape}")
padded corpus has shape: (15462, 11)

Split the data into features and labels

Before feeding the data into the neural network you should split it into features and labels. In this case the features will be the padded n_gram sequences with the last word removed from them and the labels will be the removed word.

The below function expects the padded n_gram sequences as input and should return a tuple containing the features and the one hot encoded labels.

In [37]:
def features_and_labels(input_sequences, total_words):
    """
    Generates features and labels from n-grams
    
    Args:
        input_sequences (list of int): sequences to split features and labels from
        total_words (int): vocabulary size
    
    Returns:
        features, one_hot_labels (array of int, array of int): arrays of features and one-hot encoded labels
    """
    features = input_sequences[:,:-1]
    labels = input_sequences[:,-1]
    one_hot_labels = to_categorical(labels, num_classes=total_words)

    return features, one_hot_labels
In [38]:
# Test your function with the padded n_grams_seq of the first example
first_features, first_labels = features_and_labels(first_padded_seq, total_words)

print(f"labels have shape: {first_labels.shape}")
print("\nfeatures look like this:\n")
first_features
labels have shape: (5, 3211)

features look like this:

Out[38]:
array([[  0,   0,   0,  34],
       [  0,   0,  34, 417],
       [  0,  34, 417, 877],
       [ 34, 417, 877, 166],
       [417, 877, 166, 213]], dtype=int32)

Create the model

In [40]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Train the model

In [41]:
# Train the model
history = model.fit(features, labels, epochs=50, verbose=1)
Epoch 1/50
484/484 [==============================] - 12s 9ms/step - loss: 6.8825 - accuracy: 0.0248
Epoch 2/50
484/484 [==============================] - 4s 8ms/step - loss: 6.4208 - accuracy: 0.0327
Epoch 3/50
484/484 [==============================] - 4s 8ms/step - loss: 6.1828 - accuracy: 0.0392
Epoch 4/50
484/484 [==============================] - 4s 9ms/step - loss: 5.9334 - accuracy: 0.0507
Epoch 5/50
484/484 [==============================] - 4s 8ms/step - loss: 5.6410 - accuracy: 0.0634
Epoch 6/50
484/484 [==============================] - 4s 8ms/step - loss: 5.3036 - accuracy: 0.0748
Epoch 7/50
484/484 [==============================] - 4s 7ms/step - loss: 4.9402 - accuracy: 0.0887
Epoch 8/50
484/484 [==============================] - 4s 8ms/step - loss: 4.5413 - accuracy: 0.1168
Epoch 9/50
484/484 [==============================] - 4s 8ms/step - loss: 4.1359 - accuracy: 0.1627
Epoch 10/50
484/484 [==============================] - 4s 7ms/step - loss: 3.7459 - accuracy: 0.2241
Epoch 11/50
484/484 [==============================] - 4s 8ms/step - loss: 3.3735 - accuracy: 0.2949
Epoch 12/50
484/484 [==============================] - 4s 8ms/step - loss: 3.0341 - accuracy: 0.3558
Epoch 13/50
484/484 [==============================] - 4s 8ms/step - loss: 2.7328 - accuracy: 0.4150
Epoch 14/50
484/484 [==============================] - 4s 7ms/step - loss: 2.4685 - accuracy: 0.4712
Epoch 15/50
484/484 [==============================] - 4s 7ms/step - loss: 2.2365 - accuracy: 0.5192
Epoch 16/50
484/484 [==============================] - 4s 7ms/step - loss: 2.0284 - accuracy: 0.5677
Epoch 17/50
484/484 [==============================] - 4s 7ms/step - loss: 1.8402 - accuracy: 0.6071
Epoch 18/50
484/484 [==============================] - 4s 7ms/step - loss: 1.6764 - accuracy: 0.6442
Epoch 19/50
484/484 [==============================] - 4s 7ms/step - loss: 1.5287 - accuracy: 0.6784
Epoch 20/50
484/484 [==============================] - 4s 7ms/step - loss: 1.3995 - accuracy: 0.7079
Epoch 21/50
484/484 [==============================] - 4s 7ms/step - loss: 1.2834 - accuracy: 0.7337
Epoch 22/50
484/484 [==============================] - 4s 7ms/step - loss: 1.1852 - accuracy: 0.7548
Epoch 23/50
484/484 [==============================] - 4s 8ms/step - loss: 1.0928 - accuracy: 0.7722
Epoch 24/50
484/484 [==============================] - 4s 8ms/step - loss: 1.0254 - accuracy: 0.7850
Epoch 25/50
484/484 [==============================] - 4s 7ms/step - loss: 0.9565 - accuracy: 0.7984
Epoch 26/50
484/484 [==============================] - 4s 7ms/step - loss: 0.8920 - accuracy: 0.8099
Epoch 27/50
484/484 [==============================] - 4s 7ms/step - loss: 0.8529 - accuracy: 0.8161
Epoch 28/50
484/484 [==============================] - 4s 8ms/step - loss: 0.8080 - accuracy: 0.8238
Epoch 29/50
484/484 [==============================] - 4s 7ms/step - loss: 0.7689 - accuracy: 0.8289
Epoch 30/50
484/484 [==============================] - 4s 7ms/step - loss: 0.7438 - accuracy: 0.8335
Epoch 31/50
484/484 [==============================] - 4s 9ms/step - loss: 0.7162 - accuracy: 0.8377
Epoch 32/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6899 - accuracy: 0.8412
Epoch 33/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6789 - accuracy: 0.8415
Epoch 34/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6587 - accuracy: 0.8443
Epoch 35/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6535 - accuracy: 0.8433
Epoch 36/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6417 - accuracy: 0.8448
Epoch 37/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6286 - accuracy: 0.8465
Epoch 38/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6155 - accuracy: 0.8473
Epoch 39/50
484/484 [==============================] - 4s 7ms/step - loss: 0.6100 - accuracy: 0.8475
Epoch 40/50
484/484 [==============================] - 4s 8ms/step - loss: 0.6040 - accuracy: 0.8481
Epoch 41/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5988 - accuracy: 0.8478
Epoch 42/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5945 - accuracy: 0.8478
Epoch 43/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5908 - accuracy: 0.8458
Epoch 44/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5928 - accuracy: 0.8474
Epoch 45/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5847 - accuracy: 0.8474
Epoch 46/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5805 - accuracy: 0.8476
Epoch 47/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5689 - accuracy: 0.8508
Epoch 48/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5713 - accuracy: 0.8507
Epoch 49/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5703 - accuracy: 0.8500
Epoch 50/50
484/484 [==============================] - 4s 7ms/step - loss: 0.5668 - accuracy: 0.8492
In [42]:
# Take a look at the training curves of your model

acc = history.history['accuracy']
loss = history.history['loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'b', label='Training accuracy')
plt.title('Training accuracy')

plt.figure()

plt.plot(epochs, loss, 'b', label='Training Loss')
plt.title('Training loss')
plt.legend()

plt.show()

See our model in action

After all your work it is finally time to see your model generating text.

Run the cell below to generate the next 100 words of a seed text.

In [44]:
seed_text = "Help me Obi Wan Kenobi, you're my only hope"
next_words = 100
  
for _ in range(next_words):
	# Convert the text into sequences
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	# Pad the sequences
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	# Get the probabilities of predicting a word
	predicted = model.predict(token_list, verbose=0)
	# Choose the next word based on the maximum probability
	predicted = np.argmax(predicted, axis=-1).item()
	# Get the actual word from the word index
	output_word = tokenizer.index_word[predicted]
	# Append to the current text
	seed_text += " " + output_word

print(seed_text)
Help me Obi Wan Kenobi, you're my only hope should stay the treasure of laws ' it blind contains ' can love now it live bright best ' still old toil'd you or thee still sessions eyes good best assured you can hide his wrong alone nought of thee thy heart in thee trophies carved me strife some had dead seen spent spent spent seen date of no growth of pride ' to be gone of well ' you dearer tell so rare me ' be kind ' be my ill live his style near slain of ride glass poet's dwell in thine eyes ' so bad my friend