BBC News Archive - Text Classification - First Model (Overfitted)

You can get the dataset from here: https://www.kaggle.com/c/learn-ai-bbc/overview

In [1]:
import io
import csv
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt

You probably remember structure of the csv:

In [2]:
with open("./bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")
First line (header) looks like this:

category,text

Each data point looks like this:

tech,tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high-definition tv sets  which are big business in japan and the us  but slower to take off in europe because of the lack of high-definition programming. not only can people forward wind through adverts  they can also forget about abiding by network and channel schedules  putting together their own a-la-carte entertainment. but some us networks and cable and satellite companies are worried about what it means for them in terms of advertising revenues as well as  brand identity  and viewer loyalty to channels. although the us leads in this technology at the moment  it is also a concern that is being raised in europe  particularly with the growing uptake of services like sky+.  what happens here today  we will see in nine months to a years  time in the uk   adam hume  the bbc broadcast s futurologist told the bbc news website. for the likes of the bbc  there are no issues of lost advertising revenue yet. it is a more pressing issue at the moment for commercial uk broadcasters  but brand loyalty is important for everyone.  we will be talking more about content brands rather than network brands   said tim hanlon  from brand communications firm starcom mediavest.  the reality is that with broadband connections  anybody can be the producer of content.  he added:  the challenge now is that it is hard to promote a programme with so much choice.   what this means  said stacey jolna  senior vice president of tv guide tv group  is that the way people find the content they want to watch has to be simplified for tv viewers. it means that networks  in us terms  or channels could take a leaf out of google s book and be the search engine of the future  instead of the scheduler to help people find what they want to watch. this kind of channel model might work for the younger ipod generation which is used to taking control of their gadgets and what they play on them. but it might not suit everyone  the panel recognised. older generations are more comfortable with familiar schedules and channel brands because they know what they are getting. they perhaps do not want so much of the choice put into their hands  mr hanlon suggested.  on the other end  you have the kids just out of diapers who are pushing buttons already - everything is possible and available to them   said mr hanlon.  ultimately  the consumer will tell the market they want.   of the 50 000 new gadgets and technologies being showcased at ces  many of them are about enhancing the tv-watching experience. high-definition tv sets are everywhere and many new models of lcd (liquid crystal display) tvs have been launched with dvr capability built into them  instead of being external boxes. one such example launched at the show is humax s 26-inch lcd tv with an 80-hour tivo dvr and dvd recorder. one of the us s biggest satellite tv companies  directtv  has even launched its own branded dvr at the show with 100-hours of recording capability  instant replay  and a search function. the set can pause and rewind tv for up to 90 hours. and microsoft chief bill gates announced in his pre-show keynote speech a partnership with tivo  called tivotogo  which means people can play recorded programmes on windows pcs and mobile devices. all these reflect the increasing trend of freeing up multimedia so that people can watch what they want  when they want.

As you can see, each data point is composed of the category of the news article followed by a comma and then the actual text of the article.

Defining useful global variables

  • NUM_WORDS: The maximum number of words to keep, based on word frequency. Defaults to 1000.
  • EMBEDDING_DIM: Dimension of the dense embedding, will be used in the embedding layer of the model. Defaults to 16.
  • MAXLEN: Maximum length of all sequences. Defaults to 120.
  • PADDING: Padding strategy (pad either before or after each sequence.). Defaults to 'post'.
  • OOV_TOKEN: Token to replace out-of-vocabulary words during text_to_sequence calls. Defaults to "\".
  • TRAINING_SPLIT: Proportion of data used for training. Defaults to 0.8
In [3]:
NUM_WORDS = 1000
EMBEDDING_DIM = 16
MAXLEN = 120
PADDING = 'post'
OOV_TOKEN = "<OOV>"
TRAINING_SPLIT = .8

Loading and pre-processing the data

In [4]:
def remove_stopwords(sentence):
    """
    Removes a list of stopwords
    
    Args:
        sentence (string): sentence to remove the stopwords from
    
    Returns:
        sentence (string): lowercase sentence without the stopwords
    """
    # List of stopwords
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]
    
    # Sentence converted to lowercase-only
    sentence = sentence.lower()

    words = sentence.split()
    no_words = [w for w in words if w not in stopwords]
    sentence = " ".join(no_words)

    return sentence

filename = "./bbc-text.csv"
sentences = []
labels = []
with open(filename, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
        labels.append(row[0])
        sentence = row[1]
        sentence = remove_stopwords(sentence)
        sentences.append(sentence)
In [5]:
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}")
There are 2225 sentences in the dataset.

First sentence has 436 words (after removing stopwords).

There are 2225 labels in the dataset.

The first 5 labels are ['tech', 'business', 'sport', 'sport', 'entertainment']

Training - Validation Split

In [8]:
# Compute the number of sentences that will be used for training (should be an integer)
train_size = int(len(sentences) * training_split)

# Split the sentences and labels into train/validation splits
train_sentences = sentences[0:train_size]
train_labels = labels[0:train_size]

validation_sentences = sentences[train_size:]
validation_labels = labels[train_size:]
In [9]:
print(f"There are {len(train_sentences)} sentences for training.\n")
print(f"There are {len(train_labels)} labels for training.\n")
print(f"There are {len(val_sentences)} sentences for validation.\n")
print(f"There are {len(val_labels)} labels for validation.")
There are 1780 sentences for training.

There are 1780 labels for training.

There are 445 sentences for validation.

There are 445 labels for validation.

Tokenization - Sequences and padding

Now that we have sets for training and validation it is time for you to begin the tokenization process.

In [10]:
# Instantiate the Tokenizer class, passing in the correct values for num_words and oov_token
tokenizer = Tokenizer(num_words=num_words, oov_token=oov_token)

# Fit the tokenizer to the training sentences
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
In [11]:
print(f"Vocabulary contains {len(word_index)} words\n")
print("<OOV> token included in vocabulary" if "<OOV>" in word_index else "<OOV> token NOT included in vocabulary")
Vocabulary contains 27285 words

<OOV> token included in vocabulary

Now that the tokenizer has been fitted to the training data, we need to convert each text data point into its padded sequence representation:

In [12]:
# Convert sentences to sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Pad the sequences using the correct padding and maxlen
padded_sequences = pad_sequences(sequences, maxlen=maxlen, padding=padding)
In [13]:
print(f"Padded training sequences have shape: {train_padded_seq.shape}\n")
print(f"Padded validation sequences have shape: {val_padded_seq.shape}")
Padded training sequences have shape: (1780, 120)

Padded validation sequences have shape: (445, 120)

Finally we need to tokenize the labels:

In [48]:
# Instantiate the Tokenizer (no additional arguments needed)
label_tokenizer = Tokenizer()

# Fit the tokenizer on all the labels
label_tokenizer.fit_on_texts(all_labels)

# Convert labels to sequences
label_seq = label_tokenizer.texts_to_sequences(split_labels)

# Convert sequences to a numpy array. Don't forget to substact 1 from every entry in the array!
label_seq_np = np.array(label_seq) - 1
In [49]:
print(f"First 5 labels of the training set should look like this:\n{train_label_seq[:5]}\n")
print(f"First 5 labels of the validation set should look like this:\n{val_label_seq[:5]}\n")
print(f"Tokenized labels of the training set have shape: {train_label_seq.shape}\n")
print(f"Tokenized labels of the validation set have shape: {val_label_seq.shape}\n")
First 5 labels of the training set should look like this:
[[3]
 [1]
 [0]
 [0]
 [4]]

First 5 labels of the validation set should look like this:
[[4]
 [3]
 [2]
 [0]
 [0]]

Tokenized labels of the training set have shape: (1780, 1)

Tokenized labels of the validation set have shape: (445, 1)

Creating a model for text classification

In [60]:
tf.random.set_seed(123)

model = tf.keras.Sequential([ 
    tf.keras.layers.Embedding(NUM_WORDS, EMBEDDING_DIM, input_length=MAXLEN),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(65, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy']) 
In [62]:
history = model.fit(train_padded_seq, train_label_seq, epochs=30, validation_data=(val_padded_seq, val_label_seq))
Epoch 1/30
56/56 [==============================] - 1s 4ms/step - loss: 1.5997 - accuracy: 0.2517 - val_loss: 1.5836 - val_accuracy: 0.4472
Epoch 2/30
56/56 [==============================] - 0s 3ms/step - loss: 1.5421 - accuracy: 0.4343 - val_loss: 1.4858 - val_accuracy: 0.4652
Epoch 3/30
56/56 [==============================] - 0s 2ms/step - loss: 1.3633 - accuracy: 0.5601 - val_loss: 1.2466 - val_accuracy: 0.6045
Epoch 4/30
56/56 [==============================] - 0s 2ms/step - loss: 1.0608 - accuracy: 0.6921 - val_loss: 0.9566 - val_accuracy: 0.7034
Epoch 5/30
56/56 [==============================] - 0s 3ms/step - loss: 0.7700 - accuracy: 0.8112 - val_loss: 0.7174 - val_accuracy: 0.8382
Epoch 6/30
56/56 [==============================] - 0s 2ms/step - loss: 0.5541 - accuracy: 0.9000 - val_loss: 0.5429 - val_accuracy: 0.9034
Epoch 7/30
56/56 [==============================] - 0s 2ms/step - loss: 0.4010 - accuracy: 0.9360 - val_loss: 0.4212 - val_accuracy: 0.9124
Epoch 8/30
56/56 [==============================] - 0s 2ms/step - loss: 0.2987 - accuracy: 0.9506 - val_loss: 0.3427 - val_accuracy: 0.9191
Epoch 9/30
56/56 [==============================] - 0s 2ms/step - loss: 0.2312 - accuracy: 0.9607 - val_loss: 0.3029 - val_accuracy: 0.9191
Epoch 10/30
56/56 [==============================] - 0s 2ms/step - loss: 0.1832 - accuracy: 0.9697 - val_loss: 0.2583 - val_accuracy: 0.9213
Epoch 11/30
56/56 [==============================] - 0s 2ms/step - loss: 0.1498 - accuracy: 0.9770 - val_loss: 0.2386 - val_accuracy: 0.9303
Epoch 12/30
56/56 [==============================] - 0s 3ms/step - loss: 0.1254 - accuracy: 0.9787 - val_loss: 0.2239 - val_accuracy: 0.9303
Epoch 13/30
56/56 [==============================] - 0s 2ms/step - loss: 0.1065 - accuracy: 0.9831 - val_loss: 0.2113 - val_accuracy: 0.9303
Epoch 14/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0898 - accuracy: 0.9854 - val_loss: 0.2038 - val_accuracy: 0.9326
Epoch 15/30
56/56 [==============================] - 0s 3ms/step - loss: 0.0768 - accuracy: 0.9882 - val_loss: 0.1919 - val_accuracy: 0.9348
Epoch 16/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0655 - accuracy: 0.9916 - val_loss: 0.1847 - val_accuracy: 0.9371
Epoch 17/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0563 - accuracy: 0.9949 - val_loss: 0.1812 - val_accuracy: 0.9326
Epoch 18/30
56/56 [==============================] - 0s 3ms/step - loss: 0.0487 - accuracy: 0.9978 - val_loss: 0.1799 - val_accuracy: 0.9348
Epoch 19/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0422 - accuracy: 0.9966 - val_loss: 0.1766 - val_accuracy: 0.9371
Epoch 20/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0364 - accuracy: 0.9983 - val_loss: 0.1737 - val_accuracy: 0.9348
Epoch 21/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0320 - accuracy: 0.9983 - val_loss: 0.1754 - val_accuracy: 0.9348
Epoch 22/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0281 - accuracy: 0.9989 - val_loss: 0.1697 - val_accuracy: 0.9348
Epoch 23/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0244 - accuracy: 0.9989 - val_loss: 0.1715 - val_accuracy: 0.9348
Epoch 24/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0215 - accuracy: 0.9989 - val_loss: 0.1683 - val_accuracy: 0.9326
Epoch 25/30
56/56 [==============================] - 0s 3ms/step - loss: 0.0190 - accuracy: 0.9989 - val_loss: 0.1686 - val_accuracy: 0.9393
Epoch 26/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0168 - accuracy: 1.0000 - val_loss: 0.1696 - val_accuracy: 0.9371
Epoch 27/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0150 - accuracy: 1.0000 - val_loss: 0.1729 - val_accuracy: 0.9348
Epoch 28/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0135 - accuracy: 1.0000 - val_loss: 0.1683 - val_accuracy: 0.9326
Epoch 29/30
56/56 [==============================] - 0s 3ms/step - loss: 0.0120 - accuracy: 1.0000 - val_loss: 0.1703 - val_accuracy: 0.9371
Epoch 30/30
56/56 [==============================] - 0s 2ms/step - loss: 0.0109 - accuracy: 1.0000 - val_loss: 0.1684 - val_accuracy: 0.9393

Once training has finished you can run the following cell to check the training and validation accuracy achieved at the end of each epoch.

In [63]:
def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history[f'val_{metric}'])
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, f'val_{metric}'])
    plt.show()
    
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")