AI for Developers: Using RNNs
Training a simple RNN text generator to demonstrate the concept of text generation.
This post is a summary of this Google tutorial.
NOTE: This post is intended for developers, if you are an aspiring data scientist or AI researcher this post will not dig deep enough for you.
The RNN will learn to generate text character-by-character.
Overview
RNN - Recurrent Neural Networks are networks designed specifically to handle sequential data, like a timeseries or language. A regular neural network process the input data in a single pass, while RNNs process the input sequentially, repeatedly apply the same set of weights to the input at each step in the sequence, hence "recurrent".
The key feature of RNNs is that they have a hidden state that stores information about what has been calculated so far. This hidden state is updated at each step in the sequence so newer data processing is affected by the past data.
Let's say our input is "I love machine learning", a regular neural network will process the entire sentence at once where it can't learn the relations between the words. The RNN processes the sentence one word at a time, at each step the current word is fed into the RNN, the RNN hidden state (memory) is updated (based on current word and previous hidden state), the updated hidden state captures information about the sequence seen so far.
In this post we will NOT go into the RNN layer itself, we will use tf.keras.layers.GRU
(Gated Recurrent Unit) which is a type of RNN layer provided by TF, our input will be "I love machine learning" and the word-by-word processing will be handled by the GRU layer.
Setup
The Tensorflow syntax is of version 2.15.1, newer versions will give errors.
!pip uninstall tensorflow -y
!pip uninstall tf-keras -y
!pip install tensorflow==2.15.1
!pip install keras
Load training data
The training dataset is Shakespeare's writing from Andrej Karpathy's The Unreasonable Effectiveness of Recurrent Neural Networks
Load the entire text into shakespeare_text
and create a vocabulary. Since the model will generate text character-by-character the vocabulary is essentially a set of all the characters in the text.
import tensorflow as tf
import numpy as np
import os
path_to_file = tf.keras.utils.get_file(
"shakespeare.txt",
"https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt",
)
shakespeare_text = open(path_to_file, "rb").read().decode(encoding="utf-8")
vocab = sorted(set(shakespeare_text))
Vectorize the text
The model cannot work with characters, it needs to work with numbers, we will use the tf.keras.layers.StringLookup
layer which maps strings to indices (we will call it id
).
Important! The StringLookup
will add an "Unknown" [UNK]
token (usually with id = 0) to represent unknown tokens it might encounter (OOV - Out of vocabulary).
We will also define a reverse lookup layer that will convert ids back to characters, as the model output will be the character id that we will need to convert back to character.
And finally we will also define a function that that gets an array of characters ids and convert it back to string.
ids_from_chars = tf.keras.layers.StringLookup(
vocabulary=list(vocab), mask_token=None
)
chars_from_ids = tf.keras.layers.StringLookup(
vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None
)
def text_from_ids(ids):
return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)
Training
Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?
This is what we want the model to predict. The way we will train the model is by providing pairs of texts, where the "input" text will be missing the last character and the "target" will include the last character.
Another parameter is the length of the texts we will use, say we decided to work with text length of 4, we will break the entire Shakespeare text to chunks of 5 and by offsetting the first/last characters we will create the training pair of 4 characters.
Example: The string "Hello" will be split to the input "Hell" and target "ello".
shakespeare_text_ids = ids_from_chars(tf.strings.unicode_split(shakespeare_text, "UTF-8"))
ids_dataset = tf.data.Dataset.from_tensor_slices(shakespeare_text_ids)
# Our training sequence will be 100 characters
seq_length = 100
# calculate how many complete 100 charts chunks we have
examples_per_epoch = len(shakespeare_text) // (seq_length + 1)
sequences = ids_dataset.batch(seq_length + 1, drop_remainder=True)
# Create the (input, target) pairs for training
def split_input_target(sequence):
input_text = sequence[:-1] # All chars but the LAST one
target_text = sequence[1:] # All chars but the FIRST one
return input_text, target_text
train_dataset = sequences.map(split_input_target)
Next we need to shuffle the training input:
# Batch size
BATCH_SIZE = 64
# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000
train_dataset = (
train_dataset.shuffle(BUFFER_SIZE)
.batch(BATCH_SIZE, drop_remainder=True)
.prefetch(tf.data.experimental.AUTOTUNE)
)
Build the model
The model consists of 3 layers:
- Embedding - This layer maps each character in the vocabulary to a vector, hence it gets the
vocab_size
and the desired embedding vector size - GRU - Gated Recurrent Unit, this is the type of RNN we are using, once can use LTSM RNN as well.
- Dense - this layer is the output layer, its size should be the
vocab_size
as the model's output is a probability for EACH character in the vocabulary.
So what kind of model is this? is it an encoder-decoder, encoder-only, decoder-only?
I am not 100% sure how to classify it, my guess is: We have only a single RNN layer so it is not an encoder-decoder model. The single layer can be seen as decoder as it is used to generate text by "decoding" the embeddings and the layer's hidden states.
class MyModel(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, rnn_units):
super().__init__()
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(rnn_units,
return_sequences=True,
return_state=True)
self.dense = tf.keras.layers.Dense(vocab_size)
# TF will call this method with training batches
# On inference we will call this method in a loop to predict the next char and will maintain the "states".
# The states is the "memory" of the model about all the sequnces it has already processed.
def call(self, inputs, states=None, return_state=False, training=False):
x = inputs
x = self.embedding(x, training=training)
# RNN should get the state from the prev step
if states is None:
states = self.gru.get_initial_state(x)
x, states = self.gru(x, initial_state=states, training=training)
x = self.dense(x, training=training)
if return_state:
return x, states
else:
return x
model = MyModel(
# Be sure the vocabulary size matches the `StringLookup` layers.
vocab_size=len(ids_from_chars.get_vocabulary()),
embedding_dim=256,
rnn_units=1024,
)
Train the model
# Directory where the checkpoints will be saved
checkpoint_dir = "./training_checkpoints"
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}.weights.h5")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix, save_weights_only=True
)
EPOCHS = 10
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss)
history = model.fit(train_dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
Generating Text
In order to generate text we will call the model and pass it an initial string (e.g "ROMEO:") and an initial state (all zeros).
The model response would be a matrix in the shape [1, input_len, vocab_size]
, in our example of the string "ROMEO:" it will be [1, 6, 66]
, that is the model took our sequence and for each character it produced logits vector (sort of probability) across the entire vocabulary. Since all we care about is the last character we will take only the logits vector for the last character ([0,-1,:]
), from it we will choose the "best" probable character (see below).
The model will also return an updated state.
Then in a loop we will keep calling the model, each time passing the last predicted character and the updated states, we'll keep do that until we decide to manually stop.
Choosing the "best" character
As described above, the model returns a logits vector, which is some sort of a probability FOR EACH character in our vocabulary, we need to choose 1 character.
One way is to take the character that has the highest probability, this is a good approach if we were in a classification task, but for text generation this can lead to a "boring" and repetitive text.
Another approach is to randomly sample from the probabilities, this gives better results for text generation. While the sampling is random higher logits values have higher chance if being selected.
We can also adjust the logit of each character before doing the sampling, we can use the temperature parameter to scale the probabilities.
A value lower than 1 will over scale the higher probabilities, creating a skew towards the higher probabilities tokens and less random selection.
A value higher than 1 will "equalize" the probabilities providing higher chance for lower probabilities tokens to get selected.
The text generator code
import time
class OneStep(tf.keras.Model):
def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
super().__init__()
self.temperature = temperature
self.model = model
self.chars_from_ids = chars_from_ids
self.ids_from_chars = ids_from_chars
# Create a mask to prevent "[UNK]" from being generated.
skip_ids = self.ids_from_chars(['[UNK]']).numpy()
mask=np.zeros(len(self.ids_from_chars.get_vocabulary()))
mask[skip_ids] = -float('inf')
# prediction_mask will have the shape of the model prediction (logit value per character in vocabulary)
# the [UNK] character will have a value of -Inf and all the rest will have a value of 0
# i.e [-inf, 0., 0., 0., 0., 0., ...]
self.prediction_mask = tf.convert_to_tensor(mask, dtype=np.float32)
@tf.function
def generate_one_step(self, inputs, states=None):
# Convert strings to token IDs.
input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
input_ids = self.ids_from_chars(input_chars).to_tensor()
# Run the model.
# predicted_logits.shape is [batch, char, next_char_logits]
predicted_logits, states = self.model(inputs=input_ids, states=states,
return_state=True)
# Only use the last prediction.
predicted_logits = predicted_logits[:, -1, :]
predicted_logits = predicted_logits/self.temperature
# Apply the prediction mask: prevent "[UNK]" from being generated.
predicted_logits = predicted_logits + self.prediction_mask
# Sample the output logits to generate token IDs.
predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
predicted_ids = tf.squeeze(predicted_ids, axis=-1)
# Convert from token ids to characters
predicted_chars = self.chars_from_ids(predicted_ids)
# Return the characters and model state.
return predicted_chars, states
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)
start = time.time()
states = None
states = tf.convert_to_tensor(np.zeros((1,1024)), dtype=np.float32)
next_char = tf.constant(['ROMEO:']) # Starting text
result = [next_char]
## Generate 1000 characters
for n in range(500):
next_char, states = one_step_model.generate_one_step(next_char, states=states)
result.append(next_char)
result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)
Here is a sample output
ROMEO:
Where shall we do't; for this is morrow to
Foul devouring them when he shall find ourselves,
Your father shubless of friends, for marriage
Would not be lemamed your ill in less
Than more tastesing for wards are but a quarrel wit.
VALERIA:
O excreet thou the rade friar Lode?
Surrers:
Fie, fieed them, to be rid of Clifford's dagging.
RATCLIFF:
Either I have break our pawned bones,
And with the other bett breaks this body,
And terred her husbaster.
ALONSO:
Ert too things so redueded?
note that while the sentences means nothing, the model did actually generate correct words. Quite impressive for a super simple model with only 10 epochs.
Author Of article : Yuval Read full article