Dummy example: 3 timesteps, 2 features each
X = np.array([[[23.13, 21.12], [21.19, 24.02], [23.99,
23.98]]])
y = np.array([[24.10]]) # predicted next close price
model = Sequential()
model.add(SimpleRNN(32, input_shape=(3, 2)))
model.add(Dense(1))
model.compile(optimizer=‘adam’, loss=‘mse’)
model.fit(X, y, epochs=20)
Memory in RNNs
- First timestep: uses zero input as
ht-1
- Each timestep passes forward its hidden state
- Allows network to “remember” past inputs
Weight Structure
Instead of just one matrix: - You get two: one for the input, one for
the recurrent memory - Internally: often concatenated
into a single matrix
- In frameworks like Keras, these are stored separately as: -
kernel
→ input weights
- recurrent_kernel
→ recurrent weights
- bias
Perfect. Here’s the next section:
MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)
Part 2: A Brief NLP Introduction
Why RNNs Are Natural for NLP
Language is sequential. Word order matters. RNNs process data one
step at a time, carrying forward learned context. This makes them a
perfect match for language-based tasks.
Three Core Problems in NLP:
1. Word-to-Number Conversion
- One-hot encoding = sparse, huge, mostly zeros
- Example: 10,000-word vocab → 1 word = [0, 0, 1, 0, …, 0]
- Wasteful. Not efficient for learning.
2. Variable-Length Sentences
- RNNs require fixed-size inputs
- Solution:
- Padding: add dummy “PAD” tokens
- Truncating: cut long sentences
- Choose length based on a quantile of data (e.g. 95%)
3. Semantic Similarity
- How do we measure if two sentences are alike?
- Need dense representations of words, not sparse ones
Word Vectors / Embeddings
Dense vectors trained to capture meaning.
You can: - Let the model learn them - Or use pretrained (GloVe,
Word2Vec, FastText)
Words like “king” and “queen” will have similar vectors.
They’re stored in an embedding matrix, where each row
is a word.
Example (100-dim vector for “the”)
the → [0.12, -0.03, …, 0.47]
Vector Similarity: Cosine
We don’t compare vectors by their length, but
direction:
Cosine similarity
\(\text{sim}(A, B) = \frac{A \cdot B}{\|A\|
\cdot \|B\|}\)
- Ranges: -1 to 1
- 1 = same direction → high similarity
- 0 = orthogonal → unrelated
- Used to find similar words, detect synonyms
Cosine distance = 1 - similarity
Python Snippet – Tokenizing, Padding, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
Build embedding model
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100,
input_length=10))
Summary of Padding Strategy
- Choose a max sentence length (based on histogram or quantile)
- Short: add zeros (PAD tokens)
- Long: cut off back or front
- Better to lose a few long ones than break most inputs
Key Idea: Similar Words = Similar Vectors
- Vectors live in high-dimensional space (100–300D)
- Similarity is not based on matching values
- Only direction matters (cosine)
Continuing the study guide—
MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)
Part 3: Building RNNs for NLP
Step-by-Step Process
1. Convert Text to Integers
- Assign each word a unique integer (token ID)
- Frequent words usually get lower numbers
- Unknown words get a reserved token (e.g. <UNK>
or
2)
Example:
“The movie was the best I have seen”
→ [2, 7, 15, 3, 9, 11, 19]
2. Handle Variable Length
- Pad shorter sentences (e.g., with 0)
- Truncate longer ones
- Padding value 0
is default but can be changed
3-Layer Architecture
Layer 1: Embedding Layer (lookup)
- Turns integers into dense vectors
- Example: word index 9 → embedding row 9
- Learns word meanings during training
- Must be the first layer if you’re using it
Layer 2: Recurrent Layer (SimpleRNN, LSTM,
GRU)
- Processes sequences step-by-step
- Remembers past via hidden state
Layer 3: Dense Output Layer
- Uses final timestep output to predict target
- Example: Sentiment classification (0 = negative, 1 = positive)
Python Code – Simple Binary Classifier with RNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
vocab_size = 5000
embedding_dim = 100
input_length = 10 # length after padding
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_length))
model.add(SimpleRNN(256))
model.add(Dense(1, activation='sigmoid')) # binary classification
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(padded, [1, 0], epochs=5) # dummy targets for demo
Output Flow
For input [2, 4, 1, 19]
:
- Embedding looks up vectors for each index
- Feeds them one at a time to the RNN layer
- RNN builds temporal context
- Final output goes to Dense layer for prediction
RNN Layer Types
RNN |
Low (~5 steps) |
Fast |
Few |
Can forget early input (vanishing gradient) |
LSTM |
High |
Slower |
4x more |
Controls memory with gates |
GRU |
Medium-High |
Faster than LSTM |
3x more |
Efficient and accurate |
Architectural Limitations
- RNNs struggle with long-range dependencies
- Vanishing gradients reduce ability to retain earlier inputs
- Simple RNNs can only remember ~5 steps back
- LSTMs/GRUs mitigate this using memory gates
Summary
- NLP + RNN = encode text → padded sequence → embed → RNN → dense
output
- Embedding layer learns meaning over time
- RNN layer builds sequence memory
- Padding lets all samples be fed in same batch
- Short-term memory is a challenge, solved by LSTM/GRU variants
- RNNs are powerful but can become very complex and
slow to train
Understood. Executing Part 4 exactly as directed.
MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)
Part 4: Data Preparation
Objective
Prepare real-world text data (IMDB dataset) for use in a neural
network. Focus is on: - Loading pre-tokenized data - Understanding
word-index mapping - Padding/truncation - Vocabulary sizing -
Visualization of data length distribution
Step 1: Load IMDB Dataset and Define Vocabulary Size
from tensorflow.keras.datasets import imdb
vocab_size = 5000 # cap vocab for speed
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
- Dataset is pre-tokenized (words already mapped to integers)
- Only top 5,000 words are included
Step 2: Inspect Word-Index Mapping
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}
word_index
maps words to integers
reverse_word_index
maps integers back to words
Special tokens: - 0 = PAD
- 1 = START
- 2 = UNK
- 3 = UNUSED (reserved mistake)
Step 3: Decode Sequences for Human Readability
def decode_review(sequence):
return ' '.join([reverse_word_index.get(i - 3, '?') for i in sequence])
print(decode_review(x_train[0]))
- Shifts all indices by 3 because of the reserved special tokens
- Reconstructs original sentence from integer sequence
Step 4: Analyze Sequence Lengths
review_lengths = [len(review) for review in x_train]
import matplotlib.pyplot as plt
plt.hist(review_lengths, bins=100)
plt.title("IMDB Review Length Distribution")
plt.xlabel("Review Length")
plt.ylabel("Frequency")
plt.show()
- Longest review: ~2500 tokens
- Most reviews: < 1000 tokens
- 90% of reviews: < 467
- 95% of reviews: < 600
- 99% of reviews: < 1000
- Common cutoff: 500 tokens (captures ~92% of
data)
Step 5: Pad and Truncate to Fixed Length
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len = 500
x_train_padded = pad_sequences(x_train, maxlen=max_len, padding='post', truncating='post')
x_test_padded = pad_sequences(x_test, maxlen=max_len, padding='post', truncating='post')
padding='post'
: pads at end
truncating='post'
: truncates from end
- All sequences now exactly 500 long
- Pads with 0 (PAD token)
Inspect Post-Padding Example
print(x_train_padded[0]) # e.g., [1, 14, 22, ..., 0, 0, 0]
print(decode_review(x_train_padded[0]))
- Still interpretable
- Start token: 1
- Actual words follow
- Remaining space filled with zeros
Summary
- IMDB dataset is already tokenized into integers
- Word-index and reverse-index let us decode
- Vocabulary capped at top 5,000 most frequent words
- Padding and truncating normalizes input size
- Final shape:
[num_samples, 500]
- Data is now ready to be input into a Keras RNN model
Executing as instructed.
MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)
Part 5: Building Your Network
Overview
Use the padded IMDB dataset to build a working RNN model for
sentiment analysis. The model has 3 layers:
- Embedding Layer
- SimpleRNN Layer
- Dense Output Layer
Step 1: Imports and Model Setup
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
Step 2: Hyperparameters
vocab_size = 5000 # Top 5,000 words only
embedding_dim = 100 # Size of each word vector
input_length = 500 # All reviews are padded to length 500
Step 3: Load and Prepare IMDB Data
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = pad_sequences(x_train, maxlen=input_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=input_length, padding='post', truncating='post')
Step 4: Define Model Architecture
model = Sequential()
# Layer 1: Embedding
model.add(Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
input_length=input_length))
# Layer 2: Simple RNN
model.add(SimpleRNN(256)) # outputs 256 features
# Layer 3: Dense output
model.add(Dense(1, activation='sigmoid')) # binary classifier
Step 5: Compile the Model
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
Step 6: Train the Model
history = model.fit(x_train, y_train,
epochs=3,
batch_size=64,
validation_data=(x_test, y_test))
Explanation of Shapes and Parameters
- Embedding Layer
- Input: (batch_size, 500)
- Output: (batch_size, 500, 100)
- Params = 5000 * 100 = 500,000
- No bias term
- RNN Layer
- Input: (batch_size, 500, 100)
- Output: (batch_size, 256)
- Params = (100 + 256) * 256 + 256 = 91,392
- Includes kernel weights, recurrent weights, and bias
- Dense Layer
- Input: 256 → Output: 1
- Activation: sigmoid
- Output is binary: 0 (negative review), 1 (positive review)
Summary
- Model: Embedding → SimpleRNN → Dense(sigmoid)
- Embedding learns representations
- RNN learns sequence structure
- Dense layer produces prediction
- Works well on IMDB sentiment classification task
Understood. Executing exactly per your instruction.
MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)
Part 6: Advanced RNN Layers
Starting Point
We begin with a working model that used:
- Embedding layer:
output_dim=100
- SimpleRNN layer:
units=256
- Dense output:
sigmoid
Now we explore two advanced memory layers that
replace SimpleRNN
:
LSTM
: Long Short-Term Memory
GRU
: Gated Recurrent Unit
Step 1: Import Advanced Layers
from tensorflow.keras.layers import LSTM, GRU
Step 2: Swap RNN Layer
Replace with LSTM
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(LSTM(256)) # replaces SimpleRNN
model.add(Dense(1, activation='sigmoid'))
Or Replace with GRU
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(GRU(256)) # replaces SimpleRNN
model.add(Dense(1, activation='sigmoid'))
- Everything else stays the same
- Compile and fit as before
Key Differences
RNN |
Basic |
Base |
Low |
Fast |
No gates |
LSTM |
4 sub-layers (input, forget, output, cell) |
4× |
High |
Slow |
Excellent for long dependencies |
GRU |
3 sub-layers (update, reset, new) |
3× |
Medium |
Faster than LSTM |
Lighter alternative |
- LSTM = 4× SimpleRNN parameter count
- GRU = 3× SimpleRNN parameter count
- Embedding layer unchanged
- Output dimensions stay 256 → Dense(1)
Summary
- To use LSTM or GRU, just swap layer call
- LSTM has more memory, more params, slower
- GRU is leaner, nearly same performance
- Both outperform
SimpleRNN
on long sequences
- Final output behavior is the same (e.g., sentiment = 0/1)
There’s no single “best” choice—performance is dataset-dependent.
