MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)


Part 1: Introduction to Recurrent Neural Networks

Core Concept

Traditional dense and convolutional neural networks process input all at once. But many problems (e.g., time series, language) require sequential handling—where the order of input matters. RNNs handle this by using a memory of past inputs and feeding outputs from previous steps back into the model.

When to Use RNNs

Key Difference from CNNs

Visual Models


Stock Prediction Example (Unrolled RNN)

You input stock prices one day at a time:

Day Open Close
1 23.13 21.12
2 21.19 24.02
3 23.99 23.98

Each day goes through a dense layer, and the output of that layer is passed to the next day along with the new input.

This chaining builds temporal context.


How It Works Mechanically

At each time step \(t\), you compute:

Hidden state
\(h_t = \sigma(W_x x_t + W_h h_{t-1} + b)\)
  where:   - \(x_t\): input at time t
  - \(h_{t-1}\): hidden state from previous timestep
  - \(W_x\): weights for current input
  - \(W_h\): weights for recurrent input
  - \(\sigma\): activation (e.g., tanh)

Output
\(y_t = W_y h_t + b_y\)

At timestep 0, \(h_0\) is a zero vector.


Python Representation (Vanilla RNN)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding
import numpy as np

Dummy example: 3 timesteps, 2 features each

X = np.array([[[23.13, 21.12], [21.19, 24.02], [23.99, 23.98]]])
y = np.array([[24.10]]) # predicted next close price

model = Sequential()
model.add(SimpleRNN(32, input_shape=(3, 2)))
model.add(Dense(1))

model.compile(optimizer=‘adam’, loss=‘mse’)
model.fit(X, y, epochs=20)


Memory in RNNs

  • First timestep: uses zero input as ht-1
  • Each timestep passes forward its hidden state
  • Allows network to “remember” past inputs

Weight Structure

Instead of just one matrix: - You get two: one for the input, one for the recurrent memory - Internally: often concatenated into a single matrix
- In frameworks like Keras, these are stored separately as: - kernel → input weights
- recurrent_kernel → recurrent weights
- bias


Perfect. Here’s the next section:


MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)


Part 2: A Brief NLP Introduction

Why RNNs Are Natural for NLP

Language is sequential. Word order matters. RNNs process data one step at a time, carrying forward learned context. This makes them a perfect match for language-based tasks.


Three Core Problems in NLP:

1. Word-to-Number Conversion
- One-hot encoding = sparse, huge, mostly zeros
- Example: 10,000-word vocab → 1 word = [0, 0, 1, 0, …, 0]
- Wasteful. Not efficient for learning.

2. Variable-Length Sentences
- RNNs require fixed-size inputs
- Solution:
- Padding: add dummy “PAD” tokens
- Truncating: cut long sentences
- Choose length based on a quantile of data (e.g. 95%)

3. Semantic Similarity
- How do we measure if two sentences are alike?
- Need dense representations of words, not sparse ones


Word Vectors / Embeddings

Dense vectors trained to capture meaning.
You can: - Let the model learn them - Or use pretrained (GloVe, Word2Vec, FastText)

Words like “king” and “queen” will have similar vectors.
They’re stored in an embedding matrix, where each row is a word.

Example (100-dim vector for “the”)
the → [0.12, -0.03, …, 0.47]


Vector Similarity: Cosine

We don’t compare vectors by their length, but direction:

Cosine similarity
\(\text{sim}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}\)

  • Ranges: -1 to 1
  • 1 = same direction → high similarity
  • 0 = orthogonal → unrelated
  • Used to find similar words, detect synonyms

Cosine distance = 1 - similarity


Python Snippet – Tokenizing, Padding, Embedding

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding

Sentences

texts = [“The cow jumped over the moon”, “The dog ran under the sun”]

Tokenize

tokenizer = Tokenizer(num_words=5000, oov_token=“”)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

Pad to same length

padded = pad_sequences(sequences, maxlen=10, padding=‘post’, truncating=‘post’)

Build embedding model

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=10))


Summary of Padding Strategy

  • Choose a max sentence length (based on histogram or quantile)
  • Short: add zeros (PAD tokens)
  • Long: cut off back or front
  • Better to lose a few long ones than break most inputs

Key Idea: Similar Words = Similar Vectors

  • Vectors live in high-dimensional space (100–300D)
  • Similarity is not based on matching values
  • Only direction matters (cosine)

Continuing the study guide—


MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)


Part 3: Building RNNs for NLP


Step-by-Step Process

1. Convert Text to Integers
- Assign each word a unique integer (token ID)
- Frequent words usually get lower numbers
- Unknown words get a reserved token (e.g. <UNK> or 2)

Example:
“The movie was the best I have seen”
[2, 7, 15, 3, 9, 11, 19]

2. Handle Variable Length
- Pad shorter sentences (e.g., with 0)
- Truncate longer ones
- Padding value 0 is default but can be changed


Input Processing Recap

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

texts = ["The movie was the best I have seen", "A weak movie with no soul"]
tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=10, padding='post', truncating='post')

3-Layer Architecture

Layer 1: Embedding Layer (lookup)
- Turns integers into dense vectors
- Example: word index 9 → embedding row 9
- Learns word meanings during training
- Must be the first layer if you’re using it

Layer 2: Recurrent Layer (SimpleRNN, LSTM, GRU)
- Processes sequences step-by-step
- Remembers past via hidden state

Layer 3: Dense Output Layer
- Uses final timestep output to predict target
- Example: Sentiment classification (0 = negative, 1 = positive)


Python Code – Simple Binary Classifier with RNN

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

vocab_size = 5000
embedding_dim = 100
input_length = 10  # length after padding

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_length))
model.add(SimpleRNN(256))
model.add(Dense(1, activation='sigmoid'))  # binary classification

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(padded, [1, 0], epochs=5)  # dummy targets for demo

Output Flow

For input [2, 4, 1, 19]:
- Embedding looks up vectors for each index
- Feeds them one at a time to the RNN layer
- RNN builds temporal context
- Final output goes to Dense layer for prediction


RNN Layer Types

Type Memory Depth Speed Params Notes
RNN Low (~5 steps) Fast Few Can forget early input (vanishing gradient)
LSTM High Slower 4x more Controls memory with gates
GRU Medium-High Faster than LSTM 3x more Efficient and accurate

Architectural Limitations

  • RNNs struggle with long-range dependencies
  • Vanishing gradients reduce ability to retain earlier inputs
  • Simple RNNs can only remember ~5 steps back
  • LSTMs/GRUs mitigate this using memory gates

Summary

  • NLP + RNN = encode text → padded sequence → embed → RNN → dense output
  • Embedding layer learns meaning over time
  • RNN layer builds sequence memory
  • Padding lets all samples be fed in same batch
  • Short-term memory is a challenge, solved by LSTM/GRU variants
  • RNNs are powerful but can become very complex and slow to train

Understood. Executing Part 4 exactly as directed.


MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)


Part 4: Data Preparation


Objective

Prepare real-world text data (IMDB dataset) for use in a neural network. Focus is on: - Loading pre-tokenized data - Understanding word-index mapping - Padding/truncation - Vocabulary sizing - Visualization of data length distribution


Step 1: Load IMDB Dataset and Define Vocabulary Size

from tensorflow.keras.datasets import imdb

vocab_size = 5000  # cap vocab for speed
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
  • Dataset is pre-tokenized (words already mapped to integers)
  • Only top 5,000 words are included

Step 2: Inspect Word-Index Mapping

word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}
  • word_index maps words to integers
  • reverse_word_index maps integers back to words

Special tokens: - 0 = PAD
- 1 = START
- 2 = UNK
- 3 = UNUSED (reserved mistake)


Step 3: Decode Sequences for Human Readability

def decode_review(sequence):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in sequence])

print(decode_review(x_train[0]))
  • Shifts all indices by 3 because of the reserved special tokens
  • Reconstructs original sentence from integer sequence

Step 4: Analyze Sequence Lengths

review_lengths = [len(review) for review in x_train]

import matplotlib.pyplot as plt
plt.hist(review_lengths, bins=100)
plt.title("IMDB Review Length Distribution")
plt.xlabel("Review Length")
plt.ylabel("Frequency")
plt.show()
  • Longest review: ~2500 tokens
  • Most reviews: < 1000 tokens
  • 90% of reviews: < 467
  • 95% of reviews: < 600
  • 99% of reviews: < 1000
  • Common cutoff: 500 tokens (captures ~92% of data)

Step 5: Pad and Truncate to Fixed Length

from tensorflow.keras.preprocessing.sequence import pad_sequences

max_len = 500
x_train_padded = pad_sequences(x_train, maxlen=max_len, padding='post', truncating='post')
x_test_padded = pad_sequences(x_test, maxlen=max_len, padding='post', truncating='post')
  • padding='post': pads at end
  • truncating='post': truncates from end
  • All sequences now exactly 500 long
  • Pads with 0 (PAD token)

Inspect Post-Padding Example

print(x_train_padded[0])  # e.g., [1, 14, 22, ..., 0, 0, 0]
print(decode_review(x_train_padded[0]))
  • Still interpretable
  • Start token: 1
  • Actual words follow
  • Remaining space filled with zeros

Summary

  • IMDB dataset is already tokenized into integers
  • Word-index and reverse-index let us decode
  • Vocabulary capped at top 5,000 most frequent words
  • Padding and truncating normalizes input size
  • Final shape: [num_samples, 500]
  • Data is now ready to be input into a Keras RNN model

Executing as instructed.


MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)


Part 5: Building Your Network


Overview

Use the padded IMDB dataset to build a working RNN model for sentiment analysis. The model has 3 layers:

  • Embedding Layer
  • SimpleRNN Layer
  • Dense Output Layer

Step 1: Imports and Model Setup

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

Step 2: Hyperparameters

vocab_size = 5000         # Top 5,000 words only
embedding_dim = 100       # Size of each word vector
input_length = 500        # All reviews are padded to length 500

Step 3: Load and Prepare IMDB Data

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = pad_sequences(x_train, maxlen=input_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=input_length, padding='post', truncating='post')

Step 4: Define Model Architecture

model = Sequential()

# Layer 1: Embedding
model.add(Embedding(input_dim=vocab_size,
                    output_dim=embedding_dim,
                    input_length=input_length))

# Layer 2: Simple RNN
model.add(SimpleRNN(256))  # outputs 256 features

# Layer 3: Dense output
model.add(Dense(1, activation='sigmoid'))  # binary classifier

Step 5: Compile the Model

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Step 6: Train the Model

history = model.fit(x_train, y_train,
                    epochs=3,
                    batch_size=64,
                    validation_data=(x_test, y_test))

Explanation of Shapes and Parameters

  • Embedding Layer
    • Input: (batch_size, 500)
    • Output: (batch_size, 500, 100)
    • Params = 5000 * 100 = 500,000
    • No bias term
  • RNN Layer
    • Input: (batch_size, 500, 100)
    • Output: (batch_size, 256)
    • Params = (100 + 256) * 256 + 256 = 91,392
      • Includes kernel weights, recurrent weights, and bias
  • Dense Layer
    • Input: 256 → Output: 1
    • Activation: sigmoid
    • Output is binary: 0 (negative review), 1 (positive review)

Performance & Runtime Notes

  • Sequence length = 500 → RNN runs 500 timesteps per sample
  • Takes longer to train than CNNs due to recurrence
  • After 3 epochs:
    • Training accuracy ~75–78%
    • Validation accuracy ~76.5%
  • Acceptable baseline for simple RNN

Summary

  • Model: Embedding → SimpleRNN → Dense(sigmoid)
  • Embedding learns representations
  • RNN learns sequence structure
  • Dense layer produces prediction
  • Works well on IMDB sentiment classification task

Understood. Executing exactly per your instruction.


MAIN: QTW - 7333
Module 13: Recurrent Neural Networks (RNNs)


Part 6: Advanced RNN Layers


Starting Point

We begin with a working model that used:

  • Embedding layer: output_dim=100
  • SimpleRNN layer: units=256
  • Dense output: sigmoid

Now we explore two advanced memory layers that replace SimpleRNN:

  • LSTM: Long Short-Term Memory
  • GRU: Gated Recurrent Unit

Step 1: Import Advanced Layers

from tensorflow.keras.layers import LSTM, GRU

Step 2: Swap RNN Layer

Replace with LSTM

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(LSTM(256))  # replaces SimpleRNN
model.add(Dense(1, activation='sigmoid'))

Or Replace with GRU

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(GRU(256))  # replaces SimpleRNN
model.add(Dense(1, activation='sigmoid'))
  • Everything else stays the same
  • Compile and fit as before

Key Differences

Layer Structure Param Count Memory Speed Notes
RNN Basic Base Low Fast No gates
LSTM 4 sub-layers (input, forget, output, cell) High Slow Excellent for long dependencies
GRU 3 sub-layers (update, reset, new) Medium Faster than LSTM Lighter alternative
  • LSTM = 4× SimpleRNN parameter count
  • GRU = 3× SimpleRNN parameter count
  • Embedding layer unchanged
  • Output dimensions stay 256 → Dense(1)

Summary

  • To use LSTM or GRU, just swap layer call
  • LSTM has more memory, more params, slower
  • GRU is leaner, nearly same performance
  • Both outperform SimpleRNN on long sequences
  • Final output behavior is the same (e.g., sentiment = 0/1)

There’s no single “best” choice—performance is dataset-dependent.


---
title: "Moducle 123 - Recurrent Neural Networks"
author: "Jessica McPhaul"
output: html_notebook
---


MAIN: QTW - 7333  
Module 13: Recurrent Neural Networks (RNNs)

---

Part 1: Introduction to Recurrent Neural Networks

### Core Concept
Traditional dense and convolutional neural networks process input all at once. But many problems (e.g., time series, language) require **sequential** handling—where the **order of input matters**. RNNs handle this by using a memory of past inputs and feeding outputs from previous steps back into the model.

### When to Use RNNs
- Time-series prediction  
- Natural Language Processing  
- Sequence labeling and classification  
- Event prediction over time  

### Key Difference from CNNs
- CNNs care about spatial proximity, not order  
- RNNs are temporal: they depend on what came **before**

### Visual Models
- **Unrolled RNN**: Looks like multiple dense layers, one per timestep. Each passes info to the next.  
- **Rolled RNN**: Same layer reused with **shared weights**, passing forward hidden states.

---

### Stock Prediction Example (Unrolled RNN)
You input stock prices one day at a time:

| Day | Open  | Close  |
|-----|-------|--------|
| 1   | 23.13 | 21.12  |
| 2   | 21.19 | 24.02  |
| 3   | 23.99 | 23.98  |

Each day goes through a dense layer, and the **output of that layer** is passed to the next day along with the new input.

This chaining builds temporal context.

---

### How It Works Mechanically

At each time step \( t \), you compute:

 **Hidden state**  
 \( h_t = \sigma(W_x x_t + W_h h_{t-1} + b) \)  
  where:
  - \( x_t \): input at time t  
  - \( h_{t-1} \): hidden state from previous timestep  
  - \( W_x \): weights for current input  
  - \( W_h \): weights for recurrent input  
  - \( \sigma \): activation (e.g., tanh)

 **Output**  
 \( y_t = W_y h_t + b_y \)

At timestep 0, \( h_0 \) is a zero vector.

---

### Python Representation (Vanilla RNN)

from tensorflow.keras.models import Sequential  
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding  
import numpy as np  

# Dummy example: 3 timesteps, 2 features each  
X = np.array([[[23.13, 21.12], [21.19, 24.02], [23.99, 23.98]]])  
y = np.array([[24.10]])  # predicted next close price

model = Sequential()  
model.add(SimpleRNN(32, input_shape=(3, 2)))  
model.add(Dense(1))  

model.compile(optimizer='adam', loss='mse')  
model.fit(X, y, epochs=20)

---

### Memory in RNNs
- First timestep: uses **zero input** as h<sub>t-1</sub>
- Each timestep passes forward its **hidden state**
- Allows network to "remember" past inputs

---

### Weight Structure
Instead of just one matrix:
- You get two: one for the input, one for the recurrent memory
- Internally: often **concatenated** into a single matrix  
- In frameworks like Keras, these are stored separately as:
  - `kernel` → input weights  
  - `recurrent_kernel` → recurrent weights  
  - `bias`

---

Perfect. Here's the next section:

---

**MAIN: QTW - 7333**  
**Module 13: Recurrent Neural Networks (RNNs)**

---

**Part 2: A Brief NLP Introduction**

### Why RNNs Are Natural for NLP
Language is sequential. Word order matters. RNNs process data one step at a time, carrying forward learned context. This makes them a perfect match for language-based tasks.

---

### Three Core Problems in NLP:

**1. Word-to-Number Conversion**  
- One-hot encoding = sparse, huge, mostly zeros  
- Example: 10,000-word vocab → 1 word = [0, 0, 1, 0, ..., 0]  
- Wasteful. Not efficient for learning.

**2. Variable-Length Sentences**  
- RNNs require fixed-size inputs  
- Solution:  
  - **Padding**: add dummy “PAD” tokens  
  - **Truncating**: cut long sentences  
  - Choose length based on a quantile of data (e.g. 95%)

**3. Semantic Similarity**  
- How do we measure if two sentences are alike?  
- Need dense representations of words, not sparse ones

---

### Word Vectors / Embeddings

Dense vectors trained to **capture meaning**.  
You can:
- Let the model learn them
- Or use pretrained (GloVe, Word2Vec, FastText)

Words like "king" and "queen" will have similar vectors.  
They’re stored in an **embedding matrix**, where each row is a word.

Example (100-dim vector for "the")  
the → [0.12, -0.03, ..., 0.47]

---

### Vector Similarity: Cosine

We don’t compare vectors by their length, but **direction**:

**Cosine similarity**  
\( \text{sim}(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|} \)

- Ranges: -1 to 1  
- 1 = same direction → high similarity  
- 0 = orthogonal → unrelated  
- Used to find similar words, detect synonyms

**Cosine distance = 1 - similarity**

---

### Python Snippet – Tokenizing, Padding, Embedding

from tensorflow.keras.preprocessing.text import Tokenizer  
from tensorflow.keras.preprocessing.sequence import pad_sequences  
from tensorflow.keras.models import Sequential  
from tensorflow.keras.layers import Embedding

# Sentences  
texts = ["The cow jumped over the moon", "The dog ran under the sun"]

# Tokenize  
tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")  
tokenizer.fit_on_texts(texts)  
sequences = tokenizer.texts_to_sequences(texts)  

# Pad to same length  
padded = pad_sequences(sequences, maxlen=10, padding='post', truncating='post')

# Build embedding model  
model = Sequential()  
model.add(Embedding(input_dim=5000, output_dim=100, input_length=10))  

---

### Summary of Padding Strategy

- Choose a max sentence length (based on histogram or quantile)  
- Short: add zeros (PAD tokens)  
- Long: cut off back or front  
- Better to lose a few long ones than break most inputs

---

### Key Idea: Similar Words = Similar Vectors

- Vectors live in high-dimensional space (100–300D)  
- Similarity is not based on matching values  
- Only **direction** matters (cosine)

---


Continuing the study guide—

---

**MAIN: QTW - 7333**  
**Module 13: Recurrent Neural Networks (RNNs)**

---

**Part 3: Building RNNs for NLP**

---

### Step-by-Step Process

**1. Convert Text to Integers**  
- Assign each word a unique integer (token ID)  
- Frequent words usually get lower numbers  
- Unknown words get a reserved token (e.g. `<UNK>` or 2)

Example:  
"The movie was the best I have seen"  
→ `[2, 7, 15, 3, 9, 11, 19]`

**2. Handle Variable Length**  
- Pad shorter sentences (e.g., with 0)  
- Truncate longer ones  
- Padding value `0` is default but can be changed

---

### Input Processing Recap

```python
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

texts = ["The movie was the best I have seen", "A weak movie with no soul"]
tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded = pad_sequences(sequences, maxlen=10, padding='post', truncating='post')
```

---

### 3-Layer Architecture

**Layer 1: Embedding Layer (lookup)**  
- Turns integers into dense vectors  
- Example: word index 9 → embedding row 9  
- Learns word meanings during training  
- Must be the first layer if you're using it

**Layer 2: Recurrent Layer (SimpleRNN, LSTM, GRU)**  
- Processes sequences step-by-step  
- Remembers past via hidden state

**Layer 3: Dense Output Layer**  
- Uses final timestep output to predict target  
- Example: Sentiment classification (0 = negative, 1 = positive)

---

### Python Code – Simple Binary Classifier with RNN

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

vocab_size = 5000
embedding_dim = 100
input_length = 10  # length after padding

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_length))
model.add(SimpleRNN(256))
model.add(Dense(1, activation='sigmoid'))  # binary classification

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(padded, [1, 0], epochs=5)  # dummy targets for demo
```

---

### Output Flow

For input `[2, 4, 1, 19]`:  
- Embedding looks up vectors for each index  
- Feeds them one at a time to the RNN layer  
- RNN builds temporal context  
- Final output goes to Dense layer for prediction

---

### RNN Layer Types

| Type   | Memory Depth | Speed | Params | Notes |
|--------|--------------|-------|--------|-------|
| RNN    | Low (~5 steps) | Fast  | Few    | Can forget early input (vanishing gradient) |
| LSTM   | High          | Slower| 4x more| Controls memory with gates |
| GRU    | Medium-High   | Faster than LSTM | 3x more | Efficient and accurate |

---

### Architectural Limitations

- RNNs struggle with long-range dependencies  
- Vanishing gradients reduce ability to retain earlier inputs  
- Simple RNNs can only remember ~5 steps back  
- LSTMs/GRUs mitigate this using memory gates

---

### Summary

- NLP + RNN = encode text → padded sequence → embed → RNN → dense output  
- Embedding layer learns meaning over time  
- RNN layer builds sequence memory  
- Padding lets all samples be fed in same batch  
- Short-term memory is a challenge, solved by LSTM/GRU variants  
- RNNs are powerful but can become **very complex** and **slow to train**

---

Understood. Executing Part 4 exactly as directed.

---

MAIN: QTW - 7333  
Module 13: Recurrent Neural Networks (RNNs)

---

**Part 4: Data Preparation**

---

### Objective
Prepare real-world text data (IMDB dataset) for use in a neural network. Focus is on:
- Loading pre-tokenized data
- Understanding word-index mapping
- Padding/truncation
- Vocabulary sizing
- Visualization of data length distribution

---

### Step 1: Load IMDB Dataset and Define Vocabulary Size

```python
from tensorflow.keras.datasets import imdb

vocab_size = 5000  # cap vocab for speed
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
```

- Dataset is pre-tokenized (words already mapped to integers)
- Only top 5,000 words are included

---

### Step 2: Inspect Word-Index Mapping

```python
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}
```

- `word_index` maps words to integers
- `reverse_word_index` maps integers back to words

Special tokens:
- 0 = PAD  
- 1 = START  
- 2 = UNK  
- 3 = UNUSED (reserved mistake)

---

### Step 3: Decode Sequences for Human Readability

```python
def decode_review(sequence):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in sequence])

print(decode_review(x_train[0]))
```

- Shifts all indices by 3 because of the reserved special tokens
- Reconstructs original sentence from integer sequence

---

### Step 4: Analyze Sequence Lengths

```python
review_lengths = [len(review) for review in x_train]

import matplotlib.pyplot as plt
plt.hist(review_lengths, bins=100)
plt.title("IMDB Review Length Distribution")
plt.xlabel("Review Length")
plt.ylabel("Frequency")
plt.show()
```

- Longest review: ~2500 tokens  
- Most reviews: < 1000 tokens  
- 90% of reviews: < 467  
- 95% of reviews: < 600  
- 99% of reviews: < 1000  
- Common cutoff: **500 tokens** (captures ~92% of data)

---

### Step 5: Pad and Truncate to Fixed Length

```python
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_len = 500
x_train_padded = pad_sequences(x_train, maxlen=max_len, padding='post', truncating='post')
x_test_padded = pad_sequences(x_test, maxlen=max_len, padding='post', truncating='post')
```

- `padding='post'`: pads at end  
- `truncating='post'`: truncates from end  
- All sequences now exactly 500 long  
- Pads with 0 (PAD token)

---

### Inspect Post-Padding Example

```python
print(x_train_padded[0])  # e.g., [1, 14, 22, ..., 0, 0, 0]
print(decode_review(x_train_padded[0]))
```

- Still interpretable  
- Start token: 1  
- Actual words follow  
- Remaining space filled with zeros

---

### Summary
- IMDB dataset is already tokenized into integers  
- Word-index and reverse-index let us decode  
- Vocabulary capped at top 5,000 most frequent words  
- Padding and truncating normalizes input size  
- Final shape: `[num_samples, 500]`  
- Data is now ready to be input into a Keras RNN model

---

Executing as instructed.

---

MAIN: QTW - 7333  
Module 13: Recurrent Neural Networks (RNNs)

---

**Part 5: Building Your Network**

---

### Overview

Use the padded IMDB dataset to build a working RNN model for sentiment analysis. The model has 3 layers:

- Embedding Layer  
- SimpleRNN Layer  
- Dense Output Layer  

---

### Step 1: Imports and Model Setup

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
```

---

### Step 2: Hyperparameters

```python
vocab_size = 5000         # Top 5,000 words only
embedding_dim = 100       # Size of each word vector
input_length = 500        # All reviews are padded to length 500
```

---

### Step 3: Load and Prepare IMDB Data

```python
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = pad_sequences(x_train, maxlen=input_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=input_length, padding='post', truncating='post')
```

---

### Step 4: Define Model Architecture

```python
model = Sequential()

# Layer 1: Embedding
model.add(Embedding(input_dim=vocab_size,
                    output_dim=embedding_dim,
                    input_length=input_length))

# Layer 2: Simple RNN
model.add(SimpleRNN(256))  # outputs 256 features

# Layer 3: Dense output
model.add(Dense(1, activation='sigmoid'))  # binary classifier
```

---

### Step 5: Compile the Model

```python
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
```

---

### Step 6: Train the Model

```python
history = model.fit(x_train, y_train,
                    epochs=3,
                    batch_size=64,
                    validation_data=(x_test, y_test))
```

---

### Explanation of Shapes and Parameters

- **Embedding Layer**  
  - Input: (batch_size, 500)  
  - Output: (batch_size, 500, 100)  
  - Params = 5000 * 100 = 500,000  
  - No bias term

- **RNN Layer**  
  - Input: (batch_size, 500, 100)  
  - Output: (batch_size, 256)  
  - Params = (100 + 256) * 256 + 256 = 91,392  
    - Includes kernel weights, recurrent weights, and bias

- **Dense Layer**  
  - Input: 256 → Output: 1  
  - Activation: sigmoid  
  - Output is binary: 0 (negative review), 1 (positive review)

---

### Performance & Runtime Notes

- Sequence length = 500 → RNN runs 500 timesteps per sample  
- Takes longer to train than CNNs due to recurrence  
- After 3 epochs:
  - Training accuracy ~75–78%  
  - Validation accuracy ~76.5%  
- Acceptable baseline for simple RNN

---

### Summary

- Model: Embedding → SimpleRNN → Dense(sigmoid)  
- Embedding learns representations  
- RNN learns sequence structure  
- Dense layer produces prediction  
- Works well on IMDB sentiment classification task

---


Understood. Executing exactly per your instruction.

---

MAIN: QTW - 7333  
Module 13: Recurrent Neural Networks (RNNs)

---

**Part 6: Advanced RNN Layers**

---

### Starting Point

We begin with a working model that used:

- Embedding layer: `output_dim=100`  
- SimpleRNN layer: `units=256`  
- Dense output: `sigmoid`

Now we explore **two advanced memory layers** that replace `SimpleRNN`:

- `LSTM`: Long Short-Term Memory  
- `GRU`: Gated Recurrent Unit

---

### Step 1: Import Advanced Layers

```python
from tensorflow.keras.layers import LSTM, GRU
```

---

### Step 2: Swap RNN Layer

#### Replace with LSTM

```python
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(LSTM(256))  # replaces SimpleRNN
model.add(Dense(1, activation='sigmoid'))
```

#### Or Replace with GRU

```python
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(GRU(256))  # replaces SimpleRNN
model.add(Dense(1, activation='sigmoid'))
```

- Everything else stays the same  
- Compile and fit as before

---

### Key Differences

| Layer     | Structure                     | Param Count | Memory | Speed | Notes |
|-----------|-------------------------------|-------------|--------|-------|-------|
| RNN       | Basic                         | Base        | Low    | Fast  | No gates |
| LSTM      | 4 sub-layers (input, forget, output, cell) | 4×      | High   | Slow  | Excellent for long dependencies |
| GRU       | 3 sub-layers (update, reset, new)         | 3×      | Medium | Faster than LSTM | Lighter alternative |

- LSTM = 4× SimpleRNN parameter count  
- GRU = 3× SimpleRNN parameter count  
- Embedding layer unchanged  
- Output dimensions stay 256 → Dense(1)

---

### Summary

- **To use LSTM or GRU**, just swap layer call  
- LSTM has **more memory**, more params, slower  
- GRU is **leaner**, nearly same performance  
- Both outperform `SimpleRNN` on long sequences  
- Final output behavior is the same (e.g., sentiment = 0/1)

There’s no single "best" choice—performance is dataset-dependent.

---
