QTW - 7333
Module 13: Recurrent Neural Networks
Part 1: Introduction Recurrent Neural Networks
Recurrent Neural Networks are designed for sequence-dependent data
like language and time series. Unlike dense and convolutional neural
networks that process data all at once, RNNs process inputs one step at
a time while maintaining a memory of previous outputs.
At each timestep, the network receives current input and the output
of the previous step. This is why RNNs are called “recurrent”—they feed
outputs back into themselves.
At timestep 0, there’s no previous input, so a matrix of zeros is
used. RNNs can be visualized in two ways:
- Unrolled view: layers shown sequentially over time
- Rolled view: the same layer reused repeatedly with shared weights
Each step’s output is used for prediction and is also passed forward
as input for the next time step.
The structure involves two sets of inputs:
- The actual data (e.g., stock prices)
- The output of the previous timestep
These are treated as one concatenated input matrix, which simplifies
implementation. The sequence length must be consistent for all input
data, which presents a challenge in NLP where sentence lengths vary.
This is solved with padding or truncating.
Each RNN layer has two sets of weights—one for the normal input and
one for the recurrent input. These are treated separately in libraries
like Keras, so two sets of weights are maintained and trained.
Part 2: A Brief NLP Introduction
NLP tasks require converting language into a numeric format RNNs can
work with. One-hot encoding is inefficient for large vocabularies and
results in sparse representations, so instead, we use word
embeddings—dense vectors that capture word meaning.
Sentences must be converted to the same length using padding (adding
a “pad” word token like 0) or truncation. The pad word does not carry
meaning and is just a filler. Padding is applied to make the input shape
consistent.
The position of a word in a sentence matters. RNNs process the
sentence in order, making them suitable for language tasks like
completing sentences or sentiment analysis.
Word vectors (embeddings) can be learned by the model or pre-trained.
They represent words in dense numerical form and live in
high-dimensional space. Words with similar meanings point in similar
directions in this space.
Cosine similarity is used to compare these vectors:
- Cosine similarity = dot product of two vectors divided by product of
their magnitudes
- Cosine distance = 1 - cosine similarity
- Close direction = semantic similarity
Part 3: Building RNN for NLP
Text is converted to unique integers via tokenization. Example:
“The movie was the best I have seen” → [2, 7, 15, 3, 9, 11, 19]
Each word is mapped to an integer. Common words are assigned lower
numbers. Unknown words are handled with a special token (e.g., 2).
Padding is added to make all sequences the same length (e.g., 500).
The network structure:
- Embedding layer (must be the first): learns dense
representations
- Simple RNN layer: processes sequence
- Dense layer with sigmoid: outputs sentiment (0 or 1)
Embedding uses row lookup rather than matrix multiplication. For each
input word integer, it retrieves the corresponding vector from the
embedding matrix. Output of each step is passed to the RNN, which
outputs to the final dense layer.
RNNs have short memory (~5 steps). LSTM and GRU extend memory
capability but use more parameters. RNNs can perform well even with
basic setups.
Part 4: Data Preparation
Use the IMDB dataset. Limit vocabulary size to 5000 for performance.
Data is already tokenized (words to integers). Each review is a list of
integers.
Use get_word_index()
to map integers back to words and
build a reverse lookup. Special tokens include:
- 0: PAD
- 1: START
- 2: UNK
- 3: UNUSED
Create a decode function to turn sequences into text.
Check review lengths:
- Max: ~2500
- 99% < 1000
- 95% < 600
- 90% < 467
- Cutoff = 500 captures ~92% of data
Use pad_sequences()
to trim/pad data to length 500.
Now all reviews are exactly 500 integers long, ready for input into
the model.
Part 5: Building Your Network
Model architecture:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(SimpleRNN(256))
model.add(Dense(1, activation='sigmoid'))
Compile and train:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)
Explanation:
- Embedding outputs: (batch_size, 500, 100)
- SimpleRNN outputs: (batch_size, 256)
- Dense layer: compresses 256 to 1 output
- Total parameters = vocab_size × embedding_dim + RNN weights +
bias
Training takes time due to 500 steps per sample. After 3
epochs:
- Accuracy = ~75–78%
- Validation = ~76.5%
- Decent for a simple 3-layer RNN model
Part 6: Advanced RNN Layers
You can replace SimpleRNN
with LSTM
or
GRU
.
from tensorflow.keras.layers import LSTM, GRU
model.add(LSTM(256)) # or model.add(GRU(256))
- LSTM has 4 gates: input, forget, output, and cell state
- GRU has 3 gates: update, reset, new
- LSTM: 4× parameter count
- GRU: 3× parameter count
- Embedding layer unchanged
- Dense layer unchanged
Speed and memory tradeoffs:
RNN |
Low |
1× |
Fast, short memory |
GRU |
Medium |
3× |
Efficient |
LSTM |
High |
4× |
Best for long dependencies |
All three are valid. Choice depends on resources and problem.
Key Takeaways — QTW 7333 Module 13: Recurrent Neural
Networks
- RNNs handle sequential data
- Memory is built by feeding previous outputs as inputs
- Used for tasks like language, time series, prediction
- Inputs must be uniform length
- NLP input is padded/truncated to fixed size (commonly 500)
- Padding uses 0s; truncating drops from start or end
- Word representation matters
- One-hot = inefficient
- Embeddings = dense, learned representations
- Can be trained or pre-loaded (e.g., GloVe)
- Basic RNNs have short memory (~5 steps)
- Vanishing gradient limits long-term dependencies
- LSTM and GRU are drop-in replacements with gating memory
- Simple RNN architecture is three layers
- Embedding → RNN → Dense(sigmoid)
- Works well for sentiment classification (~76% accuracy on IMDB)
- LSTM = 4× parameters of RNN
GRU = 3× parameters of RNN
- Trade-off: GRU is faster, LSTM remembers more
- Both are better than basic RNN for long sequences
- Data pipeline matters
- IMDB is pre-tokenized
- Word index and reverse index are used for encoding/decoding
- Proper preprocessing is required for model input
- Cosine similarity measures semantic similarity in vector
space
- Used in NLP tasks for comparing word vectors
- Only direction of vector matters, not magnitude
- Training time increases with sequence length
- 500-timestep sequences take longer to run
- RNNs inherently run sequentially, limiting parallelization
- Output layer is a binary classifier using sigmoid
- For sentiment: 0 = negative, 1 = positive
- Loss = binary crossentropy
Got it.
Mathematical Representation – RNN, LSTM, GRU
1. Basic RNN
At each timestep \(t\):
2. LSTM (Long Short-Term Memory)
At each timestep \(t\):
Forget gate:
\(f_t = \sigma(W_f x_t + U_f h_{t-1} +
b_f)\)
Input gate:
\(i_t = \sigma(W_i x_t + U_i h_{t-1} +
b_i)\)
Candidate cell state:
\(\tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} +
b_c)\)
New cell state:
\(c_t = f_t \odot c_{t-1} + i_t \odot
\tilde{c}_t\)
Output gate:
\(o_t = \sigma(W_o x_t + U_o h_{t-1} +
b_o)\)
Hidden state:
\(h_t = o_t \odot \tanh(c_t)\)
3. GRU (Gated Recurrent Unit)
At each timestep \(t\):
Update gate:
\(z_t = \sigma(W_z x_t + U_z h_{t-1} +
b_z)\)
Reset gate:
\(r_t = \sigma(W_r x_t + U_r h_{t-1} +
b_r)\)
Candidate:
\(\tilde{h}_t = \tanh(W_h x_t + r_t \odot U_h
h_{t-1} + b_h)\)
Final hidden state:
\(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot
\tilde{h}_t\)
3 Questions for Dr. Slater
How do we determine whether to use LSTM or GRU for a specific NLP
task in terms of performance versus resource efficiency?
In sequence prediction, when is it more appropriate to use the
final timestep output versus the full sequence of outputs?
How do pretrained embeddings (like GloVe or Word2Vec) affect RNN
performance compared to training embeddings from scratch?
---
title: "7333 Module 13 - Recurrent Neural Networks"
author: "Jessica McPhaul"
output: html_notebook
---


**QTW - 7333**  
**Module 13: Recurrent Neural Networks**

---

**Part 1: Introduction Recurrent Neural Networks**

Recurrent Neural Networks are designed for sequence-dependent data like language and time series. Unlike dense and convolutional neural networks that process data all at once, RNNs process inputs one step at a time while maintaining a memory of previous outputs.

At each timestep, the network receives current input and the output of the previous step. This is why RNNs are called "recurrent"—they feed outputs back into themselves.

At timestep 0, there's no previous input, so a matrix of zeros is used. RNNs can be visualized in two ways:  
- Unrolled view: layers shown sequentially over time  
- Rolled view: the same layer reused repeatedly with shared weights

Each step’s output is used for prediction and is also passed forward as input for the next time step.

The structure involves two sets of inputs:  
- The actual data (e.g., stock prices)  
- The output of the previous timestep

These are treated as one concatenated input matrix, which simplifies implementation. The sequence length must be consistent for all input data, which presents a challenge in NLP where sentence lengths vary. This is solved with padding or truncating.

Each RNN layer has two sets of weights—one for the normal input and one for the recurrent input. These are treated separately in libraries like Keras, so two sets of weights are maintained and trained.

---

**Part 2: A Brief NLP Introduction**

NLP tasks require converting language into a numeric format RNNs can work with. One-hot encoding is inefficient for large vocabularies and results in sparse representations, so instead, we use word embeddings—dense vectors that capture word meaning.

Sentences must be converted to the same length using padding (adding a "pad" word token like 0) or truncation. The pad word does not carry meaning and is just a filler. Padding is applied to make the input shape consistent.

The position of a word in a sentence matters. RNNs process the sentence in order, making them suitable for language tasks like completing sentences or sentiment analysis.

Word vectors (embeddings) can be learned by the model or pre-trained. They represent words in dense numerical form and live in high-dimensional space. Words with similar meanings point in similar directions in this space.

Cosine similarity is used to compare these vectors:  
- Cosine similarity = dot product of two vectors divided by product of their magnitudes  
- Cosine distance = 1 - cosine similarity  
- Close direction = semantic similarity

---

**Part 3: Building RNN for NLP**

Text is converted to unique integers via tokenization. Example:  
"The movie was the best I have seen" → [2, 7, 15, 3, 9, 11, 19]

Each word is mapped to an integer. Common words are assigned lower numbers. Unknown words are handled with a special token (e.g., 2). Padding is added to make all sequences the same length (e.g., 500).

The network structure:

- Embedding layer (must be the first): learns dense representations  
- Simple RNN layer: processes sequence  
- Dense layer with sigmoid: outputs sentiment (0 or 1)

Embedding uses row lookup rather than matrix multiplication. For each input word integer, it retrieves the corresponding vector from the embedding matrix. Output of each step is passed to the RNN, which outputs to the final dense layer.

RNNs have short memory (~5 steps). LSTM and GRU extend memory capability but use more parameters. RNNs can perform well even with basic setups.

---

**Part 4: Data Preparation**

Use the IMDB dataset. Limit vocabulary size to 5000 for performance. Data is already tokenized (words to integers). Each review is a list of integers.

Use `get_word_index()` to map integers back to words and build a reverse lookup. Special tokens include:  
- 0: PAD  
- 1: START  
- 2: UNK  
- 3: UNUSED

Create a decode function to turn sequences into text.

Check review lengths:  
- Max: ~2500  
- 99% < 1000  
- 95% < 600  
- 90% < 467  
- Cutoff = 500 captures ~92% of data

Use `pad_sequences()` to trim/pad data to length 500.

Now all reviews are exactly 500 integers long, ready for input into the model.

---

**Part 5: Building Your Network**

Model architecture:

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=100, input_length=500))
model.add(SimpleRNN(256))
model.add(Dense(1, activation='sigmoid'))
```

Compile and train:

```python
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=3, batch_size=64)
```

Explanation:

- Embedding outputs: (batch_size, 500, 100)  
- SimpleRNN outputs: (batch_size, 256)  
- Dense layer: compresses 256 to 1 output  
- Total parameters = vocab_size × embedding_dim + RNN weights + bias

Training takes time due to 500 steps per sample. After 3 epochs:  
- Accuracy = ~75–78%  
- Validation = ~76.5%  
- Decent for a simple 3-layer RNN model

---

**Part 6: Advanced RNN Layers**

You can replace `SimpleRNN` with `LSTM` or `GRU`.

```python
from tensorflow.keras.layers import LSTM, GRU

model.add(LSTM(256))  # or model.add(GRU(256))
```

- LSTM has 4 gates: input, forget, output, and cell state  
- GRU has 3 gates: update, reset, new  
- LSTM: 4× parameter count  
- GRU: 3× parameter count  
- Embedding layer unchanged  
- Dense layer unchanged

Speed and memory tradeoffs:

| Layer | Memory | Params | Notes |
|-------|--------|--------|-------|
| RNN   | Low    | 1×     | Fast, short memory |
| GRU   | Medium | 3×     | Efficient |
| LSTM  | High   | 4×     | Best for long dependencies |

All three are valid. Choice depends on resources and problem.

---



---

**Key Takeaways — QTW 7333 Module 13: Recurrent Neural Networks**

1. **RNNs handle sequential data**  
   - Memory is built by feeding previous outputs as inputs  
   - Used for tasks like language, time series, prediction

2. **Inputs must be uniform length**  
   - NLP input is padded/truncated to fixed size (commonly 500)  
   - Padding uses 0s; truncating drops from start or end

3. **Word representation matters**  
   - One-hot = inefficient  
   - Embeddings = dense, learned representations  
   - Can be trained or pre-loaded (e.g., GloVe)

4. **Basic RNNs have short memory (~5 steps)**  
   - Vanishing gradient limits long-term dependencies  
   - LSTM and GRU are drop-in replacements with gating memory

5. **Simple RNN architecture is three layers**  
   - Embedding → RNN → Dense(sigmoid)  
   - Works well for sentiment classification (~76% accuracy on IMDB)

6. **LSTM = 4× parameters of RNN**  
   **GRU = 3× parameters of RNN**  
   - Trade-off: GRU is faster, LSTM remembers more  
   - Both are better than basic RNN for long sequences

7. **Data pipeline matters**  
   - IMDB is pre-tokenized  
   - Word index and reverse index are used for encoding/decoding  
   - Proper preprocessing is required for model input

8. **Cosine similarity measures semantic similarity in vector space**  
   - Used in NLP tasks for comparing word vectors  
   - Only direction of vector matters, not magnitude

9. **Training time increases with sequence length**  
   - 500-timestep sequences take longer to run  
   - RNNs inherently run sequentially, limiting parallelization

10. **Output layer is a binary classifier using sigmoid**  
    - For sentiment: 0 = negative, 1 = positive  
    - Loss = binary crossentropy

---

Got it.

---

**Mathematical Representation – RNN, LSTM, GRU**

---

**1. Basic RNN**

At each timestep \( t \):

- Hidden state:  
  \( h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h) \)

- Output:  
  \( y_t = W_{hy} h_t + b_y \)

---

**2. LSTM (Long Short-Term Memory)**

At each timestep \( t \):

- Forget gate:  
  \( f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) \)

- Input gate:  
  \( i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) \)

- Candidate cell state:  
  \( \tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c) \)

- New cell state:  
  \( c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \)

- Output gate:  
  \( o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) \)

- Hidden state:  
  \( h_t = o_t \odot \tanh(c_t) \)

---

**3. GRU (Gated Recurrent Unit)**

At each timestep \( t \):

- Update gate:  
  \( z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) \)

- Reset gate:  
  \( r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r) \)

- Candidate:  
  \( \tilde{h}_t = \tanh(W_h x_t + r_t \odot U_h h_{t-1} + b_h) \)

- Final hidden state:  
  \( h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \)

---

**3 Questions for Dr. Slater**

1. How do we determine whether to use LSTM or GRU for a specific NLP task in terms of performance versus resource efficiency?

2. In sequence prediction, when is it more appropriate to use the final timestep output versus the full sequence of outputs?

3. How do pretrained embeddings (like GloVe or Word2Vec) affect RNN performance compared to training embeddings from scratch?

---


