Discussion Questions
- What are the benefits and limitations of using a
Bag-of-Words model?
- Why does TF-IDF often outperform simple CountVectorizer for
text classification?
- How can we improve our custom CountVectorizer to include
stopword removal and stemming?
—Here is your summarized and expanded study guide based on QTW Module
6 – Naive Bayes, including key concepts, mathematical formulas, Python
code for CPU and GPU utilization, plus three key takeaways and three
discussion questions. Type “proceed” if you need more, and I will
continue.
Study Guide: QTW Module 6 – Naive Bayes
- Overview of Bayes’ Rule Bayes’ Rule helps update an initial belief
(prior) after observing new evidence (likelihood). It is expressed
mathematically as:
P(A | B) = [ P(B | A) × P(A) ] / P(B)
• P(A | B): Posterior Probability (updated belief) • P(B | A):
Likelihood (probability of observing B given A is true) • P(A): Prior
Probability (initial belief) • P(B): Normalizing Factor (ensures the
result is a valid probability)
Example: Drug Test - True Positive Rate = 99% - True Negative Rate =
97% - Prevalence = 5%
P(A | B) = [0.99 × 0.05] / [ (0.99 × 0.05) + (0.03 × 0.95 ) ] ≈
63.5%
- Bayes’ Rule for Multiple Variables When dealing with multiple
evidence variables (B, C, D, …), we can generalize Bayes’ Rule as:
P(A | B, C, D) ∝ P(A) × P(B | A) × P(C | A) × P(D | A)
The “naive” assumption is that all evidence variables B, C, D are
conditionally independent given A, allowing us to multiply probabilities
rather than consider their complex joint distributions.
- Bayes’ Rule with Continuous Variables For continuous attributes
(temperature, weight), we use probability density functions (e.g.,
Gaussian) to estimate P(X = x). If the data is assumed normally
distributed, we can compute:
P(x; μ, σ) = (1 / (σ√(2π))) × exp[−(x − μ)² / (2σ²)]
We then insert these probabilities into Bayes’ framework just as we
do with discrete or categorical data.
Naive Bayes in Practice (Demo) • Often used for text
classification with bag-of-words or tf-idf features.
• MultinomialNB from sklearn is commonly applied: – Alpha parameter
(Laplace smoothing) handles zero-frequency words.
– Very fast, even on large datasets.
Naive Bayes vs. Logistic Regression • Naive Bayes (generative
model) and Logistic Regression (discriminative model) form a
generative-discriminative pair.
• Naive Bayes converges faster to a near-optimal solution with limited
data; Logistic Regression can yield a lower asymptotic error with
sufficient data.
• Both are deeply connected through the logit function and ratio forms
of probability.
Bag of Words Exercise • Task: Build a custom CountVectorizer
using Python.
• Compare with sklearn’s CountVectorizer to observe differences in
tokenization, vocabulary building, and performance.
Mathematical Representation (Core Formulas) 1. Bayes’ Rule (Single
Variable) P(A | B) = [P(B | A) P(A)] / P(B)
Bayes’ Rule (Multivariable) P(A | B, C, D) ∝ P(A) × P(B | A) ×
P(C | A) × P(D | A)
Gaussian Density (Continuous Example) p(x) = 1 / (σ√(2π)) ×
exp[−(x − μ)² / (2σ²)]
Code Snippets
A. CPU Implementation (Naive Bayes – Sklearn)
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
# 1) Load Data
train_data = fetch_20newsgroups(subset='train', remove=('headers','footers','quotes'))
test_data = fetch_20newsgroups(subset='test', remove=('headers','footers','quotes'))
# 2) Transform Text (CountVectorizer)
cv = CountVectorizer()
X_train_cv = cv.fit_transform(train_data.data)
X_test_cv = cv.transform(test_data.data)
# 3) Train Naive Bayes
model_cv = MultinomialNB()
model_cv.fit(X_train_cv, train_data.target)
# 4) Evaluate
preds_cv = model_cv.predict(X_test_cv)
print("Accuracy with CountVectorizer:", accuracy_score(test_data.target, preds_cv))
# 5) Using TF-IDF
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(train_data.data)
X_test_tfidf = tfidf.transform(test_data.data)
model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, train_data.target)
preds_tfidf = model_tfidf.predict(X_test_tfidf)
print("Accuracy with TF-IDF:", accuracy_score(test_data.target, preds_tfidf))
B. Simple GPU Implementation (Naive Bayes with cuML) (You need an
environment with RAPIDS/cuml installed for this.)
import cupy as cp
from cuml.naive_bayes import MultinomialNB as cuMultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
# Assume you have train_data & test_data from above
tfidf_gpu = TfidfVectorizer()
X_train_gpu = tfidf_gpu.fit_transform(train_data.data)
X_test_gpu = tfidf_gpu.transform(test_data.data)
# Convert to GPU arrays
X_train_gpu_csr = cp.sparse.csr_matrix(X_train_gpu)
X_test_gpu_csr = cp.sparse.csr_matrix(X_test_gpu)
y_train_gpu = cp.asarray(train_data.target)
y_test_gpu = cp.asarray(test_data.target)
# Train
model_gpu = cuMultinomialNB()
model_gpu.fit(X_train_gpu_csr, y_train_gpu)
# Predict
preds_gpu = model_gpu.predict(X_test_gpu_csr)
accuracy_gpu = (preds_gpu.get() == y_test_gpu.get()).mean()
print("GPU-based Naive Bayes Accuracy:", accuracy_gpu)
Study Guide Format (Condensed)
I. Introduction - Bayes’ Rule, Prior, Likelihood, Posterior
- Connection to logistic regression
- Mathematical Foundations
- P(A|B) = (P(B|A)*P(A))/P(B)
- Multivariate Extension
- Continuous Variables
- Practical Implementation
- Text Classification
- Bag-of-Words & TF-IDF
- Naive Bayes (MultinomialNB)
- Comparison
- Naive Bayes vs. Logistic Regression
- Generative vs. Discriminative
V. Key Takeaways 1. Naive Bayes uses Bayes’ Rule with a simplifying
(naive) independence assumption, yet often performs competitively in
classification tasks, especially text.
2. Combining TF-IDF with MultinomialNB frequently yields better accuracy
than bag-of-words.
3. Generative (Naive Bayes) and Discriminative (Logistic Regression)
models are mathematically related through the logit (odds ratio)
framework.
- Discussion Questions
- Why does Naive Bayes often perform well in high-dimensional text
classification, despite the naive independence assumption?
- How do smoothing parameters (like alpha) influence Naive Bayes
performance?
- In what cases might Logistic Regression eventually outperform Naive
Bayes, and why?
Below is a more detailed walkthrough of Part 6 – The Bag of
Words Exercise, showing how to build a custom CountVectorizer
and comparing it with sklearn’s version. This will help solidify your
understanding of how text is converted into numerical features for Naive
Bayes and other machine learning models.
Bag of Words Exercise – Step-by-Step
Understanding the Goal We want to: • Load the 20 Newsgroups
dataset (without headers, footers, and quotes).
• Create our own Bag-of-Words approach (basic
CountVectorizer).
• Compare against the sklearn implementation
(CountVectorizer).
Loading the 20 Newsgroups Dataset The dataset can be fetched
with: data = fetch_20newsgroups(remove=(‘headers’, ‘footers’,
‘quotes’))
This returns a dictionary-like object, including .data (list of
documents) and .target (list of numeric labels).
- Building a Custom CountVectorizer Our custom version will: •
Tokenize each document into words (split on whitespace, remove
punctuation).
• Build a vocabulary (unique words mapped to indices).
• Count how often each word appears in each document.
Below is sample code to illustrate these steps.
import re
from collections import Counter
from sklearn.datasets import fetch_20newsgroups
def custom_count_vectorizer(corpus):
# 1. Clean and lowercase text, remove punctuation
cleaned_texts = [re.sub(r'[^a-zA-Z\s]', '', doc.lower()) for doc in corpus]
# 2. Tokenize by splitting on whitespace
tokenized_docs = [doc.split() for doc in cleaned_texts]
# 3. Build vocabulary (all unique words)
# Convert to a sorted list to ensure consistent ordering.
vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
# 4. Create a dict mapping each word to its index
word2index = {word: idx for idx, word in enumerate(vocabulary)}
# 5. Construct the count matrix
# Each row represents a document, each column a word in our vocabulary.
count_matrix = []
for doc in tokenized_docs:
counts = Counter(doc)
# Build a row of length = size of vocabulary
row_vector = [counts.get(word, 0) for word in vocabulary]
count_matrix.append(row_vector)
return vocabulary, count_matrix
# 1) Fetch data
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
corpus = data.data[:5] # Using first 5 docs as a small example
# 2) Apply our custom CountVectorizer
vocab, matrix = custom_count_vectorizer(corpus)
print("Custom Vocabulary (first 20 words):", vocab[:20])
print("\nFirst Document's Vector Representation:")
matrix[0])
- Comparing with sklearn’s CountVectorizer Now let’s use the sklearn
version to see how it differs.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english') # optionally remove stopwords
X = cv.fit_transform(corpus)
print("\nSklearn Vocabulary Size:", len(cv.vocabulary_))
print("Sklearn First Document's Vector:")
print(X.toarray()[0])
Differences You Might Notice • Stopword Removal: By
default, sklearn’s CountVectorizer can remove common words like “the,”
“and,” etc., if you specify stop_words='english'
. Our
custom version didn’t remove them unless we code it.
• Punctuation Handling: sklearn automatically handles
punctuation in multiple ways. Our custom version uses a simple
regex.
• Sparse Representation: sklearn returns a sparse
matrix (efficient for large text). Ours is a plain list of lists (dense
matrix).
- (Optional) GPU-Accelerated Version If you have the RAPIDS
environment set up (with CuPy and cuML), you could adapt your code to
build counts on the GPU. However, for a “homegrown” solution,
you’d:
import cupy as cp
def gpu_count_vectorizer(corpus):
# Step 1: Preprocess & Tokenize on CPU (or GPU if you prefer)
cleaned_texts = [re.sub(r'[^a-zA-Z\s]', '', doc.lower()) for doc in corpus]
tokenized_docs = [doc.split() for doc in cleaned_texts]
# Step 2: Build vocabulary on CPU for simplicity
vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
word2index = {word: idx for idx, word in enumerate(vocabulary)}
# Step 3: Initialize a GPU array to hold counts
rows, cols = len(tokenized_docs), len(vocabulary)
gpu_matrix = cp.zeros((rows, cols), dtype=cp.int32)
# Step 4: Fill in the matrix
for i, doc in enumerate(tokenized_docs):
doc_counts = Counter(doc)
for word, count in doc_counts.items():
if word in word2index:
j = word2index[word]
gpu_matrix[i, j] = count
return vocabulary, gpu_matrix
# Example usage
vocab_gpu, matrix_gpu = gpu_count_vectorizer(corpus[:5])
print("GPU Count Matrix shape:", matrix_gpu.shape)
print("GPU Count Matrix (first row, back to CPU):", matrix_gpu[0].get())
Note: This example shows the concept. In practice, you’d want to
handle more steps on GPU to maximize speed.
Summary of the Exercise • Objective: Implement a
simplified Bag-of-Words manually.
• Key Observation: The core steps – tokenizing,
building a vocabulary, counting words – are straightforward, but sklearn
optimizes them for performance and adds many extra features (stopword
removal, n-grams, etc.).
Three Key Takeaways (Part 6 Focus) 1. Implementation
Control: Building CountVectorizer from scratch clarifies how
text preprocessing works under the hood, but libraries like sklearn are
more flexible and optimized.
2. Vocabulary Size: Real-world text data sets can
contain tens or hundreds of thousands of unique tokens; efficient data
structures (sparse matrices) are critical.
3. Preprocessing Matters: Removing stopwords, handling
punctuation, or applying stemming/lemmatization can significantly change
your results.
Three Discussion Questions 1. How would you modify your
custom CountVectorizer to remove stopwords, or to handle n-grams (like
two-word phrases)?
2. What might be the tradeoffs between using a dense vs. sparse
matrix representation for large corpora?
3. How could we extend this approach for advanced text
processing, such as adding TF-IDF or word embeddings (GloVe,
Word2Vec)?
