Study Guide: QTW Module 6 - Naive Bayes

1. Bayes’ Rule: Overview and Mathematical Representation

Bayes’ Rule is a fundamental theorem in probability that helps update prior beliefs based on new evidence. It is the backbone of Bayesian inference and plays a crucial role in probabilistic modeling.

Mathematical Formula

\[ P(A | B) = \frac{P(B | A) P(A)}{P(B)} \] Where: - \(P(A | B)\) = Posterior Probability (updated belief after seeing evidence) - \(P(B | A)\) = Likelihood (how likely B occurs given A is true) - \(P(A)\) = Prior Probability (initial belief before seeing evidence) - \(P(B)\) = Normalization Factor (ensures valid probability distribution)

Example: Drug Testing

Given: - True positive rate = 99% (\(P(B | A) = 0.99\)) - False positive rate = 3% (\(P(B | \neg A) = 0.03\)) - Prevalence of drug use = 5% (\(P(A) = 0.05\))

The probability that a person who tested positive is actually a drug user is: \[ P(A | B) = \frac{(0.99 \times 0.05)}{(0.99 \times 0.05) + (0.03 \times 0.95)} \]

Using Python:

P_A = 0.05  # Prior probability of being a drug user
P_B_given_A = 0.99  # True positive rate
P_B_given_not_A = 0.03  # False positive rate
P_not_A = 1 - P_A

P_B = (P_B_given_A * P_A) + (P_B_given_not_A * P_not_A)  # Normalization factor
P_A_given_B = (P_B_given_A * P_A) / P_B  # Bayes' Rule

print(f"Probability of being a drug user given a positive test: {P_A_given_B:.3f}")

2. Bayes’ Rule for Multivariables

When there are multiple evidence variables \(B, C, D, ...\), we generalize Bayes’ Rule: \[ P(A | B, C, D) = \frac{P(B, C, D | A) P(A)}{P(B, C, D)} \] By assuming conditional independence (naive assumption), we simplify: \[ P(A | B, C, D) \propto P(A) P(B | A) P(C | A) P(D | A) \]

This forms the foundation of Naive Bayes Classification.

Python Example for Text Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

# Load dataset
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups.data)  # Convert text to bag-of-words
y = newsgroups.target

# Train Naive Bayes Classifier
nb = MultinomialNB()
nb.fit(X, y)

# Predict on new data
sample_text = ["Quantum computing is the future of AI"]
X_sample = vectorizer.transform(sample_text)
prediction = nb.predict(X_sample)
print(f"Predicted category: {newsgroups.target_names[prediction[0]]}")

3. Bayes’ Rule for Continuous Variables

For continuous variables (e.g., temperature, weight), we use Probability Density Functions (PDFs) such as the Gaussian Distribution: \[ P(X = x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \] where \(\mu\) is the mean and \(\sigma\) is the standard deviation.

Python Implementation

import numpy as np
from scipy.stats import norm

# Assume normal distribution with mean=70, std=10
mu, sigma = 70, 10
x_value = 75
probability = norm.pdf(x_value, mu, sigma)

print(f"Probability density of weight 75 given mean=70 and std=10: {probability:.4f}")

4. Naive Bayes Implementation for Text Classification

Using CountVectorizer and TF-IDF, we train a Multinomial Naive Bayes model.

Python Implementation

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

# Convert dataset using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(newsgroups.data)

# Train Naive Bayes with TF-IDF
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_tfidf, y)

# Predict on test data
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
X_test_tfidf = tfidf_vectorizer.transform(newsgroups_test.data)
y_test = newsgroups_test.target

predictions = nb_tfidf.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, predictions)
print(f"Naive Bayes TF-IDF Accuracy: {accuracy:.2f}")

5. CPU vs. GPU Implementation

CPU Implementation

from sklearn.naive_bayes import GaussianNB

# Simulating data
X_cpu = np.random.rand(10000, 10)
y_cpu = np.random.randint(2, size=10000)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_cpu, y_cpu)

GPU Implementation with CuPy

import cupy as cp
from cuml.naive_bayes import GaussianNB as cuGaussianNB

# Convert data to GPU
X_gpu = cp.asarray(X_cpu)
y_gpu = cp.asarray(y_cpu)

# Train on GPU
gnb_gpu = cuGaussianNB()
gnb_gpu.fit(X_gpu, y_gpu)

6. Bag of Words Exercise

Build a CountVectorizer from scratch and compare with sklearn.

Manual Count Vectorizer

from collections import Counter

def count_vectorizer(corpus):
    vocab = set(word for text in corpus for word in text.split())
    vectorized = [{word: text.split().count(word) for word in vocab} for text in corpus]
    return vectorized, list(vocab)

# Sample data
corpus = ["Naive Bayes is simple", "Bayes models are powerful"]
vectorized_data, vocab = count_vectorizer(corpus)

print(f"Vocabulary: {vocab}")
print(f"Vectorized Data: {vectorized_data}")

Comparison with Scikit-learn

cv = CountVectorizer()
X_cv = cv.fit_transform(corpus)
print(f"Sklearn CountVectorizer Output:\n{X_cv.toarray()}")

Key Takeaways

  1. Bayes’ Rule is a powerful probabilistic tool for updating beliefs based on evidence.
  2. Naive Bayes assumes feature independence but still performs well in many classification tasks.
  3. It is computationally efficient and well-suited for text classification, spam detection, and medical diagnostics.

Discussion Questions

  1. What are the limitations of the naive assumption in Naive Bayes?
  2. Why does Naive Bayes perform well in text classification despite feature independence?
  3. How can we improve Naive Bayes for continuous data distributions?

Bag of Words Exercise: Step-by-Step Walkthrough

This exercise involves manually implementing CountVectorizer, which is a fundamental preprocessing step in Natural Language Processing (NLP). The goal is to convert text into a numerical representation before applying machine learning models.


Step 1: Understanding CountVectorizer

CountVectorizer is a method of converting text into a bag-of-words (BoW) representation, where: - Each unique word in the dataset is assigned an index in a vocabulary. - The text data is transformed into a matrix, where each row represents a document and each column represents a word. - The values in the matrix indicate how many times a word appears in a document.

Example

For the text:

"Naive Bayes is simple"
"Bayes models are powerful"

The bag-of-words representation would count the occurrences of each word across documents.


Step 2: Loading the 20 Newsgroup Dataset

The 20 Newsgroups dataset is a collection of text documents categorized into 20 topics.

from sklearn.datasets import fetch_20newsgroups

# Load dataset without headers, footers, and quotes
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
documents = data.data[:5]  # Load only first 5 documents for demonstration

# Display sample data
for i, doc in enumerate(documents):
    print(f"Document {i+1}:\n{doc}\n{'-'*50}")

This dataset contains raw text that needs to be processed into a numerical format.


Step 3: Implementing Manual CountVectorizer

1. Tokenization

Convert text into words by splitting on spaces and removing punctuation.

2. Building Vocabulary

Create a dictionary mapping each unique word to an index.

3. Constructing the Count Matrix

For each document, count how many times each word appears.

from collections import Counter
import re

def custom_count_vectorizer(corpus):
    # Preprocess: Remove punctuation and lowercase
    corpus_clean = [re.sub(r'\W+', ' ', doc).lower() for doc in corpus]

    # Tokenize
    tokenized_docs = [doc.split() for doc in corpus_clean]

    # Build vocabulary
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    word_index = {word: idx for idx, word in enumerate(vocabulary)}

    # Count word occurrences
    matrix = []
    for doc in tokenized_docs:
        word_counts = Counter(doc)
        row = [word_counts.get(word, 0) for word in vocabulary]
        matrix.append(row)

    return vocabulary, matrix

# Apply custom CountVectorizer
vocab, count_matrix = custom_count_vectorizer(documents[:5])

# Display results
print("Vocabulary:\n", vocab)
print("\nCount Matrix:\n", count_matrix)

Step 4: Comparing with Sklearn’s CountVectorizer

To validate our implementation, we compare it against Scikit-learn’s CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

# Apply Scikit-learn's CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents[:5])

# Display vocabulary
print("\nScikit-learn Vocabulary:\n", vectorizer.get_feature_names_out())

# Display transformed text representation
print("\nScikit-learn Count Matrix:\n", X.toarray())

Step 5: Key Differences Between Manual and Sklearn CountVectorizer

Feature Custom Implementation Sklearn CountVectorizer
Tokenization Basic space-splitting and regex Handles special cases (stopwords, n-grams, etc.)
Vocabulary Built manually using a set Automatically optimized
Count Matrix Created with Counter Optimized sparse matrix
Speed Slower for large datasets Highly optimized C++ backend

Step 6: Performance Considerations

Since CountVectorizer is a CPU-based method, let’s optimize it for a GPU implementation using CuPy.

import cupy as cp

def gpu_count_vectorizer(corpus):
    corpus_clean = [re.sub(r'\W+', ' ', doc).lower() for doc in corpus]
    tokenized_docs = [doc.split() for doc in corpus_clean]
    
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    word_index = {word: idx for idx, word in enumerate(vocabulary)}
    
    matrix = cp.zeros((len(tokenized_docs), len(vocabulary)), dtype=cp.int32)
    
    for doc_idx, doc in enumerate(tokenized_docs):
        word_counts = Counter(doc)
        for word, count in word_counts.items():
            if word in word_index:
                matrix[doc_idx, word_index[word]] = count
    
    return vocabulary, matrix

# Apply GPU CountVectorizer
gpu_vocab, gpu_count_matrix = gpu_count_vectorizer(documents[:5])

print("\nGPU Vocabulary:", gpu_vocab)
print("\nGPU Count Matrix:\n", gpu_count_matrix.get())

Key Takeaways

  1. Bag-of-Words is an essential text preprocessing step in NLP, transforming text into a numerical format.
  2. Naive Bayes can work effectively on text classification despite its simplifying assumptions.
  3. GPU acceleration can improve efficiency when handling large-scale text datasets.

Discussion Questions

  1. What are the benefits and limitations of using a Bag-of-Words model?
  2. Why does TF-IDF often outperform simple CountVectorizer for text classification?
  3. How can we improve our custom CountVectorizer to include stopword removal and stemming?

—Here is your summarized and expanded study guide based on QTW Module 6 – Naive Bayes, including key concepts, mathematical formulas, Python code for CPU and GPU utilization, plus three key takeaways and three discussion questions. Type “proceed” if you need more, and I will continue.

Study Guide: QTW Module 6 – Naive Bayes

  1. Overview of Bayes’ Rule Bayes’ Rule helps update an initial belief (prior) after observing new evidence (likelihood). It is expressed mathematically as:

P(A | B) = [ P(B | A) × P(A) ] / P(B)

• P(A | B): Posterior Probability (updated belief) • P(B | A): Likelihood (probability of observing B given A is true) • P(A): Prior Probability (initial belief) • P(B): Normalizing Factor (ensures the result is a valid probability)

Example: Drug Test - True Positive Rate = 99% - True Negative Rate = 97% - Prevalence = 5%

P(A | B) = [0.99 × 0.05] / [ (0.99 × 0.05) + (0.03 × 0.95 ) ] ≈ 63.5%

  1. Bayes’ Rule for Multiple Variables When dealing with multiple evidence variables (B, C, D, …), we can generalize Bayes’ Rule as:

P(A | B, C, D) ∝ P(A) × P(B | A) × P(C | A) × P(D | A)

The “naive” assumption is that all evidence variables B, C, D are conditionally independent given A, allowing us to multiply probabilities rather than consider their complex joint distributions.

  1. Bayes’ Rule with Continuous Variables For continuous attributes (temperature, weight), we use probability density functions (e.g., Gaussian) to estimate P(X = x). If the data is assumed normally distributed, we can compute:

P(x; μ, σ) = (1 / (σ√(2π))) × exp[−(x − μ)² / (2σ²)]

We then insert these probabilities into Bayes’ framework just as we do with discrete or categorical data.

  1. Naive Bayes in Practice (Demo) • Often used for text classification with bag-of-words or tf-idf features.
    • MultinomialNB from sklearn is commonly applied: – Alpha parameter (Laplace smoothing) handles zero-frequency words.
    – Very fast, even on large datasets.

  2. Naive Bayes vs. Logistic Regression • Naive Bayes (generative model) and Logistic Regression (discriminative model) form a generative-discriminative pair.
    • Naive Bayes converges faster to a near-optimal solution with limited data; Logistic Regression can yield a lower asymptotic error with sufficient data.
    • Both are deeply connected through the logit function and ratio forms of probability.

  3. Bag of Words Exercise • Task: Build a custom CountVectorizer using Python.
    • Compare with sklearn’s CountVectorizer to observe differences in tokenization, vocabulary building, and performance.

Mathematical Representation (Core Formulas) 1. Bayes’ Rule (Single Variable) P(A | B) = [P(B | A) P(A)] / P(B)

  1. Bayes’ Rule (Multivariable) P(A | B, C, D) ∝ P(A) × P(B | A) × P(C | A) × P(D | A)

  2. Gaussian Density (Continuous Example) p(x) = 1 / (σ√(2π)) × exp[−(x − μ)² / (2σ²)]

Code Snippets

A. CPU Implementation (Naive Bayes – Sklearn)

from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score

# 1) Load Data
train_data = fetch_20newsgroups(subset='train', remove=('headers','footers','quotes'))
test_data  = fetch_20newsgroups(subset='test',  remove=('headers','footers','quotes'))

# 2) Transform Text (CountVectorizer)
cv = CountVectorizer()
X_train_cv = cv.fit_transform(train_data.data)
X_test_cv  = cv.transform(test_data.data)

# 3) Train Naive Bayes
model_cv = MultinomialNB()
model_cv.fit(X_train_cv, train_data.target)

# 4) Evaluate
preds_cv = model_cv.predict(X_test_cv)
print("Accuracy with CountVectorizer:", accuracy_score(test_data.target, preds_cv))

# 5) Using TF-IDF
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(train_data.data)
X_test_tfidf  = tfidf.transform(test_data.data)

model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, train_data.target)

preds_tfidf = model_tfidf.predict(X_test_tfidf)
print("Accuracy with TF-IDF:", accuracy_score(test_data.target, preds_tfidf))

B. Simple GPU Implementation (Naive Bayes with cuML) (You need an environment with RAPIDS/cuml installed for this.)

import cupy as cp
from cuml.naive_bayes import MultinomialNB as cuMultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Assume you have train_data & test_data from above
tfidf_gpu = TfidfVectorizer()
X_train_gpu = tfidf_gpu.fit_transform(train_data.data)
X_test_gpu  = tfidf_gpu.transform(test_data.data)

# Convert to GPU arrays
X_train_gpu_csr = cp.sparse.csr_matrix(X_train_gpu)
X_test_gpu_csr  = cp.sparse.csr_matrix(X_test_gpu)
y_train_gpu     = cp.asarray(train_data.target)
y_test_gpu      = cp.asarray(test_data.target)

# Train
model_gpu = cuMultinomialNB()
model_gpu.fit(X_train_gpu_csr, y_train_gpu)

# Predict
preds_gpu = model_gpu.predict(X_test_gpu_csr)
accuracy_gpu = (preds_gpu.get() == y_test_gpu.get()).mean()
print("GPU-based Naive Bayes Accuracy:", accuracy_gpu)

Study Guide Format (Condensed)

I. Introduction - Bayes’ Rule, Prior, Likelihood, Posterior
- Connection to logistic regression

  1. Mathematical Foundations
  1. Practical Implementation
  1. Comparison

V. Key Takeaways 1. Naive Bayes uses Bayes’ Rule with a simplifying (naive) independence assumption, yet often performs competitively in classification tasks, especially text.
2. Combining TF-IDF with MultinomialNB frequently yields better accuracy than bag-of-words.
3. Generative (Naive Bayes) and Discriminative (Logistic Regression) models are mathematically related through the logit (odds ratio) framework.

  1. Discussion Questions
  1. Why does Naive Bayes often perform well in high-dimensional text classification, despite the naive independence assumption?
  2. How do smoothing parameters (like alpha) influence Naive Bayes performance?
  3. In what cases might Logistic Regression eventually outperform Naive Bayes, and why?

Below is a more detailed walkthrough of Part 6 – The Bag of Words Exercise, showing how to build a custom CountVectorizer and comparing it with sklearn’s version. This will help solidify your understanding of how text is converted into numerical features for Naive Bayes and other machine learning models.

Bag of Words Exercise – Step-by-Step

  1. Understanding the Goal We want to: • Load the 20 Newsgroups dataset (without headers, footers, and quotes).
    • Create our own Bag-of-Words approach (basic CountVectorizer).
    • Compare against the sklearn implementation (CountVectorizer).

  2. Loading the 20 Newsgroups Dataset The dataset can be fetched with: data = fetch_20newsgroups(remove=(‘headers’, ‘footers’, ‘quotes’))

This returns a dictionary-like object, including .data (list of documents) and .target (list of numeric labels).

  1. Building a Custom CountVectorizer Our custom version will: • Tokenize each document into words (split on whitespace, remove punctuation).
    • Build a vocabulary (unique words mapped to indices).
    • Count how often each word appears in each document.

Below is sample code to illustrate these steps.

import re
from collections import Counter
from sklearn.datasets import fetch_20newsgroups

def custom_count_vectorizer(corpus):
    # 1. Clean and lowercase text, remove punctuation
    cleaned_texts = [re.sub(r'[^a-zA-Z\s]', '', doc.lower()) for doc in corpus]
    
    # 2. Tokenize by splitting on whitespace
    tokenized_docs = [doc.split() for doc in cleaned_texts]
    
    # 3. Build vocabulary (all unique words)
    #    Convert to a sorted list to ensure consistent ordering.
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    
    # 4. Create a dict mapping each word to its index
    word2index = {word: idx for idx, word in enumerate(vocabulary)}
    
    # 5. Construct the count matrix
    #    Each row represents a document, each column a word in our vocabulary.
    count_matrix = []
    for doc in tokenized_docs:
        counts = Counter(doc)
        # Build a row of length = size of vocabulary
        row_vector = [counts.get(word, 0) for word in vocabulary]
        count_matrix.append(row_vector)
    
    return vocabulary, count_matrix

# 1) Fetch data
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
corpus = data.data[:5]  # Using first 5 docs as a small example

# 2) Apply our custom CountVectorizer
vocab, matrix = custom_count_vectorizer(corpus)

print("Custom Vocabulary (first 20 words):", vocab[:20])
print("\nFirst Document's Vector Representation:")
matrix[0])
  1. Comparing with sklearn’s CountVectorizer Now let’s use the sklearn version to see how it differs.
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')  # optionally remove stopwords
X = cv.fit_transform(corpus)

print("\nSklearn Vocabulary Size:", len(cv.vocabulary_))
print("Sklearn First Document's Vector:")
print(X.toarray()[0])

Differences You Might Notice • Stopword Removal: By default, sklearn’s CountVectorizer can remove common words like “the,” “and,” etc., if you specify stop_words='english'. Our custom version didn’t remove them unless we code it.
Punctuation Handling: sklearn automatically handles punctuation in multiple ways. Our custom version uses a simple regex.
Sparse Representation: sklearn returns a sparse matrix (efficient for large text). Ours is a plain list of lists (dense matrix).

  1. (Optional) GPU-Accelerated Version If you have the RAPIDS environment set up (with CuPy and cuML), you could adapt your code to build counts on the GPU. However, for a “homegrown” solution, you’d:
import cupy as cp

def gpu_count_vectorizer(corpus):
    # Step 1: Preprocess & Tokenize on CPU (or GPU if you prefer)
    cleaned_texts = [re.sub(r'[^a-zA-Z\s]', '', doc.lower()) for doc in corpus]
    tokenized_docs = [doc.split() for doc in cleaned_texts]
    
    # Step 2: Build vocabulary on CPU for simplicity
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    word2index = {word: idx for idx, word in enumerate(vocabulary)}

    # Step 3: Initialize a GPU array to hold counts
    rows, cols = len(tokenized_docs), len(vocabulary)
    gpu_matrix = cp.zeros((rows, cols), dtype=cp.int32)
    
    # Step 4: Fill in the matrix
    for i, doc in enumerate(tokenized_docs):
        doc_counts = Counter(doc)
        for word, count in doc_counts.items():
            if word in word2index:
                j = word2index[word]
                gpu_matrix[i, j] = count
    
    return vocabulary, gpu_matrix

# Example usage
vocab_gpu, matrix_gpu = gpu_count_vectorizer(corpus[:5])
print("GPU Count Matrix shape:", matrix_gpu.shape)
print("GPU Count Matrix (first row, back to CPU):", matrix_gpu[0].get())

Note: This example shows the concept. In practice, you’d want to handle more steps on GPU to maximize speed.

Summary of the Exercise • Objective: Implement a simplified Bag-of-Words manually.
Key Observation: The core steps – tokenizing, building a vocabulary, counting words – are straightforward, but sklearn optimizes them for performance and adds many extra features (stopword removal, n-grams, etc.).

Three Key Takeaways (Part 6 Focus) 1. Implementation Control: Building CountVectorizer from scratch clarifies how text preprocessing works under the hood, but libraries like sklearn are more flexible and optimized.
2. Vocabulary Size: Real-world text data sets can contain tens or hundreds of thousands of unique tokens; efficient data structures (sparse matrices) are critical.
3. Preprocessing Matters: Removing stopwords, handling punctuation, or applying stemming/lemmatization can significantly change your results.

Three Discussion Questions 1. How would you modify your custom CountVectorizer to remove stopwords, or to handle n-grams (like two-word phrases)?
2. What might be the tradeoffs between using a dense vs. sparse matrix representation for large corpora?
3. How could we extend this approach for advanced text processing, such as adding TF-IDF or word embeddings (GloVe, Word2Vec)?

Bayes’ Rule and Naive Bayes Study Guide - Part 1

I. Fundamentals of Bayes’ Rule

Core Equation

Bayes’ Rule is expressed as:

P(A|B) = P(B|A) × P(A) / P(B)

Where: - P(A|B) = Posterior Probability - P(B|A) = Likelihood - P(A) = Prior Probability - P(B) = Evidence/Normalization

Components Breakdown:

  1. Posterior Distribution - P(A|B)
    • The probability of event A occurring given B has occurred
    • What we’re typically trying to calculate
    • Updated probability after considering evidence
  2. Prior - P(A)
    • Initial probability before considering new evidence
    • Based on previous knowledge or assumptions
    • Starting point for Bayesian inference

II. Basic Python Implementation (CPU)

import numpy as np
from sklearn.naive_bayes import GaussianNB

class BayesianCalculator:
    def simple_bayes(self, prior_a, likelihood_b_given_a, evidence_b):
        """
        Calculate posterior probability using Bayes' Rule
        """
        posterior = (likelihood_b_given_a * prior_a) / evidence_b
        return posterior

    def naive_bayes_example(self):
        # Sample data
        X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
        y = np.array([0, 0, 1, 1])
        
        # Initialize and train model
        model = GaussianNB()
        model.fit(X, y)
        
        # Make prediction
        prediction = model.predict([[2, 3]])
        return prediction

III. GPU Implementation with CuPy

import cupy as cp

class GPUBayesianCalculator:
    def __init__(self):
        self.device = 'gpu' if cp.cuda.is_available() else 'cpu'
    
    def gpu_naive_bayes(self, X, y):
        """
        GPU-accelerated Naive Bayes implementation
        """
        if self.device == 'gpu':
            X = cp.array(X)
            y = cp.array(y)
            
            # Calculate class priors
            classes = cp.unique(y)
            class_probs = cp.array([cp.mean(y == c) for c in classes])
            
            return class_probs.get()  # Transfer back to CPU
        else:
            return "GPU not available"

Bayes’ Rule and Naive Bayes Study Guide - Part 2

IV. Naive Bayes Core Concepts

Critical Assumption

The fundamental assumption of Naive Bayes is feature independence. Mathematically:

P(X₁,X₂,...,Xₙ|Y) = P(X₁|Y) × P(X₂|Y) × ... × P(Xₙ|Y)

Multiclass Classification

For a problem with k classes, we calculate:

import numpy as np
from sklearn.naive_bayes import MultinomialNB

class MulticlassNaiveBayes:
    def multiclass_example(self):
        # Example with multiple classes
        X = np.array([[1,2], [2,3], [3,4], [4,5], [5,6]])
        y = np.array([0, 1, 2, 1, 2])  # Three classes: 0, 1, 2
        
        # Initialize and train
        clf = MultinomialNB()
        clf.fit(X, y)
        
        # Predict probabilities for each class
        probs = clf.predict_proba([[3,4]])
        # Returns probability for each class
        return probs

V. Advanced Implementation with GPU Acceleration

import cupy as cp
from cupy.linalg import norm

class GPUNaiveBayes:
    def __init__(self):
        self.classes_ = None
        self.class_priors_ = None
        self.class_means_ = None
        self.class_vars_ = None

    def fit(self, X, y):
        if not cp.cuda.is_available():
            raise RuntimeError("CUDA is not available")
            
        # Convert to GPU arrays
        X = cp.array(X)
        y = cp.array(y)
        
        self.classes_ = cp.unique(y)
        n_classes = len(self.classes_)
        n_features = X.shape[1]
        
        # Initialize parameters
        self.class_priors_ = cp.zeros(n_classes)
        self.class_means_ = cp.zeros((n_classes, n_features))
        self.class_vars_ = cp.zeros((n_classes, n_features))
        
        # Calculate parameters for each class
        for i, c in enumerate(self.classes_):
            X_c = X[y == c]
            self.class_priors_[i] = X_c.shape[0] / X.shape[0]
            self.class_means_[i] = X_c.mean(axis=0)
            self.class_vars_[i] = X_c.var(axis=0)
            
        return self

VI. Comparison with Linear Regression

Key Differences: 1. Purpose - Naive Bayes: Classification - Linear Regression: Continuous value prediction

  1. Mathematical Foundation

    Naive Bayes: P(Y|X) ∝ P(X|Y)P(Y)
    Linear Regression: Y = βX + ε
  2. Output Type

    • Naive Bayes: Probabilities for each class
    • Linear Regression: Continuous values

VII. Key Takeaways

  1. Naive Bayes derives its power from Bayes’ theorem while assuming feature independence
  2. It naturally handles multiclass problems without special modifications
  3. GPU implementation can significantly speed up calculations for large datasets

VIII. Practice Questions

  1. How does the independence assumption in Naive Bayes affect its performance on real-world datasets where features are often correlated?

  2. In what scenarios would you choose Naive Bayes over other classification algorithms?

  3. How does the computational complexity of Naive Bayes compare between CPU and GPU implementations for different dataset sizes?

Bayes’ Rule and Naive Bayes Study Guide - Part 3

IX. Practical Applications and Use Cases

Common Applications

  1. Text Classification
  2. Spam Detection
  3. Sentiment Analysis
  4. Medical Diagnosis

Implementation Example for Text Classification

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import cupy as cp

class TextClassificationNB:
    def __init__(self, use_gpu=False):
        self.vectorizer = CountVectorizer()
        self.classifier = MultinomialNB()
        self.use_gpu = use_gpu and cp.cuda.is_available()
        
    def preprocess_text(self, texts):
        # Convert text to numerical features
        X = self.vectorizer.fit_transform(texts)
        if self.use_gpu:
            return cp.array(X.toarray())
        return X
    
    def train(self, texts, labels):
        X = self.preprocess_text(texts)
        if self.use_gpu:
            labels = cp.array(labels)
        self.classifier.fit(X, labels)
        
    def predict(self, new_texts):
        X = self.vectorizer.transform(new_texts)
        if self.use_gpu:
            X = cp.array(X.toarray())
        return self.classifier.predict(X)

X. Performance Optimization Tips

1. Data Preprocessing

class DataPreprocessor:
    def __init__(self):
        self.scaler = None
        
    def handle_missing_values(self, X):
        # Replace missing values with mean
        return np.nan_to_num(X, nan=np.nanmean(X))
        
    def normalize_features(self, X):
        # Log transformation for skewed features
        return np.log1p(X)
        
    def handle_categorical_features(self, X):
        # One-hot encoding
        return pd.get_dummies(X).values

2. GPU Memory Management

class GPUMemoryManager:
    def __init__(self):
        self.memory_pool = cp.cuda.MemoryPool()
        
    def __enter__(self):
        cp.cuda.set_allocator(self.memory_pool.malloc)
        
    def __exit__(self, *args):
        self.memory_pool.free_all_blocks()

XI. Common Pitfalls and Solutions

1. Zero Probability Problem

class LaplaceSmoothingNB:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Smoothing parameter
        
    def calculate_probability(self, feature_counts, total_counts):
        # Add smoothing to prevent zero probabilities
        return (feature_counts + self.alpha) / (total_counts + self.alpha * len(feature_counts))

2. Numerical Stability

def log_sum_exp(x):
    """Numerically stable log sum exp."""
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))

XII. Advanced Techniques

1. Feature Selection

from sklearn.feature_selection import SelectKBest, chi2

class FeatureSelector:
    def __init__(self, k=10):
        self.selector = SelectKBest(chi2, k=k)
        
    def select_features(self, X, y):
        return self.selector.fit_transform(X, y)

2. Cross-Validation

from sklearn.model_selection import cross_val_score

def evaluate_model(model, X, y, cv=5):
    scores = cross_val_score(model, X, y, cv=cv)
    return {
        'mean_accuracy': scores.mean(),
        'std_accuracy': scores.std()
    }

XIII. Final Key Takeaways

  1. Understanding the independence assumption is crucial for effective implementation
  2. GPU acceleration can provide significant speedup for large datasets
  3. Proper preprocessing and handling of edge cases is essential for robust models

XIV. Additional Practice Questions

  1. How would you handle highly imbalanced datasets when using Naive Bayes?
  2. What techniques can be used to determine the optimal smoothing parameter?
  3. In what scenarios might GPU acceleration actually slow down computation?

XV. Study Tips

  1. Focus on understanding the probabilistic foundations
  2. Practice with real-world datasets
  3. Experiment with both CPU and GPU implementations
  4. Pay attention to preprocessing steps
  5. Understand the limitations and assumptions

Three thought-provoking and logically comprehensive questions about Naive Bayes:

  1. “Why does Naive Bayes often perform well in high-dimensional text classification, despite the naive independence assumption?”
    • Addresses the fundamental paradox of Naive Bayes
    • Requires understanding of both theory and practical applications
    • Challenges students to think about the relationship between model assumptions and real-world performance
    • Connects to real-world applications in text classification
  2. “How does the independence assumption in Naive Bayes affect its performance on real-world datasets where features are often correlated?”
    • Forces consideration of the core assumptions
    • Requires analysis of real-world implications
    • Encourages critical thinking about model limitations
    • Connects theoretical concepts to practical applications
  3. “In what scenarios would you choose Naive Bayes over other classification algorithms?”
    • Requires comparative analysis
    • Demands understanding of various algorithms’ strengths and weaknesses
    • Focuses on practical decision-making
    • Encourages consideration of real-world constraints and requirements

Takeaways: 1. The Independence Assumption’s Paradox - Despite its “naive” assumption that features are independent (which is rarely true in real-world data), Naive Bayes often performs surprisingly well in practice - This is particularly true in text classification and high-dimensional problems - Understanding: The model’s simplicity and efficiency often outweigh the limitations of the independence assumption - Mathematical representation: P(x₁,x₂|y) = P(x₁|y) × P(x₂|y)

  1. Probabilistic Foundation and Scalability
    • Naive Bayes is fundamentally a probabilistic classifier based on Bayes’ Theorem
    • It’s highly scalable and can handle large datasets efficiently
    • Training is O(n) complexity, making it one of the fastest classifiers
    • Key equation: P(y|x) = P(x|y)P(y)/P(x)
    • This makes it particularly valuable for real-time applications and large-scale classification tasks
  2. Susceptibility to Zero Probability Problem
    • When a feature value in the test data never appears in the training data, it leads to zero probability
    • This is solved through smoothing techniques (like Laplace/Additive smoothing)
    • Critical for practical implementation: P(x|y) = (count(x,y) + α)/(count(y) + α|V|)
    • Understanding this limitation and its solutions is crucial for effective implementation

These takeaways are particularly important because they: - Cover both theoretical foundations and practical implications - Address common challenges and solutions - Explain why and when the algorithm works well - Provide essential knowledge for real-world applications - Help in making informed decisions about when to use Naive Bayes

Each of these points affects: - Model selection decisions - Implementation approaches - Performance optimization - Problem-solving strategies - Understanding of results

---
title: "7333 Module 6 - Naive Bayes"
author: "Jessica McPhaul - for QTW Spring 2025. Dr. Slater"
output: html_notebook
---


---

# **Study Guide: QTW Module 6 - Naive Bayes**

## **1. Bayes' Rule: Overview and Mathematical Representation**
Bayes' Rule is a fundamental theorem in probability that helps update prior beliefs based on new evidence. It is the backbone of Bayesian inference and plays a crucial role in probabilistic modeling.

### **Mathematical Formula**
\[
P(A | B) = \frac{P(B | A) P(A)}{P(B)}
\]
Where:
- \( P(A | B) \) = **Posterior Probability** (updated belief after seeing evidence)
- \( P(B | A) \) = **Likelihood** (how likely B occurs given A is true)
- \( P(A) \) = **Prior Probability** (initial belief before seeing evidence)
- \( P(B) \) = **Normalization Factor** (ensures valid probability distribution)

#### **Example: Drug Testing**
Given:
- **True positive rate** = 99% (\( P(B | A) = 0.99 \))
- **False positive rate** = 3% (\( P(B | \neg A) = 0.03 \))
- **Prevalence of drug use** = 5% (\( P(A) = 0.05 \))

The probability that a person who tested positive is actually a drug user is:
\[
P(A | B) = \frac{(0.99 \times 0.05)}{(0.99 \times 0.05) + (0.03 \times 0.95)}
\]

Using Python:

```python
P_A = 0.05  # Prior probability of being a drug user
P_B_given_A = 0.99  # True positive rate
P_B_given_not_A = 0.03  # False positive rate
P_not_A = 1 - P_A

P_B = (P_B_given_A * P_A) + (P_B_given_not_A * P_not_A)  # Normalization factor
P_A_given_B = (P_B_given_A * P_A) / P_B  # Bayes' Rule

print(f"Probability of being a drug user given a positive test: {P_A_given_B:.3f}")
```

---

## **2. Bayes' Rule for Multivariables**
When there are multiple evidence variables \( B, C, D, ... \), we generalize Bayes' Rule:
\[
P(A | B, C, D) = \frac{P(B, C, D | A) P(A)}{P(B, C, D)}
\]
By assuming conditional independence (naive assumption), we simplify:
\[
P(A | B, C, D) \propto P(A) P(B | A) P(C | A) P(D | A)
\]

This forms the foundation of **Naive Bayes Classification**.

#### **Python Example for Text Classification**
```python
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

# Load dataset
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups.data)  # Convert text to bag-of-words
y = newsgroups.target

# Train Naive Bayes Classifier
nb = MultinomialNB()
nb.fit(X, y)

# Predict on new data
sample_text = ["Quantum computing is the future of AI"]
X_sample = vectorizer.transform(sample_text)
prediction = nb.predict(X_sample)
print(f"Predicted category: {newsgroups.target_names[prediction[0]]}")
```

---

## **3. Bayes' Rule for Continuous Variables**
For continuous variables (e.g., temperature, weight), we use **Probability Density Functions (PDFs)** such as the **Gaussian Distribution**:
\[
P(X = x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
\]
where \( \mu \) is the mean and \( \sigma \) is the standard deviation.

#### **Python Implementation**
```python
import numpy as np
from scipy.stats import norm

# Assume normal distribution with mean=70, std=10
mu, sigma = 70, 10
x_value = 75
probability = norm.pdf(x_value, mu, sigma)

print(f"Probability density of weight 75 given mean=70 and std=10: {probability:.4f}")
```

---

## **4. Naive Bayes Implementation for Text Classification**
Using **CountVectorizer** and **TF-IDF**, we train a **Multinomial Naive Bayes model**.

#### **Python Implementation**
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

# Convert dataset using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(newsgroups.data)

# Train Naive Bayes with TF-IDF
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_tfidf, y)

# Predict on test data
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
X_test_tfidf = tfidf_vectorizer.transform(newsgroups_test.data)
y_test = newsgroups_test.target

predictions = nb_tfidf.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, predictions)
print(f"Naive Bayes TF-IDF Accuracy: {accuracy:.2f}")
```

---

## **5. CPU vs. GPU Implementation**
### **CPU Implementation**
```python
from sklearn.naive_bayes import GaussianNB

# Simulating data
X_cpu = np.random.rand(10000, 10)
y_cpu = np.random.randint(2, size=10000)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_cpu, y_cpu)
```

### **GPU Implementation with CuPy**
```python
import cupy as cp
from cuml.naive_bayes import GaussianNB as cuGaussianNB

# Convert data to GPU
X_gpu = cp.asarray(X_cpu)
y_gpu = cp.asarray(y_cpu)

# Train on GPU
gnb_gpu = cuGaussianNB()
gnb_gpu.fit(X_gpu, y_gpu)
```

---

## **6. Bag of Words Exercise**
Build a **CountVectorizer** from scratch and compare with `sklearn`.

### **Manual Count Vectorizer**
```python
from collections import Counter

def count_vectorizer(corpus):
    vocab = set(word for text in corpus for word in text.split())
    vectorized = [{word: text.split().count(word) for word in vocab} for text in corpus]
    return vectorized, list(vocab)

# Sample data
corpus = ["Naive Bayes is simple", "Bayes models are powerful"]
vectorized_data, vocab = count_vectorizer(corpus)

print(f"Vocabulary: {vocab}")
print(f"Vectorized Data: {vectorized_data}")
```

**Comparison with Scikit-learn**
```python
cv = CountVectorizer()
X_cv = cv.fit_transform(corpus)
print(f"Sklearn CountVectorizer Output:\n{X_cv.toarray()}")
```

---

# **Key Takeaways**
1. **Bayes' Rule is a powerful probabilistic tool** for updating beliefs based on evidence.
2. **Naive Bayes assumes feature independence** but still performs well in many classification tasks.
3. **It is computationally efficient** and well-suited for text classification, spam detection, and medical diagnostics.

---

# **Discussion Questions**
1. **What are the limitations of the naive assumption in Naive Bayes?**
2. **Why does Naive Bayes perform well in text classification despite feature independence?**
3. **How can we improve Naive Bayes for continuous data distributions?**

---

### **Bag of Words Exercise: Step-by-Step Walkthrough**

This exercise involves manually implementing **CountVectorizer**, which is a fundamental preprocessing step in Natural Language Processing (NLP). The goal is to **convert text into a numerical representation** before applying machine learning models.

---

## **Step 1: Understanding CountVectorizer**
`CountVectorizer` is a method of converting text into a **bag-of-words (BoW)** representation, where:
- Each unique word in the dataset is assigned an index in a vocabulary.
- The text data is transformed into a matrix, where each row represents a document and each column represents a word.
- The values in the matrix indicate how many times a word appears in a document.

### **Example**
For the text:
```text
"Naive Bayes is simple"
"Bayes models are powerful"
```
The **bag-of-words representation** would count the occurrences of each word across documents.

---

## **Step 2: Loading the 20 Newsgroup Dataset**
The **20 Newsgroups dataset** is a collection of text documents categorized into 20 topics.

```python
from sklearn.datasets import fetch_20newsgroups

# Load dataset without headers, footers, and quotes
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
documents = data.data[:5]  # Load only first 5 documents for demonstration

# Display sample data
for i, doc in enumerate(documents):
    print(f"Document {i+1}:\n{doc}\n{'-'*50}")
```
This dataset contains **raw text** that needs to be processed into a numerical format.

---

## **Step 3: Implementing Manual CountVectorizer**
### **1. Tokenization**
Convert text into words by splitting on spaces and removing punctuation.

### **2. Building Vocabulary**
Create a dictionary mapping each unique word to an index.

### **3. Constructing the Count Matrix**
For each document, count how many times each word appears.

```python
from collections import Counter
import re

def custom_count_vectorizer(corpus):
    # Preprocess: Remove punctuation and lowercase
    corpus_clean = [re.sub(r'\W+', ' ', doc).lower() for doc in corpus]

    # Tokenize
    tokenized_docs = [doc.split() for doc in corpus_clean]

    # Build vocabulary
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    word_index = {word: idx for idx, word in enumerate(vocabulary)}

    # Count word occurrences
    matrix = []
    for doc in tokenized_docs:
        word_counts = Counter(doc)
        row = [word_counts.get(word, 0) for word in vocabulary]
        matrix.append(row)

    return vocabulary, matrix

# Apply custom CountVectorizer
vocab, count_matrix = custom_count_vectorizer(documents[:5])

# Display results
print("Vocabulary:\n", vocab)
print("\nCount Matrix:\n", count_matrix)
```

---

## **Step 4: Comparing with Sklearn's CountVectorizer**
To validate our implementation, we compare it against Scikit-learn's `CountVectorizer`.

```python
from sklearn.feature_extraction.text import CountVectorizer

# Apply Scikit-learn's CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents[:5])

# Display vocabulary
print("\nScikit-learn Vocabulary:\n", vectorizer.get_feature_names_out())

# Display transformed text representation
print("\nScikit-learn Count Matrix:\n", X.toarray())
```

---

## **Step 5: Key Differences Between Manual and Sklearn CountVectorizer**
| Feature | Custom Implementation | Sklearn CountVectorizer |
|---------|-----------------------|-------------------------|
| **Tokenization** | Basic space-splitting and regex | Handles special cases (stopwords, n-grams, etc.) |
| **Vocabulary** | Built manually using a set | Automatically optimized |
| **Count Matrix** | Created with `Counter` | Optimized sparse matrix |
| **Speed** | Slower for large datasets | Highly optimized C++ backend |

---

## **Step 6: Performance Considerations**
Since `CountVectorizer` is a **CPU-based method**, let’s optimize it for a **GPU** implementation using **CuPy**.

```python
import cupy as cp

def gpu_count_vectorizer(corpus):
    corpus_clean = [re.sub(r'\W+', ' ', doc).lower() for doc in corpus]
    tokenized_docs = [doc.split() for doc in corpus_clean]
    
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    word_index = {word: idx for idx, word in enumerate(vocabulary)}
    
    matrix = cp.zeros((len(tokenized_docs), len(vocabulary)), dtype=cp.int32)
    
    for doc_idx, doc in enumerate(tokenized_docs):
        word_counts = Counter(doc)
        for word, count in word_counts.items():
            if word in word_index:
                matrix[doc_idx, word_index[word]] = count
    
    return vocabulary, matrix

# Apply GPU CountVectorizer
gpu_vocab, gpu_count_matrix = gpu_count_vectorizer(documents[:5])

print("\nGPU Vocabulary:", gpu_vocab)
print("\nGPU Count Matrix:\n", gpu_count_matrix.get())
```

---

# **Key Takeaways**
1. **Bag-of-Words is an essential text preprocessing step** in NLP, transforming text into a numerical format.
2. **Naive Bayes can work effectively on text classification** despite its simplifying assumptions.
3. **GPU acceleration can improve efficiency** when handling large-scale text datasets.

---

# **Discussion Questions**
1. **What are the benefits and limitations of using a Bag-of-Words model?**
2. **Why does TF-IDF often outperform simple CountVectorizer for text classification?**
3. **How can we improve our custom CountVectorizer to include stopword removal and stemming?**

---Here is your summarized and expanded study guide based on QTW Module 6 – Naive Bayes, including key concepts, mathematical formulas, Python code for CPU and GPU utilization, plus three key takeaways and three discussion questions. Type "proceed" if you need more, and I will continue.

Study Guide: QTW Module 6 – Naive Bayes

1. Overview of Bayes’ Rule
Bayes’ Rule helps update an initial belief (prior) after observing new evidence (likelihood). It is expressed mathematically as:

P(A | B) = [ P(B | A) × P(A) ] / P(B)

• P(A | B): Posterior Probability (updated belief)
• P(B | A): Likelihood (probability of observing B given A is true)
• P(A): Prior Probability (initial belief)
• P(B): Normalizing Factor (ensures the result is a valid probability)

Example: Drug Test
- True Positive Rate = 99%
- True Negative Rate = 97%
- Prevalence = 5%

P(A | B) = [0.99 × 0.05] / [ (0.99 × 0.05) + (0.03 × 0.95 ) ] ≈ 63.5%

2. Bayes’ Rule for Multiple Variables
When dealing with multiple evidence variables (B, C, D, ...), we can generalize Bayes’ Rule as:

P(A | B, C, D) ∝ P(A) × P(B | A) × P(C | A) × P(D | A)

The “naive” assumption is that all evidence variables B, C, D are conditionally independent given A, allowing us to multiply probabilities rather than consider their complex joint distributions.

3. Bayes’ Rule with Continuous Variables
For continuous attributes (temperature, weight), we use probability density functions (e.g., Gaussian) to estimate P(X = x). If the data is assumed normally distributed, we can compute:

P(x; μ, σ) = (1 / (σ√(2π))) × exp[−(x − μ)² / (2σ²)]

We then insert these probabilities into Bayes’ framework just as we do with discrete or categorical data.

4. Naive Bayes in Practice (Demo)
• Often used for text classification with bag-of-words or tf-idf features.  
• MultinomialNB from sklearn is commonly applied:
  – Alpha parameter (Laplace smoothing) handles zero-frequency words.  
  – Very fast, even on large datasets.

5. Naive Bayes vs. Logistic Regression
• Naive Bayes (generative model) and Logistic Regression (discriminative model) form a generative-discriminative pair.  
• Naive Bayes converges faster to a near-optimal solution with limited data; Logistic Regression can yield a lower asymptotic error with sufficient data.  
• Both are deeply connected through the logit function and ratio forms of probability.

6. Bag of Words Exercise
• Task: Build a custom CountVectorizer using Python.  
• Compare with sklearn’s CountVectorizer to observe differences in tokenization, vocabulary building, and performance.

Mathematical Representation (Core Formulas)
1. Bayes’ Rule (Single Variable)
   P(A | B) = [P(B | A) P(A)] / P(B)

2. Bayes’ Rule (Multivariable)
   P(A | B, C, D) ∝ P(A) × P(B | A) × P(C | A) × P(D | A)

3. Gaussian Density (Continuous Example)
   p(x) = 1 / (σ√(2π)) × exp[−(x − μ)² / (2σ²)]

Code Snippets

A. CPU Implementation (Naive Bayes – Sklearn)

```python
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score

# 1) Load Data
train_data = fetch_20newsgroups(subset='train', remove=('headers','footers','quotes'))
test_data  = fetch_20newsgroups(subset='test',  remove=('headers','footers','quotes'))

# 2) Transform Text (CountVectorizer)
cv = CountVectorizer()
X_train_cv = cv.fit_transform(train_data.data)
X_test_cv  = cv.transform(test_data.data)

# 3) Train Naive Bayes
model_cv = MultinomialNB()
model_cv.fit(X_train_cv, train_data.target)

# 4) Evaluate
preds_cv = model_cv.predict(X_test_cv)
print("Accuracy with CountVectorizer:", accuracy_score(test_data.target, preds_cv))

# 5) Using TF-IDF
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(train_data.data)
X_test_tfidf  = tfidf.transform(test_data.data)

model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, train_data.target)

preds_tfidf = model_tfidf.predict(X_test_tfidf)
print("Accuracy with TF-IDF:", accuracy_score(test_data.target, preds_tfidf))
```
B. Simple GPU Implementation (Naive Bayes with cuML)
(You need an environment with RAPIDS/cuml installed for this.)

```python
import cupy as cp
from cuml.naive_bayes import MultinomialNB as cuMultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Assume you have train_data & test_data from above
tfidf_gpu = TfidfVectorizer()
X_train_gpu = tfidf_gpu.fit_transform(train_data.data)
X_test_gpu  = tfidf_gpu.transform(test_data.data)

# Convert to GPU arrays
X_train_gpu_csr = cp.sparse.csr_matrix(X_train_gpu)
X_test_gpu_csr  = cp.sparse.csr_matrix(X_test_gpu)
y_train_gpu     = cp.asarray(train_data.target)
y_test_gpu      = cp.asarray(test_data.target)

# Train
model_gpu = cuMultinomialNB()
model_gpu.fit(X_train_gpu_csr, y_train_gpu)

# Predict
preds_gpu = model_gpu.predict(X_test_gpu_csr)
accuracy_gpu = (preds_gpu.get() == y_test_gpu.get()).mean()
print("GPU-based Naive Bayes Accuracy:", accuracy_gpu)
```

Study Guide Format (Condensed)

I. Introduction
  - Bayes’ Rule, Prior, Likelihood, Posterior  
  - Connection to logistic regression  

II. Mathematical Foundations
  - P(A|B) = (P(B|A)*P(A))/P(B)  
  - Multivariate Extension  
  - Continuous Variables  

III. Practical Implementation
  - Text Classification  
  - Bag-of-Words & TF-IDF  
  - Naive Bayes (MultinomialNB)  

IV. Comparison
  - Naive Bayes vs. Logistic Regression  
  - Generative vs. Discriminative  

V. Key Takeaways
  1. Naive Bayes uses Bayes’ Rule with a simplifying (naive) independence assumption, yet often performs competitively in classification tasks, especially text.  
  2. Combining TF-IDF with MultinomialNB frequently yields better accuracy than bag-of-words.  
  3. Generative (Naive Bayes) and Discriminative (Logistic Regression) models are mathematically related through the logit (odds ratio) framework.

VI. Discussion Questions
  1. Why does Naive Bayes often perform well in high-dimensional text classification, despite the naive independence assumption?  
  2. How do smoothing parameters (like alpha) influence Naive Bayes performance?  
  3. In what cases might Logistic Regression eventually outperform Naive Bayes, and why?


Below is a more detailed walkthrough of **Part 6 – The Bag of Words Exercise**, showing how to build a custom CountVectorizer and comparing it with sklearn’s version. This will help solidify your understanding of how text is converted into numerical features for Naive Bayes and other machine learning models.

Bag of Words Exercise – Step-by-Step

1) Understanding the Goal
We want to:
• Load the 20 Newsgroups dataset (without headers, footers, and quotes).  
• Create our **own** Bag-of-Words approach (basic CountVectorizer).  
• Compare against the **sklearn** implementation (CountVectorizer).  

2) Loading the 20 Newsgroups Dataset
The dataset can be fetched with:
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

This returns a dictionary-like object, including .data (list of documents) and .target (list of numeric labels).

3) Building a Custom CountVectorizer
Our custom version will:
• Tokenize each document into words (split on whitespace, remove punctuation).  
• Build a vocabulary (unique words mapped to indices).  
• Count how often each word appears in each document.  

Below is sample code to illustrate these steps.

```python
import re
from collections import Counter
from sklearn.datasets import fetch_20newsgroups

def custom_count_vectorizer(corpus):
    # 1. Clean and lowercase text, remove punctuation
    cleaned_texts = [re.sub(r'[^a-zA-Z\s]', '', doc.lower()) for doc in corpus]
    
    # 2. Tokenize by splitting on whitespace
    tokenized_docs = [doc.split() for doc in cleaned_texts]
    
    # 3. Build vocabulary (all unique words)
    #    Convert to a sorted list to ensure consistent ordering.
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    
    # 4. Create a dict mapping each word to its index
    word2index = {word: idx for idx, word in enumerate(vocabulary)}
    
    # 5. Construct the count matrix
    #    Each row represents a document, each column a word in our vocabulary.
    count_matrix = []
    for doc in tokenized_docs:
        counts = Counter(doc)
        # Build a row of length = size of vocabulary
        row_vector = [counts.get(word, 0) for word in vocabulary]
        count_matrix.append(row_vector)
    
    return vocabulary, count_matrix

# 1) Fetch data
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
corpus = data.data[:5]  # Using first 5 docs as a small example

# 2) Apply our custom CountVectorizer
vocab, matrix = custom_count_vectorizer(corpus)

print("Custom Vocabulary (first 20 words):", vocab[:20])
print("\nFirst Document's Vector Representation:")
matrix[0])
```


4) Comparing with sklearn’s CountVectorizer
Now let’s use the sklearn version to see how it differs.

```python
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')  # optionally remove stopwords
X = cv.fit_transform(corpus)

print("\nSklearn Vocabulary Size:", len(cv.vocabulary_))
print("Sklearn First Document's Vector:")
print(X.toarray()[0])
```


Differences You Might Notice
• **Stopword Removal**: By default, sklearn’s CountVectorizer can remove common words like “the,” “and,” etc., if you specify `stop_words='english'`. Our custom version didn’t remove them unless we code it.  
• **Punctuation Handling**: sklearn automatically handles punctuation in multiple ways. Our custom version uses a simple regex.  
• **Sparse Representation**: sklearn returns a sparse matrix (efficient for large text). Ours is a plain list of lists (dense matrix).  

5) (Optional) GPU-Accelerated Version
If you have the RAPIDS environment set up (with CuPy and cuML), you could adapt your code to build counts on the GPU. However, for a “homegrown” solution, you’d:

```python
import cupy as cp

def gpu_count_vectorizer(corpus):
    # Step 1: Preprocess & Tokenize on CPU (or GPU if you prefer)
    cleaned_texts = [re.sub(r'[^a-zA-Z\s]', '', doc.lower()) for doc in corpus]
    tokenized_docs = [doc.split() for doc in cleaned_texts]
    
    # Step 2: Build vocabulary on CPU for simplicity
    vocabulary = sorted(set(word for doc in tokenized_docs for word in doc))
    word2index = {word: idx for idx, word in enumerate(vocabulary)}

    # Step 3: Initialize a GPU array to hold counts
    rows, cols = len(tokenized_docs), len(vocabulary)
    gpu_matrix = cp.zeros((rows, cols), dtype=cp.int32)
    
    # Step 4: Fill in the matrix
    for i, doc in enumerate(tokenized_docs):
        doc_counts = Counter(doc)
        for word, count in doc_counts.items():
            if word in word2index:
                j = word2index[word]
                gpu_matrix[i, j] = count
    
    return vocabulary, gpu_matrix

# Example usage
vocab_gpu, matrix_gpu = gpu_count_vectorizer(corpus[:5])
print("GPU Count Matrix shape:", matrix_gpu.shape)
print("GPU Count Matrix (first row, back to CPU):", matrix_gpu[0].get())
```

Note: This example shows the concept. In practice, you’d want to handle more steps on GPU to maximize speed.

Summary of the Exercise
• **Objective**: Implement a simplified Bag-of-Words manually.  
• **Key Observation**: The core steps – tokenizing, building a vocabulary, counting words – are straightforward, but sklearn optimizes them for performance and adds many extra features (stopword removal, n-grams, etc.).  

Three Key Takeaways (Part 6 Focus)
1. **Implementation Control**: Building CountVectorizer from scratch clarifies how text preprocessing works under the hood, but libraries like sklearn are more flexible and optimized.  
2. **Vocabulary Size**: Real-world text data sets can contain tens or hundreds of thousands of unique tokens; efficient data structures (sparse matrices) are critical.  
3. **Preprocessing Matters**: Removing stopwords, handling punctuation, or applying stemming/lemmatization can significantly change your results.

Three Discussion Questions
1. **How would you modify your custom CountVectorizer to remove stopwords, or to handle n-grams (like two-word phrases)?**  
2. **What might be the tradeoffs between using a dense vs. sparse matrix representation for large corpora?**  
3. **How could we extend this approach for advanced text processing, such as adding TF-IDF or word embeddings (GloVe, Word2Vec)?**




# Bayes' Rule and Naive Bayes Study Guide - Part 1

## I. Fundamentals of Bayes' Rule

### Core Equation
Bayes' Rule is expressed as:

```
P(A|B) = P(B|A) × P(A) / P(B)
```

Where:
- P(A|B) = Posterior Probability
- P(B|A) = Likelihood
- P(A) = Prior Probability
- P(B) = Evidence/Normalization

### Components Breakdown:
1. **Posterior Distribution - P(A|B)**
   - The probability of event A occurring given B has occurred
   - What we're typically trying to calculate
   - Updated probability after considering evidence

2. **Prior - P(A)**
   - Initial probability before considering new evidence
   - Based on previous knowledge or assumptions
   - Starting point for Bayesian inference

## II. Basic Python Implementation (CPU)

```python
import numpy as np
from sklearn.naive_bayes import GaussianNB

class BayesianCalculator:
    def simple_bayes(self, prior_a, likelihood_b_given_a, evidence_b):
        """
        Calculate posterior probability using Bayes' Rule
        """
        posterior = (likelihood_b_given_a * prior_a) / evidence_b
        return posterior

    def naive_bayes_example(self):
        # Sample data
        X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
        y = np.array([0, 0, 1, 1])
        
        # Initialize and train model
        model = GaussianNB()
        model.fit(X, y)
        
        # Make prediction
        prediction = model.predict([[2, 3]])
        return prediction
```

## III. GPU Implementation with CuPy

```python
import cupy as cp

class GPUBayesianCalculator:
    def __init__(self):
        self.device = 'gpu' if cp.cuda.is_available() else 'cpu'
    
    def gpu_naive_bayes(self, X, y):
        """
        GPU-accelerated Naive Bayes implementation
        """
        if self.device == 'gpu':
            X = cp.array(X)
            y = cp.array(y)
            
            # Calculate class priors
            classes = cp.unique(y)
            class_probs = cp.array([cp.mean(y == c) for c in classes])
            
            return class_probs.get()  # Transfer back to CPU
        else:
            return "GPU not available"
```

# Bayes' Rule and Naive Bayes Study Guide - Part 2

## IV. Naive Bayes Core Concepts

### Critical Assumption
The fundamental assumption of Naive Bayes is feature independence. Mathematically:

```
P(X₁,X₂,...,Xₙ|Y) = P(X₁|Y) × P(X₂|Y) × ... × P(Xₙ|Y)
```

### Multiclass Classification
For a problem with k classes, we calculate:

```python
import numpy as np
from sklearn.naive_bayes import MultinomialNB

class MulticlassNaiveBayes:
    def multiclass_example(self):
        # Example with multiple classes
        X = np.array([[1,2], [2,3], [3,4], [4,5], [5,6]])
        y = np.array([0, 1, 2, 1, 2])  # Three classes: 0, 1, 2
        
        # Initialize and train
        clf = MultinomialNB()
        clf.fit(X, y)
        
        # Predict probabilities for each class
        probs = clf.predict_proba([[3,4]])
        # Returns probability for each class
        return probs
```

## V. Advanced Implementation with GPU Acceleration

```python
import cupy as cp
from cupy.linalg import norm

class GPUNaiveBayes:
    def __init__(self):
        self.classes_ = None
        self.class_priors_ = None
        self.class_means_ = None
        self.class_vars_ = None

    def fit(self, X, y):
        if not cp.cuda.is_available():
            raise RuntimeError("CUDA is not available")
            
        # Convert to GPU arrays
        X = cp.array(X)
        y = cp.array(y)
        
        self.classes_ = cp.unique(y)
        n_classes = len(self.classes_)
        n_features = X.shape[1]
        
        # Initialize parameters
        self.class_priors_ = cp.zeros(n_classes)
        self.class_means_ = cp.zeros((n_classes, n_features))
        self.class_vars_ = cp.zeros((n_classes, n_features))
        
        # Calculate parameters for each class
        for i, c in enumerate(self.classes_):
            X_c = X[y == c]
            self.class_priors_[i] = X_c.shape[0] / X.shape[0]
            self.class_means_[i] = X_c.mean(axis=0)
            self.class_vars_[i] = X_c.var(axis=0)
            
        return self
```

## VI. Comparison with Linear Regression

Key Differences:
1. **Purpose**
   - Naive Bayes: Classification
   - Linear Regression: Continuous value prediction

2. **Mathematical Foundation**
   ```
   Naive Bayes: P(Y|X) ∝ P(X|Y)P(Y)
   Linear Regression: Y = βX + ε
   ```

3. **Output Type**
   - Naive Bayes: Probabilities for each class
   - Linear Regression: Continuous values

## VII. Key Takeaways

1. Naive Bayes derives its power from Bayes' theorem while assuming feature independence
2. It naturally handles multiclass problems without special modifications
3. GPU implementation can significantly speed up calculations for large datasets

## VIII. Practice Questions

1. How does the independence assumption in Naive Bayes affect its performance on real-world datasets where features are often correlated?

2. In what scenarios would you choose Naive Bayes over other classification algorithms?

3. How does the computational complexity of Naive Bayes compare between CPU and GPU implementations for different dataset sizes?


# Bayes' Rule and Naive Bayes Study Guide - Part 3

## IX. Practical Applications and Use Cases

### Common Applications
1. Text Classification
2. Spam Detection
3. Sentiment Analysis
4. Medical Diagnosis

### Implementation Example for Text Classification

```python
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import cupy as cp

class TextClassificationNB:
    def __init__(self, use_gpu=False):
        self.vectorizer = CountVectorizer()
        self.classifier = MultinomialNB()
        self.use_gpu = use_gpu and cp.cuda.is_available()
        
    def preprocess_text(self, texts):
        # Convert text to numerical features
        X = self.vectorizer.fit_transform(texts)
        if self.use_gpu:
            return cp.array(X.toarray())
        return X
    
    def train(self, texts, labels):
        X = self.preprocess_text(texts)
        if self.use_gpu:
            labels = cp.array(labels)
        self.classifier.fit(X, labels)
        
    def predict(self, new_texts):
        X = self.vectorizer.transform(new_texts)
        if self.use_gpu:
            X = cp.array(X.toarray())
        return self.classifier.predict(X)
```

## X. Performance Optimization Tips

### 1. Data Preprocessing
```python
class DataPreprocessor:
    def __init__(self):
        self.scaler = None
        
    def handle_missing_values(self, X):
        # Replace missing values with mean
        return np.nan_to_num(X, nan=np.nanmean(X))
        
    def normalize_features(self, X):
        # Log transformation for skewed features
        return np.log1p(X)
        
    def handle_categorical_features(self, X):
        # One-hot encoding
        return pd.get_dummies(X).values
```

### 2. GPU Memory Management
```python
class GPUMemoryManager:
    def __init__(self):
        self.memory_pool = cp.cuda.MemoryPool()
        
    def __enter__(self):
        cp.cuda.set_allocator(self.memory_pool.malloc)
        
    def __exit__(self, *args):
        self.memory_pool.free_all_blocks()
```

## XI. Common Pitfalls and Solutions

### 1. Zero Probability Problem
```python
class LaplaceSmoothingNB:
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Smoothing parameter
        
    def calculate_probability(self, feature_counts, total_counts):
        # Add smoothing to prevent zero probabilities
        return (feature_counts + self.alpha) / (total_counts + self.alpha * len(feature_counts))
```

### 2. Numerical Stability
```python
def log_sum_exp(x):
    """Numerically stable log sum exp."""
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))
```

## XII. Advanced Techniques

### 1. Feature Selection
```python
from sklearn.feature_selection import SelectKBest, chi2

class FeatureSelector:
    def __init__(self, k=10):
        self.selector = SelectKBest(chi2, k=k)
        
    def select_features(self, X, y):
        return self.selector.fit_transform(X, y)
```

### 2. Cross-Validation
```python
from sklearn.model_selection import cross_val_score

def evaluate_model(model, X, y, cv=5):
    scores = cross_val_score(model, X, y, cv=cv)
    return {
        'mean_accuracy': scores.mean(),
        'std_accuracy': scores.std()
    }
```

## XIII. Final Key Takeaways

1. Understanding the independence assumption is crucial for effective implementation
2. GPU acceleration can provide significant speedup for large datasets
3. Proper preprocessing and handling of edge cases is essential for robust models

## XIV. Additional Practice Questions

1. How would you handle highly imbalanced datasets when using Naive Bayes?
2. What techniques can be used to determine the optimal smoothing parameter?
3. In what scenarios might GPU acceleration actually slow down computation?

## XV. Study Tips

1. Focus on understanding the probabilistic foundations
2. Practice with real-world datasets
3. Experiment with both CPU and GPU implementations
4. Pay attention to preprocessing steps
5. Understand the limitations and assumptions



Three thought-provoking and logically comprehensive questions about Naive Bayes:

1. **"Why does Naive Bayes often perform well in high-dimensional text classification, despite the naive independence assumption?"**
   - Addresses the fundamental paradox of Naive Bayes
   - Requires understanding of both theory and practical applications
   - Challenges students to think about the relationship between model assumptions and real-world performance
   - Connects to real-world applications in text classification

2. **"How does the independence assumption in Naive Bayes affect its performance on real-world datasets where features are often correlated?"**
   - Forces consideration of the core assumptions
   - Requires analysis of real-world implications
   - Encourages critical thinking about model limitations
   - Connects theoretical concepts to practical applications

3. **"In what scenarios would you choose Naive Bayes over other classification algorithms?"**
   - Requires comparative analysis
   - Demands understanding of various algorithms' strengths and weaknesses
   - Focuses on practical decision-making
   - Encourages consideration of real-world constraints and requirements


Takeaways: 
1. **The Independence Assumption's Paradox**
   - Despite its "naive" assumption that features are independent (which is rarely true in real-world data), Naive Bayes often performs surprisingly well in practice
   - This is particularly true in text classification and high-dimensional problems
   - Understanding: The model's simplicity and efficiency often outweigh the limitations of the independence assumption
   - Mathematical representation: P(x₁,x₂|y) = P(x₁|y) × P(x₂|y)

2. **Probabilistic Foundation and Scalability**
   - Naive Bayes is fundamentally a probabilistic classifier based on Bayes' Theorem
   - It's highly scalable and can handle large datasets efficiently
   - Training is O(n) complexity, making it one of the fastest classifiers
   - Key equation: P(y|x) = P(x|y)P(y)/P(x)
   - This makes it particularly valuable for real-time applications and large-scale classification tasks

3. **Susceptibility to Zero Probability Problem**
   - When a feature value in the test data never appears in the training data, it leads to zero probability
   - This is solved through smoothing techniques (like Laplace/Additive smoothing)
   - Critical for practical implementation: P(x|y) = (count(x,y) + α)/(count(y) + α|V|)
   - Understanding this limitation and its solutions is crucial for effective implementation

These takeaways are particularly important because they:
- Cover both theoretical foundations and practical implications
- Address common challenges and solutions
- Explain why and when the algorithm works well
- Provide essential knowledge for real-world applications
- Help in making informed decisions about when to use Naive Bayes

Each of these points affects:
- Model selection decisions
- Implementation approaches
- Performance optimization
- Problem-solving strategies
- Understanding of results
