Project 4: MNIST using KNN and Neural Networks

Author

Team: I Love Lucy

Published

December 8, 2024

Project 4 Github: [Quarto Presentation] [Python] | Projects: [4] [3] [2] [1]

from IPython.display import display, HTML
from lussi.mnist import *

X_train, X_test, y_train, y_test, X_train_unscaled = load_data()

Executive Summary

This project evaluates the performance of K-Nearest Neighbors (KNN) and Neural Networks for handwritten digit recognition using the MNIST dataset. The analysis reveals several key findings:

Performance:

The Neural Network achieved 97.1% accuracy, outperforming KNN’s 94.7% accuracy
The Neural Network showed more consistent performance across all digits, with accuracy ranging from 95.3% to 98.9%
KNN showed more variability, with accuracy ranging from 89.9% to 99.2%

Computational Characteristics:

Training: KNN trained in 4.07 seconds vs. Neural Network’s 16.05 seconds
Prediction Speed:
- For small batches (1-100 images), KNN was faster
- For larger batches (1000 images), Neural Network was significantly faster (0.07ms vs 0.31ms per image)

Error Patterns:

Both models struggled most with visually similar digits (e.g., 3/5, 4/9, 7/9)
KNN showed higher error rates for complex digits like ‘8’ (89.9% accuracy)
Neural Network maintained >95% accuracy across all digit classes

This analysis demonstrates that while KNN offers faster training and competitive performance for small-scale predictions, the Neural Network provides superior accuracy and better scaling for larger batch predictions, making it more suitable for production deployment despite longer training times.

Project Overview

History and Significance of MNIST

The MNIST dataset (Modified National Institute of Standards and Technology) emerged from a practical need at the U.S. Postal Service in the late 1980s. It was created to help automate mail sorting by recognizing handwritten zip codes. Created by Yann LeCun, Corinna Cortes, and Christopher Burges, MNIST has become the de facto “Hello World” of machine learning. The dataset consists of 70,000 handwritten digits (60,000 for training, 10,000 for testing). Its standardized format and manageable size have made it an ideal benchmark for comparing machine learning algorithms for over three decades.

Understanding the Dataset Format

Though easily converted, the records are not actually stored as images. They are stored as a matrix. Each record of the 60,000 images is stored as a 28 by 28 matrix, with those positions holding the color of the pixel the position represents. It’s a square image, so 28 pixels by 28 pixels tall totals 784 total pixles, or 784 total numbers. Each of those numbers represents a shade of grayscale, 0 being all black, and 255 being white.

#visualize_digit_matrix(X_train_unscaled, index=0)
matrix_html = visualize_digit_matrix_encoded(X_train_unscaled, index=0)
display(HTML(matrix_html))

Wrapping around each of those records, it’s like any other machine learning dataset, test and train, and each of those two are broken apart into data and label, like so:

Training Set:

Images: X_train → 60000 images, each of size 28
Labels: y_train → 60000 labels, e.g., [5, 0, 4, 1, 9, …]

Testing Set:

Images: X_test → 10000 images, each of size 28
Labels: y_test → 10000 labels, e.g., [7, 2, 1, 0, 4, …]

Looking at Sample records

It’s easy to understand the core challenge by looking at records. There is much variation in hand written letters, with all sorts of factors presenting like:

Writing styles and penmanship
Stroke thickness and continuity
Digit orientation and slant
Image noise and quality

samples_html = plot_sample_digits_encoded(X_train_unscaled, y_train)
display(HTML(samples_html))
# plot_sample_digits(X_train_unscaled, y_train)

Project Goals

This project aims to:

Compare the effectiveness of a simple, intuitive algorithm (KNN) against a more complex, modern approach (Neural Networks)
Analyze the tradeoffs between computational complexity and accuracy
Understand how different architectures handle the variations in handwritten digits
Evaluate both training time and inference speed for real-world applicability

Model Implementation and Training

KNN

K-Nearest Neighbors (KNN) is a non-parametic algorithm, supervised machine algorithm used for both classification and regression tasks. The fundamental principle of KNN is simple: classify a new data point based on the majority vote (classification) or average (regression) of its K nearest neighbors in the feature space.

# Train KNN model
print("Training KNN Model...")
start_time = time.time()
knn_model, knn_accuracy = train_knn(X_train, X_test, y_train, y_test, rebuild_model=True)
knn_train_time = time.time() - start_time

Training KNN Model...

Training Configuration:

Data Splitting:
- Used train_test_split() function to split the data:
  - Training (80%)
  - Testing (20%)
- An 80-20 split balances training data sufficiency and evaluation robustness
Feature Scaling:
- Used StandardScalar() to scale the data, ensuring all features contribute equally to the distance calculation.
Lazy Learning:
- KNN stores the entire training dataset and only makes predictions during inference

Prediction Process Mechanics:

Distance Calculation
- For the new data point, calculate its distance to all training points.
- The default distance calculation method is Euclidean distance.
- Euclidean Distance formula: \(d(p, q) = \sqrt{\sum_{i=1}^n (q_i - p_i)^2}\)
Neighbor Selection
- Select the K closest points (neighbors)
- In this implementation, n_neighbors =3, making the prediction sensitive to local patterns.
Classification Method
- Majority voting determines the class
- Most frequent class among K neighbors wins

Key Parameters:

Number of Neighbors: n_neighbors = 3
- A small value (e.g., 3) captures local patterns but is prone to overfitting.
- Larger values smoothen predictions but may underfit.
Distance Metric:
- Default is Euclidean distance
- Other options include Manhattan or Minkowski distances for varying use cases.
Weighting Scheme:
- Default is Uniform, where all neighbors contribute equally.
- Weighted options give more influence to closer neighbors.

Performance Metrics:

print(f"\nKNN Results:")
print(f"Training Time: {knn_train_time:.2f} seconds")
print(f"Accuracy: {knn_accuracy:.4f}")


KNN Results:
Training Time: 6.42 seconds
Accuracy: 0.9465

Conclusion:

K-Nearest Neighbors offers a straightforward yet powerful approach to classification. By leveraging local neighborhood information and flexible distance calculations, KNN provides an interpretable method for pattern recognition in machine learning tasks.

Neural Network

A neural network is a machine learning algorithm inspired by the structure and function of the human brain. It is designed to learn relationships, recognize patterns, and make predictions by mimicking how biological neurons process and transmit information. Neural networks excel in handling complex, non-linear data, making them a versatile tool for tasks such as image recognition, natural language processing, and classification.

print("\nTraining Neural Network...")
start_time = time.time()
nn_model, history, nn_accuracy = train_neural_network(X_train, X_test, y_train, y_test, rebuild_model=True)
nn_train_time = time.time() - start_time


Training Neural Network...

Architecture Overview:

We created a neural network designed for MNIST digit classification. It features a multi-layer feedforward architecture with strategic layer design and regularization techniques.

Detailed Layer Analysis:

Input Layer
- Dimensions: 784 neurons (28 x 28 pixel flattened image)
- Purpose: Direct mapping of pixel intensity values
- Transformation: Converts 2D image to 1D feature vector
First Hidden Layer
- Dimensions: 256 neurons
- Activation: ReLU (Rectified Linear Unit)
- Objectives:
  - Initial complex feature extraction
  - Introduces non-linear transformations
  - Captures primary image characteristics
First Dropout layer
- Dropout Rate: 0.2 (20%)
- Regularization Technique:
  - Randomly deactivates 20% of neurons during training
  - Prevents model overfitting
  - Reduces neuron interdependence
Second Hidden Layer
- Dimensions: 128 neurons
- Activation: ReLU (Rectified Linear Unit)
- Objectives:
  - Further abstract feature representations
  - Progressively reduce feature dimensionality
  - Refine initial feature extraction
Second Dropout Layer
- Dropout Rate: 0.2 (20%)
- Continues regularization strategy
- Prevents neural network from becoming too specialized
Third Hidden Layer
- Dimensions: 64 neurons
- Activation: ReLU (Rectified Linear Unit)
- Objectives:
  - Final feature abstraction
  - Prepares data for classification
  - Further reduces feature complexity
Output Layer
- Neurons: 10 (one per digit 0-9)
- Activation: Softmax
- Characteristics:
  - Converts raw scores to probability distribution
  - Ensures probabilities sum to 1
  - Enables multi-class classification

Training Configuration:

Hyperparameters:
1. epochs
  - Total Iterations: 10
  - Purpose:
    - Complete passes through entire training dataset
    - Allows progressive weight refinement
    - Prevents overfitting through limited iterations
2. batch_size
  - Configuration: 128 samples per gradient update
  - Benefits:
    - Computational efficiency
    - Gradient noise reduction
    - Memory-friendly processing
3. validation_split
  - Allocation: 10% of the training data
  - Functions:
    - Monitor model performance during training
    - Detect potential overfitting
    - Provide real-time performance insights
Optimization Strategy:
- Adam
  - Adaptive learning rate optimization
  - Characteristics:
    - Combines RMSprop and momentum advantages
    - Dynamically adjusts per-parameter learning rates
    - Handles sparse gradients effectively
Loss Function:
- Sparse Categorical Cross-Entropy
  - Ideal for multi-class classification
  - Measures:
    - Difference between predicted and actual distributions
    - Guides weight updates during backpropagation

Performance Metrics:

print(f"\nNeural Network Results:")
print(f"Training Time: {nn_train_time:.2f} seconds")
print(f"Accuracy: {nn_accuracy:.4f}")


Neural Network Results:
Training Time: 11.20 seconds
Accuracy: 0.9741

Conclusion

The neural network architecture is carefully designed to balance complexity, feature extraction, and generalization. By incorporating strategic layer design, dropout regularization, and adaptive optimization, the model achieves robust performance in MNIST digit classification.

Model Comparison

In this section, we compare the performance of the K-Nearest Neighbors (KNN) algorithm and the Neural Network (NN) architecture based on key performance metrics: training time and accuracy.

compare_df = create_comparison_table()
print(compare_df)

                    Metric    KNN  Neural Network
0  Training Time (seconds)   6.27           10.91
1             Accuracy (%)  94.02           97.20

Performance Metrics

Training Time
- KNN exhibits a faster training process (6.27 seconds) since it is a “lazy learning” algorithm, which delays most computation until prediction.
- The Neural Network, being a “eager learning” algorithm, spends more time (10.91 seconds) in training due to backpropagation, weight updates, and regularization techniques.
Accuracy:
- The Neural Network outperforms KNN with an accuracy of 97.20%, compared to 94.02% for KNN.
- The Neural Network’s higher accuracy is attributed to its ability to extract complex, non-linear patterns in the data through multiple layers and activation functions.
- KNN, while simpler, relies on proximity in the feature space, which may not fully capture intricate relationships.
Scalability:
- KNN’s computational cost increases significantly with larger datasets or higher-dimensional data due to the need to calculate distances for all training samples during prediction.
- Neural Networks scale better for larger datasets, as training is done once, and predictions are efficient after model training.

Confusion Matrices

analysis_text, knn_cm_percent, nn_cm_percent = analyze_model_accuracies(knn_model, nn_model, X_test, y_test)
# Then create and display the visualization
comparison_viz = compare_model_accuracies_encoded(knn_model, nn_model, X_test, y_test)
# Display both
# print(analysis_text)
display(HTML(comparison_viz))

  1/438 ━━━━━━━━━━━━━━━━━━━━ 8s 20ms/step 96/438 ━━━━━━━━━━━━━━━━━━━━ 0s 527us/step208/438 ━━━━━━━━━━━━━━━━━━━━ 0s 485us/step316/438 ━━━━━━━━━━━━━━━━━━━━ 0s 480us/step438/438 ━━━━━━━━━━━━━━━━━━━━ 0s 485us/step438/438 ━━━━━━━━━━━━━━━━━━━━ 0s 500us/step
  1/438 ━━━━━━━━━━━━━━━━━━━━ 3s 7ms/step109/438 ━━━━━━━━━━━━━━━━━━━━ 0s 465us/step217/438 ━━━━━━━━━━━━━━━━━━━━ 0s 464us/step329/438 ━━━━━━━━━━━━━━━━━━━━ 0s 459us/step438/438 ━━━━━━━━━━━━━━━━━━━━ 0s 463us/step

Per-Class Performance Analysis

Let’s analyze how each model performs for different digits:

print("\nDetailed Per-Digit Analysis:")
print("-" * 50)
for i in range(10):
    knn_accuracy = knn_cm_percent[i,i]
    nn_accuracy = nn_cm_percent[i,i]
    
    print(f"\nDigit {i}:")
    print(f"KNN Accuracy: {knn_accuracy:.1f}%")
    print(f"Neural Network Accuracy: {nn_accuracy:.1f}%")
    print(f"Difference: {(nn_accuracy - knn_accuracy):.1f}%")


Detailed Per-Digit Analysis:
--------------------------------------------------

Digit 0:
KNN Accuracy: 98.1%
Neural Network Accuracy: 98.1%
Difference: 0.0%

Digit 1:
KNN Accuracy: 99.2%
Neural Network Accuracy: 98.9%
Difference: -0.3%

Digit 2:
KNN Accuracy: 94.1%
Neural Network Accuracy: 97.1%
Difference: 3.0%

Digit 3:
KNN Accuracy: 95.0%
Neural Network Accuracy: 96.2%
Difference: 1.3%

Digit 4:
KNN Accuracy: 93.6%
Neural Network Accuracy: 98.0%
Difference: 4.4%

Digit 5:
KNN Accuracy: 94.1%
Neural Network Accuracy: 97.4%
Difference: 3.3%

Digit 6:
KNN Accuracy: 97.2%
Neural Network Accuracy: 99.1%
Difference: 1.9%

Digit 7:
KNN Accuracy: 92.7%
Neural Network Accuracy: 97.5%
Difference: 4.7%

Digit 8:
KNN Accuracy: 89.9%
Neural Network Accuracy: 94.8%
Difference: 4.9%

Digit 9:
KNN Accuracy: 92.0%
Neural Network Accuracy: 96.8%
Difference: 4.8%

Prediction Speed Analysis

To understand real-world performance implications, let’s analyze prediction speeds for different batch sizes:

batch_sizes = [1, 10, 100, 1000]
results = {'knn': {}, 'nn': {}}
print("\nPrediction Speed Analysis:")
print("-" * 50)
for batch_size in batch_sizes:
    # Select subset of test data
    X_batch = X_test[:batch_size]
    
    # KNN timing
    start_time = time.time()
    _ = knn_model.predict(X_batch)
    knn_time = time.time() - start_time
    results['knn'][batch_size] = knn_time
    
    # Neural Network timing
    start_time = time.time()
    _ = nn_model.predict(X_batch, verbose=0)
    nn_time = time.time() - start_time
    results['nn'][batch_size] = nn_time
    
    print(f"\nBatch size: {batch_size}")
    print(f"KNN prediction time: {knn_time:.4f} seconds")
    print(f"Neural Network prediction time: {nn_time:.4f} seconds")
    print(f"Time per image - KNN: {(knn_time/batch_size)*1000:.2f}ms")
    print(f"Time per image - NN: {(nn_time/batch_size)*1000:.2f}ms")


Prediction Speed Analysis:
--------------------------------------------------

Batch size: 1
KNN prediction time: 0.0149 seconds
Neural Network prediction time: 0.0273 seconds
Time per image - KNN: 14.87ms
Time per image - NN: 27.28ms

Batch size: 10
KNN prediction time: 0.0339 seconds
Neural Network prediction time: 0.0262 seconds
Time per image - KNN: 3.39ms
Time per image - NN: 2.62ms

Batch size: 100
KNN prediction time: 0.0607 seconds
Neural Network prediction time: 0.0317 seconds
Time per image - KNN: 0.61ms
Time per image - NN: 0.32ms

Batch size: 1000
KNN prediction time: 0.8629 seconds
Neural Network prediction time: 0.0456 seconds
Time per image - KNN: 0.86ms
Time per image - NN: 0.05ms

Key Findings and Business Impact

Overall Accuracy

Neural Network: 96.7606 (97.1%)
KNN: 91.9718 (94.7%)
The Neural Network provides a 2.4 percentage point higher accuracy.

Training Performance

KNN Training Time: 6.42 seconds
Neural Network Training Time: 11.20 seconds

Prediction Speed

Small batches (1-100 images): KNN performs faster Large batches (1000+ images): Neural Network shows superior performance Neural Network scales better for production workloads

Error Analysis

Both models struggle most with visually similar digits (3/5, 4/9, 7/9) Neural Network shows more consistent performance across all digit classes KNN shows higher variability in accuracy between different digits

Business Implications

For real-time, single-image processing: KNN might be preferable due to faster prediction times For batch processing: Neural Network is clearly superior Trade-off between setup time (KNN faster) vs long-term performance (NN better) Memory requirements favor Neural Network for large-scale deployment

Deployment Considerations

KNN requires storing entire training dataset (higher memory usage) Neural Network has fixed memory footprint after training Neural Network offers better scalability for production systems

Conclusion

Neural Network (NN) clearly outperforms the K-Nearest Neighbors (KNN) model in terms of both accuracy and handling more complex patterns in the data. It also shows better scalability as the dataset grows.
KNN is still a useful algorithm for simpler datasets or when interpretability and speed are more important than accuracy, but Neural Networks are better suited for high-accuracy tasks, especially with larger and more complex datasets.

Key Takeaways

KNN is advantageous for smaller datasets and when simplicity and interpretability are priorities.
Neural Networks are ideal for larger or more complex datasets where advanced feature extraction and higher accuracy are desired.

—– below has not yet started —-

Model Comparison The code you provided already includes excellent comparison functionality:

Accuracy metrics Confusion matrices Per-digit performance analysis Training time comparison Inference time comparison

Custom Prediction Testing

Section for testing with your own handwritten digits Process for loading and preprocessing custom images Comparison of how both models perform on your custom input

Would you like me to elaborate on any of these sections or help you implement specific parts of the outline? I can also help you create the actual Quarto markdown structure with code chunks if you’d like.