Project 4: MNIST using KNN and Neural Networks

Author

Team: I Love Lucy

Published

December 11, 2024

Project 4 Github: [Quarto Presentation] [Python] | Projects: [4] [3] [2] [1]

from IPython.display import display, HTML
from lussi.mnist import *

X_train, X_test, y_train, y_test, X_train_unscaled = load_data()

Executive Summary

This project evaluates the performance of K-Nearest Neighbors (KNN) and Neural Networks for handwritten digit recognition using the MNIST dataset. The analysis reveals several key findings:

Performance:

The Neural Network achieved 97.3% accuracy, outperforming KNN’s 94.7% accuracy
The Neural Network showed more consistent performance across all digits, with accuracy ranging from ~95.6% to ~98.6%
KNN showed more variability, with accuracy ranging from ~89.9% to ~99.2%

Computational Characteristics:

Training: KNN trained in 6.31 seconds vs. Neural Network’s 11.37 seconds
Prediction Speed:
- For small batches (1-100 images), KNN was faster
- For larger batches (1000 images), Neural Network was significantly faster (~0.06ms vs ~0.69ms per image)

Error Patterns:

Both models struggled most with visually similar digits (e.g., 3/5, 4/9, 7/9)
KNN showed higher error rates for complex digits like ‘8’ (~89.9% accuracy)
Neural Network maintained >95% accuracy across all digit classes

This analysis demonstrates that while KNN offers faster training and competitive performance for small-scale predictions, the Neural Network provides superior accuracy and better scaling for larger batch predictions, making it more suitable for production deployment despite longer training times.

Project Overview

History and Significance of MNIST

The MNIST dataset (Modified National Institute of Standards and Technology) emerged from a practical need at the U.S. Postal Service in the late 1980s. It was created to help automate mail sorting by recognizing handwritten zip codes. Created by Yann LeCun, Corinna Cortes, and Christopher Burges, MNIST has become the de facto “Hello World” of machine learning. The dataset consists of 70,000 handwritten digits (60,000 for training, 10,000 for testing). Its standardized format and manageable size have made it an ideal benchmark for comparing machine learning algorithms for over three decades.

Understanding the Dataset Format

Though easily converted, the records are not actually stored as images. They are stored as a matrix. Each record of the 60,000 images is stored as a 28 by 28 matrix, with those positions holding the color of the pixel the position represents. It’s a square image, so 28 pixels by 28 pixels tall totals 784 total pixels, or 784 total numbers. Each of those numbers represents a shade of grayscale, 0 being all black, and 255 being white.

#visualize_digit_matrix(X_train_unscaled, index=0)
matrix_html = visualize_digit_matrix_encoded(X_train_unscaled, index=0)
display(HTML(matrix_html))

Wrapping around each of those records, it’s like any other machine learning dataset, test and train, and each of those two are broken apart into data and label, like so:

Training Set:

Images: X_train → 60000 images, each of size 28 × 28
Labels: y_train → 60000 labels, e.g., [5, 0, 4, 1, 9, …]

Testing Set:

Images: X_test → 10000 images, each of size 28 × 28
Labels: y_test → 10000 labels, e.g., [7, 2, 1, 0, 4, …]

Looking at Sample records

It’s easy to understand the core challenge by looking at records. There is much variation in hand written letters, with all sorts of factors presenting like:

Writing styles and penmanship
Stroke thickness and continuity
Digit orientation and slant
Image noise and quality

samples_html = plot_sample_digits_encoded(X_train_unscaled, y_train)
display(HTML(samples_html))
# plot_sample_digits(X_train_unscaled, y_train)

Project Goals

This project aims to:

Compare the effectiveness of a simple, intuitive algorithm (KNN) against a more complex, modern approach (Neural Networks)
Analyze the tradeoffs between computational complexity and accuracy
Understand how different architectures handle the variations in handwritten digits
Evaluate both training time and inference speed for real-world applicability

Model Implementation and Training

KNN

K-Nearest Neighbors (KNN) is a non-parametic algorithm, supervised machine algorithm used for both classification and regression tasks. The fundamental principle of KNN is simple: classify a new data point based on the majority vote (classification) or average (regression) of its K nearest neighbors in the feature space.

# Train KNN model
print("Training KNN Model...")
start_time = time.time()
knn_model, knn_accuracy, knn_train_time = train_knn(X_train, X_test, y_train, y_test, rebuild_model=True)

Training KNN Model...

Training Configuration

Data Splitting:
- Used train_test_split() function to split the data:
  - Training (80%)
  - Testing (20%)
- An 80-20 split balances training data sufficiency and evaluation robustness
Feature Scaling:
- Used StandardScalar() to scale the data, ensuring all features contribute equally to the distance calculation.
Lazy Learning:
- KNN stores the entire training dataset and only makes predictions during inference

Prediction Process Mechanics:

Distance Calculation
- For the new data point, calculate its distance to all training points.
- The default distance calculation method is Euclidean distance.
- Euclidean Distance formula: \(d(p, q) = \sqrt{\sum_{i=1}^n (q_i - p_i)^2}\)
Neighbor Selection
- Select the K closest points (neighbors)
- In this implementation, n_neighbors =3, making the prediction sensitive to local patterns.
Classification Method
- Majority voting determines the class
- Most frequent class among K neighbors wins

Key Parameters:

Number of Neighbors: n_neighbors = 3
- A small value (e.g., 3) captures local patterns but is prone to overfitting.
- Larger values smoothen predictions but may underfit.
Distance Metric:
- Default is Euclidean distance
- Other options include Manhattan or Minkowski distances for varying use cases.
Weighting Scheme:
- Default is Uniform, where all neighbors contribute equally.
- Weighted options give more influence to closer neighbors.

Performance Metrics

print(f"\nKNN Results:")
print(f"Training Time: {knn_train_time:.2f} seconds")
print(f"Accuracy: {knn_accuracy:.4f}")


KNN Results:
Training Time: 6.31 seconds
Accuracy: 0.9465

Conclusion

K-Nearest Neighbors offers a straightforward yet powerful approach to classification. By leveraging local neighborhood information and flexible distance calculations, KNN provides an interpretable method for pattern recognition in machine learning tasks.

Neural Network

A neural network is a machine learning algorithm inspired by the structure and function of the human brain. It is designed to learn relationships, recognize patterns, and make predictions by mimicking how biological neurons process and transmit information. Neural networks excel in handling complex, non-linear data, making them a versatile tool for tasks such as image recognition, natural language processing, and classification.

print("\nTraining Neural Network...")
start_time = time.time()
nn_model, history, nn_accuracy, nn_train_time = train_neural_network(X_train, X_test, y_train, y_test, rebuild_model=False)


Training Neural Network...

Architecture Overview

We created a neural network designed for MNIST digit classification. It features a multi-layer feedforward architecture with strategic layer design and regularization techniques.

Detailed Layer Analysis:

Input Layer
- Dimensions: 784 neurons (28 x 28 pixel flattened image)
- Purpose: Direct mapping of pixel intensity values
- Transformation: Converts 2D image to 1D feature vector
First Hidden Layer
- Dimensions: 256 neurons
- Activation: ReLU (Rectified Linear Unit)
- Objectives:
  - Initial complex feature extraction
  - Introduces non-linear transformations
  - Captures primary image characteristics
First Dropout layer
- Dropout Rate: 0.2 (20%)
- Regularization Technique:
  - Randomly deactivates 20% of neurons during training
  - Prevents model overfitting
  - Reduces neuron interdependence
Second Hidden Layer
- Dimensions: 128 neurons
- Activation: ReLU (Rectified Linear Unit)
- Objectives:
  - Further abstract feature representations
  - Progressively reduce feature dimensionality
  - Refine initial feature extraction
Second Dropout Layer
- Dropout Rate: 0.2 (20%)
- Continues regularization strategy
- Prevents neural network from becoming too specialized
Third Hidden Layer
- Dimensions: 64 neurons
- Activation: ReLU (Rectified Linear Unit)
- Objectives:
  - Final feature abstraction
  - Prepares data for classification
  - Further reduces feature complexity
Output Layer
- Neurons: 10 (one per digit 0-9)
- Activation: Softmax
- Characteristics:
  - Converts raw scores to probability distribution
  - Ensures probabilities sum to 1
  - Enables multi-class classification

Training Configuration

Hyperparameters:
1. epochs
  - Total Iterations: 10
  - Purpose:
    - Complete passes through entire training dataset
    - Allows progressive weight refinement
    - Prevents overfitting through limited iterations
2. batch_size
  - Configuration: 128 samples per gradient update
  - Benefits:
    - Computational efficiency
    - Gradient noise reduction
    - Memory-friendly processing
3. validation_split
  - Allocation: 10% of the training data
  - Functions:
    - Monitor model performance during training
    - Detect potential overfitting
    - Provide real-time performance insights
Optimization Strategy: Adam
- Adaptive learning rate optimization
- Characteristics:
  - Combines RMSprop and momentum advantages
  - Dynamically adjusts per-parameter learning rates
  - Handles sparse gradients effectively
Loss Function: Sparse Categorical Cross-Entropy
- Ideal for multi-class classification
- Measures:
  - Difference between predicted and actual distributions
  - Guides weight updates during backpropagation

Performance Metrics

print(f"\nNeural Network Results:")
print(f"Training Time: {nn_train_time:.2f} seconds")
print(f"Accuracy: {nn_accuracy:.4f}")


Neural Network Results:
Training Time: 11.37 seconds
Accuracy: 0.9740

Conclusion

The neural network architecture is carefully designed to balance complexity, feature extraction, and generalization. By incorporating strategic layer design, dropout regularization, and adaptive optimization, the model achieves robust performance in MNIST digit classification.

Model Comparison

In this section, we compare the performance of the K-Nearest Neighbors (KNN) algorithm and the Neural Network (NN) architecture based on key performance metrics: training time and accuracy.

Performance Metrics

compare_df = create_comparison_table(knn_accuracy, knn_train_time, nn_accuracy, nn_train_time)
print(compare_df)

                    Metric     KNN  Neural Network
0  Training Time (seconds)  6.3100          11.370
1             Accuracy (%)  0.9465           0.974

Training Time
- KNN exhibits a faster training process (6.31 seconds) since it is a “lazy learning” algorithm, which delays most computation until prediction.
- The Neural Network, being a “eager learning” algorithm, spends more time (13.84 seconds) in training due to backpropagation, weight updates, and regularization techniques.
Accuracy:
- The Neural Network outperforms KNN with an accuracy of 97.4%, compared to 94.7% for KNN.
- The Neural Network’s higher accuracy is attributed to its ability to extract complex, non-linear patterns in the data through multiple layers and activation functions.
- KNN, while simpler, relies on proximity in the feature space, which may not fully capture intricate relationships.
Scalability:
- KNN’s computational cost increases significantly with larger datasets or higher-dimensional data due to the need to calculate distances for all training samples during prediction.
- Neural Networks scale better for larger datasets, as training is done once, and predictions are efficient after model training.

Model Accuracy by Digit

The bar chart below compares the accuracy of two models, K-Nearest Neighbors (KNN) and a Neural Network, for classifying digits. Accuracy is presented for each digit (0–9) as a percentage.

analysis_text, knn_cm_percent, nn_cm_percent = analyze_model_accuracies(knn_model, nn_model, X_test, y_test)

# Then create and display the visualization
comparison_viz = compare_model_accuracies_encoded(knn_model, nn_model, X_test, y_test)

# Display both
# print(analysis_text)
display(HTML(comparison_viz))

Per-Class Performance Analysis

Let’s analyze how each model performs for different digits:

print("\nDetailed Per-Digit Analysis:")
print("-" * 50)
for i in range(10):
    knn_accuracy_per_class = knn_cm_percent[i,i]
    nn_accuracy_per_class = nn_cm_percent[i,i]
    
    print(f"\nDigit {i}:")
    print(f"KNN Accuracy: {knn_accuracy_per_class:.1f}%")
    print(f"Neural Network Accuracy: {nn_accuracy_per_class:.1f}%")
    print(f"Difference: {(nn_accuracy_per_class - knn_accuracy_per_class):.1f}%")


Detailed Per-Digit Analysis:
--------------------------------------------------

Digit 0:
KNN Accuracy: 98.1%
Neural Network Accuracy: 98.8%
Difference: 0.7%

Digit 1:
KNN Accuracy: 99.2%
Neural Network Accuracy: 98.6%
Difference: -0.6%

Digit 2:
KNN Accuracy: 94.1%
Neural Network Accuracy: 96.7%
Difference: 2.7%

Digit 3:
KNN Accuracy: 95.0%
Neural Network Accuracy: 96.7%
Difference: 1.7%

Digit 4:
KNN Accuracy: 93.6%
Neural Network Accuracy: 97.5%
Difference: 3.9%

Digit 5:
KNN Accuracy: 94.1%
Neural Network Accuracy: 96.5%
Difference: 2.4%

Digit 6:
KNN Accuracy: 97.2%
Neural Network Accuracy: 98.9%
Difference: 1.6%

Digit 7:
KNN Accuracy: 92.7%
Neural Network Accuracy: 97.4%
Difference: 4.7%

Digit 8:
KNN Accuracy: 89.9%
Neural Network Accuracy: 96.6%
Difference: 6.7%

Digit 9:
KNN Accuracy: 92.0%
Neural Network Accuracy: 96.2%
Difference: 4.2%

Prediction Speed Analysis

To understand real-world performance implications, let’s analyze prediction speeds for different batch sizes:

batch_sizes = [1, 10, 100, 1000]
results = {'knn': {}, 'nn': {}}
print("\nPrediction Speed Analysis:")
print("-" * 50)
for batch_size in batch_sizes:
    # Select subset of test data
    X_batch = X_test[:batch_size]
    
    # KNN timing
    start_time = time.time()
    _ = knn_model.predict(X_batch)
    knn_time = time.time() - start_time
    results['knn'][batch_size] = knn_time
    
    # Neural Network timing
    start_time = time.time()
    _ = nn_model.predict(X_batch, verbose=0)
    nn_time = time.time() - start_time
    results['nn'][batch_size] = nn_time
    
    print(f"\nBatch size: {batch_size}")
    print(f"KNN prediction time: {knn_time:.4f} seconds")
    print(f"Neural Network prediction time: {nn_time:.4f} seconds")
    print(f"Time per image - KNN: {(knn_time/batch_size)*1000:.2f}ms")
    print(f"Time per image - NN: {(nn_time/batch_size)*1000:.2f}ms")


Prediction Speed Analysis:
--------------------------------------------------

Batch size: 1
KNN prediction time: 0.0205 seconds
Neural Network prediction time: 0.0291 seconds
Time per image - KNN: 20.46ms
Time per image - NN: 29.10ms

Batch size: 10
KNN prediction time: 0.0344 seconds
Neural Network prediction time: 0.0266 seconds
Time per image - KNN: 3.44ms
Time per image - NN: 2.66ms

Batch size: 100
KNN prediction time: 0.0638 seconds
Neural Network prediction time: 0.0418 seconds
Time per image - KNN: 0.64ms
Time per image - NN: 0.42ms

Batch size: 1000
KNN prediction time: 0.7961 seconds
Neural Network prediction time: 0.0668 seconds
Time per image - KNN: 0.80ms
Time per image - NN: 0.07ms

Key Findings and Business Impact

Overall Accuracy

Neural Network: 0.9740 (97.4%)
KNN: 0.9465 (94.7%)
The Neural Network provides a 2.7 percentage point higher accuracy.

The Neural Network provides a 2.7 percentage point higher accuracy, making it a more reliable model for tasks requiring high precision.

Training Performance

KNN Training Time: 6.31 seconds
Neural Network Training Time: 11.37 seconds

While KNN has a shorter training time, it is less resource-intensive since it does not involve iterative optimization or backpropagation. This makes KNN a simpler model to implement, especially for smaller datasets or low-complexity tasks.

On the other hand, the Neural Network’s slightly longer training time is offset by its better classification performance and scalability to handle larger and more complex datasets. Its efficiency during frequent retraining makes it advantageous for dynamic datasets where the model needs to adapt continuously.

In summary, KNN may be preferred for simpler tasks with limited resources, while Neural Networks are better suited for tasks requiring higher accuracy and frequent updates.

Prediction Speed

Small batches (1-100 images): KNN performs faster
Large batches (1000+ images): Neural Network shows superior performance
The Neural Network scales better for production workloads where large volumes of data need to be processed consistently.

Error Analysis

Common Challenges: - Both models struggle most with visually similar digits (3/5, 4/9, 7/9) - Misclassifications are more frequent when digits share overlapping features or stroke patterns

Performance Consistency: - Neural Network: Shows more consistent performance across all digit classes due to its ability to generalize and extract deeper features - KNN: Displays higher variability in accuracy across digit classes, as it relies heavily on proximity to training samples

Business Insights: The Neural Network’s consistency ensures fewer outliers in predictions, making it a safer choice for critical applications where errors carry significant consequences.

Business Implications

For real-time, single-image processing:

KNN Advantage: Faster prediction times for small-scale tasks make KNN ideal for low-latency requirements.
- Example: In a security camera system for facial recognition at entry points, KNN could quickly identify faces in real-time, where speed is crucial for granting access without delays.

Batch Processing:

Neural Network Advantage: Superior performance and scalability make it the preferred choice for batch processing tasks, such as automated document scanning or large-scale data classification.

Trade-Offs:

Setup Time: KNN offers a quicker setup and is easier to implement for quick turnaround projects.
- Example: A retail business launching a new product line might use KNN to quickly build a recommendation system. Since KNN is easy to implement, the business can rapidly test and deploy a basic version of the recommendation engine to suggest products to customers, allowing them to gather early feedback and adjust their strategy without a lengthy development cycle.
Long-Term Performance: Neural Networks offer better reliability and scalability, justifying the initial investment in setup and training.
- Example: A large insurance company with thousands of claims per day might prefer Neural Networks for fraud detection due to their ability to adapt as the dataset grows and improve performance over time.

Memory Usage:

KNN: Requires storing the entire training dataset in memory, leading to higher resource consumption as the dataset size increases.
- Example: In a real-time recommendation system for a small retailer, KNN may be inefficient as it needs to store all user data for every prediction, leading to higher memory costs as the user base grows.
Neural Network: Fixed memory footprint post-training makes it more suitable for resource-constrained environments.
- Example: A mobile app for language translation might use a pre-trained Neural Network that doesn’t require storing new training data, making it ideal for running efficiently on devices with limited memory.

Scalability:

Neural Networks demonstrate better scalability for production systems due to their ability to handle increasing dataset sizes and complexities without a significant drop in performance.
- Example: As customer interactions and transactions grow, neural networks can continue to improve product recommendations or fraud detection systems, providing businesses with the agility to meet increasing demands without overhauling their infrastructure. Moreover, scalable models help companies maintain cost-efficiency by reducing the need for manual intervention or frequent model retraining as data volume increases.

Integration:

KNN’s simplicity makes it easier to integrate for basic applications or for systems with minimal computational capabilities.
- Example: A small healthcare clinic could integrate KNN for simple predictive tasks like diagnosing basic diseases based on patient symptoms without requiring expensive computational resources.

Conclusion

In this analysis, the Neural Network (NN) demonstrated superior performance compared to the K-Nearest Neighbors (KNN) model, particularly in terms of accuracy and its ability to handle more complex patterns in the MNIST dataset. The Neural Network achieved an impressive 97.3% accuracy, with stable performance across all digit classes, making it highly reliable for high-accuracy tasks.

While KNN outperforms Neural Networks in terms of training speed and prediction time for small batches, its performance is more variable and less consistent, especially with visually similar or complex digits. KNN is more appropriate for scenarios where speed and simplicity are critical, such as real-time applications with small datasets or when interpretability is a priority.

On the other hand, Neural Networks offer a better balance of accuracy and scalability, making them well-suited for production environments dealing with larger and more complex datasets. Despite the longer training time, their ability to consistently improve performance as the dataset size increases makes them the more effective choice for applications requiring high precision and reliability.

In summary, while KNN remains a valuable tool for certain use cases, Neural Networks stand out as the more robust and scalable option for tasks that demand accuracy and complexity handling, especially in production-level applications.