Build a Handwritten Digit Classifier with Random Forests and Convolutional Neural Networks and Compare Approaches

Author

Clarke A. Homan

Published

May 13, 2026

1 Abstract

This study compared two machine learning approaches to handwritten digit classification using the MNIST dataset: a Random Forest classifier implemented via the tidymodels ecosystem and a Convolutional Neural Network (CNN) implemented using the torch and luz packages in R. The MNIST training set (n = 60,000) and test set (n = 10,000) were obtained in CSV format from Kaggle. Prior to Random Forest modeling, permutation-based variable importance scoring was used to reduce the 784-pixel feature space to 617 informative features, eliminating always-blank border pixels confirmed by visual inspection of a pixel importance heatmap. The Random Forest classifier achieved a test set accuracy of 97.05% following 5-fold cross-validated hyperparameter tuning. The CNN, trained across 10 epochs using a standard small architecture with two convolutional layers, achieved a test set accuracy of 99.27% — a 2.22 percentage point improvement over the Random Forest, representing a 75.3% reduction in error rate. The CNN outperformed the Random Forest on every digit class, with the largest advantages observed for digits 8, 9, 2, and 3 — digits characterized by complex curved strokes that benefit most from spatial pattern recognition. CNN training completed in 13.92 minutes compared to approximately 9 hours and 47 minutes for the full Random Forest pipeline, representing a roughly 42-fold reduction in compute time. These results demonstrate that CNNs are the superior approach for image classification tasks of this nature, offering both higher accuracy and substantially greater computational efficiency.

2 Introduction

Handwritten digit recognition is a foundational problem in machine learning and computer vision, serving as a benchmark for evaluating classification approaches across a wide range of methodologies. The MNIST dataset, comprising 70,000 grayscale images of handwritten digits (0–9) at 28×28 pixel resolution, has been widely adopted as a standard evaluation benchmark since its introduction by LeCun et al. (1998).

This study evaluated two fundamentally different classification approaches on the MNIST dataset: a Random Forest classifier, which treats each pixel as an independent tabular feature, and a Convolutional Neural Network (CNN), which exploits the spatial structure of the image grid through learned local filters. The primary objectives were to assess the relative classification accuracy of each approach, characterize per-digit-class performance differences, and compare computational efficiency. A secondary objective was to evaluate the utility of permutation-based variable importance scoring as a pixel feature selection method prior to Random Forest modeling.

The present study was motivated in part by Radečić (2021), which demonstrated a Random Forest approach to MNIST classification in R. The current analysis extends that work by incorporating permutation-based feature selection, formal cross-validated hyperparameter tuning via the tidymodels ecosystem, and a direct comparison against a CNN implementation using torch and luz.

3 Data

The MNIST dataset was obtained in CSV format from Kaggle (https://www.kaggle.com/datasets/oddrationale/mnist-in-csv), comprising a training set of 60,000 images and a test set of 10,000 images. Each image is represented as a single row containing a class label (0–9) and 784 pixel intensity values (one per pixel in the 28×28 grid), with pixel values ranging from 0 (black) to 255 (white).

Radečić, D. (2021, February 14). Build an MNIST Classifier With Random Forests. Appsilon. https://www.appsilon.com/post/r-mnist-random-forests

4 Methods

4.1 Software and Packages

All analyses were conducted in R (version 4.5.0). Key packages and versions are listed in Table 1. It is noted that torch version 0.11.0 and luz version 0.4.0 were required in place of current releases due to a binary incompatibility between the current torch LibTorch binaries and the macOS 12.7.6 (Monterey) C++ standard library (libc++) on the analysis machine. Project reproducibility was managed using renv.

Package	Version	Purpose
tidymodels	1.5.0	RF modeling pipeline
ranger	0.18.0	Random Forest engine
vip	0.4.6	Variable importance extraction
torch	0.11.0	CNN backend (older version required)
luz	0.4.0	High-level CNN training interface (older version required)
yardstick	1.4.0	Classification metrics
ggplot2	4.0.3	Visualization
patchwork	1.3.2	Multi-panel figure layout
renv	1.1.5	Project reproducibility
dplyr	1.2.1	Data manipulation
base	4.5.2	The R Base Package
graphics	4.5.2	The R Graphics Package

Table 1: R packages and versions used in this analysis.

4.2 Phase 1: Data Preparation

4.2.1 Random Forest Preprocessing

Pixel intensity values were normalized to the [0, 1] range by dividing by 255. An initial Random Forest model (500 trees, permutation-based importance, 3 parallel threads) was fitted on the normalized training data using the ranger package to extract pixel-level variable importance scores. Features with permutation importance scores of zero or below were identified as uninformative and excluded from further modeling. This reduced the feature space from 784 to 617 pixels, eliminating 167 always-blank border pixels confirmed by visual inspection of the pixel importance heatmap (Figure 1). The same column mask was applied passively to the test set to maintain pipeline integrity.

4.2.2 Convolutional Neural Network Preprocessing

A separate preprocessing path was applied to the CNN input data. Pixel intensity values were normalized to the [0, 1] range by dividing by 255. The normalized data were then reshaped from a flat 784-column row format into a four-dimensional tensor of shape [batch, 1, 28, 28], representing batch size, channel (grayscale = 1), image height, and image width respectively. Finally, pixel values were standardized using the known MNIST population mean (0.1307) and standard deviation (0.3081), yielding a training tensor mean of approximately 0.000 and standard deviation of approximately 1.000, confirming correct standardization.

4.3 Phase 2: Random Forest Classification

The Random Forest classification pipeline was constructed using the tidymodels ecosystem with the ranger engine. The mtry hyperparameter — controlling the number of features randomly sampled at each tree split — was identified as the primary tuning objective. The search grid explored four candidate values (15, 25, 35, 45) centered on the standard sqrt(n_remaining_features) heuristic (\(\sqrt{617} \approx 25\)). Optimal mtry selection was performed via 5-fold stratified cross-validation, with digit class used as the stratification variable to ensure balanced fold composition. Cross-validated tuning produced mean accuracies exceeding 96% across all candidate values, with a range of only 0.11 percentage points, indicating model robustness to mtry selection within the explored range. The optimal value of mtry = 35 was selected and used to fit the final model on the full normalized and reduced training set. A random seed of 123 was set prior to all random processes.

4.4 Phase 3: Convolutional Neural Network

The CNN was implemented using the torch and luz packages in R. The architecture comprised two convolutional layers followed by two fully connected layers:

Conv2d(1, 32, kernel=3) + ReLU + MaxPool2d(2)
Conv2d(32, 64, kernel=3) + ReLU + MaxPool2d(2)
Flatten
Linear(1600, 128) + ReLU + Dropout(0.5)
Linear(128, 10)

The network was trained using the Adam optimizer with cross-entropy loss. The training set was partitioned into an 80/20 train/validation split (48,000 training images, 12,000 validation images) prior to training. The model was trained for a maximum of 10 epochs with early stopping configured to halt training if validation loss failed to improve for 3 consecutive epochs (patience = 3). Early stopping did not trigger, as validation loss improved consistently across all 10 epochs. A random seed of 123 was set prior to training.Results

4.5 Pixel Importance Analysis

Permutation-based variable importance scoring identified 617 of 784 pixels (78.7%) as having positive importance for digit classification. The remaining 167 pixels (21.3%) were assigned zero or negative importance and eliminated from the Random Forest feature set. Visual inspection of the pixel importance heatmap (Figure 1) confirmed that eliminated pixels correspond to the border regions of the 28×28 image grid, where digit strokes are consistently absent across all digit classes. The highest importance pixels were concentrated in the central region of the grid (approximately rows 7–22, columns 5–22), consistent with the known spatial distribution of MNIST digit strokes.

Figure 1: MNIST Training Data Pixel Importance Heat-map

4.6 Random Forest Hyperparameter Tuning

Cross-validated tuning of the mtry hyperparameter across four candidate values (15, 25, 35, and 45) produced mean classification accuracies exceeding 96% in all cases, with a range of only 0.11 percentage points across the entire grid. The optimal value was mtry = 35. The negligible variation in accuracy across the search grid suggests the Random Forest classifier is robust to mtry selection within this range.

4.7 CNN Training Performance

CNN training progressed consistently across all 10 epochs, with validation accuracy reaching 99.19% by Epoch 10 (Table 2). Validation loss decreased monotonically across all epochs, indicating no overfitting occurred during training. The gap between training loss and validation loss narrowed progressively, reflecting effective regularization via dropout.

Epoch	Train Loss	Train Acc	Val Loss	Val Acc
1	0.2332	92.91%	0.0752	97.62%
2	0.0863	97.41%	0.0469	98.57%
3	0.0652	98.04%	0.0395	98.78%
4	0.0511	98.40%	0.0389	98.84%
5	0.0428	98.69%	0.0393	98.82%
6	0.0358	98.88%	0.0424	98.84%
7	0.0317	98.97%	0.0370	98.89%
8	0.0307	99.05%	0.0374	98.95%
9	0.0245	99.18%	0.0341	99.05%
10	0.0238	99.22%	0.0335	99.19%

Table 2: CNN training and validation metrics across all 10 epochs.

4.8 Test Set Comparison

Figure 2: MNIST Classification Model Performance Comparison

Test set classification performance is summarized in Figure 2. The CNN achieved a test set accuracy of 99.27%, compared to 97.05% for the Random Forest — a difference of 2.22 percentage points. Expressed in terms of error rate, the CNN reduced misclassifications by 75.3% relative to the Random Forest (0.73% vs. 2.95%).

The CNN outperformed the Random Forest on every digit class without exception. The largest CNN advantages were observed for digit 8 (4.00 percentage points), digit 9 (2.97 percentage points), digit 2 (2.81 percentage points), and digit 3 (2.77 percentage points). The smallest advantages were observed for digit 0 (0.71 percentage points) and digit 1 (0.97 percentage points). This pattern is consistent with the expectation that digits characterized by complex curved strokes benefit most from the CNN’s spatial pattern recognition capability.

Confusion matrix analysis revealed that both models exhibited classic MNIST confusion patterns, though with differing severity. The Random Forest most frequently confused digit 2 with digit 7 (19 instances), digit 4 with digit 9 (12 instances), and digit 3 with digit 9 (11 instances). The CNN reduced these confusions substantially — the most frequent misclassification was digit 5 predicted as digit 9 (8 instances). In both confusion matrices, rows represent predicted classes and columns represent true classes; diagonal elements indicate correct classifications and off-diagonal elements indicate misclassifications.

4.9 Computational Efficiency

The computational cost of the two approaches differed substantially (Table 3). Random Forest training required approximately 9 hours and 47 minutes in total, comprising 9 hours and 11 minutes for 5-fold cross-validated hyperparameter tuning and 36 minutes for final model fitting, all performed on a 3-core CPU. CNN training across 10 epochs completed in 13.92 minutes on the same hardware, representing a roughly 42-fold reduction in total compute time.

Model	Phase	Time
Random Forest	CV Tuning (5-fold, 4 mtry values)	9hrs 11min
Random Forest	Final Model Fit	36min
Random Forest	Total	9hrs 47min
CNN	Training (10 epochs)	13.92min

Table 3: Computational time comparison between Random Forest and CNN pipelines.

5 Discussion

The results of this study demonstrate a clear and consistent advantage for the CNN approach over the Random Forest classifier across all evaluated dimensions — overall accuracy, per-class accuracy, confusion matrix sparsity, and computational efficiency. The 2.22 percentage point accuracy advantage, representing a 75.3% reduction in error rate, is both statistically meaningful and practically significant in the context of digit classification.

The pattern of per-class accuracy differences is theoretically coherent. Digits with complex curved stroke structures — particularly 8, 9, 2, and 3 — showed the largest CNN advantages, consistent with the CNN’s ability to detect spatial relationships between adjacent pixels through learned convolutional filters. Simpler digit forms such as 0 and 1 showed smaller advantages, suggesting that the spatial structure of these digits is sufficiently captured by the tabular feature representation used by the Random Forest.

The pixel importance analysis conducted in Phase 1 contributed meaningful analytical insight beyond its role as a preprocessing step. The resulting heatmap provided visual confirmation that MNIST classification signal is spatially concentrated in the central region of the 28×28 image grid, with border pixels consistently uninformative across all digit classes. This finding independently validates the spatial assumptions underlying the CNN architecture.

Several limitations of this study warrant acknowledgment. The Random Forest pipeline required substantially greater compute time than the CNN — approximately 42 times longer — in part due to the exhaustive permutation-based importance scoring performed in Phase 1 and the 5-fold cross-validated hyperparameter tuning in Phase 2. A simplified Random Forest pipeline omitting these steps would be faster, though potentially less accurate. Additionally, the analysis was constrained to older versions of torch (0.11.0) and luz (0.4.0) due to binary incompatibility with the macOS 12.7.6 C++ standard library on the analysis machine; current software versions may produce marginally different results.

6 Conclusions

The CNN approach is unambiguously superior to the Random Forest classifier for handwritten digit classification on the MNIST dataset, offering higher accuracy, lower error rates, and substantially greater computational efficiency. The Random Forest contributed meaningful analytical value through the pixel importance analysis, which characterized the spatial structure of the MNIST feature space and motivated the CNN’s architectural design. For image classification tasks where spatial relationships between pixels are informative — as is clearly the case for handwritten digit recognition — convolutional architectures are the preferred approach.

7 References

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

Kaggle MNIST CSV dataset: https://www.kaggle.com/datasets/oddrationale/mnist-in-csv

8 Attribution

Data preprocessing, modeling, and visualization code was independently authored by Clarke A. Homan with the support of the AI Claude (Anthropic). Report language was developed with assistance from Claude (Anthropic). Debugging and code optimization support was provided by Claude (Anthropic). —