The MNIST dataset is one of the most well-known datasets in the field of machine learning and is widely used for training various image processing systems. It contains images of handwritten digits, with each image being a 28x28 pixel grayscale image. Each pixel ranges from 0 (white) to 255 (black). The dataset is structured as follows:
Label: This variable denotes the class of the handwritten digit image. It represents the digit depicted in the image (0 through 9). In some variations of the MNIST dataset (though not standard), the label might represent letters if it’s modified and extended to alphabets.
Pix1, Pix2, …, Pix784: These variables represent
the pixel values of the 28x28 pixel images. Each image is “flattened”
into a single row with 784 columns (28 multiplied by 28), where each
PixN corresponds to the grayscale value of a pixel. Each
pixel value ranges from 0 to 255, where 0 corresponds to a completely
white pixel and 255 corresponds to a completely black pixel.
Structure of the Data:
Label and the remaining 784 columns are the pixel intensity
values from the top-left to the bottom-right of the image.Goal: The goal with MNIST is to build a
model that can predict the Label from the 784 pixel values,
effectively allowing a computer to recognize handwritten digits.
# Load necessary libraries
library(class)
install.packages("gmodels")
Installing package into 'C:/Users/jfern/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(gmodels)
Warning: package 'gmodels' was built under R version 4.4.3
# Load MNIST dataset
DataMnist <- readRDS("MN500.rds")
DataMnist$Label <- factor(DataMnist$Label)
# Create Training and Testing Data
set.seed(123)
index <- sample(1:nrow(DataMnist), 0.7 * nrow(DataMnist))
DataTrain <- DataMnist[index, ]
DataTest <- DataMnist[-index, ]
# Prepare the data for KNN (excluding the label column for training/testing dataset)
train_labels <- DataTrain$Label
test_labels <- DataTest$Label
# Remove the label column from the datasets for KNN
DataTrain <- DataTrain[, -1]
DataTest <- DataTest[, -1]
# Perform KNN for k = 5
predicted_labels <- knn(train = DataTrain, test = DataTest, cl = train_labels, 5)
# Create Confusion Matrix using CrossTable
CrossTable(x = test_labels, y = predicted_labels, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 150
| predicted_labels
test_labels | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 12 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 |
| 0.923 | 0.000 | 0.077 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.087 |
| 0.923 | 0.000 | 0.125 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
| 0.080 | 0.000 | 0.007 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 0 | 23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 23 |
| 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.153 |
| 0.000 | 0.742 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
| 0.000 | 0.153 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
2 | 0 | 2 | 6 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 10 |
| 0.000 | 0.200 | 0.600 | 0.000 | 0.100 | 0.000 | 0.000 | 0.100 | 0.000 | 0.000 | 0.067 |
| 0.000 | 0.065 | 0.750 | 0.000 | 0.067 | 0.000 | 0.000 | 0.059 | 0.000 | 0.000 | |
| 0.000 | 0.013 | 0.040 | 0.000 | 0.007 | 0.000 | 0.000 | 0.007 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
3 | 0 | 1 | 0 | 9 | 0 | 0 | 0 | 1 | 0 | 0 | 11 |
| 0.000 | 0.091 | 0.000 | 0.818 | 0.000 | 0.000 | 0.000 | 0.091 | 0.000 | 0.000 | 0.073 |
| 0.000 | 0.032 | 0.000 | 0.900 | 0.000 | 0.000 | 0.000 | 0.059 | 0.000 | 0.000 | |
| 0.000 | 0.007 | 0.000 | 0.060 | 0.000 | 0.000 | 0.000 | 0.007 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
4 | 0 | 0 | 1 | 0 | 10 | 0 | 0 | 0 | 0 | 2 | 13 |
| 0.000 | 0.000 | 0.077 | 0.000 | 0.769 | 0.000 | 0.000 | 0.000 | 0.000 | 0.154 | 0.087 |
| 0.000 | 0.000 | 0.125 | 0.000 | 0.667 | 0.000 | 0.000 | 0.000 | 0.000 | 0.118 | |
| 0.000 | 0.000 | 0.007 | 0.000 | 0.067 | 0.000 | 0.000 | 0.000 | 0.000 | 0.013 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
5 | 0 | 2 | 0 | 1 | 2 | 12 | 1 | 0 | 0 | 0 | 18 |
| 0.000 | 0.111 | 0.000 | 0.056 | 0.111 | 0.667 | 0.056 | 0.000 | 0.000 | 0.000 | 0.120 |
| 0.000 | 0.065 | 0.000 | 0.100 | 0.133 | 0.800 | 0.059 | 0.000 | 0.000 | 0.000 | |
| 0.000 | 0.013 | 0.000 | 0.007 | 0.013 | 0.080 | 0.007 | 0.000 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
6 | 1 | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 17 |
| 0.059 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.941 | 0.000 | 0.000 | 0.000 | 0.113 |
| 0.077 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.941 | 0.000 | 0.000 | 0.000 | |
| 0.007 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.107 | 0.000 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 13 |
| 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.087 |
| 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.765 | 0.000 | 0.000 | |
| 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.087 | 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
8 | 0 | 2 | 0 | 0 | 1 | 3 | 0 | 0 | 6 | 1 | 13 |
| 0.000 | 0.154 | 0.000 | 0.000 | 0.077 | 0.231 | 0.000 | 0.000 | 0.462 | 0.077 | 0.087 |
| 0.000 | 0.065 | 0.000 | 0.000 | 0.067 | 0.200 | 0.000 | 0.000 | 0.857 | 0.059 | |
| 0.000 | 0.013 | 0.000 | 0.000 | 0.007 | 0.020 | 0.000 | 0.000 | 0.040 | 0.007 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
9 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 1 | 14 | 19 |
| 0.000 | 0.053 | 0.000 | 0.000 | 0.053 | 0.000 | 0.000 | 0.105 | 0.053 | 0.737 | 0.127 |
| 0.000 | 0.032 | 0.000 | 0.000 | 0.067 | 0.000 | 0.000 | 0.118 | 0.143 | 0.824 | |
| 0.000 | 0.007 | 0.000 | 0.000 | 0.007 | 0.000 | 0.000 | 0.013 | 0.007 | 0.093 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 13 | 31 | 8 | 10 | 15 | 15 | 17 | 17 | 7 | 17 | 150 |
| 0.087 | 0.207 | 0.053 | 0.067 | 0.100 | 0.100 | 0.113 | 0.113 | 0.047 | 0.113 | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
# Calculate accuracy
accuracy <- mean(predicted_labels == test_labels)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
[1] "Accuracy: 80.67 %"
MODEL PERFORMS AT 80.67% ACCURACY
The Titanic dataset is a classic dataset in data science and machine learning typically used for demonstrating classification tasks. This dataset is available in the file `Titanic.csv1. Here’s a general description of the Titanic dataset and its variables:
The dataset contains data about the passengers who were onboard the ill-fated RMS Titanic. Your goal is to predict the survival of the passengers based on various features. Below are the variables you will use in this dataset:
Survived: Indicates if the passenger Survived (1) or did not survive (0).
Class: Passenger class, a proxy for socio-economic status (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
Name: Full name of the passenger.
Sex: Gender of the passenger (male or female).
Age: Age of the passenger in years. Some entries may have missing ages.
SibSp: Number of siblings and spouses aboard the Titanic.
Parch: Number of parents and children aboard the Titanic.
Fare: Passenger fare.
# Load necessary libraries
library(rpart)
library(gmodels)
# Load the dataset; assuming the Titanic dataset
Titanic <- read.csv("Titanic.csv")
# Select relevant columns
Titanic <- Titanic[,c("Survived", "Sex", "Class", "Age", "Fare")]
# Factorize the Survived column
Titanic$Survived <- as.factor(Titanic$Survived)
# Set seed and split the data
set.seed(777)
train_indices <- sample(1:nrow(Titanic), 0.75 * nrow(Titanic))
DataTrain <- Titanic[train_indices, ]
DataTest <- Titanic[-train_indices, ]
# Train the decision tree model
ModelDesignDecTree <- rpart(Survived ~ ., data = DataTrain, method = "class", control = rpart.control(maxdepth = 3))
# Visualize the decision tree
library(rpart.plot)
Warning: package 'rpart.plot' was built under R version 4.4.3
rpart.plot(ModelDesignDecTree, yes.text = "YES", no.text = "NO", roundint = FALSE)
# Predict on the test set
predicted_labels <- predict(ModelDesignDecTree, DataTest, type = "class")
# Create confusion matrix using CrossTable
CrossTable(x = DataTest$Survived, y = predicted_labels, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 222
| predicted_labels
DataTest$Survived | 0 | 1 | Row Total |
------------------|-----------|-----------|-----------|
0 | 115 | 12 | 127 |
| 0.906 | 0.094 | 0.572 |
| 0.846 | 0.140 | |
| 0.518 | 0.054 | |
------------------|-----------|-----------|-----------|
1 | 21 | 74 | 95 |
| 0.221 | 0.779 | 0.428 |
| 0.154 | 0.860 | |
| 0.095 | 0.333 | |
------------------|-----------|-----------|-----------|
Column Total | 136 | 86 | 222 |
| 0.613 | 0.387 | |
------------------|-----------|-----------|-----------|
# Calculate accuracy
accuracy <- mean(predicted_labels == DataTest$Survived)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
[1] "Accuracy: 85.14 %"
WITH A 85% ACCURACY THE TREE PREDICTS
Answer: 9% with 46% Data
Answer: 19% with 66% Data
Answer: 79%
Answer: 72% with 34% Data