The MNIST dataset is one of the most well-known datasets in the field of machine learning and is widely used for training various image processing systems. It contains images of handwritten digits, with each image being a 28x28 pixel grayscale image. Each pixel ranges from 0 (white) to 255 (black). The dataset is structured as follows:
Label: This variable denotes the class of the handwritten digit image. It represents the digit depicted in the image (0 through 9). In some variations of the MNIST dataset (though not standard), the label might represent letters if it’s modified and extended to alphabets.
Pix1, Pix2, …, Pix784: These variables represent
the pixel values of the 28x28 pixel images. Each image is “flattened”
into a single row with 784 columns (28 multiplied by 28), where each
PixN
corresponds to the grayscale value of a pixel. Each
pixel value ranges from 0 to 255, where 0 corresponds to a completely
white pixel and 255 corresponds to a completely black pixel.
Structure of the Data:
Label
and the remaining 784 columns are the pixel intensity
values from the top-left to the bottom-right of the image.Goal: The goal with MNIST is to build a
model that can predict the Label
from the 784 pixel values,
effectively allowing a computer to recognize handwritten digits.
Label
variable is set as a factor, as it represents
categorical data.knn
function to classify test data based on the
training data, using k = 5
as the number of nearest
neighbors.CrossTable
from the gmodels
library to
construct a confusion matrix to compare the actual and predicted
labels.**Complete the code below by filling spaces (“___“).**
## Warning: package 'class' was built under R version 4.3.3
## Warning: package 'gmodels' was built under R version 4.3.3
# Load MNIST dataset
DataMnist <- readRDS("MN500.rds")
DataMnist$Label <- factor(DataMnist$Label)
# Create Training and Testing Data
set.seed(123)
index <- sample(1:nrow(DataMnist), 0.7 * nrow(DataMnist))
DataTrain <- DataMnist[index, ]
DataTest <- DataMnist[-index, ]
# Prepare the data for KNN (excluding the label column for training/testing dataset)
train_labels <- DataTrain$Label
test_labels <- DataTest$Label
# Remove the label column from the datasets for KNN
DataTrain <- DataTrain[, -1]
DataTest <- DataTest[,-1]
# Perform KNN for k = 5
predicted_labels <- knn(train = DataTrain, test = DataTest, cl = train_labels, k = 5)
print(predicted_labels)
## [1] 7 1 1 9 4 6 1 9 7 3 4 0 7 1 1 3 1 4 4 0 1 9 1 4 7 4 0 7 1 1 3 1 1 6 9 6 5
## [38] 1 9 4 7 6 9 5 6 6 1 6 7 1 8 0 1 5 6 6 4 1 2 8 1 8 5 2 5 1 2 1 1 3 5 0 5 3
## [75] 5 2 4 1 0 2 1 2 5 3 9 9 7 9 0 8 1 0 6 9 3 0 1 5 3 9 7 6 5 1 0 1 9 5 6 7 4
## [112] 1 4 6 6 5 9 4 0 7 7 9 2 4 7 1 9 8 3 2 9 8 3 0 4 6 6 7 7 5 8 9 7 1 7 1 5 4
## [149] 0 6
## Levels: 0 1 2 3 4 5 6 7 8 9
# Create Confusion Matrix using CrossTable
CrossTable(x = test_labels, y = predicted_labels, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 150
##
##
## | predicted_labels
## test_labels | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 0 | 12 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 |
## | 0.923 | 0.000 | 0.077 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.087 |
## | 0.923 | 0.000 | 0.125 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.080 | 0.000 | 0.007 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 0 | 23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 23 |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.153 |
## | 0.000 | 0.742 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.153 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 0 | 2 | 6 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 10 |
## | 0.000 | 0.200 | 0.600 | 0.000 | 0.100 | 0.000 | 0.000 | 0.100 | 0.000 | 0.000 | 0.067 |
## | 0.000 | 0.065 | 0.750 | 0.000 | 0.067 | 0.000 | 0.000 | 0.059 | 0.000 | 0.000 | |
## | 0.000 | 0.013 | 0.040 | 0.000 | 0.007 | 0.000 | 0.000 | 0.007 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 0 | 1 | 0 | 9 | 0 | 0 | 0 | 1 | 0 | 0 | 11 |
## | 0.000 | 0.091 | 0.000 | 0.818 | 0.000 | 0.000 | 0.000 | 0.091 | 0.000 | 0.000 | 0.073 |
## | 0.000 | 0.032 | 0.000 | 0.900 | 0.000 | 0.000 | 0.000 | 0.059 | 0.000 | 0.000 | |
## | 0.000 | 0.007 | 0.000 | 0.060 | 0.000 | 0.000 | 0.000 | 0.007 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 0 | 0 | 1 | 0 | 10 | 0 | 0 | 0 | 0 | 2 | 13 |
## | 0.000 | 0.000 | 0.077 | 0.000 | 0.769 | 0.000 | 0.000 | 0.000 | 0.000 | 0.154 | 0.087 |
## | 0.000 | 0.000 | 0.125 | 0.000 | 0.667 | 0.000 | 0.000 | 0.000 | 0.000 | 0.118 | |
## | 0.000 | 0.000 | 0.007 | 0.000 | 0.067 | 0.000 | 0.000 | 0.000 | 0.000 | 0.013 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 0 | 2 | 0 | 1 | 2 | 12 | 1 | 0 | 0 | 0 | 18 |
## | 0.000 | 0.111 | 0.000 | 0.056 | 0.111 | 0.667 | 0.056 | 0.000 | 0.000 | 0.000 | 0.120 |
## | 0.000 | 0.065 | 0.000 | 0.100 | 0.133 | 0.800 | 0.059 | 0.000 | 0.000 | 0.000 | |
## | 0.000 | 0.013 | 0.000 | 0.007 | 0.013 | 0.080 | 0.007 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 1 | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 17 |
## | 0.059 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.941 | 0.000 | 0.000 | 0.000 | 0.113 |
## | 0.077 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.941 | 0.000 | 0.000 | 0.000 | |
## | 0.007 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.107 | 0.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 13 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.087 |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.765 | 0.000 | 0.000 | |
## | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.087 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 8 | 0 | 2 | 0 | 0 | 1 | 3 | 0 | 0 | 6 | 1 | 13 |
## | 0.000 | 0.154 | 0.000 | 0.000 | 0.077 | 0.231 | 0.000 | 0.000 | 0.462 | 0.077 | 0.087 |
## | 0.000 | 0.065 | 0.000 | 0.000 | 0.067 | 0.200 | 0.000 | 0.000 | 0.857 | 0.059 | |
## | 0.000 | 0.013 | 0.000 | 0.000 | 0.007 | 0.020 | 0.000 | 0.000 | 0.040 | 0.007 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 9 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 1 | 14 | 19 |
## | 0.000 | 0.053 | 0.000 | 0.000 | 0.053 | 0.000 | 0.000 | 0.105 | 0.053 | 0.737 | 0.127 |
## | 0.000 | 0.032 | 0.000 | 0.000 | 0.067 | 0.000 | 0.000 | 0.118 | 0.143 | 0.824 | |
## | 0.000 | 0.007 | 0.000 | 0.000 | 0.007 | 0.000 | 0.000 | 0.013 | 0.007 | 0.093 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 13 | 31 | 8 | 10 | 15 | 15 | 17 | 17 | 7 | 17 | 150 |
## | 0.087 | 0.207 | 0.053 | 0.067 | 0.100 | 0.100 | 0.113 | 0.113 | 0.047 | 0.113 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
# Calculate accuracy
accuracy <- mean(predicted_labels == test_labels)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy: 80.67 %"
The Titanic dataset is a classic dataset in data science and machine learning typically used for demonstrating classification tasks. This dataset is available in the file `Titanic.csv1. Here’s a general description of the Titanic dataset and its variables:
The dataset contains data about the passengers who were onboard the ill-fated RMS Titanic. Your goal is to predict the survival of the passengers based on various features. Below are the variables you will use in this dataset:
Survived: Indicates if the passenger Survived (1) or did not survive (0).
Class: Passenger class, a proxy for socio-economic status (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
Name: Full name of the passenger.
Sex: Gender of the passenger (male or female).
Age: Age of the passenger in years. Some entries may have missing ages.
SibSp: Number of siblings and spouses aboard the Titanic.
Parch: Number of parents and children aboard the Titanic.
Fare: Passenger fare.
rpart
library for decision tree modeling and
the gmodels
library for creating a confusion matrix.rpart
function.
Predict “Survived” based on all other features, with a maximum depth for
tree branches set to 3 for simplicity.rpart.plot
function, customizing the yes/no labels on each
node.CrossTable
function from the
gmodels
package to generate a confusion matrix comparing
actual and predicted outcomes.**Complete the code below by filling spaces (“___“).**
# Load necessary libraries
library(rpart)
#install.packages('gmodels')
library(gmodels)
# Load the dataset; assuming the Titanic dataset
Titanic <- read.csv("Titanic.csv")
# Select relevant columns
Titanic <- Titanic[,c("Survived", "Sex", "Class", "Age", "Fare")]
# Factorize the Survived column
Titanic$Survived <- as.factor(Titanic$Survived)
# Set seed and split the data
set.seed(777)
train_indices <- sample(1:nrow(Titanic), 0.75 * nrow(Titanic))
DataTrain <- Titanic[train_indices, ]
DataTest <- Titanic[-train_indices, ]
# Train the decision tree model
ModelDesignDecTree <- rpart(Survived ~ ., data = DataTrain, method = "class", control = rpart.control(cp=0, maxdepth = 3))
# Visualize the decision tree
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
#install.packages('rpart.plot')
rpart.plot(ModelDesignDecTree, yes.text = "YES", no.text = "NO", roundint = FALSE)
# Predict on the test set
predicted_labels <- predict(ModelDesignDecTree, DataTest, type = "class")
# Create confusion matrix using CrossTable
CrossTable(x = DataTest$Survived, y = predicted_labels , prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 222
##
##
## | predicted_labels
## DataTest$Survived | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 115 | 12 | 127 |
## | 0.906 | 0.094 | 0.572 |
## | 0.846 | 0.140 | |
## | 0.518 | 0.054 | |
## ------------------|-----------|-----------|-----------|
## 1 | 21 | 74 | 95 |
## | 0.221 | 0.779 | 0.428 |
## | 0.154 | 0.860 | |
## | 0.095 | 0.333 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 136 | 86 | 222 |
## | 0.613 | 0.387 | |
## ------------------|-----------|-----------|-----------|
##
##
# Calculate accuracy
accuracy <- mean(predicted_labels == DataTest$Survived)
print(paste(accuracy, round(accuracy * 100, 2), "%"))
## [1] "0.851351351351351 85.14 %"
Based on the plot of the decision tree, Asnswer the following questions by typing your answer below each question.
```
Based on the plot of the decision tree, Asnswer the following questions by typing your answer below each question.
Answer: 46%
Answer: 2%
Answer: 2%
Answer: 34%
Answer: 18%