Question 1

The MNIST dataset is one of the most well-known datasets in the field of machine learning and is widely used for training various image processing systems. It contains images of handwritten digits, with each image being a 28x28 pixel grayscale image. Each pixel ranges from 0 (white) to 255 (black). The dataset is structured as follows:

  • Label: This variable denotes the class of the handwritten digit image. It represents the digit depicted in the image (0 through 9). In some variations of the MNIST dataset (though not standard), the label might represent letters if it’s modified and extended to alphabets.

  • Pix1, Pix2, …, Pix784: These variables represent the pixel values of the 28x28 pixel images. Each image is “flattened” into a single row with 784 columns (28 multiplied by 28), where each PixN corresponds to the grayscale value of a pixel. Each pixel value ranges from 0 to 255, where 0 corresponds to a completely white pixel and 255 corresponds to a completely black pixel.

Structure of the Data:

  • Rows: Each row in the dataset corresponds to a single image (a single handwritten digit) along with its label.
  • Columns: The first column is typically the Label and the remaining 784 columns are the pixel intensity values from the top-left to the bottom-right of the image.

Goal: The goal with MNIST is to build a model that can predict the Label from the 784 pixel values, effectively allowing a computer to recognize handwritten digits.

# Load necessary libraries
library(class)
install.packages("gmodels")
Installing package into 'C:/Users/jfern/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(gmodels)
Warning: package 'gmodels' was built under R version 4.4.3
# Load MNIST dataset
DataMnist <- readRDS("MN500.rds")
DataMnist$Label <- factor(DataMnist$Label)

# Create Training and Testing Data
set.seed(123)
index <- sample(1:nrow(DataMnist), 0.7 * nrow(DataMnist))
DataTrain <- DataMnist[index, ]
DataTest <- DataMnist[-index, ]

# Prepare the data for KNN (excluding the label column for training/testing dataset)
train_labels <- DataTrain$Label
test_labels <- DataTest$Label

# Remove the label column from the datasets for KNN
DataTrain <- DataTrain[, -1]
DataTest <- DataTest[, -1]

# Perform KNN for k = 5
predicted_labels <- knn(train = DataTrain, test = DataTest, cl = train_labels,  5)

# Create Confusion Matrix using CrossTable
CrossTable(x = test_labels, y = predicted_labels, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  150 

 
             | predicted_labels 
 test_labels |         0 |         1 |         2 |         3 |         4 |         5 |         6 |         7 |         8 |         9 | Row Total | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           0 |        12 |         0 |         1 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |        13 | 
             |     0.923 |     0.000 |     0.077 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.087 | 
             |     0.923 |     0.000 |     0.125 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
             |     0.080 |     0.000 |     0.007 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           1 |         0 |        23 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |        23 | 
             |     0.000 |     1.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.153 | 
             |     0.000 |     0.742 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
             |     0.000 |     0.153 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           2 |         0 |         2 |         6 |         0 |         1 |         0 |         0 |         1 |         0 |         0 |        10 | 
             |     0.000 |     0.200 |     0.600 |     0.000 |     0.100 |     0.000 |     0.000 |     0.100 |     0.000 |     0.000 |     0.067 | 
             |     0.000 |     0.065 |     0.750 |     0.000 |     0.067 |     0.000 |     0.000 |     0.059 |     0.000 |     0.000 |           | 
             |     0.000 |     0.013 |     0.040 |     0.000 |     0.007 |     0.000 |     0.000 |     0.007 |     0.000 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           3 |         0 |         1 |         0 |         9 |         0 |         0 |         0 |         1 |         0 |         0 |        11 | 
             |     0.000 |     0.091 |     0.000 |     0.818 |     0.000 |     0.000 |     0.000 |     0.091 |     0.000 |     0.000 |     0.073 | 
             |     0.000 |     0.032 |     0.000 |     0.900 |     0.000 |     0.000 |     0.000 |     0.059 |     0.000 |     0.000 |           | 
             |     0.000 |     0.007 |     0.000 |     0.060 |     0.000 |     0.000 |     0.000 |     0.007 |     0.000 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           4 |         0 |         0 |         1 |         0 |        10 |         0 |         0 |         0 |         0 |         2 |        13 | 
             |     0.000 |     0.000 |     0.077 |     0.000 |     0.769 |     0.000 |     0.000 |     0.000 |     0.000 |     0.154 |     0.087 | 
             |     0.000 |     0.000 |     0.125 |     0.000 |     0.667 |     0.000 |     0.000 |     0.000 |     0.000 |     0.118 |           | 
             |     0.000 |     0.000 |     0.007 |     0.000 |     0.067 |     0.000 |     0.000 |     0.000 |     0.000 |     0.013 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           5 |         0 |         2 |         0 |         1 |         2 |        12 |         1 |         0 |         0 |         0 |        18 | 
             |     0.000 |     0.111 |     0.000 |     0.056 |     0.111 |     0.667 |     0.056 |     0.000 |     0.000 |     0.000 |     0.120 | 
             |     0.000 |     0.065 |     0.000 |     0.100 |     0.133 |     0.800 |     0.059 |     0.000 |     0.000 |     0.000 |           | 
             |     0.000 |     0.013 |     0.000 |     0.007 |     0.013 |     0.080 |     0.007 |     0.000 |     0.000 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           6 |         1 |         0 |         0 |         0 |         0 |         0 |        16 |         0 |         0 |         0 |        17 | 
             |     0.059 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.941 |     0.000 |     0.000 |     0.000 |     0.113 | 
             |     0.077 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.941 |     0.000 |     0.000 |     0.000 |           | 
             |     0.007 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.107 |     0.000 |     0.000 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           7 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |        13 |         0 |         0 |        13 | 
             |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     1.000 |     0.000 |     0.000 |     0.087 | 
             |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.765 |     0.000 |     0.000 |           | 
             |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.000 |     0.087 |     0.000 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           8 |         0 |         2 |         0 |         0 |         1 |         3 |         0 |         0 |         6 |         1 |        13 | 
             |     0.000 |     0.154 |     0.000 |     0.000 |     0.077 |     0.231 |     0.000 |     0.000 |     0.462 |     0.077 |     0.087 | 
             |     0.000 |     0.065 |     0.000 |     0.000 |     0.067 |     0.200 |     0.000 |     0.000 |     0.857 |     0.059 |           | 
             |     0.000 |     0.013 |     0.000 |     0.000 |     0.007 |     0.020 |     0.000 |     0.000 |     0.040 |     0.007 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
           9 |         0 |         1 |         0 |         0 |         1 |         0 |         0 |         2 |         1 |        14 |        19 | 
             |     0.000 |     0.053 |     0.000 |     0.000 |     0.053 |     0.000 |     0.000 |     0.105 |     0.053 |     0.737 |     0.127 | 
             |     0.000 |     0.032 |     0.000 |     0.000 |     0.067 |     0.000 |     0.000 |     0.118 |     0.143 |     0.824 |           | 
             |     0.000 |     0.007 |     0.000 |     0.000 |     0.007 |     0.000 |     0.000 |     0.013 |     0.007 |     0.093 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total |        13 |        31 |         8 |        10 |        15 |        15 |        17 |        17 |         7 |        17 |       150 | 
             |     0.087 |     0.207 |     0.053 |     0.067 |     0.100 |     0.100 |     0.113 |     0.113 |     0.047 |     0.113 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|

 
# Calculate accuracy
accuracy <- mean(predicted_labels == test_labels)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
[1] "Accuracy: 80.67 %"

MODEL PERFORMS AT 80.67% ACCURACY

Question 2

The Titanic dataset is a classic dataset in data science and machine learning typically used for demonstrating classification tasks. This dataset is available in the file `Titanic.csv1. Here’s a general description of the Titanic dataset and its variables:

Description of the Titanic Dataset

The dataset contains data about the passengers who were onboard the ill-fated RMS Titanic. Your goal is to predict the survival of the passengers based on various features. Below are the variables you will use in this dataset:

Survived: Indicates if the passenger Survived (1) or did not survive (0).

Class: Passenger class, a proxy for socio-economic status (1 = 1st class, 2 = 2nd class, 3 = 3rd class).

Name: Full name of the passenger.

Sex: Gender of the passenger (male or female).

Age: Age of the passenger in years. Some entries may have missing ages.

SibSp: Number of siblings and spouses aboard the Titanic.

Parch: Number of parents and children aboard the Titanic.

Fare: Passenger fare.

# Load necessary libraries
library(rpart)
library(gmodels)

# Load the dataset; assuming the Titanic dataset
Titanic <- read.csv("Titanic.csv")

# Select relevant columns
Titanic <- Titanic[,c("Survived", "Sex", "Class", "Age", "Fare")]

# Factorize the Survived column
Titanic$Survived <- as.factor(Titanic$Survived)

# Set seed and split the data
set.seed(777)
train_indices <- sample(1:nrow(Titanic), 0.75 * nrow(Titanic))
DataTrain <- Titanic[train_indices, ]
DataTest <- Titanic[-train_indices, ]

# Train the decision tree model
ModelDesignDecTree <- rpart(Survived ~ ., data = DataTrain, method = "class", control = rpart.control(maxdepth = 3))

# Visualize the decision tree
library(rpart.plot)
Warning: package 'rpart.plot' was built under R version 4.4.3
rpart.plot(ModelDesignDecTree, yes.text = "YES", no.text = "NO", roundint = FALSE)

# Predict on the test set
predicted_labels <- predict(ModelDesignDecTree, DataTest, type = "class")

# Create confusion matrix using CrossTable
CrossTable(x = DataTest$Survived, y = predicted_labels, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  222 

 
                  | predicted_labels 
DataTest$Survived |         0 |         1 | Row Total | 
------------------|-----------|-----------|-----------|
                0 |       115 |        12 |       127 | 
                  |     0.906 |     0.094 |     0.572 | 
                  |     0.846 |     0.140 |           | 
                  |     0.518 |     0.054 |           | 
------------------|-----------|-----------|-----------|
                1 |        21 |        74 |        95 | 
                  |     0.221 |     0.779 |     0.428 | 
                  |     0.154 |     0.860 |           | 
                  |     0.095 |     0.333 |           | 
------------------|-----------|-----------|-----------|
     Column Total |       136 |        86 |       222 | 
                  |     0.613 |     0.387 |           | 
------------------|-----------|-----------|-----------|

 
# Calculate accuracy
accuracy <- mean(predicted_labels == DataTest$Survived)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
[1] "Accuracy: 85.14 %"

WITH A 85% ACCURACY THE TREE PREDICTS

  1. What is the survival rate for adult male passengers 13 years or older, regardless of the class they traveled in and the fare they paid?

Answer: 9% with 46% Data

  1. What is the survival rate for young male passengers (younger than 13 years), regardless of which class they traveled and the fare they paid?

Answer: 19% with 66% Data

  1. What is the survival rate for young male passengers (younger than 13 years) traveling in Third Class regardless of the fare they paid?

Answer: 79%

  1. What is the survival rate for female passengers, regardless of age and not considering the class they traveled in or the fare they paid?

Answer: 72% with 34% Data

  1. When considering the class female passengers traveled in, we can see female passengers, regardless of age, had a survival rate of 93% when they traveled in First or Second Class regardless of the fare they paid.
---
title: "BANL 3200: Machine Learning -- Supervised"
subtitle: "Midterm"
author: "Joyston Fernandes"
date: "`r format(Sys.Date(), '%B %e, %Y')`"
output:
  html_document:
    theme: flatly
    toc: TRUE
    toc_float: TRUE
    toc_depth: 3
    number_sections: TRUE
    code_folding: show
    code_download: true
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE, cache = FALSE}
knitr::opts_chunk$set(eval = TRUE, error = TRUE, comment = NA, 
                      warnings = FALSE, messages = FALSE, tidy = FALSE, 
                      cache = FALSE)
# load libraries
library(tidyverse)
```


### Question 1 {-}

The MNIST dataset is one of the most well-known datasets in the field of machine learning and is widely used for training various image processing systems. It contains images of handwritten digits, with each image being a 28x28 pixel grayscale image. Each pixel ranges from 0 (white) to 255 (black). The dataset is structured as follows:

- **Label**: This variable denotes the class of the handwritten digit image. It represents the digit depicted in the image (0 through 9). In some variations of the MNIST dataset (though not standard), the label might represent letters if it’s modified and extended to alphabets.  

- **Pix1, Pix2, ..., Pix784**: These variables represent the pixel values of the 28x28 pixel images. Each image is "flattened" into a single row with 784 columns (28 multiplied by 28), where each `PixN` corresponds to the grayscale value of a pixel. Each pixel value ranges from 0 to 255, where 0 corresponds to a completely white pixel and 255 corresponds to a completely black pixel.

**Structure of the Data:**

- **Rows**: Each row in the dataset corresponds to a single image (a single handwritten digit) along with its label.  
- **Columns**: The first column is typically the `Label` and the remaining 784 columns are the pixel intensity values from the top-left to the bottom-right of the image.

***Goal:*** 
The goal with MNIST is to build a model that can predict the `Label` from the 784 pixel values, effectively allowing a computer to recognize handwritten digits.


```{r Q1}
# Load necessary libraries
library(class)
install.packages("gmodels")
library(gmodels)

# Load MNIST dataset
DataMnist <- readRDS("MN500.rds")
DataMnist$Label <- factor(DataMnist$Label)

# Create Training and Testing Data
set.seed(123)
index <- sample(1:nrow(DataMnist), 0.7 * nrow(DataMnist))
DataTrain <- DataMnist[index, ]
DataTest <- DataMnist[-index, ]

# Prepare the data for KNN (excluding the label column for training/testing dataset)
train_labels <- DataTrain$Label
test_labels <- DataTest$Label

# Remove the label column from the datasets for KNN
DataTrain <- DataTrain[, -1]
DataTest <- DataTest[, -1]

# Perform KNN for k = 5
predicted_labels <- knn(train = DataTrain, test = DataTest, cl = train_labels,  5)

# Create Confusion Matrix using CrossTable
CrossTable(x = test_labels, y = predicted_labels, prop.chisq = FALSE)

# Calculate accuracy
accuracy <- mean(predicted_labels == test_labels)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
```

**MODEL PERFORMS AT 80.67% ACCURACY**

### Question 2 {-}

The Titanic dataset is a classic dataset in data science and machine learning typically used for demonstrating classification tasks. This dataset is available in the file `Titanic.csv1. Here's a general description of the Titanic dataset and its variables:

### Description of the Titanic Dataset {-}

The dataset contains data about the passengers who were onboard the ill-fated RMS Titanic. Your goal is to predict the survival of the passengers based on various features. Below are the variables you will use in this dataset:

**Survived**: Indicates if the passenger Survived (1) or did not survive (0).

**Class**: Passenger class, a proxy for socio-economic status (1 = 1st class, 2 = 2nd class, 3 = 3rd class).

**Name**: Full name of the passenger.

**Sex**: Gender of the passenger (male or female).

**Age**: Age of the passenger in years. Some entries may have missing ages.

**SibSp**: Number of siblings and spouses aboard the Titanic.

**Parch**: Number of parents and children aboard the Titanic.

**Fare**: Passenger fare.


```{r Q2}
# Load necessary libraries
library(rpart)
library(gmodels)

# Load the dataset; assuming the Titanic dataset
Titanic <- read.csv("Titanic.csv")

# Select relevant columns
Titanic <- Titanic[,c("Survived", "Sex", "Class", "Age", "Fare")]

# Factorize the Survived column
Titanic$Survived <- as.factor(Titanic$Survived)

# Set seed and split the data
set.seed(777)
train_indices <- sample(1:nrow(Titanic), 0.75 * nrow(Titanic))
DataTrain <- Titanic[train_indices, ]
DataTest <- Titanic[-train_indices, ]

# Train the decision tree model
ModelDesignDecTree <- rpart(Survived ~ ., data = DataTrain, method = "class", control = rpart.control(maxdepth = 3))

# Visualize the decision tree
library(rpart.plot)
rpart.plot(ModelDesignDecTree, yes.text = "YES", no.text = "NO", roundint = FALSE)

# Predict on the test set
predicted_labels <- predict(ModelDesignDecTree, DataTest, type = "class")

# Create confusion matrix using CrossTable
CrossTable(x = DataTest$Survived, y = predicted_labels, prop.chisq = FALSE)

# Calculate accuracy
accuracy <- mean(predicted_labels == DataTest$Survived)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
```

**WITH A 85% ACCURACY THE TREE PREDICTS**

1. What is the survival rate for adult male passengers 13 years or older, regardless of the class they traveled in and the fare they paid?

Answer: **9% with 46% Data**

2. What is the survival rate for young male passengers (younger than 13 years), regardless of which class they traveled and the fare they paid?

Answer: **19% with 66% Data**

3. What is the survival rate for young male passengers (younger than 13 years) traveling in Third Class regardless of the fare they paid?

Answer: **79%**

4. What is the survival rate for female passengers, regardless of age and not considering the class they traveled in or the fare they paid?

Answer: **72% with 34% Data**

5. When considering the class female passengers traveled in, we can see female passengers, regardless of age, had a survival rate of **93%** when they traveled in First or Second Class regardless of the fare they paid.




