Question 1 - Spaceship Titanic

Go to the Spaceship Titanic competition on Kaggle, join the competition, and download the training set. Complete the classification problem at hand using logistic regression. In order to complete this question, you need to show your training classification error rate only. You don’t need to submit via Kaggle.

#Install Packages we may need
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mice)
## Warning: package 'mice' was built under R version 4.4.3
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
# Load train & test set 
train <- read.csv("C:/Users/gianc/OneDrive/Desktop/BUA 302 HW-CODE/SPACE - Group Assignment/train.csv", stringsAsFactors = TRUE, na.strings = c("", "NA"), header = TRUE)

#Split 80/20
set.seed(123)

split_calc <- sample(1:nrow(train), size = 0.8 * nrow(train))

train <- train[split_calc,]
test <- train[-split_calc]

Exploratory Analysis

When bring in data, any blank values were switched to NA’s.

Cleaning Data:

  • Separated Cabin into 3 segment (deck, num, side)

  • Switched FALSE/TRUE to represent 0/1 (changed values for CryoSleep, Transported, and VIP)]

  • Removing unnecessary variables (PassengerID, Name)

Missing Values:

  • Use package ‘mice’ to run computation of Random Forest for missing values
glimpse(train)
## Rows: 6,954
## Columns: 14
## $ PassengerId  <fct> 2643_01, 2695_01, 3229_01, 1968_01, 3624_01, 5083_01, 711…
## $ HomePlanet   <fct> Earth, Mars, Europa, Earth, NA, Earth, Earth, Mars, Earth…
## $ CryoSleep    <fct> False, False, False, True, False, True, True, False, Fals…
## $ Cabin        <fct> F/511/S, E/186/S, B/118/S, G/319/P, G/594/P, G/820/P, G/1…
## $ Destination  <fct> PSO J318.5-22, TRAPPIST-1e, TRAPPIST-1e, PSO J318.5-22, T…
## $ Age          <dbl> 34, 18, 45, 38, 29, 22, 26, 25, 39, 6, NA, 58, 19, 15, 3,…
## $ VIP          <fct> False, False, False, False, False, False, False, False, F…
## $ RoomService  <dbl> 122, 947, 3, 0, 0, 0, 0, 1228, 0, 0, 0, 0, 0, 1554, 0, 0,…
## $ FoodCourt    <dbl> 0, 0, 2344, 0, 0, 0, 0, 0, 1, 0, 0, 82, 0, 0, 0, 1382, 4,…
## $ ShoppingMall <dbl> 6, 4, 1100, 0, 0, 0, 0, 132, 0, 0, 0, 0, 616, 646, 0, 0, …
## $ Spa          <dbl> 739, 4, 498, 0, 0, 0, 0, 0, 739, 0, 0, 758, 0, 20, 0, 179…
## $ VRDeck       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1241, 0, 588, 0, 0…
## $ Name         <fct> Terta Chasey, Honey Stike, Aino Chpeng, Delody Traverdy, …
## $ Transported  <fct> False, False, True, True, False, False, False, False, Fal…
#Number of NAs throughout train data
colSums(is.na(train))
##  PassengerId   HomePlanet    CryoSleep        Cabin  Destination          Age 
##            0          146          173          165          140          148 
##          VIP  RoomService    FoodCourt ShoppingMall          Spa       VRDeck 
##          156          152          144          163          146          155 
##         Name  Transported 
##          169            0
#Separate Cabin in 3 segments 
train <- separate(train, Cabin, c('deck','num','side'))
train$num <- as.numeric(train$num)
train <- as.data.frame(unclass(train), stringsAsFactors = TRUE)



#Change FALSE/TRUE to 0/1
train <- train %>% 
    mutate(CryoSleep = factor(case_when(CryoSleep == 'False' ~ 0, CryoSleep == 'True' ~ 1)),
           Transported = factor(case_when(Transported == 'False' ~ 0,Transported == 'True' ~ 1)),
           VIP = factor(case_when(VIP == 'False' ~ 0, VIP == 'True' ~ 1)))%>%
  select(-c(PassengerId,Name)) %>% #Taking out certain columns
  mice(method = "rf") %>%
  complete()
## 
##  iter imp variable
##   1   1  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   1   2  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   1   3  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   1   4  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   1   5  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   2   1  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   2   2  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   2   3  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   2   4  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   2   5  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   3   1  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   3   2  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   3   3  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   3   4  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   3   5  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   4   1  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   4   2  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   4   3  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   4   4  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   4   5  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   5   1  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   5   2  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   5   3  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   5   4  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
##   5   5  HomePlanet  CryoSleep  deck  num  side  Destination  Age  VIP  RoomService  FoodCourt  ShoppingMall  Spa  VRDeck
#No more missing values
colSums(is.na(train))
##   HomePlanet    CryoSleep         deck          num         side  Destination 
##            0            0            0            0            0            0 
##          Age          VIP  RoomService    FoodCourt ShoppingMall          Spa 
##            0            0            0            0            0            0 
##       VRDeck  Transported 
##            0            0

CryoSleep

Takeaways:

  • Almost 35% of passengers elected to Cyro-freeze

  • Proportionally, those who are in cyro-freeze had a significant amount of passengers that were teleported

#CryoSleep 
table(train$CryoSleep)
## 
##    0    1 
## 4489 2465
ggplot(train, aes(x = Transported, fill = CryoSleep)) +
  geom_bar(position = "dodge") +
  labs(title = "Number of Passengers by Transported Status and CryoSleep",
       x = "Transported",
       y = "Number of Passengers",
       fill = "CryoSleep") +
  theme_minimal()

HomePlanet

Takeaways:

  • Europa has the 2nd largest home planet passengers & had a slightly more amount of passengers that were teleported (comparison to their proportion).
#HomePlant
table(train$HomePlanet)
## 
##  Earth Europa   Mars 
##   3770   1727   1457
ggplot(train, aes(x = Transported, fill = HomePlanet)) +
  geom_bar(position = "dodge") +
  labs(title = "Number of Transported Passengers by HomePlanet",
       x = "Transported",
       y = "Number of Passengers",
       fill = "HomePlanet") +
  theme_minimal()

Deck

Takeaways:

  • Deck F and G has the most amount of passengers

  • Overall, the majority of teleported-passengers were from Deck F, G, C, B

    • Proportionally, Deck B and C were the only Decks were they had more teleported-passengers
#Deck 
table(train$deck)
## 
##    A    B    C    D    E    F    G    T 
##  215  647  609  405  703 2258 2112    5
ggplot(train, aes(x = Transported, fill = deck)) +
  geom_bar(position = "dodge") +
  labs(title = "Number of Transported Passengers by deck",
       x = "Transported",
       y = "Number of Passengers",
       fill = "deck") +
  theme_minimal()

Side

Takeaways:

  • There were slightly more teleported-passengers that were on the Starboard side.
#Side 
table(train$side)
## 
##    P    S 
## 3463 3491
ggplot(train, aes(x = Transported, fill = side)) +
  geom_bar(position = "dodge") +
  labs(title = "Number of Transported Passengers by side",
       x = "Transported",
       y = "Number of Passengers",
       fill = "side") +
  theme_minimal()

Destination

Takeaways:

  • Overall, majority of those that were teleported came from Trappisit-1e.
#Destination
table(train$Destination)
## 
##   55 Cancri e PSO J318.5-22   TRAPPIST-1e 
##          1448           631          4875
ggplot(train, aes(x = Transported, fill = Destination)) +
  geom_bar(position = "dodge") +
  labs(title = "Number of Transported Passengers by Destination",
       x = "Transported",
       y = "Number of Passengers",
       fill = "Destination") +
  theme_minimal()

Age

Takeaways:

  • There is a significant spike in passengers that were not teleported from ages 16-28

  • Really young children from ages 0-10 were teleported for frequently.

#Age
ggplot(train, aes(x = Age, fill = Transported)) +
  geom_density(alpha = 0.5) +
  labs(title = "Age Distribution of Transported vs. Not Transported Passengers",
       x = "Age",
       y = "Density",
       fill = "Transported") +
  theme_minimal()

VIP

Takeaways:

  • There were only 162 VIP passengers

    • out of VIP, only 60 passengers were teleported
  • On the other hand non-VIP passengers were around 50/50 split if there were teleported or not

#VIP
train %>%
  group_by(VIP, Transported) %>%
  summarise(Count = n(), .groups = "drop")
## # A tibble: 4 × 3
##   VIP   Transported Count
##   <fct> <fct>       <int>
## 1 0     0            3371
## 2 0     1            3421
## 3 1     0             102
## 4 1     1              60
  • Thinking through the numerical variables, this would only be associated with passengers that were not in cyro-freeze. Furthermore, there were a small number of VIP passengers that would have the funds to spend a lot of money. Lastly, any money spent would depend on the relationship with various variables. As a result, in the model development, all variables will be included.

  • Our decision to use rain forest to replace missing values allowed us to have a more of a educated estimate on the imputed values.

Model Development - Logistic Regression

  • After including all variables held constant, we received a classification error rate of a little under 21%.
# Fit logistic regression model
model1 <- glm(Transported ~ ., 
              family = "binomial", data = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Predict probabilities on the training data
train_probs <- predict(model1, newdata = train, type = "response")

# Convert probabilities to class predictions (threshold = 0.5)
train_preds <- ifelse(train_probs > 0.5, 1, 0)  # Use 1 and 0 for binary outcome

# Compute training classification error rate
misclassified <- sum(train_preds != train$Transported)  # Count misclassified instances
train_error_rate <- misclassified / length(train_preds)  # Compute error rate

# Print the training error rate
print(train_error_rate)
## [1] 0.2076503

Question 2 - Default data set

In Chapter 4, we used logistic regression to predict the probability of Default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

  1. Fit a logistic regression model that uses income and balance to predict default.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
Default <- Default
head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559
#Used to reproduce results
set.seed(1)

#Logistic Regression
log_regression <- glm(default~ income + balance, data = Default, family = "binomial")
summary(log_regression)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
  1. Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

    1. Split the sample set into a training set and a validation set.
    #Split 80/20
    train_index <- sample(1:nrow(Default), size = 0.8*nrow(Default))
    
    #Calling training and validation sets
    
    train_data <- Default[train_index,]
    valid_data <- Default[-train_index,]
    1. Fit a multiple logistic regression model using only the training observation.
    #Model
    multi_reg <- glm(default ~ income + balance, data = train_data, family = "binomial")
    summary(multi_reg)
    ## 
    ## Call:
    ## glm(formula = default ~ income + balance, family = "binomial", 
    ##     data = train_data)
    ## 
    ## Coefficients:
    ##               Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -1.168e+01  4.893e-01 -23.879  < 2e-16 ***
    ## income       2.547e-05  5.631e-06   4.523  6.1e-06 ***
    ## balance      5.613e-03  2.531e-04  22.176  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 2313.6  on 7999  degrees of freedom
    ## Residual deviance: 1239.2  on 7997  degrees of freedom
    ## AIC: 1245.2
    ## 
    ## Number of Fisher Scoring iterations: 8
    1. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
    #Predictions & Probs
    multi_probs <- predict(multi_reg, newdata = valid_data, type = "response")
    multi_preds <- rep("No", 2000)
    multi_preds[multi_probs>0.5] = "Yes"
    1. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
    mean(multi_preds != valid_data$default)
    ## [1] 0.026
  2. Repeat the process in (b) again, using a different split of the observations into a training set and a validation set. Comment on the results obtained.

#50/50 split
train_repeat <- sample(dim(Default)[1], dim(Default)[1] / 2)

#Model 2
  multi_repeat <- glm(default ~ income + balance, data = Default[train_repeat,], family = "binomial")
  summary(multi_repeat)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default[train_repeat, ])
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.182e+01  6.464e-01 -18.286  < 2e-16 ***
## income       1.952e-05  7.426e-06   2.629  0.00856 ** 
## balance      5.757e-03  3.344e-04  17.218  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1395.76  on 4999  degrees of freedom
## Residual deviance:  736.66  on 4997  degrees of freedom
## AIC: 742.66
## 
## Number of Fisher Scoring iterations: 8
#Probs & Preds
  repeat_probs <- predict(multi_repeat, newdata = Default[-train_repeat, ], type="response")
repeat_preds <- rep("No",5000)
repeat_preds[repeat_probs>0.5] = "Yes"

#Validation set error 
mean(repeat_preds != Default[-train_repeat, ]$default)
## [1] 0.0278

Question 3 - Caravan Data set

This question uses the Caravan data set.

  1. Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

  2. Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

## Warning: package 'gbm' was built under R version 4.4.2
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

##               var     rel.inf
## PPERSAUT PPERSAUT 7.472017283
## MGODGE     MGODGE 4.920268524
## PBRAND     PBRAND 4.562337369
## MKOOPKLA MKOOPKLA 4.455474976
## MOPLHOOG MOPLHOOG 4.380527334
## MOSTYPE   MOSTYPE 4.051368353
## MINK3045 MINK3045 3.950103516
## MGODPR     MGODPR 3.672167279
## MBERMIDD MBERMIDD 3.243664884
## MAUT2       MAUT2 3.234222424
## MBERARBG MBERARBG 2.649691655
## MSKB1       MSKB1 2.519943184
## MSKC         MSKC 2.458311336
## MOPLMIDD MOPLMIDD 2.265387824
## MSKA         MSKA 2.160331133
## PWAPART   PWAPART 2.124682559
## MBERARBO MBERARBO 2.033834610
## MINKM30   MINKM30 2.025470388
## MFWEKIND MFWEKIND 1.973111208
## MRELOV     MRELOV 1.811226920
## MBERHOOG MBERHOOG 1.779508314
## MAUT1       MAUT1 1.761319430
## MRELSA     MRELSA 1.720751942
## MRELGE     MRELGE 1.667203409
## MGODOV     MGODOV 1.649518856
## MINK7512 MINK7512 1.580946746
## MINKGEM   MINKGEM 1.495104835
## MFALLEEN MFALLEEN 1.445414183
## MSKB2       MSKB2 1.417953954
## MZFONDS   MZFONDS 1.401154197
## MAUT0       MAUT0 1.379730933
## MGODRK     MGODRK 1.346956354
## MINK4575 MINK4575 1.289878285
## MFGEKIND MFGEKIND 1.287102544
## ABRAND     ABRAND 1.269977976
## MGEMLEEF MGEMLEEF 1.210600254
## MHHUUR     MHHUUR 1.207518555
## MGEMOMV   MGEMOMV 1.188485795
## MSKD         MSKD 1.120963005
## MHKOOP     MHKOOP 1.097480654
## MZPART     MZPART 0.981066976
## APERSAUT APERSAUT 0.836917534
## MOSHOOFD MOSHOOFD 0.716215569
## MOPLLAAG MOPLLAAG 0.627775856
## MBERZELF MBERZELF 0.509784887
## PMOTSCO   PMOTSCO 0.498942450
## MBERBOER MBERBOER 0.440908428
## PLEVEN     PLEVEN 0.365185344
## PBYSTAND PBYSTAND 0.310737284
## MINK123M MINK123M 0.252407865
## MAANTHUI MAANTHUI 0.132457819
## ALEVEN     ALEVEN 0.025485709
## PFIETS     PFIETS 0.013999110
## PAANHANG PAANHANG 0.006402191
## PWABEDR   PWABEDR 0.000000000
## PWALAND   PWALAND 0.000000000
## PBESAUT   PBESAUT 0.000000000
## PTRACTOR PTRACTOR 0.000000000
## PWERKT     PWERKT 0.000000000
## PBROM       PBROM 0.000000000
## PPERSONG PPERSONG 0.000000000
## PGEZONG   PGEZONG 0.000000000
## PWAOREG   PWAOREG 0.000000000
## PZEILPL   PZEILPL 0.000000000
## PPLEZIER PPLEZIER 0.000000000
## PINBOED   PINBOED 0.000000000
## AWAPART   AWAPART 0.000000000
## AWABEDR   AWABEDR 0.000000000
## AWALAND   AWALAND 0.000000000
## ABESAUT   ABESAUT 0.000000000
## AMOTSCO   AMOTSCO 0.000000000
## AAANHANG AAANHANG 0.000000000
## ATRACTOR ATRACTOR 0.000000000
## AWERKT     AWERKT 0.000000000
## ABROM       ABROM 0.000000000
## APERSONG APERSONG 0.000000000
## AGEZONG   AGEZONG 0.000000000
## AWAOREG   AWAOREG 0.000000000
## AZEILPL   AZEILPL 0.000000000
## APLEZIER APLEZIER 0.000000000
## AFIETS     AFIETS 0.000000000
## AINBOED   AINBOED 0.000000000
## ABYSTAND ABYSTAND 0.000000000

Looking at the summary table of the boosting model, the most important predictor is PPERSAUT (7.472) by far. The next most important predictors are MGODGE (4.92) and PBRAND (4.562).

  1. Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%.
##            
## boost_preds    0    1
##           0 4346  254
##           1  187   35

The model correctly predicted 90.85% of the outcomes. The area where this model struggles was predicting no when the true outcome was yes (false negatives). This isn’t a great thing as the model is missing out on actual buyers.

Question 4

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.

  1. Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

  2. Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.

The three classes do appear separated in the plot. Good to move onto part c.

  1. Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K- means clustering compare to the true class labels?
##        Actual_Class
## Cluster  1  2  3
##       1  0 20  0
##       2 20  0  0
##       3  0  0 20

The clusters do align well with the true classes. We see a difference in the labeling of clusters but we were warned that clusters were arbitrarily numbered so it doesn’t matter. The points are all where clustered in their respective groups.

  1. Perform K-means clustering with K = 2. Describe your results.
##          Actual_Class
## Cluster_2  1  2  3
##         1 20 20  0
##         2  0  0 20

The results show that two of the true classes were grouped into the first cluster and the the last class was grouped into the second cluster. Comparing these results to the previous part that use k = 3, these results indicate the this number of clusters does not work as well with this data.

  1. Now perform K-means clustering with K = 4, and describe your results.
##          Actual_Class
## Cluster_4  1  2  3
##         1 10  0  0
##         2  0  0 20
##         3 10  0  0
##         4  0 20  0

Looking at the table, clusters two and four are grouped together under class one. This model split an existing class unnecessarily meaning k = 4 might over-complicate the clustering. k = 3 is still the best clustering so far.

  1. Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

##            Actual_Class
## Cluster_PCA  1  2  3
##           1  0 20  0
##           2 20  0  0
##           3  0  0 20

The PCA-based k-means clusters essentially performed the same as on the raw data, specifically when k = 3. We still see three distinct classes meaning the structure remained the same regardless of reducing to two dimensions.

  1. Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.
##               Actual_Class
## Cluster_Scaled  1  2  3
##              1  0 20  0
##              2 20  0  0
##              3  0  0 20

We see pretty much identical results. There are three distinct clusters that correspond with the right classes. This would indicate that scaling does not change the results.