Go to the Spaceship Titanic competition on Kaggle, join the competition, and download the training set. Complete the classification problem at hand using logistic regression. In order to complete this question, you need to show your training classification error rate only. You don’t need to submit via Kaggle.
#Install Packages we may need
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mice)
## Warning: package 'mice' was built under R version 4.4.3
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
# Load train & test set
train <- read.csv("C:/Users/gianc/OneDrive/Desktop/BUA 302 HW-CODE/SPACE - Group Assignment/train.csv", stringsAsFactors = TRUE, na.strings = c("", "NA"), header = TRUE)
#Split 80/20
set.seed(123)
split_calc <- sample(1:nrow(train), size = 0.8 * nrow(train))
train <- train[split_calc,]
test <- train[-split_calc]
When bring in data, any blank values were switched to NA’s.
Cleaning Data:
Separated Cabin into 3 segment (deck, num, side)
Switched FALSE/TRUE to represent 0/1 (changed values for CryoSleep, Transported, and VIP)]
Removing unnecessary variables (PassengerID, Name)
Missing Values:
glimpse(train)
## Rows: 6,954
## Columns: 14
## $ PassengerId <fct> 2643_01, 2695_01, 3229_01, 1968_01, 3624_01, 5083_01, 711…
## $ HomePlanet <fct> Earth, Mars, Europa, Earth, NA, Earth, Earth, Mars, Earth…
## $ CryoSleep <fct> False, False, False, True, False, True, True, False, Fals…
## $ Cabin <fct> F/511/S, E/186/S, B/118/S, G/319/P, G/594/P, G/820/P, G/1…
## $ Destination <fct> PSO J318.5-22, TRAPPIST-1e, TRAPPIST-1e, PSO J318.5-22, T…
## $ Age <dbl> 34, 18, 45, 38, 29, 22, 26, 25, 39, 6, NA, 58, 19, 15, 3,…
## $ VIP <fct> False, False, False, False, False, False, False, False, F…
## $ RoomService <dbl> 122, 947, 3, 0, 0, 0, 0, 1228, 0, 0, 0, 0, 0, 1554, 0, 0,…
## $ FoodCourt <dbl> 0, 0, 2344, 0, 0, 0, 0, 0, 1, 0, 0, 82, 0, 0, 0, 1382, 4,…
## $ ShoppingMall <dbl> 6, 4, 1100, 0, 0, 0, 0, 132, 0, 0, 0, 0, 616, 646, 0, 0, …
## $ Spa <dbl> 739, 4, 498, 0, 0, 0, 0, 0, 739, 0, 0, 758, 0, 20, 0, 179…
## $ VRDeck <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1241, 0, 588, 0, 0…
## $ Name <fct> Terta Chasey, Honey Stike, Aino Chpeng, Delody Traverdy, …
## $ Transported <fct> False, False, True, True, False, False, False, False, Fal…
#Number of NAs throughout train data
colSums(is.na(train))
## PassengerId HomePlanet CryoSleep Cabin Destination Age
## 0 146 173 165 140 148
## VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 156 152 144 163 146 155
## Name Transported
## 169 0
#Separate Cabin in 3 segments
train <- separate(train, Cabin, c('deck','num','side'))
train$num <- as.numeric(train$num)
train <- as.data.frame(unclass(train), stringsAsFactors = TRUE)
#Change FALSE/TRUE to 0/1
train <- train %>%
mutate(CryoSleep = factor(case_when(CryoSleep == 'False' ~ 0, CryoSleep == 'True' ~ 1)),
Transported = factor(case_when(Transported == 'False' ~ 0,Transported == 'True' ~ 1)),
VIP = factor(case_when(VIP == 'False' ~ 0, VIP == 'True' ~ 1)))%>%
select(-c(PassengerId,Name)) %>% #Taking out certain columns
mice(method = "rf") %>%
complete()
##
## iter imp variable
## 1 1 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 1 2 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 1 3 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 1 4 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 1 5 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 2 1 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 2 2 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 2 3 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 2 4 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 2 5 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 3 1 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 3 2 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 3 3 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 3 4 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 3 5 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 4 1 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 4 2 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 4 3 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 4 4 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 4 5 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 5 1 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 5 2 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 5 3 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 5 4 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
## 5 5 HomePlanet CryoSleep deck num side Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck
#No more missing values
colSums(is.na(train))
## HomePlanet CryoSleep deck num side Destination
## 0 0 0 0 0 0
## Age VIP RoomService FoodCourt ShoppingMall Spa
## 0 0 0 0 0 0
## VRDeck Transported
## 0 0
Takeaways:
Almost 35% of passengers elected to Cyro-freeze
Proportionally, those who are in cyro-freeze had a significant amount of passengers that were teleported
#CryoSleep
table(train$CryoSleep)
##
## 0 1
## 4489 2465
ggplot(train, aes(x = Transported, fill = CryoSleep)) +
geom_bar(position = "dodge") +
labs(title = "Number of Passengers by Transported Status and CryoSleep",
x = "Transported",
y = "Number of Passengers",
fill = "CryoSleep") +
theme_minimal()
Takeaways:
#HomePlant
table(train$HomePlanet)
##
## Earth Europa Mars
## 3770 1727 1457
ggplot(train, aes(x = Transported, fill = HomePlanet)) +
geom_bar(position = "dodge") +
labs(title = "Number of Transported Passengers by HomePlanet",
x = "Transported",
y = "Number of Passengers",
fill = "HomePlanet") +
theme_minimal()
Takeaways:
Deck F and G has the most amount of passengers
Overall, the majority of teleported-passengers were from Deck F, G, C, B
#Deck
table(train$deck)
##
## A B C D E F G T
## 215 647 609 405 703 2258 2112 5
ggplot(train, aes(x = Transported, fill = deck)) +
geom_bar(position = "dodge") +
labs(title = "Number of Transported Passengers by deck",
x = "Transported",
y = "Number of Passengers",
fill = "deck") +
theme_minimal()
Takeaways:
#Side
table(train$side)
##
## P S
## 3463 3491
ggplot(train, aes(x = Transported, fill = side)) +
geom_bar(position = "dodge") +
labs(title = "Number of Transported Passengers by side",
x = "Transported",
y = "Number of Passengers",
fill = "side") +
theme_minimal()
Takeaways:
#Destination
table(train$Destination)
##
## 55 Cancri e PSO J318.5-22 TRAPPIST-1e
## 1448 631 4875
ggplot(train, aes(x = Transported, fill = Destination)) +
geom_bar(position = "dodge") +
labs(title = "Number of Transported Passengers by Destination",
x = "Transported",
y = "Number of Passengers",
fill = "Destination") +
theme_minimal()
Takeaways:
There is a significant spike in passengers that were not teleported from ages 16-28
Really young children from ages 0-10 were teleported for frequently.
#Age
ggplot(train, aes(x = Age, fill = Transported)) +
geom_density(alpha = 0.5) +
labs(title = "Age Distribution of Transported vs. Not Transported Passengers",
x = "Age",
y = "Density",
fill = "Transported") +
theme_minimal()
Takeaways:
There were only 162 VIP passengers
On the other hand non-VIP passengers were around 50/50 split if there were teleported or not
#VIP
train %>%
group_by(VIP, Transported) %>%
summarise(Count = n(), .groups = "drop")
## # A tibble: 4 × 3
## VIP Transported Count
## <fct> <fct> <int>
## 1 0 0 3371
## 2 0 1 3421
## 3 1 0 102
## 4 1 1 60
Thinking through the numerical variables, this would only be associated with passengers that were not in cyro-freeze. Furthermore, there were a small number of VIP passengers that would have the funds to spend a lot of money. Lastly, any money spent would depend on the relationship with various variables. As a result, in the model development, all variables will be included.
Our decision to use rain forest to replace missing values allowed us to have a more of a educated estimate on the imputed values.
# Fit logistic regression model
model1 <- glm(Transported ~ .,
family = "binomial", data = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Predict probabilities on the training data
train_probs <- predict(model1, newdata = train, type = "response")
# Convert probabilities to class predictions (threshold = 0.5)
train_preds <- ifelse(train_probs > 0.5, 1, 0) # Use 1 and 0 for binary outcome
# Compute training classification error rate
misclassified <- sum(train_preds != train$Transported) # Count misclassified instances
train_error_rate <- misclassified / length(train_preds) # Compute error rate
# Print the training error rate
print(train_error_rate)
## [1] 0.2076503
In Chapter 4, we used logistic regression to predict the probability of Default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
Default <- Default
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
#Used to reproduce results
set.seed(1)
#Logistic Regression
log_regression <- glm(default~ income + balance, data = Default, family = "binomial")
summary(log_regression)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
#Split 80/20
train_index <- sample(1:nrow(Default), size = 0.8*nrow(Default))
#Calling training and validation sets
train_data <- Default[train_index,]
valid_data <- Default[-train_index,]
#Model
multi_reg <- glm(default ~ income + balance, data = train_data, family = "binomial")
summary(multi_reg)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.168e+01 4.893e-01 -23.879 < 2e-16 ***
## income 2.547e-05 5.631e-06 4.523 6.1e-06 ***
## balance 5.613e-03 2.531e-04 22.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2313.6 on 7999 degrees of freedom
## Residual deviance: 1239.2 on 7997 degrees of freedom
## AIC: 1245.2
##
## Number of Fisher Scoring iterations: 8
#Predictions & Probs
multi_probs <- predict(multi_reg, newdata = valid_data, type = "response")
multi_preds <- rep("No", 2000)
multi_preds[multi_probs>0.5] = "Yes"
mean(multi_preds != valid_data$default)
## [1] 0.026
Repeat the process in (b) again, using a different split of the observations into a training set and a validation set. Comment on the results obtained.
Our residual deviance measures the noise after taking into considerations our variables. Using our second model, we are able to a slightly better deviance reduction than our first model
AIC is a measurement that provides insights on the balancing of fitting and complexity, Our second model also receives a much lower AIC, which means it is more efficient in explaining the data (with few parameters).
As for our validation set error, there really is not a difference that is significant. Both models are pretty good. Model 1 can be used for more variability within new data. Model 2 can be just as efficient and productive with less parameters.
#50/50 split
train_repeat <- sample(dim(Default)[1], dim(Default)[1] / 2)
#Model 2
multi_repeat <- glm(default ~ income + balance, data = Default[train_repeat,], family = "binomial")
summary(multi_repeat)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default[train_repeat, ])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.182e+01 6.464e-01 -18.286 < 2e-16 ***
## income 1.952e-05 7.426e-06 2.629 0.00856 **
## balance 5.757e-03 3.344e-04 17.218 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1395.76 on 4999 degrees of freedom
## Residual deviance: 736.66 on 4997 degrees of freedom
## AIC: 742.66
##
## Number of Fisher Scoring iterations: 8
#Probs & Preds
repeat_probs <- predict(multi_repeat, newdata = Default[-train_repeat, ], type="response")
repeat_preds <- rep("No",5000)
repeat_preds[repeat_probs>0.5] = "Yes"
#Validation set error
mean(repeat_preds != Default[-train_repeat, ]$default)
## [1] 0.0278
This question uses the Caravan data set.
Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.
Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?
## Warning: package 'gbm' was built under R version 4.4.2
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
## var rel.inf
## PPERSAUT PPERSAUT 7.472017283
## MGODGE MGODGE 4.920268524
## PBRAND PBRAND 4.562337369
## MKOOPKLA MKOOPKLA 4.455474976
## MOPLHOOG MOPLHOOG 4.380527334
## MOSTYPE MOSTYPE 4.051368353
## MINK3045 MINK3045 3.950103516
## MGODPR MGODPR 3.672167279
## MBERMIDD MBERMIDD 3.243664884
## MAUT2 MAUT2 3.234222424
## MBERARBG MBERARBG 2.649691655
## MSKB1 MSKB1 2.519943184
## MSKC MSKC 2.458311336
## MOPLMIDD MOPLMIDD 2.265387824
## MSKA MSKA 2.160331133
## PWAPART PWAPART 2.124682559
## MBERARBO MBERARBO 2.033834610
## MINKM30 MINKM30 2.025470388
## MFWEKIND MFWEKIND 1.973111208
## MRELOV MRELOV 1.811226920
## MBERHOOG MBERHOOG 1.779508314
## MAUT1 MAUT1 1.761319430
## MRELSA MRELSA 1.720751942
## MRELGE MRELGE 1.667203409
## MGODOV MGODOV 1.649518856
## MINK7512 MINK7512 1.580946746
## MINKGEM MINKGEM 1.495104835
## MFALLEEN MFALLEEN 1.445414183
## MSKB2 MSKB2 1.417953954
## MZFONDS MZFONDS 1.401154197
## MAUT0 MAUT0 1.379730933
## MGODRK MGODRK 1.346956354
## MINK4575 MINK4575 1.289878285
## MFGEKIND MFGEKIND 1.287102544
## ABRAND ABRAND 1.269977976
## MGEMLEEF MGEMLEEF 1.210600254
## MHHUUR MHHUUR 1.207518555
## MGEMOMV MGEMOMV 1.188485795
## MSKD MSKD 1.120963005
## MHKOOP MHKOOP 1.097480654
## MZPART MZPART 0.981066976
## APERSAUT APERSAUT 0.836917534
## MOSHOOFD MOSHOOFD 0.716215569
## MOPLLAAG MOPLLAAG 0.627775856
## MBERZELF MBERZELF 0.509784887
## PMOTSCO PMOTSCO 0.498942450
## MBERBOER MBERBOER 0.440908428
## PLEVEN PLEVEN 0.365185344
## PBYSTAND PBYSTAND 0.310737284
## MINK123M MINK123M 0.252407865
## MAANTHUI MAANTHUI 0.132457819
## ALEVEN ALEVEN 0.025485709
## PFIETS PFIETS 0.013999110
## PAANHANG PAANHANG 0.006402191
## PWABEDR PWABEDR 0.000000000
## PWALAND PWALAND 0.000000000
## PBESAUT PBESAUT 0.000000000
## PTRACTOR PTRACTOR 0.000000000
## PWERKT PWERKT 0.000000000
## PBROM PBROM 0.000000000
## PPERSONG PPERSONG 0.000000000
## PGEZONG PGEZONG 0.000000000
## PWAOREG PWAOREG 0.000000000
## PZEILPL PZEILPL 0.000000000
## PPLEZIER PPLEZIER 0.000000000
## PINBOED PINBOED 0.000000000
## AWAPART AWAPART 0.000000000
## AWABEDR AWABEDR 0.000000000
## AWALAND AWALAND 0.000000000
## ABESAUT ABESAUT 0.000000000
## AMOTSCO AMOTSCO 0.000000000
## AAANHANG AAANHANG 0.000000000
## ATRACTOR ATRACTOR 0.000000000
## AWERKT AWERKT 0.000000000
## ABROM ABROM 0.000000000
## APERSONG APERSONG 0.000000000
## AGEZONG AGEZONG 0.000000000
## AWAOREG AWAOREG 0.000000000
## AZEILPL AZEILPL 0.000000000
## APLEZIER APLEZIER 0.000000000
## AFIETS AFIETS 0.000000000
## AINBOED AINBOED 0.000000000
## ABYSTAND ABYSTAND 0.000000000
Looking at the summary table of the boosting model, the most important predictor is PPERSAUT (7.472) by far. The next most important predictors are MGODGE (4.92) and PBRAND (4.562).
##
## boost_preds 0 1
## 0 4346 254
## 1 187 35
The model correctly predicted 90.85% of the outcomes. The area where this model struggles was predicting no when the true outcome was yes (false negatives). This isn’t a great thing as the model is missing out on actual buyers.
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data.
Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.
Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.
The three classes do appear separated in the plot. Good to move onto part c.
## Actual_Class
## Cluster 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
The clusters do align well with the true classes. We see a difference in the labeling of clusters but we were warned that clusters were arbitrarily numbered so it doesn’t matter. The points are all where clustered in their respective groups.
## Actual_Class
## Cluster_2 1 2 3
## 1 20 20 0
## 2 0 0 20
The results show that two of the true classes were grouped into the first cluster and the the last class was grouped into the second cluster. Comparing these results to the previous part that use k = 3, these results indicate the this number of clusters does not work as well with this data.
## Actual_Class
## Cluster_4 1 2 3
## 1 10 0 0
## 2 0 0 20
## 3 10 0 0
## 4 0 20 0
Looking at the table, clusters two and four are grouped together under class one. This model split an existing class unnecessarily meaning k = 4 might over-complicate the clustering. k = 3 is still the best clustering so far.
## Actual_Class
## Cluster_PCA 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
The PCA-based k-means clusters essentially performed the same as on the raw data, specifically when k = 3. We still see three distinct classes meaning the structure remained the same regardless of reducing to two dimensions.
## Actual_Class
## Cluster_Scaled 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
We see pretty much identical results. There are three distinct clusters that correspond with the right classes. This would indicate that scaling does not change the results.