6.9 (e) Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
It seems like the accuracy from my pls fit was better than pcr. Both methods performed better than the Lasso and regular OLS used in previous exercises. It seems like pls is the best fit model for all the listed models.
library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ dplyr 1.0.7
## ✓ tibble 3.1.2 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ISLR2)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
College <- na.omit(College)
sam.college <- sample(1:nrow(College), nrow(College)*.8)
x <- model.matrix(Apps~., data=College)[, -1]
y <- College$Apps
train.col <- College[sam.college,]
test.col <- College[-sam.college,]
y.test <- test.col$Apps
#PCR
pcr <- pcr(Apps ~ ., data=train.col, scale =TRUE, validation = 'CV')
validationplot(pcr, val.type = 'MSEP')
predict.pcr <- predict(pcr, test.col, ncomp=15)
acc.pcr <- mean((predict.pcr-test.col$Apps)^2)
acc.pcr
## [1] 1676837
#PLS
set.seed(1)
pls <- plsr(Apps~., data = train.col, scale= TRUE, validation= 'CV')
validationplot(pls, val.type='MSEP')
pred.pls <- predict(pls, test.col, ncomp = 5)
acc.pls <- mean((pred.pls - test.col$Apps)^2)
acc.pls
## [1] 1588895
In Section 12.2.3, a formula for calculating PVE was given in Equa- tion 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function. On the USArrests data, calculate PVE in two ways: (a) Using the sdev output of the prcomp() function, as was done in Section 12.2.3.
I got 62% of variance explaing by first PCA, 24.7% explained by second, 89.1% explaing by third, and 43.3% explained by fourth.
library(dplyr)
library(tidyverse)
set.seed(1)
#part a
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
states <- row.names(USArrests)
pr.out <- prcomp(USArrests, scale = TRUE)
pr.var <- pr.out$sdev^2
pve <- pr.var / sum(pr.var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
#part b
#12.9
Done in code.
You can see this resulting from cut3 in output.
Done in code
What I can tell from my graphs produced is that it seems like scaling has a significant effect on the height measure of clustering. This obviously makes sense because by scaling the data we are also shrinking it. For the unscaled data it seems as though the branches form clusters at lower levels while the scaled data has a more gradual grouping going up to height while compared the more quickly grouped unscaled data.
#a
data.dist1 <- dist(USArrests)
hclust1 <- hclust(data.dist1)
plot(hclust(data.dist1))
#b
cut3 <- cutree(hclust1, 3)
table(cut3)
## cut3
## 1 2 3
## 16 14 20
cut3
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
#c
sd.data <- scale(USArrests)
par(mfrow = c(1,3))
data.dist2 <- dist(sd.data)
plot(hclust(data.dist2))
complete.clus <- hclust(dist(sd.data))
#d
In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data. (a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.
Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes.
Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?
My K=3 kmeans clustering has the correct observations in each of the three classes.
With k=2, the method obvioiusly sorted my observations into two distinct clusters. This took my first 40 observations and put them in one group, then my last 20 observations were in the second group. Leads me to believe my middle group is more similar to my first group than my third group.
This ended up having interesting results. My first 20 observations, which were all of one class in my original data frame, were spread out between two groups. My second and third originally created classes did not change and stayed in the same class. When considering my original first assigned class, which as I mentioned is split into two groups, it is split between 8 in one group and 12 in another group which is interesting.
This operation results in three clusters of which were classified exactly how I had created them.
They are the same exact classes resulting from b. This clustering with scaling still correctly gets the same groups assigned at the start. To be honest I am not sure how to compare this to the PCA completed in part 2. PCA also shows three distinct groups, PC1 is between -40 and 40 while pc2 is between -15 and 15.
#a
set.seed(1)
a <- matrix(rnorm(20*50, 5, 5), ncol=50)
b <- matrix(rnorm(20*50, 10, 5), ncol=50)
c <- matrix(rnorm(20*50, 15, 5), ncol=50)
r <- rbind(a,b,c)
h <- cbind(r, class = as.factor(c(rep(1,20), rep(2,20), rep(3,20) )))
final <- data.frame(h)
#b
pca.imag <- prcomp(final)
ggplot(data.frame(pc1 = pca.imag$x[,1], pc2 = pca.imag$x[,2], class= final$class), aes(pc1, pc2, col= class)) +geom_point()
#c
sd.final <- scale(final)
km.final <- kmeans(sd.final, 3, nstart = 20)
km.clusters <- km.final$cluster
table(km.clusters, final$class)
##
## km.clusters 1 2 3
## 1 0 20 0
## 2 0 0 20
## 3 20 0 0
#d
km.2 <- kmeans(sd.final, 2, nstart= 20)
km.clus2 <- km.2$cluster
table(km.clus2, final$class)
##
## km.clus2 1 2 3
## 1 0 0 20
## 2 20 20 0
km.clus2
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#e
km.4 <- kmeans(sd.final, 4, nstart= 20)
km.clus4 <- km.4$cluster
table(km.clus4, final$class)
##
## km.clus4 1 2 3
## 1 8 0 0
## 2 12 0 0
## 3 0 0 20
## 4 0 20 0
km.clus4
## [1] 1 2 1 2 2 1 2 1 2 1 2 1 2 2 1 1 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [39] 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#f
km.pca <- kmeans(pca.imag$x[,1:2], 3, nstart= 20)
km.pca.clus <- km.pca$cluster
table(km.pca.clus, final$class)
##
## km.pca.clus 1 2 3
## 1 0 20 0
## 2 20 0 0
## 3 0 0 20
km.pca.clus
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#g
fin.scale <- scale(final)
km.scaled <- kmeans(fin.scale, 3, nstart= 20)
scale.clus <- km.scaled$cluster
scale.clus
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
6.For my project I decided to work with an airline passenger satisfaction data frame that I found from Kaggle. I am working alone because I haven’t really looked for a group, but also have not heard much in terms of people looking for groups. I don’t have a problem with doing this project either way, but if you happen to find someone who needs a group I would be happy to join. My basic vision for the project is to use various classification techniques we have learned in order to classify whether or not a passenger was satisfied with their flight. I started by created dummy variables for the character variables so they would be easier to do quantitative analysis on. I made satisfaction a dummy variable where people where people who marked that they were satisfied got a 1 and people who marked they were unsatisfied/neutral were given a 0. I think it is interesting that this data came with unsatisfied/neutral. Grouping these two together could lead to some issues as there could be a large difference in unsatisfied versus neutral, so I will keep that in mind as I go further into my analysis. Next, I removed the variables that I created a dummy variable for for obvious reasons. Next, I manipulated flight distance because it looked to be an exponentially distributed variable. A little description of most of my variables for them make more sense is that a lot of them are on a scale of 1-5, rating levels of satisfaction. 1 being unsatisfied and 5 being satisfied. Some of these had values of zero which I considered to be an NA, so I removed them from my data. For exploratory analysis, I did some simple stuff at the beginning and then tried my hand at PCA which failed miserably. My correlation plot showed me that none of the variables look to be too correlated besides class and type of travel. This makes sense because if you travel for business I am guessing you are more likely to have an upgraded seat. I looked at the distribution of my response variable satisfied and it seems there is a decent sized gap in the amount of people unsatisfied(68330) and satisfied(50874). I am not sure how this will affect my analysis but that is a good note to point out. I went onto PCA and had a slight misunderstanding. I did 5 clusters at first and couldn’t get much from them, so then I decided to do more clusters which was not very interpretable. Then 3 clusters made the results more visually satisfying and understandable. I didn’t do heirarchical clustering due to the size of my data set being about 120,000 so I figured that plot would be a mess too. My basic understanding of this data from my preliminary exploration is that I am not too worried about covariation. I made a few stacked histograms that I think explain my data in an easy to understand way. It seems like in general the categorical variables on satisfaction of certain services provided by the airline have a positive relationship with likelihood to become satisfied. I will have to obviously in the future use statistical tools rather than visualization but I think that my visualizaitions show that fairly well.
library(readr)
library(dplyr)
library(ggplot2)
df1 <- read_csv("~/Documents/Machine Learning/Project/test.csv")
## New names:
## * `` -> ...1
## Rows: 25976 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Customer Type, Type of Travel, Class, satisfaction
## dbl (20): ...1, id, Age, Flight Distance, Inflight wifi service, Departure/A...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- read_csv("~/Documents/Machine Learning/Project/train.csv")
## New names:
## * `` -> ...1
## Rows: 103904 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Gender, Customer Type, Type of Travel, Class, satisfaction
## dbl (20): ...1, id, Age, Flight Distance, Inflight wifi service, Departure/A...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airline <- rbind(df1, df2)
airline <- na.omit(airline)
airline.clean <- airline %>%
select(-...1) %>%
arrange(id)
# Creating dummy variables that are easier to understand
airline.clean$personal <- ifelse(airline.clean$`Type of Travel` == 'Personal Travel', 1, 0)
airline.clean$loyal <- ifelse(airline.clean$`Customer Type` == 'Loyal Customer', 1, 0)
airline.clean$satisfied <- ifelse(airline.clean$satisfaction == 'satisfied', 1, 0)
airline.clean$male <- ifelse(airline.clean$Gender == 'Male', 1, 0)
#removing variables that I created dummy variables for
airline.clean1 <- select(airline.clean, -c(Gender, satisfaction, `Customer Type`, `Type of Travel`))
#noticing flight distance is not normally distributed so I am taking the log of flight distance to get a more normal distribution
airline.clean1$logDist <- log(airline.clean$`Flight Distance`)
airline.clean1 <- select(airline.clean, -(`Flight Distance`))
#deciding to split up the class, I am going to split it into a dummy variable of 1 being economy and 2 being business and economy plus because economy is not a paid upgrade while the other two are
airline.clean1$UpgradedClass <- ifelse(airline.clean$Class == 'Eco', 0 ,1)
airline.clean2 <- select(airline.clean1, -c(Class))
#renaming columns so they are easier to work with
airline.clean2 <- airline.clean2 %>%
rename(wifi = `Inflight wifi service`,
time.convenient = `Departure/Arrival time convenient`,
ease.booking = `Ease of Online booking`,
food.drink = `Food and drink`,
online.boardingn = `Online boarding`,
seat.comfort = `Seat comfort`,
flight.entertainment = `Inflight entertainment`,
onboard.service = `On-board service`,
legroom = `Leg room service`,
baggage.handling = `Baggage handling`,
checkin.service = `Checkin service`,
inflight.service = `Inflight service`,
departure.delay.min = `Departure Delay in Minutes`,
arrival.delay.min = `Arrival Delay in Minutes`)
airline.clean2$gate.location <- airline.clean2$`Gate location`
airline.clean2$`Gate location` <- NULL
#removing zeros from certain columns because those are technically NAs
airline.clean3 <- airline.clean2 %>%
filter(wifi != 0, time.convenient != 0, ease.booking != 0, food.drink !=0, online.boardingn !=0, seat.comfort != 0, flight.entertainment != 0, onboard.service!= 0, legroom != 0, baggage.handling != 0, checkin.service != 0, inflight.service != 0, Cleanliness != 0)
airline.clean3 <- airline.clean3 %>%
select(-c(`Customer Type`, Gender, `Type of Travel`, satisfaction))
#Going to attempt to do some Exploratory analysis
names(airline.clean3)
## [1] "id" "Age" "wifi"
## [4] "time.convenient" "ease.booking" "food.drink"
## [7] "online.boardingn" "seat.comfort" "flight.entertainment"
## [10] "onboard.service" "legroom" "baggage.handling"
## [13] "checkin.service" "inflight.service" "Cleanliness"
## [16] "departure.delay.min" "arrival.delay.min" "personal"
## [19] "loyal" "satisfied" "male"
## [22] "UpgradedClass" "gate.location"
dim(airline.clean3)
## [1] 119204 23
summary(airline.clean3)
## id Age wifi time.convenient
## Min. : 1 Min. : 7.00 Min. :1.000 Min. :1.000
## 1st Qu.: 33037 1st Qu.:28.00 1st Qu.:2.000 1st Qu.:2.000
## Median : 65618 Median :40.00 Median :3.000 Median :3.000
## Mean : 65315 Mean :39.86 Mean :2.818 Mean :3.207
## 3rd Qu.: 97643 3rd Qu.:51.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :129880 Max. :85.00 Max. :5.000 Max. :5.000
## ease.booking food.drink online.boardingn seat.comfort
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :4.000 Median :4.000
## Mean :2.879 Mean :3.214 Mean :3.331 Mean :3.457
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## flight.entertainment onboard.service legroom baggage.handling
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :4.000
## Mean :3.381 Mean :3.386 Mean :3.381 Mean :3.637
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## checkin.service inflight.service Cleanliness departure.delay.min
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 0.00
## Median :3.000 Median :4.000 Median :3.000 Median : 0.00
## Mean :3.295 Mean :3.647 Mean :3.294 Mean : 14.84
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.: 13.00
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :1592.00
## arrival.delay.min personal loyal satisfied
## Min. : 0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 0.00 Median :0.0000 Median :1.0000 Median :0.0000
## Mean : 15.28 Mean :0.3084 Mean :0.8391 Mean :0.4268
## 3rd Qu.: 13.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1584.00 Max. :1.0000 Max. :1.0000 Max. :1.0000
## male UpgradedClass gate.location
## Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.0000 Median :1.0000 Median :3.000
## Mean :0.4932 Mean :0.5599 Mean :2.987
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :5.000
library(corrplot)
## corrplot 0.92 loaded
##
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
##
## corrplot
cor <- cor(airline.clean3)
corrplot(cor, type='upper', order = 'hclust')
hist(airline.clean3$satisfied)
table(airline.clean3$satisfied)
##
## 0 1
## 68330 50874
tab1 <- table(airline.clean3$satisfied, airline.clean3$legroom)
barplot(tab1, legend= TRUE)
tab2 <- table(airline.clean3$satisfied, airline.clean3$time.convenient)
barplot(tab2, legend= TRUE)
tab3 <- table(airline.clean3$satisfied, airline.clean3$wifi)
barplot(tab3, legend= TRUE)
tab4 <- table(airline.clean3$satisfied, airline.clean3$seat.comfort)
barplot(tab4, legend= TRUE)
tab5 <- table(airline.clean3$satisfied, airline.clean3$Cleanliness)
barplot(tab5, legend= TRUE)
tab6 <- table(airline.clean3$satisfied, airline.clean3$baggage.handling)
barplot(tab6, legend= TRUE)
library(ggpubr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
scaled <- scale(airline.clean3)
km.5.air <- kmeans(scaled, 5, nstart=25)
fviz_cluster(km.5.air,
data=airline.clean3,
palette= c('yellow', 'red', 'blue', 'green', 'black'),
geom ='point',
ellipse.type = 'convex',
ggtheme = theme_classic())
km.15.air <- kmeans(scaled, 15, nstart=25)
fviz_cluster(km.15.air,
data=airline.clean3,
palette= c('yellow', 'red', 'blue', 'green', 'black', 'purple', 'khaki', 'cyan', 'tomato', 'plum', 'chocolate', 'pink', 'brown', 'seashell', 'deepskyblue'),
geom ='point',
ellipse.type = 'convex',
ggtheme = theme_classic())
km.3.air <- kmeans(scaled, 3, nstart=25)
fviz_cluster(km.3.air,
data=airline.clean3,
palette= c('yellow', 'red', 'blue', 'green', 'black', 'purple', 'khaki', 'cyan', 'tomato', 'plum', 'chocolate', 'pink', 'brown', 'seashell', 'deepskyblue'),
geom ='point',
ellipse.type = 'convex',
ggtheme = theme_classic())