6.9 (e) Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

  1. Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

  2. It seems like the accuracy from my pls fit was better than pcr. Both methods performed better than the Lasso and regular OLS used in previous exercises. It seems like pls is the best fit model for all the listed models.

library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.7
## ✓ tibble  3.1.2     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ISLR2)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
College <- na.omit(College)
sam.college <- sample(1:nrow(College), nrow(College)*.8)
x <- model.matrix(Apps~., data=College)[, -1]
y <- College$Apps
train.col <- College[sam.college,]
test.col <- College[-sam.college,]
y.test <- test.col$Apps

#PCR

pcr <- pcr(Apps ~ ., data=train.col, scale =TRUE, validation = 'CV')


validationplot(pcr, val.type = 'MSEP')

predict.pcr <- predict(pcr, test.col, ncomp=15)

acc.pcr <- mean((predict.pcr-test.col$Apps)^2)
acc.pcr
## [1] 1676837
#PLS
set.seed(1)

pls <- plsr(Apps~., data = train.col, scale= TRUE, validation= 'CV')

validationplot(pls, val.type='MSEP')

pred.pls <- predict(pls, test.col, ncomp = 5)

acc.pls <- mean((pred.pls - test.col$Apps)^2)
acc.pls
## [1] 1588895

12.8

In Section 12.2.3, a formula for calculating PVE was given in Equa- tion 12.10. We also saw that the PVE can be obtained using the sdev output of the prcomp() function. On the USArrests data, calculate PVE in two ways: (a) Using the sdev output of the prcomp() function, as was done in Section 12.2.3.

I got 62% of variance explaing by first PCA, 24.7% explained by second, 89.1% explaing by third, and 43.3% explained by fourth.

  1. By applying Equation 12.10 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 12.10 to obtain the PVE.
library(dplyr)
library(tidyverse)
set.seed(1)

#part a
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
states <- row.names(USArrests)

pr.out <- prcomp(USArrests, scale = TRUE)

pr.var <- pr.out$sdev^2

pve <- pr.var / sum(pr.var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
#part b

#12.9

  1. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

Done in code.

  1. Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

You can see this resulting from cut3 in output.

  1. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

Done in code

  1. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

What I can tell from my graphs produced is that it seems like scaling has a significant effect on the height measure of clustering. This obviously makes sense because by scaling the data we are also shrinking it. For the unscaled data it seems as though the branches form clusters at lower levels while the scaled data has a more gradual grouping going up to height while compared the more quickly grouped unscaled data.

#a
data.dist1 <- dist(USArrests)

hclust1 <- hclust(data.dist1)

plot(hclust(data.dist1))

#b 
cut3 <- cutree(hclust1, 3)
table(cut3)
## cut3
##  1  2  3 
## 16 14 20
cut3
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2
#c
sd.data <- scale(USArrests)

par(mfrow = c(1,3))

data.dist2 <- dist(sd.data)

plot(hclust(data.dist2))

complete.clus <- hclust(dist(sd.data))


#d 

12.10

In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the data. (a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.

  1. Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a different color to indicate the observations in each of the three classes.

  2. Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained in K-means clustering compare to the true class labels?

My K=3 kmeans clustering has the correct observations in each of the three classes.

  1. Perform K-means clustering with K = 2. Describe your results.

With k=2, the method obvioiusly sorted my observations into two distinct clusters. This took my first 40 observations and put them in one group, then my last 20 observations were in the second group. Leads me to believe my middle group is more similar to my first group than my third group.

  1. Now perform K-means clustering with K = 4, and describe your results.

This ended up having interesting results. My first 20 observations, which were all of one class in my original data frame, were spread out between two groups. My second and third originally created classes did not change and stayed in the same class. When considering my original first assigned class, which as I mentioned is split into two groups, it is split between 8 in one group and 12 in another group which is interesting.

  1. Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2 matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.

This operation results in three clusters of which were classified exactly how I had created them.

  1. Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.

They are the same exact classes resulting from b. This clustering with scaling still correctly gets the same groups assigned at the start. To be honest I am not sure how to compare this to the PCA completed in part 2. PCA also shows three distinct groups, PC1 is between -40 and 40 while pc2 is between -15 and 15.

#a
set.seed(1)

a <- matrix(rnorm(20*50, 5, 5), ncol=50)
b <- matrix(rnorm(20*50, 10, 5), ncol=50)
c <- matrix(rnorm(20*50, 15, 5), ncol=50)
r <- rbind(a,b,c)
h <- cbind(r, class = as.factor(c(rep(1,20), rep(2,20), rep(3,20) )))
final <- data.frame(h)

#b
pca.imag <- prcomp(final)

ggplot(data.frame(pc1 = pca.imag$x[,1], pc2 = pca.imag$x[,2], class= final$class), aes(pc1, pc2, col= class)) +geom_point()

#c
sd.final <- scale(final)

km.final <- kmeans(sd.final, 3, nstart = 20)
km.clusters <- km.final$cluster
table(km.clusters, final$class)
##            
## km.clusters  1  2  3
##           1  0 20  0
##           2  0  0 20
##           3 20  0  0
#d 

km.2 <- kmeans(sd.final, 2, nstart= 20)
km.clus2 <- km.2$cluster
table(km.clus2, final$class)
##         
## km.clus2  1  2  3
##        1  0  0 20
##        2 20 20  0
km.clus2
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#e
km.4 <- kmeans(sd.final, 4, nstart= 20)
km.clus4 <- km.4$cluster
table(km.clus4, final$class)
##         
## km.clus4  1  2  3
##        1  8  0  0
##        2 12  0  0
##        3  0  0 20
##        4  0 20  0
km.clus4
##  [1] 1 2 1 2 2 1 2 1 2 1 2 1 2 2 1 1 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [39] 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#f 

km.pca <- kmeans(pca.imag$x[,1:2], 3, nstart= 20)

km.pca.clus <- km.pca$cluster
table(km.pca.clus, final$class)
##            
## km.pca.clus  1  2  3
##           1  0 20  0
##           2 20  0  0
##           3  0  0 20
km.pca.clus
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#g

fin.scale <- scale(final)

km.scaled <- kmeans(fin.scale, 3, nstart= 20)

scale.clus <- km.scaled$cluster
scale.clus
##  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
  1. The risk of using algorithmic data is that it can mess exclude groups and cause discrimination. There is a risk when you look at a person through a set of variables. When applying to a loan, financial history is a variable that is considered, but so is zip code. Banks will hold out on giving a lone based on a specific zip code which only furthers the poverty in said neighborhoods. This creates a system where people are discriminated against and keep poor areas poor. This is just one example of how a formula can lead to systematic issues of inclusion. Algorithms do not account for equity and inclusion, they have no context to situations and this can be dangerous. What I got from Oneil’s analysis is that the only way to fix this is to add our contextual knowledge to our models. We can make models so they make sure certain ethnicities or under represented groups are recognized. We need to add our human values into these systems. Another solution is to understand that models and algorithms are simplifying a vastly complex world. Just because they spit out a certain answer does not mean that it can’t be scrutanized. I think advances would be similar to the things I learned when studied natural resource economics. I forget the specific term but they would measure how a negative consequence like smog or pollution would affect the efficiency of an outcome. Creating models that can understand the unintended consequences involved in a model and interpret that in terms of efficient outcomes would be a great start.

6.For my project I decided to work with an airline passenger satisfaction data frame that I found from Kaggle. I am working alone because I haven’t really looked for a group, but also have not heard much in terms of people looking for groups. I don’t have a problem with doing this project either way, but if you happen to find someone who needs a group I would be happy to join. My basic vision for the project is to use various classification techniques we have learned in order to classify whether or not a passenger was satisfied with their flight. I started by created dummy variables for the character variables so they would be easier to do quantitative analysis on. I made satisfaction a dummy variable where people where people who marked that they were satisfied got a 1 and people who marked they were unsatisfied/neutral were given a 0. I think it is interesting that this data came with unsatisfied/neutral. Grouping these two together could lead to some issues as there could be a large difference in unsatisfied versus neutral, so I will keep that in mind as I go further into my analysis. Next, I removed the variables that I created a dummy variable for for obvious reasons. Next, I manipulated flight distance because it looked to be an exponentially distributed variable. A little description of most of my variables for them make more sense is that a lot of them are on a scale of 1-5, rating levels of satisfaction. 1 being unsatisfied and 5 being satisfied. Some of these had values of zero which I considered to be an NA, so I removed them from my data. For exploratory analysis, I did some simple stuff at the beginning and then tried my hand at PCA which failed miserably. My correlation plot showed me that none of the variables look to be too correlated besides class and type of travel. This makes sense because if you travel for business I am guessing you are more likely to have an upgraded seat. I looked at the distribution of my response variable satisfied and it seems there is a decent sized gap in the amount of people unsatisfied(68330) and satisfied(50874). I am not sure how this will affect my analysis but that is a good note to point out. I went onto PCA and had a slight misunderstanding. I did 5 clusters at first and couldn’t get much from them, so then I decided to do more clusters which was not very interpretable. Then 3 clusters made the results more visually satisfying and understandable. I didn’t do heirarchical clustering due to the size of my data set being about 120,000 so I figured that plot would be a mess too. My basic understanding of this data from my preliminary exploration is that I am not too worried about covariation. I made a few stacked histograms that I think explain my data in an easy to understand way. It seems like in general the categorical variables on satisfaction of certain services provided by the airline have a positive relationship with likelihood to become satisfied. I will have to obviously in the future use statistical tools rather than visualization but I think that my visualizaitions show that fairly well.

library(readr)
library(dplyr)
library(ggplot2)
df1 <- read_csv("~/Documents/Machine Learning/Project/test.csv")
## New names:
## * `` -> ...1
## Rows: 25976 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Gender, Customer Type, Type of Travel, Class, satisfaction
## dbl (20): ...1, id, Age, Flight Distance, Inflight wifi service, Departure/A...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- read_csv("~/Documents/Machine Learning/Project/train.csv")
## New names:
## * `` -> ...1
## Rows: 103904 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Gender, Customer Type, Type of Travel, Class, satisfaction
## dbl (20): ...1, id, Age, Flight Distance, Inflight wifi service, Departure/A...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airline <- rbind(df1, df2)

airline <- na.omit(airline)

airline.clean <- airline %>%
  select(-...1) %>%
  arrange(id)


# Creating dummy variables that are easier to understand
airline.clean$personal <- ifelse(airline.clean$`Type of Travel` == 'Personal Travel', 1, 0)

airline.clean$loyal <- ifelse(airline.clean$`Customer Type` == 'Loyal Customer', 1, 0)

airline.clean$satisfied <- ifelse(airline.clean$satisfaction == 'satisfied', 1, 0)

airline.clean$male <- ifelse(airline.clean$Gender == 'Male', 1, 0)

#removing variables that I created dummy variables for
airline.clean1 <- select(airline.clean, -c(Gender, satisfaction, `Customer Type`, `Type of Travel`))


#noticing flight distance is not normally distributed so I am taking the log of flight distance to get a more normal distribution
airline.clean1$logDist <- log(airline.clean$`Flight Distance`)

airline.clean1 <- select(airline.clean, -(`Flight Distance`))

#deciding to split up the class, I am going to split it into a dummy variable of 1 being economy and  2 being business and economy plus because economy is not a paid upgrade while the other two are

airline.clean1$UpgradedClass <- ifelse(airline.clean$Class == 'Eco', 0 ,1)

airline.clean2 <- select(airline.clean1, -c(Class))

#renaming columns so they are easier to work with 

airline.clean2 <- airline.clean2 %>%
  rename(wifi = `Inflight wifi service`, 
         time.convenient = `Departure/Arrival time convenient`, 
         ease.booking = `Ease of Online booking`, 
         food.drink = `Food and drink`, 
         online.boardingn = `Online boarding`, 
         seat.comfort = `Seat comfort`, 
         flight.entertainment = `Inflight entertainment`, 
         onboard.service = `On-board service`, 
         legroom = `Leg room service`, 
         baggage.handling = `Baggage handling`, 
         checkin.service = `Checkin service`,
         inflight.service = `Inflight service`,
         departure.delay.min = `Departure Delay in Minutes`,
         arrival.delay.min = `Arrival Delay in Minutes`)
airline.clean2$gate.location <- airline.clean2$`Gate location`

airline.clean2$`Gate location` <- NULL


#removing zeros from certain columns because those are technically NAs

airline.clean3 <- airline.clean2 %>%
  filter(wifi != 0, time.convenient != 0, ease.booking != 0, food.drink !=0, online.boardingn !=0, seat.comfort != 0, flight.entertainment != 0, onboard.service!= 0, legroom != 0, baggage.handling != 0, checkin.service != 0, inflight.service != 0, Cleanliness != 0)

airline.clean3 <- airline.clean3 %>%
  select(-c(`Customer Type`, Gender, `Type of Travel`, satisfaction))



#Going to attempt to do some Exploratory analysis
names(airline.clean3)
##  [1] "id"                   "Age"                  "wifi"                
##  [4] "time.convenient"      "ease.booking"         "food.drink"          
##  [7] "online.boardingn"     "seat.comfort"         "flight.entertainment"
## [10] "onboard.service"      "legroom"              "baggage.handling"    
## [13] "checkin.service"      "inflight.service"     "Cleanliness"         
## [16] "departure.delay.min"  "arrival.delay.min"    "personal"            
## [19] "loyal"                "satisfied"            "male"                
## [22] "UpgradedClass"        "gate.location"
dim(airline.clean3)
## [1] 119204     23
summary(airline.clean3)
##        id              Age             wifi       time.convenient
##  Min.   :     1   Min.   : 7.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 33037   1st Qu.:28.00   1st Qu.:2.000   1st Qu.:2.000  
##  Median : 65618   Median :40.00   Median :3.000   Median :3.000  
##  Mean   : 65315   Mean   :39.86   Mean   :2.818   Mean   :3.207  
##  3rd Qu.: 97643   3rd Qu.:51.00   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :129880   Max.   :85.00   Max.   :5.000   Max.   :5.000  
##   ease.booking     food.drink    online.boardingn  seat.comfort  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000    Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000    1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :4.000    Median :4.000  
##  Mean   :2.879   Mean   :3.214   Mean   :3.331    Mean   :3.457  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000    3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000    Max.   :5.000  
##  flight.entertainment onboard.service    legroom      baggage.handling
##  Min.   :1.000        Min.   :1.000   Min.   :1.000   Min.   :1.000   
##  1st Qu.:2.000        1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   
##  Median :4.000        Median :4.000   Median :4.000   Median :4.000   
##  Mean   :3.381        Mean   :3.386   Mean   :3.381   Mean   :3.637   
##  3rd Qu.:4.000        3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:5.000   
##  Max.   :5.000        Max.   :5.000   Max.   :5.000   Max.   :5.000   
##  checkin.service inflight.service  Cleanliness    departure.delay.min
##  Min.   :1.000   Min.   :1.000    Min.   :1.000   Min.   :   0.00    
##  1st Qu.:2.000   1st Qu.:3.000    1st Qu.:2.000   1st Qu.:   0.00    
##  Median :3.000   Median :4.000    Median :3.000   Median :   0.00    
##  Mean   :3.295   Mean   :3.647    Mean   :3.294   Mean   :  14.84    
##  3rd Qu.:4.000   3rd Qu.:5.000    3rd Qu.:4.000   3rd Qu.:  13.00    
##  Max.   :5.000   Max.   :5.000    Max.   :5.000   Max.   :1592.00    
##  arrival.delay.min    personal          loyal          satisfied     
##  Min.   :   0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:   0.00   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :   0.00   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   :  15.28   Mean   :0.3084   Mean   :0.8391   Mean   :0.4268  
##  3rd Qu.:  13.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1584.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       male        UpgradedClass    gate.location  
##  Min.   :0.0000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.0000   Median :1.0000   Median :3.000  
##  Mean   :0.4932   Mean   :0.5599   Mean   :2.987  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000
library(corrplot)
## corrplot 0.92 loaded
## 
## Attaching package: 'corrplot'
## The following object is masked from 'package:pls':
## 
##     corrplot
cor <- cor(airline.clean3)

corrplot(cor, type='upper', order = 'hclust')

hist(airline.clean3$satisfied)

table(airline.clean3$satisfied)
## 
##     0     1 
## 68330 50874
tab1 <- table(airline.clean3$satisfied, airline.clean3$legroom)
barplot(tab1, legend= TRUE)

tab2 <- table(airline.clean3$satisfied, airline.clean3$time.convenient)
barplot(tab2, legend= TRUE)

tab3 <- table(airline.clean3$satisfied, airline.clean3$wifi)
barplot(tab3, legend= TRUE)

tab4 <- table(airline.clean3$satisfied, airline.clean3$seat.comfort)
barplot(tab4, legend= TRUE)

tab5 <- table(airline.clean3$satisfied, airline.clean3$Cleanliness)
barplot(tab5, legend= TRUE)

tab6 <- table(airline.clean3$satisfied, airline.clean3$baggage.handling)
barplot(tab6, legend= TRUE)

library(ggpubr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
scaled <- scale(airline.clean3)
km.5.air <- kmeans(scaled, 5, nstart=25)

fviz_cluster(km.5.air, 
             data=airline.clean3, 
             palette= c('yellow', 'red', 'blue', 'green', 'black'),
             geom ='point',
             ellipse.type = 'convex',
             ggtheme = theme_classic())

km.15.air <- kmeans(scaled, 15, nstart=25)

fviz_cluster(km.15.air, 
             data=airline.clean3, 
             palette= c('yellow', 'red', 'blue', 'green', 'black', 'purple', 'khaki', 'cyan', 'tomato', 'plum', 'chocolate', 'pink', 'brown', 'seashell', 'deepskyblue'),
             geom ='point',
             ellipse.type = 'convex',
             ggtheme = theme_classic())

km.3.air <- kmeans(scaled, 3, nstart=25)

fviz_cluster(km.3.air, 
             data=airline.clean3, 
             palette= c('yellow', 'red', 'blue', 'green', 'black', 'purple', 'khaki', 'cyan', 'tomato', 'plum', 'chocolate', 'pink', 'brown', 'seashell', 'deepskyblue'),
             geom ='point',
             ellipse.type = 'convex',
             ggtheme = theme_classic())