I subset my variables with the 1-5 satisfaction rate variables and dummy variables for personal/business class passengers, gender, and loyal customers versus unloyal customers. I picked these in order to have normalized variables which would make analysis more accurate. I had previously included age, departure and arrival time delay and distance in my analysis and the prediction accuracy remained almost the exact same so I am confident leaving those variables out did not affect my models accuracy. The range of all of my variables are either 1-5 or 0-1.
With my efforts of classification I have decided to use the methods from chapter 4. I did not find it necessary to use any subset or bootstrap methods because I have around 120,000 observations and my test set will be larger than most entire data sets I am assuming. I am not going to use any non-linear models because my n is much much much larger than my p so OLS will have a little variance theoretically. The methods I used were Logistic regression, linear discriminant analysis, quadtratic discriminant analysis and naive bayes. Logistic regression did a terrible job at prediction airline passenger satisfaction with only a 10% accuracy rate. LDA performed the best with an 89% accuracy rate, QDA had an 87% accuracy and naive bayes has an 85% accuracy. I think I succesfully used a model to predict airline satisfaction with great accuracy. I am wondering what else I can say for analysis if you could help with that
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.2 ✓ stringr 1.4.0
## ✓ tidyr 1.1.3 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(ISLR2)
library(knitr)
set.seed(1)
sample <- sample(1:nrow(airline.clean), .8*nrow(airline.clean))
train <- airline.clean[sample,]
test <- airline.clean[-sample,]
logistic <- glm(satisfied~., data=train, family = binomial)
logistic.probs <- predict(logistic, test, type = "response")
logistic.pred <- rep('unsatisfied', 23841)
logistic.pred[logistic.probs > .5 ] <- 'satisfied'
table(logistic.pred, test$satisfied)
##
## logistic.pred 0 1
## satisfied 1259 8943
## unsatisfied 12448 1191
logistic.acc <- mean(logistic.pred == test$satisfied)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
## The following object is masked from 'package:dplyr':
##
## select
lda <- lda(satisfied~., data= train)
lda.pred <- predict(lda, test)
lda.class <- lda.pred$class
lda.table <- table(lda.class, test$satisfied)
kable(lda.table, caption = "LDA Confusion Matrix" )
| 0 | 1 | |
|---|---|---|
| 0 | 12307 | 1203 |
| 1 | 1400 | 8931 |
lda.acc <- mean(lda.class == test$satisfied)
#QDA
qda <- qda(satisfied~., data=train)
qda.pred <- predict(qda, test)
qda.class <- qda.pred$class
table(qda.class, test$satisfied)
##
## qda.class 0 1
## 0 12170 1575
## 1 1537 8559
qda.acc <- mean(qda.class == test$satisfied)
library(e1071)
nb <- naiveBayes(satisfied ~., data = train)
nb.class <- predict(nb, test)
table(nb.class, test$satisfied)
##
## nb.class 0 1
## 0 12617 1661
## 1 1090 8473
nb.acc <- mean( nb.class == test$satisfied)
Accuracy.df <- data.frame(Model = c('Logistic Regression',
'Linear Discriminant Analysis',
'Quadtratic Discriminant Analysis',
'Naive Bayes'),
Accuracy = c(logistic.acc, lda.acc, qda.acc, nb.acc))
ggplot(aes(x= Model, y= Accuracy), data=Accuracy.df)+
geom_bar(stat= 'identity', fill = c('LightBlue', 'Red', 'Yellow', 'Green')) + ggtitle('Comparison of Accuracies of Machine Learning Methods Used')+
geom_text(aes(label = round(Accuracy, 2)))
airline.clean
## # A tibble: 119,204 x 19
## wifi time.convenient ease.booking food.drink online.boardingn seat.comfort
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 3 3 5 3 5
## 2 2 2 2 3 5 4
## 3 4 4 4 5 5 5
## 4 2 2 2 4 4 5
## 5 3 3 3 4 5 4
## 6 4 4 4 3 5 4
## 7 3 3 3 5 4 5
## 8 4 3 4 4 4 4
## 9 4 1 1 3 2 3
## 10 2 2 5 2 5 4
## # … with 119,194 more rows, and 13 more variables: flight.entertainment <dbl>,
## # onboard.service <dbl>, legroom <dbl>, baggage.handling <dbl>,
## # checkin.service <dbl>, inflight.service <dbl>, Cleanliness <dbl>,
## # personal <dbl>, loyal <dbl>, satisfied <dbl>, male <dbl>,
## # UpgradedClass <dbl>, gate.location <dbl>
One measure of distance between points on a graph is geometric distance. This take points on a plane and groups them together based on their values on said plane. This can be useful in determining the likeness of points based on their geometric location of a plane. This shows us how the observations are alike or unalike based on the values of the plane which is important when clustering and observing similarities. There are many measures of geometric distance such as euclidean and Manhattann. On the other hand there is correlation distance. Two points that are perfectly correlated technically would have 0 geometric distance. However, correlation distance measures asses how similar observations are regardless of their magnitude. It seeks to asses the profiles of observations rather than their geometric location. Pearson correlation distance is an example of correlation distance that measures the magnitude of the linear relationship between profiles.