I subset my variables with the 1-5 satisfaction rate variables and dummy variables for personal/business class passengers, gender, and loyal customers versus unloyal customers. I picked these in order to have normalized variables which would make analysis more accurate. I had previously included age, departure and arrival time delay and distance in my analysis and the prediction accuracy remained almost the exact same so I am confident leaving those variables out did not affect my models accuracy. The range of all of my variables are either 1-5 or 0-1.

analysis

With my efforts of classification I have decided to use the methods from chapter 4. I did not find it necessary to use any subset or bootstrap methods because I have around 120,000 observations and my test set will be larger than most entire data sets I am assuming. I am not going to use any non-linear models because my n is much much much larger than my p so OLS will have a little variance theoretically. The methods I used were Logistic regression, linear discriminant analysis, quadtratic discriminant analysis and naive bayes. Logistic regression did a terrible job at prediction airline passenger satisfaction with only a 10% accuracy rate. LDA performed the best with an 89% accuracy rate, QDA had an 87% accuracy and naive bayes has an 85% accuracy. I think I succesfully used a model to predict airline satisfaction with great accuracy. I am wondering what else I can say for analysis if you could help with that

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.2     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(ISLR2)
library(knitr)
set.seed(1)
sample <- sample(1:nrow(airline.clean), .8*nrow(airline.clean))

train <- airline.clean[sample,]

test <- airline.clean[-sample,]


logistic <- glm(satisfied~., data=train, family = binomial)

logistic.probs <- predict(logistic, test, type = "response")
logistic.pred <- rep('unsatisfied', 23841)
logistic.pred[logistic.probs > .5 ] <- 'satisfied'


table(logistic.pred, test$satisfied)

##              
## logistic.pred     0     1
##   satisfied    1259  8943
##   unsatisfied 12448  1191

logistic.acc <- mean(logistic.pred == test$satisfied)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

## The following object is masked from 'package:dplyr':
## 
##     select

lda <- lda(satisfied~., data= train)

lda.pred <- predict(lda, test)

lda.class <- lda.pred$class

lda.table <- table(lda.class, test$satisfied)
kable(lda.table, caption = "LDA Confusion Matrix" )

LDA Confusion Matrix
	0	1
0	12307	1203
1	1400	8931

lda.acc <- mean(lda.class == test$satisfied)

#QDA

qda <- qda(satisfied~., data=train)

qda.pred <- predict(qda, test)

qda.class <- qda.pred$class

table(qda.class, test$satisfied)

##          
## qda.class     0     1
##         0 12170  1575
##         1  1537  8559

qda.acc <- mean(qda.class == test$satisfied)

library(e1071)

nb <- naiveBayes(satisfied ~., data = train)

nb.class <- predict(nb, test)

table(nb.class, test$satisfied)

##         
## nb.class     0     1
##        0 12617  1661
##        1  1090  8473

nb.acc <- mean( nb.class == test$satisfied)


Accuracy.df <- data.frame(Model = c('Logistic Regression',
                                 'Linear Discriminant Analysis', 
                                 'Quadtratic Discriminant Analysis', 
                                 'Naive Bayes'), 
                       Accuracy = c(logistic.acc, lda.acc, qda.acc, nb.acc))

ggplot(aes(x= Model, y= Accuracy), data=Accuracy.df)+
  geom_bar(stat= 'identity', fill = c('LightBlue', 'Red', 'Yellow', 'Green')) + ggtitle('Comparison of Accuracies of Machine Learning Methods Used')+
  geom_text(aes(label = round(Accuracy, 2)))

airline.clean

## # A tibble: 119,204 x 19
##     wifi time.convenient ease.booking food.drink online.boardingn seat.comfort
##    <dbl>           <dbl>        <dbl>      <dbl>            <dbl>        <dbl>
##  1     3               3            3          5                3            5
##  2     2               2            2          3                5            4
##  3     4               4            4          5                5            5
##  4     2               2            2          4                4            5
##  5     3               3            3          4                5            4
##  6     4               4            4          3                5            4
##  7     3               3            3          5                4            5
##  8     4               3            4          4                4            4
##  9     4               1            1          3                2            3
## 10     2               2            5          2                5            4
## # … with 119,194 more rows, and 13 more variables: flight.entertainment <dbl>,
## #   onboard.service <dbl>, legroom <dbl>, baggage.handling <dbl>,
## #   checkin.service <dbl>, inflight.service <dbl>, Cleanliness <dbl>,
## #   personal <dbl>, loyal <dbl>, satisfied <dbl>, male <dbl>,
## #   UpgradedClass <dbl>, gate.location <dbl>

Project

Nathaniel Ross

4/20/2022

analysis

Measures of Distance