This exercise consists of 3 parts: Part A. prediction using Logistik Regression, Part B. KNN algorithm, and Part C. analysis.
We use data from Kaggle.com, titled Airline survey. We would like to investigate customer satisfaction in using the airlines.
Read data
library(dplyr)## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
kepuasan <- read.csv("Airline_survey.csv")
head(kepuasan)dim(kepuasan)## [1] 129880 23
str(kepuasan)## 'data.frame': 129880 obs. of 23 variables:
## $ satisfaction : chr "satisfied" "satisfied" "satisfied" "satisfied" ...
## $ Gender : chr "Female" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "Loyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 65 47 15 60 70 30 66 10 56 22 ...
## $ Type.of.Travel : chr "Personal Travel" "Personal Travel" "Personal Travel" "Personal Travel" ...
## $ Class : chr "Eco" "Business" "Eco" "Eco" ...
## $ Flight.Distance : int 265 2464 2138 623 354 1894 227 1812 73 1556 ...
## $ Seat.comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time.convenient: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Food.and.drink : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gate.location : int 2 3 3 3 3 3 3 3 3 3 ...
## $ Inflight.wifi.service : int 2 0 2 3 4 2 2 2 5 2 ...
## $ Inflight.entertainment : int 4 2 0 4 3 0 5 0 3 0 ...
## $ Online.support : int 2 2 2 3 4 2 5 2 5 2 ...
## $ Ease.of.Online.booking : int 3 3 2 1 2 2 5 2 4 2 ...
## $ On.board.service : int 3 4 3 1 2 5 5 3 4 2 ...
## $ Leg.room.service : int 0 4 3 0 0 4 0 3 0 4 ...
## $ Baggage.handling : int 3 4 4 1 2 5 5 4 1 5 ...
## $ Checkin.service : int 5 2 4 4 4 5 5 5 5 3 ...
## $ Cleanliness : int 3 3 4 1 2 4 5 4 4 4 ...
## $ Online.boarding : int 2 2 2 3 5 2 3 2 4 2 ...
## $ Departure.Delay.in.Minutes : int 0 310 0 0 0 0 17 0 0 30 ...
## $ Arrival.Delay.in.Minutes : int 0 305 0 0 0 0 15 0 0 26 ...
Deskripsi variabel: (to be updated)
Selanjutnya, perlu dilakukan data wrangling untuk
mengubah tipe data menjadi factor, karena data-data
tersebut memiliki tipe kategorikal yang merupakan hasil survey terhadap
respondents. Semua variabel diubah menjadi
factor kecuali age,
Flight.Distance, Departure.Delay.in.Minutes
dan Arrival.Delay.in.Minutes.
names(kepuasan)## [1] "satisfaction" "Gender"
## [3] "Customer.Type" "Age"
## [5] "Type.of.Travel" "Class"
## [7] "Flight.Distance" "Seat.comfort"
## [9] "Departure.Arrival.time.convenient" "Food.and.drink"
## [11] "Gate.location" "Inflight.wifi.service"
## [13] "Inflight.entertainment" "Online.support"
## [15] "Ease.of.Online.booking" "On.board.service"
## [17] "Leg.room.service" "Baggage.handling"
## [19] "Checkin.service" "Cleanliness"
## [21] "Online.boarding" "Departure.Delay.in.Minutes"
## [23] "Arrival.Delay.in.Minutes"
kepuasan <- kepuasan %>%
mutate(satisfaction = as.factor(satisfaction),
Gender = as.factor(Gender),
Customer.Type = as.factor(Customer.Type),
Type.of.Travel = as.factor(Type.of.Travel),
Class = as.factor(Class),
Seat.comfort = as.factor(Seat.comfort),
Departure.Arrival.time.convenient = as.factor(Departure.Arrival.time.convenient),
Food.and.drink = as.factor(Food.and.drink),
Gate.location = as.factor(Gate.location),
Inflight.wifi.service = as.factor(Inflight.wifi.service),
Inflight.entertainment = as.factor(Inflight.entertainment),
Online.support = as.factor(Online.support),
Ease.of.Online.booking = as.factor(Ease.of.Online.booking),
On.board.service = as.factor(On.board.service),
Leg.room.service = as.factor(Leg.room.service),
Baggage.handling = as.factor(Baggage.handling),
Checkin.service = as.factor(Checkin.service),
Cleanliness = as.factor(Cleanliness),
Online.boarding = as.factor(Online.boarding))
glimpse(kepuasan)## Rows: 129,880
## Columns: 23
## $ satisfaction <fct> satisfied, satisfied, satisfied, sat…
## $ Gender <fct> Female, Male, Female, Female, Female…
## $ Customer.Type <fct> Loyal Customer, Loyal Customer, Loya…
## $ Age <int> 65, 47, 15, 60, 70, 30, 66, 10, 56, …
## $ Type.of.Travel <fct> Personal Travel, Personal Travel, Pe…
## $ Class <fct> Eco, Business, Eco, Eco, Eco, Eco, E…
## $ Flight.Distance <int> 265, 2464, 2138, 623, 354, 1894, 227…
## $ Seat.comfort <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Departure.Arrival.time.convenient <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Food.and.drink <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gate.location <fct> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, …
## $ Inflight.wifi.service <fct> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2, …
## $ Inflight.entertainment <fct> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0, …
## $ Online.support <fct> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2, …
## $ Ease.of.Online.booking <fct> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2, …
## $ On.board.service <fct> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3, …
## $ Leg.room.service <fct> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2, …
## $ Baggage.handling <fct> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5, …
## $ Checkin.service <fct> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2, …
## $ Cleanliness <fct> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5, …
## $ Online.boarding <fct> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2, …
## $ Departure.Delay.in.Minutes <int> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 47…
## $ Arrival.Delay.in.Minutes <int> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 48…
Check missing value
anyNA(kepuasan)## [1] TRUE
is.na(kepuasan)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 393
isi missing value dengan mean
kepuasan$Arrival.Delay.in.Minutes[is.na(kepuasan$Arrival.Delay.in.Minutes)]=
mean(kepuasan$Arrival.Delay.in.Minutes, na.rm = TRUE)pastikan lagi bahwa tidak ada missing value
anyNA(kepuasan)## [1] FALSE
is.na(kepuasan)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 0
Check class imbalance
table(kepuasan$satisfaction)##
## dissatisfied satisfied
## 58793 71087
prop.table(table(kepuasan$satisfaction))##
## dissatisfied satisfied
## 0.4526717 0.5473283
tidak ada class imbalance pada kolom target
Memisahkan antara data train dan
data test
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# your code here
index <- sample(nrow(kepuasan), nrow(kepuasan)*0.7)
kepuasan_train <- kepuasan[index,]
kepuasan_test <- kepuasan[-index,]nrow(kepuasan)*0.7## [1] 90916
nrow(kepuasan_train)## [1] 90916
nrow(kepuasan)*0.3## [1] 38964
nrow(kepuasan_test)## [1] 38964
check class imbalance pada kolom target pada data train
table(kepuasan_train$satisfaction)##
## dissatisfied satisfied
## 41169 49747
prop.table(table(kepuasan_train$satisfaction))##
## dissatisfied satisfied
## 0.4528246 0.5471754
tidak terjadi class imbalance
# model_logistic <- glm()
model_kepuasan_all <- glm(formula = satisfaction ~.,
data = kepuasan_train,
family = "binomial")## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_kepuasan_all)##
## Call:
## glm(formula = satisfaction ~ ., family = "binomial", data = kepuasan_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4453 -0.2996 0.0127 0.2343 3.6532
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.727e+00 8.441e+03 0.001 0.99936
## GenderMale -9.083e-01 2.625e-02 -34.604 < 2e-16 ***
## Customer.TypeLoyal Customer 2.783e+00 4.391e-02 63.382 < 2e-16 ***
## Age -4.268e-03 9.088e-04 -4.696 2.65e-06 ***
## Type.of.TravelPersonal Travel -1.366e+00 3.908e-02 -34.967 < 2e-16 ***
## ClassEco -5.384e-01 3.408e-02 -15.802 < 2e-16 ***
## ClassEco Plus -6.907e-01 5.296e-02 -13.041 < 2e-16 ***
## Flight.Distance -8.687e-05 1.352e-05 -6.426 1.31e-10 ***
## Seat.comfort1 -2.360e+01 8.822e+01 -0.267 0.78910
## Seat.comfort2 -2.398e+01 8.822e+01 -0.272 0.78572
## Seat.comfort3 -2.406e+01 8.822e+01 -0.273 0.78508
## Seat.comfort4 -2.296e+01 8.822e+01 -0.260 0.79462
## Seat.comfort5 -1.805e+01 8.822e+01 -0.205 0.83786
## Departure.Arrival.time.convenient1 5.209e-01 8.649e-02 6.023 1.72e-09 ***
## Departure.Arrival.time.convenient2 6.251e-01 8.366e-02 7.472 7.92e-14 ***
## Departure.Arrival.time.convenient3 5.981e-01 8.171e-02 7.319 2.50e-13 ***
## Departure.Arrival.time.convenient4 -4.546e-01 7.611e-02 -5.972 2.34e-09 ***
## Departure.Arrival.time.convenient5 -1.572e+00 8.267e-02 -19.014 < 2e-16 ***
## Food.and.drink1 3.452e+00 6.849e-01 5.040 4.66e-07 ***
## Food.and.drink2 3.459e+00 6.849e-01 5.050 4.42e-07 ***
## Food.and.drink3 3.863e+00 6.846e-01 5.643 1.67e-08 ***
## Food.and.drink4 3.958e+00 6.842e-01 5.784 7.29e-09 ***
## Food.and.drink5 4.147e+00 6.846e-01 6.057 1.39e-09 ***
## Gate.location1 -1.971e+01 4.400e+03 -0.004 0.99643
## Gate.location2 -1.968e+01 4.400e+03 -0.004 0.99643
## Gate.location3 -2.001e+01 4.400e+03 -0.005 0.99637
## Gate.location4 -1.974e+01 4.400e+03 -0.004 0.99642
## Gate.location5 -1.980e+01 4.400e+03 -0.004 0.99641
## Inflight.wifi.service1 -2.120e-01 1.303e+00 -0.163 0.87069
## Inflight.wifi.service2 1.443e-01 1.302e+00 0.111 0.91177
## Inflight.wifi.service3 -8.046e-02 1.302e+00 -0.062 0.95071
## Inflight.wifi.service4 -1.047e-01 1.302e+00 -0.080 0.93589
## Inflight.wifi.service5 -2.851e-01 1.302e+00 -0.219 0.82666
## Inflight.entertainment1 -3.265e+00 6.913e-01 -4.722 2.33e-06 ***
## Inflight.entertainment2 -3.190e+00 6.907e-01 -4.618 3.87e-06 ***
## Inflight.entertainment3 -3.393e+00 6.903e-01 -4.915 8.87e-07 ***
## Inflight.entertainment4 -1.617e+00 6.897e-01 -2.345 0.01905 *
## Inflight.entertainment5 -3.401e-01 6.898e-01 -0.493 0.62200
## Online.support1 2.017e+01 6.523e+03 0.003 0.99753
## Online.support2 1.977e+01 6.523e+03 0.003 0.99758
## Online.support3 1.874e+01 6.523e+03 0.003 0.99771
## Online.support4 1.952e+01 6.523e+03 0.003 0.99761
## Online.support5 2.017e+01 6.523e+03 0.003 0.99753
## Ease.of.Online.booking1 3.875e+01 1.447e+03 0.027 0.97864
## Ease.of.Online.booking2 3.959e+01 1.447e+03 0.027 0.97817
## Ease.of.Online.booking3 4.061e+01 1.447e+03 0.028 0.97761
## Ease.of.Online.booking4 4.072e+01 1.447e+03 0.028 0.97755
## Ease.of.Online.booking5 3.993e+01 1.447e+03 0.028 0.97799
## On.board.service1 -2.308e+01 3.381e+03 -0.007 0.99455
## On.board.service2 -2.291e+01 3.381e+03 -0.007 0.99459
## On.board.service3 -2.246e+01 3.381e+03 -0.007 0.99470
## On.board.service4 -2.229e+01 3.381e+03 -0.007 0.99474
## On.board.service5 -2.182e+01 3.381e+03 -0.006 0.99485
## Leg.room.service1 -2.068e+00 7.470e-01 -2.769 0.00563 **
## Leg.room.service2 -1.807e+00 7.467e-01 -2.420 0.01551 *
## Leg.room.service3 -1.982e+00 7.465e-01 -2.655 0.00794 **
## Leg.room.service4 -1.297e+00 7.465e-01 -1.737 0.08237 .
## Leg.room.service5 -1.147e+00 7.466e-01 -1.537 0.12435
## Baggage.handling2 -9.120e-02 7.235e-02 -1.260 0.20749
## Baggage.handling3 -6.285e-01 6.750e-02 -9.310 < 2e-16 ***
## Baggage.handling4 -8.843e-02 6.563e-02 -1.347 0.17784
## Baggage.handling5 4.511e-01 6.911e-02 6.527 6.69e-11 ***
## Checkin.service1 -1.232e+00 4.996e-02 -24.662 < 2e-16 ***
## Checkin.service2 -1.090e+00 4.954e-02 -21.996 < 2e-16 ***
## Checkin.service3 -6.053e-01 3.910e-02 -15.483 < 2e-16 ***
## Checkin.service4 -6.030e-01 3.877e-02 -15.553 < 2e-16 ***
## Checkin.service5 NA NA NA NA
## Cleanliness1 -3.248e-01 7.043e-02 -4.611 4.01e-06 ***
## Cleanliness2 -4.044e-01 6.439e-02 -6.281 3.36e-10 ***
## Cleanliness3 -1.178e+00 5.242e-02 -22.481 < 2e-16 ***
## Cleanliness4 -5.832e-01 4.047e-02 -14.412 < 2e-16 ***
## Cleanliness5 NA NA NA NA
## Online.boarding1 -5.116e-01 6.922e-02 -7.392 1.45e-13 ***
## Online.boarding2 -6.438e-01 6.690e-02 -9.623 < 2e-16 ***
## Online.boarding3 -1.679e-01 5.572e-02 -3.014 0.00258 **
## Online.boarding4 -3.611e-01 5.375e-02 -6.717 1.85e-11 ***
## Online.boarding5 NA NA NA NA
## Departure.Delay.in.Minutes 1.351e-03 1.118e-03 1.208 0.22690
## Arrival.Delay.in.Minutes -6.296e-03 1.109e-03 -5.677 1.37e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 125226 on 90915 degrees of freedom
## Residual deviance: 41996 on 90840 degrees of freedom
## AIC: 42148
##
## Number of Fisher Scoring iterations: 17
AIC untul all predictors = 42034
# stepwise
model_kepuasan_step <- step(object=model_kepuasan_all, direction="backward", trace=F)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_kepuasan_step)##
## Call:
## glm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel +
## Class + Flight.Distance + Seat.comfort + Departure.Arrival.time.convenient +
## Food.and.drink + Gate.location + Inflight.wifi.service +
## Inflight.entertainment + Online.support + Ease.of.Online.booking +
## On.board.service + Leg.room.service + Baggage.handling +
## Checkin.service + Cleanliness + Online.boarding + Arrival.Delay.in.Minutes,
## family = "binomial", data = kepuasan_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4422 -0.2998 0.0127 0.2345 3.6801
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.708e+00 8.441e+03 0.001 0.99937
## GenderMale -9.082e-01 2.625e-02 -34.602 < 2e-16 ***
## Customer.TypeLoyal Customer 2.783e+00 4.391e-02 63.376 < 2e-16 ***
## Age -4.251e-03 9.086e-04 -4.678 2.89e-06 ***
## Type.of.TravelPersonal Travel -1.366e+00 3.908e-02 -34.959 < 2e-16 ***
## ClassEco -5.385e-01 3.408e-02 -15.802 < 2e-16 ***
## ClassEco Plus -6.907e-01 5.297e-02 -13.039 < 2e-16 ***
## Flight.Distance -8.640e-05 1.351e-05 -6.394 1.62e-10 ***
## Seat.comfort1 -2.360e+01 8.819e+01 -0.268 0.78899
## Seat.comfort2 -2.399e+01 8.819e+01 -0.272 0.78561
## Seat.comfort3 -2.406e+01 8.819e+01 -0.273 0.78496
## Seat.comfort4 -2.297e+01 8.819e+01 -0.260 0.79451
## Seat.comfort5 -1.806e+01 8.819e+01 -0.205 0.83777
## Departure.Arrival.time.convenient1 5.211e-01 8.649e-02 6.025 1.69e-09 ***
## Departure.Arrival.time.convenient2 6.251e-01 8.366e-02 7.472 7.91e-14 ***
## Departure.Arrival.time.convenient3 5.977e-01 8.171e-02 7.314 2.59e-13 ***
## Departure.Arrival.time.convenient4 -4.542e-01 7.611e-02 -5.968 2.40e-09 ***
## Departure.Arrival.time.convenient5 -1.572e+00 8.267e-02 -19.015 < 2e-16 ***
## Food.and.drink1 3.452e+00 6.849e-01 5.040 4.65e-07 ***
## Food.and.drink2 3.460e+00 6.849e-01 5.051 4.39e-07 ***
## Food.and.drink3 3.864e+00 6.846e-01 5.645 1.66e-08 ***
## Food.and.drink4 3.958e+00 6.842e-01 5.785 7.26e-09 ***
## Food.and.drink5 4.148e+00 6.846e-01 6.059 1.37e-09 ***
## Gate.location1 -1.971e+01 4.400e+03 -0.004 0.99643
## Gate.location2 -1.968e+01 4.400e+03 -0.004 0.99643
## Gate.location3 -2.001e+01 4.400e+03 -0.005 0.99637
## Gate.location4 -1.974e+01 4.400e+03 -0.004 0.99642
## Gate.location5 -1.980e+01 4.400e+03 -0.004 0.99641
## Inflight.wifi.service1 -2.148e-01 1.300e+00 -0.165 0.86873
## Inflight.wifi.service2 1.412e-01 1.299e+00 0.109 0.91345
## Inflight.wifi.service3 -8.293e-02 1.299e+00 -0.064 0.94909
## Inflight.wifi.service4 -1.067e-01 1.299e+00 -0.082 0.93452
## Inflight.wifi.service5 -2.872e-01 1.299e+00 -0.221 0.82502
## Inflight.entertainment1 -3.265e+00 6.913e-01 -4.722 2.33e-06 ***
## Inflight.entertainment2 -3.190e+00 6.907e-01 -4.618 3.87e-06 ***
## Inflight.entertainment3 -3.393e+00 6.903e-01 -4.915 8.87e-07 ***
## Inflight.entertainment4 -1.617e+00 6.897e-01 -2.344 0.01907 *
## Inflight.entertainment5 -3.399e-01 6.898e-01 -0.493 0.62221
## Online.support1 2.020e+01 6.523e+03 0.003 0.99753
## Online.support2 1.980e+01 6.523e+03 0.003 0.99758
## Online.support3 1.877e+01 6.523e+03 0.003 0.99770
## Online.support4 1.955e+01 6.523e+03 0.003 0.99761
## Online.support5 2.020e+01 6.523e+03 0.003 0.99753
## Ease.of.Online.booking1 3.876e+01 1.447e+03 0.027 0.97863
## Ease.of.Online.booking2 3.960e+01 1.447e+03 0.027 0.97816
## Ease.of.Online.booking3 4.062e+01 1.447e+03 0.028 0.97760
## Ease.of.Online.booking4 4.072e+01 1.447e+03 0.028 0.97754
## Ease.of.Online.booking5 3.994e+01 1.447e+03 0.028 0.97798
## On.board.service1 -2.309e+01 3.381e+03 -0.007 0.99455
## On.board.service2 -2.292e+01 3.381e+03 -0.007 0.99459
## On.board.service3 -2.247e+01 3.381e+03 -0.007 0.99470
## On.board.service4 -2.229e+01 3.381e+03 -0.007 0.99474
## On.board.service5 -2.182e+01 3.381e+03 -0.006 0.99485
## Leg.room.service1 -2.071e+00 7.469e-01 -2.773 0.00555 **
## Leg.room.service2 -1.810e+00 7.467e-01 -2.425 0.01533 *
## Leg.room.service3 -1.985e+00 7.464e-01 -2.659 0.00783 **
## Leg.room.service4 -1.300e+00 7.464e-01 -1.741 0.08163 .
## Leg.room.service5 -1.150e+00 7.466e-01 -1.541 0.12340
## Baggage.handling2 -9.124e-02 7.235e-02 -1.261 0.20727
## Baggage.handling3 -6.276e-01 6.749e-02 -9.299 < 2e-16 ***
## Baggage.handling4 -8.763e-02 6.562e-02 -1.335 0.18175
## Baggage.handling5 4.518e-01 6.910e-02 6.537 6.26e-11 ***
## Checkin.service1 -1.232e+00 4.996e-02 -24.659 < 2e-16 ***
## Checkin.service2 -1.090e+00 4.954e-02 -22.005 < 2e-16 ***
## Checkin.service3 -6.054e-01 3.910e-02 -15.484 < 2e-16 ***
## Checkin.service4 -6.031e-01 3.877e-02 -15.556 < 2e-16 ***
## Checkin.service5 NA NA NA NA
## Cleanliness1 -3.246e-01 7.044e-02 -4.609 4.05e-06 ***
## Cleanliness2 -4.041e-01 6.439e-02 -6.276 3.47e-10 ***
## Cleanliness3 -1.179e+00 5.242e-02 -22.488 < 2e-16 ***
## Cleanliness4 -5.832e-01 4.047e-02 -14.412 < 2e-16 ***
## Cleanliness5 NA NA NA NA
## Online.boarding1 -5.113e-01 6.922e-02 -7.386 1.51e-13 ***
## Online.boarding2 -6.431e-01 6.690e-02 -9.613 < 2e-16 ***
## Online.boarding3 -1.679e-01 5.572e-02 -3.013 0.00259 **
## Online.boarding4 -3.610e-01 5.375e-02 -6.715 1.88e-11 ***
## Online.boarding5 NA NA NA NA
## Arrival.Delay.in.Minutes -5.018e-03 3.338e-04 -15.036 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 125226 on 90915 degrees of freedom
## Residual deviance: 41998 on 90841 degrees of freedom
## AIC: 42148
##
## Number of Fisher Scoring iterations: 17
AIC untuk stepwise = AIC 42034, sama dengan pada model_kepuasaan_all
tambahkan kolom hasil prediksi pada data test
kepuasan_test$pred_value <- predict(model_kepuasan_all, kepuasan_test, type="response")## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
#kepuasan_test$pred_valuetambahkan label hasil prediksi pada data test
kepuasan_test$pred.Label <- ifelse(kepuasan_test$pred_value > 0.5, "1", "0")
#kepuasan_test %>% select(pred_value, pred.Label) ubah kelas label ke factor
kepuasan_test$pred.Label <- as.factor(kepuasan_test$pred.Label)tambahkan label nama hasil prediksi pada data test
kepuasan_test$pred.Name <- ifelse(kepuasan_test$pred.Label == 1, "satisfied", "not satisfied")
kepuasan_test$pred.Name <- as.factor(kepuasan_test$pred.Name)
#kepuasan_test %>% select(pred_value, pred.Label, pred.Name)quick view perbandingkan data asli (satisfaction) dan label prediksi
#kepuasan_test %>% select(satisfaction, pred.Name)
kepuasan_test[1:20 , c("satisfaction", "pred.Name" )]kepuasan_test$satisfaction_label <- ifelse(kepuasan_test$satisfaction == "satisfied", "1", "0")
kepuasan_test$satisfaction_label <- as.factor(kepuasan_test$satisfaction_label)quick view perbandingkan data asli (satisfaction_label) dan label prediksi
#kepuasan_test %>% select(satisfaction_label, pred.Label)
kepuasan_test[1:20 , c("satisfaction_label", "pred.Label")]# confusion matrix
#install.packages("caret")library(caret)## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Loading required package: lattice
confusionMatrix(data = kepuasan_test$pred.Label,
reference = kepuasan_test$satisfaction_label,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 15894 1957
## 1 1730 19383
##
## Accuracy : 0.9054
## 95% CI : (0.9024, 0.9083)
## No Information Rate : 0.5477
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8092
##
## Mcnemar's Test P-Value : 0.0001977
##
## Sensitivity : 0.9083
## Specificity : 0.9018
## Pos Pred Value : 0.9181
## Neg Pred Value : 0.8904
## Prevalence : 0.5477
## Detection Rate : 0.4975
## Detection Prevalence : 0.5419
## Balanced Accuracy : 0.9051
##
## 'Positive' Class : 1
##
sebaran peluang prediksi data
ggplot(kepuasan_test, aes(x=pred.Label)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
check range hasil prediksi
boxplot(kepuasan_test$pred_value)
> berdasarkan hasil boxplot, median hasil prediksi adalah sekitar
0.7
LBB ini merupakan bagian dari pembelajaran machine learning. Data yang digunakan untuk LBB ini saya ambil dari website Kaggle.com, berjudul Airline survey. Tujuan LBB: untuk simulasi algoritma KNN* terhadap kepuasan pelanggan atas layanan airline.
Read data
library(dplyr)
kepuasanKNN <- read.csv("Airline_survey.csv")
head(kepuasanKNN)dim(kepuasanKNN)## [1] 129880 23
str(kepuasanKNN)## 'data.frame': 129880 obs. of 23 variables:
## $ satisfaction : chr "satisfied" "satisfied" "satisfied" "satisfied" ...
## $ Gender : chr "Female" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "Loyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 65 47 15 60 70 30 66 10 56 22 ...
## $ Type.of.Travel : chr "Personal Travel" "Personal Travel" "Personal Travel" "Personal Travel" ...
## $ Class : chr "Eco" "Business" "Eco" "Eco" ...
## $ Flight.Distance : int 265 2464 2138 623 354 1894 227 1812 73 1556 ...
## $ Seat.comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time.convenient: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Food.and.drink : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gate.location : int 2 3 3 3 3 3 3 3 3 3 ...
## $ Inflight.wifi.service : int 2 0 2 3 4 2 2 2 5 2 ...
## $ Inflight.entertainment : int 4 2 0 4 3 0 5 0 3 0 ...
## $ Online.support : int 2 2 2 3 4 2 5 2 5 2 ...
## $ Ease.of.Online.booking : int 3 3 2 1 2 2 5 2 4 2 ...
## $ On.board.service : int 3 4 3 1 2 5 5 3 4 2 ...
## $ Leg.room.service : int 0 4 3 0 0 4 0 3 0 4 ...
## $ Baggage.handling : int 3 4 4 1 2 5 5 4 1 5 ...
## $ Checkin.service : int 5 2 4 4 4 5 5 5 5 3 ...
## $ Cleanliness : int 3 3 4 1 2 4 5 4 4 4 ...
## $ Online.boarding : int 2 2 2 3 5 2 3 2 4 2 ...
## $ Departure.Delay.in.Minutes : int 0 310 0 0 0 0 17 0 0 30 ...
## $ Arrival.Delay.in.Minutes : int 0 305 0 0 0 0 15 0 0 26 ...
names(kepuasanKNN)## [1] "satisfaction" "Gender"
## [3] "Customer.Type" "Age"
## [5] "Type.of.Travel" "Class"
## [7] "Flight.Distance" "Seat.comfort"
## [9] "Departure.Arrival.time.convenient" "Food.and.drink"
## [11] "Gate.location" "Inflight.wifi.service"
## [13] "Inflight.entertainment" "Online.support"
## [15] "Ease.of.Online.booking" "On.board.service"
## [17] "Leg.room.service" "Baggage.handling"
## [19] "Checkin.service" "Cleanliness"
## [21] "Online.boarding" "Departure.Delay.in.Minutes"
## [23] "Arrival.Delay.in.Minutes"
**ubah beberapa kolom ke factor*
kepuasanKNN <- kepuasanKNN %>%
mutate(satisfaction = as.factor(satisfaction),
Gender = as.factor(Gender),
Customer.Type = as.factor(Customer.Type),
Type.of.Travel = as.factor(Type.of.Travel),
Class = as.factor(Class),
Seat.comfort = as.factor(Seat.comfort),
Departure.Arrival.time.convenient = as.factor(Departure.Arrival.time.convenient),
Food.and.drink = as.factor(Food.and.drink),
Gate.location = as.factor(Gate.location),
Inflight.wifi.service = as.factor(Inflight.wifi.service),
Inflight.entertainment = as.factor(Inflight.entertainment),
Online.support = as.factor(Online.support),
Ease.of.Online.booking = as.factor(Ease.of.Online.booking),
On.board.service = as.factor(On.board.service),
Leg.room.service = as.factor(Leg.room.service),
Baggage.handling = as.factor(Baggage.handling),
Checkin.service = as.factor(Checkin.service),
Cleanliness = as.factor(Cleanliness),
Online.boarding = as.factor(Online.boarding))
glimpse(kepuasanKNN)## Rows: 129,880
## Columns: 23
## $ satisfaction <fct> satisfied, satisfied, satisfied, sat…
## $ Gender <fct> Female, Male, Female, Female, Female…
## $ Customer.Type <fct> Loyal Customer, Loyal Customer, Loya…
## $ Age <int> 65, 47, 15, 60, 70, 30, 66, 10, 56, …
## $ Type.of.Travel <fct> Personal Travel, Personal Travel, Pe…
## $ Class <fct> Eco, Business, Eco, Eco, Eco, Eco, E…
## $ Flight.Distance <int> 265, 2464, 2138, 623, 354, 1894, 227…
## $ Seat.comfort <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Departure.Arrival.time.convenient <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Food.and.drink <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gate.location <fct> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, …
## $ Inflight.wifi.service <fct> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2, …
## $ Inflight.entertainment <fct> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0, …
## $ Online.support <fct> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2, …
## $ Ease.of.Online.booking <fct> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2, …
## $ On.board.service <fct> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3, …
## $ Leg.room.service <fct> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2, …
## $ Baggage.handling <fct> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5, …
## $ Checkin.service <fct> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2, …
## $ Cleanliness <fct> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5, …
## $ Online.boarding <fct> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2, …
## $ Departure.Delay.in.Minutes <int> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 47…
## $ Arrival.Delay.in.Minutes <int> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 48…
Check missing value
anyNA(kepuasanKNN)## [1] TRUE
is.na(kepuasanKNN)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 393
isi missing value dengan mean
kepuasanKNN$Arrival.Delay.in.Minutes[is.na(kepuasanKNN$Arrival.Delay.in.Minutes)]=
mean(kepuasanKNN$Arrival.Delay.in.Minutes, na.rm = TRUE)pastikan sudah tidak ada missing value
anyNA(kepuasanKNN)## [1] FALSE
is.na(kepuasanKNN)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 0
Check class imbalance
table(kepuasanKNN$satisfaction)##
## dissatisfied satisfied
## 58793 71087
prop.table(table(kepuasanKNN$satisfaction))##
## dissatisfied satisfied
## 0.4526717 0.5473283
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# your code here
index <- sample(nrow(kepuasanKNN), nrow(kepuasanKNN)*0.7)
kepuasanKNN_train <- kepuasanKNN[index,]
kepuasanKNN_test <- kepuasanKNN[-index,]Pisahkan prediktor dan target pada data TRAIN maupun data TEST
library(dplyr)
#pemisahan prediktor
kepuasanKNN_train_x <- kepuasanKNN_train %>% select_if(is.numeric)
kepuasanKNN_test_x <- kepuasanKNN_test %>% select_if(is.numeric)
#pemisahan kolom target
kepuasanKNN_train_y <- kepuasanKNN_train[,"satisfaction"]
kepuasanKNN_test_y <- kepuasanKNN_test[,"satisfaction"]library(gtools)## Warning: package 'gtools' was built under R version 4.2.2
# scale train_x data
kepuasanKNN_train_xs <- scale(kepuasanKNN_train_x)
# scale test_x data
kepuasanKNN_test_xs <- scale(kepuasanKNN_test_x,
center = attr(kepuasanKNN_train_xs, "scaled:center"),
scale = attr(kepuasanKNN_train_xs, "scaled:scale"))#mencari nilai k
sqrt(nrow(kepuasanKNN_train_xs))## [1] 301.5228
library(class)
model_knn <- knn(train=kepuasanKNN_train_xs,
test=kepuasanKNN_test_xs,
cl=kepuasanKNN_train_y,
k=301)#model_knn
model_knn[1:10] ## [1] satisfied dissatisfied satisfied dissatisfied satisfied
## [6] satisfied dissatisfied satisfied satisfied satisfied
## Levels: dissatisfied satisfied
library(caret)
confusionMatrix(data = as.factor(model_knn),
reference = kepuasanKNN_test_y,
positive = "satisfied")## Confusion Matrix and Statistics
##
## Reference
## Prediction dissatisfied satisfied
## dissatisfied 9849 6756
## satisfied 7775 14584
##
## Accuracy : 0.6271
## 95% CI : (0.6222, 0.6319)
## No Information Rate : 0.5477
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2435
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6834
## Specificity : 0.5588
## Pos Pred Value : 0.6523
## Neg Pred Value : 0.5931
## Prevalence : 0.5477
## Detection Rate : 0.3743
## Detection Prevalence : 0.5738
## Balanced Accuracy : 0.6211
##
## 'Positive' Class : satisfied
##
Analisa awal dapat dilakukan dengan membandingkan nilai AIC.
Untuk Logistic Regression prediction menggunakan
all predictor dan yang menggunakan
feature selection menghasilkan AIC yang sama yaitu AIC
42034
Data yang kita analisa memiliki 393 missing value
sehingga menimbulkan error pada saat dilakukan KNN. Oleh karena itu
perlu dilakukan pengisian missing value, dalam hal ini
menggunakan mean.
Evaluasi prediksi dapat dilakukan dengan menggunakan
confusion matrix
Kebutuhan bisnis untuk evaluasi survey kepuasan pelanggan airline, menurut saya yang diharapkan adalah:
sekecil mungkin False Positive (FP), karena manajemen tentunya tidak ingin terkecoh oleh data yang menyatakan seolah pelanggan puas (satisfied), padahal tidak puas (disatisfied)
oleh karena itu matrix yang paling tepat digunakan adalah Precision
precision/Pos Pred Value hasil Logistic
Regression dan KNNdengan demikian sebaiknya model yang dipilih adalah Logistic Regression, dan bukan KNN