This is a second exercise of classification, by using the
same data as the previous classification exercise
(airline survey from kaggle.com). This
exercise consists of 4 parts: Part A. prediction using Naive
Bayes, Part B. Decision Tree, Part C. Random
Forest, and Part D. Analysis
Goal: to predict customer satisfaction of airline services
Read data
library(dplyr)## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
airline <- read.csv("Airline_survey.csv")
head(airline)dim(airline)## [1] 129880 23
str(airline)## 'data.frame': 129880 obs. of 23 variables:
## $ satisfaction : chr "satisfied" "satisfied" "satisfied" "satisfied" ...
## $ Gender : chr "Female" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "Loyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 65 47 15 60 70 30 66 10 56 22 ...
## $ Type.of.Travel : chr "Personal Travel" "Personal Travel" "Personal Travel" "Personal Travel" ...
## $ Class : chr "Eco" "Business" "Eco" "Eco" ...
## $ Flight.Distance : int 265 2464 2138 623 354 1894 227 1812 73 1556 ...
## $ Seat.comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time.convenient: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Food.and.drink : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gate.location : int 2 3 3 3 3 3 3 3 3 3 ...
## $ Inflight.wifi.service : int 2 0 2 3 4 2 2 2 5 2 ...
## $ Inflight.entertainment : int 4 2 0 4 3 0 5 0 3 0 ...
## $ Online.support : int 2 2 2 3 4 2 5 2 5 2 ...
## $ Ease.of.Online.booking : int 3 3 2 1 2 2 5 2 4 2 ...
## $ On.board.service : int 3 4 3 1 2 5 5 3 4 2 ...
## $ Leg.room.service : int 0 4 3 0 0 4 0 3 0 4 ...
## $ Baggage.handling : int 3 4 4 1 2 5 5 4 1 5 ...
## $ Checkin.service : int 5 2 4 4 4 5 5 5 5 3 ...
## $ Cleanliness : int 3 3 4 1 2 4 5 4 4 4 ...
## $ Online.boarding : int 2 2 2 3 5 2 3 2 4 2 ...
## $ Departure.Delay.in.Minutes : int 0 310 0 0 0 0 17 0 0 30 ...
## $ Arrival.Delay.in.Minutes : int 0 305 0 0 0 0 15 0 0 26 ...
Selanjutnya, perlu dilakukan data wrangling untuk
mengubah tipe data menjadi factor, karena data-data
tersebut memiliki tipe kategorikal yang merupakan hasil survey terhadap
respondents. Semua variabel diubah menjadi
factor kecuali age,
Flight.Distance, Departure.Delay.in.Minutes
dan Arrival.Delay.in.Minutes.
airline <- airline %>%
mutate(satisfaction = as.factor(satisfaction),
Gender = as.factor(Gender),
Customer.Type = as.factor(Customer.Type),
Type.of.Travel = as.factor(Type.of.Travel),
Class = as.factor(Class),
Seat.comfort = as.factor(Seat.comfort),
Departure.Arrival.time.convenient = as.factor(Departure.Arrival.time.convenient),
Food.and.drink = as.factor(Food.and.drink),
Gate.location = as.factor(Gate.location),
Inflight.wifi.service = as.factor(Inflight.wifi.service),
Inflight.entertainment = as.factor(Inflight.entertainment),
Online.support = as.factor(Online.support),
Ease.of.Online.booking = as.factor(Ease.of.Online.booking),
On.board.service = as.factor(On.board.service),
Leg.room.service = as.factor(Leg.room.service),
Baggage.handling = as.factor(Baggage.handling),
Checkin.service = as.factor(Checkin.service),
Cleanliness = as.factor(Cleanliness),
Online.boarding = as.factor(Online.boarding))
glimpse(airline)## Rows: 129,880
## Columns: 23
## $ satisfaction <fct> satisfied, satisfied, satisfied, sat…
## $ Gender <fct> Female, Male, Female, Female, Female…
## $ Customer.Type <fct> Loyal Customer, Loyal Customer, Loya…
## $ Age <int> 65, 47, 15, 60, 70, 30, 66, 10, 56, …
## $ Type.of.Travel <fct> Personal Travel, Personal Travel, Pe…
## $ Class <fct> Eco, Business, Eco, Eco, Eco, Eco, E…
## $ Flight.Distance <int> 265, 2464, 2138, 623, 354, 1894, 227…
## $ Seat.comfort <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Departure.Arrival.time.convenient <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Food.and.drink <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gate.location <fct> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, …
## $ Inflight.wifi.service <fct> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2, …
## $ Inflight.entertainment <fct> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0, …
## $ Online.support <fct> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2, …
## $ Ease.of.Online.booking <fct> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2, …
## $ On.board.service <fct> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3, …
## $ Leg.room.service <fct> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2, …
## $ Baggage.handling <fct> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5, …
## $ Checkin.service <fct> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2, …
## $ Cleanliness <fct> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5, …
## $ Online.boarding <fct> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2, …
## $ Departure.Delay.in.Minutes <int> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 47…
## $ Arrival.Delay.in.Minutes <int> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 48…
Check missing value
anyNA(airline)## [1] TRUE
is.na(airline)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 393
isi missing value dengan mean
airline$Arrival.Delay.in.Minutes[is.na(airline$Arrival.Delay.in.Minutes)]=
mean(airline$Arrival.Delay.in.Minutes, na.rm = TRUE)pastikan lagi bahwa tidak ada missing value
anyNA(airline)## [1] FALSE
is.na(airline)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 0
Check class imbalance
table(airline$satisfaction)##
## dissatisfied satisfied
## 58793 71087
prop.table(table(airline$satisfaction))##
## dissatisfied satisfied
## 0.4526717 0.5473283
tidak ada class imbalance pada kolom target
Memisahkan antara data train dan
data test
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# your code here
index <- sample(nrow(airline), nrow(airline)*0.8)
airline_train <- airline[index,]
airline_test <- airline[-index,]nrow(airline)*0.8## [1] 103904
nrow(airline_train)## [1] 103904
nrow(airline)*0.2## [1] 25976
nrow(airline_test)## [1] 25976
check class imbalance pada kolom target pada data train
table(airline_train$satisfaction)##
## dissatisfied satisfied
## 47100 56804
prop.table(table(airline_train$satisfaction))##
## dissatisfied satisfied
## 0.453303 0.546697
tidak terjadi class imbalance
library(e1071)## Warning: package 'e1071' was built under R version 4.2.2
model_airline_naive <- naiveBayes(formula = satisfaction~., data=airline_train, laplace = 1)
#model_airline_naive# predict
airline_test$pred_label <- predict(object = model_airline_naive,
newdata = airline_test,
type = "class") #yang dikembalikan label target, kalau raw yang dikembalikan peluang
#airline_testEvaluasi model dengan confusion matrix:
library(caret)## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Loading required package: lattice
# confusion matrix
confusionMatrix(data= airline_test$pred_label,
reference= airline_test$satisfaction,
positive = "satisfied")## Confusion Matrix and Statistics
##
## Reference
## Prediction dissatisfied satisfied
## dissatisfied 9615 1697
## satisfied 2078 12586
##
## Accuracy : 0.8547
## 95% CI : (0.8503, 0.8589)
## No Information Rate : 0.5499
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7056
##
## Mcnemar's Test P-Value : 6.218e-10
##
## Sensitivity : 0.8812
## Specificity : 0.8223
## Pos Pred Value : 0.8583
## Neg Pred Value : 0.8500
## Prevalence : 0.5499
## Detection Rate : 0.4845
## Detection Prevalence : 0.5645
## Balanced Accuracy : 0.8517
##
## 'Positive' Class : satisfied
##
Sensitivity = recall
Pos pred value = precision
Metrics mana yang kita unggulkan?
Precision atau Pos pred value, karena kita ingin meminimalisir false positive
Nilai accuracy 0.8547 apakah sudah cukup baik? Model naive bayes sudah cukup bagus dengan nilai accuracy sebesar 0.8547 dengan catatan kita berasumsi bahwa model yang bagus akurasi-nya 85%
ROC adalah kurva yang menggambarkan hubungan antara True Positive Rate (Sensitivity atau Recall) dengan False Positive Rate (1-Specificity) pada setiap threshold. Model yang baik idealnya memiliki True Positive Rate yang tinggi dan False Positive Rate yang rendah. Note: Specificity adalah True Negative Rate.
Mari kita buat kurva ROC dari model model_nb_vote:
# ambil hasil prediksi data test dalam bentuk probability
airline_test$prob <- predict(model_airline_naive, airline_test, type="raw")
#airline_test$prob[,"satisfied"]# menyiapkan pred vs actual
airline_test$actual <- ifelse(airline_test$satisfaction=="satisfied", 1, 0)
#airline_test$actuallibrary(ROCR)## Warning: package 'ROCR' was built under R version 4.2.2
# objek prediction
roc_pred_airline <- prediction(predictions=airline_test$prob[,"satisfied"],
labels=airline_test$actual) #label aktual yang sudah diubah menjadi nilai 0 dan 1
# ROC curve
plot(performance(prediction.obj = roc_pred_airline,
measure = "tpr", #axis y
x.measure = "fpr")) #axis x
abline(0,1, lty=2) #garis diagonal sebagai batas model terburuk dan harus di-run bersamaan dengan vote di atasnyaNilai AUC
auc_pred <- performance(prediction.obj = roc_pred_airline,
measure="auc")
auc_pred@y.values## [[1]]
## [1] 0.9309337
Nili AUC = 0.9309, mendekati 1, artinya cukup baik dalam memprediksi kelas positif dan kelas negatif
Data yang digunakan untuk LBB ini saya ambil dari website Kaggle.com, berjudul Airline survey. Tujuan LBB: untuk melakukan prediksi* terhadap kepuasan pelanggan atas layanan airline.
Read data
library(dplyr)
airline <- read.csv("Airline_survey.csv")
head(airline)dim(airline)## [1] 129880 23
str(airline)## 'data.frame': 129880 obs. of 23 variables:
## $ satisfaction : chr "satisfied" "satisfied" "satisfied" "satisfied" ...
## $ Gender : chr "Female" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "Loyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 65 47 15 60 70 30 66 10 56 22 ...
## $ Type.of.Travel : chr "Personal Travel" "Personal Travel" "Personal Travel" "Personal Travel" ...
## $ Class : chr "Eco" "Business" "Eco" "Eco" ...
## $ Flight.Distance : int 265 2464 2138 623 354 1894 227 1812 73 1556 ...
## $ Seat.comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time.convenient: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Food.and.drink : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gate.location : int 2 3 3 3 3 3 3 3 3 3 ...
## $ Inflight.wifi.service : int 2 0 2 3 4 2 2 2 5 2 ...
## $ Inflight.entertainment : int 4 2 0 4 3 0 5 0 3 0 ...
## $ Online.support : int 2 2 2 3 4 2 5 2 5 2 ...
## $ Ease.of.Online.booking : int 3 3 2 1 2 2 5 2 4 2 ...
## $ On.board.service : int 3 4 3 1 2 5 5 3 4 2 ...
## $ Leg.room.service : int 0 4 3 0 0 4 0 3 0 4 ...
## $ Baggage.handling : int 3 4 4 1 2 5 5 4 1 5 ...
## $ Checkin.service : int 5 2 4 4 4 5 5 5 5 3 ...
## $ Cleanliness : int 3 3 4 1 2 4 5 4 4 4 ...
## $ Online.boarding : int 2 2 2 3 5 2 3 2 4 2 ...
## $ Departure.Delay.in.Minutes : int 0 310 0 0 0 0 17 0 0 30 ...
## $ Arrival.Delay.in.Minutes : int 0 305 0 0 0 0 15 0 0 26 ...
Selanjutnya, perlu dilakukan data wrangling untuk
mengubah tipe data menjadi factor, karena data-data
tersebut memiliki tipe kategorikal yang merupakan hasil survey terhadap
respondents. Semua variabel diubah menjadi
factor kecuali age,
Flight.Distance, Departure.Delay.in.Minutes
dan Arrival.Delay.in.Minutes.
airline <- airline %>%
mutate(satisfaction = as.factor(satisfaction),
Gender = as.factor(Gender),
Customer.Type = as.factor(Customer.Type),
Type.of.Travel = as.factor(Type.of.Travel),
Class = as.factor(Class),
Seat.comfort = as.factor(Seat.comfort),
Departure.Arrival.time.convenient = as.factor(Departure.Arrival.time.convenient),
Food.and.drink = as.factor(Food.and.drink),
Gate.location = as.factor(Gate.location),
Inflight.wifi.service = as.factor(Inflight.wifi.service),
Inflight.entertainment = as.factor(Inflight.entertainment),
Online.support = as.factor(Online.support),
Ease.of.Online.booking = as.factor(Ease.of.Online.booking),
On.board.service = as.factor(On.board.service),
Leg.room.service = as.factor(Leg.room.service),
Baggage.handling = as.factor(Baggage.handling),
Checkin.service = as.factor(Checkin.service),
Cleanliness = as.factor(Cleanliness),
Online.boarding = as.factor(Online.boarding))
glimpse(airline)## Rows: 129,880
## Columns: 23
## $ satisfaction <fct> satisfied, satisfied, satisfied, sat…
## $ Gender <fct> Female, Male, Female, Female, Female…
## $ Customer.Type <fct> Loyal Customer, Loyal Customer, Loya…
## $ Age <int> 65, 47, 15, 60, 70, 30, 66, 10, 56, …
## $ Type.of.Travel <fct> Personal Travel, Personal Travel, Pe…
## $ Class <fct> Eco, Business, Eco, Eco, Eco, Eco, E…
## $ Flight.Distance <int> 265, 2464, 2138, 623, 354, 1894, 227…
## $ Seat.comfort <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Departure.Arrival.time.convenient <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Food.and.drink <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Gate.location <fct> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, …
## $ Inflight.wifi.service <fct> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2, …
## $ Inflight.entertainment <fct> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0, …
## $ Online.support <fct> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2, …
## $ Ease.of.Online.booking <fct> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2, …
## $ On.board.service <fct> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3, …
## $ Leg.room.service <fct> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2, …
## $ Baggage.handling <fct> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5, …
## $ Checkin.service <fct> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2, …
## $ Cleanliness <fct> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5, …
## $ Online.boarding <fct> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2, …
## $ Departure.Delay.in.Minutes <int> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 47…
## $ Arrival.Delay.in.Minutes <int> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 48…
Check missing value
anyNA(airline)## [1] TRUE
is.na(airline)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 393
isi missing value dengan mean
airline$Arrival.Delay.in.Minutes[is.na(airline$Arrival.Delay.in.Minutes)]=
mean(airline$Arrival.Delay.in.Minutes, na.rm = TRUE)pastikan lagi bahwa tidak ada missing value
anyNA(airline)## [1] FALSE
is.na(airline)%>% colSums()## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 0
Check class imbalance
table(airline$satisfaction)##
## dissatisfied satisfied
## 58793 71087
prop.table(table(airline$satisfaction))##
## dissatisfied satisfied
## 0.4526717 0.5473283
tidak ada class imbalance pada kolom target
Memisahkan antara data train dan
data test
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
# your code here
index <- sample(nrow(airline), nrow(airline)*0.8)
airline_train <- airline[index,]
airline_test <- airline[-index,]nrow(airline)*0.8## [1] 103904
nrow(airline_train)## [1] 103904
nrow(airline)*0.2## [1] 25976
nrow(airline_test)## [1] 25976
check class imbalance pada kolom target pada data train
table(airline_train$satisfaction)##
## dissatisfied satisfied
## 47100 56804
prop.table(table(airline_train$satisfaction))##
## dissatisfied satisfied
## 0.453303 0.546697
tidak terjadi class imbalance
library(partykit)## Warning: package 'partykit' was built under R version 4.2.2
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.2.2
## Loading required package: mvtnorm
dtree_model_airline <- ctree(formula = satisfaction ~.,
data = airline_train)
plot(dtree_model_airline, type = "simple")terlalu rumit, lakukan set percabangan
Evaluasi performa model complex
# prediksi kelas di data test
pred_A <- predict(dtree_model_airline_complex, airline_test, type="response")
# confusion matrix data test
confusionMatrix(pred_A, airline_test$satisfaction, positive = "satisfied")## Confusion Matrix and Statistics
##
## Reference
## Prediction dissatisfied satisfied
## dissatisfied 10856 1018
## satisfied 837 13265
##
## Accuracy : 0.9286
## 95% CI : (0.9254, 0.9317)
## No Information Rate : 0.5499
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8559
##
## Mcnemar's Test P-Value : 2.924e-05
##
## Sensitivity : 0.9287
## Specificity : 0.9284
## Pos Pred Value : 0.9406
## Neg Pred Value : 0.9143
## Prevalence : 0.5499
## Detection Rate : 0.5107
## Detection Prevalence : 0.5429
## Balanced Accuracy : 0.9286
##
## 'Positive' Class : satisfied
##