Customer Satisfaction Analysis
Introduction
This data is contains information about airlines customer satisfaction from Kaggle. We are going to analyze the data to build model and predict customer satisfaction using Logistic Regression and K-Nearest Neighbor.
First import the library that we needed.
library(dplyr)
library(tidyverse)
library(MLmetrics)
library(lmtest)
library(rsample)
library(class)
library(caret)
library(car)Data Preparation
Input Data
Input our data and put it into airline object. We use
parameter stringsAsFactors = TRUE so that all character
columns will automatically stored as factors.
airline <- read.csv("Invistico_Airline.csv", stringsAsFactors = TRUE)Overview our data:
head(airline)Data Structure
Check the number of columns and rows.
dim(airline)## [1] 129880 23
Data contains 129.880 rows and 23 columns.
View all columns and the data types.
glimpse(airline)## Rows: 129,880
## Columns: 23
## $ satisfaction <fct> satisfied, satisfied, satisfied, sat~
## $ Gender <fct> Female, Male, Female, Female, Female~
## $ Customer.Type <fct> Loyal Customer, Loyal Customer, Loya~
## $ Age <int> 65, 47, 15, 60, 70, 30, 66, 10, 56, ~
## $ Type.of.Travel <fct> Personal Travel, Personal Travel, Pe~
## $ Class <fct> Eco, Business, Eco, Eco, Eco, Eco, E~
## $ Flight.Distance <int> 265, 2464, 2138, 623, 354, 1894, 227~
## $ Seat.comfort <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Departure.Arrival.time.convenient <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Food.and.drink <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Gate.location <int> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, ~
## $ Inflight.wifi.service <int> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2, ~
## $ Inflight.entertainment <int> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0, ~
## $ Online.support <int> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2, ~
## $ Ease.of.Online.booking <int> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2, ~
## $ On.board.service <int> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3, ~
## $ Leg.room.service <int> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2, ~
## $ Baggage.handling <int> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5, ~
## $ Checkin.service <int> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2, ~
## $ Cleanliness <int> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5, ~
## $ Online.boarding <int> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2, ~
## $ Departure.Delay.in.Minutes <int> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 47~
## $ Arrival.Delay.in.Minutes <int> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 48~
Data type of all columns are correct.
Pre-processing Data
Checking the missing value.
colSums(is.na(airline))## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 393
We can see that Arrival.Delay.in.Minutes has 393 missing values. We can handle the missing value by adjust it to zero (no delay).
airline <- airline %>%
mutate(Arrival.Delay.in.Minutes = replace(Arrival.Delay.in.Minutes, is.na(Arrival.Delay.in.Minutes), 0),
Arrival.Delay.in.Minutes = as.numeric(Arrival.Delay.in.Minutes))
colSums(is.na(airline))## satisfaction Gender
## 0 0
## Customer.Type Age
## 0 0
## Type.of.Travel Class
## 0 0
## Flight.Distance Seat.comfort
## 0 0
## Departure.Arrival.time.convenient Food.and.drink
## 0 0
## Gate.location Inflight.wifi.service
## 0 0
## Inflight.entertainment Online.support
## 0 0
## Ease.of.Online.booking On.board.service
## 0 0
## Leg.room.service Baggage.handling
## 0 0
## Checkin.service Cleanliness
## 0 0
## Online.boarding Departure.Delay.in.Minutes
## 0 0
## Arrival.Delay.in.Minutes
## 0
No missing value found. Now the data is ready to explore.
Exploratory Data Analysis
Let’s see the summary of all columns.
summary(airline)## satisfaction Gender Customer.Type Age
## dissatisfied:58793 Female:65899 disloyal Customer: 23780 Min. : 7.00
## satisfied :71087 Male :63981 Loyal Customer :106100 1st Qu.:27.00
## Median :40.00
## Mean :39.43
## 3rd Qu.:51.00
## Max. :85.00
## Type.of.Travel Class Flight.Distance Seat.comfort
## Business travel:89693 Business:62160 Min. : 50 Min. :0.000
## Personal Travel:40187 Eco :58309 1st Qu.:1359 1st Qu.:2.000
## Eco Plus: 9411 Median :1925 Median :3.000
## Mean :1981 Mean :2.839
## 3rd Qu.:2544 3rd Qu.:4.000
## Max. :6951 Max. :5.000
## Departure.Arrival.time.convenient Food.and.drink Gate.location
## Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00
## Median :3.000 Median :3.000 Median :3.00
## Mean :2.991 Mean :2.852 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00
## Max. :5.000 Max. :5.000 Max. :5.00
## Inflight.wifi.service Inflight.entertainment Online.support
## Min. :0.000 Min. :0.000 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.00
## Median :3.000 Median :4.000 Median :4.00
## Mean :3.249 Mean :3.383 Mean :3.52
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.00
## Max. :5.000 Max. :5.000 Max. :5.00
## Ease.of.Online.booking On.board.service Leg.room.service Baggage.handling
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :4.000
## Mean :3.472 Mean :3.465 Mean :3.486 Mean :3.696
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Checkin.service Cleanliness Online.boarding Departure.Delay.in.Minutes
## Min. :0.000 Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 0.00
## Median :3.000 Median :4.000 Median :4.000 Median : 0.00
## Mean :3.341 Mean :3.706 Mean :3.353 Mean : 14.71
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.: 12.00
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :1592.00
## Arrival.Delay.in.Minutes
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 15.05
## 3rd Qu.: 13.00
## Max. :1584.00
Before doing the analysis, we have to inspect the distribution of all variables. Categorical variables:
ggplot(gather(airline %>% select_if(is.factor)), aes(value)) +
geom_bar(bins = 10, fill = "maroon") +
facet_wrap(~key, scales = 'free_x') +
theme_minimal()Numerical variables:
ggplot(gather(airline %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10, fill = "maroon") +
facet_wrap(~key, scales = 'free_x') +
theme_minimal()Logistic Regression
Logistic regression is a statistical analysis method to predict a categorical outcome. When the target variabel has 2 values such as “yes” and “no”, we build binomial logistic regression model, and if the target variable has more than 2 values, the model is multinomial logistic regression. For this project, we are going to build binomial logistic regression model because the target variable is Satisfaction: “satisfied” and “dissatisfied”.
Cross Validation
Checking the proportion of our target variable.
prop.table(table(airline$satisfaction))##
## dissatisfied satisfied
## 0.4526717 0.5473283
The proportion of target variable is balance. Next, split the data into data train and test.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(airline), nrow(airline)*0.8)
airline_train <- airline[index,]
airline_test <- airline[-index,]After split the data, re-check the class imbalance of data train.
prop.table(table(airline_train$satisfaction))##
## dissatisfied satisfied
## 0.453303 0.546697
It is still balanced. Next, we can build the model.
Build Model
model_logistic <- glm(satisfaction~., data = airline, family = "binomial")
summary(model_logistic)##
## Call:
## glm(formula = satisfaction ~ ., family = "binomial", data = airline)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9805 -0.5776 0.1930 0.5192 3.6393
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -6.798554108 0.065301226 -104.111
## GenderMale -0.965355063 0.016444726 -58.703
## Customer.TypeLoyal Customer 1.980323291 0.024991761 79.239
## Age -0.007754971 0.000571093 -13.579
## Type.of.TravelPersonal Travel -0.780419597 0.023400688 -33.350
## ClassEco -0.729101632 0.021200175 -34.391
## ClassEco Plus -0.806549152 0.032585445 -24.752
## Flight.Distance -0.000110760 0.000008596 -12.884
## Seat.comfort 0.289617833 0.009196862 31.491
## Departure.Arrival.time.convenient -0.196737890 0.006770793 -29.057
## Food.and.drink -0.219463677 0.009344082 -23.487
## Gate.location 0.114537166 0.007638146 14.995
## Inflight.wifi.service -0.071701102 0.008874769 -8.079
## Inflight.entertainment 0.686584065 0.008294491 82.776
## Online.support 0.091800763 0.009015167 10.183
## Ease.of.Online.booking 0.221709025 0.011614610 19.089
## On.board.service 0.308841027 0.008242354 37.470
## Leg.room.service 0.223597185 0.007009861 31.898
## Baggage.handling 0.107907249 0.009282240 11.625
## Checkin.service 0.296327433 0.006931315 42.752
## Cleanliness 0.079310433 0.009654015 8.215
## Online.boarding 0.171242789 0.009945911 17.217
## Departure.Delay.in.Minutes 0.002551590 0.000741308 3.442
## Arrival.Delay.in.Minutes -0.007843130 0.000734262 -10.682
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## GenderMale < 0.0000000000000002 ***
## Customer.TypeLoyal Customer < 0.0000000000000002 ***
## Age < 0.0000000000000002 ***
## Type.of.TravelPersonal Travel < 0.0000000000000002 ***
## ClassEco < 0.0000000000000002 ***
## ClassEco Plus < 0.0000000000000002 ***
## Flight.Distance < 0.0000000000000002 ***
## Seat.comfort < 0.0000000000000002 ***
## Departure.Arrival.time.convenient < 0.0000000000000002 ***
## Food.and.drink < 0.0000000000000002 ***
## Gate.location < 0.0000000000000002 ***
## Inflight.wifi.service 0.000000000000000652 ***
## Inflight.entertainment < 0.0000000000000002 ***
## Online.support < 0.0000000000000002 ***
## Ease.of.Online.booking < 0.0000000000000002 ***
## On.board.service < 0.0000000000000002 ***
## Leg.room.service < 0.0000000000000002 ***
## Baggage.handling < 0.0000000000000002 ***
## Checkin.service < 0.0000000000000002 ***
## Cleanliness < 0.0000000000000002 ***
## Online.boarding < 0.0000000000000002 ***
## Departure.Delay.in.Minutes 0.000577 ***
## Arrival.Delay.in.Minutes < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178886 on 129879 degrees of freedom
## Residual deviance: 99982 on 129856 degrees of freedom
## AIC: 100030
##
## Number of Fisher Scoring iterations: 5
From the logistic regression model above, we can see that all predictor variables significant to target variable (satisfaction), but to get the interpretation of each variable, we have to calculate exponential of each coefficient.
exp(model_logistic$coefficients)## (Intercept) GenderMale
## 0.001115387 0.380847951
## Customer.TypeLoyal Customer Age
## 7.245084880 0.992275021
## Type.of.TravelPersonal Travel ClassEco
## 0.458213706 0.482342116
## ClassEco Plus Flight.Distance
## 0.446395856 0.999889246
## Seat.comfort Departure.Arrival.time.convenient
## 1.335916847 0.821405903
## Food.and.drink Gate.location
## 0.802949323 1.121354316
## Inflight.wifi.service Inflight.entertainment
## 0.930809071 1.986916749
## Online.support Ease.of.Online.booking
## 1.096146407 1.248208128
## On.board.service Leg.room.service
## 1.361845857 1.250567171
## Baggage.handling Checkin.service
## 1.113944421 1.344910453
## Cleanliness Online.boarding
## 1.082540326 1.186778851
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## 1.002554848 0.992187547
Interpretation of model (for example):
Gender: The probability of a male customer getting satisfied is 0.38 times more likely than a female customer.
Customer Type: The probability of a loyal customer getting satisfied is 7.25 times more likely than a disloyal customer.
Age: The increasing age by 1 unit multiplies the odds of having the outcome statisfied by 0.99.
Class: The probability of a Eco Class getting satisfied is 0.48 times more likely than business class; The probability of a Eco Plus Class getting satisfied is 0.45 times more likely than business class.
Flight distance: The increasing flight distance by 1 unit multiplies the odds of having the outcome statisfied by 0.99.
Seat comfort: The increasing seat comfort by 1 unit multiplies the odds of having the outcome statisfied by 1.34.
For each interpretation above, note that the value of the other coefficients is constant.
Prediction
Show the prediction of logistic regression model using data test and compare the prediction result with the existing data.
airline_test$pred_satisfaction <- predict(object = model_logistic, newdata = airline_test, type="response")
airline_test$pred_satisfaction_label <- ifelse(airline_test$pred_satisfaction>0.5, "satisfied", "dissatisfied") %>% as.factor()
airline_test %>% select(pred_satisfaction, pred_satisfaction_label, satisfaction)Model Evaluation
Using confusionMatrix() to evaluate our model. Confusion matrix is a table that shows:
- TP (True Positive) = When we predict positive class, and it’s true.
- TN (True Negative) = When we predict negative class, and it’s true.
- FP (False Positive) = When we predict positive class, and it’s not true.
- FN (False Negative) = When we predict negative class, and it’s not true.
We will get information about:
- Accuracy: How accurately our model predicts the target class.
- Sensitivity/ Recall: The measure of the goodness of the model to the
positive class.
- Specificity: The measure of the goodness of the model to the
negative class.
- Pos Pred Value/Precision: How precise the model predicts positive class.
confusionMatrix(data=airline_test$pred_satisfaction_label, reference=airline_test$satisfaction, positive="satisfied")## Confusion Matrix and Statistics
##
## Reference
## Prediction dissatisfied satisfied
## dissatisfied 9564 2103
## satisfied 2129 12180
##
## Accuracy : 0.8371
## 95% CI : (0.8325, 0.8416)
## No Information Rate : 0.5499
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6708
##
## Mcnemar's Test P-Value : 0.7008
##
## Sensitivity : 0.8528
## Specificity : 0.8179
## Pos Pred Value : 0.8512
## Neg Pred Value : 0.8197
## Prevalence : 0.5499
## Detection Rate : 0.4689
## Detection Prevalence : 0.5509
## Balanced Accuracy : 0.8353
##
## 'Positive' Class : satisfied
##
The evaluation result of logistic regression model:
- Accuracy = 83.71%
- Sensitivity/ Recall = 85.28%
- Specificity = 81.79%
- Pos Pred Value/Precision = 85.12%
K-Nearest Neighbor
K-Nearest Neighbor (K-NN) will classify the new data by comparing the characteristics of the new data (test data) with existing data (train data). The closeness of these characteristics is measured by Euclidean Distance and the class will be determined using majority voting. This method is good for numerical predictor (because it classifies by distance), but not good for categorical predictor.
Cross Validation
RNGkind(sample.kind = "Rounding")
set.seed(100)
index2 <- sample(nrow(airline), nrow(airline)*0.8)
airline_train_knn <- airline[index2,]
airline_test_knn <- airline[-index2,]# prediktor data train
train_x <- airline_train_knn %>% select_if(is.numeric) # dipilih semua kolom yang numerik karena akan discaling
# target data train
train_y <- airline_train_knn %>% select(satisfaction) # dipisahkan khusus untuk kelas target
# prediktor data test
test_x <- airline_test_knn %>% select_if(is.numeric)
# target data test
test_y <- airline_test_knn %>% select(satisfaction)Scaling
Based on summary before, the range of each variable is different, so
it is necessary to do feature rescaling using z-score standardization or
scale .
train_x <- scale(train_x)
test_x <- scale(test_x,
center = attr(train_x, "scaled:center"), #nilai rata-rata train
scale = attr(train_x, "scaled:scale")) # nilai sd trainPrediction
Selecting the value of k by square root of the observation.
sqrt(nrow(train_x))## [1] 322.3414
We got k = 322 then predict using K-NN method.
airline_pred_knn <- knn(train = train_x,
test = test_x,
cl = train_y$satisfaction,
k=322)Evaluation Model
confusionMatrix(airline_pred_knn, reference = test_y$satisfaction, positive="satisfied")## Confusion Matrix and Statistics
##
## Reference
## Prediction dissatisfied satisfied
## dissatisfied 10527 1989
## satisfied 1166 12294
##
## Accuracy : 0.8785
## 95% CI : (0.8745, 0.8825)
## No Information Rate : 0.5499
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.7562
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.8607
## Specificity : 0.9003
## Pos Pred Value : 0.9134
## Neg Pred Value : 0.8411
## Prevalence : 0.5499
## Detection Rate : 0.4733
## Detection Prevalence : 0.5182
## Balanced Accuracy : 0.8805
##
## 'Positive' Class : satisfied
##
The evaluation of K-NN:
- Accuracy = 87.85%
- Sensitivity/ Recall = 86.07%
- Specificity = 90.03%
- Pos Pred Value/Precision = 91.34%
Conclusion
- The customer that is more likely to get satisfied is Loyal Customer and flight in Business Class.
- Not only that condition, but the increase rating of Seat comfort, Gate location, Inflight entertainment, Online support, Ease of Online booking, On board service, Leg room service, Baggage handling, Checkin service, and Cleanliness Online boarding, the more likely customer getting satisfied.
- Based on model evaluation, both model have good result. Compared by Accuracy, Sensitifity, Specificity, and Pos Pred Value, model KNN has higher values than logistic regression model.