Welcome back again, readers! It’s yours truly, Gissella Nadya, and this is my Classification 1 Learning By Building Project for Algoritma Academy. For this project, we are using a dataset from Kaggle (here or here on Airlines Customer Satisfaction using various factors. We are going to implement the Logistic Regression and K-Nearest Neighbors model for this report. Let’s go!
library(tidyverse)
library(ggplot2)
library(class) # knn()
library(caret) # cm
library(ggmosaic)
library(kableExtra)airline <- read_csv("airline.csv") %>% janitor::clean_names()
head(airline)glimpse(airline)#> Rows: 129,880
#> Columns: 23
#> $ satisfaction <chr> "satisfied", "satisfied", "satisfie…
#> $ gender <chr> "Female", "Male", "Female", "Female…
#> $ customer_type <chr> "Loyal Customer", "Loyal Customer",…
#> $ age <dbl> 65, 47, 15, 60, 70, 30, 66, 10, 56,…
#> $ type_of_travel <chr> "Personal Travel", "Personal Travel…
#> $ class <chr> "Eco", "Business", "Eco", "Eco", "E…
#> $ flight_distance <dbl> 265, 2464, 2138, 623, 354, 1894, 22…
#> $ seat_comfort <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ departure_arrival_time_convenient <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ food_and_drink <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ gate_location <dbl> 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4,…
#> $ inflight_wifi_service <dbl> 2, 0, 2, 3, 4, 2, 2, 2, 5, 2, 3, 2,…
#> $ inflight_entertainment <dbl> 4, 2, 0, 4, 3, 0, 5, 0, 3, 0, 3, 0,…
#> $ online_support <dbl> 2, 2, 2, 3, 4, 2, 5, 2, 5, 2, 3, 2,…
#> $ ease_of_online_booking <dbl> 3, 3, 2, 1, 2, 2, 5, 2, 4, 2, 3, 2,…
#> $ on_board_service <dbl> 3, 4, 3, 1, 2, 5, 5, 3, 4, 2, 3, 3,…
#> $ leg_room_service <dbl> 0, 4, 3, 0, 0, 4, 0, 3, 0, 4, 0, 2,…
#> $ baggage_handling <dbl> 3, 4, 4, 1, 2, 5, 5, 4, 1, 5, 1, 5,…
#> $ checkin_service <dbl> 5, 2, 4, 4, 4, 5, 5, 5, 5, 3, 2, 2,…
#> $ cleanliness <dbl> 3, 3, 4, 1, 2, 4, 5, 4, 4, 4, 3, 5,…
#> $ online_boarding <dbl> 2, 2, 2, 3, 5, 2, 3, 2, 4, 2, 5, 2,…
#> $ departure_delay_in_minutes <dbl> 0, 310, 0, 0, 0, 0, 17, 0, 0, 30, 4…
#> $ arrival_delay_in_minutes <dbl> 0, 305, 0, 0, 0, 0, 15, 0, 0, 26, 4…
To get to know more of our dataset, here is the thorough explanations about each variables:
clean_airline <- airline %>%
mutate_if(~is.character(.), ~as.factor(.)) %>%
mutate(age = as.numeric(age),
flight_distance = as.numeric(flight_distance),
departure_delay_in_minutes = as.numeric(departure_delay_in_minutes),
arrival_delay_in_minutes = as.numeric(arrival_delay_in_minutes))Checking NA Values
colSums(is.na(clean_airline))#> satisfaction gender
#> 0 0
#> customer_type age
#> 0 0
#> type_of_travel class
#> 0 0
#> flight_distance seat_comfort
#> 0 0
#> departure_arrival_time_convenient food_and_drink
#> 0 0
#> gate_location inflight_wifi_service
#> 0 0
#> inflight_entertainment online_support
#> 0 0
#> ease_of_online_booking on_board_service
#> 0 0
#> leg_room_service baggage_handling
#> 0 0
#> checkin_service cleanliness
#> 0 0
#> online_boarding departure_delay_in_minutes
#> 0 0
#> arrival_delay_in_minutes
#> 393
After we checked if there is any NA values, we found out that 393 observations are NA on the arrival_delay_in_minutes column. Here, we are going to assume that the NA value on are 0.
clean_airline$arrival_delay_in_minutes <- ifelse(is.na(clean_airline$arrival_delay_in_minutes), '0', clean_airline$arrival_delay_in_minutes)
clean_airline$arrival_delay_in_minutes <- as.numeric(clean_airline$arrival_delay_in_minutes)To simplify our modeling, we are going to change the dissatified category into 0 and satisfied category into 1.
clean_airline <- clean_airline %>%
mutate(satisfaction = as.factor(case_when(satisfaction == "dissatisfied" ~ "0", TRUE ~ "1")),
customer_type = as.factor(case_when(customer_type == "disloyal Customer" ~ "0",
TRUE ~ "1")))To get to know more about our data, we are going to do an Exploratory Data Analysis. We are not going to be thorough in this part as our focus is on the Log Reg and K-NN models.
summary(clean_airline)#> satisfaction gender customer_type age
#> 0:58793 Female:65899 0: 23780 Min. : 7.00
#> 1:71087 Male :63981 1:106100 1st Qu.:27.00
#> Median :40.00
#> Mean :39.43
#> 3rd Qu.:51.00
#> Max. :85.00
#> type_of_travel class flight_distance seat_comfort
#> Business travel:89693 Business:62160 Min. : 50 Min. :0.000
#> Personal Travel:40187 Eco :58309 1st Qu.:1359 1st Qu.:2.000
#> Eco Plus: 9411 Median :1925 Median :3.000
#> Mean :1981 Mean :2.839
#> 3rd Qu.:2544 3rd Qu.:4.000
#> Max. :6951 Max. :5.000
#> departure_arrival_time_convenient food_and_drink gate_location
#> Min. :0.000 Min. :0.000 Min. :0.00
#> 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00
#> Median :3.000 Median :3.000 Median :3.00
#> Mean :2.991 Mean :2.852 Mean :2.99
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00
#> Max. :5.000 Max. :5.000 Max. :5.00
#> inflight_wifi_service inflight_entertainment online_support
#> Min. :0.000 Min. :0.000 Min. :0.00
#> 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.00
#> Median :3.000 Median :4.000 Median :4.00
#> Mean :3.249 Mean :3.383 Mean :3.52
#> 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.00
#> Max. :5.000 Max. :5.000 Max. :5.00
#> ease_of_online_booking on_board_service leg_room_service baggage_handling
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000
#> 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000
#> Median :4.000 Median :4.000 Median :4.000 Median :4.000
#> Mean :3.472 Mean :3.465 Mean :3.486 Mean :3.696
#> 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000
#> Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
#> checkin_service cleanliness online_boarding departure_delay_in_minutes
#> Min. :0.000 Min. :0.000 Min. :0.000 Min. : 0.00
#> 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 0.00
#> Median :3.000 Median :4.000 Median :4.000 Median : 0.00
#> Mean :3.341 Mean :3.706 Mean :3.353 Mean : 14.71
#> 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.: 12.00
#> Max. :5.000 Max. :5.000 Max. :5.000 Max. :1592.00
#> arrival_delay_in_minutes
#> Min. : 0.00
#> 1st Qu.: 0.00
#> Median : 0.00
#> Mean : 15.05
#> 3rd Qu.: 13.00
#> Max. :1584.00
ggplot(gather(clean_airline %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')A fun insight indeed, here we can see that compared to female, there are more male passengers that are dissatisfied with the airlines service.
Logistic regression is a classification algorithm used to fit a regression curve, y = f (x), where y is a categorical variable. When the Dependent Variable is binary (1 for spam, 0 for not-spam) we also call the model binomial logistic regression where in cases of the Dependent Variable are more than 2 values the model are referred to as a class of multinomial logistic regression.
For our dataset, it is considered as Binomial because the target variable is 1 and 0, in which 1 is satisfied and 0 is dissatisfied.
prop.table(table(clean_airline$satisfaction))#>
#> 0 1
#> 0.4526717 0.5473283
45:54 is a good number for our target variables.
RNGkind(sample.kind = "Rounding")
set.seed(598)
index <- sample(nrow(clean_airline),
nrow(clean_airline) *0.8)
airline_train <- clean_airline[index, ]
airline_test <- clean_airline[-index, ] model_logreg <- glm(formula = satisfaction ~ .,
data = airline_train,
family = binomial("logit"))
summary(model_logreg)#>
#> Call:
#> glm(formula = satisfaction ~ ., family = binomial("logit"), data = airline_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.9873 -0.5741 0.1912 0.5171 3.5917
#>
#> Coefficients:
#> Estimate Std. Error z value
#> (Intercept) -6.822591352 0.073216843 -93.183
#> genderMale -0.956407411 0.018407073 -51.959
#> customer_type1 1.988820649 0.028014059 70.994
#> age -0.007668534 0.000641260 -11.959
#> type_of_travelPersonal Travel -0.780669721 0.026200679 -29.796
#> classEco -0.748964211 0.023703356 -31.597
#> classEco Plus -0.783218049 0.036438417 -21.494
#> flight_distance -0.000114023 0.000009641 -11.827
#> seat_comfort 0.289408040 0.010300741 28.096
#> departure_arrival_time_convenient -0.201601106 0.007594677 -26.545
#> food_and_drink -0.215013110 0.010467912 -20.540
#> gate_location 0.115611700 0.008578729 13.477
#> inflight_wifi_service -0.078420308 0.009959073 -7.874
#> inflight_entertainment 0.689561404 0.009298993 74.154
#> online_support 0.094209823 0.010102754 9.325
#> ease_of_online_booking 0.223764332 0.013017619 17.189
#> on_board_service 0.307225318 0.009235947 33.264
#> leg_room_service 0.218333969 0.007850168 27.813
#> baggage_handling 0.110937290 0.010390525 10.677
#> checkin_service 0.298166736 0.007761246 38.417
#> cleanliness 0.084748768 0.010815271 7.836
#> online_boarding 0.172143142 0.011150624 15.438
#> departure_delay_in_minutes 0.002727487 0.000825623 3.304
#> arrival_delay_in_minutes -0.007837885 0.000817985 -9.582
#> Pr(>|z|)
#> (Intercept) < 0.0000000000000002 ***
#> genderMale < 0.0000000000000002 ***
#> customer_type1 < 0.0000000000000002 ***
#> age < 0.0000000000000002 ***
#> type_of_travelPersonal Travel < 0.0000000000000002 ***
#> classEco < 0.0000000000000002 ***
#> classEco Plus < 0.0000000000000002 ***
#> flight_distance < 0.0000000000000002 ***
#> seat_comfort < 0.0000000000000002 ***
#> departure_arrival_time_convenient < 0.0000000000000002 ***
#> food_and_drink < 0.0000000000000002 ***
#> gate_location < 0.0000000000000002 ***
#> inflight_wifi_service 0.00000000000000343 ***
#> inflight_entertainment < 0.0000000000000002 ***
#> online_support < 0.0000000000000002 ***
#> ease_of_online_booking < 0.0000000000000002 ***
#> on_board_service < 0.0000000000000002 ***
#> leg_room_service < 0.0000000000000002 ***
#> baggage_handling < 0.0000000000000002 ***
#> checkin_service < 0.0000000000000002 ***
#> cleanliness 0.00000000000000465 ***
#> online_boarding < 0.0000000000000002 ***
#> departure_delay_in_minutes 0.000955 ***
#> arrival_delay_in_minutes < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 143114 on 103903 degrees of freedom
#> Residual deviance: 79746 on 103880 degrees of freedom
#> AIC: 79794
#>
#> Number of Fisher Scoring iterations: 5
customer_type1 <- 1.988820649
exp(customer_type1)#> [1] 7.306911
Example interpretation: - Customer type 1 or Loyal Customer have a probability 7.30 times more to be satisfied rather than dissatisfied.
log_prob <- predict(model_logreg,
newdata = airline_test,
type = "response")
log_label <- as.factor(ifelse(log_prob > 0.5,
yes = "1",
no = "0"))
head(log_label)#> 1 2 3 4 5 6
#> 0 0 0 1 1 0
#> Levels: 0 1
class(log_label)#> [1] "factor"
K-NN has its own characteristics and one of them was it is better for predictors that are numeric, therefore, in our pre-processing step here we are going to divide the categorical variables in order to make the train and test dataset.
# predictor
airline_train_x <- airline_train %>% select(-c(satisfaction, gender, customer_type, type_of_travel, class))
airline_test_x <- airline_test %>% select(-c(satisfaction, gender, customer_type, type_of_travel, class))
# target
airline_train_y <- airline_train %>% select(satisfaction)
airline_test_y <- airline_test %>% select(satisfaction)airline_train_xs <- scale(x = airline_train_x)
airline_test_xs <- scale(x = airline_test_x,
center = attr(airline_train_xs, "scaled:center"),
scale = attr(airline_train_xs, "scaled:scale"))
head(airline_test_xs)#> age flight_distance seat_comfort departure_arrival_time_convenient
#> [1,] 0.5001273 0.4693603 -2.037774 -1.958847
#> [2,] 2.0248780 -1.5856046 -2.037774 -1.958847
#> [3,] -0.6268623 -0.0857724 -2.037774 -1.958847
#> [4,] 1.7597040 -1.7092920 -2.037774 -1.958847
#> [5,] 1.0967689 -1.8592752 -2.037774 -1.958847
#> [6,] 1.2293559 -1.8290838 -2.037774 -1.958847
#> food_and_drink gate_location inflight_wifi_service inflight_entertainment
#> [1,] -1.977142 0.007046417 -2.4659491 -1.0284106
#> [2,] -1.977142 0.007046417 0.5675381 -0.2859376
#> [3,] -1.977142 0.007046417 -0.9492055 -2.5133567
#> [4,] -1.977142 0.007046417 -0.9492055 1.1990085
#> [5,] -1.977142 0.007046417 1.3259099 -0.2859376
#> [6,] -1.977142 0.007046417 -0.1908337 -0.2859376
#> online_support ease_of_online_booking on_board_service leg_room_service
#> [1,] -1.1638199 -0.3594178 0.4231493 0.399446
#> [2,] 0.3670479 -1.1250754 -1.1504694 -2.696657
#> [3,] -1.1638199 -1.1250754 1.2099587 0.399446
#> [4,] 1.1324818 1.1718973 1.2099587 -2.696657
#> [5,] 1.1324818 0.4062398 0.4231493 -2.696657
#> [6,] -0.3983860 -0.3594178 -0.3636601 -2.696657
#> baggage_handling checkin_service cleanliness online_boarding
#> [1,] 0.2654027 -1.0643002 -0.6111931 -1.0418249
#> [2,] -1.4625503 0.5217398 -1.4794651 1.2679964
#> [3,] 1.1293792 1.3147598 0.2570788 -1.0418249
#> [4,] 1.1293792 1.3147598 1.1253508 -0.2718845
#> [5,] -2.3265268 1.3147598 0.2570788 0.4980560
#> [6,] -2.3265268 -1.0643002 -0.6111931 1.2679964
#> departure_delay_in_minutes arrival_delay_in_minutes
#> [1,] 7.73134704 7.5236118680
#> [2,] -0.38470109 -0.3898527594
#> [3,] -0.38470109 -0.3898527594
#> [4,] 0.06037252 -0.0006659745
#> [5,] -0.38470109 -0.3898527594
#> [6,] 0.84579653 0.8555449524
round(sqrt(nrow(airline_train)))#> [1] 322
k = 322
knn_label <- knn(train = airline_train_xs,
test = airline_test_xs,
cl = airline_train_y$satisfaction,
k = 322)
head(knn_label)#> [1] 0 1 0 1 1 1
#> Levels: 0 1
class(knn_label)#> [1] "factor"
To evaluate our model, we may use confusionMatrix() function from the library caret. Confusion matrix is a table that shows four different category: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). After that we may use 4 metrics to evaluate the model, those are Accuracy, Sensitivity, Specificity, and Precision.
| Actual No | Actual Yes | |
|---|---|---|
| Predicted No | True Negative | False Negative |
| Predicted Yes | False Positive | True Positive |
\[ Accuracy = \frac{TP + TN} {TP + TN + FP + FN } \]
\[ Sensitivity = \frac{TP} {TP + FN}\]
\[ Specificity = \frac{TN}{TN+FP}\]
\[ Precision = \frac{TP} {TP + FP}\]
cm_log <- confusionMatrix(data = log_label,
reference = airline_test$satisfaction,
positive = "1")
cm_log#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 9587 2177
#> 1 2158 12054
#>
#> Accuracy : 0.8331
#> 95% CI : (0.8285, 0.8376)
#> No Information Rate : 0.5479
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.6632
#>
#> Mcnemar's Test P-Value : 0.7846
#>
#> Sensitivity : 0.8470
#> Specificity : 0.8163
#> Pos Pred Value : 0.8482
#> Neg Pred Value : 0.8149
#> Prevalence : 0.5479
#> Detection Rate : 0.4640
#> Detection Prevalence : 0.5471
#> Balanced Accuracy : 0.8316
#>
#> 'Positive' Class : 1
#>
cm_knn <- confusionMatrix(data = knn_label,
reference = airline_test$satisfaction,
positive = "1")
cm_knn#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 10595 1980
#> 1 1150 12251
#>
#> Accuracy : 0.8795
#> 95% CI : (0.8755, 0.8834)
#> No Information Rate : 0.5479
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.7583
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8609
#> Specificity : 0.9021
#> Pos Pred Value : 0.9142
#> Neg Pred Value : 0.8425
#> Prevalence : 0.5479
#> Detection Rate : 0.4716
#> Detection Prevalence : 0.5159
#> Balanced Accuracy : 0.8815
#>
#> 'Positive' Class : 1
#>
As we can see on the Model Evaluation, both of our model perform very well in all of the metrics from Accuracy, Recall, Specificity and Precision. But, to be more precise our K-Nearest Neighbor model performs better in all aspects. Therefore, depending on what we want to achieve, for example if we only focuses on the positive classification or “satisfied” class, we may prioritize the model with higher precision value. But on the contrary, if we would like to pay attention more to both the number of correct positive and negative outcome, we might prioritize the model with high accuracy. But in this case, as previously mentioned we should refer to the K-NN model as it is better in all aspects.
Thank you for taking the time to read my report, feel free to comment or give me feedbacks. Get in touch with me here!