library(caret)
library(tidyverse)
library(recipes)
library(doParallel)
library(stringr)
library(kableExtra)
library(ggpubr)
library(ROSE)
setwd("E:/Estudio/R/Proyectos/Airplane_satisfaction")Clients are without a doubt, one of the top worries every company has. Without them, it would be impossible for any company to thrive in the market. For that reason, keeping said clients satisfied and happy with ones services is a top priority for any company that tries to be successful and recognized.
Based on this we will be working with an airline that wants to maintain happy as many customers as possible. Because of this, we are interested on, first, understanding which properties of a flight and/or person are related to the client being unsatisfied or satisfied with the service that was given to them, by knowing this, the company will be able to attempt to invest in ways to solve these main issues. Second, we want to be able to construct a predictive model that can accurately predict whether a person will be satisfied after receiving the already mentioned service, that way, the airline can give a more personal service if a specific customer is very likely to be unsatisfied.
To do this, a dataset containing 30000 people that flew with the mentioned company will be utilized for the training process, while a different set containing 10000 people will ultimately be used for performance comparison between the models. This dataset contains 23 variables regarding characteristics from the person itself as well as the flight they took, including whether that person was ultimately satisfied or not. Some of the variables that make up this set of data are:
Gender: Gender of the passenger.
Customer.Type: Loyal or not loyal customer.
Age: Age of the passenger.
Type.of.Travel: Personal or business.
Class: Eco, Eco plus or business.
Flight.Distance: Distance to be flown.
Departure.Delay.in.Minutes: Time of delay in minutes.
Satisfaction: Satisfied or not satisfied (response variable).
Apart from these variables, the dataset also contains a number of variables regarding the overall level of satisfaction of the client with the services given with the flight (like wi-fi service or onboard service). However, we are interested on trying to predict the satisfaction of a client before the flight has ended, then, naturally, the vast majority of these variables representing satisfaction do not really have a place in our project.
Nonetheless, this variables do hold important value for the purpose of this project. For that reason, knowing that each of these variables are in a scale from 0-5 (where 0 means not having the service), we will create new variables that will represent whether a person’s flight had that specific service (1) or not (0).
# Load dataset
airplane = read.csv("train.csv")
# Dropping id variables
airplane1 = airplane[,-c(1,2,24)]
airplane1$Ease.of.Online.booking = as.numeric(airplane1$Ease.of.Online.booking)# Creating function to create new variables
HaveService = function(service){
if (sum(service == 0) >= 1000){
service = ifelse(service == 0, 0, 1)
}
else{service = rep("no data",length(service))}
return(service)
}
# Mutating every satisfaction variable to be 0 or 1 (where there are zeros)
airplane[,7:20] = apply(airplane[,7:20], 2, FUN = HaveService)
# Dropping all satisfaction variables that can't be converted
col_noData = sapply(colnames(airplane), function(x) {"no data" %in% airplane[,x]})
airplane = airplane[,!col_noData]
airplaneTest = airplaneTest[,!col_noData]
airplane$Ease.of.Online.booking = as.numeric(airplane$Ease.of.Online.booking)
airplaneTest$Ease.of.Online.booking = as.numeric(airplaneTest$Ease.of.Online.booking)# Outputing the table in a neat format
kbl(airplane1[1:6,1:7]) %>%
kable_styling(latex_options = "HOLD_position")| Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | |
|---|---|---|---|---|---|---|---|
| 31334 | Female | Loyal Customer | 56 | Business travel | Business | 373 | 2 |
| 97785 | Male | Loyal Customer | 48 | Business travel | Business | 3895 | 2 |
| 3974 | Male | Loyal Customer | 33 | Business travel | Business | 83 | 0 |
| 32140 | Female | Loyal Customer | 16 | Personal Travel | Eco Plus | 235 | 3 |
| 87132 | Female | Loyal Customer | 58 | Personal Travel | Eco | 775 | 1 |
| 93244 | Female | disloyal Customer | 79 | Business travel | Eco | 102 | 2 |
Now that we only have the variables that we really are interested on, it is reasonable to begin an initial analysis in order to try and understand how each of these properties relates to the subsequent satisfaction of the passenger involved. To do this, we will do an exploratory and descriptive analysis that will give us a good graphical idea on how important each variable really is for the satisfaction of the customer.
Firstly, let’s look at the frequency distribution of the response variable, satisfaction.
ggplot(airplane1, aes(x = satisfaction, fill = satisfaction))+
geom_bar()+
theme_minimal(9)+
labs(title = "Frequency by satisfaction level")+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))It is clear that there are, in general, more people who end up being dissatisfied than people who are actually satisfied. To be specific, 22653 people out of the entire 40000 were dissatisfied with the provided service, while only the remaining 17347 people were satisfied. This means that around 56.6% of the people do not feel satisfied with the service and 43.4% do. Based on this, any model that we build has to be, at least, more than 56% accurate when classifying customers in this two classes since, otherwise, we would be better of just always predicting dissatisfaction.
It might be possible that one gender is, in general, easier to please or maybe a little more permissive than the other one. Let us take a look at the proportion for each gender that feels satisfied or unsatisfied.
# Plot for satisfaction by gender
ggplot(airplane1, aes(x = Gender, fill = satisfaction))+
geom_bar(position = "fill")+
theme_minimal(9)+
labs(y = "Proportion",title = "Satisfaction ratio by gender")+
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))Despite our initial suspicions, it seems to be clear that, marginally, the gender does not really represent a significant difference in the satisfaction of a person. This is easy to conclude since we can clearly see how the proportion of satisfied males is pretty much the exact same as the proportion of females that are satisfied.
This however, does not mean that gender should not be taken into consideration when building the model, since its effect over the response variable might be masked by an interaction with a different variable.
Similar to what we mentioned with the gender, it is to be expected that a more loyal customer is a customer who is in general satisfied with the overall service of an airline, which means that said person is probably more likely to be satisfied in their next flight as well.
# Plot for satisfaction by customer type
ggplot(airplane1, aes(x = Customer.Type, fill = satisfaction))+
geom_bar(position = "fill")+
theme_minimal(9)+
labs(y = "Proportion",title = "Satisfaction ratio by customer type")+
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))Now, unlike the previous plot for the gender, there seems to be a significant difference in how likely a person is to be satisfied if they are a loyal customer in comparison to a “disloyal” one. Specifically, we see how a loyal client is about twice as likely to be satisfied with the service than its counterpart.
Now, does age make a difference when it comes to satisfaction? it might be reasonable to suggest that younger people are not easily bothered and can be satisfied with less effort.
# Plot for the age by satisfaction level
ggplot(airplane1, aes(x = satisfaction,y = Age, fill = satisfaction))+
geom_boxplot()+
theme_minimal(9)+
labs(y = "Age",
title = "Age distribution by satisfaction level")+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))It seems like there might be a very slight difference in the age for satisfied and dissatisfied customers. However, our initial hypothesis was wrong, it looks like satisfied people tend to be a little older than people who are dissatisfied having a median value of around 45 years while dissatisfied people’s median falls around 37 years of age.
It is very logical to believe that people are far more likely to feel satisfied when they fly in business class in comparison to those who fly in eco or eco plus, simply because there are more and better services.
On the other hand, it is not crazy to assume that people who go on personal travels are maybe a little more picky that those who travel for business, since the later one is probably only worried about making it to their destination and maybe not that much about being exceptionally comfortable.
# Plot for the satisfaction by type of travel
p1 = ggplot(airplane1, aes(x = Type.of.Travel, fill = satisfaction))+
geom_bar(position = "fill")+
theme_minimal(9)+
labs(y = "Proportion",
title = "Satisfaction ratio by type of travel")+
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))
# Plot for the satisfaction by class
p2 = ggplot(airplane1, aes(x = Class, fill = satisfaction))+
geom_bar(position = "fill")+
theme_minimal(9)+
labs(y = "Proportion",
title = "Satisfaction ratio by flight class")+
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))
ggarrange(p1,p2)As we suspected, it is extremely clear that people who travel for business are a lot more likely to be satisfied than who travel for personal reasons. Specifically, around 65% of the people traveling for business reasons are satisfied, while only around 10% of the people flying in personal travels end up being satisfied.
However, this might be due, not to what we mentioned before, but maybe because people traveling for business reasons tend to do it in business class.
ggplot(airplane1, aes(x = Type.of.Travel, fill = satisfaction))+
geom_bar(position = "fill")+
theme_minimal(9)+
labs(y = "Proportion",
title = "Satisfaction ratio by type of travel for each class")+
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))+
facet_wrap(~Class)It is true that people traveling for business purposes in business class are more likely to be satisfied than people traveling for business reasons in any other class. However, it is very clear that, it is a general rule throughout all classes that people traveling for personal reasons are more likely to be dissatisfied.
Also, it is very interesting that, for personal travels, it does not really seem to matter which class you travel in, you are always as likely to be satisfied.
We might be inclined to believe that, in a longer lasting flight, it will be easier for people to be dissatisfied simply for the discomfort that traveling for too long represents.
# Plot for the flight distance by satisfaction level
ggplot(airplane1, aes(x = satisfaction,y = Flight.Distance, fill = satisfaction))+
geom_boxplot()+
theme_minimal(9)+
labs(y = "Flight Distance",
title = "Flight distance distribution by satisfaction level")+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))Opposite to what we suggested, satisfied people tend to be traveling longer distances in comparison to their dissatisfied counterpart. This is possibly due to people being more willing to go into business class for longer flights and because longer flights might be more likely to have better services.
ggplot(airplane1, aes(x = satisfaction,y = Flight.Distance, fill = satisfaction))+
geom_boxplot()+
theme_minimal(9)+
labs(y = "Flight Distance",
title = "Flight distance distribution by satisfaction level for each class")+
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
scale_fill_manual(values = c("#E1695D","#5DE19A"))+
facet_wrap(~Class)As we had suspected, it seems like longer flights are more likely to deliver satisfied customers only when said customers were flying in business class. For the other two classes it does not seem to be significantly different.
Finally, it will not be portrayed here but, even though we might believe that the departure delay time cold be extremely important when calculating the satisfaction of a customer, it seems like, at least marginally, there is not really a significant effect from the delay time on said satisfaction. This might be due to the vast majority of flights being in time and only a handful actually having some sort of delay.
Next up, it is very important to get our data into a usable format for the models that we will be fitting. This means imputing missing data and constructing dummy variables which take the value of 1 if that customer meets a certain criteria and 0 otherwise.
Let us then create a recipe object that will specify the pre-processing steps over the dataset.
obj_recipe = recipe(satisfaction ~ ., data = airplane)
obj_recipe = obj_recipe %>%
step_impute_bag(-satisfaction) %>% # Using bagging to perform the imputation
step_dummy(Gender,Customer.Type, Type.of.Travel, Class) # Creating dummy variables for the categories
obj_recipe## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 9
##
## Operations:
##
## Bagged tree imputation for -satisfaction
## Dummy variables from Gender, Customer.Type, Type.of.Travel, Class
The recipe object now has all the desired specifications for the preprocessing of the data. Now, we will proceed to used said recipe over our original dataset which will results in the dataset that we will ultimately use to fit our models.
train_recipe = prep(obj_recipe, training = airplane)
airplanePrep = bake(train_recipe, airplane)Now that our data has been processed, we can proceed to start training and fitting the models that we will be implementing.
Before anything is done, we will be using bootstrap for a random forest and recursive feature elimination in order to find the set of variables that yield the best possible results when we build the model.
# Creating 6 instances on R
cs = makeCluster(6)
registerDoParallel(cs)
# Specifying the subset of variables to be used
subsets = 3:10
# Number of bootstraping repetitions
reps = 10
# Creating seeds for the fitting
set.seed(886)
seeds = vector(mode = "list",length = reps + 1)
for (i in 1:reps){
seeds[[i]] = sample.int(1000, length(subsets))
}
seeds[[reps+1]] = sample.int(1000,1)
# Defining parameters to run
ctrl_rfe <- rfeControl(functions = rfFuncs, method = "boot", number = reps,
returnResamp = "all",
allowParallel = TRUE, verbose = FALSE,
seeds = seeds)
# Performing recursive feature elimination
set.seed(4232)
rf_rfe = rfe(satisfaction ~., data = airplanePrep,
sizes = subsets,
metric = "accuracy",
n.tree = 200,
rfeControl = ctrl_rfe)
rf_rfe##
## Recursive feature selection
##
## Outer resampling method: Bootstrapped (10 reps)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 3 0.7825 0.5672 0.002273 0.004508
## 4 0.7849 0.5707 0.004043 0.006595
## 5 0.7895 0.5786 0.002100 0.004370
## 6 0.7924 0.5829 0.003757 0.006412
## 7 0.7951 0.5882 0.004079 0.006781
## 8 0.7996 0.5950 0.002967 0.005229
## 9 0.8074 0.6099 0.002021 0.005017 *
## 10 0.8073 0.6096 0.002252 0.005444
##
## The top 5 variables (out of 9):
## Type.of.Travel_Personal.Travel, Customer.Type_Loyal.Customer, Ease.of.Online.booking, Departure.Delay.in.Minutes, Age
stopCluster(cs)Doing recursive feature elimination, and basing our selection criteria on the accuracy of the model, we end up having all 9 out of the 10 variables to fit all the best possible models that we will see next up. Which basically means that it seems like dropping variables does not really represent any sort of improvement for our models.
airplanePrep = airplanePrep[,c(rf_rfe$optVariables,"satisfaction")]It is important to mention that, in order to select the best parameters for our model, we will be using cross validation in every iteration of a new model. That way we can get an accurate estimation of the accuracy of each particular model so that we can then understand which parameters are optimal.
Let us then start fitting the models. We will begin with a simple naive bayes classifier and then work our way up to more complicated models. This initial model, as its name suggests, is based on the bayes theorem which, let us remember, can be defined like this:
\[P(Y = j| x) = \frac{P(x|Y = j) \times P(Y = j)}{P(x)}\]
Where \(x\) is the vector containing the observed values for the p predictors of one customer.
Since the naive bayes classifier assumes that all features are independent and equally important, then the probability to get a certain vector \(x\) given \(Y = j\) can be calculated like:
\[P(x|Y = j) = P(x_1|Y = j)\times P(x_2|Y = j)\times ... \times P(x_p|Y = j)\]
Where \(P(x_i|Y = j)\) represents the probability to get the observed value for the i-th feature given that \(Y = j\).
Which ultimately leads to calculate the probability to belong to the j-th class, given the p features, as:
\[P(Y = j | x ) = \frac{P(x_1|Y = j)\times P(x_2|Y = j)\times ... \times P(x_p|Y = j)\times P(Y = j)}{P(x)}\]
Where \(P(x)\) is a normalization constant and \(P(Y = j)\) is the prior probability of an observation being part of the j-th class.
# Create 6 parallel instances in R
cs = makeCluster(6)
registerDoParallel(cs)
# Number of folds for cross-validation
parts = 10
# Numebr of times the model will be repeated
reps = 6
# Hyperparameters to be included in our model
hyperparameters = data.frame(usekernel = FALSE, fL = 0, adjust = 0)
# Creating seeds to maintain replicable work
set.seed(992)
seeds = vector(mode = "list", length = parts*reps+1)
for (i in 1:(parts*reps)){
seeds[[i]] = sample.int(1000, nrow(hyperparameters))
}
seeds[[parts*reps+1]] = sample.int(1000,1)
# Defining control options for the model
nb_control = trainControl(method = "repeatedcv",number = parts,
repeats = reps, verboseIter = FALSE,
returnResamp = "final",seeds = seeds,
allowParallel = TRUE)
# Fitting the model using "accuracy" as the performance metric
set.seed(8827)
nb_model = train(satisfaction ~. , data = airplanePrep,
method = "nb",tuneGrid = hyperparameters,
trControl = nb_control, metric = "Accuracy")
nb_model## Naive Bayes
##
## 30000 samples
## 9 predictor
## 2 classes: 'neutral or dissatisfied', 'satisfied'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times)
## Summary of sample sizes: 27000, 27000, 27000, 26999, 27000, 27000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7808113 0.5521478
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'usekernel' was held constant at a value of FALSE
## Tuning
## parameter 'adjust' was held constant at a value of 0
# Stoping the open instances
stopCluster(cs)With the most basic type of model, which is this naive bayes, and using repeated cross validation, we obtain an estimated accuracy of around 78%, which implies that using this model, we can correctly classify each customer as satisfied or dissatisfied 78% of the times. This is a very decent start for our classifier, but it is to be expected that more flexible models will return better results.
For the next step, we will be fitting a logistic regression. This model is composed by a linear component that passes through an activation function (logistic in this case) in order to return a value between 1 and 0 which can be interpreted as the probability of success. Said model can be written as follows:
\[log\left(\frac{p}{1-p} \right) = \beta_0 + \sum_{i=1}^n \beta_iX_i + ... \]
Now, we will fit a logistic regression as portrayed before and, just like we did with the naive bayes, 10-folds cross validation will be utilized to obtain a first estimation for the accuracy of our model.
# Create 6 parallel instances in R
cs = makeCluster(6)
registerDoParallel(cs)
# Number of folds for cross-validation
parts = 10
# Numebr of times the model will be repeated
reps = 6
# Hyperparameters to be included in our model
hyperparameters = data.frame(parameter = "none")
# Creating seeds to maintain replicable work
set.seed(992)
seeds = vector(mode = "list", length = parts*reps+1)
for (i in 1:(parts*reps)){
seeds[[i]] = sample.int(1000, nrow(hyperparameters))
}
seeds[[parts*reps+1]] = sample.int(1000,1)
# Defining control options for the model
log_control = trainControl(method = "repeatedcv",number = parts,
repeats = reps, verboseIter = FALSE,
returnResamp = "final",seeds = seeds,
allowParallel = TRUE)
# Fitting the model with "accuracy" as performance metric, we also specify binomial family
# to work with a logistic regression
set.seed(8827)
log_model = train(satisfaction ~., data = airplanePrep,
method = "glm",tuneGrid = hyperparameters,
trControl = log_control, metric = "Accuracy",
family = "binomial")
log_model## Generalized Linear Model
##
## 30000 samples
## 9 predictor
## 2 classes: 'neutral or dissatisfied', 'satisfied'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times)
## Summary of sample sizes: 27000, 27000, 27000, 26999, 27000, 27000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7897445 0.5752051
# Stopping te open instances
stopCluster(cs)We then see how using logistic regression represents a slight improvement when it comes to the accuracy, resulting in an estimated value for this metric equal to 78.97%, which, similarly to the naive bayes model, represents that our model can correctly classify the satisfaction of a passenger 78.97% of the times. This represents and increase of 0.9% from model to model which does not seem to be much but has to be taken into account.
It was to be expected that we would not see immense improvement since this is still a not very flexible model. Another interesting thing that we can get from this model is the significance levels for each one of the predictors as we will see next.
coefLog_model = summary(log_model)$coefficients
kbl(coefLog_model) %>%
kable_styling(latex_options = "HOLD_position")| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.3546757 | 0.0908672 | 14.9082993 | 0.0000000 |
| Type.of.Travel_Personal.Travel | -2.5046449 | 0.0464644 | -53.9045589 | 0.0000000 |
| Customer.Type_Loyal.Customer | 1.8271408 | 0.0435256 | 41.9785085 | 0.0000000 |
| Ease.of.Online.booking | -1.9469513 | 0.0763949 | -25.4853703 | 0.0000000 |
| Departure.Delay.in.Minutes | -0.0042655 | 0.0004148 | -10.2832963 | 0.0000000 |
| Age | -0.0004940 | 0.0011007 | -0.4488417 | 0.6535459 |
| Flight.Distance | 0.0000135 | 0.0000167 | 0.8057957 | 0.4203607 |
| Gender_Male | 0.0129126 | 0.0296442 | 0.4355853 | 0.6631376 |
| Class_Eco | -1.1924865 | 0.0374232 | -31.8648676 | 0.0000000 |
| Class_Eco.Plus | -1.3945038 | 0.0628364 | -22.1925972 | 0.0000000 |
Those variables that have a p-value lower than \(\alpha = 0.1\) are the ones whose effect results to be significant for the construction of the model. Under that premise, gender,Age and flight distance would not be significant. However, this does not mean that those three variables should be dropped when fitting the upcoming models since, first, the effect of age and gender might be masked by an interaction with a different variable and, second, we already saw how this set of variables result in the best estimated accuracy.
We can also visualize the importance of each one of these variables on the fitted logistic regression by looking at the z-value for each one of these, the higher the absolute value, the better.
# Turn Table with coefficients into DataFrame
coefLog_model = as.data.frame(coefLog_model)
# Plot for the importance of the variables
ggplot(coefLog_model[-1,], aes(x = abs(`z value`),
y = reorder(rownames(coefLog_model[-1,]),abs(`z value`)),
fill = ifelse(abs(`z value`) >
qnorm(0.95, lower.tail = T),
# Color by significance
"Significant","Not significant")))+
geom_col()+
theme_minimal(9)+
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,size = 15,vjust = 0))+
labs(fill = "Significance",
y = "Predictor",title = "Variable importance in logistic regression")+
scale_fill_manual(values = c("#FAA28F","#8FF3FA"))Based on this plot, we can clearly see how, for the logistic regression, the variables Type of travel and Customer type seem to be the ones with the biggest effect and importance in this logistic regression, to describe the probability of being satisfied or dissatisfied.
Up until this point we assumed a linear relationship between the predictors and the response variable when we fitted the logistic regression. However, it is reasonable to believe that the actual relation will be non-linear instead. Because of that, as a first non-linear approach, we want to fit a quadratic discriminant model (QDA) and see if the results are better or worse.
For this model we have:
\[x \sim N(\mu_k, \Sigma_k)\] Where k is the k-th class on the problem. Based on this, for every class we have:
\[f_k(X) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k))\]
And returning to the bayes theorem as seen next:
\[P(Y = k| X = x) = \frac{\pi_kf_k(x)}{\sum_{j=1}^K \pi_jf_j(x)} \]
Which ultimately results in:
\[P(Y = k| X = x) \propto \pi_k \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} \cdot exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k))\] And if we calculate the logarithm of the previous expression we have:
\[\delta_k(x) = -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) + log(\pi_k)- \frac{p}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|)\]
A new observation \(x_0\) will be classified as part of the class where \(\delta_k(x_0)\) results in the highest value.
# Create 6 parallel instances in R
cs = makeCluster(6)
registerDoParallel(cs)
# Number of folds for cross-validation
parts = 10
# Numebr of times the model will be repeated
reps = 6
# Hyperparameters to be included in our model
hyperparameters = data.frame(parameter = "none")
# Creating seeds to maintain replicable work
set.seed(992)
seeds = vector(mode = "list", length = parts*reps+1)
for (i in 1:(parts*reps)){
seeds[[i]] = sample.int(1000, nrow(hyperparameters))
}
seeds[[parts*reps+1]] = sample.int(1000,1)
# Defining control options for the model
qda_control = trainControl(method = "repeatedcv",number = parts,
repeats = reps, verboseIter = FALSE,
returnResamp = "final",seeds = seeds,
allowParallel = TRUE)
# Fitting the model with "accuracy" as performance metric
set.seed(8827)
qda_model = train(satisfaction ~., data = airplanePrep,
method = "qda",tuneGrid = hyperparameters,
trControl = qda_control, metric = "Accuracy"
)
qda_model## Quadratic Discriminant Analysis
##
## 30000 samples
## 9 predictor
## 2 classes: 'neutral or dissatisfied', 'satisfied'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times)
## Summary of sample sizes: 27000, 27000, 27000, 26999, 27000, 27000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7859446 0.5693871
# Stopping the open instances
stopCluster(cs)Finally, we see how using quadratic discriminant analysis returns an accuracy score, estimated using 10 fold cross validation, equals to 78.6% which is slightly worse than the one we estimated using logistic regression. However, this small difference might be due to random factors.
Despite the very slight difference, it is important to remember that, for qda, all the predictors are assumed to be normally distributed; probably this assumption isn’t really valid for our variables, which would result in a worse accuracy estimation; but not necessarily because a linear relation works better than a non-linear.
Next up, let us fit a random forest model. This model is known as an ensemble method, which basically means that is constructed from multiple simpler models. In this case, a random fores is a group of decision trees where, additionally, each tree contains a random subset of the predictor variables.
Let’s also remember that a decision tree is built by creating nodes in a way in which, in this case, we can minimize the gini impurity. Which is defined as a measure of how often a randomly chosen element from the would be incorrectly labeled according to the distribution of labels in the subset.
\[I_g(p) = 1 - \sum_{i = 1}^j p_i^2\]
where j is the number of classes and \(p_i\) is the fraction of items labeled with the class \(i\) in the set.
If we have p decision trees noted as:
\[\hat{f}^{*1}(x_1),\hat{f}^{*2}(x_2),...,\hat{f}^{*p}(x_p)\]
Then the final predicted class for a new observation will be the class that is most frequently predicted in those p trees.
It is also important to mention that this is the first model that we implement that has different parameters that can be tweaked. Because of this, using 10-fold cross validation, we want to estimate which of the values for the parameters return the best performing model. These parameters are:
mtry: which represents the number of randomly selected variables for each of the decision trees
min.node.size: minimum size of the node for it to be divided.
splitrule: Splitting rule (gini impurity in our case)
# Create 6 parallel instances in R
cs = makeCluster(6)
registerDoParallel(cs)
# Number of folds for cross-validation
parts = 10
# Number of times the model will be repeated
reps = 3
# Hyperparameters to be included in our model
hyperparameters = expand.grid(mtry = c(4,6,8),
splitrule = "gini",
min.node.size = c(30,45))
# Creating seeds to maintain replicable work
set.seed(992)
seeds = vector(mode = "list", length = parts*reps+1)
for (i in 1:(parts*reps)){
seeds[[i]] = sample.int(1000, nrow(hyperparameters))
}
seeds[[parts*reps+1]] = sample.int(1000,1)
# Defining control options for the model
rf_control = trainControl(method = "repeatedcv",number = parts,
repeats = reps, verboseIter = FALSE,
returnResamp = "final",seeds = seeds,
allowParallel = TRUE)
# Fitting the model with "accuracy" as performance metric, we also need to pass
# the number of trees
set.seed(8827)
rf_model = train(satisfaction ~., data = airplanePrep,
method = "ranger",tuneGrid = hyperparameters,
trControl = rf_control, metric = "Accuracy",
ntree = 150
)
rf_model## Random Forest
##
## 30000 samples
## 9 predictor
## 2 classes: 'neutral or dissatisfied', 'satisfied'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 27000, 27000, 27000, 26999, 27000, 27000, ...
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 4 30 0.8081776 0.6115668
## 4 45 0.8086221 0.6123269
## 6 30 0.8031999 0.6010181
## 6 45 0.8048776 0.6044275
## 8 30 0.8012664 0.5970767
## 8 45 0.8037665 0.6021670
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
## and min.node.size = 45.
# Stopping the open instances
stopCluster(cs)Then, after performing 10-fold cross validation for each of the possible models, we find out that the best random forest is built using a value for the parameter mtry equals to 4, which means every decision tree is formed by only 4 randomly selected predictors out of the 9 available, and a value for the parameter min.node.size equals to 45, which means that the minimum size of a node has to be is 45 in order for it to be divided.
Using this combination of parameters, we estimated an accuracy value of approximately 80.86% which is a significant improvement from the less flexible models that we had previously fitted.
Finally, we will fit a gradient boost model in order to classify customers as satisfied or unsatisfied. This model, similarly to a random forest, works as an ensemble type of model where, unlike the random forest, we build a model buy concatenating trees sequentially by using the residuals from the previous tree as the response variable.
For a model of this type we have a number of different parameters that we can tune in order to tweak the performance of our model:
n.trees: Number of decision trees to be fitted.
interaction.depth: Number of divisions each tree has.
shrinkage: Controls the influence that every individual tree has over the entire model; also known as the learning rate
n.minobsinnode: Minimum number of observations in each node for it to be divided.
# Create 6 parallel instances in R
cs = makeCluster(6)
registerDoParallel(cs)
# Number of folds for cross-validation
parts = 10
# Number of times the model will be repeated
reps = 3
# Hyperparameters to be included in our model
hyperparameters = expand.grid(interaction.depth = c(4,6),
n.trees = c(500,1000),
shrinkage = c(0.02,0.1),
n.minobsinnode = c(15,30))
# Creating seeds to maintain replicable work
set.seed(992)
seeds = vector(mode = "list", length = parts*reps+1)
for (i in 1:(parts*reps)){
seeds[[i]] = sample.int(1000, nrow(hyperparameters))
}
seeds[[parts*reps+1]] = sample.int(1000,1)
# Defining control options for the model
gb_control = trainControl(method = "repeatedcv",number = parts,
repeats = reps, verboseIter = FALSE,
returnResamp = "final",seeds = seeds,
allowParallel = TRUE)
# Fitting the model with "accuracy" as performance metric
set.seed(8827)
gb_model = train(satisfaction ~., data = airplanePrep,
method = "gbm",tuneGrid = hyperparameters,
trControl = gb_control, metric = "Accuracy",
distribution = "adaboost",verbose = FALSE
)
gb_model## Stochastic Gradient Boosting
##
## 30000 samples
## 9 predictor
## 2 classes: 'neutral or dissatisfied', 'satisfied'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 27000, 27000, 27000, 26999, 27000, 27000, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa
## 0.02 4 15 500 0.8052332 0.6054816
## 0.02 4 15 1000 0.8060665 0.6072441
## 0.02 4 30 500 0.8052443 0.6055476
## 0.02 4 30 1000 0.8061776 0.6074907
## 0.02 6 15 500 0.8067887 0.6087262
## 0.02 6 15 1000 0.8072776 0.6096681
## 0.02 6 30 500 0.8069110 0.6089848
## 0.02 6 30 1000 0.8073665 0.6098530
## 0.10 4 15 500 0.8063332 0.6076934
## 0.10 4 15 1000 0.8052110 0.6052764
## 0.10 4 30 500 0.8063442 0.6075579
## 0.10 4 30 1000 0.8051666 0.6051382
## 0.10 6 15 500 0.8055666 0.6059506
## 0.10 6 15 1000 0.8049110 0.6045667
## 0.10 6 30 500 0.8057109 0.6062346
## 0.10 6 30 1000 0.8047220 0.6040577
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
## 6, shrinkage = 0.02 and n.minobsinnode = 30.
# Stopping the open instances
stopCluster(cs)The previous table comes to show the estimation for the accuracy using each one of the parameter combinations that we passed to our model. If we look closely, we can conclude that the best model is obtained when we used a learning rate equal to 0.02, an interaction depth equal to 6, a size node of 30 and 1000 trees for the model. With this specific combination, we were able to fit a model that can return an estimated accuracy (using 10-fold cross validation) of approximately 80.71%; slightly worse than the one we estimated for the random forest.
Despite already having some estimation on the accuracy for each one of these models, we still want to go over some more metrics that can help us define the real world performance of each one of the models.
airplaneTestPrep = bake(train_recipe, new_data = airplaneTest)
nb_pred = predict(nb_model, newdata = airplaneTestPrep)
log_pred = predict(log_model, newdata = airplaneTestPrep)
qda_pred = predict(qda_model, newdata = airplaneTestPrep)
rf_pred = predict(rf_model, newdata = airplaneTestPrep)
gb_pred = predict(gb_model, newdata = airplaneTestPrep)First, let’s look at the accuracy estimated using cross validation as well as the accuracy estimated on the 10000 testing records.
acc_nb = sum(nb_pred == airplaneTestPrep$satisfaction)/nrow(airplaneTestPrep)
acc_log = sum(log_pred == airplaneTestPrep$satisfaction)/nrow(airplaneTestPrep)
acc_qda = sum(qda_pred == airplaneTestPrep$satisfaction)/nrow(airplaneTestPrep)
acc_rf = sum(rf_pred == airplaneTestPrep$satisfaction)/nrow(airplaneTestPrep)
acc_gb = sum(gb_pred == airplaneTestPrep$satisfaction)/nrow(airplaneTestPrep)
acc_metrics = data.frame(model = rep(c("Naive Bayes","Logistic regression",
"Quadratic discriminant analysis",
"Random forest","Gradient boosting"),each = 2),
validation_type =
rep(c("10-fold cross validation","Test dataset"),5),
accuracy = c(max(nb_model$results$Accuracy),acc_nb,
max(log_model$results$Accuracy),acc_log,
max(qda_model$results$Accuracy),acc_qda,
max(rf_model$results$Accuracy),acc_rf,
max(gb_model$results$Accuracy),acc_gb)
)
ggplot(acc_metrics, aes(x = reorder(model,accuracy),y = accuracy,
col = validation_type))+
geom_point(size = 3)+
theme_minimal(9)+
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5,
size = 15,vjust = 0))+
scale_color_manual(values = c("#FAA28F","#8FF3FA"))+
labs(x = "model",title = "Accuracy for each model by validation type")We can then see how our three simplest models (naive bayes, qda and logistic regression) perform extremely poorly when using a test dataset to estimate the accuracy which might be a sign of a very poor fitting of said models for this specific problem. However, we see how both the random forest and the gradient boosting models perform very similarly regardless of the validation method used; this is a very good sign specially since these result in the highest scores for accuracy for both the 10-fold cross validation and the test dataset.
We already saw how, quite possibly, the best model is one of the two most flexible models that we fitted (random forest or gradient boost). A good way to give a more confident verdict on which of the two select (or even select one of the other three options) is by looking at the sensitivity of the models.
Sensitivity can be defined as the true positive rate which is the proportion of positive observations that we actually classify as a positive, this is defined as follows:
\[TPR = \frac{TP}{TP + FN}\]
Where TP is the number of true positives (positives correctly classified) and FN is the number of false negatives (negatives incorrectly classified).
In this specific context, we are more interested on accurately classifying unsatisfied customers as unsatisfied than we are on correctly classifying satisfied customers. because of this, having a high sensitivity can be critical for the selection of the model.
nb_sens = sensitivity(nb_pred, airplaneTestPrep$satisfaction)
log_sens = sensitivity(log_pred, airplaneTestPrep$satisfaction)
qda_sens = sensitivity(qda_pred, airplaneTestPrep$satisfaction)
rf_sens = sensitivity(rf_pred, airplaneTestPrep$satisfaction)
gb_sens = sensitivity(gb_pred, airplaneTestPrep$satisfaction)
sens_metrics = data.frame(model = c("Naive Bayes","Logistic regression",
"Quadratic discriminant analysis",
"Random forest","Gradient boosting"),
sensitivity = c(nb_sens, log_sens,
qda_sens, rf_sens, gb_sens))
ggplot(sens_metrics, aes(x = reorder(model,sensitivity), y = sensitivity))+
geom_point(size = 3, col = "red")+
geom_text(aes(x = model,y = sensitivity+0.05, label = round(sensitivity,4)))+
theme_minimal(9)+
labs(x = "Model",title = "Sensitivity for each model")+
theme(plot.title = element_text(hjust = 0.5,
size = 15,vjust = 0))Surprisingly, the previous plot comes to show how the highest value for the sensitivity is return by the logistic regression. From it, we can easily conclude that said model can almost perfectly classify all the dissatisfied customers. Specifically, it seems to be correctly classifying about 95% of these dissatisfied clients.
It is very clear that the variables with the most effect over the satisfaction of a customer are the type of travel, where people traveling for personal reasons are way more likely to be dissatisfied maybe due to the stress that a possible family trip may cause on some people, and the loyalty of the customers, which is to expected since a loyal customer is very related to a satisfied one
Based on the results that we saw, it is logical to select either the random forest or the gradient boosting model as the final predictive model because of their all around versatility and predictive power to classify satisfied and dissatisfied customers.
Even though the logistic regression did not seem to be very accurate when predicting new observations, it turns out it has an almost perfect score when classifying dissatisfied customers. Since predicting dissatisfied clients is the real main goal of this project, it is not unreasonable at all to select said logistic regression as the predictive model of choice.
If eventually we were interested on a more general model, it might be a good idea to include more variables that can have predictive value since, as we saw, even the best model we could fit is still far from a perfect model, possibly due to not finding enough predicting value on the available variables.
An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics)
Applied Predictive Modeling by Max Kuhn and Kjell Johnson.
An Introduction to Statistical Learning by James, Gareth et al.