In this assignment, we consider a subset of airline customer satisfaction survey data (for the original data see https://www.ibm.com/communities/analytics/watson-analytics-blog/sample-data-airline-survey/)
Please see the data below:
survey <- read.table("C:/Users/cmart/OneDrive/Documents/Certificate in Business Analytics/Predictive Analytics/Module 06_Assignment/Satisfaction.csv", header = TRUE, sep=",", dec=".")
head(survey)
## Satisfaction Airline.Status Age Gender PriceSensitivity YearofFirstFlight
## 1 4 Blue 56 Male 2 2006
## 2 4 Blue 43 Male 1 2007
## 3 5 Silver 49 Male 1 2006
## 4 5 Gold 49 Female 1 2010
## 5 4 Silver 33 Male 1 2010
## 6 4 Blue 44 Female 1 2003
## FlighOtherAirlines TypeofTravel LoyaltyCards Class DelayMinutes
## 1 3 Business travel 0 Business 2
## 2 9 Business travel 2 Eco 26
## 3 10 Business travel 0 Eco 0
## 4 4 Business travel 1 Eco 0
## 5 17 Business travel 2 Eco 0
## 6 6 Business travel 0 Eco 0
## FlightCancelled FlightTime FlightDistance ArrivalDelay
## 1 No 120 821 no
## 2 No 141 821 yes
## 3 No 144 853 no
## 4 No 123 821 no
## 5 No 138 821 no
## 6 No 114 853 no
str(survey)
## 'data.frame': 1182 obs. of 15 variables:
## $ Satisfaction : int 4 4 5 5 4 4 4 4 2 5 ...
## $ Airline.Status : chr "Blue" "Blue" "Silver" "Gold" ...
## $ Age : int 56 43 49 49 33 44 51 28 39 46 ...
## $ Gender : chr "Male" "Male" "Male" "Female" ...
## $ PriceSensitivity : int 2 1 1 1 1 1 1 1 1 1 ...
## $ YearofFirstFlight : int 2006 2007 2006 2010 2010 2003 2005 2009 2005 2012 ...
## $ FlighOtherAirlines: int 3 9 10 4 17 6 7 33 8 24 ...
## $ TypeofTravel : chr "Business travel" "Business travel" "Business travel" "Business travel" ...
## $ LoyaltyCards : int 0 2 0 1 2 0 0 2 0 1 ...
## $ Class : chr "Business" "Eco" "Eco" "Eco" ...
## $ DelayMinutes : int 2 26 0 0 0 0 0 0 0 13 ...
## $ FlightCancelled : chr "No" "No" "No" "No" ...
## $ FlightTime : int 120 141 144 123 138 114 118 145 156 114 ...
## $ FlightDistance : int 821 821 853 821 821 853 821 853 853 853 ...
## $ ArrivalDelay : chr "no" "yes" "no" "no" ...
#We need the variable satisfaction_response for boosting, as the response cannot be a factor
survey$satisfaction_boosting <- as.integer(as.character(survey$Satisfaction))
# code response variable (satisfaction) as factor
survey$Satisfaction <- as.factor(survey$Satisfaction)
str(survey)
## 'data.frame': 1182 obs. of 16 variables:
## $ Satisfaction : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 4 4 4 4 2 5 ...
## $ Airline.Status : chr "Blue" "Blue" "Silver" "Gold" ...
## $ Age : int 56 43 49 49 33 44 51 28 39 46 ...
## $ Gender : chr "Male" "Male" "Male" "Female" ...
## $ PriceSensitivity : int 2 1 1 1 1 1 1 1 1 1 ...
## $ YearofFirstFlight : int 2006 2007 2006 2010 2010 2003 2005 2009 2005 2012 ...
## $ FlighOtherAirlines : int 3 9 10 4 17 6 7 33 8 24 ...
## $ TypeofTravel : chr "Business travel" "Business travel" "Business travel" "Business travel" ...
## $ LoyaltyCards : int 0 2 0 1 2 0 0 2 0 1 ...
## $ Class : chr "Business" "Eco" "Eco" "Eco" ...
## $ DelayMinutes : int 2 26 0 0 0 0 0 0 0 13 ...
## $ FlightCancelled : chr "No" "No" "No" "No" ...
## $ FlightTime : int 120 141 144 123 138 114 118 145 156 114 ...
## $ FlightDistance : int 821 821 853 821 821 853 821 853 853 853 ...
## $ ArrivalDelay : chr "no" "yes" "no" "no" ...
## $ satisfaction_boosting: int 4 4 5 5 4 4 4 4 2 5 ...
Our variable to predict is Satisfaction.
Satisfaction is coded as factor
(included in the R chunk below)You can use summary(survey) to see if the variables are
categorical or numeric
# 80% of the sample size
smp_size <- floor(0.80 * nrow(survey))
# set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(survey)), size = smp_size)
survey_train <- survey[train_ind, ]
survey_test = survey[-train_ind, ]
model.ct <- rpart(formula=Satisfaction~., data=data.frame(survey_train[,1:15]), method="class", parms=list(split="gini"))
rpart.plot(model.ct, roundint=FALSE)
# Creating a bar plot for the "satisfaction" variable to check for class imbalance
ggplot(survey, aes(x = Satisfaction)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Satisfaction Classes",
x = "Satisfaction Levels",
y = "Frequency") +
theme_minimal()
prop.table(table(survey$Satisfaction))
##
## 1 2 3 4 5
## 0.01353638 0.13959391 0.26649746 0.46362098 0.11675127
While no action was taken to address class imbalance for satisfaction scores in the survey dataset, it is important to note that this is likely a key reason why there is only 1 split in the original and pruned classification tree. Only 1 predictor variable, “TypeofTravel” was utilized in the model. Inclusion of only 1 predictor variable also suggests that “TypeofTravel” has a significant effect on the response variable, satisfaction, meaning that the other variables have relatively low impact and, thus, will not be included in the model. No further action was taken to include additional predictor variables.
printcp(model.ct) #note - 1 node split results in the lowest cross validation score
##
## Classification tree:
## rpart(formula = Satisfaction ~ ., data = data.frame(survey_train[,
## 1:15]), method = "class", parms = list(split = "gini"))
##
## Variables actually used in tree construction:
## [1] TypeofTravel
##
## Root node error: 513/945 = 0.54286
##
## n= 945
##
## CP nsplit rel error xerror xstd
## 1 0.14035 0 1.00000 1.00000 0.029852
## 2 0.01000 1 0.85965 0.85965 0.029895
ptree <- prune(model.ct, cp=model.ct$cptable[which.min(model.ct$cptable[,"xerror"]),"CP"])
rpart.plot(ptree, roundint=FALSE, main="Pruned Classification Tree, Airline Satisfaction Survey")
(10 points) Predict the Satisfaction in the test
data, and calculate the following:
# Utilize survey_test data to predict values
pred.ct <- predict(object = ptree, newdata = survey_test[,1:15], type="class")
# Calculate the confusion matrix for the test set
confusionMatrix(data = pred.ct, reference = factor(survey_test$satisfaction))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 19 18 6 2
## 3 0 0 0 0 0
## 4 2 8 50 110 22
## 5 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5443
## 95% CI : (0.4786, 0.6089)
## No Information Rate : 0.4895
## P-Value [Acc > NIR] : 0.05214
##
## Kappa : 0.2168
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.000000 0.70370 0.0000 0.9483 0.0000
## Specificity 1.000000 0.87619 1.0000 0.3223 1.0000
## Pos Pred Value NaN 0.42222 NaN 0.5729 NaN
## Neg Pred Value 0.991561 0.95833 0.7131 0.8667 0.8987
## Prevalence 0.008439 0.11392 0.2869 0.4895 0.1013
## Detection Rate 0.000000 0.08017 0.0000 0.4641 0.0000
## Detection Prevalence 0.000000 0.18987 0.0000 0.8101 0.0000
## Balanced Accuracy 0.500000 0.78995 0.5000 0.6353 0.5000
# Calculate testing error (note accuracy rate [0.5443], error rate [0.4556962] sum to 1)
1- mean(survey_test$satisfaction==pred.ct)
## [1] 0.4556962
(10 points) For the same data, train a bagging algorithm, predict
the Satisfaction in the test data, and calculate the
following two:
# Set a seed for reproducibility of bagging for classification decision tree
set.seed(888)
# Train a bagged model
satisfaction_model <- bagging(formula = Satisfaction ~ ., data = survey_train[,1:15], coob = TRUE)
print(satisfaction_model)
##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = Satisfaction ~ ., data = survey_train[,
## 1:15], coob = TRUE)
##
## Out-of-bag estimate of misclassification error: 0.5365
# Utilize survey_test data to predict values
pred.bag <- predict(object = satisfaction_model, newdata = survey_test[,1:15], type = "class")
# Calculate the confusion matrix for the test set
confusionMatrix(data = pred.bag, reference = survey_test$Satisfaction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 0 0 1 0 0
## 2 0 9 12 4 2
## 3 0 10 15 17 1
## 4 2 8 35 94 19
## 5 0 0 5 1 2
##
## Overall Statistics
##
## Accuracy : 0.5063
## 95% CI : (0.4408, 0.5717)
## No Information Rate : 0.4895
## P-Value [Acc > NIR] : 0.3245
##
## Kappa : 0.1843
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.000000 0.33333 0.22059 0.8103 0.083333
## Specificity 0.995745 0.91429 0.83432 0.4711 0.971831
## Pos Pred Value 0.000000 0.33333 0.34884 0.5949 0.250000
## Neg Pred Value 0.991525 0.91429 0.72680 0.7215 0.903930
## Prevalence 0.008439 0.11392 0.28692 0.4895 0.101266
## Detection Rate 0.000000 0.03797 0.06329 0.3966 0.008439
## Detection Prevalence 0.004219 0.11392 0.18143 0.6667 0.033755
## Balanced Accuracy 0.497872 0.62381 0.52745 0.6407 0.527582
# Calculate testing error (note accuracy rate [0.5063], error rate [0.4936709] sum to 1)
1- mean(survey_test$satisfaction==pred.bag)
## [1] 0.4936709
(10 points) For the same data, train a random forest algorithm,
predict the Satisfaction in the test data, and calculate
the following:
satisfaction.rf <- randomForest(Satisfaction ~ . , data = survey_train[,1:15], importance = TRUE, proximity = TRUE)
print(satisfaction.rf)
##
## Call:
## randomForest(formula = Satisfaction ~ ., data = survey_train[, 1:15], importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 49.21%
## Confusion matrix:
## 1 2 3 4 5 class.error
## 1 0 5 3 6 0 1.0000000
## 2 1 57 39 41 0 0.5869565
## 3 0 41 53 152 1 0.7854251
## 4 0 15 44 369 4 0.1458333
## 5 0 2 9 102 1 0.9912281
#Identify which variables had the largest impact on the determination of the satisfaction score using random forest decision tree method
head(satisfaction.rf$importance)
## 1 2 3 4
## Airline.Status 0.0020333333 -0.0004787364 0.0061191534 -0.0007581262
## Age -0.0020222222 0.0007361151 0.0000644354 0.0155799741
## Gender 0.0031412698 -0.0048913144 0.0009058809 0.0009868869
## PriceSensitivity -0.0002976190 0.0004691513 0.0024171788 0.0032764053
## YearofFirstFlight 0.0009507937 -0.0008897768 -0.0030980038 0.0028270536
## FlighOtherAirlines -0.0038928571 -0.0034129429 -0.0016138300 0.0039582781
## 5 MeanDecreaseAccuracy MeanDecreaseGini
## Airline.Status 0.0007804311 0.0013248092 22.04699
## Age -0.0001660357 0.0072733192 92.61813
## Gender 0.0083199384 0.0010901851 17.90408
## PriceSensitivity 0.0033941825 0.0025932498 25.54298
## YearofFirstFlight -0.0008987827 0.0003362741 59.35590
## FlighOtherAirlines -0.0042260674 0.0003029533 72.66563
# Grab OOB error matrix
err <- satisfaction.rf$err.rate
#plot(satisfaction.rf)
plot(satisfaction.rf)
# Add a legend since it doesn't have one by default
legend(x = "right", legend = colnames(err), fill = 1:ncol(err))
# Calculate prediction performance using survey_test data
pred.rf <- predict(object = satisfaction.rf, newdata = survey_test[,1:15], type = "class")
# Calculate the confusion matrix
confusionMatrix(data = pred.rf, reference = factor(survey_test$satisfaction))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 11 11 5 2
## 3 0 8 14 8 1
## 4 2 8 43 102 20
## 5 0 0 0 1 1
##
## Overall Statistics
##
## Accuracy : 0.5401
## 95% CI : (0.4744, 0.6048)
## No Information Rate : 0.4895
## P-Value [Acc > NIR] : 0.06753
##
## Kappa : 0.2155
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.000000 0.40741 0.20588 0.8793 0.041667
## Specificity 1.000000 0.91429 0.89941 0.3967 0.995305
## Pos Pred Value NaN 0.37931 0.45161 0.5829 0.500000
## Neg Pred Value 0.991561 0.92308 0.73786 0.7742 0.902128
## Prevalence 0.008439 0.11392 0.28692 0.4895 0.101266
## Detection Rate 0.000000 0.04641 0.05907 0.4304 0.004219
## Detection Prevalence 0.000000 0.12236 0.13080 0.7384 0.008439
## Balanced Accuracy 0.500000 0.66085 0.55265 0.6380 0.518486
# Calculate testing error (note - accuracy rate [0.5316], error rate [0.4683544] sum to 1)
1- mean(survey_test$satisfaction==pred.rf)
## [1] 0.4599156
The error rate and accuracy scores for the tree model exhibits the lowest error rate among the three models. Therefore, I recommend that that airline uses the tree model, as compared to the bagging and random forest models, to predict customer satisfaction. However, the underlying data in the survey data set is imbalanced with most satisfaction scores being 2 or 4, meaning that work should be done to improve model performance. This could be done by weighted average classes, SMOTE, or even other methods such as random forest with class weights or boosting, as such methods provide the model with additional exposure to minority classes, reduction in overfitting, and averaging predictions that result in better overall predictions for minority classes (Such as satisfaction scores 1,3, and 5). Note that while the error rate was relatively high for the tree model (0.455), error rate was higher for bagging(0.49) and random forest (0.47). Thus, simply utilizing bagging or random forest methods will not lower error rate. Without further adjustments, satisfaction scores 1, 3, and 5 are unused in the standard tree model and the only splitting variable to be used in the model to determine the satisfaction score is “TypeofTravel” (due to [1] potential overfitting of the model, and/or [2] the high impact of “TypeofTravel” on the satisfaction score in comparison to other predictors in the survey data set and potential insufficient informativeness of other predictors). Addressing class imbalance and identifying ways to include additional splitting variables (such as transforming categorical predictors into numerical integers, creating interactions between predictors (perhaps “TypeofTravel” and “Class”), and adjusting the cp in the rpart function to allow for increased complexity) in the standard tree method will likely lower error rate and increase model performance and complexity.
# Compare decision tree methods (tree, bagging, random forest) - error rate
Models = c("tree", "bagging", "random forest")
Error.Rate = c("0.4556962","0.4936709","0.4683544")
text_tbl <- data.frame(Models, Error.Rate)
kable(text_tbl) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Models | Error.Rate |
|---|---|
| tree | 0.4556962 |
| bagging | 0.4936709 |
| random forest | 0.4683544 |
Random forest sampling, the third model created, involves bootstrap sampling (bagging) and random feature selection. Random forest sampling involves running many decision trees using different variations of the same underlying survey dataset and averaging the outcome (for regression models) or taking the most common outcome (for classification-type models) for each of the N models run. For the underlying survey dataset above, the satisfaction.rf algorithm ran 500 trees (N). Each tree constructed using the random forest sampling method uses a distinct variation of the same underlying survey dataset. It is this concept of using variations of the same underlying dataset for each of the N decision trees constructed as part of the random forest model that, in large part, enables random forest models to provide a lower error rate than a standard decision tree model that only uses 1 training set when creating the decision tree. The variation of the underlying dataset for each of the 500 tree models (N) created lowers the chance of overfitting and, in part, addresses potential issues caused by class imbalance by increasing the probability that minority classes will be represented in the overall random forest sample through inclusion in individual tree models. Each of the 500 trees created in the satisfaction.rf model above is trained independently. The splitting variables for each of the decision nodes within each of the 500 trees are selected from a pool of randomly selected predictor variables contained within the overall survey dataset, thereby introducing an element of randomness into the random forest model. Each pool of randomly selected variables to pull from in the satisfaction.rf model includes 3 variables. The selection of splitting variables from a pool of randomly generated variables results in a lower risk of overfitting and ensures that more splitting variables are represented within the model. Once all trees are trained, the outcomes or predictions are aggregated and for a classification-type model, a final prediction is a assigned based on majority voting across all 500 (N) tress. The randomness involved in generating the variables to choose from in selecting a splitting variable at any given decision node is a key strength of random forest sampling and explains why the model often performs better than a standard decision tree model. Due to the importance of “TypeofTravel” and potential class imbalance in satisfaction score, many satisfaction scores were unused, and 14 of the 15 variables were not incorporated in the standard decision tree model. This problem was addressed using satisfaction.rf. Despite this, the random forest sample model still performed worse than the standard tree model in regard to accuracy score. Evidently, further adjustment is needed. Note that accuracy and error was determined by comparing the satisfaction.rf model’s prediction performance to actual observations set in the survey_test data.