Data

In this assignment, we consider a subset of airline customer satisfaction survey data (for the original data see https://www.ibm.com/communities/analytics/watson-analytics-blog/sample-data-airline-survey/)

Please see the data below:

survey <- read.table("C:/Users/cmart/OneDrive/Documents/Certificate in Business Analytics/Predictive Analytics/Module 06_Assignment/Satisfaction.csv", header = TRUE, sep=",", dec=".")

head(survey)
##   Satisfaction Airline.Status Age Gender PriceSensitivity YearofFirstFlight
## 1            4           Blue  56   Male                2              2006
## 2            4           Blue  43   Male                1              2007
## 3            5         Silver  49   Male                1              2006
## 4            5           Gold  49 Female                1              2010
## 5            4         Silver  33   Male                1              2010
## 6            4           Blue  44 Female                1              2003
##   FlighOtherAirlines    TypeofTravel LoyaltyCards    Class DelayMinutes
## 1                  3 Business travel            0 Business            2
## 2                  9 Business travel            2      Eco           26
## 3                 10 Business travel            0      Eco            0
## 4                  4 Business travel            1      Eco            0
## 5                 17 Business travel            2      Eco            0
## 6                  6 Business travel            0      Eco            0
##   FlightCancelled FlightTime FlightDistance ArrivalDelay
## 1              No        120            821           no
## 2              No        141            821          yes
## 3              No        144            853           no
## 4              No        123            821           no
## 5              No        138            821           no
## 6              No        114            853           no
str(survey)
## 'data.frame':    1182 obs. of  15 variables:
##  $ Satisfaction      : int  4 4 5 5 4 4 4 4 2 5 ...
##  $ Airline.Status    : chr  "Blue" "Blue" "Silver" "Gold" ...
##  $ Age               : int  56 43 49 49 33 44 51 28 39 46 ...
##  $ Gender            : chr  "Male" "Male" "Male" "Female" ...
##  $ PriceSensitivity  : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ YearofFirstFlight : int  2006 2007 2006 2010 2010 2003 2005 2009 2005 2012 ...
##  $ FlighOtherAirlines: int  3 9 10 4 17 6 7 33 8 24 ...
##  $ TypeofTravel      : chr  "Business travel" "Business travel" "Business travel" "Business travel" ...
##  $ LoyaltyCards      : int  0 2 0 1 2 0 0 2 0 1 ...
##  $ Class             : chr  "Business" "Eco" "Eco" "Eco" ...
##  $ DelayMinutes      : int  2 26 0 0 0 0 0 0 0 13 ...
##  $ FlightCancelled   : chr  "No" "No" "No" "No" ...
##  $ FlightTime        : int  120 141 144 123 138 114 118 145 156 114 ...
##  $ FlightDistance    : int  821 821 853 821 821 853 821 853 853 853 ...
##  $ ArrivalDelay      : chr  "no" "yes" "no" "no" ...
#We need the variable satisfaction_response for boosting, as the response cannot be a factor
survey$satisfaction_boosting <- as.integer(as.character(survey$Satisfaction))
# code response variable (satisfaction) as factor 
survey$Satisfaction <- as.factor(survey$Satisfaction)
str(survey)
## 'data.frame':    1182 obs. of  16 variables:
##  $ Satisfaction         : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 4 4 4 4 2 5 ...
##  $ Airline.Status       : chr  "Blue" "Blue" "Silver" "Gold" ...
##  $ Age                  : int  56 43 49 49 33 44 51 28 39 46 ...
##  $ Gender               : chr  "Male" "Male" "Male" "Female" ...
##  $ PriceSensitivity     : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ YearofFirstFlight    : int  2006 2007 2006 2010 2010 2003 2005 2009 2005 2012 ...
##  $ FlighOtherAirlines   : int  3 9 10 4 17 6 7 33 8 24 ...
##  $ TypeofTravel         : chr  "Business travel" "Business travel" "Business travel" "Business travel" ...
##  $ LoyaltyCards         : int  0 2 0 1 2 0 0 2 0 1 ...
##  $ Class                : chr  "Business" "Eco" "Eco" "Eco" ...
##  $ DelayMinutes         : int  2 26 0 0 0 0 0 0 0 13 ...
##  $ FlightCancelled      : chr  "No" "No" "No" "No" ...
##  $ FlightTime           : int  120 141 144 123 138 114 118 145 156 114 ...
##  $ FlightDistance       : int  821 821 853 821 821 853 821 853 853 853 ...
##  $ ArrivalDelay         : chr  "no" "yes" "no" "no" ...
##  $ satisfaction_boosting: int  4 4 5 5 4 4 4 4 2 5 ...

Our variable to predict is Satisfaction.

Questions

  1. (10 points)

You can use summary(survey) to see if the variables are categorical or numeric

# 80% of the sample size
smp_size <- floor(0.80 * nrow(survey))

# set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(survey)), size = smp_size)

survey_train <- survey[train_ind, ]
survey_test = survey[-train_ind, ]
  1. (15 points) Run a tree model using all the variables and training data, and plot your tree. (Do not forget to use gini split)
model.ct <- rpart(formula=Satisfaction~., data=data.frame(survey_train[,1:15]), method="class", parms=list(split="gini"))
rpart.plot(model.ct, roundint=FALSE)

# Creating a bar plot for the "satisfaction" variable to check for class imbalance
ggplot(survey, aes(x = Satisfaction)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Satisfaction Classes",
       x = "Satisfaction Levels",
       y = "Frequency") +
  theme_minimal()

prop.table(table(survey$Satisfaction))
## 
##          1          2          3          4          5 
## 0.01353638 0.13959391 0.26649746 0.46362098 0.11675127

While no action was taken to address class imbalance for satisfaction scores in the survey dataset, it is important to note that this is likely a key reason why there is only 1 split in the original and pruned classification tree. Only 1 predictor variable, “TypeofTravel” was utilized in the model. Inclusion of only 1 predictor variable also suggests that “TypeofTravel” has a significant effect on the response variable, satisfaction, meaning that the other variables have relatively low impact and, thus, will not be included in the model. No further action was taken to include additional predictor variables.

  1. (15 points) Use cross validation to find your best pruned tree, estimate and plot that tree. How does your pruned tree compare to the original tree?
printcp(model.ct) #note - 1 node split results in the lowest cross validation score
## 
## Classification tree:
## rpart(formula = Satisfaction ~ ., data = data.frame(survey_train[, 
##     1:15]), method = "class", parms = list(split = "gini"))
## 
## Variables actually used in tree construction:
## [1] TypeofTravel
## 
## Root node error: 513/945 = 0.54286
## 
## n= 945 
## 
##        CP nsplit rel error  xerror     xstd
## 1 0.14035      0   1.00000 1.00000 0.029852
## 2 0.01000      1   0.85965 0.85965 0.029895
ptree <- prune(model.ct, cp=model.ct$cptable[which.min(model.ct$cptable[,"xerror"]),"CP"])
rpart.plot(ptree, roundint=FALSE, main="Pruned Classification Tree, Airline Satisfaction Survey")

  1. (10 points) Predict the Satisfaction in the test data, and calculate the following:

    • Confusion Matrix
    • Accuracy
# Utilize survey_test data to predict values
pred.ct <- predict(object = ptree, newdata = survey_test[,1:15], type="class")
# Calculate the confusion matrix for the test set
confusionMatrix(data = pred.ct, reference = factor(survey_test$satisfaction))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1   0   0   0   0   0
##          2   0  19  18   6   2
##          3   0   0   0   0   0
##          4   2   8  50 110  22
##          5   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5443          
##                  95% CI : (0.4786, 0.6089)
##     No Information Rate : 0.4895          
##     P-Value [Acc > NIR] : 0.05214         
##                                           
##                   Kappa : 0.2168          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.000000  0.70370   0.0000   0.9483   0.0000
## Specificity          1.000000  0.87619   1.0000   0.3223   1.0000
## Pos Pred Value            NaN  0.42222      NaN   0.5729      NaN
## Neg Pred Value       0.991561  0.95833   0.7131   0.8667   0.8987
## Prevalence           0.008439  0.11392   0.2869   0.4895   0.1013
## Detection Rate       0.000000  0.08017   0.0000   0.4641   0.0000
## Detection Prevalence 0.000000  0.18987   0.0000   0.8101   0.0000
## Balanced Accuracy    0.500000  0.78995   0.5000   0.6353   0.5000
# Calculate testing error (note accuracy rate [0.5443], error rate [0.4556962] sum to 1)
1- mean(survey_test$satisfaction==pred.ct)
## [1] 0.4556962
  1. (10 points) For the same data, train a bagging algorithm, predict the Satisfaction in the test data, and calculate the following two:

    • Confusion Matrix
    • Accuracy
# Set a seed for reproducibility of bagging for classification decision tree
set.seed(888)
# Train a bagged model
satisfaction_model <- bagging(formula = Satisfaction ~ ., data = survey_train[,1:15], coob = TRUE)
print(satisfaction_model)
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = Satisfaction ~ ., data = survey_train[, 
##     1:15], coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.5365
# Utilize survey_test data to predict values
pred.bag <- predict(object = satisfaction_model, newdata = survey_test[,1:15], type = "class")
# Calculate the confusion matrix for the test set
confusionMatrix(data = pred.bag, reference = survey_test$Satisfaction)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5
##          1  0  0  1  0  0
##          2  0  9 12  4  2
##          3  0 10 15 17  1
##          4  2  8 35 94 19
##          5  0  0  5  1  2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5063          
##                  95% CI : (0.4408, 0.5717)
##     No Information Rate : 0.4895          
##     P-Value [Acc > NIR] : 0.3245          
##                                           
##                   Kappa : 0.1843          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.000000  0.33333  0.22059   0.8103 0.083333
## Specificity          0.995745  0.91429  0.83432   0.4711 0.971831
## Pos Pred Value       0.000000  0.33333  0.34884   0.5949 0.250000
## Neg Pred Value       0.991525  0.91429  0.72680   0.7215 0.903930
## Prevalence           0.008439  0.11392  0.28692   0.4895 0.101266
## Detection Rate       0.000000  0.03797  0.06329   0.3966 0.008439
## Detection Prevalence 0.004219  0.11392  0.18143   0.6667 0.033755
## Balanced Accuracy    0.497872  0.62381  0.52745   0.6407 0.527582
# Calculate testing error (note accuracy rate [0.5063], error rate [0.4936709] sum to 1)
1- mean(survey_test$satisfaction==pred.bag)
## [1] 0.4936709
  1. (10 points) For the same data, train a random forest algorithm, predict the Satisfaction in the test data, and calculate the following:

    • Confusion Matrix
    • Accuracy
satisfaction.rf <- randomForest(Satisfaction ~ . , data = survey_train[,1:15], importance = TRUE, proximity = TRUE)
print(satisfaction.rf)
## 
## Call:
##  randomForest(formula = Satisfaction ~ ., data = survey_train[,      1:15], importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 49.21%
## Confusion matrix:
##   1  2  3   4 5 class.error
## 1 0  5  3   6 0   1.0000000
## 2 1 57 39  41 0   0.5869565
## 3 0 41 53 152 1   0.7854251
## 4 0 15 44 369 4   0.1458333
## 5 0  2  9 102 1   0.9912281
#Identify which variables had the largest impact on the determination of the satisfaction score using random forest decision tree method
head(satisfaction.rf$importance)
##                                1             2             3             4
## Airline.Status      0.0020333333 -0.0004787364  0.0061191534 -0.0007581262
## Age                -0.0020222222  0.0007361151  0.0000644354  0.0155799741
## Gender              0.0031412698 -0.0048913144  0.0009058809  0.0009868869
## PriceSensitivity   -0.0002976190  0.0004691513  0.0024171788  0.0032764053
## YearofFirstFlight   0.0009507937 -0.0008897768 -0.0030980038  0.0028270536
## FlighOtherAirlines -0.0038928571 -0.0034129429 -0.0016138300  0.0039582781
##                                5 MeanDecreaseAccuracy MeanDecreaseGini
## Airline.Status      0.0007804311         0.0013248092         22.04699
## Age                -0.0001660357         0.0072733192         92.61813
## Gender              0.0083199384         0.0010901851         17.90408
## PriceSensitivity    0.0033941825         0.0025932498         25.54298
## YearofFirstFlight  -0.0008987827         0.0003362741         59.35590
## FlighOtherAirlines -0.0042260674         0.0003029533         72.66563
# Grab OOB error matrix
err <- satisfaction.rf$err.rate
#plot(satisfaction.rf)
plot(satisfaction.rf)
# Add a legend since it doesn't have one by default
legend(x = "right", legend = colnames(err), fill = 1:ncol(err))

# Calculate prediction performance using survey_test data
pred.rf <- predict(object = satisfaction.rf, newdata = survey_test[,1:15], type = "class")
# Calculate the confusion matrix
confusionMatrix(data = pred.rf, reference = factor(survey_test$satisfaction))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1   0   0   0   0   0
##          2   0  11  11   5   2
##          3   0   8  14   8   1
##          4   2   8  43 102  20
##          5   0   0   0   1   1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5401          
##                  95% CI : (0.4744, 0.6048)
##     No Information Rate : 0.4895          
##     P-Value [Acc > NIR] : 0.06753         
##                                           
##                   Kappa : 0.2155          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity          0.000000  0.40741  0.20588   0.8793 0.041667
## Specificity          1.000000  0.91429  0.89941   0.3967 0.995305
## Pos Pred Value            NaN  0.37931  0.45161   0.5829 0.500000
## Neg Pred Value       0.991561  0.92308  0.73786   0.7742 0.902128
## Prevalence           0.008439  0.11392  0.28692   0.4895 0.101266
## Detection Rate       0.000000  0.04641  0.05907   0.4304 0.004219
## Detection Prevalence 0.000000  0.12236  0.13080   0.7384 0.008439
## Balanced Accuracy    0.500000  0.66085  0.55265   0.6380 0.518486
# Calculate testing error (note - accuracy rate [0.5316], error rate [0.4683544] sum to 1)
1- mean(survey_test$satisfaction==pred.rf)
## [1] 0.4599156
  1. (15 Points) Compare your tree, bagging and random forest results. Which one you would recommend the airline to predict the customer satisfaction rate?

The error rate and accuracy scores for the tree model exhibits the lowest error rate among the three models. Therefore, I recommend that that airline uses the tree model, as compared to the bagging and random forest models, to predict customer satisfaction. However, the underlying data in the survey data set is imbalanced with most satisfaction scores being 2 or 4, meaning that work should be done to improve model performance. This could be done by weighted average classes, SMOTE, or even other methods such as random forest with class weights or boosting, as such methods provide the model with additional exposure to minority classes, reduction in overfitting, and averaging predictions that result in better overall predictions for minority classes (Such as satisfaction scores 1,3, and 5). Note that while the error rate was relatively high for the tree model (0.455), error rate was higher for bagging(0.49) and random forest (0.47). Thus, simply utilizing bagging or random forest methods will not lower error rate. Without further adjustments, satisfaction scores 1, 3, and 5 are unused in the standard tree model and the only splitting variable to be used in the model to determine the satisfaction score is “TypeofTravel” (due to [1] potential overfitting of the model, and/or [2] the high impact of “TypeofTravel” on the satisfaction score in comparison to other predictors in the survey data set and potential insufficient informativeness of other predictors). Addressing class imbalance and identifying ways to include additional splitting variables (such as transforming categorical predictors into numerical integers, creating interactions between predictors (perhaps “TypeofTravel” and “Class”), and adjusting the cp in the rpart function to allow for increased complexity) in the standard tree method will likely lower error rate and increase model performance and complexity.

# Compare decision tree methods (tree, bagging, random forest) - error rate
Models = c("tree", "bagging", "random forest")
Error.Rate = c("0.4556962","0.4936709","0.4683544")
text_tbl <- data.frame(Models, Error.Rate)
  

kable(text_tbl) %>%
  kable_styling(bootstrap_options = "striped", full_width = F) 
Models Error.Rate
tree 0.4556962
bagging 0.4936709
random forest 0.4683544
  1. (15 points) Pick one of the algorithms you worked on and explain what you did to a stakeholder who does not know anything about tree based algorithms.

Random forest sampling, the third model created, involves bootstrap sampling (bagging) and random feature selection. Random forest sampling involves running many decision trees using different variations of the same underlying survey dataset and averaging the outcome (for regression models) or taking the most common outcome (for classification-type models) for each of the N models run. For the underlying survey dataset above, the satisfaction.rf algorithm ran 500 trees (N). Each tree constructed using the random forest sampling method uses a distinct variation of the same underlying survey dataset. It is this concept of using variations of the same underlying dataset for each of the N decision trees constructed as part of the random forest model that, in large part, enables random forest models to provide a lower error rate than a standard decision tree model that only uses 1 training set when creating the decision tree. The variation of the underlying dataset for each of the 500 tree models (N) created lowers the chance of overfitting and, in part, addresses potential issues caused by class imbalance by increasing the probability that minority classes will be represented in the overall random forest sample through inclusion in individual tree models. Each of the 500 trees created in the satisfaction.rf model above is trained independently. The splitting variables for each of the decision nodes within each of the 500 trees are selected from a pool of randomly selected predictor variables contained within the overall survey dataset, thereby introducing an element of randomness into the random forest model. Each pool of randomly selected variables to pull from in the satisfaction.rf model includes 3 variables. The selection of splitting variables from a pool of randomly generated variables results in a lower risk of overfitting and ensures that more splitting variables are represented within the model. Once all trees are trained, the outcomes or predictions are aggregated and for a classification-type model, a final prediction is a assigned based on majority voting across all 500 (N) tress. The randomness involved in generating the variables to choose from in selecting a splitting variable at any given decision node is a key strength of random forest sampling and explains why the model often performs better than a standard decision tree model. Due to the importance of “TypeofTravel” and potential class imbalance in satisfaction score, many satisfaction scores were unused, and 14 of the 15 variables were not incorporated in the standard decision tree model. This problem was addressed using satisfaction.rf. Despite this, the random forest sample model still performed worse than the standard tree model in regard to accuracy score. Evidently, further adjustment is needed. Note that accuracy and error was determined by comparing the satisfaction.rf model’s prediction performance to actual observations set in the survey_test data.