Problem Description: The data in the Airlines data
file contains data from 3,999 airline customers enrolled in East-West
Airlines’ customer rewards program. (Note that while East-West Airlines
is clearly fictional, this is data from a real airlines reward program;
names have been changed to protect the innocent and not-so-innocent
alike.) East-West Airlines has two goals with this analysis: (1)
identifying if a customer will claim a travel award using their rewards,
and (2) identifying factors that lead to customers claiming a travel
award.
Data Preparation
(1) Set the seed 123.
# Load the data
library(readxl)
Airlines <- read_excel("~/Desktop/Airlines_Data.xlsx")
attach(Airlines)
set.seed(123)
(2) Randomly partition the data set into training,
validation, and testing data by:
First, splitting off 25% of the
data for testing.
# Calculate indices for testing data (25% of the original data)
test_index <- sample(seq_len(nrow(Airlines)), size = floor(0.25 * nrow(Airlines)))
# Create the testing set
testing_data <- Airlines[test_index, ]
Second, splitting off 25% of the remaining data for validation.
Third, setting the remaining data as training data.
# Exclude testing data to deal with the remaining data for training and validation
remaining_data <- Airlines[-test_index, ]
# Calculate indices for validation set (25% of the remaining data)
validation_index <- sample(seq_len(nrow(remaining_data)), size = floor(0.25 * nrow(remaining_data)))
# Create the validation set
validation_data <- remaining_data[validation_index, ]
# The rest is training data
training_data <- remaining_data[-validation_index, ]
# Print summary to check the structure
dim(training_data) # For the training dataset
## [1] 2250 12
dim(validation_data) # For the validation dataset
## [1] 750 12
dim(testing_data) # For the testing dataset
## [1] 999 12
Appropriate Data Analysis Techniques
Consider the problem of analyzing the East-West Airlines travel award
data. Without running any analysis (yet), of all the methods we’ve
learned in class, choose three that you think would be appropriate for
this problem and for each method explain:
(1) Why
you think this method is appropriate for the problem
(2) How you would use this method to address either
Goal #1 for East-West Airlines, Goal #2 for East-West Airlines, or both
I would select logistic regression, random forest, and CART.
Logistic Regression is suitable because it’s
tailored for binary outcomes, like determining if a customer will claim
a travel award (yes/no). Its output includes probabilities and
coefficient insights that relate directly to the influence of
predictors.
Goal #1: Predict whether a customer will claim an award
using predictors like miles earned and transactions.
Goal #2:
Examine the regression coefficients to understand which factors most
influence the likelihood of award claims.
Random
Forest is strong against overfitting and effective in handling
various types of data, making it great for classification tasks. It also
ranks features based on their importance in prediction, providing
insights into what drives outcomes.
Goal #1: Classify customers on
their likelihood to claim awards based on comprehensive feature sets.
Goal #2: Use the feature importance scores to identify and
prioritize the factors that most affect award claiming.
CART is an ideal choice due to their simplicity and
effectiveness in handling categorical and continuous data.
Goal #1:
Visually dissect the paths taken by customers who claim awards,
identifying key thresholds in behavior that predict higher likelihoods
of claiming.
Goal #2: The structure of the tree can help identify
the most influential factors leading to award claims, as the top nodes
in the tree represent the most important variables.
Goal #1: Identifying if a customer will claim a travel award
using their rewards
If East-West Airlines is most
concerned with identifying if a new customer will claim a travel award
using their rewards, what three methods would you suggest as the most
appropriate for this problem? Why?
Run each of your three methods
and decide which method East-West Airlines should use in practice. (Make
sure to support your answers!)
1. Logistic
Regression
names(Airlines)
## [1] "ID" "Balance" "Qual_miles"
## [4] "cc1_miles" "cc2_miles" "cc3_miles"
## [7] "Bonus_miles" "Bonus_trans" "Flight_miles_12mo"
## [10] "Flight_trans_12" "Days_since_enroll" "Award"
head(Airlines)
## # A tibble: 6 × 12
## ID Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles Bonus_trans
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 28143 0 1 1 1 174 1
## 2 2 19244 0 1 1 1 215 2
## 3 3 41354 0 1 1 1 4123 4
## 4 4 14776 0 1 1 1 500 1
## 5 5 97752 0 4 1 1 43300 26
## 6 6 16420 0 1 1 1 0 0
## # ℹ 4 more variables: Flight_miles_12mo <dbl>, Flight_trans_12 <dbl>,
## # Days_since_enroll <dbl>, Award <dbl>
# The reference level is set to "0" (not claimed), so the model coefficients will describe the effect of predictors on the probability of claiming an award ("1").
Airlines$Award <- factor(Airlines$Award, levels = c("0", "1"))
Airlines$Award <- relevel(Airlines$Award, ref = "0")
# Fit the logistic regression model
model <- glm(Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)
# Display summary of the model
summary(model)
##
## Call:
## glm(formula = Award ~ Balance + Qual_miles + cc1_miles + cc2_miles +
## cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo +
## Flight_trans_12 + Days_since_enroll, family = binomial(),
## data = Airlines)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.164e+00 3.728e-01 -3.123 0.001789 **
## Balance -1.732e-06 4.576e-07 -3.785 0.000154 ***
## Qual_miles 1.787e-04 4.900e-05 3.647 0.000265 ***
## cc1_miles -5.574e-02 6.525e-02 -0.854 0.392961
## cc2_miles -6.942e-02 2.488e-01 -0.279 0.780238
## cc3_miles -7.613e-01 2.124e-01 -3.584 0.000338 ***
## Bonus_miles 4.041e-05 4.665e-06 8.663 < 2e-16 ***
## Bonus_trans 1.775e-02 5.753e-03 3.085 0.002033 **
## Flight_miles_12mo 1.365e-05 6.894e-05 0.198 0.843008
## Flight_trans_12 1.594e-01 2.786e-02 5.723 1.05e-08 ***
## Days_since_enroll 1.392e-04 1.888e-05 7.376 1.63e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5271.8 on 3998 degrees of freedom
## Residual deviance: 4362.5 on 3988 degrees of freedom
## AIC: 4384.5
##
## Number of Fisher Scoring iterations: 5
The model includes coefficients that are statistically significant,
such as Balance, Qual_miles, cc3_miles, Bonus_miles,
Bonus_trans, Flight_trans_12, and Days_since_enroll. These
predictors have low p-values, suggesting strong evidence against the
null hypothesis (that these coefficients are zero), indicating they are
important predictors in determining whether a customer will claim a
travel award.
The negative coefficients for Balance
and cc3_miles suggest that increases in these predictors are
associated with a decrease in the likelihood of claiming a travel award,
whereas positive coefficients for variables like Bonus_miles and
Days_since_enroll suggest an increase in the likelihood of
claiming an award.
# Calculate confidence intervals for the coefficients
confint(model)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.889730e+00 -4.228812e-01
## Balance -2.629152e-06 -8.330669e-07
## Qual_miles 8.442144e-05 2.773445e-04
## cc1_miles -1.843432e-01 7.144114e-02
## cc2_miles -5.774166e-01 4.075282e-01
## cc3_miles -1.185394e+00 -3.421540e-01
## Bonus_miles 3.142374e-05 4.970332e-05
## Bonus_trans 6.446666e-03 2.901256e-02
## Flight_miles_12mo -1.187238e-04 1.515353e-04
## Flight_trans_12 1.057396e-01 2.149284e-01
## Days_since_enroll 1.023096e-04 1.763207e-04
# Exponentiate coefficients to get odds ratios
exp(coef(model))
## (Intercept) Balance Qual_miles cc1_miles
## 0.3121305 0.9999983 1.0001787 0.9457873
## cc2_miles cc3_miles Bonus_miles Bonus_trans
## 0.9329380 0.4670429 1.0000404 1.0179095
## Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1.0000137 1.1728595 1.0001393
# Shows the estimates for the coefficients
model$coefficients
## (Intercept) Balance Qual_miles cc1_miles
## -1.164334e+00 -1.732249e-06 1.787109e-04 -5.573754e-02
## cc2_miles cc3_miles Bonus_miles Bonus_trans
## -6.941654e-02 -7.613342e-01 4.041425e-05 1.775103e-02
## Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1.365420e-05 1.594448e-01 1.392409e-04
# Predict probabilities for new data (e.g., a new customer scenario)
new_customer <- data.frame(Balance=30000, Qual_miles=500, cc1_miles=2, cc2_miles=1, cc3_miles=1, Bonus_miles=12000, Bonus_trans=12, Flight_miles_12mo=1500, Flight_trans_12=3, Days_since_enroll=2000)
# Probability of claiming the award
predict(model, new_customer, type="response")
## 1
## 0.3557426
Strong Predictors: The most substantial effects are
observed in cc3_miles (decreasing likelihood), Flight_trans_12, and
Bonus_trans (both increasing likelihood).
Minimal
Effects: Predictors such as Balance and Flight_miles_12mo have
minimal effects on the likelihood of claiming awards, suggesting that
these factors, while statistically significant, may not be as impactful
in practical terms.
Policy Implications: For
decision-makers at East-West Airlines, focusing on engaging customers
through more flight transactions and bonus transactions might be more
effective strategies for increasing travel award claims than merely
increasing balances or flight miles.
# Full model with all predictors
full_model <- glm(Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)
# Reduced model excluding cc1_miles, cc2_miles, and Flight_miles_12mo
reduced_model <- glm(Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)
# ANOVA to compare the full model and the reduced model
anova_result <- anova(reduced_model, full_model, test="Chisq")
print(anova_result)
## Analysis of Deviance Table
##
## Model 1: Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + Bonus_trans +
## Flight_trans_12 + Days_since_enroll
## Model 2: Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles +
## Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 +
## Days_since_enroll
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3991 4363.3
## 2 3988 4362.5 3 0.81288 0.8464
The high p-value (0.8464) in the ANOVA table strongly suggests that
the reduced model (Model 1) fits the data nearly as well as the full
model (Model 2), despite having fewer predictors. This result implies
that the predictors removed (cc1_miles, cc2_miles, and
Flight_miles_12mo) do not contribute significantly to the model, and
their exclusion simplifies the model without losing predictive power.
2. Random Forest
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(ROCR)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
# Ensure the target variable 'Award' is a factor
training_data$Award <- as.factor(training_data$Award)
validation_data$Award <- as.factor(validation_data$Award)
# Fit the Random Forest model
rf_model <- randomForest(Award ~ ., data = training_data, ntree = 500, mtry = 4, importance = TRUE)
# Review model details and variable importance
print(rf_model)
##
## Call:
## randomForest(formula = Award ~ ., data = training_data, ntree = 500, mtry = 4, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 24.36%
## Confusion matrix:
## 0 1 class.error
## 0 1200 207 0.1471215
## 1 341 502 0.4045077
varImpPlot(rf_model)
From the plot, the variables like Bonus_miles, Balance, and
Bonus_trans stand out as the most significant predictors for
increasing the accuracy of the model. This indicates that these
features, when altered, affect the model’s accuracy the most
significantly.
# Prediction on validation data
rf_predictions <- predict(rf_model, validation_data)
# Generate a confusion matrix
conf_matrix <- table(Actual = validation_data$Award, Predicted = rf_predictions)
print(conf_matrix)
## Predicted
## Actual 0 1
## 0 399 57
## 1 100 194
Sensitivity: The model is able to identify 66% of
all actual award claims.
Specificity: The model
correctly identifies 87.5% of those who do not claim awards.
# Calculate overall prediction error
prediction_error <- 1 - sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Prediction Error: ", prediction_error))
## [1] "Prediction Error: 0.209333333333333"
# ROC curve analysis
rf_probabilities <- predict(rf_model, validation_data, type = "prob")[,2] # Probabilities for the positive class
pred_obj <- prediction(rf_probabilities, validation_data$Award)
perf <- performance(pred_obj, measure = "tpr", x.measure = "fpr")
# Plot ROC curve
plot(perf, main = "ROC Curve", col = rainbow(10))
abline(0, 1, col = "red") # Adding reference line
# Calculate AUC
auc_value <- performance(pred_obj, measure = "auc")
auc_value <- auc_value@y.values[[1]]
print(paste("AUC: ", auc_value))
## [1] "AUC: 0.839658670485738"
The ROC curve shows good predictive performance, significantly above
the random guess line, indicating effective discrimination between
customers likely and unlikely to claim a travel award.
The AUC
value of 0.84 the model indicates a strong ability to distinguish
between customers who will and will not claim a travel award. This high
AUC value suggests that the model is highly effective in meeting Goal 1,
which aims to predict award claim behavior among customers.
3. CART
# Load necessary libraries
library(rpart)
library(rpart.plot)
library(caret)
# Ensure the target variable 'Award' is a factor
Airlines$Award <- as.factor(Airlines$Award)
# Splitting the data into training and validation sets
set.seed(123)
index <- sample(1:nrow(Airlines), 0.7 * nrow(Airlines))
training_data <- Airlines[index, ]
validation_data <- Airlines[-index, ]
# Fit the CART model
cart_model <- rpart(Award ~ ., data = training_data, method = "class", cp = 0.01)
# Plot the tree
rpart.plot(cart_model, type = 1, extra = 101)
This model illustrates the key factors influencing airline customers’
likelihood of claiming travel awards. The primary split is based on
Bonus_miles, indicating its significance as a
predictor. Customers with higher bonus miles consistently show a higher
probability of claiming awards. Further splits based on Balance
and Flight_miles_12mo suggest that financial status and travel
frequency also play crucial roles. The model effectively segments
customers into groups with varying probabilities of award claims, which
can guide strategies to enhance participation in the airline’s rewards
program.
# Evaluate the model on training data
training_predictions <- predict(cart_model, training_data, type = "class")
train_table <- table(Actual = training_data$Award, Predicted = training_predictions)
train_accuracy <- sum(diag(train_table)) / sum(train_table)
# Evaluate the model on validation data
validation_predictions <- predict(cart_model, validation_data, type = "class")
valid_table <- table(Actual = validation_data$Award, Predicted = validation_predictions)
valid_accuracy <- sum(diag(valid_table)) / sum(valid_table)
# Print the results
print(paste("Training Accuracy:", train_accuracy))
## [1] "Training Accuracy: 0.774205073240443"
print(paste("Validation Accuracy:", valid_accuracy))
## [1] "Validation Accuracy: 0.746666666666667"
The analysis of the CART model’s performance shows a training
accuracy of approximately 77.42% and a validation accuracy of 74.67%.
This indicates that the model fits the training data reasonably well and
maintains a similar level of performance on unseen validation data,
suggesting moderate generalizability. The slight drop in accuracy from
training to validation could be indicative of minor overfitting but
overall, the model seems robust enough for practical use in predicting
whether customers will claim travel awards based on their profile
characteristics captured in the training dataset.
# Plot ROC curve and calculate AUC
library(ROCR)
validation_prob <- predict(cart_model, validation_data, type = "prob")[, 2]
prediction_obj <- prediction(validation_prob, validation_data$Award)
perf <- performance(prediction_obj, "tpr", "fpr")
plot(perf, main = "ROC Curve")
abline(0, 1, col = "red")
auc <- performance(prediction_obj, "auc")
auc_value <- auc@y.values[[1]]
print(paste("AUC:", auc_value))
## [1] "AUC: 0.762723266745006"
The ROC curve plotted from the predictions shows a moderate level of
discriminative ability, with an AUC of approximately 0.763. This value
indicates a reasonable level of accuracy in the model’s ability to
differentiate between those customers who will claim a travel award and
those who will not. The curve significantly deviates from the line of no
discrimination, suggesting that the model performs better than random
guessing. However, the curve’s slope and the AUC value indicate that
there is still room for improvement, particularly in minimizing false
positive rates and enhancing true positive rates.
Based
on the results that each method provided:
Logistic
Regression offered significant insights into which variables
are important and how they affect the probability of a customer claiming
a reward. However, it might oversimplify the relationships by assuming
linearity.
Random Forest showed a high level of
predictive accuracy (AUC = 0.84) and was effective in differentiating
between customers likely to claim awards versus those who are not. The
model also provided insights into feature importance.
CART, while visually intuitive and useful for
generating decision rules, showed lower accuracy and AUC than Random
Forest.
Given these considerations, Random Forest
appears to be the most suitable method for Goal #1 at East-West
Airlines. It balances high predictive accuracy with the ability
to handle complex data structures and provides actionable insights
through feature importance scores. This method can robustly identify
patterns that are not readily apparent through simpler models, making it
ideal for practical use in predicting customer behavior concerning
travel award claims. This method will likely provide the most reliable
predictions while also offering insights that can help refine marketing
strategies and customer engagement initiatives.
Goal #2: Identifying factors that lead to customers claiming a
travel award
If East-West Airlines is most concerned with
identifying factors that lead to customers claiming a travel award, what
three methods would you suggest as the most appropriate for this
problem? Why?
Run each of your three methods and decide which
factors East-West Airlines should focus on for determining if customers
will claim a travel award. (Make sure to support your answers!)
1. Logistic Regression
model <- glm(Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)
summary(model)
##
## Call:
## glm(formula = Award ~ Balance + Qual_miles + cc1_miles + cc2_miles +
## cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo +
## Flight_trans_12 + Days_since_enroll, family = binomial(),
## data = Airlines)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.164e+00 3.728e-01 -3.123 0.001789 **
## Balance -1.732e-06 4.576e-07 -3.785 0.000154 ***
## Qual_miles 1.787e-04 4.900e-05 3.647 0.000265 ***
## cc1_miles -5.574e-02 6.525e-02 -0.854 0.392961
## cc2_miles -6.942e-02 2.488e-01 -0.279 0.780238
## cc3_miles -7.613e-01 2.124e-01 -3.584 0.000338 ***
## Bonus_miles 4.041e-05 4.665e-06 8.663 < 2e-16 ***
## Bonus_trans 1.775e-02 5.753e-03 3.085 0.002033 **
## Flight_miles_12mo 1.365e-05 6.894e-05 0.198 0.843008
## Flight_trans_12 1.594e-01 2.786e-02 5.723 1.05e-08 ***
## Days_since_enroll 1.392e-04 1.888e-05 7.376 1.63e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5271.8 on 3998 degrees of freedom
## Residual deviance: 4362.5 on 3988 degrees of freedom
## AIC: 4384.5
##
## Number of Fisher Scoring iterations: 5
# The reference level is set to "0" (not claimed), so the model coefficients will describe the effect of predictors on the probability of claiming an award ("1").
Airlines$Award <- factor(Airlines$Award, levels = c("0", "1"))
Airlines$Award <- relevel(Airlines$Award, ref = "0")
step(model)
## Start: AIC=4384.51
## Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles +
## Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 +
## Days_since_enroll
##
## Df Deviance AIC
## - Flight_miles_12mo 1 4362.5 4382.5
## - cc2_miles 1 4362.6 4382.6
## - cc1_miles 1 4363.2 4383.2
## <none> 4362.5 4384.5
## - Bonus_trans 1 4371.9 4391.9
## - cc3_miles 1 4374.6 4394.6
## - Qual_miles 1 4376.5 4396.5
## - Balance 1 4376.7 4396.7
## - Flight_trans_12 1 4398.7 4418.7
## - Days_since_enroll 1 4417.4 4437.4
## - Bonus_miles 1 4456.1 4476.1
##
## Step: AIC=4382.55
## Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles +
## Bonus_miles + Bonus_trans + Flight_trans_12 + Days_since_enroll
##
## Df Deviance AIC
## - cc2_miles 1 4362.6 4380.6
## - cc1_miles 1 4363.3 4381.3
## <none> 4362.5 4382.5
## - Bonus_trans 1 4372.0 4390.0
## - cc3_miles 1 4374.7 4392.7
## - Qual_miles 1 4376.6 4394.6
## - Balance 1 4376.7 4394.7
## - Days_since_enroll 1 4417.4 4435.4
## - Bonus_miles 1 4456.9 4474.9
## - Flight_trans_12 1 4481.0 4499.0
##
## Step: AIC=4380.63
## Award ~ Balance + Qual_miles + cc1_miles + cc3_miles + Bonus_miles +
## Bonus_trans + Flight_trans_12 + Days_since_enroll
##
## Df Deviance AIC
## - cc1_miles 1 4363.3 4379.3
## <none> 4362.6 4380.6
## - Bonus_trans 1 4372.0 4388.0
## - cc3_miles 1 4374.7 4390.7
## - Qual_miles 1 4376.7 4392.7
## - Balance 1 4376.8 4392.8
## - Days_since_enroll 1 4417.4 4433.4
## - Bonus_miles 1 4457.7 4473.7
## - Flight_trans_12 1 4481.8 4497.8
##
## Step: AIC=4379.32
## Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + Bonus_trans +
## Flight_trans_12 + Days_since_enroll
##
## Df Deviance AIC
## <none> 4363.3 4379.3
## - Bonus_trans 1 4372.0 4386.0
## - cc3_miles 1 4374.7 4388.7
## - Balance 1 4377.5 4391.5
## - Qual_miles 1 4377.5 4391.5
## - Days_since_enroll 1 4417.5 4431.5
## - Flight_trans_12 1 4500.9 4514.9
## - Bonus_miles 1 4611.8 4625.8
##
## Call: glm(formula = Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles +
## Bonus_trans + Flight_trans_12 + Days_since_enroll, family = binomial(),
## data = Airlines)
##
## Coefficients:
## (Intercept) Balance Qual_miles cc3_miles
## -1.341e+00 -1.729e-06 1.799e-04 -7.028e-01
## Bonus_miles Bonus_trans Flight_trans_12 Days_since_enroll
## 3.720e-05 1.632e-02 1.676e-01 1.378e-04
##
## Degrees of Freedom: 3998 Total (i.e. Null); 3991 Residual
## Null Deviance: 5272
## Residual Deviance: 4363 AIC: 4379
step(model, trace=0)
##
## Call: glm(formula = Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles +
## Bonus_trans + Flight_trans_12 + Days_since_enroll, family = binomial(),
## data = Airlines)
##
## Coefficients:
## (Intercept) Balance Qual_miles cc3_miles
## -1.341e+00 -1.729e-06 1.799e-04 -7.028e-01
## Bonus_miles Bonus_trans Flight_trans_12 Days_since_enroll
## 3.720e-05 1.632e-02 1.676e-01 1.378e-04
##
## Degrees of Freedom: 3998 Total (i.e. Null); 3991 Residual
## Null Deviance: 5272
## Residual Deviance: 4363 AIC: 4379
model_coefficients <- coef(model)
print(model_coefficients)
## (Intercept) Balance Qual_miles cc1_miles
## -1.164334e+00 -1.732249e-06 1.787109e-04 -5.573754e-02
## cc2_miles cc3_miles Bonus_miles Bonus_trans
## -6.941654e-02 -7.613342e-01 4.041425e-05 1.775103e-02
## Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1.365420e-05 1.594448e-01 1.392409e-04
odds_claim <- exp(model_coefficients)
print(odds_claim)
## (Intercept) Balance Qual_miles cc1_miles
## 0.3121305 0.9999983 1.0001787 0.9457873
## cc2_miles cc3_miles Bonus_miles Bonus_trans
## 0.9329380 0.4670429 1.0000404 1.0179095
## Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1.0000137 1.1728595 1.0001393
Non-significant predictors, Flight_miles_12mo, cc1_miles,
cc2_miles, were removed, resulting in a reduced model with a
lower AIC, suggesting a better model fit with fewer variables.
The coefficients of the final model indicate the log odds of claiming a
reward for each unit increase in the predictors. Coefficients closer to
zero have a smaller effect, whereas larger absolute values indicate a
stronger effect.
The final model is more parsimonious than the
initial model and retains variables that are most influential for
predicting an award claim. This model not only provides insights into
which factors are most important but also ensures that the model is not
overly complex, which can improve model performance and
interpretability.
2. Random Forest
# Load necessary libraries
library(randomForest)
library(caret) # for additional model evaluation metrics
# Fit the Random Forest model
set.seed(123)
rf_model <- randomForest(Award ~ ., data = training_data, ntree = 500, mtry = 4, importance = TRUE)
# Output the model details to review variable importance
print(rf_model)
##
## Call:
## randomForest(formula = Award ~ ., data = training_data, ntree = 500, mtry = 4, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 23.15%
## Confusion matrix:
## 0 1 class.error
## 0 1550 228 0.1282340
## 1 420 601 0.4113614
# Plotting the importance of variables
varImpPlot(rf_model, main = "Feature Importance in Predicting Travel Award Claiming")
# To get a more detailed view of the importance
importance_scores <- importance(rf_model)
print(importance_scores)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## ID 26.067203 -3.240115 26.3684866 159.757715
## Balance 94.524151 -15.891439 71.1609329 263.810688
## Qual_miles 10.268828 3.972043 10.1321805 17.788421
## cc1_miles 16.847852 11.374680 21.4202436 61.062159
## cc2_miles 6.143701 -5.721670 0.8294604 3.418842
## cc3_miles 1.375224 -5.987414 -4.9128705 1.638468
## Bonus_miles 69.789038 28.574894 82.0143089 300.942900
## Bonus_trans 24.905191 13.777174 32.2925804 132.680377
## Flight_miles_12mo 13.303305 28.305898 31.1254767 102.860463
## Flight_trans_12 8.583279 23.710049 24.0421978 58.821654
## Days_since_enroll 26.335174 2.837085 28.8745741 159.531097
Bonus_miles has the highest impact on both accuracy
and Gini decrease, indicating it is a critical predictor. This suggests
that the accumulation of bonus miles is a significant determinant in a
customer’s decision to claim a travel award.
Balance and
Days_since_enroll also show significant effects on model
accuracy and Gini decrease, suggesting that longer-term customers with
higher account balances are more likely to engage with the airline’s
award system.
Flight_miles_12mo and
Flight_trans_12 have notable impacts, especially in Gini
decrease, which highlights their roles in influencing award claims,
likely due to direct engagement with flying activities.
cc2_miles and cc3_miles, which might represent
affiliations with less frequent credit card programs or partnerships,
show minimal impact, suggesting they are less critical in determining
award claiming behavior.
3. CART
# Load necessary libraries
library(rpart)
library(rpart.plot)
library(caret)
# Ensure the target variable 'Award' is a factor
Airlines$Award <- as.factor(Airlines$Award)
# Splitting the data into training and validation sets
set.seed(123)
index <- sample(1:nrow(Airlines), 0.7 * nrow(Airlines))
training_data <- Airlines[index, ]
validation_data <- Airlines[-index, ]
# Fit the simplified CART model
cart_model <- rpart(Award ~ ., data = training_data, method = "class", cp = 0.02)
# Plot the tree with important splits
rpart.plot(cart_model, type = 4, extra = 102, under = TRUE, cex = 0.75)
The decision tree provides a clear visualization of how various
features contribute to the likelihood of a customer claiming a travel
award. The first split is on Bonus_miles, indicating
it’s a significant predictor of award claims. This suggests that
customers with a higher number of bonus miles are more likely to claim
awards.
Further splits on Balance and
Flight_miles_12mo suggest these variables also significantly
impact the decision to claim an award. For example, customers with bonus
miles greater than 7704 and a balance less than 29,000 show a higher
probability of claiming awards, pointing to specific customer segments
that might be targeted more effectively.
# Print the results of variable importance
var_importance <- as.data.frame(varImp(cart_model, scale = FALSE))
print(var_importance)
## Overall
## Balance 44.49856
## Bonus_miles 235.37386
## Bonus_trans 143.83443
## cc1_miles 130.86686
## Flight_miles_12mo 205.39944
## Flight_trans_12 201.45258
## ID 0.00000
## Qual_miles 0.00000
## cc2_miles 0.00000
## cc3_miles 0.00000
## Days_since_enroll 0.00000
The variable importance output highlights which variables most influence the model’s predictions. Bonus_miles, Bonus_trans, and Flight_miles_12mo are shown to have substantial importance, confirming their strong predictive power regarding award claims. This insight is crucial for understanding which features should be focused on to enhance customer engagement and reward claiming behavior.
# Validate the model's ability to differentiate factors influencing award claims
validation_predictions <- predict(cart_model, validation_data, type = "class")
valid_table <- table(Actual = validation_data$Award, Predicted = validation_predictions)
valid_accuracy <- sum(diag(valid_table)) / sum(valid_table)
# Print validation accuracy
print(paste("Validation Accuracy:", valid_accuracy))
## [1] "Validation Accuracy: 0.7375"
# Print simple decision rules from tree
print(rpart.rules(cart_model, style="tall"))
## Award is 0.19 when
## Bonus_miles < 7704
##
## Award is 0.37 when
## Bonus_miles >= 7704
## Balance >= 28682
## Flight_miles_12mo < 75
##
## Award is 0.69 when
## Bonus_miles >= 7704
## Balance >= 28682
## Flight_miles_12mo >= 75
##
## Award is 0.85 when
## Bonus_miles >= 7704
## Balance < 28682
# Plot ROC curve and calculate AUC for validation data
library(ROCR)
validation_prob <- predict(cart_model, validation_data, type = "prob")[, 2]
prediction_obj <- prediction(validation_prob, validation_data$Award)
perf <- performance(prediction_obj, "tpr", "fpr")
plot(perf, main = "ROC Curve for Simplified CART Model")
abline(0, 1, col = "red")
auc <- performance(prediction_obj, "auc")
auc_value <- auc@y.values[[1]]
print(paste("AUC:", auc_value))
## [1] "AUC: 0.756977085781434"
The ROC curve and an AUC value of approximately 0.757 demonstrate the
model’s ability to discriminate between those who will and won’t claim
awards. The curve is above the diagonal line of no discrimination, which
confirms the model’s effectiveness in distinguishing between the two
classes, but it has some room for improvement.
Based on
the results that each method provided:
Logistic
regression provided detailed coefficients for each predictor,
indicating how much each factor increases or decreases the odds of a
customer claiming a travel award.
Random Forest
performed well in terms of model accuracy and provided a comprehensive
ranking of feature importance. CART provided a
straightforward visual representation of how different features lead to
a travel award claim.
Given these considerations, CART
appears to be the most suitable method for Goal #2 at East-West
Airlines for its simplicity and direct applicability. The
clear, visual nature of the decision tree allows non-technical
stakeholders to easily understand and implement strategies based on its
results. CART effectively identifies critical thresholds and conditions
that influence award claiming, making it ideal for practical, actionable
business applications.