Project

Problem Description: The data in the Airlines data file contains data from 3,999 airline customers enrolled in East-West Airlines’ customer rewards program. (Note that while East-West Airlines is clearly fictional, this is data from a real airlines reward program; names have been changed to protect the innocent and not-so-innocent alike.) East-West Airlines has two goals with this analysis: (1) identifying if a customer will claim a travel award using their rewards, and (2) identifying factors that lead to customers claiming a travel award.

Data Preparation
(1) Set the seed 123.

# Load the data
library(readxl)
Airlines <- read_excel("~/Desktop/Airlines_Data.xlsx")
attach(Airlines)

set.seed(123)

(2) Randomly partition the data set into training, validation, and testing data by:
First, splitting off 25% of the data for testing.

# Calculate indices for testing data (25% of the original data)
test_index <- sample(seq_len(nrow(Airlines)), size = floor(0.25 * nrow(Airlines)))

# Create the testing set
testing_data <- Airlines[test_index, ]

Second, splitting off 25% of the remaining data for validation. Third, setting the remaining data as training data.

# Exclude testing data to deal with the remaining data for training and validation
remaining_data <- Airlines[-test_index, ]

# Calculate indices for validation set (25% of the remaining data)
validation_index <- sample(seq_len(nrow(remaining_data)), size = floor(0.25 * nrow(remaining_data)))

# Create the validation set
validation_data <- remaining_data[validation_index, ]

# The rest is training data
training_data <- remaining_data[-validation_index, ]

# Print summary to check the structure
dim(training_data)  # For the training dataset

## [1] 2250   12

dim(validation_data)  # For the validation dataset

## [1] 750  12

dim(testing_data)  # For the testing dataset

## [1] 999  12

Appropriate Data Analysis Techniques
Consider the problem of analyzing the East-West Airlines travel award data. Without running any analysis (yet), of all the methods we’ve learned in class, choose three that you think would be appropriate for this problem and for each method explain:
(1) Why you think this method is appropriate for the problem
(2) How you would use this method to address either Goal #1 for East-West Airlines, Goal #2 for East-West Airlines, or both

I would select logistic regression, random forest, and CART.

Logistic Regression is suitable because it’s tailored for binary outcomes, like determining if a customer will claim a travel award (yes/no). Its output includes probabilities and coefficient insights that relate directly to the influence of predictors.
Goal #1: Predict whether a customer will claim an award using predictors like miles earned and transactions.
Goal #2: Examine the regression coefficients to understand which factors most influence the likelihood of award claims.

Random Forest is strong against overfitting and effective in handling various types of data, making it great for classification tasks. It also ranks features based on their importance in prediction, providing insights into what drives outcomes.
Goal #1: Classify customers on their likelihood to claim awards based on comprehensive feature sets.
Goal #2: Use the feature importance scores to identify and prioritize the factors that most affect award claiming.

CART is an ideal choice due to their simplicity and effectiveness in handling categorical and continuous data.
Goal #1: Visually dissect the paths taken by customers who claim awards, identifying key thresholds in behavior that predict higher likelihoods of claiming.
Goal #2: The structure of the tree can help identify the most influential factors leading to award claims, as the top nodes in the tree represent the most important variables.

Goal #1: Identifying if a customer will claim a travel award using their rewards
If East-West Airlines is most concerned with identifying if a new customer will claim a travel award using their rewards, what three methods would you suggest as the most appropriate for this problem? Why?
Run each of your three methods and decide which method East-West Airlines should use in practice. (Make sure to support your answers!)

1. Logistic Regression

names(Airlines)

##  [1] "ID"                "Balance"           "Qual_miles"       
##  [4] "cc1_miles"         "cc2_miles"         "cc3_miles"        
##  [7] "Bonus_miles"       "Bonus_trans"       "Flight_miles_12mo"
## [10] "Flight_trans_12"   "Days_since_enroll" "Award"

head(Airlines)

## # A tibble: 6 × 12
##      ID Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles Bonus_trans
##   <dbl>   <dbl>      <dbl>     <dbl>     <dbl>     <dbl>       <dbl>       <dbl>
## 1     1   28143          0         1         1         1         174           1
## 2     2   19244          0         1         1         1         215           2
## 3     3   41354          0         1         1         1        4123           4
## 4     4   14776          0         1         1         1         500           1
## 5     5   97752          0         4         1         1       43300          26
## 6     6   16420          0         1         1         1           0           0
## # ℹ 4 more variables: Flight_miles_12mo <dbl>, Flight_trans_12 <dbl>,
## #   Days_since_enroll <dbl>, Award <dbl>

# The reference level is set to "0" (not claimed), so the model coefficients will describe the effect of predictors on the probability of claiming an award ("1").
Airlines$Award <- factor(Airlines$Award, levels = c("0", "1"))
Airlines$Award <- relevel(Airlines$Award, ref = "0")

# Fit the logistic regression model
model <- glm(Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)

# Display summary of the model
summary(model)

## 
## Call:
## glm(formula = Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + 
##     cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + 
##     Flight_trans_12 + Days_since_enroll, family = binomial(), 
##     data = Airlines)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.164e+00  3.728e-01  -3.123 0.001789 ** 
## Balance           -1.732e-06  4.576e-07  -3.785 0.000154 ***
## Qual_miles         1.787e-04  4.900e-05   3.647 0.000265 ***
## cc1_miles         -5.574e-02  6.525e-02  -0.854 0.392961    
## cc2_miles         -6.942e-02  2.488e-01  -0.279 0.780238    
## cc3_miles         -7.613e-01  2.124e-01  -3.584 0.000338 ***
## Bonus_miles        4.041e-05  4.665e-06   8.663  < 2e-16 ***
## Bonus_trans        1.775e-02  5.753e-03   3.085 0.002033 ** 
## Flight_miles_12mo  1.365e-05  6.894e-05   0.198 0.843008    
## Flight_trans_12    1.594e-01  2.786e-02   5.723 1.05e-08 ***
## Days_since_enroll  1.392e-04  1.888e-05   7.376 1.63e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5271.8  on 3998  degrees of freedom
## Residual deviance: 4362.5  on 3988  degrees of freedom
## AIC: 4384.5
## 
## Number of Fisher Scoring iterations: 5

The model includes coefficients that are statistically significant, such as Balance, Qual_miles, cc3_miles, Bonus_miles, Bonus_trans, Flight_trans_12, and Days_since_enroll. These predictors have low p-values, suggesting strong evidence against the null hypothesis (that these coefficients are zero), indicating they are important predictors in determining whether a customer will claim a travel award.

The negative coefficients for Balance and cc3_miles suggest that increases in these predictors are associated with a decrease in the likelihood of claiming a travel award, whereas positive coefficients for variables like Bonus_miles and Days_since_enroll suggest an increase in the likelihood of claiming an award.

# Calculate confidence intervals for the coefficients
confint(model)

## Waiting for profiling to be done...

##                           2.5 %        97.5 %
## (Intercept)       -1.889730e+00 -4.228812e-01
## Balance           -2.629152e-06 -8.330669e-07
## Qual_miles         8.442144e-05  2.773445e-04
## cc1_miles         -1.843432e-01  7.144114e-02
## cc2_miles         -5.774166e-01  4.075282e-01
## cc3_miles         -1.185394e+00 -3.421540e-01
## Bonus_miles        3.142374e-05  4.970332e-05
## Bonus_trans        6.446666e-03  2.901256e-02
## Flight_miles_12mo -1.187238e-04  1.515353e-04
## Flight_trans_12    1.057396e-01  2.149284e-01
## Days_since_enroll  1.023096e-04  1.763207e-04

# Exponentiate coefficients to get odds ratios
exp(coef(model))

##       (Intercept)           Balance        Qual_miles         cc1_miles 
##         0.3121305         0.9999983         1.0001787         0.9457873 
##         cc2_miles         cc3_miles       Bonus_miles       Bonus_trans 
##         0.9329380         0.4670429         1.0000404         1.0179095 
## Flight_miles_12mo   Flight_trans_12 Days_since_enroll 
##         1.0000137         1.1728595         1.0001393

# Shows the estimates for the coefficients
model$coefficients

##       (Intercept)           Balance        Qual_miles         cc1_miles 
##     -1.164334e+00     -1.732249e-06      1.787109e-04     -5.573754e-02 
##         cc2_miles         cc3_miles       Bonus_miles       Bonus_trans 
##     -6.941654e-02     -7.613342e-01      4.041425e-05      1.775103e-02 
## Flight_miles_12mo   Flight_trans_12 Days_since_enroll 
##      1.365420e-05      1.594448e-01      1.392409e-04

# Predict probabilities for new data (e.g., a new customer scenario)
new_customer <- data.frame(Balance=30000, Qual_miles=500, cc1_miles=2, cc2_miles=1, cc3_miles=1, Bonus_miles=12000, Bonus_trans=12, Flight_miles_12mo=1500, Flight_trans_12=3, Days_since_enroll=2000)

# Probability of claiming the award
predict(model, new_customer, type="response")

##         1 
## 0.3557426

Strong Predictors: The most substantial effects are observed in cc3_miles (decreasing likelihood), Flight_trans_12, and Bonus_trans (both increasing likelihood).
Minimal Effects: Predictors such as Balance and Flight_miles_12mo have minimal effects on the likelihood of claiming awards, suggesting that these factors, while statistically significant, may not be as impactful in practical terms.
Policy Implications: For decision-makers at East-West Airlines, focusing on engaging customers through more flight transactions and bonus transactions might be more effective strategies for increasing travel award claims than merely increasing balances or flight miles.

# Full model with all predictors
full_model <- glm(Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)

# Reduced model excluding cc1_miles, cc2_miles, and Flight_miles_12mo
reduced_model <- glm(Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)

# ANOVA to compare the full model and the reduced model
anova_result <- anova(reduced_model, full_model, test="Chisq")
print(anova_result)

## Analysis of Deviance Table
## 
## Model 1: Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + Bonus_trans + 
##     Flight_trans_12 + Days_since_enroll
## Model 2: Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + 
##     Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + 
##     Days_since_enroll
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      3991     4363.3                     
## 2      3988     4362.5  3  0.81288   0.8464

The high p-value (0.8464) in the ANOVA table strongly suggests that the reduced model (Model 1) fits the data nearly as well as the full model (Model 2), despite having fewer predictors. This result implies that the predictors removed (cc1_miles, cc2_miles, and Flight_miles_12mo) do not contribute significantly to the model, and their exclusion simplifies the model without losing predictive power.

2. Random Forest

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

library(ROCR)
library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

# Ensure the target variable 'Award' is a factor
training_data$Award <- as.factor(training_data$Award)
validation_data$Award <- as.factor(validation_data$Award)

# Fit the Random Forest model
rf_model <- randomForest(Award ~ ., data = training_data, ntree = 500, mtry = 4, importance = TRUE)

# Review model details and variable importance
print(rf_model)

## 
## Call:
##  randomForest(formula = Award ~ ., data = training_data, ntree = 500,      mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 24.36%
## Confusion matrix:
##      0   1 class.error
## 0 1200 207   0.1471215
## 1  341 502   0.4045077

varImpPlot(rf_model)

From the plot, the variables like Bonus_miles, Balance, and Bonus_trans stand out as the most significant predictors for increasing the accuracy of the model. This indicates that these features, when altered, affect the model’s accuracy the most significantly.

# Prediction on validation data
rf_predictions <- predict(rf_model, validation_data)

# Generate a confusion matrix
conf_matrix <- table(Actual = validation_data$Award, Predicted = rf_predictions)
print(conf_matrix)

##       Predicted
## Actual   0   1
##      0 399  57
##      1 100 194

Sensitivity: The model is able to identify 66% of all actual award claims.
Specificity: The model correctly identifies 87.5% of those who do not claim awards.

# Calculate overall prediction error
prediction_error <- 1 - sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Prediction Error: ", prediction_error))

## [1] "Prediction Error:  0.209333333333333"

# ROC curve analysis
rf_probabilities <- predict(rf_model, validation_data, type = "prob")[,2]  # Probabilities for the positive class
pred_obj <- prediction(rf_probabilities, validation_data$Award)
perf <- performance(pred_obj, measure = "tpr", x.measure = "fpr")

# Plot ROC curve
plot(perf, main = "ROC Curve", col = rainbow(10))
abline(0, 1, col = "red")  # Adding reference line

# Calculate AUC
auc_value <- performance(pred_obj, measure = "auc")
auc_value <- auc_value@y.values[[1]]
print(paste("AUC: ", auc_value))

## [1] "AUC:  0.839658670485738"

The ROC curve shows good predictive performance, significantly above the random guess line, indicating effective discrimination between customers likely and unlikely to claim a travel award.

The AUC value of 0.84 the model indicates a strong ability to distinguish between customers who will and will not claim a travel award. This high AUC value suggests that the model is highly effective in meeting Goal 1, which aims to predict award claim behavior among customers.

3. CART

# Load necessary libraries
library(rpart)
library(rpart.plot)
library(caret)

# Ensure the target variable 'Award' is a factor
Airlines$Award <- as.factor(Airlines$Award)

# Splitting the data into training and validation sets
set.seed(123)
index <- sample(1:nrow(Airlines), 0.7 * nrow(Airlines))
training_data <- Airlines[index, ]
validation_data <- Airlines[-index, ]

# Fit the CART model
cart_model <- rpart(Award ~ ., data = training_data, method = "class", cp = 0.01) 

# Plot the tree
rpart.plot(cart_model, type = 1, extra = 101)

This model illustrates the key factors influencing airline customers’ likelihood of claiming travel awards. The primary split is based on Bonus_miles, indicating its significance as a predictor. Customers with higher bonus miles consistently show a higher probability of claiming awards. Further splits based on Balance and Flight_miles_12mo suggest that financial status and travel frequency also play crucial roles. The model effectively segments customers into groups with varying probabilities of award claims, which can guide strategies to enhance participation in the airline’s rewards program.

# Evaluate the model on training data
training_predictions <- predict(cart_model, training_data, type = "class")
train_table <- table(Actual = training_data$Award, Predicted = training_predictions)
train_accuracy <- sum(diag(train_table)) / sum(train_table)

# Evaluate the model on validation data
validation_predictions <- predict(cart_model, validation_data, type = "class")
valid_table <- table(Actual = validation_data$Award, Predicted = validation_predictions)
valid_accuracy <- sum(diag(valid_table)) / sum(valid_table)

# Print the results
print(paste("Training Accuracy:", train_accuracy))

## [1] "Training Accuracy: 0.774205073240443"

print(paste("Validation Accuracy:", valid_accuracy))

## [1] "Validation Accuracy: 0.746666666666667"

The analysis of the CART model’s performance shows a training accuracy of approximately 77.42% and a validation accuracy of 74.67%. This indicates that the model fits the training data reasonably well and maintains a similar level of performance on unseen validation data, suggesting moderate generalizability. The slight drop in accuracy from training to validation could be indicative of minor overfitting but overall, the model seems robust enough for practical use in predicting whether customers will claim travel awards based on their profile characteristics captured in the training dataset.

# Plot ROC curve and calculate AUC
library(ROCR)
validation_prob <- predict(cart_model, validation_data, type = "prob")[, 2]
prediction_obj <- prediction(validation_prob, validation_data$Award)
perf <- performance(prediction_obj, "tpr", "fpr")
plot(perf, main = "ROC Curve")
abline(0, 1, col = "red")

auc <- performance(prediction_obj, "auc")
auc_value <- auc@y.values[[1]]
print(paste("AUC:", auc_value))

## [1] "AUC: 0.762723266745006"

The ROC curve plotted from the predictions shows a moderate level of discriminative ability, with an AUC of approximately 0.763. This value indicates a reasonable level of accuracy in the model’s ability to differentiate between those customers who will claim a travel award and those who will not. The curve significantly deviates from the line of no discrimination, suggesting that the model performs better than random guessing. However, the curve’s slope and the AUC value indicate that there is still room for improvement, particularly in minimizing false positive rates and enhancing true positive rates.

Based on the results that each method provided:
Logistic Regression offered significant insights into which variables are important and how they affect the probability of a customer claiming a reward. However, it might oversimplify the relationships by assuming linearity.
Random Forest showed a high level of predictive accuracy (AUC = 0.84) and was effective in differentiating between customers likely to claim awards versus those who are not. The model also provided insights into feature importance.
CART, while visually intuitive and useful for generating decision rules, showed lower accuracy and AUC than Random Forest.

Given these considerations, Random Forest appears to be the most suitable method for Goal #1 at East-West Airlines. It balances high predictive accuracy with the ability to handle complex data structures and provides actionable insights through feature importance scores. This method can robustly identify patterns that are not readily apparent through simpler models, making it ideal for practical use in predicting customer behavior concerning travel award claims. This method will likely provide the most reliable predictions while also offering insights that can help refine marketing strategies and customer engagement initiatives.

Goal #2: Identifying factors that lead to customers claiming a travel award
If East-West Airlines is most concerned with identifying factors that lead to customers claiming a travel award, what three methods would you suggest as the most appropriate for this problem? Why?
Run each of your three methods and decide which factors East-West Airlines should focus on for determining if customers will claim a travel award. (Make sure to support your answers!)

1. Logistic Regression

model <- glm(Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + Days_since_enroll, family = binomial(), data = Airlines)
summary(model)

## 
## Call:
## glm(formula = Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + 
##     cc3_miles + Bonus_miles + Bonus_trans + Flight_miles_12mo + 
##     Flight_trans_12 + Days_since_enroll, family = binomial(), 
##     data = Airlines)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.164e+00  3.728e-01  -3.123 0.001789 ** 
## Balance           -1.732e-06  4.576e-07  -3.785 0.000154 ***
## Qual_miles         1.787e-04  4.900e-05   3.647 0.000265 ***
## cc1_miles         -5.574e-02  6.525e-02  -0.854 0.392961    
## cc2_miles         -6.942e-02  2.488e-01  -0.279 0.780238    
## cc3_miles         -7.613e-01  2.124e-01  -3.584 0.000338 ***
## Bonus_miles        4.041e-05  4.665e-06   8.663  < 2e-16 ***
## Bonus_trans        1.775e-02  5.753e-03   3.085 0.002033 ** 
## Flight_miles_12mo  1.365e-05  6.894e-05   0.198 0.843008    
## Flight_trans_12    1.594e-01  2.786e-02   5.723 1.05e-08 ***
## Days_since_enroll  1.392e-04  1.888e-05   7.376 1.63e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5271.8  on 3998  degrees of freedom
## Residual deviance: 4362.5  on 3988  degrees of freedom
## AIC: 4384.5
## 
## Number of Fisher Scoring iterations: 5

# The reference level is set to "0" (not claimed), so the model coefficients will describe the effect of predictors on the probability of claiming an award ("1").
Airlines$Award <- factor(Airlines$Award, levels = c("0", "1"))
Airlines$Award <- relevel(Airlines$Award, ref = "0")

step(model)

## Start:  AIC=4384.51
## Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + 
##     Bonus_miles + Bonus_trans + Flight_miles_12mo + Flight_trans_12 + 
##     Days_since_enroll
## 
##                     Df Deviance    AIC
## - Flight_miles_12mo  1   4362.5 4382.5
## - cc2_miles          1   4362.6 4382.6
## - cc1_miles          1   4363.2 4383.2
## <none>                   4362.5 4384.5
## - Bonus_trans        1   4371.9 4391.9
## - cc3_miles          1   4374.6 4394.6
## - Qual_miles         1   4376.5 4396.5
## - Balance            1   4376.7 4396.7
## - Flight_trans_12    1   4398.7 4418.7
## - Days_since_enroll  1   4417.4 4437.4
## - Bonus_miles        1   4456.1 4476.1
## 
## Step:  AIC=4382.55
## Award ~ Balance + Qual_miles + cc1_miles + cc2_miles + cc3_miles + 
##     Bonus_miles + Bonus_trans + Flight_trans_12 + Days_since_enroll
## 
##                     Df Deviance    AIC
## - cc2_miles          1   4362.6 4380.6
## - cc1_miles          1   4363.3 4381.3
## <none>                   4362.5 4382.5
## - Bonus_trans        1   4372.0 4390.0
## - cc3_miles          1   4374.7 4392.7
## - Qual_miles         1   4376.6 4394.6
## - Balance            1   4376.7 4394.7
## - Days_since_enroll  1   4417.4 4435.4
## - Bonus_miles        1   4456.9 4474.9
## - Flight_trans_12    1   4481.0 4499.0
## 
## Step:  AIC=4380.63
## Award ~ Balance + Qual_miles + cc1_miles + cc3_miles + Bonus_miles + 
##     Bonus_trans + Flight_trans_12 + Days_since_enroll
## 
##                     Df Deviance    AIC
## - cc1_miles          1   4363.3 4379.3
## <none>                   4362.6 4380.6
## - Bonus_trans        1   4372.0 4388.0
## - cc3_miles          1   4374.7 4390.7
## - Qual_miles         1   4376.7 4392.7
## - Balance            1   4376.8 4392.8
## - Days_since_enroll  1   4417.4 4433.4
## - Bonus_miles        1   4457.7 4473.7
## - Flight_trans_12    1   4481.8 4497.8
## 
## Step:  AIC=4379.32
## Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + Bonus_trans + 
##     Flight_trans_12 + Days_since_enroll
## 
##                     Df Deviance    AIC
## <none>                   4363.3 4379.3
## - Bonus_trans        1   4372.0 4386.0
## - cc3_miles          1   4374.7 4388.7
## - Balance            1   4377.5 4391.5
## - Qual_miles         1   4377.5 4391.5
## - Days_since_enroll  1   4417.5 4431.5
## - Flight_trans_12    1   4500.9 4514.9
## - Bonus_miles        1   4611.8 4625.8

## 
## Call:  glm(formula = Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + 
##     Bonus_trans + Flight_trans_12 + Days_since_enroll, family = binomial(), 
##     data = Airlines)
## 
## Coefficients:
##       (Intercept)            Balance         Qual_miles          cc3_miles  
##        -1.341e+00         -1.729e-06          1.799e-04         -7.028e-01  
##       Bonus_miles        Bonus_trans    Flight_trans_12  Days_since_enroll  
##         3.720e-05          1.632e-02          1.676e-01          1.378e-04  
## 
## Degrees of Freedom: 3998 Total (i.e. Null);  3991 Residual
## Null Deviance:       5272 
## Residual Deviance: 4363  AIC: 4379

step(model, trace=0)

## 
## Call:  glm(formula = Award ~ Balance + Qual_miles + cc3_miles + Bonus_miles + 
##     Bonus_trans + Flight_trans_12 + Days_since_enroll, family = binomial(), 
##     data = Airlines)
## 
## Coefficients:
##       (Intercept)            Balance         Qual_miles          cc3_miles  
##        -1.341e+00         -1.729e-06          1.799e-04         -7.028e-01  
##       Bonus_miles        Bonus_trans    Flight_trans_12  Days_since_enroll  
##         3.720e-05          1.632e-02          1.676e-01          1.378e-04  
## 
## Degrees of Freedom: 3998 Total (i.e. Null);  3991 Residual
## Null Deviance:       5272 
## Residual Deviance: 4363  AIC: 4379

model_coefficients <- coef(model)
print(model_coefficients)

##       (Intercept)           Balance        Qual_miles         cc1_miles 
##     -1.164334e+00     -1.732249e-06      1.787109e-04     -5.573754e-02 
##         cc2_miles         cc3_miles       Bonus_miles       Bonus_trans 
##     -6.941654e-02     -7.613342e-01      4.041425e-05      1.775103e-02 
## Flight_miles_12mo   Flight_trans_12 Days_since_enroll 
##      1.365420e-05      1.594448e-01      1.392409e-04

odds_claim <- exp(model_coefficients)
print(odds_claim)

##       (Intercept)           Balance        Qual_miles         cc1_miles 
##         0.3121305         0.9999983         1.0001787         0.9457873 
##         cc2_miles         cc3_miles       Bonus_miles       Bonus_trans 
##         0.9329380         0.4670429         1.0000404         1.0179095 
## Flight_miles_12mo   Flight_trans_12 Days_since_enroll 
##         1.0000137         1.1728595         1.0001393

Non-significant predictors, Flight_miles_12mo, cc1_miles, cc2_miles, were removed, resulting in a reduced model with a lower AIC, suggesting a better model fit with fewer variables.

The coefficients of the final model indicate the log odds of claiming a reward for each unit increase in the predictors. Coefficients closer to zero have a smaller effect, whereas larger absolute values indicate a stronger effect.

The final model is more parsimonious than the initial model and retains variables that are most influential for predicting an award claim. This model not only provides insights into which factors are most important but also ensures that the model is not overly complex, which can improve model performance and interpretability.

2. Random Forest

# Load necessary libraries
library(randomForest)
library(caret)  # for additional model evaluation metrics

# Fit the Random Forest model
set.seed(123) 
rf_model <- randomForest(Award ~ ., data = training_data, ntree = 500, mtry = 4, importance = TRUE)

# Output the model details to review variable importance
print(rf_model)

## 
## Call:
##  randomForest(formula = Award ~ ., data = training_data, ntree = 500,      mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 23.15%
## Confusion matrix:
##      0   1 class.error
## 0 1550 228   0.1282340
## 1  420 601   0.4113614

# Plotting the importance of variables
varImpPlot(rf_model, main = "Feature Importance in Predicting Travel Award Claiming")

# To get a more detailed view of the importance
importance_scores <- importance(rf_model)
print(importance_scores)

##                           0          1 MeanDecreaseAccuracy MeanDecreaseGini
## ID                26.067203  -3.240115           26.3684866       159.757715
## Balance           94.524151 -15.891439           71.1609329       263.810688
## Qual_miles        10.268828   3.972043           10.1321805        17.788421
## cc1_miles         16.847852  11.374680           21.4202436        61.062159
## cc2_miles          6.143701  -5.721670            0.8294604         3.418842
## cc3_miles          1.375224  -5.987414           -4.9128705         1.638468
## Bonus_miles       69.789038  28.574894           82.0143089       300.942900
## Bonus_trans       24.905191  13.777174           32.2925804       132.680377
## Flight_miles_12mo 13.303305  28.305898           31.1254767       102.860463
## Flight_trans_12    8.583279  23.710049           24.0421978        58.821654
## Days_since_enroll 26.335174   2.837085           28.8745741       159.531097

Bonus_miles has the highest impact on both accuracy and Gini decrease, indicating it is a critical predictor. This suggests that the accumulation of bonus miles is a significant determinant in a customer’s decision to claim a travel award.
Balance and Days_since_enroll also show significant effects on model accuracy and Gini decrease, suggesting that longer-term customers with higher account balances are more likely to engage with the airline’s award system.
Flight_miles_12mo and Flight_trans_12 have notable impacts, especially in Gini decrease, which highlights their roles in influencing award claims, likely due to direct engagement with flying activities.
cc2_miles and cc3_miles, which might represent affiliations with less frequent credit card programs or partnerships, show minimal impact, suggesting they are less critical in determining award claiming behavior.

3. CART

# Load necessary libraries
library(rpart)
library(rpart.plot)
library(caret)

# Ensure the target variable 'Award' is a factor
Airlines$Award <- as.factor(Airlines$Award)

# Splitting the data into training and validation sets
set.seed(123)
index <- sample(1:nrow(Airlines), 0.7 * nrow(Airlines))
training_data <- Airlines[index, ]
validation_data <- Airlines[-index, ]

# Fit the simplified CART model
cart_model <- rpart(Award ~ ., data = training_data, method = "class", cp = 0.02) 

# Plot the tree with important splits
rpart.plot(cart_model, type = 4, extra = 102, under = TRUE, cex = 0.75)

The decision tree provides a clear visualization of how various features contribute to the likelihood of a customer claiming a travel award. The first split is on Bonus_miles, indicating it’s a significant predictor of award claims. This suggests that customers with a higher number of bonus miles are more likely to claim awards.
Further splits on Balance and Flight_miles_12mo suggest these variables also significantly impact the decision to claim an award. For example, customers with bonus miles greater than 7704 and a balance less than 29,000 show a higher probability of claiming awards, pointing to specific customer segments that might be targeted more effectively.

# Print the results of variable importance
var_importance <- as.data.frame(varImp(cart_model, scale = FALSE))
print(var_importance)

##                     Overall
## Balance            44.49856
## Bonus_miles       235.37386
## Bonus_trans       143.83443
## cc1_miles         130.86686
## Flight_miles_12mo 205.39944
## Flight_trans_12   201.45258
## ID                  0.00000
## Qual_miles          0.00000
## cc2_miles           0.00000
## cc3_miles           0.00000
## Days_since_enroll   0.00000

The variable importance output highlights which variables most influence the model’s predictions. Bonus_miles, Bonus_trans, and Flight_miles_12mo are shown to have substantial importance, confirming their strong predictive power regarding award claims. This insight is crucial for understanding which features should be focused on to enhance customer engagement and reward claiming behavior.

# Validate the model's ability to differentiate factors influencing award claims
validation_predictions <- predict(cart_model, validation_data, type = "class")
valid_table <- table(Actual = validation_data$Award, Predicted = validation_predictions)
valid_accuracy <- sum(diag(valid_table)) / sum(valid_table)

# Print validation accuracy
print(paste("Validation Accuracy:", valid_accuracy))

## [1] "Validation Accuracy: 0.7375"

# Print simple decision rules from tree
print(rpart.rules(cart_model, style="tall"))

## Award is 0.19 when
##     Bonus_miles < 7704
## 
## Award is 0.37 when
##     Bonus_miles >= 7704
##     Balance >= 28682
##     Flight_miles_12mo < 75
## 
## Award is 0.69 when
##     Bonus_miles >= 7704
##     Balance >= 28682
##     Flight_miles_12mo >= 75
## 
## Award is 0.85 when
##     Bonus_miles >= 7704
##     Balance < 28682

# Plot ROC curve and calculate AUC for validation data
library(ROCR)
validation_prob <- predict(cart_model, validation_data, type = "prob")[, 2]
prediction_obj <- prediction(validation_prob, validation_data$Award)
perf <- performance(prediction_obj, "tpr", "fpr")
plot(perf, main = "ROC Curve for Simplified CART Model")
abline(0, 1, col = "red")

auc <- performance(prediction_obj, "auc")
auc_value <- auc@y.values[[1]]
print(paste("AUC:", auc_value))

## [1] "AUC: 0.756977085781434"

The ROC curve and an AUC value of approximately 0.757 demonstrate the model’s ability to discriminate between those who will and won’t claim awards. The curve is above the diagonal line of no discrimination, which confirms the model’s effectiveness in distinguishing between the two classes, but it has some room for improvement.

Based on the results that each method provided:
Logistic regression provided detailed coefficients for each predictor, indicating how much each factor increases or decreases the odds of a customer claiming a travel award.
Random Forest performed well in terms of model accuracy and provided a comprehensive ranking of feature importance. CART provided a straightforward visual representation of how different features lead to a travel award claim.

Given these considerations, CART appears to be the most suitable method for Goal #2 at East-West Airlines for its simplicity and direct applicability. The clear, visual nature of the decision tree allows non-technical stakeholders to easily understand and implement strategies based on its results. CART effectively identifies critical thresholds and conditions that influence award claiming, making it ideal for practical, actionable business applications.

Project

05-02-2024