The problem at hand is to develop a predictive model to forecast which customers will book with the company during the 2013–14 school year, spanning from fall 2013 to spring 2014. To achieve this, we need to utilize available data from the 2012–13 school year to train the model. This historical data includes information on customer profiles and trip-related data such as post-trip surveys, trip revenue, trip length, and school size. The ultimate goal is to deliver a reasonably accurate predictive model to enable the deployment of a new marketing strategy before the sales season begins later in the spring.
We are tasked with building a predictive model for customer retention. Specifically, we aim to predict which customers will book with the company during the 2013–14 school year. To accomplish this, we can consider using various machine learning algorithms, such as logistic regression, decision trees, random forests, or gradient boosting. However, given the context of the problem and the need to handle potentially high-dimensional data with many features, a suitable model to consider would be a penalized regression model, such as LASSO (Least Absolute Shrinkage and Selection Operator) regression. LASSO regression is particularly useful when dealing with datasets with a large number of predictors and helps in feature selection by shrinking the coefficients of less important variables to zero, effectively performing variable selection and regularization simultaneously. This can lead to a more interpretable and potentially more accurate predictive model for customer retention.
set.seed(1) #
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # contains createDataPartition
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet) # performed lasso regression
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(pROC) # Makes ROC curves
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(mltools) # One-hot encoding
## Warning: package 'mltools' was built under R version 4.3.2
##
## Attaching package: 'mltools'
##
## The following object is masked from 'package:tidyr':
##
## replace_na
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
travel_Exhibit_1 <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Retention modeling project/Session 7 Outline and Case Studies/Retention Modeling Data A - Exhibit 1.csv")
travel_retention_b <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Retention modeling project/Session 7 Outline and Case Studies/Retention Modeling Data - B.csv")
travel <- cbind(travel_Exhibit_1, travel_retention_b)
summary(travel)
## ID Program.Code From.Grade To.Grade
## Length:2392 Length:2392 Min. : 3.000 Min. : 3.000
## Class :character Class :character 1st Qu.: 7.000 1st Qu.: 8.000
## Mode :character Mode :character Median : 8.000 Median : 8.000
## Mean : 7.268 Mean : 7.913
## 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :12.000 Max. :12.000
## NA's :130 NA's :153
## Group.State Is.Non.Annual. Days Travel.Type
## Length:2392 Min. :0.000 Min. : 1.000 Length:2392
## Class :character 1st Qu.:0.000 1st Qu.: 4.000 Class :character
## Mode :character Median :0.000 Median : 5.000 Mode :character
## Mean :0.154 Mean : 4.575
## 3rd Qu.:0.000 3rd Qu.: 5.000
## Max. :1.000 Max. :12.000
## NA's :3 NA's :3
## Departure.Date Return.Date Deposit.Date Special.Pay
## Length:2392 Length:2392 Length:2392 Length:2392
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Tuition FRP.Active FRP.Cancelled FRP.Take.up.percent.
## Min. : 79 Min. : 0.00 Min. : 0.000 Min. :0.0000
## 1st Qu.:1174 1st Qu.: 6.00 1st Qu.: 1.000 1st Qu.:0.4550
## Median :1700 Median : 12.00 Median : 2.000 Median :0.6000
## Mean :1615 Mean : 16.87 Mean : 3.306 Mean :0.5707
## 3rd Qu.:2048 3rd Qu.: 23.00 3rd Qu.: 4.000 3rd Qu.:0.7270
## Max. :4200 Max. :257.00 Max. :45.000 Max. :1.0000
## NA's :3 NA's :3 NA's :3 NA's :3
## Early.RPL Latest.RPL Cancelled.Pax Total.Discount.Pax
## Length:2392 Length:2392 Min. : 0.000 Min. : 0.000
## Class :character Class :character 1st Qu.: 2.000 1st Qu.: 1.000
## Mode :character Mode :character Median : 4.000 Median : 2.000
## Mean : 4.807 Mean : 2.954
## 3rd Qu.: 6.000 3rd Qu.: 4.000
## Max. :39.000 Max. :47.000
## NA's :3 NA's :3
## Initial.System.Date Poverty.Code Region CRM.Segment
## Length:2392 Length:2392 Length:2392 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 5.00
## Mode :character Mode :character Mode :character Median : 6.00
## Mean : 6.92
## 3rd Qu.:10.00
## Max. :11.00
## NA's :7
## School.Type Parent.Meeting.Flag MDR.Low.Grade MDR.High.Grade
## Length:2392 Min. :0.0000 Length:2392 Min. : 1.000
## Class :character 1st Qu.:1.0000 Class :character 1st Qu.: 8.000
## Mode :character Median :1.0000 Mode :character Median : 8.000
## Mean :0.8589 Mean : 8.392
## 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :1.0000 Max. :12.000
## NA's :3 NA's :71
## Total.School.Enrollment Income.Level EZ.Pay.Take.Up.Rate
## Min. : 19.0 Length:2392 Min. :0.0000
## 1st Qu.: 360.0 Class :character 1st Qu.:0.1000
## Median : 597.0 Mode :character Median :0.2000
## Mean : 648.4 Mean :0.2079
## 3rd Qu.: 825.8 3rd Qu.:0.2920
## Max. :3990.0 Max. :1.7500
## NA's :94 NA's :3
## School.Sponsor SPR.Product.Type SPR.New.Existing FPP
## Min. :0.0000 Length:2392 Length:2392 Min. : 2.0
## 1st Qu.:0.0000 Class :character Class :character 1st Qu.: 12.0
## Median :0.0000 Mode :character Mode :character Median : 23.0
## Mean :0.1059 Mean : 31.3
## 3rd Qu.:0.0000 3rd Qu.: 41.0
## Max. :1.0000 Max. :286.0
## NA's :3 NA's :3
## Total.Pax SPR.Group.Revenue NumberOfMeetingswithParents
## Min. : 2.00 Min. : 79 Min. :0.000
## 1st Qu.: 14.00 1st Qu.:1174 1st Qu.:1.000
## Median : 26.00 Median :1700 Median :1.000
## Mean : 34.25 Mean :1615 Mean :1.102
## 3rd Qu.: 44.00 3rd Qu.:2048 3rd Qu.:1.000
## Max. :313.00 Max. :4200 Max. :2.000
## NA's :3 NA's :3 NA's :3
## FirstMeeting LastMeeting DifferenceTraveltoFirstMeeting
## Length:2392 Length:2392 Min. :-204.0
## Class :character Class :character 1st Qu.: 208.0
## Mode :character Mode :character Median : 250.0
## Mean : 262.1
## 3rd Qu.: 287.0
## Max. : 749.0
## NA's :340
## DifferenceTraveltoLastMeeting SchoolGradeTypeLow SchoolGradeTypeHigh
## Min. :-204.0 Length:2392 Length:2392
## 1st Qu.: 196.8 Class :character Class :character
## Median : 233.0 Mode :character Mode :character
## Mean : 229.0
## 3rd Qu.: 261.0
## Max. : 749.0
## NA's :340
## SchoolGradeType DepartureMonth GroupGradeTypeLow GroupGradeTypeHigh
## Length:2392 Length:2392 Length:2392 Length:2392
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## GroupGradeType MajorProgramCode SingleGradeTripFlag
## Length:2392 Length:2392 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :1.0000
## Mean :0.5567
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :3
## FPP.to.School.enrollment FPP.to.PAX Num.of.Non_FPP.PAX
## Min. :0.00092 Min. :0.6000 Min. : 0.000
## 1st Qu.:0.02079 1st Qu.:0.8824 1st Qu.: 1.000
## Median :0.04526 Median :0.9091 Median : 2.000
## Mean :0.06618 Mean :0.9007 Mean : 2.954
## 3rd Qu.:0.08752 3rd Qu.:0.9333 3rd Qu.: 4.000
## Max. :2.05263 Max. :1.0000 Max. :47.000
## NA's :94 NA's :3 NA's :3
## SchoolSizeIndicator Retained.in.2012. ID NPS.2011
## Length:2392 Min. :0.0000 Length:2392 Min. : 2.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 9.000
## Mode :character Median :1.0000 Mode :character Median :10.000
## Mean :0.6074 Mean : 9.405
## 3rd Qu.:1.0000 3rd Qu.:10.000
## Max. :1.0000 Max. :10.000
## NA's :3 NA's :580
## NPS.2010 NPS.2009 NPS.2008 X...3.FPP.Date
## Min. : 4.000 Min. : 2.000 Min. : 1.000 Length:2392
## 1st Qu.: 9.000 1st Qu.: 9.000 1st Qu.: 9.000 Class :character
## Median :10.000 Median :10.000 Median :10.000 Mode :character
## Mean : 9.466 Mean : 9.405 Mean : 9.359
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000
## Max. :10.000 Max. :10.000 Max. :10.000
## NA's :1232 NA's :1228 NA's :1415
## X...10.FPP.Date X...20.FPP.Date X...35.FPP.Date
## Length:2392 Length:2392 Length:2392
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
travel <- na.omit(travel[, 1:56])
Step 1: Variable classification & transformation into dummy variables
Step 2. Visualize potential predictors; make transformations as necessary (managing highly skewed variables, nonlinear relationships, etc)
Skipping this step; demonstrated in numerous other cases
Step 3. Select the variables to be included in the set of candidate predictors & place them into a new dataset with only variables you are going to use
# create dataset to use for regression
trav <- travel[ , c("Retained.in.2012.", "Is.Non.Annual.", "School.Sponsor", "SingleGradeTripFlag")]
travel$Days <- as.numeric(travel$Days)
travel$Tuition <- as.numeric(travel$Tuition)
travel$FRP.Active <- as.numeric(travel$FRP.Active)
travel$FRP.Cancelled <- as.numeric(travel$FRP.Cancelled)
travel$FRP.Take.up.percent. <- as.numeric(travel$FRP.Take.up.percent.)
travel$Cancelled.Pax <- as.numeric(travel$Cancelled.Pax)
travel$Total.Discount.Pax <- as.numeric(travel$Total.Discount.Pax)
travel$FPP.to.School.enrollment <- as.numeric(travel$FPP.to.School.enrollment)
travel$EZ.Pay.Take.Up.Rate <- as.numeric(travel$EZ.Pay.Take.Up.Rate)
travel$Total.School.Enrollment <- as.numeric(travel$Total.School.Enrollment)
travel$FPP <- as.numeric(travel$FPP)
travel$DifferenceTraveltoFirstMeeting <- as.numeric(travel$DifferenceTraveltoFirstMeeting)
travel$FPP.to.PAX <- as.numeric(travel$FPP.to.PAX)
trav$Days <- scale(travel$Days)
trav$Tuition <- scale(travel$Tuition)
trav$FRP.Active <- scale(travel$FRP.Active)
trav$FRP.Cancelled <- scale(travel$FRP.Cancelled)
trav$FRP.Take.up.percent. <- scale(travel$FRP.Take.up.percent.)
trav$Cancelled.Pax <- scale(travel$Cancelled.Pax)
trav$Total.Discount.Pax <- scale(travel$Total.Discount.Pax)
trav$FPP.to.School.enrollment <- scale(travel$FPP.to.School.enrollment)
trav$EZ.Pay.Take.Up.Rate <- scale(travel$EZ.Pay.Take.Up.Rate)
trav$Total.School.Enrollment <- scale(travel$Total.School.Enrollment)
trav$FPP <- scale(travel$FPP)
trav$DifferenceTraveltoFirstMeeting <- scale(travel$DifferenceTraveltoFirstMeeting)
trav$FPP.to.PAX <- scale(travel$FPP.to.PAX)
travel$Special.Pay <- as.factor(travel$Special.Pay)
travel$Group.State <- as.factor(travel$Group.State)
travel$Poverty.Code <- as.factor(travel$Poverty.Code)
travel$CRM.Segment <- as.factor(travel$CRM.Segment)
travel$School.Type <- as.factor(travel$School.Type)
travel$To.Grade <- as.factor(travel$To.Grade)
travel$MDR.Low.Grade <- as.factor(travel$MDR.Low.Grade)
travel$MDR.High.Grade <- as.factor(travel$MDR.High.Grade)
travel$SchoolSizeIndicator <- as.factor(travel$SchoolSizeIndicator)
travel$SPR.New.Existing <- as.factor(travel$SPR.New.Existing)
travel$DepartureMonth <- as.factor(travel$DepartureMonth)
travel$MajorProgramCode <- as.factor(travel$MajorProgramCode)
newdata <- one_hot(as.data.table(travel$DepartureMonth))
names(newdata) <- paste0("DepartMonth_", names(newdata))
trav <- cbind(trav, newdata)
newdata <- one_hot(as.data.table(travel$MajorProgramCode))
names(newdata) <- paste0("MajorProgram_", names(newdata))
trav <- cbind(trav, newdata)
train.size <- 0.7 ## use 70% of the data for training
train.index <- createDataPartition(as.factor(trav$Retained.in.2012.), p= train.size, list=F) # createDataPartition function is in the
trav.train <- trav[train.index, ]
trav.test <- trav[-train.index, ]
# weights for the training set
p.retain <- sum(trav.train$Retained.in.2012. == 1)/length(trav.train$Retained.in.2012.)
weights <- rep(NA, times=length(trav.train$Retained.in.2012.))
weights[trav.train$Retained.in.2012. == 0] <- p.retain
weights[trav.train$Retained.in.2012. == 1] <- 1-p.retain
Using only the training dataset, the cv.glmnet function is doing k-fold cross validation to identify the ‘best’ model for prediction within the training set. later we will evaluate this model using the testing set
cv.lasso <- cv.glmnet(y = as.matrix(trav.train$Retained.in.2012.), x = as.matrix(trav.train[, -trav.train$Retained.in.2012.]),
family="binomial", na.action = NULL, weights = weights, type.measure = "auc")
Set the model formula with interaction variables (‘+0’ here because glmnet adds an intercept by default)
interaction_formula <- as.formula(Retained.in.2012. ~ . + .*. + 0)
# combining the model formula and the data to create the design matrix for the regression
X.train <- model.matrix(interaction_formula, trav.train)
X.test <- model.matrix(interaction_formula, trav.test)
cv.lasso.withInt <- cv.glmnet(y = as.matrix(trav.train$Retained.in.2012.), x = as.matrix(X.train),
family="binomial", na.action = NULL, weights = weights, type.measure = "auc")
Evaluating model fit is Area-Under-the-Curve
probs <- predict(cv.lasso, as.matrix(trav.test[, -trav.test$Retained.in.2012.]), s=cv.lasso$lambda.min, type = "response")
roc_obj <- roc(as.matrix(trav.test$Retained.in.2012.) ~ probs, smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE, main="AUC Lasso Regression: No Interactions")
## Setting levels: control = 0, case = 1
## Warning in roc.default(response, predictors[, 1], ...): Deprecated use a matrix
## as predictor. Unexpected results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
# Adding confidence intervals to the plot
sens.ci <- ci.se(roc_obj)
plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")
probs <- predict(cv.lasso.withInt, as.matrix(X.test), s=cv.lasso.withInt$lambda.min, type = "response")
roc_obj <- roc(as.matrix(trav.test$Retained.in.2012.) ~ probs, smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE, main="AUC Lasso Regression: with Interactions")
## Setting levels: control = 0, case = 1
## Warning in roc.default(response, predictors[, 1], ...): Deprecated use a matrix
## as predictor. Unexpected results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
# Adding confidence intervals to the plot
sens.ci <- ci.se(roc_obj)
plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")
#Getting exponential of coefficients (odds ratios)
odds_ratios <- exp(coef(cv.lasso, cv.lasso$lambda.min))
# Converting odds ratios to an array for interpretability / formatting
odds_ratios <- array(odds_ratios, dimnames = list(c("Intercept", colnames(trav.train[, -1]))))
t(t(odds_ratios))
## [,1]
## Intercept 0.5969789
## Is.Non.Annual. 0.3280069
## School.Sponsor 1.0000000
## SingleGradeTripFlag 4.5851338
## Days 1.0000000
## Tuition 1.0000000
## FRP.Active 1.0000000
## FRP.Cancelled 1.0000000
## FRP.Take.up.percent. 1.0000000
## Cancelled.Pax 1.0000000
## Total.Discount.Pax 1.0000000
## FPP.to.School.enrollment 1.0000000
## EZ.Pay.Take.Up.Rate 1.0000000
## Total.School.Enrollment 1.0687806
## FPP 1.0000000
## DifferenceTraveltoFirstMeeting 1.0000000
## FPP.to.PAX 1.0000000
## DepartMonth_V1_April 1.0000000
## DepartMonth_V1_February 1.0000000
## DepartMonth_V1_January 1.0000000
## DepartMonth_V1_June 1.0000000
## DepartMonth_V1_March 1.0000000
## DepartMonth_V1_May 1.0000000
## MajorProgram_V1_C 1.0000000
## MajorProgram_V1_H 1.0000000
## MajorProgram_V1_S 1.0000000
odds_ratios <- exp(coef(cv.lasso.withInt, cv.lasso.withInt$lambda.min))
# Converting odds ratios to an array for interpretability / formatting out of dgeMatrix
odds_ratios <- array(odds_ratios, dimnames = list(c("Intercept", colnames(X.test))))
t(t(odds_ratios))
Interpretation of model with interaction terms is often best done using representative cases and marginal analysis For example, we can use the models to estimate the probability of Approved=1 for a sample individual
group1 <- data.frame(Retained.in.2012. = -1,
Is.Non.Annual. = 0,
School.Sponsor = 0,
SingleGradeTripFlag = c(0,1),
Days = 0,
Tuition = 0,
FRP.Active = 0,
FRP.Cancelled = 0,
FRP.Take.up.percent. = 0,
Cancelled.Pax = 0,
Total.Discount.Pax = 0,
FPP.to.School.enrollment = 0,
EZ.Pay.Take.Up.Rate = 0,
Total.School.Enrollment = 0,
FPP = 0,
DifferenceTraveltoFirstMeeting = 0,
FPP.to.PAX = 0,
MajorProgram_V1_C = 0,
MajorProgram_V1_H = 0,
MajorProgram_V1_S = 0,
DepartMonth_V1_January = 0,
DepartMonth_V1_February = 0,
DepartMonth_V1_March = 0,
DepartMonth_V1_April = 0,
DepartMonth_V1_May = 0,
DepartMonth_V1_June = 0
)
main_formula <- as.formula(Retained.in.2012. ~ . + 0)
group1_no_interactions <- model.matrix(main_formula, group1)
# Predict probability of Retained=1 no interactions
predicted_probs <- predict(cv.lasso, as.matrix(group1_no_interactions), s= cv.lasso$lambda.min, type = "response")
predicted_probs
## s1
## 1 0.3738177
## 2 0.7324220
# Create the design matrix for model with interactions
group1_with_interactions <- model.matrix(interaction_formula, group1)
# Predict probability of Retained=1
predicted_probs <- predict(cv.lasso.withInt, as.matrix(group1_with_interactions), s= cv.lasso.withInt$lambda.min, type = "response")
predicted_probs
## s1
## 1 0.4168507
## 2 0.8625626