1 Introduction

The problem at hand is to develop a predictive model to forecast which customers will book with the company during the 2013–14 school year, spanning from fall 2013 to spring 2014. To achieve this, we need to utilize available data from the 2012–13 school year to train the model. This historical data includes information on customer profiles and trip-related data such as post-trip surveys, trip revenue, trip length, and school size. The ultimate goal is to deliver a reasonably accurate predictive model to enable the deployment of a new marketing strategy before the sales season begins later in the spring.

We are tasked with building a predictive model for customer retention. Specifically, we aim to predict which customers will book with the company during the 2013–14 school year. To accomplish this, we can consider using various machine learning algorithms, such as logistic regression, decision trees, random forests, or gradient boosting. However, given the context of the problem and the need to handle potentially high-dimensional data with many features, a suitable model to consider would be a penalized regression model, such as LASSO (Least Absolute Shrinkage and Selection Operator) regression. LASSO regression is particularly useful when dealing with datasets with a large number of predictors and helps in feature selection by shrinking the coefficients of less important variables to zero, effectively performing variable selection and regularization simultaneously. This can lead to a more interpretable and potentially more accurate predictive model for customer retention.

2 Unpacking Required Packages For The Model

set.seed(1)  # 
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)  # contains createDataPartition
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(glmnet)   # performed lasso regression
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
library(pROC)     # Makes ROC curves
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(mltools)  # One-hot encoding
## Warning: package 'mltools' was built under R version 4.3.2
## 
## Attaching package: 'mltools'
## 
## The following object is masked from 'package:tidyr':
## 
##     replace_na
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

3 Step 1. Read in the data file, and combine the two datasets

travel_Exhibit_1 <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Retention modeling project/Session 7 Outline and Case Studies/Retention Modeling Data A - Exhibit 1.csv")
travel_retention_b <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Retention modeling project/Session 7 Outline and Case Studies/Retention Modeling Data - B.csv")
travel <- cbind(travel_Exhibit_1, travel_retention_b)

3.1 Preliminary Analysis and Data cleaning

summary(travel)
##       ID            Program.Code         From.Grade        To.Grade     
##  Length:2392        Length:2392        Min.   : 3.000   Min.   : 3.000  
##  Class :character   Class :character   1st Qu.: 7.000   1st Qu.: 8.000  
##  Mode  :character   Mode  :character   Median : 8.000   Median : 8.000  
##                                        Mean   : 7.268   Mean   : 7.913  
##                                        3rd Qu.: 8.000   3rd Qu.: 8.000  
##                                        Max.   :12.000   Max.   :12.000  
##                                        NA's   :130      NA's   :153     
##  Group.State        Is.Non.Annual.       Days        Travel.Type       
##  Length:2392        Min.   :0.000   Min.   : 1.000   Length:2392       
##  Class :character   1st Qu.:0.000   1st Qu.: 4.000   Class :character  
##  Mode  :character   Median :0.000   Median : 5.000   Mode  :character  
##                     Mean   :0.154   Mean   : 4.575                     
##                     3rd Qu.:0.000   3rd Qu.: 5.000                     
##                     Max.   :1.000   Max.   :12.000                     
##                     NA's   :3       NA's   :3                          
##  Departure.Date     Return.Date        Deposit.Date       Special.Pay       
##  Length:2392        Length:2392        Length:2392        Length:2392       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Tuition       FRP.Active     FRP.Cancelled    FRP.Take.up.percent.
##  Min.   :  79   Min.   :  0.00   Min.   : 0.000   Min.   :0.0000      
##  1st Qu.:1174   1st Qu.:  6.00   1st Qu.: 1.000   1st Qu.:0.4550      
##  Median :1700   Median : 12.00   Median : 2.000   Median :0.6000      
##  Mean   :1615   Mean   : 16.87   Mean   : 3.306   Mean   :0.5707      
##  3rd Qu.:2048   3rd Qu.: 23.00   3rd Qu.: 4.000   3rd Qu.:0.7270      
##  Max.   :4200   Max.   :257.00   Max.   :45.000   Max.   :1.0000      
##  NA's   :3      NA's   :3        NA's   :3        NA's   :3           
##   Early.RPL          Latest.RPL        Cancelled.Pax    Total.Discount.Pax
##  Length:2392        Length:2392        Min.   : 0.000   Min.   : 0.000    
##  Class :character   Class :character   1st Qu.: 2.000   1st Qu.: 1.000    
##  Mode  :character   Mode  :character   Median : 4.000   Median : 2.000    
##                                        Mean   : 4.807   Mean   : 2.954    
##                                        3rd Qu.: 6.000   3rd Qu.: 4.000    
##                                        Max.   :39.000   Max.   :47.000    
##                                        NA's   :3        NA's   :3         
##  Initial.System.Date Poverty.Code          Region           CRM.Segment   
##  Length:2392         Length:2392        Length:2392        Min.   : 1.00  
##  Class :character    Class :character   Class :character   1st Qu.: 5.00  
##  Mode  :character    Mode  :character   Mode  :character   Median : 6.00  
##                                                            Mean   : 6.92  
##                                                            3rd Qu.:10.00  
##                                                            Max.   :11.00  
##                                                            NA's   :7      
##  School.Type        Parent.Meeting.Flag MDR.Low.Grade      MDR.High.Grade  
##  Length:2392        Min.   :0.0000      Length:2392        Min.   : 1.000  
##  Class :character   1st Qu.:1.0000      Class :character   1st Qu.: 8.000  
##  Mode  :character   Median :1.0000      Mode  :character   Median : 8.000  
##                     Mean   :0.8589                         Mean   : 8.392  
##                     3rd Qu.:1.0000                         3rd Qu.: 8.000  
##                     Max.   :1.0000                         Max.   :12.000  
##                     NA's   :3                              NA's   :71      
##  Total.School.Enrollment Income.Level       EZ.Pay.Take.Up.Rate
##  Min.   :  19.0          Length:2392        Min.   :0.0000     
##  1st Qu.: 360.0          Class :character   1st Qu.:0.1000     
##  Median : 597.0          Mode  :character   Median :0.2000     
##  Mean   : 648.4                             Mean   :0.2079     
##  3rd Qu.: 825.8                             3rd Qu.:0.2920     
##  Max.   :3990.0                             Max.   :1.7500     
##  NA's   :94                                 NA's   :3          
##  School.Sponsor   SPR.Product.Type   SPR.New.Existing        FPP       
##  Min.   :0.0000   Length:2392        Length:2392        Min.   :  2.0  
##  1st Qu.:0.0000   Class :character   Class :character   1st Qu.: 12.0  
##  Median :0.0000   Mode  :character   Mode  :character   Median : 23.0  
##  Mean   :0.1059                                         Mean   : 31.3  
##  3rd Qu.:0.0000                                         3rd Qu.: 41.0  
##  Max.   :1.0000                                         Max.   :286.0  
##  NA's   :3                                              NA's   :3      
##    Total.Pax      SPR.Group.Revenue NumberOfMeetingswithParents
##  Min.   :  2.00   Min.   :  79      Min.   :0.000              
##  1st Qu.: 14.00   1st Qu.:1174      1st Qu.:1.000              
##  Median : 26.00   Median :1700      Median :1.000              
##  Mean   : 34.25   Mean   :1615      Mean   :1.102              
##  3rd Qu.: 44.00   3rd Qu.:2048      3rd Qu.:1.000              
##  Max.   :313.00   Max.   :4200      Max.   :2.000              
##  NA's   :3        NA's   :3         NA's   :3                  
##  FirstMeeting       LastMeeting        DifferenceTraveltoFirstMeeting
##  Length:2392        Length:2392        Min.   :-204.0                
##  Class :character   Class :character   1st Qu.: 208.0                
##  Mode  :character   Mode  :character   Median : 250.0                
##                                        Mean   : 262.1                
##                                        3rd Qu.: 287.0                
##                                        Max.   : 749.0                
##                                        NA's   :340                   
##  DifferenceTraveltoLastMeeting SchoolGradeTypeLow SchoolGradeTypeHigh
##  Min.   :-204.0                Length:2392        Length:2392        
##  1st Qu.: 196.8                Class :character   Class :character   
##  Median : 233.0                Mode  :character   Mode  :character   
##  Mean   : 229.0                                                      
##  3rd Qu.: 261.0                                                      
##  Max.   : 749.0                                                      
##  NA's   :340                                                         
##  SchoolGradeType    DepartureMonth     GroupGradeTypeLow  GroupGradeTypeHigh
##  Length:2392        Length:2392        Length:2392        Length:2392       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  GroupGradeType     MajorProgramCode   SingleGradeTripFlag
##  Length:2392        Length:2392        Min.   :0.0000     
##  Class :character   Class :character   1st Qu.:0.0000     
##  Mode  :character   Mode  :character   Median :1.0000     
##                                        Mean   :0.5567     
##                                        3rd Qu.:1.0000     
##                                        Max.   :1.0000     
##                                        NA's   :3          
##  FPP.to.School.enrollment   FPP.to.PAX     Num.of.Non_FPP.PAX
##  Min.   :0.00092          Min.   :0.6000   Min.   : 0.000    
##  1st Qu.:0.02079          1st Qu.:0.8824   1st Qu.: 1.000    
##  Median :0.04526          Median :0.9091   Median : 2.000    
##  Mean   :0.06618          Mean   :0.9007   Mean   : 2.954    
##  3rd Qu.:0.08752          3rd Qu.:0.9333   3rd Qu.: 4.000    
##  Max.   :2.05263          Max.   :1.0000   Max.   :47.000    
##  NA's   :94               NA's   :3        NA's   :3         
##  SchoolSizeIndicator Retained.in.2012.      ID               NPS.2011     
##  Length:2392         Min.   :0.0000    Length:2392        Min.   : 2.000  
##  Class :character    1st Qu.:0.0000    Class :character   1st Qu.: 9.000  
##  Mode  :character    Median :1.0000    Mode  :character   Median :10.000  
##                      Mean   :0.6074                       Mean   : 9.405  
##                      3rd Qu.:1.0000                       3rd Qu.:10.000  
##                      Max.   :1.0000                       Max.   :10.000  
##                      NA's   :3                            NA's   :580     
##     NPS.2010         NPS.2009         NPS.2008      X...3.FPP.Date    
##  Min.   : 4.000   Min.   : 2.000   Min.   : 1.000   Length:2392       
##  1st Qu.: 9.000   1st Qu.: 9.000   1st Qu.: 9.000   Class :character  
##  Median :10.000   Median :10.000   Median :10.000   Mode  :character  
##  Mean   : 9.466   Mean   : 9.405   Mean   : 9.359                     
##  3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:10.000                     
##  Max.   :10.000   Max.   :10.000   Max.   :10.000                     
##  NA's   :1232     NA's   :1228     NA's   :1415                       
##  X...10.FPP.Date    X...20.FPP.Date    X...35.FPP.Date   
##  Length:2392        Length:2392        Length:2392       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
travel <- na.omit(travel[, 1:56])

3.2 Feature Engineering:

  • Step 1: Variable classification & transformation into dummy variables

  • Step 2. Visualize potential predictors; make transformations as necessary (managing highly skewed variables, nonlinear relationships, etc)

  • Skipping this step; demonstrated in numerous other cases

  • Step 3. Select the variables to be included in the set of candidate predictors & place them into a new dataset with only variables you are going to use

# create dataset to use for regression
trav <- travel[ , c("Retained.in.2012.", "Is.Non.Annual.",  "School.Sponsor", "SingleGradeTripFlag")]

4 numeric variables

travel$Days <- as.numeric(travel$Days)
travel$Tuition <- as.numeric(travel$Tuition)
travel$FRP.Active <- as.numeric(travel$FRP.Active)
travel$FRP.Cancelled <- as.numeric(travel$FRP.Cancelled)
travel$FRP.Take.up.percent. <- as.numeric(travel$FRP.Take.up.percent.)
travel$Cancelled.Pax <- as.numeric(travel$Cancelled.Pax)
travel$Total.Discount.Pax <- as.numeric(travel$Total.Discount.Pax)
travel$FPP.to.School.enrollment <- as.numeric(travel$FPP.to.School.enrollment)
travel$EZ.Pay.Take.Up.Rate <- as.numeric(travel$EZ.Pay.Take.Up.Rate)
travel$Total.School.Enrollment <- as.numeric(travel$Total.School.Enrollment)
travel$FPP <- as.numeric(travel$FPP)
travel$DifferenceTraveltoFirstMeeting <- as.numeric(travel$DifferenceTraveltoFirstMeeting)
travel$FPP.to.PAX <- as.numeric(travel$FPP.to.PAX)

5 scale numeric variables and add them to the dataset for use

trav$Days <- scale(travel$Days)
trav$Tuition <- scale(travel$Tuition)
trav$FRP.Active <- scale(travel$FRP.Active)
trav$FRP.Cancelled <- scale(travel$FRP.Cancelled)
trav$FRP.Take.up.percent. <- scale(travel$FRP.Take.up.percent.)
trav$Cancelled.Pax <- scale(travel$Cancelled.Pax)
trav$Total.Discount.Pax <- scale(travel$Total.Discount.Pax)
trav$FPP.to.School.enrollment <- scale(travel$FPP.to.School.enrollment)
trav$EZ.Pay.Take.Up.Rate <- scale(travel$EZ.Pay.Take.Up.Rate)
trav$Total.School.Enrollment <- scale(travel$Total.School.Enrollment)
trav$FPP <- scale(travel$FPP)
trav$DifferenceTraveltoFirstMeeting <- scale(travel$DifferenceTraveltoFirstMeeting)
trav$FPP.to.PAX <- scale(travel$FPP.to.PAX)

6 categorical variables

travel$Special.Pay <- as.factor(travel$Special.Pay)
travel$Group.State <- as.factor(travel$Group.State)
travel$Poverty.Code <- as.factor(travel$Poverty.Code)
travel$CRM.Segment <- as.factor(travel$CRM.Segment)
travel$School.Type <- as.factor(travel$School.Type)
travel$To.Grade <- as.factor(travel$To.Grade)
travel$MDR.Low.Grade <- as.factor(travel$MDR.Low.Grade)
travel$MDR.High.Grade <- as.factor(travel$MDR.High.Grade)
travel$SchoolSizeIndicator <- as.factor(travel$SchoolSizeIndicator)
travel$SPR.New.Existing <- as.factor(travel$SPR.New.Existing)
travel$DepartureMonth <- as.factor(travel$DepartureMonth)
travel$MajorProgramCode <- as.factor(travel$MajorProgramCode)

7 One-hot-encode (create dummy variables) for two categorical variables; ignoring the rest

newdata <- one_hot(as.data.table(travel$DepartureMonth))
names(newdata) <- paste0("DepartMonth_", names(newdata))
trav <- cbind(trav, newdata)

newdata <- one_hot(as.data.table(travel$MajorProgramCode))
names(newdata) <- paste0("MajorProgram_", names(newdata))
trav <- cbind(trav, newdata)

8 Analysis: Step by step lasso logistic regression

  • Step 1: Split the data into training and testing sets
  • Step 2: Create a weight variable to correct for imbalance in the response variable
  • Step 3: Regression procedure for training, identify lambda and fit the model using cross-validation
  • Step 4: Evaluate the model fit
  • Step 5: Model interpretation

8.1 Split the data into training and testing sets

train.size <- 0.7   ## use 70% of the data for training
train.index <- createDataPartition(as.factor(trav$Retained.in.2012.), p= train.size, list=F)   # createDataPartition function is in the 
trav.train <- trav[train.index, ]
trav.test <- trav[-train.index, ]

8.2 Create a weight variable to correct for imbalance in the response variable

# weights for the training set
p.retain <- sum(trav.train$Retained.in.2012. == 1)/length(trav.train$Retained.in.2012.)
weights <- rep(NA, times=length(trav.train$Retained.in.2012.))
weights[trav.train$Retained.in.2012. == 0] <- p.retain
weights[trav.train$Retained.in.2012. == 1] <- 1-p.retain

8.3 Regression procedure for training, identify lambda and fit the model using cross-validation

8.3.1 MODEL 1: No interaction terms

Using only the training dataset, the cv.glmnet function is doing k-fold cross validation to identify the ‘best’ model for prediction within the training set. later we will evaluate this model using the testing set

cv.lasso <- cv.glmnet(y = as.matrix(trav.train$Retained.in.2012.), x = as.matrix(trav.train[, -trav.train$Retained.in.2012.]), 
                      family="binomial", na.action = NULL, weights = weights, type.measure = "auc")

8.4 MODEL 2: With interaction terms

Set the model formula with interaction variables (‘+0’ here because glmnet adds an intercept by default)

interaction_formula <- as.formula(Retained.in.2012. ~ . + .*. + 0) 

# combining the model formula and the data to create the design matrix for the regression 
X.train <- model.matrix(interaction_formula, trav.train)
X.test <- model.matrix(interaction_formula, trav.test)  

cv.lasso.withInt <- cv.glmnet(y = as.matrix(trav.train$Retained.in.2012.), x = as.matrix(X.train), 
                              family="binomial", na.action = NULL, weights = weights, type.measure = "auc")

8.5 Evaluate the model fit

Evaluating model fit is Area-Under-the-Curve

8.5.1 MODEL 1: No interaction terms

probs <- predict(cv.lasso, as.matrix(trav.test[, -trav.test$Retained.in.2012.]), s=cv.lasso$lambda.min, type = "response")
roc_obj <- roc(as.matrix(trav.test$Retained.in.2012.) ~ probs, smoothed = TRUE,
               # arguments for ci
               ci=TRUE, ci.alpha=0.9, stratified=FALSE,
               # arguments for plot
               plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE, 
               print.auc=TRUE, show.thres=TRUE, main="AUC Lasso Regression: No Interactions")
## Setting levels: control = 0, case = 1
## Warning in roc.default(response, predictors[, 1], ...): Deprecated use a matrix
## as predictor. Unexpected results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
# Adding confidence intervals to the plot
sens.ci <- ci.se(roc_obj)
plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")

8.5.2 MODEL 2: With interaction terms

probs <- predict(cv.lasso.withInt, as.matrix(X.test), s=cv.lasso.withInt$lambda.min, type = "response")
roc_obj <- roc(as.matrix(trav.test$Retained.in.2012.) ~ probs, smoothed = TRUE,
               # arguments for ci
               ci=TRUE, ci.alpha=0.9, stratified=FALSE,
               # arguments for plot
               plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
               print.auc=TRUE, show.thres=TRUE, main="AUC Lasso Regression: with Interactions")
## Setting levels: control = 0, case = 1
## Warning in roc.default(response, predictors[, 1], ...): Deprecated use a matrix
## as predictor. Unexpected results may be produced, please pass a numeric vector.
## Setting direction: controls < cases
# Adding confidence intervals to the plot
sens.ci <- ci.se(roc_obj)
plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")

8.6 Model Interpretation

8.6.1 Coefficient interpretation is possible for Model 1 (because no intereaction terms)

#Getting exponential of coefficients (odds ratios)
odds_ratios <- exp(coef(cv.lasso, cv.lasso$lambda.min))

# Converting odds ratios to an array for interpretability / formatting
odds_ratios <- array(odds_ratios, dimnames = list(c("Intercept", colnames(trav.train[, -1]))))
t(t(odds_ratios))
##                                     [,1]
## Intercept                      0.5969789
## Is.Non.Annual.                 0.3280069
## School.Sponsor                 1.0000000
## SingleGradeTripFlag            4.5851338
## Days                           1.0000000
## Tuition                        1.0000000
## FRP.Active                     1.0000000
## FRP.Cancelled                  1.0000000
## FRP.Take.up.percent.           1.0000000
## Cancelled.Pax                  1.0000000
## Total.Discount.Pax             1.0000000
## FPP.to.School.enrollment       1.0000000
## EZ.Pay.Take.Up.Rate            1.0000000
## Total.School.Enrollment        1.0687806
## FPP                            1.0000000
## DifferenceTraveltoFirstMeeting 1.0000000
## FPP.to.PAX                     1.0000000
## DepartMonth_V1_April           1.0000000
## DepartMonth_V1_February        1.0000000
## DepartMonth_V1_January         1.0000000
## DepartMonth_V1_June            1.0000000
## DepartMonth_V1_March           1.0000000
## DepartMonth_V1_May             1.0000000
## MajorProgram_V1_C              1.0000000
## MajorProgram_V1_H              1.0000000
## MajorProgram_V1_S              1.0000000

8.6.2 Coefficient interpretation is difficult for Models 2 and 3 because of the numerous interaction terms.

  • Inspecting coefficients is sometimes still useful (specifically, for ensuring all expected terms are present)

8.6.3 Model 2 OR

9 Getting exponential of coefficients (odds ratios)

odds_ratios <- exp(coef(cv.lasso.withInt, cv.lasso.withInt$lambda.min))

# Converting odds ratios to an array for interpretability / formatting out of dgeMatrix
odds_ratios <- array(odds_ratios, dimnames = list(c("Intercept", colnames(X.test))))
t(t(odds_ratios))

Interpretation of model with interaction terms is often best done using representative cases and marginal analysis For example, we can use the models to estimate the probability of Approved=1 for a sample individual

9.0.1 Using MODEL 1/2

9.0.2 Sample individual #1:

group1 <- data.frame(Retained.in.2012. = -1,   
                         Is.Non.Annual.                = 0,
                         School.Sponsor                = 0,
                         SingleGradeTripFlag           = c(0,1),
                         Days                          = 0,
                         Tuition                       = 0,
                         FRP.Active                    = 0,
                         FRP.Cancelled                 = 0,
                         FRP.Take.up.percent.          = 0,
                         Cancelled.Pax                 = 0,
                         Total.Discount.Pax            = 0,
                         FPP.to.School.enrollment      = 0,
                         EZ.Pay.Take.Up.Rate           = 0,
                         Total.School.Enrollment       = 0,
                         FPP                           = 0,
                         DifferenceTraveltoFirstMeeting = 0,
                         FPP.to.PAX                    = 0,
                         MajorProgram_V1_C             = 0,
                         MajorProgram_V1_H             = 0,
                         MajorProgram_V1_S             = 0,
                         DepartMonth_V1_January        = 0,
                         DepartMonth_V1_February       = 0,
                         DepartMonth_V1_March          = 0,
                         DepartMonth_V1_April          = 0,                 
                         DepartMonth_V1_May            = 0,
                         DepartMonth_V1_June           = 0
                         )

9.0.3 Creating the design matrix for Model 1

main_formula <- as.formula(Retained.in.2012. ~ . + 0) 
group1_no_interactions <- model.matrix(main_formula, group1)

# Predict probability of Retained=1 no interactions
predicted_probs <- predict(cv.lasso, as.matrix(group1_no_interactions), s= cv.lasso$lambda.min, type = "response")
predicted_probs 
##          s1
## 1 0.3738177
## 2 0.7324220
# Create the design matrix for model with interactions
group1_with_interactions <- model.matrix(interaction_formula, group1)

# Predict probability of Retained=1
predicted_probs <- predict(cv.lasso.withInt, as.matrix(group1_with_interactions), s= cv.lasso.withInt$lambda.min, type = "response")
predicted_probs 
##          s1
## 1 0.4168507
## 2 0.8625626