Introduction
This notebook uses Lead Prediction data from Analytics Vidhya. The task of this exercise is to use supervised machine learning methods to conduct lead prediction, and evaluate models performance. The methods used in this notebook include GLM, elastic net, lasso and CART, and the evalution metrics used are accuracy, sensitivity, specificity, F1-score and AUC-ROC score
Data dictionary (from source)
| No. | Variable | Description |
|---|---|---|
| ID | Unique Identifier for a row | |
| Gender | Gender of the Customer | |
| Age | Age of the Customer (in Years) | |
| Region_Code | Code of the Region for the customers | |
| Occupation | Occupation Type for the customer | |
| Channel_Code | Acquisition Channel Code for the Customer (Encoded) | |
| Vintage | Vintage for the Customer (In Months) | |
| Credit_Product | If the Customer has any active credit product (Home loan, Personal loan, Credit Card etc.) | |
| AvgAccountBalance | Average Account Balance for the Customer in last 12 Months | |
| Is_Active | If the Customer is Active in last 3 Months | |
| Is_Lead(Target) | If the Customer is interested for the Credit Card. 0 : Customer is not interested. 1 : Customer is interested |
# load libaries
library(tidyverse)
library(skimr)
library(psych)
library(gghalves)
library(patchwork)
library(ggstatsplot)
# Import data
train = read.csv("train_s3TEQDk.csv",na.strings=c("","NA"), stringsAsFactors = T)
glimpse(train)
Rows: 245,725
Columns: 11
$ ID <fct> NNVBBKZB, IDD62UNG, HD3DSEMC, BF3NC7KV, TEASRWXV, ACUTYTWS, ETQCZFEJ, JJNJUQMQ…
$ Gender <fct> Female, Female, Female, Male, Female, Male, Male, Female, Female, Female, Male…
$ Age <int> 73, 30, 56, 34, 30, 56, 62, 48, 40, 55, 53, 27, 27, 31, 79, 33, 46, 59, 65, 37…
$ Region_Code <fct> RG268, RG277, RG268, RG270, RG282, RG261, RG282, RG265, RG283, RG268, RG254, R…
$ Occupation <fct> Other, Salaried, Self_Employed, Salaried, Salaried, Self_Employed, Other, Self…
$ Channel_Code <fct> X3, X1, X3, X1, X1, X1, X3, X3, X2, X2, X3, X1, X1, X1, X3, X2, X3, X3, X2, X1…
$ Vintage <int> 43, 32, 26, 19, 33, 32, 20, 13, 38, 49, 123, 14, 20, 31, 57, 69, 97, 15, 20, 6…
$ Credit_Product <fct> No, No, No, No, No, No, NA, No, No, Yes, No, Yes, No, Yes, No, NA, Yes, Yes, N…
$ Avg_Account_Balance <int> 1045696, 581988, 1484315, 470454, 886787, 544163, 1056750, 444724, 1274284, 20…
$ Is_Active <fct> No, No, Yes, No, No, Yes, Yes, Yes, No, No, Yes, No, Yes, No, Yes, Yes, No, No…
$ Is_Lead <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, …
test = read.csv("test_mSzZ8RL.csv",na.strings=c("","NA"), stringsAsFactors = T)
glimpse(test)
Rows: 105,312
Columns: 10
$ ID <fct> VBENBARO, CCMEWNKY, VK3KGA9M, TT8RPZVC, SHQZEYTZ, MZZAQMPT, Y88TW36I, 3UGOAQNU…
$ Gender <fct> Male, Male, Male, Male, Female, Male, Female, Female, Male, Female, Female, Ma…
$ Age <int> 29, 43, 31, 29, 29, 60, 69, 30, 43, 54, 30, 45, 42, 49, 30, 26, 49, 29, 51, 75…
$ Region_Code <fct> RG254, RG268, RG270, RG272, RG270, RG268, RG253, RG257, RG284, RG283, RG277, R…
$ Occupation <fct> Other, Other, Salaried, Other, Other, Self_Employed, Other, Salaried, Salaried…
$ Channel_Code <fct> X1, X2, X1, X1, X1, X3, X2, X1, X3, X2, X1, X3, X2, X3, X1, X1, X2, X1, X3, X2…
$ Vintage <int> 25, 49, 14, 33, 19, 110, 67, 33, 81, 37, 33, 63, 69, 27, 31, 21, 69, 25, 117, …
$ Credit_Product <fct> Yes, NA, No, No, No, No, No, No, NA, Yes, No, Yes, NA, Yes, No, No, Yes, No, N…
$ Avg_Account_Balance <int> 742366, 925537, 215949, 868070, 657087, 4624262, 1032764, 837009, 1001232, 166…
$ Is_Active <fct> No, No, No, No, No, No, No, No, Yes, No, No, No, No, No, No, Yes, Yes, No, Yes…
# Missing data in train data
mtrain = train %>% summarise(across(everything(), ~mean(!is.na(.)))) %>%
gather() %>%
mutate(key= fct_reorder(key, value)) %>% rename(train=value)
# Missing data in test data
mtest = test %>% summarise(across(everything(), ~mean(!is.na(.)))) %>%
gather() %>%
mutate(key= fct_reorder(key, value)) %>% rename(test=value)
# Missing data (test and train)
mtrain %>% left_join(mtest) %>% mutate_if(is.numeric, round, digits=3)
Joining, by = "key"
# bind test and train data
data = bind_rows(train, test)
dim(data)
[1] 351037 11
skim(data)
── Data Summary ────────────────────────
Values
Name data
Number of rows 351037
Number of columns 11
_______________________
Column type frequency:
factor 7
numeric 4
________________________
Group variables None
── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique top_counts
1 ID 0 1 FALSE 351037 222: 1, 222: 1, 222: 1, 224: 1
2 Gender 0 1 FALSE 2 Mal: 191902, Fem: 159135
3 Region_Code 0 1 FALSE 35 RG2: 51059, RG2: 42297, RG2: 38577, RG2: 27493
4 Occupation 0 1 FALSE 4 Sel: 144078, Sal: 102912, Oth: 100304, Ent: 3743
5 Channel_Code 0 1 FALSE 4 X1: 148202, X3: 97981, X2: 96902, X4: 7952
6 Credit_Product 41847 0.881 FALSE 2 No: 205965, Yes: 103225
7 Is_Active 0 1 FALSE 2 No: 214087, Yes: 136950
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
1 Age 0 1 43.9 14.8 23 30 43 54 85
2 Vintage 0 1 46.9 32.3 7 20 32 73 135
3 Avg_Account_Balance 0 1 1130141. 856953. 20790 604185 895162 1368152 10352009
4 Is_Lead 105312 0.700 0.237 0.425 0 0 0 0 1
hist
1 ▇▅▅▂▁
2 ▇▂▂▂▁
3 ▇▁▁▁▁
4 ▇▁▁▁▂
# Target variable distribution
Hmisc::describe(factor(train$Is_Lead))
factor(train$Is_Lead)
n missing distinct
245725 0 2
Value X0 X1
Frequency 187437 58288
Proportion 0.763 0.237
# Gender
train %>% group_by(Gender) %>%
summarise(n=n(), Is_Lead_0=sum(Is_Lead==0), Is_Lead_1=sum(Is_Lead==1)) %>%
mutate("Is_Lead_1_%"=scales::percent(Is_Lead_1/(Is_Lead_1+Is_Lead_0)))
# Region Code
train %>% group_by(Region_Code) %>%
summarise(n=n(), Is_Lead_0=sum(Is_Lead==0), Is_Lead_1=sum(Is_Lead==1)) %>%
mutate("n_%"=round(n/sum(n)*100,1),
"Is_Lead_1_%"=round(Is_Lead_1/(Is_Lead_1+Is_Lead_0)*100,1)) %>%
select(-Is_Lead_0) %>%
ungroup() %>%
DT::datatable(rownames=FALSE,options = list(order = list(list(2, 'desc'))))
# Occupation
train %>% group_by(Occupation) %>%
summarise(n=n(), Is_Lead_0=sum(Is_Lead==0), Is_Lead_1=sum(Is_Lead==1)) %>%
mutate("Is_Lead_1_%"=scales::percent(Is_Lead_1/(Is_Lead_1+Is_Lead_0)))
# Channel_Code
train %>% group_by(Channel_Code) %>%
summarise(n=n(), Is_Lead_0=sum(Is_Lead==0), Is_Lead_1=sum(Is_Lead==1)) %>%
mutate("Is_Lead_1_%"=scales::percent(Is_Lead_1/(Is_Lead_1+Is_Lead_0)))
# Credit_Product
train %>% group_by(Credit_Product) %>%
summarise(n=n(), Is_Lead_0=sum(Is_Lead==0), Is_Lead_1=sum(Is_Lead==1)) %>%
mutate("Is_Lead_1_%"=scales::percent(Is_Lead_1/(Is_Lead_1+Is_Lead_0)))
# Is_Active
train %>% group_by(Is_Active) %>%
summarise(n=n(), Is_Lead_0=sum(Is_Lead==0), Is_Lead_1=sum(Is_Lead==1)) %>%
mutate("Is_Lead_1_%"=scales::percent(Is_Lead_1/(Is_Lead_1+Is_Lead_0)))
# Age
psych::describeBy(train$Age, train$Is_Lead,mat=T, digits=2)
# Age group
train %>% mutate(age_group= cut(Age, c(17, 24, 34, 44, 54, 64, Inf),
labels = c("18-24", "25-34", "35-44","45-54","55-64","65+"))) %>%
group_by(age_group) %>%
summarise(n=n(), Is_Lead_0=sum(Is_Lead==0), Is_Lead_1=sum(Is_Lead==1)) %>%
mutate("Is_Lead_1_%"=scales::percent(Is_Lead_1/(Is_Lead_1+Is_Lead_0)))
# Vintage
psych::describeBy(train$Vintage, train$Is_Lead,mat=T, digits=2)
# Account
psych::describeBy(train$Avg_Account_Balance, train$Is_Lead,mat=T, digits=2)
# train data
train = train %>%
mutate(Credit_Product = fct_explicit_na(Credit_Product, "Unknown")) %>%
mutate_at(vars(Is_Lead),list(factor)) %>%
select(-ID) %>%
mutate(Is_Lead = factor(Is_Lead,labels = make.names(levels(Is_Lead)),ordered=TRUE))
glimpse(train)
Rows: 245,725
Columns: 10
$ Gender <fct> Female, Female, Female, Male, Female, Male, Male, Female, Female, Female, Male…
$ Age <int> 73, 30, 56, 34, 30, 56, 62, 48, 40, 55, 53, 27, 27, 31, 79, 33, 46, 59, 65, 37…
$ Region_Code <fct> RG268, RG277, RG268, RG270, RG282, RG261, RG282, RG265, RG283, RG268, RG254, R…
$ Occupation <fct> Other, Salaried, Self_Employed, Salaried, Salaried, Self_Employed, Other, Self…
$ Channel_Code <fct> X3, X1, X3, X1, X1, X1, X3, X3, X2, X2, X3, X1, X1, X1, X3, X2, X3, X3, X2, X1…
$ Vintage <int> 43, 32, 26, 19, 33, 32, 20, 13, 38, 49, 123, 14, 20, 31, 57, 69, 97, 15, 20, 6…
$ Credit_Product <fct> No, No, No, No, No, No, Unknown, No, No, Yes, No, Yes, No, Yes, No, Unknown, Y…
$ Avg_Account_Balance <int> 1045696, 581988, 1484315, 470454, 886787, 544163, 1056750, 444724, 1274284, 20…
$ Is_Active <fct> No, No, Yes, No, No, Yes, Yes, Yes, No, No, Yes, No, Yes, No, Yes, Yes, No, No…
$ Is_Lead <ord> X0, X0, X0, X0, X0, X0, X1, X0, X0, X0, X0, X0, X0, X0, X0, X1, X1, X1, X0, X0…
table(train$Is_Lead)
X0 X1
187437 58288
# plot
train %>% ggplot(aes(x=Is_Lead, y=Age)) +
geom_half_boxplot(aes(fill=Is_Lead), alpha=.7,show.legend = F,outlier.size = 0.7, outlier.alpha = 0.5) +
geom_half_violin(aes(fill=Is_Lead),side="r", color=NA,alpha=.7) +
scale_fill_manual(values=c("#1a759f","#ee9b00")) -> p1
train %>% ggplot(aes(x=Is_Lead, y=Vintage)) +
geom_half_boxplot(aes(fill=Is_Lead), alpha=.7,show.legend = F,outlier.size = 0.7, outlier.alpha = 0.5) +
geom_half_violin(aes(fill=Is_Lead),side="r", color=NA,alpha=.7) +
scale_fill_manual(values=c("#1a759f","#ee9b00")) -> p2
train %>% ggplot(aes(x=Is_Lead, y=Avg_Account_Balance)) +
geom_half_boxplot(aes(fill=Is_Lead), alpha=.7,show.legend = F,outlier.size = 0.7, outlier.alpha = 0.5) +
geom_half_violin(aes(fill=Is_Lead),side="r", color=NA,alpha=.7) +
scale_fill_manual(values=c("#1a759f","#ee9b00")) + coord_flip() -> p3
(p1 | p2) / p3 + plot_layout(guides='collect')
# Correlation matrix
cdata = train %>% select(Gender, Age,Channel_Code, Vintage, Credit_Product, Avg_Account_Balance, Is_Active,Is_Lead) %>%
mutate(Channel_Code= case_when(Channel_Code=="X1"~1,
Channel_Code=="X2"~2,
Channel_Code=="X3"~3,
Channel_Code=="X4"~4)) %>%
drop_na() %>%
fastDummies::dummy_cols(remove_first_dummy = TRUE) %>%
select(where(is.numeric))
str(cdata)
'data.frame': 216400 obs. of 8 variables:
$ Age : int 73 30 56 34 30 56 48 40 55 53 ...
$ Channel_Code : num 3 1 3 1 1 1 3 2 2 3 ...
$ Vintage : int 43 32 26 19 33 32 13 38 49 123 ...
$ Avg_Account_Balance: int 1045696 581988 1484315 470454 886787 544163 444724 1274284 2014239 980664 ...
$ Is_Lead : int 0 0 0 0 0 0 0 0 0 0 ...
$ Gender_Male : int 0 0 0 1 0 1 0 0 0 1 ...
$ Credit_Product_Yes : int 0 0 0 0 0 0 0 0 1 0 ...
$ Is_Active_Yes : int 0 0 1 0 0 1 1 0 0 1 ...
ggstatsplot::ggcorrmat(
data = cdata,
type = "spearman"
)
# Load libraries
library(Hmisc)
library(caret)
library(rattle)
library(pscl)
library(pROC)
library(MLmetrics)
library(rpart)
library(randomForest)
# Partition data based on shortcut variable
colnames(train) <- make.names(colnames(train)) #make valid col names
set.seed(123)
train.index <- createDataPartition(train$Is_Lead, p = .7, list = FALSE)
xtrain <- train[ train.index,]
xtest <- train[-train.index,]
# Check distribution after partitioning
Hmisc::describe(xtrain$Is_Lead)
xtrain$Is_Lead
n missing distinct
172008 0 2
Value X0 X1
Frequency 131206 40802
Proportion 0.763 0.237
Hmisc::describe(xtest$Is_Lead)
xtest$Is_Lead
n missing distinct
73717 0 2
Value X0 X1
Frequency 56231 17486
Proportion 0.763 0.237
# glm
fit.control <- trainControl(method = "repeatedcv", number = 3, repeats = 10,
summaryFunction = twoClassSummary, classProbs = TRUE, savePredictions = T)
set.seed(201)
glm_mod <- train(Is_Lead ~., data = xtrain, method = "glm",
family = "binomial", trControl = fit.control, metric="ROC")
glm_mod
Generalized Linear Model
172008 samples
9 predictor
2 classes: 'X0', 'X1'
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 10 times)
Summary of sample sizes: 114673, 114672, 114671, 114672, 114672, 114672, ...
Resampling results:
ROC Sens Spec
0.8581344 0.9682408 0.4858659
#glm_mod$finalModel
summary(glm_mod)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5620 -0.5287 -0.3555 -0.1927 2.9403
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.693e+00 1.035e-01 -35.674 < 2e-16 ***
GenderMale 3.135e-02 1.486e-02 2.109 0.03491 *
Age 8.835e-03 8.112e-04 10.891 < 2e-16 ***
Region_CodeRG251 1.400e-01 9.199e-02 1.522 0.12813
Region_CodeRG252 1.187e-01 1.024e-01 1.159 0.24662
Region_CodeRG253 7.368e-02 1.130e-01 0.652 0.51448
Region_CodeRG254 5.434e-02 8.283e-02 0.656 0.51178
Region_CodeRG255 2.009e-01 1.114e-01 1.804 0.07117 .
Region_CodeRG256 3.684e-02 1.120e-01 0.329 0.74217
Region_CodeRG257 2.191e-01 9.331e-02 2.348 0.01885 *
Region_CodeRG258 1.175e-02 1.164e-01 0.101 0.91959
Region_CodeRG259 4.741e-02 1.081e-01 0.439 0.66098
Region_CodeRG260 8.930e-02 1.057e-01 0.845 0.39839
Region_CodeRG261 4.133e-02 9.130e-02 0.453 0.65077
Region_CodeRG262 -4.815e-02 1.226e-01 -0.393 0.69453
Region_CodeRG263 3.087e-01 9.920e-02 3.112 0.00186 **
Region_CodeRG264 8.877e-03 1.132e-01 0.078 0.93751
Region_CodeRG265 1.076e-01 1.174e-01 0.917 0.35911
Region_CodeRG266 -1.268e-01 1.333e-01 -0.951 0.34155
Region_CodeRG267 -1.825e-01 1.293e-01 -1.411 0.15832
Region_CodeRG268 1.587e-01 8.176e-02 1.941 0.05221 .
Region_CodeRG269 2.935e-01 8.966e-02 3.274 0.00106 **
Region_CodeRG270 1.375e-01 9.250e-02 1.486 0.13720
Region_CodeRG271 2.072e-02 1.269e-01 0.163 0.87036
Region_CodeRG272 1.199e-01 9.457e-02 1.267 0.20503
Region_CodeRG273 1.934e-01 9.593e-02 2.016 0.04382 *
Region_CodeRG274 -8.551e-03 9.590e-02 -0.089 0.92895
Region_CodeRG275 -6.622e-02 1.046e-01 -0.633 0.52668
Region_CodeRG276 8.261e-02 1.034e-01 0.799 0.42440
Region_CodeRG277 2.309e-01 8.578e-02 2.692 0.00710 **
Region_CodeRG278 -7.967e-02 1.181e-01 -0.675 0.49996
Region_CodeRG279 2.009e-01 9.814e-02 2.047 0.04067 *
Region_CodeRG280 2.267e-01 8.548e-02 2.652 0.00799 **
Region_CodeRG281 7.787e-02 9.393e-02 0.829 0.40706
Region_CodeRG282 5.179e-02 9.435e-02 0.549 0.58307
Region_CodeRG283 1.486e-01 8.218e-02 1.809 0.07050 .
Region_CodeRG284 1.716e-01 8.331e-02 2.060 0.03941 *
OccupationOther -7.741e-01 5.833e-02 -13.271 < 2e-16 ***
OccupationSalaried 2.408e-01 6.028e-02 3.994 6.48e-05 ***
OccupationSelf_Employed -6.570e-01 5.681e-02 -11.566 < 2e-16 ***
Channel_CodeX2 9.440e-01 2.591e-02 36.439 < 2e-16 ***
Channel_CodeX3 8.016e-01 2.745e-02 29.208 < 2e-16 ***
Channel_CodeX4 8.462e-01 4.892e-02 17.297 < 2e-16 ***
Vintage 8.574e-03 2.917e-04 29.397 < 2e-16 ***
Credit_ProductYes 1.619e+00 1.658e-02 97.682 < 2e-16 ***
Credit_ProductUnknown 3.997e+00 2.413e-02 165.686 < 2e-16 ***
Avg_Account_Balance -2.177e-08 8.942e-09 -2.434 0.01493 *
Is_ActiveYes 3.378e-01 1.549e-02 21.802 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 188467 on 172007 degrees of freedom
Residual deviance: 126047 on 171960 degrees of freedom
AIC: 126143
Number of Fisher Scoring iterations: 5
probsTest <- predict(glm_mod, xtest, type = "prob")
threshold <- 0.5
glm.pred <- factor( ifelse(probsTest[, "X1"] > threshold, "X1", "X0") ) # predict
confusionMatrix(glm.pred, xtest$Is_Lead) # confusion matrix
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 54394 9005
X1 1837 8481
Accuracy : 0.8529
95% CI : (0.8503, 0.8555)
No Information Rate : 0.7628
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5267
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9673
Specificity : 0.4850
Pos Pred Value : 0.8580
Neg Pred Value : 0.8220
Prevalence : 0.7628
Detection Rate : 0.7379
Detection Prevalence : 0.8600
Balanced Accuracy : 0.7262
'Positive' Class : X0
roc(response=xtest$Is_Lead, predictor=factor(glm.pred,ordered=T),plot=T,print.auc=T) # plot ROC
Setting levels: control = X0, case = X1
Ordered predictor converted to numeric vector. Threshold values will not correspond to values in predictor.Setting direction: controls < cases
Call:
roc.default(response = xtest$Is_Lead, predictor = factor(glm.pred, ordered = T), plot = T, print.auc = T)
Data: factor(glm.pred, ordered = T) in 56231 controls (xtest$Is_Lead X0) < 17486 cases (xtest$Is_Lead X1).
Area under the curve: 0.7262
paste("F1-score: ",(F1_Score(y_pred = glm.pred, y_true = xtest$Is_Lead, positive = "X1"))) # F1-score
[1] "F1-score: 0.610056107034959"
vip(glm_mod, num_features = 20, geom="point") + ggplot2::theme_bw() # plot variable importance of 20 features
# elastic net
set.seed(42)
cv_5 = trainControl(method = "cv", number = 5)
def_elnet = train(
Is_Lead ~ ., data = xtrain,
method = "glmnet",
trControl = cv_5,
tuneLength=5
)
def_elnet
glmnet
172008 samples
9 predictor
2 classes: 'X0', 'X1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 137607, 137607, 137607, 137606, 137605
Resampling results across tuning parameters:
alpha lambda Accuracy Kappa
0.100 0.0002102763 0.8536231 0.5273201
0.100 0.0009760160 0.8536173 0.5270821
0.100 0.0045302650 0.8526638 0.5197540
0.100 0.0210276283 0.8491931 0.4993516
0.100 0.0976016090 0.8455828 0.4832608
0.325 0.0002102763 0.8536231 0.5273484
0.325 0.0009760160 0.8535010 0.5262112
0.325 0.0045302650 0.8517685 0.5150578
0.325 0.0210276283 0.8470769 0.4900611
0.325 0.0976016090 0.8469315 0.4894358
0.550 0.0002102763 0.8535998 0.5273106
0.550 0.0009760160 0.8533673 0.5252584
0.550 0.0045302650 0.8511523 0.5110180
0.550 0.0210276283 0.8469315 0.4894358
0.550 0.0976016090 0.8469315 0.4894358
0.775 0.0002102763 0.8535591 0.5272213
0.775 0.0009760160 0.8531464 0.5238844
0.775 0.0045302650 0.8503674 0.5070501
0.775 0.0210276283 0.8469315 0.4894358
0.775 0.0976016090 0.8469315 0.4894358
1.000 0.0002102763 0.8535882 0.5273384
1.000 0.0009760160 0.8529952 0.5228643
1.000 0.0045302650 0.8497977 0.5039555
1.000 0.0210276283 0.8469315 0.4894358
1.000 0.0976016090 0.8469315 0.4894358
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.325 and lambda = 0.0002102763.
def_elnet_pred = predict(def_elnet, newdata = xtest) # predict
confusionMatrix(def_elnet_pred, xtest$Is_Lead) # confusion matrix
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 54439 9062
X1 1792 8424
Accuracy : 0.8528
95% CI : (0.8502, 0.8553)
No Information Rate : 0.7628
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5251
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9681
Specificity : 0.4818
Pos Pred Value : 0.8573
Neg Pred Value : 0.8246
Prevalence : 0.7628
Detection Rate : 0.7385
Detection Prevalence : 0.8614
Balanced Accuracy : 0.7249
'Positive' Class : X0
roc(response=xtest$Is_Lead, predictor=def_elnet_pred,plot=T,print.auc=T) # plot ROC
Ordered predictor converted to numeric vector. Threshold values will not correspond to values in predictor.
Call:
roc.default(response = xtest$Is_Lead, predictor = def_elnet_pred, plot = T, print.auc = T)
Data: def_elnet_pred in 56231 controls (xtest$Is_Lead X0) < 17486 cases (xtest$Is_Lead X1).
Area under the curve: 0.7249
paste("F1-score: ",(F1_Score(y_pred = def_elnet_pred, y_true = xtest$Is_Lead, positive = "X1"))) # F1-score
[1] "F1-score: 0.608187134502924"
vip(def_elnet, num_features = 20, geom="point") + ggplot2::theme_bw() # plot variable importance of 20 features
# glm reg via lasso penalty
X = model.matrix(data = xtrain[,1:9], ~ -1 + .) #matrix
y = xtrain$Is_Lead
lasso <- glmnet(x = X, y = y, family = "binomial")
plot(lasso, xvar = "lambda")
# lasso
lams <- expand.grid(alpha = 1, lambda = lasso$lambda) # extract lambda values
lasso_mod<- train(Is_Lead~., data = xtrain, method = "glmnet", family = "binomial",
trControl = fit.control, tuneGrid = lams)
lasso_pred = predict(lasso_mod, newdata = xtest) # predict
confusionMatrix(lasso_pred, xtest$Is_Lead) # confusion matrix
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 54525 9172
X1 1706 8314
Accuracy : 0.8524
95% CI : (0.8499, 0.855)
No Information Rate : 0.7628
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5219
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9697
Specificity : 0.4755
Pos Pred Value : 0.8560
Neg Pred Value : 0.8297
Prevalence : 0.7628
Detection Rate : 0.7397
Detection Prevalence : 0.8641
Balanced Accuracy : 0.7226
'Positive' Class : X0
roc(response=xtest$Is_Lead, predictor=lasso_pred,plot=T,print.auc=T) # plot ROC
Ordered predictor converted to numeric vector. Threshold values will not correspond to values in predictor.
Call:
roc.default(response = xtest$Is_Lead, predictor = lasso_pred, plot = T, print.auc = T)
Data: lasso_pred in 56231 controls (xtest$Is_Lead X0) < 17486 cases (xtest$Is_Lead X1).
Area under the curve: 0.7226
paste("F1-score: ",(F1_Score(y_pred = lasso_pred, y_true = xtest$Is_Lead, positive = "X1"))) # F1-score
[1] "F1-score: 0.604522649603723"
vip(lasso_mod, num_features = 20, geom="point") + ggplot2::theme_bw() # plot variable importance of 20 features
# rpart
cvCtrl <- trainControl(method = "repeatedcv", number = 3, repeats = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE)
set.seed(202)
rpart_mod <- train(Is_Lead ~ ., data = xtrain, method = "rpart",
tuneLength = 10,
metric = "ROC",
trControl = cvCtrl)
rpart_mod
CART
172008 samples
9 predictor
2 classes: 'X0', 'X1'
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 10 times)
Summary of sample sizes: 114672, 114673, 114671, 114671, 114673, 114672, ...
Resampling results across tuning parameters:
cp ROC Sens Spec
0.0003186118 0.8642363 0.9592831 0.5367875
0.0003706926 0.8640522 0.9610810 0.5309912
0.0004820025 0.8635460 0.9637806 0.5223444
0.0005473588 0.8629182 0.9650252 0.5180284
0.0006862409 0.8620322 0.9673003 0.5100803
0.0015522115 0.8553868 0.9707346 0.4924097
0.0028307436 0.8509025 0.9699815 0.4869466
0.0034924759 0.8418342 0.9725149 0.4708591
0.0078754309 0.7543572 0.9770049 0.4377480
0.3547130043 0.5736293 0.9914623 0.1557963
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0003186118.
plot(rpart_mod) # plot complexity parameter
unlist(rpart_mod$bestTune) # print best complexity parameter
cp
0.0003186118
fancyRpartPlot(rpart_mod$finalModel) # plot tree
rpart_pred = predict(rpart_mod,xtest) # predict
confusionMatrix(rpart_pred, xtest$Is_Lead) # confusion matrix
Confusion Matrix and Statistics
Reference
Prediction X0 X1
X0 54196 8389
X1 2035 9097
Accuracy : 0.8586
95% CI : (0.8561, 0.8611)
No Information Rate : 0.7628
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5533
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9638
Specificity : 0.5202
Pos Pred Value : 0.8660
Neg Pred Value : 0.8172
Prevalence : 0.7628
Detection Rate : 0.7352
Detection Prevalence : 0.8490
Balanced Accuracy : 0.7420
'Positive' Class : X0
roc(response=xtest$Is_Lead, predictor=rpart_pred ,plot=T,print.auc=T) # plot ROC
Call:
roc.default(response = xtest$Is_Lead, predictor = rpart_pred, plot = T, print.auc = T)
Data: rpart_pred in 56231 controls (xtest$Is_Lead X0) < 17486 cases (xtest$Is_Lead X1).
Area under the curve: 0.742
paste("F1-score: ",(F1_Score(y_pred = rpart_pred , y_true = xtest$Is_Lead, positive = "X1"))) # F1-score
[1] "F1-score: 0.635753721434063"
vip(rpart_mod, num_features = 20, geom="point") + ggplot2::theme_bw() # plot variable importance of 20 features