dta_bank = read.csv("Bank Case.csv")
[Ans]
categorical variables: all variables other than age and duration, including job, marital, education, default, housing, loan, contact,month, day_of_week, y
numerical variables: age and duration
glimpse(dta_bank)
## Rows: 41,188
## Columns: 12
## $ age <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,...
## $ job <fct> housemaid, services, services, admin., services, servic...
## $ marital <fct> married, married, married, married, married, married, m...
## $ education <fct> basic.4y, high.school, high.school, basic.6y, high.scho...
## $ default <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, ...
## $ housing <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,...
## $ loan <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n...
## $ contact <fct> telephone, telephone, telephone, telephone, telephone, ...
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, ...
## $ day_of_week <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, ...
## $ duration <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22...
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,...
[Ans] Yes, both age and duration have outliers. Age variable has 99 outliers and duration variable has 386 outliers.
# calculate mean and standard deviation of age and duration
age_mean = mean(dta_bank$age)
age_sd = sd(dta_bank$age)
duration_mean = mean(dta_bank$duration)
duration_sd = sd(dta_bank$duration)
# calculate numbers of outliers for age and duration variables
age_outliers = sum(dta_bank$age > age_mean+4*age_sd |
dta_bank$age < age_mean-4*age_sd)
duration_outliers = sum(dta_bank$duration >
duration_mean+4*duration_sd |
dta_bank$duration <
duration_mean-4*duration_sd)
# save the above results in a dataframe
n_outliers = as.data.frame(cbind(age_outliers, duration_outliers))
colnames(n_outliers) = c("age", "duration")
row.names(n_outliers) = "number of outliers"
n_outliers %>%
kable(booktabs = T) %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position="center")
| age | duration | |
|---|---|---|
| number of outliers | 99 | 386 |
# drop all outliers in numercial variables
dta_bank = dta_bank %>%
filter(
age < age_mean+4*age_sd &
age > age_mean-4*age_sd &
dta_bank$duration <
duration_mean+4*duration_sd &
dta_bank$duration >
duration_mean-4*duration_sd
)
# transform all variables into numeric variables in order to create a corrplot
dta_bank_num = as.data.frame(lapply(dta_bank, as.numeric))
# create a corrplot
corrplot(cor(dta_bank_num), method="color", type="upper",
addCoef.col = "black", tl.col="black", tl.srt=45,
sig.level = 0.01, insig = "blank", diag=FALSE,
number.cex=0.5,tl.cex=0.7, cl.ratio=0.2,cl.cex=0.7,
col = brewer.pal(n = 8, name = "BrBG"))
| Dependent variable: | |
| as.numeric(y) | |
| age | 0.001*** |
| (0.0002) | |
| jobblue-collar | -0.027*** |
| (0.005) | |
| jobentrepreneur | -0.027*** |
| (0.008) | |
| jobhousemaid | -0.008 |
| (0.009) | |
| jobmanagement | -0.020*** |
| (0.006) | |
| jobretired | 0.062*** |
| (0.008) | |
| jobself-employed | -0.024*** |
| (0.008) | |
| jobservices | -0.022*** |
| (0.005) | |
| jobstudent | 0.118*** |
| (0.010) | |
| jobtechnician | -0.015*** |
| (0.005) | |
| jobunemployed | 0.019** |
| (0.009) | |
| jobunknown | 0.010 |
| (0.015) | |
| maritalmarried | 0.008* |
| (0.004) | |
| maritalsingle | 0.023*** |
| (0.005) | |
| maritalunknown | -0.009 |
| (0.030) | |
| educationbasic.6y | 0.006 |
| (0.007) | |
| educationbasic.9y | -0.003 |
| (0.006) | |
| educationhigh.school | 0.002 |
| (0.006) | |
| educationilliterate | 0.116* |
| (0.063) | |
| educationprofessional.course | 0.008 |
| (0.007) | |
| educationuniversity.degree | 0.018*** |
| (0.006) | |
| educationunknown | 0.021*** |
| (0.008) | |
| defaultunknown | -0.040*** |
| (0.003) | |
| defaultyes | -0.042 |
| (0.154) | |
| housingunknown | 0.006 |
| (0.009) | |
| housingyes | 0.002 |
| (0.003) | |
| loanunknown | |
| loanyes | -0.001 |
| (0.004) | |
| contacttelephone | -0.078*** |
| (0.003) | |
| monthaug | -0.068*** |
| (0.006) | |
| monthdec | 0.237*** |
| (0.021) | |
| monthjul | -0.087*** |
| (0.006) | |
| monthjun | 0.001 |
| (0.007) | |
| monthmar | 0.312*** |
| (0.013) | |
| monthmay | -0.067*** |
| (0.006) | |
| monthnov | -0.071*** |
| (0.007) | |
| monthoct | 0.239*** |
| (0.011) | |
| monthsep | 0.228*** |
| (0.013) | |
| day_of_weekmon | -0.010** |
| (0.004) | |
| day_of_weekthu | 0.002 |
| (0.004) | |
| day_of_weektue | 0.004 |
| (0.004) | |
| day_of_weekwed | 0.005 |
| (0.004) | |
| duration | 0.001*** |
| (0.00001) | |
| Constant | 1.008*** |
| (0.012) | |
| Observations | 40,703 |
| R2 | 0.258 |
| Adjusted R2 | 0.258 |
| Residual Std. Error | 0.266 (df = 40660) |
| F Statistic | 337.361*** (df = 42; 40660) |
| Note: | p<0.1; p<0.05; p<0.01 |
$$ \begin{aligned} Y = & 1.008 + 0.001 AGE \ & - 0.027 JOB {blue collar }
- 0.029 JOB {entrepreneur } - 0.004 JOB {housemaid } \ & - 0.021 JOB {management } + 0.062 JOB {retired } - 0.025 JOB {self-employed} \ & - 0.023 JOB {services } + 0.119 JOB {student } - 0.015 JOB {technician } \ & + 0.019 JOB {unemployed } + 0.006 JOB {unknown } \ & + 0.008 MARITAL {married }
+ 0.023 MARITAL {single } - 0.009 MARITAL {unknown } \ & + 0.116 EDU {illiterate } + 0.006 EDU {basic.6y } - 0.003 EDU {basic.9y } \ & + 0.002 EDU {high.school }
+ 0.008 EDU {professional } \ & + 0.018 EDU {university }
+ 0.021 EDU {unknown } \ & - 0.040 DEFAULT {unknown } - 0.042 DEFAULT {yes } \ & + 0.006 HOUSING {unknown } + 0.002 HOUSING {yes } \ & - 0.001 LOAN {yes } - 0.078 CONTACT {telephone } \ & - 0.068 MONTH {Aug } + 0.237 MONTH {Dec } - 0.087 MONTH {Jul } \ & + 0.001 MONTH {Jun } + 0.312 MONTH {Mar } - 0.067 MONTH {May } \ & - 0.071 MONTH {Nov } + 0.239 MONTH {Oct } + 0.228 MONTH {Sep } \ & - 0.010 DAY {Mon } + 0.002 DAY {Thu } + 0.004 DAY {Tue } + 0.005 DAY {Wed } \ & + 0.001 DURATION \ &
\end{aligned} $$
[Ans] Wednesday is the best time to perform telemarketing tasks because its coefficient is the highest among weekday varaibles.
weekday_coef = as.data.frame(
cbind(c("Mon","Tue","Wed","Thu","Fri"),
c(summary(reg_bank)$coefficients[c(39,41,42,40)],0))
) %>% rename(day=V1,coef=V2) %>%
mutate(coef=as.numeric(paste(coef)),
color=ifelse(coef>0,"pos","neg"))
ggplot(weekday_coef,aes(x=reorder(day,coef), y=coef,fill=color)) +
geom_bar(stat = "identity",width=0.5)+
scale_fill_manual("legend", values = c("pos"="firebrick","neg"="darkblue")) +
geom_hline(yintercept = 0,color="dimgray",size=1, linetype='dashed') +
coord_flip()
[Ans] As we can see in the bar chart, student, retired and unemployed have higher coefficients in this case, so we can conclude that groups without full-time job are the best group to target. In fact, this conclusion makes sense becuase these groups usually have more free time to answer calls than those who have full-time jobs. Besides,time deposits are risk-free and safe investments. They can appeal students and retired people who don’t have stable salary to invest.
job_coef = as.data.frame(
cbind(c(row.names(summary(reg_bank)$coefficients)[3:13],"admin"),
c(summary(reg_bank)$coefficients[c(3:13)],0))
) %>% rename(job = V1,coef=V2) %>%
mutate(job = str_replace(job,"job",""),
coef = as.numeric(paste(coef)),
color= ifelse(coef>0,"pos","neg"))
ggplot(job_coef,aes(x=reorder(job,coef), y=coef, fill=color)) +
geom_bar(stat = "identity",width=0.8) +
scale_fill_manual("legend", values = c("pos"="firebrick","neg"="darkblue")) +
geom_hline(yintercept = 0,color="dimgray",size=1, linetype='dashed') +
coord_flip() +
xlab("Coefficients") + ylab("Job")
theme(axis.text.x = element_text(angle = 45, hjust=1))
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 45
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
[Ans] * Consumer-related: Regions, Family size * Phonecalls-related: number of contacts * Advertising-related: Interest rate, how many other kind of ads are running at the same time? * Social environment: GDP, consumer price index, Unemployment rate, Political elections
[Ans] The purpose of predictive modeling is to find a trend that can be generalized to all observations and use this trend to predict what we want to know. First, we need to use training data to find a model that can represent the trend. Next, we using validating data to tune our model and find the hyperparameters with the best performance. Subsequently, we test and compare models’ performances through testing data to ensure models can be generalized. Lastly, we choose the model with the best performance, like the highest accuracy, as our predictive model. Generally, we use 80% data as the training data, 10% as validating data and the remaining 10% as test data.
[Ans] All variables should be kept. Because the purpose of this case is to find out how to call the right customers at the right times,
[Ans] Overfitting means models fit trainning data a lot better than test data and cause low bias but high variance, so it cannot be well generalized to other observations.
Conversely, underfitting means models cannot fit both trainning data and test data well and cause low variance but high bias, so the underfitting models are too generalized.
Both overfitting and underfitting might cause model to have poor performances.
It means that there is no the best alogorithms. That is, same algorithms may have different performance in different datasets. For example, we cannot say decision tree always have better performance than linear regression.
# transform categorical variables into dummy variables
dta_bank_dummy = dta_bank[,c(1,11)]
cat_col = colnames(dta_bank)[-c(1,11)]
for (i in cat_col){
dummy_df = psych::dummy.code(dta_bank[,i])
colnames(dummy_df)=paste(i,colnames(dummy_df),sep="_")
dta_bank_dummy = as.data.frame(cbind(dta_bank_dummy,dummy_df))
}
# creating square and cube of age variable
dta_bank_dummy = dta_bank_dummy %>%
mutate(age_square = age^2,
age_cube = age^3)
# setting indices
set.seed(123456)
size = nrow(dta_bank_dummy)
inx = 1:size
train_inx = sample(inx,floor(size*0.8))
valid_inx = sample(inx[-train_inx],floor((size-length(train_inx))*0.5))
test_inx = inx[-c(train_inx,valid_inx)]
# creating data and labels for training, testing and validation
dta_bank_training = dta_bank_dummy[train_inx,]
dta_bank_validating = dta_bank_dummy[valid_inx,]
dta_bank_test = dta_bank_dummy[test_inx,]
# define needed columns for different linear regessions
col_lm1 = c(1,38:47,54)
col_lm2 = c(1,38:47,54:56)
col_lm3 = c(1:52,54)
# run regressions
lm1 = lm(y_yes~.,data=dta_bank_training[,col_lm1])
lm2 = lm(y_yes~.,data=dta_bank_training[,col_lm2])
lm3 = lm(y_yes~.,data=dta_bank_training[,col_lm3])
lm4 = lm(y_yes~.^2,data=dta_bank_training[,col_lm3])
lm1: one explanatory variable and time fixed effects(months) \[ \begin{aligned} Y = 0.4199+ 0.00008468\times AGE-0.07191\times MONTH_{Mar}+0.2309\times MONTH_{Apr}-0.3642\times MONTH_{May}... \end{aligned} \]
lm2: cubic polynomial regression, non-linear regression \[ \begin{aligned} Y = & 0.8063- 0.01855\times AGE+0.117594\times AGE^2+0.146243\times AGE^3 \\& -0.07191\times MONTH_{Mar}+0.2309\times MONTH_{Apr}-0.3642\times MONTH_{May}... \end{aligned} \]
lm3: muliple linear regression
$$ \begin{aligned} Y = & 0.1301 + 0.0006725 AGE + 0.0005670 DURATION \ &
- 0.02699 JOB {blue collar }
- 0.03604 JOB {entrepreneur } - 0.005475 JOB {housemaid }… \ & - 0.0008322 MARITAL {married }
+ 0.01437 MARITAL {single } - 0.007168 MARITAL {divorced } \ & + 0.1003 EDU {illiterate } - 0.01773 EDU {basic.6y } - 0.02618 EDU {basic.9y }… \ & - 0.008282 DEFAULT {unknown } - 0.04601 DEFAULT {no } \ & + 0.003779 HOUSING {unknown } - 0.001886 HOUSING {no } \ & + 0.003411 LOAN {no } + 0.07938 CONTACT {cellular } \ & - 0.2794 MONTH {Aug } + 0.03 MONTH {Dec } - 0.2987 MONTH {Jul } …\ & - 0.01301 DAY {Mon } - 0.002929 DAY {Thu } + 0.002787 DAY {Tue } + 0.004607 DAY {Fri }
\end{aligned} $$
lm4: muliple linear regression and interaction terms between variables $$ \begin{aligned} Y = & 0.2523245 - 0.007993195 AGE - 0.00006590533 DURATION….. \ &
- 0.002887967 AGEJOB_{blue collar }
- 0.002465323 AGEJOB_{entrepreneur }…..
\end{aligned} $$
# use regression models to predict probabilities
lm_list = list(lm1,lm2,lm3,lm4)
lm_prob_list = list()
for (i in 1:4){
list_name = paste("lm_pred",i, sep="")
lm_prob_list[[list_name]] =
predict(lm_list[[i]],newdata=dta_bank_validating)
}
## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading
## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading
## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading
## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading
# create a table with some performance to compare models
lm_eval = as.data.frame(matrix(NA,nrow=6,ncol=4))
colnames(lm_eval) = paste("lm",1:4,sep="")
row.names(lm_eval)= c("RMSPE_train",
"RMSPE_valid",
"RMSPE_valid/RMSPE_train",
"Accuracy",
"Sensitivity",
"Pos Pred Value")
conf_list = list()
for (i in 1:4){
# RMSPE of training data
lm_eval[1,i] = sqrt(mean(lm_list[[i]]$residuals^2,na.rm = TRUE))
# RMSPE of validating data
ssr_pred = (dta_bank_validating$y_yes-lm_prob_list[[i]])^2
lm_eval[2,i] = sqrt(mean(ssr_pred, na.rm = TRUE))
# calculate the difference of RMSPE between training and validating data
lm_eval[3,i] = lm_eval[2,i]/lm_eval[1,i]
# calculate accuracy
lm_pred_class= ifelse(lm_prob_list[[i]]>0.5,1,"0")
conf_list[[i]] = confusionMatrix(
factor(lm_pred_class,levels=c("1","0")),
factor(dta_bank_validating$y_yes,levels=c("1","0"))
)
lm_eval[4,i] = conf_list[[i]]$overall['Accuracy']
lm_eval[5,i] = conf_list[[i]]$byClass['Sensitivity']
lm_eval[6,i] = conf_list[[i]]$byClass['Pos Pred Value']
}
# comparing models
round(lm_eval,3) %>%
kable(booktabs = T) %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position="center")
| lm1 | lm2 | lm3 | lm4 | |
|---|---|---|---|---|
| RMSPE_train | 0.297 | 0.294 | 0.266 | 0.253 |
| RMSPE_valid | 0.302 | 0.301 | 0.271 | 0.270 |
| RMSPE_valid/RMSPE_train | 1.017 | 1.022 | 1.020 | 1.065 |
| Accuracy | 0.887 | 0.888 | 0.892 | 0.894 |
| Sensitivity | 0.000 | 0.052 | 0.154 | 0.220 |
| Pos Pred Value | NaN | 0.533 | 0.582 | 0.580 |
[Ans] lm4 overfits the most because the ratio of validating data’s RMSPE to training data’s RMSPE is the highest in the above table. That is, lm4 model in the validating data has higher RMSPE than in the training data. Besides the difference of RMSPE between validating data and training data in lm4 model is the highest compared to other linear models.
[Ans] As we can see in the above table, lm1 underfits the most because both training data and validating data have the lowest RMPSE compared to other models. Besides the ratio of validating data’s MSPE to training data’s MSPE is the lowest. Therefore, we can conclude that lm1 is more underfitting than other liner models.
[Ans] No, although lm4 model has the lowest RMSPE in the training data, it doen’t has the best predictive power. In fact, lm3 is the best one because it has similar accuracy and PPV with lm4 model but have less overfitting problem than lm4.
The reason that we have to use PPV(Pos Pred Value) to evaluate predictive powers is because our purpose is to find out the potential customers of term deposits and appropriate times to contact them. Therefore, if we can predict the positive value more precisely, we can help the company to save much effort and time to contact the right people. For example, PPV of lm3 model is 0.582. when lm3 predict a customer will subscribe a term deposit, 58.2% of customer actually subscribe a term deposit. If PPV increase, we can find right customers more quickly and further help the company to save costs and increase efficiency.
[Ans] Yes, we can use a confusion matrix to analyze the underfitting problems. Take lm1 as an example. The following is the confusion matrix of lm1. As we can see, lm1 predict 0 consumers in validating data will subscribe a term deposit. However, 460 consumers actually subscribe term deposits. Although the overall accuracy is 0.887, the probability that model can precisely predict positive values is 0. That is, the lm1 model merely predict the negative values but cannot predict positive values well. Therefore, lm1 has higher RMSPE in both training and validating data and cause the underfitting problem.
conf_list[[1]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 0 0
## 0 460 3610
##
## Accuracy : 0.887
## 95% CI : (0.8768, 0.8965)
## No Information Rate : 0.887
## P-Value [Acc > NIR] : 0.5124
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000
## Specificity : 1.000
## Pos Pred Value : NaN
## Neg Pred Value : 0.887
## Prevalence : 0.113
## Detection Rate : 0.000
## Detection Prevalence : 0.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : 1
##
[Ans] We should use training dataset to run these regressions.
[Ans] Only the relationship between the Y and duration variable looks linear, other relationships do not look linear.
dta_bank_plot = dta_bank[valid_inx,]
dta_bank_plot$y_pred = lm_prob_list[[4]]
color = "darkblue"
p1 = ggplot(dta_bank_plot,aes(x=age,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p2 = ggplot(dta_bank_plot, aes(x=job,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p3 = ggplot(dta_bank_plot, aes(x=marital,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p4 = ggplot(dta_bank_plot, aes(x=education,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p5 = ggplot(dta_bank_plot, aes(x=default,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p6 = ggplot(dta_bank_plot, aes(x=housing,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p7 = ggplot(dta_bank_plot, aes(x=loan,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p8 = ggplot(dta_bank_plot, aes(x=contact,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p9 = ggplot(dta_bank_plot, aes(x=month,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p10 = ggplot(dta_bank_plot, aes(x=day_of_week,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
p11 = ggplot(dta_bank_plot, aes(x=duration,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())
library(cowplot)
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
plot_grid(p1, p2, p3,p4,p5,p6,p7,p8,p9,p10,p11, align = "v", nrow = 3)
[Ans] In addition to linear probability model, I also run 7 models, including knn, naive bayes, decision tree, random forest.
lm_prob_final = predict(lm3,
newdata = dta_bank_test[,col_lm3])
## Warning in predict.lm(lm3, newdata = dta_bank_test[, col_lm3]): prediction from
## a rank-deficient fit may be misleading
lm_pred_final = ifelse(lm_prob_final>0.5,1,0)
conf_mat_lm3 = confusionMatrix(
factor(lm_pred_final,levels=c("1","0")),
factor(dta_bank_test[,54],levels=c("1","0"))
)
conf_mat_lm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 78 45
## 0 336 3612
##
## Accuracy : 0.9064
## 95% CI : (0.897, 0.9152)
## No Information Rate : 0.8983
## P-Value [Acc > NIR] : 0.0447
##
## Kappa : 0.2558
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.18841
## Specificity : 0.98769
## Pos Pred Value : 0.63415
## Neg Pred Value : 0.91489
## Prevalence : 0.10169
## Detection Rate : 0.01916
## Detection Prevalence : 0.03021
## Balanced Accuracy : 0.58805
##
## 'Positive' Class : 1
##
dta_bank_dummy = dta_bank_dummy %>%
mutate(duration_square = duration^2,
duration_cube = duration^3)
dta_bank_dummy_Y = dta_bank_dummy %>%
select(y_yes) %>%
mutate(y_yes = ifelse(y_yes==1,1,-1))
dta_bank_dummy_X = dta_bank_dummy[,-c(3,15,19,27,30,33,36,38,48,53,54)]
index = c()
for (i in 1:47){
for(j in 1:47){
if (i==j) {next}
if (paste("interact",j,i,sep="_") %in% index) {next}
var_name = paste("interact",i,j,sep="_")
index[length(index)+1] = var_name
dta_bank_dummy_X[,var_name] = dta_bank_dummy_X[,i]*dta_bank_dummy_X[,j]
}
}
train_Y_unstd = dta_bank_dummy_Y[c(train_inx,valid_inx),] %>% as.matrix()
test_Y_unstd = dta_bank_dummy_Y[-c(train_inx,valid_inx),] %>% as.matrix()
train_X_unstd = dta_bank_dummy_X[c(train_inx,valid_inx),] %>% as.matrix()
test_X_unstd = dta_bank_dummy_X[-c(train_inx,valid_inx),] %>% as.matrix()
preProc = preProcess(train_X_unstd, method = c("center", "scale"))
## Warning in preProcess.default(train_X_unstd, method = c("center", "scale")):
## These variables have zero variances: interact_3_4, interact_3_5, interact_3_6,
## interact_3_7, interact_3_8, interact_3_9, interact_3_10, interact_3_11,
## interact_3_12, interact_3_13, interact_3_25, interact_4_5, interact_4_6,
## interact_4_7, interact_4_8, interact_4_9, interact_4_10, interact_4_11,
## interact_4_12, interact_4_13, interact_4_23, interact_5_6, interact_5_7,
## interact_5_8, interact_5_9, interact_5_10, interact_5_11, interact_5_12,
## interact_5_13, interact_5_23, interact_5_25, interact_6_7, interact_6_8,
## interact_6_9, interact_6_10, interact_6_11, interact_6_12, interact_6_13,
## interact_6_23, interact_6_25, interact_7_8, interact_7_9, interact_7_10,
## interact_7_11, interact_7_12, interact_7_13, interact_7_25, interact_8_9,
## interact_8_10, interact_8_11, interact_8_12, interact_8_13, interact_8_25,
## interact_8_39, interact_9_10, interact_9_11, interact_9_12, interact_9_13,
## interact_9_25, interact_10_11, interact_10_12, interact_10_13, interact_10_25,
## interact_11_12, interact_11_13, interact_11_23, interact_12_13, interact_12_23,
## interact_12_25, interact_13_23, interact_13_25, interact_13_39, interact_14_15,
## interact_14_16, interact_14_25, interact_15_16, interact_15_25, interact_16_23,
## interact_16_25, interact_16_37, interact_16_39, interact_17_18, interact_17_19,
## interact_17_20, interact_17_21, interact_17_22, interact_17_23, interact_18_19,
## interact_18_20, interact_18_21, interact_18_22, interact_18_23, interact_18_25,
## interact_19_20, interact_19_21, interact_19_22, interact_19_23, interact_20_21,
## interact_20_22, interact_20_23, interact_20_25, interact_21_22, interact_21_23,
## interact_21_25, interact_21_39, interact_22_23, interact_22_25, interact_23_25,
## interact_23_27, interact_23_29, interact_23_33, interact_23_36, interact_23_37,
## interact_23_38, interact_23_39, interact_24_25, interact_25_27, interact_25_28,
## interact_25_29, interact_25_30, interact_25_31, interact_25_33, interact_25_35,
## interact_25_36, interact_25_37, interact_25_38, interact_25_39, interact_25_40,
## interact_25_41, interact_25_43, interact_26_27, interact_26_29, interact_27_28,
## interact_28_29, interact_31_32, interact_31_33, interact_31_34, interact_31_35,
## interact_31_36, interact_31_37, interact_31_38, interact_31_39, interact_32_33,
## interact_32_34, interact_32_35, interact_32_36, interact_32_37, interact_32_38,
## interact_32_39, interact_33_34, interact_33_35, interact_33_36, interact_33_37,
## interact_33_38, interact_33_39, interact_34_35, interact_34_36, interact_34_37,
## interact_34_38, interact_34_39, interact_35_36, interact_35_37, interact_35_38,
## interact_35_39, interact_36_37, interact_36_38, interact_36_39, interact_37_38,
## interact_37_39, interact_38_39, interact_40_41, interact_40_42, interact_40_43,
## interact_41_42, interact_41_43, interact_42_43
train_X_std = predict(preProc, train_X_unstd) %>% as.matrix()
test_X_std = predict(preProc, test_X_unstd) %>% as.matrix()
train_Y_std = train_Y_unstd-mean(train_Y_unstd)
test_Y_std = test_Y_unstd-mean(train_Y_unstd)
# Ridge regression
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:pracma':
##
## expm, lu, tril, triu
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.0
ridge_cv = cv.glmnet(train_X_std,
train_Y_std,
alpha = 0,
standardize = FALSE,
nfolds = 5)
lambda_cv = ridge_cv$lambda.min
Ridge_cv_unstd = glmnet(train_X_unstd,
train_Y_unstd,
alpha = 0,
lambda = lambda_cv,
standardize = TRUE)
y_hat_cv_is = predict(Ridge_cv_unstd, train_X_unstd)
y_hat_cv_oos = predict(Ridge_cv_unstd, test_X_unstd)
RMSPE_ridge_is = sqrt(mean((t(train_Y_unstd - y_hat_cv_is) %*% (train_Y_unstd - y_hat_cv_is))))
RMSPE_ridge_oos = sqrt(mean((t(test_Y_unstd - y_hat_cv_oos) %*% (test_Y_unstd - y_hat_cv_oos))))
Ridge_RMSPE = data.frame(round(t(c(RMSPE_ridge_is,RMSPE_ridge_oos)),2))
colnames(Ridge_RMSPE) = c("In sample RMSPE",
"Out of sample RMSPE")
row.names(Ridge_RMSPE) = "Ridge"
Ridge_RMSPE %>%
kable(booktabs = T) %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position="center")
| In sample RMSPE | Out of sample RMSPE | |
|---|---|---|
| Ridge | 96.25 | 33.28 |
ridge_pred = ifelse(y_hat_cv_oos>0,1,0)
conf_mat_ridge = confusionMatrix(
factor(ridge_pred,levels=c("1","0")),
factor(dta_bank_dummy$y_yes[-c(train_inx,valid_inx)],levels=c("1","0"))
)
conf_mat_ridge
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 116 71
## 0 298 3586
##
## Accuracy : 0.9094
## 95% CI : (0.9001, 0.918)
## No Information Rate : 0.8983
## P-Value [Acc > NIR] : 0.009669
##
## Kappa : 0.3445
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.28019
## Specificity : 0.98059
## Pos Pred Value : 0.62032
## Neg Pred Value : 0.92327
## Prevalence : 0.10169
## Detection Rate : 0.02849
## Detection Prevalence : 0.04593
## Balanced Accuracy : 0.63039
##
## 'Positive' Class : 1
##
lasso_cv = cv.glmnet(train_X_std,
train_Y_std,
alpha = 1,
standardize = FALSE,
nfolds = 5)
lambda_lasso = lasso_cv$lambda.min
lasso_cv_unstd = glmnet(train_X_unstd,
train_Y_unstd,
alpha = 1,
lambda = lambda_lasso,
standardize = TRUE)
y_hat_lasso_is = predict(lasso_cv_unstd, train_X_unstd)
y_hat_lasso_oos = predict(lasso_cv_unstd, test_X_unstd)
Lasso_ridge_is = sqrt(mean(t(train_Y_unstd - y_hat_lasso_is) %*% (train_Y_unstd - y_hat_lasso_is)))
Lasso_ridge_oos = sqrt(mean(t(test_Y_unstd - y_hat_lasso_oos) %*% (test_Y_unstd - y_hat_lasso_oos)))
Lasso_RMSPE = data.frame(round(t(c(Lasso_ridge_is,Lasso_ridge_oos)),2))
colnames(Lasso_RMSPE) = c("In sample RMSPE",
"Out of sample RMSPE")
row.names(Lasso_RMSPE) = "Lasso"
Lasso_RMSPE %>%
kable(booktabs = T) %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position="center")
| In sample RMSPE | Out of sample RMSPE | |
|---|---|---|
| Lasso | 96.95 | 32.71 |
lasso_pred = ifelse(y_hat_lasso_oos>0,1,0)
conf_mat_lasso = confusionMatrix(
factor(lasso_pred,levels=c("1","0")),
factor(dta_bank_dummy$y_yes[-c(train_inx,valid_inx)],levels=c("1","0"))
)
conf_mat_lasso
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 107 61
## 0 307 3596
##
## Accuracy : 0.9096
## 95% CI : (0.9004, 0.9182)
## No Information Rate : 0.8983
## P-Value [Acc > NIR] : 0.008365
##
## Kappa : 0.3283
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.25845
## Specificity : 0.98332
## Pos Pred Value : 0.63690
## Neg Pred Value : 0.92134
## Prevalence : 0.10169
## Detection Rate : 0.02628
## Detection Prevalence : 0.04127
## Balanced Accuracy : 0.62089
##
## 'Positive' Class : 1
##
# normalizing data
normalize = function(x){
if (max(x) != 0) {return((x-min(x)) / (max(x)-min(x)))}
}
dta_bank_dummy =
as.data.frame((lapply(dta_bank_dummy,normalize)))
dta_bank_training = dta_bank_dummy[train_inx,]
dta_bank_validating = dta_bank_dummy[valid_inx,]
dta_bank_test = dta_bank_dummy[test_inx,]
# use training data to train models and use validating data to adjust hyperparameters
range = 1:30
accuracy = rep(0, length(range))
set.seed(123456)
for (k in range) {
pred = knn(train = dta_bank_training[,c(1:52)],
test = dta_bank_validating[,c(1:52)],
cl = dta_bank_training[,54],
k = k)
conf = table(dta_bank_validating[,54], pred)
accuracy[k] = sum(diag(conf))/sum(conf)
}
# find the value of k with the best accuracy
k_max = which.max(accuracy)
# use the best k to predict test data
test_pred_knn = class::knn(train = dta_bank_training[,-c(53:56)],
test = dta_bank_test[,-c(53:56)],
cl = dta_bank_training[,54],
k = k_max)
# calculate accuracy
conf_mat_knn = confusionMatrix(
factor(test_pred_knn,levels=c("1","0")),
factor(dta_bank_test[,54],levels=c("1","0"))
)
conf_mat_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 30 24
## 0 384 3633
##
## Accuracy : 0.8998
## 95% CI : (0.8901, 0.9088)
## No Information Rate : 0.8983
## P-Value [Acc > NIR] : 0.3902
##
## Kappa : 0.1073
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.072464
## Specificity : 0.993437
## Pos Pred Value : 0.555556
## Neg Pred Value : 0.904406
## Prevalence : 0.101695
## Detection Rate : 0.007369
## Detection Prevalence : 0.013265
## Balanced Accuracy : 0.532951
##
## 'Positive' Class : 1
##
# prepare data
dta_bank_training_NB =
as.data.frame(rbind(dta_bank_training,dta_bank_validating))
NB_classfied = naiveBayes(as.factor(y_yes)~.,
data=dta_bank_training_NB[,col_lm3],
laplace=1)
NB_pred = predict(NB_classfied,
newdata = dta_bank_test[,col_lm3],
type="class")
# calculate accuracy
conf_mat_NB = confusionMatrix(
factor(NB_pred,levels=c("1","0")),
factor(dta_bank_test[,54],levels=c("1","0"))
)
conf_mat_NB
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 191 393
## 0 223 3264
##
## Accuracy : 0.8487
## 95% CI : (0.8373, 0.8596)
## No Information Rate : 0.8983
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2994
##
## Mcnemar's Test P-Value : 9.814e-12
##
## Sensitivity : 0.46135
## Specificity : 0.89253
## Pos Pred Value : 0.32705
## Neg Pred Value : 0.93605
## Prevalence : 0.10169
## Detection Rate : 0.04692
## Detection Prevalence : 0.14345
## Balanced Accuracy : 0.67694
##
## 'Positive' Class : 1
##
# prepare data for decision tree
dta_bank_training_dt = dta_bank[train_inx,]
dta_bank_validating_dt = dta_bank[valid_inx,]
dta_bank_test_dt = dta_bank[test_inx,]
# build decision tree model
tree_model = rpart(y~.,
data=dta_bank_training_dt,
method = "class")
# choosing cp to prune the tree model
plotcp(tree_model)
tree_model_prune = prune(tree_model, cp=0.017)
# predict data
tree_pred = predict(tree_model_prune,
newdata = dta_bank_test_dt,
type = "class")
# make decision tree plot
rpart.plot(tree_model_prune)
# calculate confusion matrix and accuracy
conf_mat_tree = confusionMatrix(
factor(tree_pred,levels=c("yes","no")),
factor(dta_bank_test_dt$y,levels=c("yes","no"))
)
conf_mat_tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 114 78
## no 300 3579
##
## Accuracy : 0.9071
## 95% CI : (0.8978, 0.9159)
## No Information Rate : 0.8983
## P-Value [Acc > NIR] : 0.03158
##
## Kappa : 0.3333
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.27536
## Specificity : 0.97867
## Pos Pred Value : 0.59375
## Neg Pred Value : 0.92266
## Prevalence : 0.10169
## Detection Rate : 0.02800
## Detection Prevalence : 0.04716
## Balanced Accuracy : 0.62702
##
## 'Positive' Class : yes
##
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
n_tree = seq(500,800, by=50)
rf_acc = c(rep(NA,length(n_tree)))
names(rf_acc) = paste(n_tree)
for (i in 1:length(n_tree)){
model_rf = randomForest(y~.,
data = dta_bank_training_dt,
ntree = i)
rf_pred = predict(model_rf,
newdata = dta_bank_validating_dt,
type = "class")
conf_mat_tree = confusionMatrix(dta_bank_validating_dt$y,
rf_pred)
rf_acc[i] = conf_mat_tree$overall["Accuracy"]
}
n_tree_max = which.max(rf_acc)
model_rf = randomForest(y~.,
data = dta_bank_training_dt,
ntree = n_tree_max)
rf_pred = predict(model_rf,
newdata = dta_bank_test_dt,
type = "class")
conf_mat_rf = confusionMatrix(
factor(rf_pred,levels=c("yes","no")),
factor(dta_bank_test_dt$y,levels=c("yes","no"))
)
conf_mat_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction yes no
## yes 157 145
## no 257 3512
##
## Accuracy : 0.9013
## 95% CI : (0.8917, 0.9102)
## No Information Rate : 0.8983
## P-Value [Acc > NIR] : 0.277
##
## Kappa : 0.3859
##
## Mcnemar's Test P-Value : 3.091e-08
##
## Sensitivity : 0.37923
## Specificity : 0.96035
## Pos Pred Value : 0.51987
## Neg Pred Value : 0.93181
## Prevalence : 0.10169
## Detection Rate : 0.03857
## Detection Prevalence : 0.07418
## Balanced Accuracy : 0.66979
##
## 'Positive' Class : yes
##
[Ans] Some models make it better, but some make it worse. Overall, as we can see in the following table, all of models other than Naive Bayes have similar accuracy with the linear regression model, about 90%. However, the decision tree model have the best PPV among these models. In order to achieve our purpose to contact right customers, I suggest to choose the decision tree model as our predictive model.
all_conf_list = list(conf_mat_lm3,
conf_mat_lasso,
conf_mat_ridge,
conf_mat_NB,
conf_mat_knn,
conf_mat_tree,
conf_mat_rf
)
models_evaluation = as.data.frame(matrix(NA,nrow=length(all_conf_list),ncol=2))
colnames(models_evaluation) = c('Accuracy','PPV')
row.names(models_evaluation) = c("LM","Lasso","Ridge","Naive Bayes",
"KNN","Decision Tree","Random Forest")
for (i in 1:length(all_conf_list)){
models_evaluation[i,1] = all_conf_list[[i]]$overall['Accuracy']
models_evaluation[i,2] = all_conf_list[[i]]$byClass['Pos Pred Value']
}
round(models_evaluation,3) %>%
kable(booktabs = T) %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position="center") %>%
row_spec(6, bold = T, color = "red")
| Accuracy | PPV | |
|---|---|---|
| LM | 0.906 | 0.634 |
| Lasso | 0.910 | 0.637 |
| Ridge | 0.909 | 0.620 |
| Naive Bayes | 0.849 | 0.327 |
| KNN | 0.900 | 0.556 |
| Decision Tree | 0.894 | 0.965 |
| Random Forest | 0.901 | 0.520 |
[Ans] Because if we can find reasons behind consumer behaviors, we can adjust our marketing campaign or offer more attractive marketing contents to our consumers and further influence their purchasing behaviors. For example, if we find positive words on the packages of products, such as “happy”, can stimulate people buy more products, then it is helpful to help marketing team to decide which word they can print on the packages of products in the future.
[Ans] When a linear regression has biased estimations, it might have an omitte varirable bias. Because some variables are correlated with the dependent variable and some explanatory variables in the regression at the same time but we don’t add these variables into the regression, it cause error terms will be correlated with explanatory variables in the regression. Therefore, the coefficient of estimators will be biased and make our predictions inaccurate.
[Ans] Marrige might be the variables that we can analyze from a causal point. For example, compared to married people, single people might not use lots on money in a short term because they don’t have some expenditures on children’s tuition and life expense. As for job, people
[Ans] For marrige variable, annual salary and number of house might be omitted variables. Because all of these variables can influence the dependent variable and marridge variable.