1. Load the data contained in the file data_telebank.csv and name the variable dta_bank

dta_bank = read.csv("Bank Case.csv")

2-a. Type of variable (categorical/numerical) and what are the units (for the numerical only)

[Ans]

glimpse(dta_bank)
## Rows: 41,188
## Columns: 12
## $ age         <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,...
## $ job         <fct> housemaid, services, services, admin., services, servic...
## $ marital     <fct> married, married, married, married, married, married, m...
## $ education   <fct> basic.4y, high.school, high.school, basic.6y, high.scho...
## $ default     <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, ...
## $ housing     <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,...
## $ loan        <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n...
## $ contact     <fct> telephone, telephone, telephone, telephone, telephone, ...
## $ month       <fct> may, may, may, may, may, may, may, may, may, may, may, ...
## $ day_of_week <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, ...
## $ duration    <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22...
## $ y           <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no,...

2-b. For the ones that are numerical study whether they have outliers. There is no definition for what an outlier so we can define an outlier as any observation with a value that is more than 4 times its st3andard deviation.

[Ans] Yes, both age and duration have outliers. Age variable has 99 outliers and duration variable has 386 outliers.

# calculate mean and standard deviation of age and duration
age_mean              = mean(dta_bank$age)
age_sd                = sd(dta_bank$age)
duration_mean         = mean(dta_bank$duration)
duration_sd           = sd(dta_bank$duration) 

# calculate numbers of outliers for age and duration variables
age_outliers          = sum(dta_bank$age > age_mean+4*age_sd | 
                            dta_bank$age < age_mean-4*age_sd)
duration_outliers     = sum(dta_bank$duration > 
                            duration_mean+4*duration_sd | 
                            dta_bank$duration < 
                            duration_mean-4*duration_sd)
# save the above results in a dataframe
n_outliers            = as.data.frame(cbind(age_outliers, duration_outliers))
colnames(n_outliers)  = c("age", "duration")
row.names(n_outliers) = "number of outliers"

n_outliers %>% 
  kable(booktabs = T) %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F, 
                position="center")
age duration
number of outliers 99 386

3. Create a corr-plot using the package corrplot. You will have to install it using the command install.packages()

# drop all outliers in numercial variables
dta_bank     = dta_bank %>%
               filter(
                 age < age_mean+4*age_sd & 
                 age > age_mean-4*age_sd & 
                 dta_bank$duration < 
                   duration_mean+4*duration_sd & 
                 dta_bank$duration > 
                   duration_mean-4*duration_sd
                 )

# transform all variables into numeric variables in order to create a corrplot
dta_bank_num = as.data.frame(lapply(dta_bank, as.numeric))

# create a corrplot
corrplot(cor(dta_bank_num), method="color", type="upper", 
         addCoef.col = "black", tl.col="black", tl.srt=45, 
         sig.level = 0.01, insig = "blank", diag=FALSE,
         number.cex=0.5,tl.cex=0.7, cl.ratio=0.2,cl.cex=0.7,
         col = brewer.pal(n = 8, name = "BrBG"))

4. Run the following command lm(y~.,data=dta_bank)

Dependent variable:
as.numeric(y)
age 0.001***
(0.0002)
jobblue-collar -0.027***
(0.005)
jobentrepreneur -0.027***
(0.008)
jobhousemaid -0.008
(0.009)
jobmanagement -0.020***
(0.006)
jobretired 0.062***
(0.008)
jobself-employed -0.024***
(0.008)
jobservices -0.022***
(0.005)
jobstudent 0.118***
(0.010)
jobtechnician -0.015***
(0.005)
jobunemployed 0.019**
(0.009)
jobunknown 0.010
(0.015)
maritalmarried 0.008*
(0.004)
maritalsingle 0.023***
(0.005)
maritalunknown -0.009
(0.030)
educationbasic.6y 0.006
(0.007)
educationbasic.9y -0.003
(0.006)
educationhigh.school 0.002
(0.006)
educationilliterate 0.116*
(0.063)
educationprofessional.course 0.008
(0.007)
educationuniversity.degree 0.018***
(0.006)
educationunknown 0.021***
(0.008)
defaultunknown -0.040***
(0.003)
defaultyes -0.042
(0.154)
housingunknown 0.006
(0.009)
housingyes 0.002
(0.003)
loanunknown
loanyes -0.001
(0.004)
contacttelephone -0.078***
(0.003)
monthaug -0.068***
(0.006)
monthdec 0.237***
(0.021)
monthjul -0.087***
(0.006)
monthjun 0.001
(0.007)
monthmar 0.312***
(0.013)
monthmay -0.067***
(0.006)
monthnov -0.071***
(0.007)
monthoct 0.239***
(0.011)
monthsep 0.228***
(0.013)
day_of_weekmon -0.010**
(0.004)
day_of_weekthu 0.002
(0.004)
day_of_weektue 0.004
(0.004)
day_of_weekwed 0.005
(0.004)
duration 0.001***
(0.00001)
Constant 1.008***
(0.012)
Observations 40,703
R2 0.258
Adjusted R2 0.258
Residual Std. Error 0.266 (df = 40660)
F Statistic 337.361*** (df = 42; 40660)
Note: p<0.1; p<0.05; p<0.01

4-a. Write the structural equation that R is estimating?

$$ \begin{aligned} Y = & 1.008 + 0.001 AGE \ & - 0.027 JOB {blue collar }
- 0.029 JOB
{entrepreneur } - 0.004 JOB {housemaid } \ & - 0.021 JOB {management } + 0.062 JOB {retired } - 0.025 JOB {self-employed} \ & - 0.023 JOB {services } + 0.119 JOB {student } - 0.015 JOB {technician } \ & + 0.019 JOB {unemployed } + 0.006 JOB {unknown } \ & + 0.008 MARITAL {married }
+ 0.023 MARITAL {single } - 0.009 MARITAL {unknown } \ & + 0.116 EDU {illiterate } + 0.006 EDU {basic.6y } - 0.003 EDU {basic.9y } \ & + 0.002 EDU {high.school }
+ 0.008 EDU {professional } \ & + 0.018 EDU {university }
+ 0.021 EDU {unknown } \ & - 0.040 DEFAULT {unknown } - 0.042 DEFAULT {yes } \ & + 0.006 HOUSING {unknown } + 0.002 HOUSING {yes } \ & - 0.001 LOAN {yes } - 0.078 CONTACT {telephone } \ & - 0.068 MONTH {Aug } + 0.237 MONTH {Dec } - 0.087 MONTH {Jul } \ & + 0.001 MONTH {Jun } + 0.312 MONTH {Mar } - 0.067 MONTH {May } \ & - 0.071 MONTH {Nov } + 0.239 MONTH {Oct } + 0.228 MONTH {Sep } \ & - 0.010 DAY {Mon } + 0.002 DAY {Thu } + 0.004 DAY {Tue } + 0.005 DAY {Wed } \ & + 0.001 DURATION \ &

\end{aligned} $$

4-b. Comment the results.

4-b-i. Best time to perform telemarketing tasks?

[Ans] Wednesday is the best time to perform telemarketing tasks because its coefficient is the highest among weekday varaibles.

weekday_coef = as.data.frame(
  cbind(c("Mon","Tue","Wed","Thu","Fri"),
        c(summary(reg_bank)$coefficients[c(39,41,42,40)],0))
  ) %>% rename(day=V1,coef=V2) %>% 
        mutate(coef=as.numeric(paste(coef)),
               color=ifelse(coef>0,"pos","neg"))

ggplot(weekday_coef,aes(x=reorder(day,coef), y=coef,fill=color)) + 
  geom_bar(stat = "identity",width=0.5)+ 
  scale_fill_manual("legend", values = c("pos"="firebrick","neg"="darkblue")) +
  geom_hline(yintercept = 0,color="dimgray",size=1, linetype='dashed') +
  coord_flip()

4-b-ii. Best income groups?

[Ans] As we can see in the bar chart, student, retired and unemployed have higher coefficients in this case, so we can conclude that groups without full-time job are the best group to target. In fact, this conclusion makes sense becuase these groups usually have more free time to answer calls than those who have full-time jobs. Besides,time deposits are risk-free and safe investments. They can appeal students and retired people who don’t have stable salary to invest.

job_coef = as.data.frame(
    cbind(c(row.names(summary(reg_bank)$coefficients)[3:13],"admin"),
          c(summary(reg_bank)$coefficients[c(3:13)],0))
    ) %>% rename(job  = V1,coef=V2) %>% 
          mutate(job  = str_replace(job,"job",""),
                 coef = as.numeric(paste(coef)),
                 color= ifelse(coef>0,"pos","neg"))

ggplot(job_coef,aes(x=reorder(job,coef), y=coef, fill=color)) + 
  geom_bar(stat = "identity",width=0.8) +
  scale_fill_manual("legend", values = c("pos"="firebrick","neg"="darkblue")) +
  geom_hline(yintercept = 0,color="dimgray",size=1, linetype='dashed') +
  coord_flip() + 
  xlab("Coefficients") + ylab("Job")

  theme(axis.text.x = element_text(angle = 45, hjust=1))
## List of 1
##  $ axis.text.x:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : num 45
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

4-b-iii. Potential concerns of omitted variable Bias

[Ans] * Consumer-related: Regions, Family size * Phonecalls-related: number of contacts * Advertising-related: Interest rate, how many other kind of ads are running at the same time? * Social environment: GDP, consumer price index, Unemployment rate, Political elections


Predictive Modeling and Tuning

This is a predictive modeling exercise and we have seen in class that we always divide the dataset in dta_bank_training, dta_bank_validating, dta_bank_test.

1. Explain (in sentences) why and how we always do that.

[Ans] The purpose of predictive modeling is to find a trend that can be generalized to all observations and use this trend to predict what we want to know. First, we need to use training data to find a model that can represent the trend. Next, we using validating data to tune our model and find the hyperparameters with the best performance. Subsequently, we test and compare models’ performances through testing data to ensure models can be generalized. Lastly, we choose the model with the best performance, like the highest accuracy, as our predictive model. Generally, we use 80% data as the training data, 10% as validating data and the remaining 10% as test data.

2. From the point of view of the firm and given that we are running a predictive exercise, is there any variable that should not be included as X? If yes, please drop it.

[Ans] All variables should be kept. Because the purpose of this case is to find out how to call the right customers at the right times,

3. Explain the problems of overfitting and underfitting.

[Ans] Overfitting means models fit trainning data a lot better than test data and cause low bias but high variance, so it cannot be well generalized to other observations.

Conversely, underfitting means models cannot fit both trainning data and test data well and cause low variance but high bias, so the underfitting models are too generalized.

Both overfitting and underfitting might cause model to have poor performances.

4. Explain the meaning of the no free lunch theorem.

It means that there is no the best alogorithms. That is, same algorithms may have different performance in different datasets. For example, we cannot say decision tree always have better performance than linear regression.

5. For the following 4 models, write their structural equations and comment:

  • Step1: Preparing data
# transform categorical variables into dummy variables
dta_bank_dummy = dta_bank[,c(1,11)]
cat_col    = colnames(dta_bank)[-c(1,11)]

for (i in cat_col){
  dummy_df = psych::dummy.code(dta_bank[,i])
  colnames(dummy_df)=paste(i,colnames(dummy_df),sep="_")
  dta_bank_dummy = as.data.frame(cbind(dta_bank_dummy,dummy_df))
}
# creating square and cube of age variable
dta_bank_dummy = dta_bank_dummy %>%
                 mutate(age_square = age^2,
                        age_cube = age^3)

# setting indices
set.seed(123456)
size = nrow(dta_bank_dummy)
inx  = 1:size
train_inx = sample(inx,floor(size*0.8))
valid_inx = sample(inx[-train_inx],floor((size-length(train_inx))*0.5))
test_inx  = inx[-c(train_inx,valid_inx)]

# creating data and labels for training, testing and validation
dta_bank_training    = dta_bank_dummy[train_inx,]
dta_bank_validating  = dta_bank_dummy[valid_inx,]
dta_bank_test        = dta_bank_dummy[test_inx,]
  • Step2: Running and evaluating regressions
# define needed columns for different linear regessions
col_lm1 = c(1,38:47,54)
col_lm2 = c(1,38:47,54:56)
col_lm3 = c(1:52,54)

# run regressions
lm1 = lm(y_yes~.,data=dta_bank_training[,col_lm1])

lm2 = lm(y_yes~.,data=dta_bank_training[,col_lm2])

lm3 = lm(y_yes~.,data=dta_bank_training[,col_lm3])

lm4 = lm(y_yes~.^2,data=dta_bank_training[,col_lm3])
  • Step3: Write strutural equations

lm1: one explanatory variable and time fixed effects(months) \[ \begin{aligned} Y = 0.4199+ 0.00008468\times AGE-0.07191\times MONTH_{Mar}+0.2309\times MONTH_{Apr}-0.3642\times MONTH_{May}... \end{aligned} \]

lm2: cubic polynomial regression, non-linear regression \[ \begin{aligned} Y = & 0.8063- 0.01855\times AGE+0.117594\times AGE^2+0.146243\times AGE^3 \\& -0.07191\times MONTH_{Mar}+0.2309\times MONTH_{Apr}-0.3642\times MONTH_{May}... \end{aligned} \]

lm3: muliple linear regression

$$ \begin{aligned} Y = & 0.1301 + 0.0006725 AGE + 0.0005670 DURATION \ &
- 0.02699 JOB {blue collar }
- 0.03604 JOB
{entrepreneur } - 0.005475 JOB {housemaid }… \ & - 0.0008322 MARITAL {married }
+ 0.01437 MARITAL {single } - 0.007168 MARITAL {divorced } \ & + 0.1003 EDU {illiterate } - 0.01773 EDU {basic.6y } - 0.02618 EDU {basic.9y }… \ & - 0.008282 DEFAULT {unknown } - 0.04601 DEFAULT {no } \ & + 0.003779 HOUSING {unknown } - 0.001886 HOUSING {no } \ & + 0.003411 LOAN {no } + 0.07938 CONTACT {cellular } \ & - 0.2794 MONTH {Aug } + 0.03 MONTH {Dec } - 0.2987 MONTH {Jul } …\ & - 0.01301 DAY {Mon } - 0.002929 DAY {Thu } + 0.002787 DAY {Tue } + 0.004607 DAY {Fri }

\end{aligned} $$

lm4: muliple linear regression and interaction terms between variables $$ \begin{aligned} Y = & 0.2523245 - 0.007993195 AGE - 0.00006590533 DURATION….. \ &
- 0.002887967 AGEJOB_{blue collar }
- 0.002465323 AGE
JOB_{entrepreneur }…..

\end{aligned} $$

  • Step4: comment regressions
# use regression models to predict probabilities
lm_list = list(lm1,lm2,lm3,lm4)
lm_prob_list = list()

for (i in 1:4){
  list_name = paste("lm_pred",i, sep="")
  lm_prob_list[[list_name]] = 
    predict(lm_list[[i]],newdata=dta_bank_validating)
}
## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading

## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading

## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading

## Warning in predict.lm(lm_list[[i]], newdata = dta_bank_validating): prediction
## from a rank-deficient fit may be misleading
# create a table with some performance to compare models
lm_eval = as.data.frame(matrix(NA,nrow=6,ncol=4)) 
colnames(lm_eval) = paste("lm",1:4,sep="")
row.names(lm_eval)= c("RMSPE_train",
                      "RMSPE_valid",
                      "RMSPE_valid/RMSPE_train",
                      "Accuracy",
                      "Sensitivity",
                      "Pos Pred Value")
conf_list = list()


for (i in 1:4){
  # RMSPE of training data
  lm_eval[1,i] = sqrt(mean(lm_list[[i]]$residuals^2,na.rm = TRUE))
  
  # RMSPE of validating data
  ssr_pred     = (dta_bank_validating$y_yes-lm_prob_list[[i]])^2
  lm_eval[2,i] = sqrt(mean(ssr_pred, na.rm = TRUE))
  
  # calculate the difference of RMSPE between training and validating data
  lm_eval[3,i] = lm_eval[2,i]/lm_eval[1,i]
  
  # calculate accuracy
  lm_pred_class= ifelse(lm_prob_list[[i]]>0.5,1,"0")
  conf_list[[i]] = confusionMatrix(
    factor(lm_pred_class,levels=c("1","0")),
    factor(dta_bank_validating$y_yes,levels=c("1","0"))
  )
  lm_eval[4,i] = conf_list[[i]]$overall['Accuracy']
  lm_eval[5,i] = conf_list[[i]]$byClass['Sensitivity']
  lm_eval[6,i] = conf_list[[i]]$byClass['Pos Pred Value']
}


# comparing models
round(lm_eval,3)  %>% 
  kable(booktabs = T) %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F, 
                position="center")
lm1 lm2 lm3 lm4
RMSPE_train 0.297 0.294 0.266 0.253
RMSPE_valid 0.302 0.301 0.271 0.270
RMSPE_valid/RMSPE_train 1.017 1.022 1.020 1.065
Accuracy 0.887 0.888 0.892 0.894
Sensitivity 0.000 0.052 0.154 0.220
Pos Pred Value NaN 0.533 0.582 0.580

5-a. Which one overfits more?

[Ans] lm4 overfits the most because the ratio of validating data’s RMSPE to training data’s RMSPE is the highest in the above table. That is, lm4 model in the validating data has higher RMSPE than in the training data. Besides the difference of RMSPE between validating data and training data in lm4 model is the highest compared to other linear models.

5-b. Which one underfits more?

[Ans] As we can see in the above table, lm1 underfits the most because both training data and validating data have the lowest RMPSE compared to other models. Besides the ratio of validating data’s MSPE to training data’s MSPE is the lowest. Therefore, we can conclude that lm1 is more underfitting than other liner models.

5-c. Is the model that fits the training data the best one that has the best predictive power?

[Ans] No, although lm4 model has the lowest RMSPE in the training data, it doen’t has the best predictive power. In fact, lm3 is the best one because it has similar accuracy and PPV with lm4 model but have less overfitting problem than lm4.

The reason that we have to use PPV(Pos Pred Value) to evaluate predictive powers is because our purpose is to find out the potential customers of term deposits and appropriate times to contact them. Therefore, if we can predict the positive value more precisely, we can help the company to save much effort and time to contact the right people. For example, PPV of lm3 model is 0.582. when lm3 predict a customer will subscribe a term deposit, 58.2% of customer actually subscribe a term deposit. If PPV increase, we can find right customers more quickly and further help the company to save costs and increase efficiency.

5-d. Can we use a confusion matrix to analyze the problems a problem of underfitting?

[Ans] Yes, we can use a confusion matrix to analyze the underfitting problems. Take lm1 as an example. The following is the confusion matrix of lm1. As we can see, lm1 predict 0 consumers in validating data will subscribe a term deposit. However, 460 consumers actually subscribe term deposits. Although the overall accuracy is 0.887, the probability that model can precisely predict positive values is 0. That is, the lm1 model merely predict the negative values but cannot predict positive values well. Therefore, lm1 has higher RMSPE in both training and validating data and cause the underfitting problem.

conf_list[[1]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1    0    0
##          0  460 3610
##                                           
##                Accuracy : 0.887           
##                  95% CI : (0.8768, 0.8965)
##     No Information Rate : 0.887           
##     P-Value [Acc > NIR] : 0.5124          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000           
##             Specificity : 1.000           
##          Pos Pred Value :   NaN           
##          Neg Pred Value : 0.887           
##              Prevalence : 0.113           
##          Detection Rate : 0.000           
##    Detection Prevalence : 0.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : 1               
## 

5-e. Which data set should we use to run these regressions?

[Ans] We should use training dataset to run these regressions.


Improving the predictive power

1. Make a visualization to inspect the relationship between the Y and each of the X that you have included in the regressions above.

1-a. Does it look linear?

[Ans] Only the relationship between the Y and duration variable looks linear, other relationships do not look linear.

dta_bank_plot = dta_bank[valid_inx,]
dta_bank_plot$y_pred = lm_prob_list[[4]]

color = "darkblue"

p1  = ggplot(dta_bank_plot,aes(x=age,y=y_pred)) + geom_point(col=color) + theme(axis.text.x = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p2  = ggplot(dta_bank_plot, aes(x=job,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p3  = ggplot(dta_bank_plot, aes(x=marital,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p4  = ggplot(dta_bank_plot, aes(x=education,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p5  = ggplot(dta_bank_plot, aes(x=default,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p6  = ggplot(dta_bank_plot, aes(x=housing,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p7  = ggplot(dta_bank_plot, aes(x=loan,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p8  = ggplot(dta_bank_plot, aes(x=contact,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p9  = ggplot(dta_bank_plot, aes(x=month,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p10 = ggplot(dta_bank_plot, aes(x=day_of_week,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

p11 = ggplot(dta_bank_plot, aes(x=duration,y=y_pred)) + geom_point(col=color) + theme(axis.text.x  = element_blank(),axis.ticks.x = element_blank(),axis.title.y = element_blank())

library(cowplot)
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
plot_grid(p1, p2, p3,p4,p5,p6,p7,p8,p9,p10,p11, align = "v", nrow = 3)

2. Use the other predictive methods seen in class (like NB classifiers or KNN) to check if you can improve the performance.

[Ans] In addition to linear probability model, I also run 7 models, including knn, naive bayes, decision tree, random forest.

  • linear probability model (using lm3 to predict test data in order to compare other models)
lm_prob_final = predict(lm3,
                        newdata = dta_bank_test[,col_lm3])
## Warning in predict.lm(lm3, newdata = dta_bank_test[, col_lm3]): prediction from
## a rank-deficient fit may be misleading
lm_pred_final = ifelse(lm_prob_final>0.5,1,0)

conf_mat_lm3 = confusionMatrix(
  factor(lm_pred_final,levels=c("1","0")),
  factor(dta_bank_test[,54],levels=c("1","0"))
  )

conf_mat_lm3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1   78   45
##          0  336 3612
##                                          
##                Accuracy : 0.9064         
##                  95% CI : (0.897, 0.9152)
##     No Information Rate : 0.8983         
##     P-Value [Acc > NIR] : 0.0447         
##                                          
##                   Kappa : 0.2558         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.18841        
##             Specificity : 0.98769        
##          Pos Pred Value : 0.63415        
##          Neg Pred Value : 0.91489        
##              Prevalence : 0.10169        
##          Detection Rate : 0.01916        
##    Detection Prevalence : 0.03021        
##       Balanced Accuracy : 0.58805        
##                                          
##        'Positive' Class : 1              
## 
  • Preparing data for Ridge and Lasso
dta_bank_dummy   = dta_bank_dummy %>%
                   mutate(duration_square = duration^2,
                          duration_cube   = duration^3)

dta_bank_dummy_Y = dta_bank_dummy %>% 
                   select(y_yes) %>% 
                   mutate(y_yes = ifelse(y_yes==1,1,-1))

dta_bank_dummy_X = dta_bank_dummy[,-c(3,15,19,27,30,33,36,38,48,53,54)]


index = c()

for (i in 1:47){
  for(j in 1:47){
    if (i==j) {next}
    if (paste("interact",j,i,sep="_") %in% index) {next}
    var_name = paste("interact",i,j,sep="_")
    index[length(index)+1] = var_name
    dta_bank_dummy_X[,var_name] = dta_bank_dummy_X[,i]*dta_bank_dummy_X[,j]
  }
}

train_Y_unstd = dta_bank_dummy_Y[c(train_inx,valid_inx),] %>% as.matrix()
test_Y_unstd  = dta_bank_dummy_Y[-c(train_inx,valid_inx),] %>% as.matrix()


train_X_unstd = dta_bank_dummy_X[c(train_inx,valid_inx),] %>% as.matrix()
test_X_unstd  = dta_bank_dummy_X[-c(train_inx,valid_inx),] %>% as.matrix()



preProc     = preProcess(train_X_unstd, method = c("center", "scale"))
## Warning in preProcess.default(train_X_unstd, method = c("center", "scale")):
## These variables have zero variances: interact_3_4, interact_3_5, interact_3_6,
## interact_3_7, interact_3_8, interact_3_9, interact_3_10, interact_3_11,
## interact_3_12, interact_3_13, interact_3_25, interact_4_5, interact_4_6,
## interact_4_7, interact_4_8, interact_4_9, interact_4_10, interact_4_11,
## interact_4_12, interact_4_13, interact_4_23, interact_5_6, interact_5_7,
## interact_5_8, interact_5_9, interact_5_10, interact_5_11, interact_5_12,
## interact_5_13, interact_5_23, interact_5_25, interact_6_7, interact_6_8,
## interact_6_9, interact_6_10, interact_6_11, interact_6_12, interact_6_13,
## interact_6_23, interact_6_25, interact_7_8, interact_7_9, interact_7_10,
## interact_7_11, interact_7_12, interact_7_13, interact_7_25, interact_8_9,
## interact_8_10, interact_8_11, interact_8_12, interact_8_13, interact_8_25,
## interact_8_39, interact_9_10, interact_9_11, interact_9_12, interact_9_13,
## interact_9_25, interact_10_11, interact_10_12, interact_10_13, interact_10_25,
## interact_11_12, interact_11_13, interact_11_23, interact_12_13, interact_12_23,
## interact_12_25, interact_13_23, interact_13_25, interact_13_39, interact_14_15,
## interact_14_16, interact_14_25, interact_15_16, interact_15_25, interact_16_23,
## interact_16_25, interact_16_37, interact_16_39, interact_17_18, interact_17_19,
## interact_17_20, interact_17_21, interact_17_22, interact_17_23, interact_18_19,
## interact_18_20, interact_18_21, interact_18_22, interact_18_23, interact_18_25,
## interact_19_20, interact_19_21, interact_19_22, interact_19_23, interact_20_21,
## interact_20_22, interact_20_23, interact_20_25, interact_21_22, interact_21_23,
## interact_21_25, interact_21_39, interact_22_23, interact_22_25, interact_23_25,
## interact_23_27, interact_23_29, interact_23_33, interact_23_36, interact_23_37,
## interact_23_38, interact_23_39, interact_24_25, interact_25_27, interact_25_28,
## interact_25_29, interact_25_30, interact_25_31, interact_25_33, interact_25_35,
## interact_25_36, interact_25_37, interact_25_38, interact_25_39, interact_25_40,
## interact_25_41, interact_25_43, interact_26_27, interact_26_29, interact_27_28,
## interact_28_29, interact_31_32, interact_31_33, interact_31_34, interact_31_35,
## interact_31_36, interact_31_37, interact_31_38, interact_31_39, interact_32_33,
## interact_32_34, interact_32_35, interact_32_36, interact_32_37, interact_32_38,
## interact_32_39, interact_33_34, interact_33_35, interact_33_36, interact_33_37,
## interact_33_38, interact_33_39, interact_34_35, interact_34_36, interact_34_37,
## interact_34_38, interact_34_39, interact_35_36, interact_35_37, interact_35_38,
## interact_35_39, interact_36_37, interact_36_38, interact_36_39, interact_37_38,
## interact_37_39, interact_38_39, interact_40_41, interact_40_42, interact_40_43,
## interact_41_42, interact_41_43, interact_42_43
train_X_std = predict(preProc, train_X_unstd) %>% as.matrix()
test_X_std  = predict(preProc, test_X_unstd) %>% as.matrix()

train_Y_std = train_Y_unstd-mean(train_Y_unstd) 
test_Y_std  = test_Y_unstd-mean(train_Y_unstd)  
  • Ridge Regression
# Ridge regression
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:pracma':
## 
##     expm, lu, tril, triu
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.0
ridge_cv = cv.glmnet(train_X_std, 
                     train_Y_std, 
                     alpha = 0,
                     standardize = FALSE, 
                     nfolds = 5)

lambda_cv = ridge_cv$lambda.min

Ridge_cv_unstd  = glmnet(train_X_unstd, 
                         train_Y_unstd, 
                         alpha = 0, 
                         lambda = lambda_cv, 
                         standardize = TRUE)

y_hat_cv_is     = predict(Ridge_cv_unstd, train_X_unstd)
y_hat_cv_oos    = predict(Ridge_cv_unstd, test_X_unstd)

RMSPE_ridge_is  = sqrt(mean((t(train_Y_unstd - y_hat_cv_is) %*% (train_Y_unstd - y_hat_cv_is))))
RMSPE_ridge_oos = sqrt(mean((t(test_Y_unstd - y_hat_cv_oos) %*% (test_Y_unstd - y_hat_cv_oos))))

Ridge_RMSPE     = data.frame(round(t(c(RMSPE_ridge_is,RMSPE_ridge_oos)),2))
colnames(Ridge_RMSPE)  = c("In sample RMSPE",
                           "Out of sample RMSPE")
row.names(Ridge_RMSPE) = "Ridge"

Ridge_RMSPE %>% 
  kable(booktabs = T) %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F, 
                position="center")
In sample RMSPE Out of sample RMSPE
Ridge 96.25 33.28
ridge_pred = ifelse(y_hat_cv_oos>0,1,0)

conf_mat_ridge = confusionMatrix(
  factor(ridge_pred,levels=c("1","0")),
  factor(dta_bank_dummy$y_yes[-c(train_inx,valid_inx)],levels=c("1","0"))
  )

conf_mat_ridge
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  116   71
##          0  298 3586
##                                          
##                Accuracy : 0.9094         
##                  95% CI : (0.9001, 0.918)
##     No Information Rate : 0.8983         
##     P-Value [Acc > NIR] : 0.009669       
##                                          
##                   Kappa : 0.3445         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.28019        
##             Specificity : 0.98059        
##          Pos Pred Value : 0.62032        
##          Neg Pred Value : 0.92327        
##              Prevalence : 0.10169        
##          Detection Rate : 0.02849        
##    Detection Prevalence : 0.04593        
##       Balanced Accuracy : 0.63039        
##                                          
##        'Positive' Class : 1              
## 
  • Lasso Regression
lasso_cv = cv.glmnet(train_X_std, 
                     train_Y_std, 
                     alpha = 1,
                     standardize = FALSE, 
                     nfolds = 5)
lambda_lasso = lasso_cv$lambda.min

lasso_cv_unstd  = glmnet(train_X_unstd, 
                         train_Y_unstd, 
                         alpha = 1, 
                         lambda = lambda_lasso, 
                         standardize = TRUE)
y_hat_lasso_is     = predict(lasso_cv_unstd, train_X_unstd)
y_hat_lasso_oos    = predict(lasso_cv_unstd, test_X_unstd)

Lasso_ridge_is  = sqrt(mean(t(train_Y_unstd - y_hat_lasso_is) %*% (train_Y_unstd - y_hat_lasso_is)))
Lasso_ridge_oos = sqrt(mean(t(test_Y_unstd - y_hat_lasso_oos) %*% (test_Y_unstd - y_hat_lasso_oos)))

Lasso_RMSPE     = data.frame(round(t(c(Lasso_ridge_is,Lasso_ridge_oos)),2))
colnames(Lasso_RMSPE)  = c("In sample RMSPE",
                           "Out of sample RMSPE")
row.names(Lasso_RMSPE) = "Lasso"

Lasso_RMSPE %>% 
  kable(booktabs = T) %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F, 
                position="center")
In sample RMSPE Out of sample RMSPE
Lasso 96.95 32.71
lasso_pred = ifelse(y_hat_lasso_oos>0,1,0)
conf_mat_lasso = confusionMatrix(
  factor(lasso_pred,levels=c("1","0")),
  factor(dta_bank_dummy$y_yes[-c(train_inx,valid_inx)],levels=c("1","0"))
  )
conf_mat_lasso 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  107   61
##          0  307 3596
##                                           
##                Accuracy : 0.9096          
##                  95% CI : (0.9004, 0.9182)
##     No Information Rate : 0.8983          
##     P-Value [Acc > NIR] : 0.008365        
##                                           
##                   Kappa : 0.3283          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.25845         
##             Specificity : 0.98332         
##          Pos Pred Value : 0.63690         
##          Neg Pred Value : 0.92134         
##              Prevalence : 0.10169         
##          Detection Rate : 0.02628         
##    Detection Prevalence : 0.04127         
##       Balanced Accuracy : 0.62089         
##                                           
##        'Positive' Class : 1               
## 
  • KNN
# normalizing data
normalize = function(x){
  if (max(x) != 0) {return((x-min(x)) / (max(x)-min(x)))}
}

dta_bank_dummy =
  as.data.frame((lapply(dta_bank_dummy,normalize)))

dta_bank_training    = dta_bank_dummy[train_inx,]
dta_bank_validating  = dta_bank_dummy[valid_inx,]
dta_bank_test        = dta_bank_dummy[test_inx,]

# use training data to train models and use validating data to adjust hyperparameters
range         = 1:30
accuracy      = rep(0, length(range))

set.seed(123456)

for (k in range) {
  pred        = knn(train = dta_bank_training[,c(1:52)], 
                    test  = dta_bank_validating[,c(1:52)], 
                    cl    = dta_bank_training[,54], 
                    k     = k)
  conf        = table(dta_bank_validating[,54], pred)
  accuracy[k] = sum(diag(conf))/sum(conf)
}

# find the value of k with the best accuracy
k_max           = which.max(accuracy)

# use the best k to predict test data
test_pred_knn = class::knn(train = dta_bank_training[,-c(53:56)], 
                           test  = dta_bank_test[,-c(53:56)], 
                           cl    = dta_bank_training[,54], 
                           k     = k_max)

# calculate accuracy
conf_mat_knn = confusionMatrix(
  factor(test_pred_knn,levels=c("1","0")),
  factor(dta_bank_test[,54],levels=c("1","0"))
  )
conf_mat_knn 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1   30   24
##          0  384 3633
##                                           
##                Accuracy : 0.8998          
##                  95% CI : (0.8901, 0.9088)
##     No Information Rate : 0.8983          
##     P-Value [Acc > NIR] : 0.3902          
##                                           
##                   Kappa : 0.1073          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.072464        
##             Specificity : 0.993437        
##          Pos Pred Value : 0.555556        
##          Neg Pred Value : 0.904406        
##              Prevalence : 0.101695        
##          Detection Rate : 0.007369        
##    Detection Prevalence : 0.013265        
##       Balanced Accuracy : 0.532951        
##                                           
##        'Positive' Class : 1               
## 
  • Naive Bayes
# prepare data
dta_bank_training_NB = 
  as.data.frame(rbind(dta_bank_training,dta_bank_validating))

NB_classfied  = naiveBayes(as.factor(y_yes)~.,
                           data=dta_bank_training_NB[,col_lm3],
                           laplace=1)
NB_pred       = predict(NB_classfied,
                        newdata = dta_bank_test[,col_lm3],
                        type="class")

# calculate accuracy
conf_mat_NB = confusionMatrix(
  factor(NB_pred,levels=c("1","0")),
  factor(dta_bank_test[,54],levels=c("1","0"))
  )
conf_mat_NB 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  191  393
##          0  223 3264
##                                           
##                Accuracy : 0.8487          
##                  95% CI : (0.8373, 0.8596)
##     No Information Rate : 0.8983          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2994          
##                                           
##  Mcnemar's Test P-Value : 9.814e-12       
##                                           
##             Sensitivity : 0.46135         
##             Specificity : 0.89253         
##          Pos Pred Value : 0.32705         
##          Neg Pred Value : 0.93605         
##              Prevalence : 0.10169         
##          Detection Rate : 0.04692         
##    Detection Prevalence : 0.14345         
##       Balanced Accuracy : 0.67694         
##                                           
##        'Positive' Class : 1               
## 
  • Decision Tree
# prepare data for decision tree
dta_bank_training_dt    = dta_bank[train_inx,]
dta_bank_validating_dt  = dta_bank[valid_inx,]
dta_bank_test_dt        = dta_bank[test_inx,]

# build decision tree model
tree_model = rpart(y~.,
                   data=dta_bank_training_dt,
                   method = "class")

# choosing cp to prune the tree model
plotcp(tree_model)

tree_model_prune = prune(tree_model, cp=0.017)

# predict data
tree_pred  = predict(tree_model_prune, 
                     newdata = dta_bank_test_dt,
                     type = "class")

# make decision tree plot
rpart.plot(tree_model_prune)

# calculate confusion matrix and accuracy
conf_mat_tree = confusionMatrix(
  factor(tree_pred,levels=c("yes","no")),
  factor(dta_bank_test_dt$y,levels=c("yes","no"))
  )

conf_mat_tree 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  114   78
##        no   300 3579
##                                           
##                Accuracy : 0.9071          
##                  95% CI : (0.8978, 0.9159)
##     No Information Rate : 0.8983          
##     P-Value [Acc > NIR] : 0.03158         
##                                           
##                   Kappa : 0.3333          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.27536         
##             Specificity : 0.97867         
##          Pos Pred Value : 0.59375         
##          Neg Pred Value : 0.92266         
##              Prevalence : 0.10169         
##          Detection Rate : 0.02800         
##    Detection Prevalence : 0.04716         
##       Balanced Accuracy : 0.62702         
##                                           
##        'Positive' Class : yes             
## 
  • random forest
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
## 
##     outlier
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
n_tree = seq(500,800, by=50)
rf_acc = c(rep(NA,length(n_tree)))
names(rf_acc) = paste(n_tree)

for (i in 1:length(n_tree)){
model_rf = randomForest(y~.,
                        data = dta_bank_training_dt,
                        ntree = i)

rf_pred  = predict(model_rf, 
                   newdata = dta_bank_validating_dt,
                   type = "class")

conf_mat_tree = confusionMatrix(dta_bank_validating_dt$y,
                                rf_pred)
rf_acc[i] = conf_mat_tree$overall["Accuracy"]
}

n_tree_max = which.max(rf_acc)

model_rf = randomForest(y~.,
                        data = dta_bank_training_dt,
                        ntree = n_tree_max)

rf_pred  = predict(model_rf, 
                   newdata = dta_bank_test_dt,
                   type = "class")

conf_mat_rf = confusionMatrix(
  factor(rf_pred,levels=c("yes","no")),
  factor(dta_bank_test_dt$y,levels=c("yes","no"))
  )
conf_mat_rf 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  yes   no
##        yes  157  145
##        no   257 3512
##                                           
##                Accuracy : 0.9013          
##                  95% CI : (0.8917, 0.9102)
##     No Information Rate : 0.8983          
##     P-Value [Acc > NIR] : 0.277           
##                                           
##                   Kappa : 0.3859          
##                                           
##  Mcnemar's Test P-Value : 3.091e-08       
##                                           
##             Sensitivity : 0.37923         
##             Specificity : 0.96035         
##          Pos Pred Value : 0.51987         
##          Neg Pred Value : 0.93181         
##              Prevalence : 0.10169         
##          Detection Rate : 0.03857         
##    Detection Prevalence : 0.07418         
##       Balanced Accuracy : 0.66979         
##                                           
##        'Positive' Class : yes             
## 

3. Do they make it better? Worse?

[Ans] Some models make it better, but some make it worse. Overall, as we can see in the following table, all of models other than Naive Bayes have similar accuracy with the linear regression model, about 90%. However, the decision tree model have the best PPV among these models. In order to achieve our purpose to contact right customers, I suggest to choose the decision tree model as our predictive model.

all_conf_list     = list(conf_mat_lm3,
                         conf_mat_lasso,
                         conf_mat_ridge,
                         conf_mat_NB,
                         conf_mat_knn,
                         conf_mat_tree,
                         conf_mat_rf
                         )
models_evaluation = as.data.frame(matrix(NA,nrow=length(all_conf_list),ncol=2))
colnames(models_evaluation) = c('Accuracy','PPV')
row.names(models_evaluation) = c("LM","Lasso","Ridge","Naive Bayes",
                                 "KNN","Decision Tree","Random Forest")
                         
                                  
for (i in 1:length(all_conf_list)){
  models_evaluation[i,1] = all_conf_list[[i]]$overall['Accuracy']
  models_evaluation[i,2] = all_conf_list[[i]]$byClass['Pos Pred Value']
}

round(models_evaluation,3) %>% 
  kable(booktabs = T) %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F, 
                position="center") %>%
  row_spec(6, bold = T, color = "red")
Accuracy PPV
LM 0.906 0.634
Lasso 0.910 0.637
Ridge 0.909 0.620
Naive Bayes 0.849 0.327
KNN 0.900 0.556
Decision Tree 0.894 0.965
Random Forest 0.901 0.520

Causal Questions

1. When we study causality we always focus on the parameters multiplying the X variables instead of the predictive capacity of the model. We then give a causal interpretation to the estimated coefficients.

1-a. Explain when in marketing is preferable a causal analysis to a predictive analysis.

[Ans] Because if we can find reasons behind consumer behaviors, we can adjust our marketing campaign or offer more attractive marketing contents to our consumers and further influence their purchasing behaviors. For example, if we find positive words on the packages of products, such as “happy”, can stimulate people buy more products, then it is helpful to help marketing team to decide which word they can print on the packages of products in the future.

1-b. In the context of a linear regression, explain the concepts of a biased estimated.

[Ans] When a linear regression has biased estimations, it might have an omitte varirable bias. Because some variables are correlated with the dependent variable and some explanatory variables in the regression at the same time but we don’t add these variables into the regression, it cause error terms will be correlated with explanatory variables in the regression. Therefore, the coefficient of estimators will be biased and make our predictions inaccurate.

2. Which of the variables could be interesting to analyze from a causal point of view. Give examples.

[Ans] Marrige might be the variables that we can analyze from a causal point. For example, compared to married people, single people might not use lots on money in a short term because they don’t have some expenditures on children’s tuition and life expense. As for job, people

3. For those variables what would be the potential omitted variables problem?

[Ans] For marrige variable, annual salary and number of house might be omitted variables. Because all of these variables can influence the dependent variable and marridge variable.