The Goal of the project is to analyze the impact of the portuguese banking campaign conducted by the Marketing Department to promote term deposit.
The sample size used for this project is 10% of the full bank data corresponding to 4521 randomly selected observations from the full dataset.
My goal for this project is focused on understanding the features of the data and attempt to predict term deposit using a categorical outcome variable.
Specifically, I will like to know what is the relationship between the age and the average yearly balance and find answers to questions about the population in terms of the age distribution, employment level, marital status, banking relationship,education and home ownership. This project will also attempt to identify the significant variables for predicting term deposit.
Task 1. Data Exploration (summary statistics, means, medians, quartiles, or any other relevant information about the dataset.
Please include some conclusions in the R Markdown text.)
getwd()
## [1] "C:/Users/Emahayz_Pro/Desktop/CUNY_Bridge/R-Class/Week3"
setwd("C:/Users/Emahayz_Pro/Desktop/CUNY_Bridge/R-Class/Week3")
Port_Bank <- read.csv("bank.csv", sep = ",")
head(Port_Bank)
## age job marital education default balance housing loan contact
## 1 30 unemployed married primary no 1787 no no cellular
## 2 33 services married secondary no 4789 yes yes cellular
## 3 35 management single tertiary no 1350 yes no cellular
## 4 30 management married tertiary no 1476 yes yes unknown
## 5 59 blue-collar married secondary no 0 yes no unknown
## 6 35 management single tertiary no 747 no no cellular
## day month duration campaign pdays previous poutcome y
## 1 19 oct 79 1 -1 0 unknown no
## 2 11 may 220 1 339 4 failure no
## 3 16 apr 185 1 330 1 failure no
## 4 3 jun 199 4 -1 0 unknown no
## 5 5 may 226 1 -1 0 unknown no
## 6 23 feb 141 2 176 3 failure no
summary(Port_Bank) # See Task 4 for answers
## age job marital education
## Min. :19.00 management :969 divorced: 528 primary : 678
## 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306
## Median :39.00 technician :768 single :1196 tertiary :1350
## Mean :41.17 admin. :478 unknown : 187
## 3rd Qu.:49.00 services :417
## Max. :87.00 retired :230
## (Other) :713
## default balance housing loan contact
## no :4445 Min. :-3313 no :1962 no :3830 cellular :2896
## yes: 76 1st Qu.: 69 yes:2559 yes: 691 telephone: 301
## Median : 444 unknown :1324
## Mean : 1423
## 3rd Qu.: 1480
## Max. :71188
##
## day month duration campaign
## Min. : 1.00 may :1398 Min. : 4 Min. : 1.000
## 1st Qu.: 9.00 jul : 706 1st Qu.: 104 1st Qu.: 1.000
## Median :16.00 aug : 633 Median : 185 Median : 2.000
## Mean :15.92 jun : 531 Mean : 264 Mean : 2.794
## 3rd Qu.:21.00 nov : 389 3rd Qu.: 329 3rd Qu.: 3.000
## Max. :31.00 apr : 293 Max. :3025 Max. :50.000
## (Other): 571
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 490 no :4000
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 197 yes: 521
## Median : -1.00 Median : 0.0000 success: 129
## Mean : 39.77 Mean : 0.5426 unknown:3705
## 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :871.00 Max. :25.0000
##
str(Port_Bank) # See Task 4 for answers
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Task 2. Data wrangling (Perform some basic transformations.
to include column renaming,creating a subset of the data, replacing values, or creating new columns with derived data (for example summing two columns together)).
names(Port_Bank)[names(Port_Bank)== 'y'] <- 'term_deposit' # I renamed y categorical variable as term_deposit.
head(Port_Bank) #View the new name
## age job marital education default balance housing loan contact
## 1 30 unemployed married primary no 1787 no no cellular
## 2 33 services married secondary no 4789 yes yes cellular
## 3 35 management single tertiary no 1350 yes no cellular
## 4 30 management married tertiary no 1476 yes yes unknown
## 5 59 blue-collar married secondary no 0 yes no unknown
## 6 35 management single tertiary no 747 no no cellular
## day month duration campaign pdays previous poutcome term_deposit
## 1 19 oct 79 1 -1 0 unknown no
## 2 11 may 220 1 339 4 failure no
## 3 16 apr 185 1 330 1 failure no
## 4 3 jun 199 4 -1 0 unknown no
## 5 5 may 226 1 -1 0 unknown no
## 6 23 feb 141 2 176 3 failure no
Port_Bank$term_deposit <- ifelse(Port_Bank$term_deposit=="yes",1,0)
str(Port_Bank) #View the new number
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education : Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ term_deposit: num 0 0 0 0 0 0 0 0 0 0 ...
# I renamed y categorical variable as term_deposit for the purpose of analysis.
# This is a categorical variable with two factors “Yes” or “No”,
# I also replaced or converted the term_deposit factor values to numeric using binary “1” and “0” with Yes = 1 and No = 0.
# Creating a subset of the data:
set.seed(101)
train.size <- 0.7 # I created a subset/sample with 70% of the data known as train.
Port_train <- runif(nrow(Port_Bank))< train.size
Bank_train <- Port_Bank[Port_train, ]
Bank_test <- Port_Bank[!Port_train, ]
head(Bank_train) #Viewing the new dataframe for Bank_train
## age job marital education default balance housing loan
## 1 30 unemployed married primary no 1787 no no
## 2 33 services married secondary no 4789 yes yes
## 4 30 management married tertiary no 1476 yes yes
## 5 59 blue-collar married secondary no 0 yes no
## 6 35 management single tertiary no 747 no no
## 7 36 self-employed married tertiary no 307 yes no
## contact day month duration campaign pdays previous poutcome
## 1 cellular 19 oct 79 1 -1 0 unknown
## 2 cellular 11 may 220 1 339 4 failure
## 4 unknown 3 jun 199 4 -1 0 unknown
## 5 unknown 5 may 226 1 -1 0 unknown
## 6 cellular 23 feb 141 2 176 3 failure
## 7 cellular 14 may 341 1 330 2 other
## term_deposit
## 1 0
## 2 0
## 4 0
## 5 0
## 6 0
## 7 0
head(Bank_test) #Viewing the new dataframe for Bank_test
## age job marital education default balance housing loan contact
## 3 35 management single tertiary no 1350 yes no cellular
## 11 39 services married secondary no 9374 yes no unknown
## 12 43 admin. married secondary no 264 yes no cellular
## 13 36 technician married tertiary no 1109 no no cellular
## 14 20 student single secondary no 502 no no cellular
## 17 56 technician married secondary no 4073 no no cellular
## day month duration campaign pdays previous poutcome term_deposit
## 3 16 apr 185 1 330 1 failure 0
## 11 20 may 273 1 -1 0 unknown 0
## 12 17 apr 113 2 -1 0 unknown 0
## 13 13 aug 328 2 -1 0 unknown 0
## 14 30 apr 261 1 -1 0 unknown 1
## 17 27 aug 239 5 -1 0 unknown 0
Task 3. Graphics (Please make sure to display at least one scatter plot, box plot and histogram.
Don’t be limited to this.
Please explore the many other options in R packages such as ggplot2).
Visualization:
library(ggplot2)
# Histogram using age variable
hist(Port_Bank$age, main = "Age Distribution of Portuguese Bank Term Deposit Campaign", xlab = "Age", ylab = "Frequency", col = "pink") #The histogram shows that majority of the population is between the age of 30 to 35.
# Scatter plot using age and account balance variables
ggplot(Port_Bank, aes(x = age, y = balance))+
geom_point() # Scatter Plot
ggplot(Port_Bank,aes(age,balance))+ geom_bin2d(bins = 120) # Improved scatter plot with 2d bins
# Box plot
# Just Exploring here with ploting a box plot using factor variable (Month)
Port_Bank$month <- factor(Port_Bank$month,
labels = c("jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"))
ggplot(Port_Bank,aes(x = month, y = balance))+
geom_boxplot()+ scale_x_discrete(name = "Month")+
scale_y_continuous(name = "Average Yearly Balance")
Task 4. Meaningful question for analysis (Please state at the beginning a meaningful question for analysis.
Use the first three steps and anything else that would be helpful to answer the question you are posing from the data ##set you chose.
Please write a brief conclusion paragraph in R markdown at the end.).
Data Exploration: There are 10 factor variables and 7 integer variables in this dataset. Please see the Conclusion below for details.
Data Wrangling: I renamed y categorical variable as term_deposit for the purpose of analysis. This is a categorical variable with two factors “Yes” or “No”, I also replaced or converted the term_deposit factor ##values to number using binary “1” and “0” with Yes = 1 and No = 0.
Building a Logistic Regression Model to Predict term deposit. I will use the train dataset which is the 70% sample ##created earlier.
Bank_train #70% of the Port_Bank dataframe Bank_test #30% of the Port_Bank dataframe
Port_Bank_logit <- glm(term_deposit ~., data = Bank_train, family = binomial(), maxit = 100)
summary(Port_Bank_logit) # Note AIC shows 1524.9 is this a good fit? Good for comparing models, the smaller AIC score is better.
##
## Call:
## glm(formula = term_deposit ~ ., family = binomial(), data = Bank_train,
## maxit = 100)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6957 -0.3742 -0.2395 -0.1407 3.1441
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.304e+00 7.448e-01 -3.093 0.00198 **
## age -7.800e-03 8.869e-03 -0.879 0.37914
## jobblue-collar -6.222e-01 3.068e-01 -2.028 0.04254 *
## jobentrepreneur -2.095e-01 4.914e-01 -0.426 0.66989
## jobhousemaid 6.294e-02 4.675e-01 0.135 0.89290
## jobmanagement 6.870e-04 3.009e-01 0.002 0.99818
## jobretired 8.641e-01 3.800e-01 2.274 0.02296 *
## jobself-employed -3.514e-01 4.455e-01 -0.789 0.43031
## jobservices -2.854e-01 3.448e-01 -0.828 0.40788
## jobstudent 8.543e-02 5.159e-01 0.166 0.86847
## jobtechnician -3.427e-01 2.930e-01 -1.170 0.24213
## jobunemployed -7.978e-01 5.047e-01 -1.581 0.11393
## jobunknown 8.720e-01 7.208e-01 1.210 0.22636
## maritalmarried -2.887e-01 2.162e-01 -1.336 0.18166
## maritalsingle -2.162e-01 2.533e-01 -0.854 0.39332
## educationsecondary -8.938e-03 2.467e-01 -0.036 0.97110
## educationtertiary 2.459e-01 2.867e-01 0.858 0.39099
## educationunknown -6.552e-01 4.763e-01 -1.376 0.16895
## defaultyes 6.288e-01 4.679e-01 1.344 0.17900
## balance -6.330e-06 1.991e-05 -0.318 0.75055
## housingyes -1.902e-01 1.711e-01 -1.111 0.26641
## loanyes -4.789e-01 2.383e-01 -2.010 0.04448 *
## contacttelephone 5.375e-02 2.723e-01 0.197 0.84350
## contactunknown -1.558e+00 2.913e-01 -5.350 8.81e-08 ***
## day 7.922e-03 1.006e-02 0.788 0.43089
## monthaug -2.754e-01 3.003e-01 -0.917 0.35919
## monthdec -1.459e+00 1.131e+00 -1.290 0.19704
## monthfeb 2.811e-01 3.413e-01 0.824 0.41018
## monthjan -1.407e+00 5.260e-01 -2.676 0.00746 **
## monthjul -7.890e-01 3.062e-01 -2.577 0.00997 **
## monthjun 5.619e-01 3.619e-01 1.552 0.12056
## monthmar 1.462e+00 4.538e-01 3.222 0.00127 **
## monthmay -6.204e-01 2.890e-01 -2.147 0.03183 *
## monthnov -1.156e+00 3.600e-01 -3.210 0.00133 **
## monthoct 1.272e+00 4.373e-01 2.908 0.00364 **
## monthsep 8.196e-01 4.893e-01 1.675 0.09390 .
## duration 4.170e-03 2.421e-04 17.226 < 2e-16 ***
## campaign -8.963e-02 3.611e-02 -2.482 0.01306 *
## pdays -3.485e-04 1.316e-03 -0.265 0.79117
## previous -2.118e-02 4.679e-02 -0.453 0.65081
## poutcomeother 7.728e-01 3.324e-01 2.325 0.02006 *
## poutcomesuccess 2.385e+00 3.402e-01 7.011 2.36e-12 ***
## poutcomeunknown -1.950e-03 4.075e-01 -0.005 0.99618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2163.2 on 3145 degrees of freedom
## Residual deviance: 1438.9 on 3103 degrees of freedom
## AIC: 1524.9
##
## Number of Fisher Scoring iterations: 6
# Test of Varaible Significance using Chi Square
anova(Port_Bank_logit, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: term_deposit
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 3145 2163.2
## age 1 10.46 3144 2152.8 0.001217 **
## job 11 48.36 3133 2104.4 1.231e-06 ***
## marital 2 8.61 3131 2095.8 0.013514 *
## education 3 7.96 3128 2087.8 0.046802 *
## default 1 0.29 3127 2087.6 0.592701
## balance 1 0.00 3126 2087.6 0.952796
## housing 1 16.76 3125 2070.8 4.245e-05 ***
## loan 1 8.86 3124 2061.9 0.002916 **
## contact 2 47.10 3122 2014.8 5.909e-11 ***
## day 1 5.02 3121 2009.8 0.025063 *
## month 11 87.72 3110 1922.1 4.657e-14 ***
## duration 1 400.56 3109 1521.5 < 2.2e-16 ***
## campaign 1 7.98 3108 1513.6 0.004721 **
## pdays 1 7.56 3107 1506.0 0.005959 **
## previous 1 0.65 3106 1505.3 0.421488
## poutcome 3 66.48 3103 1438.9 2.414e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Predicting the term deposit
Predict_term <- predict(Port_Bank_logit,type = "response")
table(Bank_train$term_deposit,Predict_term > 0.5)
##
## FALSE TRUE
## 0 2744 60
## 1 226 116
# Using the term_deposit variable from the train dataset to generate a confusion matrix.
# The model accurately predicted 2,744 as True Negative (TN)and 116 as True Positive (TP)
I want to validate the Model using the test data
Port_Bank_Val <- glm(term_deposit ~., data = Bank_test, family = binomial(), maxit = 100)
summary(Port_Bank_Val)
##
## Call:
## glm(formula = term_deposit ~ ., family = binomial(), data = Bank_test,
## maxit = 100)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3762 -0.3931 -0.2570 -0.1526 2.8618
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.722e+00 1.104e+00 -2.465 0.013690 *
## age -2.914e-03 1.289e-02 -0.226 0.821077
## jobblue-collar 1.668e-01 4.181e-01 0.399 0.689953
## jobentrepreneur -2.461e-01 6.469e-01 -0.380 0.703605
## jobhousemaid -2.009e+00 1.208e+00 -1.663 0.096405 .
## jobmanagement -1.502e-01 4.275e-01 -0.351 0.725350
## jobretired 1.226e-01 6.239e-01 0.196 0.844264
## jobself-employed 2.098e-01 6.022e-01 0.348 0.727619
## jobservices 3.186e-01 4.792e-01 0.665 0.506160
## jobstudent 7.205e-01 6.193e-01 1.163 0.244689
## jobtechnician 1.658e-01 3.979e-01 0.417 0.676838
## jobunemployed -8.070e-02 7.851e-01 -0.103 0.918127
## jobunknown -1.883e-01 1.065e+00 -0.177 0.859704
## maritalmarried -8.800e-01 3.151e-01 -2.793 0.005222 **
## maritalsingle -4.868e-01 3.683e-01 -1.322 0.186221
## educationsecondary 2.056e-01 3.729e-01 0.551 0.581314
## educationtertiary 4.295e-01 4.298e-01 0.999 0.317613
## educationunknown -1.642e-01 5.984e-01 -0.274 0.783820
## defaultyes 9.417e-01 1.244e+00 0.757 0.449229
## balance 2.715e-05 3.851e-05 0.705 0.480867
## housingyes -4.705e-01 2.544e-01 -1.849 0.064389 .
## loanyes -9.185e-01 3.838e-01 -2.393 0.016708 *
## contacttelephone -1.820e-01 4.630e-01 -0.393 0.694223
## contactunknown -1.302e+00 3.816e-01 -3.412 0.000646 ***
## day 3.765e-02 1.498e-02 2.513 0.011957 *
## monthaug -5.603e-01 4.720e-01 -1.187 0.235215
## monthdec 1.365e+00 1.156e+00 1.181 0.237719
## monthfeb -3.142e-01 6.572e-01 -0.478 0.632549
## monthjan -1.153e+00 6.125e-01 -1.883 0.059713 .
## monthjul -7.383e-01 4.482e-01 -1.647 0.099518 .
## monthjun 6.291e-01 5.488e-01 1.146 0.251691
## monthmar 1.651e+00 8.210e-01 2.010 0.044394 *
## monthmay -2.663e-01 4.189e-01 -0.636 0.524894
## monthnov -5.131e-01 4.540e-01 -1.130 0.258418
## monthoct 1.405e+00 5.495e-01 2.556 0.010576 *
## monthsep 1.475e-01 8.253e-01 0.179 0.858184
## duration 4.592e-03 3.925e-04 11.700 < 2e-16 ***
## campaign -3.839e-02 4.752e-02 -0.808 0.419151
## pdays 3.228e-04 1.624e-03 0.199 0.842475
## previous 3.608e-02 8.040e-02 0.449 0.653595
## poutcomeother 5.338e-02 5.200e-01 0.103 0.918230
## poutcomesuccess 3.109e+00 6.017e-01 5.168 2.37e-07 ***
## poutcomeunknown -3.609e-01 5.593e-01 -0.645 0.518768
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1063.51 on 1374 degrees of freedom
## Residual deviance: 680.35 on 1332 degrees of freedom
## AIC: 766.35
##
## Number of Fisher Scoring iterations: 6
# Predicting the term deposit in the test data
Predict_term_Val <- predict(Port_Bank_Val,type = "response")
table(Bank_test$term_deposit,Predict_term_Val > 0.5)
##
## FALSE TRUE
## 0 1168 28
## 1 104 75
# Using the term_deposit variable from the test dataset to generate a confusion matrix.
# The model accurately predicted 1,168 as True Negative (TN)and 75 as True Positive (TP)
BONUS –place the original .csv in a github file and have R read from the link.
This will be a very useful skill as you progress in your data science education and career.
library(RCurl) # Loading the RCurl package will enable me to read the csv file using the link from my Github
## Loading required package: bitops
Port_Bank <- read.csv(text = getURL("https://raw.githubusercontent.com/Emahayz/MSDS_R_Class/master/bank.csv"), header = T, sep = ",")
head(Port_Bank) # The original Salaries csv file is successfully read.
## age job marital education default balance housing loan contact
## 1 30 unemployed married primary no 1787 no no cellular
## 2 33 services married secondary no 4789 yes yes cellular
## 3 35 management single tertiary no 1350 yes no cellular
## 4 30 management married tertiary no 1476 yes yes unknown
## 5 59 blue-collar married secondary no 0 yes no unknown
## 6 35 management single tertiary no 747 no no cellular
## day month duration campaign pdays previous poutcome y
## 1 19 oct 79 1 -1 0 unknown no
## 2 11 may 220 1 339 4 failure no
## 3 16 apr 185 1 330 1 failure no
## 4 3 jun 199 4 -1 0 unknown no
## 5 5 may 226 1 -1 0 unknown no
## 6 23 feb 141 2 176 3 failure no
Conclusion
Data Exploration:
There are 10 factor variables and 7 integer variables in this dataset. From the summary statistics of the data, the average age of this population is about 41 years old and the median age is 39, the lower and upper quartiles are 33 and 49 respectively.
The histogram shows that majority of the population is between the age of 30 to 35. Most of the people have jobs in Management representing 969 of the population while 230 people are retired. There are 2,797 married couples and 1,196 unmarried people while 528 are divorced.
A significant portion of the population has at least secondary education (2,306) while 1,350 has college degree. Only 691 people have existing loan with the bank and 76 of those people have defaulted on a loan. A significant number of the population (3,830) do not have any existing loan with the bank.
About 2,559 of the population are home owners and 521 people already have existing term deposit account. The scatter plots show that account average yearly balance does not increase with age, a significant portion of the population with age greater than 30 had negative average yearly balance. However, there are outliers at age about 42 and 60 years with over €40,000 and €70,000 average yearly balance. The Boxplot shows that the outliers occurred in the month of February and November.
Data Wrangling:
I renamed y categorical variable as term_deposit for the purpose of analysis. This is a categorical variable with two factors “Yes” or “No”, I also replaced or converted the term_deposit factor values to numeric using binary “1” and “0” with Yes = 1 and No = 0.
Test of variable Significance:
The Chi Square shows that the following variables are strongly significant for predicting term deposit: Job situation, housing condition, type of contact used (cell phone, landline etc), the month of the year for the campaign, duration-time since last contact and poutcome- outcome of the previous marketing campaign.