DATA606 Final Project

Part 1 - Abstract

The data in the project is related to telephone-based direct marketing activities of a Portuguese banking institution.The marketing campaigns were based on phone calls. In order to determine if the product (bank term deposit) would be subscribed to with (‘yes’) or not (‘no’), it was frequently necessary to make more than one contact with the same client.

Part 1a - Introduction

A term deposit is one that a bank or other financial institution offers with the fixed rate (generally preferable to just opening a deposit account) that guarantees the return of your funds at a certain maturity date. The reseacrh question is if the client will subscribe to a deposit from a Portuguese banking institution or not.

Part 2 - Data

The multivariate dataset contains 16 features and 1 target variable (17 attributes) with 45211 numbers of instances. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. This dataset is public available for research. The details are described in [Moro et al., 2014].

Below is the list of libraries used:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.1     ✔ rsample      1.1.0
## ✔ dials        1.1.0     ✔ tune         1.0.1
## ✔ infer        1.0.3     ✔ workflows    1.1.2
## ✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
## ✔ parsnip      1.0.3     ✔ yardstick    1.1.0
## ✔ recipes      1.0.3     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(tidyr)
library(skimr)
library(reshape)

## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:dplyr':
## 
##     rename
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

library(dplyr)
library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## 
## Attaching package: 'openintro'
## 
## The following object is masked from 'package:reshape':
## 
##     tips
## 
## The following object is masked from 'package:modeldata':
## 
##     ames

library(infer)
library(foreign)
library(statsr)

## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following object is masked from 'package:reshape':
## 
##     expand
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## ************
## Welcome to BayesFactor 0.9.12-4.4. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
## 
## Attaching package: 'statsr'
## 
## The following objects are masked from 'package:openintro':
## 
##     calc_streak, evals, nycflights, present
## 
## The following object is masked from 'package:modeldata':
## 
##     ames
## 
## The following object is masked from 'package:infer':
## 
##     rep_sample_n

library(airports)
library(cherryblossom)
library(usdata)
library(describer)
library(BayesFactor)
library(coda)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:openintro':
## 
##     densityPlot
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(psych)

## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:car':
## 
##     logit
## 
## The following object is masked from 'package:describer':
## 
##     describe
## 
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(carData)
library(Matrix)
library(heatmaply)

## Loading required package: plotly
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:reshape':
## 
##     rename
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## Loading required package: viridis
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## 
## The following object is masked from 'package:scales':
## 
##     viridis_pal
## 
## 
## ======================
## Welcome to heatmaply version 1.4.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================

library(viridis)
library(viridisLite)
library(corrplot)

## corrplot 0.92 loaded

library(shape)

## 
## Attaching package: 'shape'
## 
## The following object is masked from 'package:corrplot':
## 
##     colorlegend

library(ggplot2)

rm(list = ls())
getwd()

## [1] "C:/Users/Ivan/OneDrive/Desktop"

theme_set(theme_light())

The.CSV file can be easily obtained because I’ve kept it on github.

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

bank %>%
  summarize(across(.cols = everything(),
                   ~sum(is.na(.x))))

##   age job marital education default balance housing loan contact day month
## 1   0   0       0         0       0       0       0    0       0   0     0
##   duration campaign pdays previous poutcome y
## 1        0        0     0        0        0 0

bank <- bank %>%
  filter(!is.na(education))

sum(is.na(bank))

## [1] 0

I will convert jobs to a factor for prediction purposes

bank$job <- as_factor(bank$job)

Data visualizations

The variables that are being predicted are housing and jobs

ggplot(bank, aes(x=housing, y=job)) +
  geom_point()

More variables that are being predicted are age and loans

bank2 <- bank %>% 
  mutate(age = ifelse(bank$marital > 21, "yes", "no")) %>%
  na.exclude()
ggplot(bank2, aes(x=loan, y=age)) + geom_boxplot() + theme_bw()

### Part 3 - Exploratory data analysis

Here is the matrix

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

bank %>%
  ggscatmat(columns = 3:16, color="housing",alpha = 0.7)

## Warning in ggscatmat(., columns = 3:16, color = "housing", alpha = 0.7): Factor
## variables are omitted in plot

## Warning: The dot-dot notation (`..scaled..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(scaled)` instead.
## ℹ The deprecated feature was likely used in the GGally package.
##   Please report the issue at <]8;;https://github.com/ggobi/ggally/issueshttps://github.com/ggobi/ggally/issues]8;;>.

Let’s plot the jobs per their average balance. As expected people who are retired have the highest balance.

bank2 %>%
  group_by(job) %>%
  summarise(avg_balance = mean(balance)) %>%
  arrange(desc(avg_balance)) %>%
  na.omit() %>%
  slice(1:25) %>%
  ggplot(aes(x = reorder(job, avg_balance),avg_balance)) +
  geom_bar(stat="identity") +
  coord_flip()

bank2 %>%
  group_by(job) %>%
  summarise(avg_balance = mean(balance)) %>%
  arrange(desc(avg_balance)) %>%
  na.omit() %>%
  slice(1:25) %>%
  ggplot(aes(x = reorder(job, -avg_balance),avg_balance)) +
  geom_bar(stat="identity") +
  coord_flip()

bank3 <- bank[, c(1,2,4,6,8,12,17)]
str(bank3)

## 'data.frame':    45211 obs. of  7 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "management","technician",..: 1 2 3 4 5 1 1 3 6 2 ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ y        : chr  "no" "no" "no" "no" ...

head(bank3)

##   age          job education balance loan duration  y
## 1  58   management  tertiary    2143   no      261 no
## 2  44   technician secondary      29   no      151 no
## 3  33 entrepreneur secondary       2  yes       76 no
## 4  47  blue-collar   unknown    1506   no       92 no
## 5  33      unknown   unknown       1   no      198 no
## 6  35   management  tertiary     231   no      139 no

summary(bank3$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   39.00   40.94   48.00   95.00

describe(bank3$age)

##    vars     n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 45211 40.94 10.62     39   40.25 10.38  18  95    77 0.68     0.32 0.05

summary(bank3$duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   103.0   180.0   258.2   319.0  4918.0

describe(bank3$duration)

##    vars     n   mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 45211 258.16 257.53    180  210.87 137.88   0 4918  4918 3.14    18.15
##      se
## X1 1.21

round(prop.table(table(bank3$education))*100,digit = 1)

## 
##   primary secondary  tertiary   unknown 
##      15.2      51.3      29.4       4.1

boxplot(bank3$age)

ggplot(data = bank3, aes(bank3$age))+geom_histogram(binwidth = 2, position="identity", alpha=0.5)+
  labs(title="Age of customer",x="Age", y = "balance")

## Warning: Use of `bank3$age` is discouraged.
## ℹ Use `age` instead.

Part 4 - Inference

H0: There is no association between balance and the variables of Age.

HA: There is an association with at least one of the variables of Age.

The Normal Probability Plot

The normal probability plot shows that the balance and age are not normally distributed as can be seen that the data points does not closely follow the line.

qqnorm(bank3$age)
qqline(bank3$age)

I will use the Linear Regression to address my research question.

bank3$education <- ifelse(bank3$education == "Y", 1, 0)
str(bank3)

## 'data.frame':    45211 obs. of  7 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "management","technician",..: 1 2 3 4 5 1 1 3 6 2 ...
##  $ education: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ y        : chr  "no" "no" "no" "no" ...

The Linear Regression Model

bank3Model <- lm(duration~ age + balance, data = bank3)
summary(bank3Model)

## 
## Call:
## lm(formula = duration ~ age + balance, data = bank3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -355.9 -155.0  -77.8   60.3 4653.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.624e+02  4.823e+00  54.397  < 2e-16 ***
## age         -1.654e-01  1.146e-01  -1.444    0.149    
## balance      1.880e-03  3.996e-04   4.705 2.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 257.5 on 45208 degrees of freedom
## Multiple R-squared:  0.0005109,  Adjusted R-squared:  0.0004667 
## F-statistic: 11.56 on 2 and 45208 DF,  p-value: 9.614e-06

The Equation for the Regression is thus:

                                                  y^=262.37471−0.16545∗AGE+0.00188∗balance

R^2 suggesting that the alternative hypothesis is a better explanation for my data.

anova(bank3Model)

## Analysis of Variance Table
## 
## Response: duration
##              Df     Sum Sq Mean Sq F value    Pr(>F)    
## age           1      64788   64788  0.9773    0.3229    
## balance       1    1467200 1467200 22.1332 2.551e-06 ***
## Residuals 45208 2996821166   66290                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA result I should reject the null hypothesis in favor of the alternative hypothesis.

Check of Conditions:

Test for constant variance - The NcvTest

ncvTest(bank3Model)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 31.4982, Df = 1, p = 1.9962e-08

The test shows that constant error of variance does not exist.

Test for linearity - The crPlots

crPlots(bank3Model)

The test indicates that none of the predictors shows linearity.

Test for Independence of error- Autocorrelation- Durbin- Watson Test

durbinWatsonTest(bank3Model)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.04669605      1.906605       0
##  Alternative hypothesis: rho != 0

Predicting balance with Age variables

plot(bank3$balance ~ bank3$duration + bank3$age)

bank3 %>%
  group_by(y) %>%
  summarise(avg_balance = mean(balance)) %>%
  arrange(desc(avg_balance)) %>%
  na.omit() %>%
  slice(1:2) %>%
  ggplot(aes(x = reorder(y, avg_balance),avg_balance)) +
  geom_bar(stat="identity") +
  coord_flip()

Part 5 - Conclusion

The answer to research question if the client will subscribe to a deposit from a Portuguese banking institution is Yes.This finding is importatant because it indicates that direct marketing activities of a Portuguese banking institution are successful and could be implemented by other financial institutions.

References

Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#