Here we are loading the appropriate packages needed to complete our
analysis, such as ROCR
needed for ROC and AUC predicting,
to find the most optimal cutoff point for our model, as well as loading
the dataset we are currently working on,
bank-additional.csv
, which will be renamed
bank_additional
for ease and convenience.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(lattice)
library(ggplot2)
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-5
library(readr)
library(ROCR)
bank_additional <- read_delim("bank-additional.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
## Rows: 4119 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(bank_additional)
Here we are running various Exploratory Data analytics (EDA)
summary(bank_additional)
## age job marital education
## Min. :18.00 Length:4119 Length:4119 Length:4119
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.11
## 3rd Qu.:47.00
## Max. :88.00
## default housing loan contact
## Length:4119 Length:4119 Length:4119 Length:4119
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:4119 Length:4119 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 103.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 181.0 Median : 2.000
## Mean : 256.8 Mean : 2.537
## 3rd Qu.: 317.0 3rd Qu.: 3.000
## Max. :3643.0 Max. :35.000
## pdays previous poutcome emp.var.rate
## Min. : 0.0 Min. :0.0000 Length:4119 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.0000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.0000 Mode :character Median : 1.10000
## Mean :960.4 Mean :0.1903 Mean : 0.08497
## 3rd Qu.:999.0 3rd Qu.:0.0000 3rd Qu.: 1.40000
## Max. :999.0 Max. :6.0000 Max. : 1.40000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :92.20 Min. :-50.8 Min. :0.635 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.334 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5166
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
## y
## Length:4119
## Class :character
## Mode :character
##
##
##
str(bank_additional)
## spc_tbl_ [4,119 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:4119] 30 39 25 38 47 32 32 41 31 35 ...
## $ job : chr [1:4119] "blue-collar" "services" "services" "services" ...
## $ marital : chr [1:4119] "married" "single" "married" "married" ...
## $ education : chr [1:4119] "basic.9y" "high.school" "high.school" "basic.9y" ...
## $ default : chr [1:4119] "no" "no" "no" "no" ...
## $ housing : chr [1:4119] "yes" "no" "yes" "unknown" ...
## $ loan : chr [1:4119] "no" "no" "no" "unknown" ...
## $ contact : chr [1:4119] "cellular" "telephone" "telephone" "telephone" ...
## $ month : chr [1:4119] "may" "may" "jun" "jun" ...
## $ day_of_week : chr [1:4119] "fri" "fri" "wed" "fri" ...
## $ duration : num [1:4119] 487 346 227 17 58 128 290 44 68 170 ...
## $ campaign : num [1:4119] 2 4 1 3 1 3 4 2 1 1 ...
## $ pdays : num [1:4119] 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : num [1:4119] 0 0 0 0 0 2 0 0 1 0 ...
## $ poutcome : chr [1:4119] "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num [1:4119] -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
## $ cons.price.idx: num [1:4119] 92.9 94 94.5 94.5 93.2 ...
## $ cons.conf.idx : num [1:4119] -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
## $ euribor3m : num [1:4119] 1.31 4.86 4.96 4.96 4.19 ...
## $ nr.employed : num [1:4119] 5099 5191 5228 5228 5196 ...
## $ y : chr [1:4119] "no" "no" "no" "no" ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. job = col_character(),
## .. marital = col_character(),
## .. education = col_character(),
## .. default = col_character(),
## .. housing = col_character(),
## .. loan = col_character(),
## .. contact = col_character(),
## .. month = col_character(),
## .. day_of_week = col_character(),
## .. duration = col_double(),
## .. campaign = col_double(),
## .. pdays = col_double(),
## .. previous = col_double(),
## .. poutcome = col_character(),
## .. emp.var.rate = col_double(),
## .. cons.price.idx = col_double(),
## .. cons.conf.idx = col_double(),
## .. euribor3m = col_double(),
## .. nr.employed = col_double(),
## .. y = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We can see here, that none of the variables that I loaded were not
factors, i.e. leveled. To counteract I will physically change each
specific variable needed, but first to make sure what will become
levels, and specify how they will look.
levels(as.factor(bank_additional$y))
## [1] "no" "yes"
levels(as.factor(bank_additional$job))
## [1] "admin." "blue-collar" "entrepreneur" "housemaid"
## [5] "management" "retired" "self-employed" "services"
## [9] "student" "technician" "unemployed" "unknown"
levels(as.factor(bank_additional$marital))
## [1] "divorced" "married" "single" "unknown"
levels(as.factor(bank_additional$education))
## [1] "basic.4y" "basic.6y" "basic.9y"
## [4] "high.school" "illiterate" "professional.course"
## [7] "university.degree" "unknown"
levels(as.factor(bank_additional$default))
## [1] "no" "unknown" "yes"
levels(as.factor(bank_additional$housing))
## [1] "no" "unknown" "yes"
levels(as.factor(bank_additional$loan))
## [1] "no" "unknown" "yes"
levels(as.factor(bank_additional$contact))
## [1] "cellular" "telephone"
levels(as.factor(bank_additional$month))
## [1] "apr" "aug" "dec" "jul" "jun" "mar" "may" "nov" "oct" "sep"
levels(as.factor(bank_additional$day_of_week))
## [1] "fri" "mon" "thu" "tue" "wed"
bank_additional$y = as.factor(bank_additional$y)
str(bank_additional$y)
## Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
bank_additional$job = as.factor(bank_additional$job)
bank_additional$marital = as.factor(bank_additional$marital)
bank_additional$education = as.factor(bank_additional$education)
bank_additional$default = as.factor(bank_additional$default)
bank_additional$housing = as.factor(bank_additional$housing)
bank_additional$loan = as.factor(bank_additional$loan)
bank_additional$contact = as.factor(bank_additional$contact)
bank_additional$month = as.factor(bank_additional$month)
bank_additional$day_of_week = as.factor(bank_additional$day_of_week)
str(bank_additional$job)
## Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
str(bank_additional$day_of_week)
## Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
pdays
heavily skews the dataset, leading to creating a
categorical variable.
Remove duration
from analysis
bank_additional <- dplyr::select(bank_additional, -duration)
summary(bank_additional)
## age job marital education
## Min. :18.00 admin. :1012 divorced: 446 university.degree :1264
## 1st Qu.:32.00 blue-collar: 884 married :2509 high.school : 921
## Median :38.00 technician : 691 single :1153 basic.9y : 574
## Mean :40.11 services : 393 unknown : 11 professional.course: 535
## 3rd Qu.:47.00 management : 324 basic.4y : 429
## Max. :88.00 retired : 166 basic.6y : 228
## (Other) : 649 (Other) : 168
## default housing loan contact month
## no :3315 no :1839 no :3349 cellular :2652 may :1378
## unknown: 803 unknown: 105 unknown: 105 telephone:1467 jul : 711
## yes : 1 yes :2175 yes : 665 aug : 636
## jun : 530
## nov : 446
## apr : 215
## (Other): 203
## day_of_week campaign pdays previous poutcome
## fri:768 Min. : 1.000 no :3959 Min. :0.0000 Length:4119
## mon:855 1st Qu.: 1.000 yes: 160 1st Qu.:0.0000 Class :character
## thu:860 Median : 2.000 Median :0.0000 Mode :character
## tue:841 Mean : 2.537 Mean :0.1903
## wed:795 3rd Qu.: 3.000 3rd Qu.:0.0000
## Max. :35.000 Max. :6.0000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.635
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.334
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08497 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed y
## Min. :4964 no :3668
## 1st Qu.:5099 yes: 451
## Median :5191
## Mean :5166
## 3rd Qu.:5228
## Max. :5228
##
We now have all variables that were originally character varaibles,
turned into factors with levels, and attatched to the “original”
dataset. We will now create the training and testing datasets, to which
we can alter and use freely.
set.seed(2024)
index <- sample(1:nrow(bank_additional),nrow(bank_additional)*0.80) #using a 80 20 split
bank_train = bank_additional[index,]
bank_test = bank_additional[-index,]
# Your code here
#a) get rid of duration
glm_model <- glm(y ~. -age, family=binomial, data=bank_train)
backward_model <- step(glm_model, direction = "backward")
## Start: AIC=1891.53
## y ~ (age + job + marital + education + default + housing + loan +
## contact + month + day_of_week + campaign + pdays + previous +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed) - age
##
## Df Deviance AIC
## - job 11 1796.7 1876.7
## - education 7 1792.3 1880.3
## - day_of_week 4 1790.7 1884.7
## - marital 3 1789.9 1885.9
## - default 2 1789.7 1887.7
## - loan 1 1789.6 1889.6
## - housing 1 1789.7 1889.7
## - euribor3m 1 1789.7 1889.7
## - pdays 1 1790.2 1890.2
## - previous 1 1790.5 1890.5
## - nr.employed 1 1791.1 1891.1
## <none> 1789.5 1891.5
## - campaign 1 1794.5 1894.5
## - poutcome 2 1796.5 1894.5
## - cons.conf.idx 1 1796.4 1896.4
## - emp.var.rate 1 1797.8 1897.8
## - cons.price.idx 1 1798.0 1898.0
## - month 9 1823.0 1907.0
## - contact 1 1808.8 1908.8
##
## Step: AIC=1876.69
## y ~ marital + education + default + housing + loan + contact +
## month + day_of_week + campaign + pdays + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed
##
## Df Deviance AIC
## - education 7 1799.6 1865.6
## - day_of_week 4 1797.7 1869.7
## - marital 3 1796.9 1870.9
## - default 2 1796.9 1872.9
## - loan 1 1796.8 1874.8
## - housing 1 1796.8 1874.8
## - euribor3m 1 1796.9 1874.9
## - previous 1 1797.5 1875.5
## - pdays 1 1797.6 1875.6
## - nr.employed 1 1798.1 1876.1
## <none> 1796.7 1876.7
## - campaign 1 1801.8 1879.8
## - poutcome 2 1803.9 1879.9
## - cons.conf.idx 1 1803.6 1881.6
## - emp.var.rate 1 1804.3 1882.3
## - cons.price.idx 1 1804.8 1882.8
## - month 9 1830.6 1892.6
## - contact 1 1815.3 1893.3
##
## Step: AIC=1865.64
## y ~ marital + default + housing + loan + contact + month + day_of_week +
## campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - day_of_week 4 1800.6 1858.6
## - marital 3 1799.9 1859.9
## - default 2 1800.1 1862.1
## - housing 1 1799.7 1863.7
## - loan 1 1799.7 1863.7
## - euribor3m 1 1799.8 1863.8
## - pdays 1 1800.5 1864.5
## - previous 1 1800.5 1864.5
## - nr.employed 1 1801.0 1865.0
## <none> 1799.6 1865.6
## - campaign 1 1804.7 1868.7
## - poutcome 2 1807.0 1869.0
## - cons.conf.idx 1 1806.6 1870.6
## - emp.var.rate 1 1807.4 1871.4
## - cons.price.idx 1 1807.7 1871.7
## - month 9 1833.8 1881.8
## - contact 1 1819.3 1883.3
##
## Step: AIC=1858.59
## y ~ marital + default + housing + loan + contact + month + campaign +
## pdays + previous + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - marital 3 1800.9 1852.9
## - default 2 1801.1 1855.1
## - housing 1 1800.7 1856.7
## - loan 1 1800.7 1856.7
## - euribor3m 1 1800.7 1856.7
## - pdays 1 1801.4 1857.4
## - previous 1 1801.5 1857.5
## - nr.employed 1 1801.9 1857.9
## <none> 1800.6 1858.6
## - poutcome 2 1807.8 1861.8
## - campaign 1 1805.9 1861.9
## - cons.conf.idx 1 1807.6 1863.6
## - emp.var.rate 1 1808.3 1864.3
## - cons.price.idx 1 1808.6 1864.6
## - month 9 1834.2 1874.2
## - contact 1 1820.3 1876.3
##
## Step: AIC=1852.86
## y ~ default + housing + loan + contact + month + campaign + pdays +
## previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed
##
## Df Deviance AIC
## - default 2 1801.4 1849.4
## - loan 1 1800.9 1850.9
## - housing 1 1800.9 1850.9
## - euribor3m 1 1801.0 1851.0
## - pdays 1 1801.7 1851.7
## - previous 1 1801.8 1851.8
## - nr.employed 1 1802.1 1852.1
## <none> 1800.9 1852.9
## - poutcome 2 1808.1 1856.1
## - campaign 1 1806.2 1856.2
## - cons.conf.idx 1 1807.8 1857.8
## - emp.var.rate 1 1808.5 1858.5
## - cons.price.idx 1 1808.8 1858.8
## - month 9 1834.5 1868.5
## - contact 1 1820.6 1870.6
##
## Step: AIC=1849.39
## y ~ housing + loan + contact + month + campaign + pdays + previous +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed
##
## Df Deviance AIC
## - loan 1 1801.5 1847.5
## - housing 1 1801.5 1847.5
## - euribor3m 1 1801.5 1847.5
## - pdays 1 1802.2 1848.2
## - previous 1 1802.3 1848.3
## - nr.employed 1 1802.7 1848.7
## <none> 1801.4 1849.4
## - campaign 1 1806.7 1852.7
## - poutcome 2 1808.7 1852.7
## - cons.conf.idx 1 1808.3 1854.3
## - emp.var.rate 1 1809.0 1855.0
## - cons.price.idx 1 1809.3 1855.3
## - month 9 1835.3 1865.3
## - contact 1 1821.2 1867.2
##
## Step: AIC=1847.45
## y ~ housing + contact + month + campaign + pdays + previous +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed
##
## Df Deviance AIC
## - housing 2 1802.9 1844.9
## - euribor3m 1 1801.6 1845.6
## - pdays 1 1802.3 1846.3
## - previous 1 1802.3 1846.3
## - nr.employed 1 1802.7 1846.7
## <none> 1801.5 1847.5
## - campaign 1 1806.7 1850.7
## - poutcome 2 1808.7 1850.7
## - cons.conf.idx 1 1808.4 1852.4
## - emp.var.rate 1 1809.1 1853.1
## - cons.price.idx 1 1809.4 1853.4
## - month 9 1835.3 1863.3
## - contact 1 1821.4 1865.4
##
## Step: AIC=1844.87
## y ~ contact + month + campaign + pdays + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed
##
## Df Deviance AIC
## - euribor3m 1 1803.0 1843.0
## - previous 1 1803.6 1843.6
## - pdays 1 1803.8 1843.8
## - nr.employed 1 1804.1 1844.1
## <none> 1802.9 1844.9
## - poutcome 2 1810.1 1848.1
## - campaign 1 1808.2 1848.2
## - cons.conf.idx 1 1809.6 1849.6
## - emp.var.rate 1 1810.8 1850.8
## - cons.price.idx 1 1811.0 1851.0
## - month 9 1836.4 1860.4
## - contact 1 1823.0 1863.0
##
## Step: AIC=1842.97
## y ~ contact + month + campaign + pdays + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + nr.employed
##
## Df Deviance AIC
## - previous 1 1803.7 1841.7
## - pdays 1 1803.9 1841.9
## - nr.employed 1 1804.7 1842.7
## <none> 1803.0 1843.0
## - poutcome 2 1810.1 1846.1
## - campaign 1 1808.2 1846.2
## - emp.var.rate 1 1811.2 1849.2
## - cons.price.idx 1 1811.9 1849.9
## - cons.conf.idx 1 1814.4 1852.4
## - month 9 1837.5 1859.5
## - contact 1 1823.0 1861.0
##
## Step: AIC=1841.69
## y ~ contact + month + campaign + pdays + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + nr.employed
##
## Df Deviance AIC
## - pdays 1 1805.5 1841.5
## - nr.employed 1 1805.6 1841.6
## <none> 1803.7 1841.7
## - campaign 1 1809.0 1845.0
## - poutcome 2 1811.8 1845.8
## - emp.var.rate 1 1812.6 1848.6
## - cons.price.idx 1 1813.7 1849.7
## - cons.conf.idx 1 1815.2 1851.2
## - month 9 1838.1 1858.1
## - contact 1 1824.6 1860.6
##
## Step: AIC=1841.53
## y ~ contact + month + campaign + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + nr.employed
##
## Df Deviance AIC
## - nr.employed 1 1807.4 1841.4
## <none> 1805.5 1841.5
## - campaign 1 1810.9 1844.9
## - emp.var.rate 1 1814.3 1848.3
## - cons.price.idx 1 1815.5 1849.5
## - cons.conf.idx 1 1817.0 1851.0
## - month 9 1840.6 1858.6
## - contact 1 1826.9 1860.9
## - poutcome 2 1848.8 1880.8
##
## Step: AIC=1841.38
## y ~ contact + month + campaign + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx
##
## Df Deviance AIC
## <none> 1807.4 1841.4
## - campaign 1 1812.9 1844.9
## - cons.conf.idx 1 1817.2 1849.2
## - month 9 1841.3 1857.3
## - contact 1 1827.0 1859.0
## - poutcome 2 1850.8 1880.8
## - cons.price.idx 1 1861.0 1893.0
## - emp.var.rate 1 1902.6 1934.6
glm_model = backward_model
summary(glm_model)
##
## Call:
## glm(formula = y ~ contact + month + campaign + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx, family = binomial, data = bank_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.185e+02 1.569e+01 -7.552 4.28e-14 ***
## contacttelephone -1.004e+00 2.398e-01 -4.187 2.83e-05 ***
## monthaug -2.159e-02 3.394e-01 -0.064 0.94928
## monthdec 8.387e-01 5.784e-01 1.450 0.14705
## monthjul -3.349e-03 3.208e-01 -0.010 0.99167
## monthjun 1.338e-01 2.965e-01 0.451 0.65176
## monthmar 1.546e+00 4.144e-01 3.731 0.00019 ***
## monthmay -3.649e-01 2.548e-01 -1.432 0.15203
## monthnov -5.069e-01 3.150e-01 -1.609 0.10758
## monthoct -1.238e-01 4.037e-01 -0.307 0.75910
## monthsep 2.681e-02 4.229e-01 0.063 0.94945
## campaign -8.187e-02 3.798e-02 -2.156 0.03110 *
## poutcomenonexistent 3.560e-01 1.921e-01 1.853 0.06390 .
## poutcomesuccess 1.705e+00 2.694e-01 6.330 2.45e-10 ***
## emp.var.rate -7.020e-01 6.938e-02 -10.119 < 2e-16 ***
## cons.price.idx 1.266e+00 1.702e-01 7.440 1.01e-13 ***
## cons.conf.idx 5.218e-02 1.679e-02 3.108 0.00188 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2294.2 on 3294 degrees of freedom
## Residual deviance: 1807.4 on 3278 degrees of freedom
## AIC: 1841.4
##
## Number of Fisher Scoring iterations: 6
#b)
pred_prob_bank_test <- predict.glm(glm_model, newdata = bank_test, type = "response")
summary(pred_prob_bank_test)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00701 0.03980 0.05851 0.10639 0.09419 0.91357
hist(pred_prob_bank_test)

#d)
pred_test <- prediction(pred_prob_bank_test, as.numeric(bank_test$y) - 1) #need the y variable to be 0 and 1
pred_test
## A prediction instance
## with 824 data points
ROC <- performance(pred_test, "tpr", "fpr")
plot(ROC, colorize=TRUE)
#Get the AUC
auc_test = unlist(slot(performance(pred_test, "auc"), "y.values")) #unlist makes it into a number
print(auc_test)
## [1] 0.7981896
text(0.5,0.5, paste("AUC:", auc_test))

# Extract performance metrics
perf_sens <- performance(pred_test, "sens")
perf_spec <- performance(pred_test, "spec")
# Extract y-values (sensitivity and specificity)
sens <- unlist(perf_sens@y.values)
spec <- unlist(perf_spec@y.values)
cutoffs <- unlist(perf_sens@x.values) # Sensitivity and specificity should share same cutoffs
# Ensure proper lengths
if (length(sens) == length(spec) && length(sens) == length(cutoffs)) {
# Plot sensitivity
plot(cutoffs, sens, type="l", lwd=2, ylab="Sensitivity", xlab="Cutoff", col="blue",
main = paste("Maximized Cutoff\n","AUC: ", round(auc_test, 3)))
par(new=TRUE) # Overlay specificity plot
# Plot specificity
plot(cutoffs, spec, type="l", lwd=2, col='red', ylab="", xlab="", axes=FALSE)
axis(4, at=seq(0,1,0.2)) # Specificity axis labels
mtext("Specificity", side=4, col='red')
# Find optimal cutoff where sensitivity ≈ specificity
min.diff <- which.min(abs(sens - spec))
optimal_cutoff <- cutoffs[min.diff]
# Draw reference lines
abline(h = sens[min.diff], lty = 3)
abline(v = optimal_cutoff, lty = 3)
text(optimal_cutoff, 0, paste("Optimal threshold=", round(optimal_cutoff, 2)), pos=3)
} else {
print("Error: Sensitivity, specificity, and cutoffs lengths do not match.")
}

# Use predicted probabilities, not `pred_test`
pr_class <- ifelse(pred_prob_bank_test > optimal_cutoff, "1", "0") # Use optimal threshold
# Ensure bank_test$y is converted correctly
true_labels <- as.factor(ifelse(bank_test$y == "yes", "1", "0")) # Convert "yes"/"no" to "1"/"0"
# Convert `pr_class` to factor
pr_class <- as.factor(pr_class)
# Compute confusion matrix
caret::confusionMatrix(pr_class, true_labels, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 563 25
## 1 175 61
##
## Accuracy : 0.7573
## 95% CI : (0.7265, 0.7862)
## No Information Rate : 0.8956
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2667
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.70930
## Specificity : 0.76287
## Pos Pred Value : 0.25847
## Neg Pred Value : 0.95748
## Prevalence : 0.10437
## Detection Rate : 0.07403
## Detection Prevalence : 0.28641
## Balanced Accuracy : 0.73609
##
## 'Positive' Class : 1
##
Now having the confusion matrix we can see that with 902
observations we were accurate in predictiing 76% of the time (Accuracy),
while being able to predict the number who will subscribe 71% of the
time, and who will not subscribe 76% of the time.
Using the varaibles selected during the stepwise regression
(backward regression), to determine the model that best predicts the
data
Data splitting for LDA model, between predicting whether someone
will subscribe, allowing the variable y
to have equal
number of observations between yes and no
# Select only relevant variables plus y
selected_vars <- c("emp.var.rate", "cons.price.idx", "cons.conf.idx",
"contact", "month", "poutcome", "campaign", "y")
balanced_data <- bank_additional[selected_vars]
# Convert categorical variables to factors
balanced_data$contact <- as.factor(balanced_data$contact)
balanced_data$month <- as.factor(balanced_data$month)
balanced_data$poutcome <- as.factor(balanced_data$poutcome)
# Create a 50-50 split for y
yes_data <- balanced_data[balanced_data$y == "yes", ]
no_data <- balanced_data[balanced_data$y == "no", ]
set.seed(2024)
min_n <- min(nrow(yes_data), nrow(no_data))
balanced_yes <- yes_data[sample(1:nrow(yes_data), min_n), ]
balanced_no <- no_data[sample(1:nrow(no_data), min_n), ]
# Combine and shuffle
balanced_data <- rbind(balanced_yes, balanced_no)
balanced_data <- balanced_data[sample(1:nrow(balanced_data)), ]
#making a training and testing data sets
set.seed(2024)
index_2 <- sample(1:nrow(balanced_data),nrow(balanced_data)*0.80) #using a 80 20 split
lda_train = balanced_data[index_2,]
lda_test = balanced_data[-index_2,]
# Check the distribution of 'y'
cat("Distribution of no and yes in y:\n")
## Distribution of no and yes in y:
print(table(bank_additional$y))
##
## no yes
## 3668 451
# Verify the new distribution
cat("New balance between No and Yes for LDA model: \n")
## New balance between No and Yes for LDA model:
print(table(balanced_data$y))
##
## no yes
## 451 451
We can see the probabilities of someone not subscribing to someone
who will, depending on different criteria, such as people will more than
likely not subscribe when conctected through telephone, while peple are
more likely to during October
# Load necessary library
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Predicting
predictions.lda = predict(lda.model, lda_test)
# Extract the predicted probabilities for class "yes"
lda_probabilities <- predictions.lda$posterior[, "yes"]
# Create ROC curve
roc_curve <- roc(lda_test$y, lda_probabilities, levels = c("no", "yes"), direction = "<")
# Plot the ROC curve
plot(roc_curve, col = "blue", main = "ROC Curve for LDA Model")
# Calculate and print AUC value
auc_value <- auc(roc_curve)
cat("AUC Value: ", auc_value, "\n")
## AUC Value: 0.7818071
text(0.5,0.5, paste("AUC:", auc_value))

We have our confusion matrix model
# Make confusion matrix for the LDA predictions to compare accuracy
#class has the actual predictions, so we can compare to our guessed
caret::confusionMatrix(as.factor(predictions.lda$class), as.factor(lda_test$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 79 35
## yes 11 56
##
## Accuracy : 0.7459
## 95% CI : (0.6759, 0.8075)
## No Information Rate : 0.5028
## P-Value [Acc > NIR] : 1.971e-11
##
## Kappa : 0.4924
##
## Mcnemar's Test P-Value : 0.000696
##
## Sensitivity : 0.6154
## Specificity : 0.8778
## Pos Pred Value : 0.8358
## Neg Pred Value : 0.6930
## Prevalence : 0.5028
## Detection Rate : 0.3094
## Detection Prevalence : 0.3702
## Balanced Accuracy : 0.7466
##
## 'Positive' Class : yes
##
Here we can see that it is only accurate 75% of the time, beingh
able to correctly identify those who will subscribe 62% of the time, and
those who will not 88%.
Results
The GLM model had a higher accuracy, as well as a better balanced
sensitivity and specificity and slightly higher AUC (meaning how well
the model predicts correctly), while the LDA model had a slightly lower
accuracy and more unbalanced senistivity and specificity (Heavier on
specificity), and slightly lower AUC.
The actual confusion matrix, leaves to user/company digression, the
GLM model predicts better at NO, and gives for more False Positives (the
undesirable outcome, meaning we predict people will subscribe but
actually won’t meaning we spend extra money advertising to the wrong
market segment).
The LDA model’s confusion matrix Creates a better balance between
actual yes and no’s, while predicting a lower false positive,
i.e. people we think will subscribe won’t actually subscribe. It is
lower, but we then increase the false negative side, meaning we predict
they won’t subscribe, but they actually will, which isn’t the worst
outcome.
Both models should be used to clarify and check each other, but
right now it stands to reason that the glm model is better than the lda
model, due to if we were to change the seed, or introduce other data,
the glm model would predict relatively close results as it did before,
while the lda model, will have more variation.