case study 1 v4

Here we are loading the appropriate packages needed to complete our analysis, such as `ROCR` needed for ROC and AUC predicting, to find the most optimal cutoff point for our model, as well as loading the dataset we are currently working on, `bank-additional.csv`, which will be renamed `bank_additional` for ease and convenience.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(lattice)
library(ggplot2)
library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.22-5

library(readr)
library(ROCR)
bank_additional <- read_delim("bank-additional.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)

## Rows: 4119 Columns: 21

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(bank_additional)

Here we are running various Exploratory Data analytics (EDA)

summary(bank_additional)

##       age            job              marital           education        
##  Min.   :18.00   Length:4119        Length:4119        Length:4119       
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.11                                                           
##  3rd Qu.:47.00                                                           
##  Max.   :88.00                                                           
##    default            housing              loan             contact         
##  Length:4119        Length:4119        Length:4119        Length:4119       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     month           day_of_week           duration         campaign     
##  Length:4119        Length:4119        Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 103.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 181.0   Median : 2.000  
##                                        Mean   : 256.8   Mean   : 2.537  
##                                        3rd Qu.: 317.0   3rd Qu.: 3.000  
##                                        Max.   :3643.0   Max.   :35.000  
##      pdays          previous        poutcome          emp.var.rate     
##  Min.   :  0.0   Min.   :0.0000   Length:4119        Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.0000   Class :character   1st Qu.:-1.80000  
##  Median :999.0   Median :0.0000   Mode  :character   Median : 1.10000  
##  Mean   :960.4   Mean   :0.1903                      Mean   : 0.08497  
##  3rd Qu.:999.0   3rd Qu.:0.0000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :6.0000                      Max.   : 1.40000  
##  cons.price.idx  cons.conf.idx     euribor3m      nr.employed  
##  Min.   :92.20   Min.   :-50.8   Min.   :0.635   Min.   :4964  
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.334   1st Qu.:5099  
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191  
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5166  
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228  
##       y            
##  Length:4119       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

str(bank_additional)

## spc_tbl_ [4,119 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age           : num [1:4119] 30 39 25 38 47 32 32 41 31 35 ...
##  $ job           : chr [1:4119] "blue-collar" "services" "services" "services" ...
##  $ marital       : chr [1:4119] "married" "single" "married" "married" ...
##  $ education     : chr [1:4119] "basic.9y" "high.school" "high.school" "basic.9y" ...
##  $ default       : chr [1:4119] "no" "no" "no" "no" ...
##  $ housing       : chr [1:4119] "yes" "no" "yes" "unknown" ...
##  $ loan          : chr [1:4119] "no" "no" "no" "unknown" ...
##  $ contact       : chr [1:4119] "cellular" "telephone" "telephone" "telephone" ...
##  $ month         : chr [1:4119] "may" "may" "jun" "jun" ...
##  $ day_of_week   : chr [1:4119] "fri" "fri" "wed" "fri" ...
##  $ duration      : num [1:4119] 487 346 227 17 58 128 290 44 68 170 ...
##  $ campaign      : num [1:4119] 2 4 1 3 1 3 4 2 1 1 ...
##  $ pdays         : num [1:4119] 999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : num [1:4119] 0 0 0 0 0 2 0 0 1 0 ...
##  $ poutcome      : chr [1:4119] "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num [1:4119] -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
##  $ cons.price.idx: num [1:4119] 92.9 94 94.5 94.5 93.2 ...
##  $ cons.conf.idx : num [1:4119] -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
##  $ euribor3m     : num [1:4119] 1.31 4.86 4.96 4.96 4.19 ...
##  $ nr.employed   : num [1:4119] 5099 5191 5228 5228 5196 ...
##  $ y             : chr [1:4119] "no" "no" "no" "no" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   job = col_character(),
##   ..   marital = col_character(),
##   ..   education = col_character(),
##   ..   default = col_character(),
##   ..   housing = col_character(),
##   ..   loan = col_character(),
##   ..   contact = col_character(),
##   ..   month = col_character(),
##   ..   day_of_week = col_character(),
##   ..   duration = col_double(),
##   ..   campaign = col_double(),
##   ..   pdays = col_double(),
##   ..   previous = col_double(),
##   ..   poutcome = col_character(),
##   ..   emp.var.rate = col_double(),
##   ..   cons.price.idx = col_double(),
##   ..   cons.conf.idx = col_double(),
##   ..   euribor3m = col_double(),
##   ..   nr.employed = col_double(),
##   ..   y = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

We can see here, that none of the variables that I loaded were not factors, i.e. leveled. To counteract I will physically change each specific variable needed, but first to make sure what will become levels, and specify how they will look.

levels(as.factor(bank_additional$y))

## [1] "no"  "yes"

levels(as.factor(bank_additional$job))

##  [1] "admin."        "blue-collar"   "entrepreneur"  "housemaid"    
##  [5] "management"    "retired"       "self-employed" "services"     
##  [9] "student"       "technician"    "unemployed"    "unknown"

levels(as.factor(bank_additional$marital))

## [1] "divorced" "married"  "single"   "unknown"

levels(as.factor(bank_additional$education))

## [1] "basic.4y"            "basic.6y"            "basic.9y"           
## [4] "high.school"         "illiterate"          "professional.course"
## [7] "university.degree"   "unknown"

levels(as.factor(bank_additional$default))

## [1] "no"      "unknown" "yes"

levels(as.factor(bank_additional$housing))

## [1] "no"      "unknown" "yes"

levels(as.factor(bank_additional$loan))

## [1] "no"      "unknown" "yes"

levels(as.factor(bank_additional$contact))

## [1] "cellular"  "telephone"

levels(as.factor(bank_additional$month))

##  [1] "apr" "aug" "dec" "jul" "jun" "mar" "may" "nov" "oct" "sep"

levels(as.factor(bank_additional$day_of_week))

## [1] "fri" "mon" "thu" "tue" "wed"

bank_additional$y = as.factor(bank_additional$y)
str(bank_additional$y)

##  Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

bank_additional$job = as.factor(bank_additional$job)
bank_additional$marital = as.factor(bank_additional$marital)
bank_additional$education = as.factor(bank_additional$education)
bank_additional$default = as.factor(bank_additional$default)
bank_additional$housing = as.factor(bank_additional$housing)
bank_additional$loan = as.factor(bank_additional$loan)
bank_additional$contact = as.factor(bank_additional$contact)
bank_additional$month = as.factor(bank_additional$month)
bank_additional$day_of_week = as.factor(bank_additional$day_of_week)
str(bank_additional$job)

##  Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...

str(bank_additional$day_of_week)

##  Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...

`pdays` heavily skews the dataset, leading to creating a categorical variable.

When the observation is less than or equal to 998, it means an individual was contacted, anything above means no contact previously

#Warning run only once
bank_additional$pdays = ifelse(bank_additional$pdays <= 998, "yes", "no")
bank_additional$pdays = as.factor(bank_additional$pdays)
str(bank_additional$pdays)

##  Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Remove `duration` from analysis

bank_additional <- dplyr::select(bank_additional, -duration)
summary(bank_additional)

##       age                 job           marital                   education   
##  Min.   :18.00   admin.     :1012   divorced: 446   university.degree  :1264  
##  1st Qu.:32.00   blue-collar: 884   married :2509   high.school        : 921  
##  Median :38.00   technician : 691   single  :1153   basic.9y           : 574  
##  Mean   :40.11   services   : 393   unknown :  11   professional.course: 535  
##  3rd Qu.:47.00   management : 324                   basic.4y           : 429  
##  Max.   :88.00   retired    : 166                   basic.6y           : 228  
##                  (Other)    : 649                   (Other)            : 168  
##     default        housing          loan           contact         month     
##  no     :3315   no     :1839   no     :3349   cellular :2652   may    :1378  
##  unknown: 803   unknown: 105   unknown: 105   telephone:1467   jul    : 711  
##  yes    :   1   yes    :2175   yes    : 665                    aug    : 636  
##                                                                jun    : 530  
##                                                                nov    : 446  
##                                                                apr    : 215  
##                                                                (Other): 203  
##  day_of_week    campaign      pdays         previous        poutcome        
##  fri:768     Min.   : 1.000   no :3959   Min.   :0.0000   Length:4119       
##  mon:855     1st Qu.: 1.000   yes: 160   1st Qu.:0.0000   Class :character  
##  thu:860     Median : 2.000              Median :0.0000   Mode  :character  
##  tue:841     Mean   : 2.537              Mean   :0.1903                     
##  wed:795     3rd Qu.: 3.000              3rd Qu.:0.0000                     
##              Max.   :35.000              Max.   :6.0000                     
##                                                                             
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.635  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.334  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08497   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr.employed     y       
##  Min.   :4964   no :3668  
##  1st Qu.:5099   yes: 451  
##  Median :5191             
##  Mean   :5166             
##  3rd Qu.:5228             
##  Max.   :5228             
##

We now have all variables that were originally character varaibles, turned into factors with levels, and attatched to the “original” dataset. We will now create the training and testing datasets, to which we can alter and use freely.

set.seed(2024)
index <- sample(1:nrow(bank_additional),nrow(bank_additional)*0.80) #using a 80 20 split
bank_train = bank_additional[index,]
bank_test = bank_additional[-index,]

# Your code here
#a) get rid of duration
glm_model <- glm(y ~. -age, family=binomial, data=bank_train)
backward_model <- step(glm_model, direction = "backward")

## Start:  AIC=1891.53
## y ~ (age + job + marital + education + default + housing + loan + 
##     contact + month + day_of_week + campaign + pdays + previous + 
##     poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed) - age
## 
##                  Df Deviance    AIC
## - job            11   1796.7 1876.7
## - education       7   1792.3 1880.3
## - day_of_week     4   1790.7 1884.7
## - marital         3   1789.9 1885.9
## - default         2   1789.7 1887.7
## - loan            1   1789.6 1889.6
## - housing         1   1789.7 1889.7
## - euribor3m       1   1789.7 1889.7
## - pdays           1   1790.2 1890.2
## - previous        1   1790.5 1890.5
## - nr.employed     1   1791.1 1891.1
## <none>                1789.5 1891.5
## - campaign        1   1794.5 1894.5
## - poutcome        2   1796.5 1894.5
## - cons.conf.idx   1   1796.4 1896.4
## - emp.var.rate    1   1797.8 1897.8
## - cons.price.idx  1   1798.0 1898.0
## - month           9   1823.0 1907.0
## - contact         1   1808.8 1908.8
## 
## Step:  AIC=1876.69
## y ~ marital + education + default + housing + loan + contact + 
##     month + day_of_week + campaign + pdays + previous + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed
## 
##                  Df Deviance    AIC
## - education       7   1799.6 1865.6
## - day_of_week     4   1797.7 1869.7
## - marital         3   1796.9 1870.9
## - default         2   1796.9 1872.9
## - loan            1   1796.8 1874.8
## - housing         1   1796.8 1874.8
## - euribor3m       1   1796.9 1874.9
## - previous        1   1797.5 1875.5
## - pdays           1   1797.6 1875.6
## - nr.employed     1   1798.1 1876.1
## <none>                1796.7 1876.7
## - campaign        1   1801.8 1879.8
## - poutcome        2   1803.9 1879.9
## - cons.conf.idx   1   1803.6 1881.6
## - emp.var.rate    1   1804.3 1882.3
## - cons.price.idx  1   1804.8 1882.8
## - month           9   1830.6 1892.6
## - contact         1   1815.3 1893.3
## 
## Step:  AIC=1865.64
## y ~ marital + default + housing + loan + contact + month + day_of_week + 
##     campaign + pdays + previous + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - day_of_week     4   1800.6 1858.6
## - marital         3   1799.9 1859.9
## - default         2   1800.1 1862.1
## - housing         1   1799.7 1863.7
## - loan            1   1799.7 1863.7
## - euribor3m       1   1799.8 1863.8
## - pdays           1   1800.5 1864.5
## - previous        1   1800.5 1864.5
## - nr.employed     1   1801.0 1865.0
## <none>                1799.6 1865.6
## - campaign        1   1804.7 1868.7
## - poutcome        2   1807.0 1869.0
## - cons.conf.idx   1   1806.6 1870.6
## - emp.var.rate    1   1807.4 1871.4
## - cons.price.idx  1   1807.7 1871.7
## - month           9   1833.8 1881.8
## - contact         1   1819.3 1883.3
## 
## Step:  AIC=1858.59
## y ~ marital + default + housing + loan + contact + month + campaign + 
##     pdays + previous + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - marital         3   1800.9 1852.9
## - default         2   1801.1 1855.1
## - housing         1   1800.7 1856.7
## - loan            1   1800.7 1856.7
## - euribor3m       1   1800.7 1856.7
## - pdays           1   1801.4 1857.4
## - previous        1   1801.5 1857.5
## - nr.employed     1   1801.9 1857.9
## <none>                1800.6 1858.6
## - poutcome        2   1807.8 1861.8
## - campaign        1   1805.9 1861.9
## - cons.conf.idx   1   1807.6 1863.6
## - emp.var.rate    1   1808.3 1864.3
## - cons.price.idx  1   1808.6 1864.6
## - month           9   1834.2 1874.2
## - contact         1   1820.3 1876.3
## 
## Step:  AIC=1852.86
## y ~ default + housing + loan + contact + month + campaign + pdays + 
##     previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - default         2   1801.4 1849.4
## - loan            1   1800.9 1850.9
## - housing         1   1800.9 1850.9
## - euribor3m       1   1801.0 1851.0
## - pdays           1   1801.7 1851.7
## - previous        1   1801.8 1851.8
## - nr.employed     1   1802.1 1852.1
## <none>                1800.9 1852.9
## - poutcome        2   1808.1 1856.1
## - campaign        1   1806.2 1856.2
## - cons.conf.idx   1   1807.8 1857.8
## - emp.var.rate    1   1808.5 1858.5
## - cons.price.idx  1   1808.8 1858.8
## - month           9   1834.5 1868.5
## - contact         1   1820.6 1870.6
## 
## Step:  AIC=1849.39
## y ~ housing + loan + contact + month + campaign + pdays + previous + 
##     poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - loan            1   1801.5 1847.5
## - housing         1   1801.5 1847.5
## - euribor3m       1   1801.5 1847.5
## - pdays           1   1802.2 1848.2
## - previous        1   1802.3 1848.3
## - nr.employed     1   1802.7 1848.7
## <none>                1801.4 1849.4
## - campaign        1   1806.7 1852.7
## - poutcome        2   1808.7 1852.7
## - cons.conf.idx   1   1808.3 1854.3
## - emp.var.rate    1   1809.0 1855.0
## - cons.price.idx  1   1809.3 1855.3
## - month           9   1835.3 1865.3
## - contact         1   1821.2 1867.2
## 
## Step:  AIC=1847.45
## y ~ housing + contact + month + campaign + pdays + previous + 
##     poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - housing         2   1802.9 1844.9
## - euribor3m       1   1801.6 1845.6
## - pdays           1   1802.3 1846.3
## - previous        1   1802.3 1846.3
## - nr.employed     1   1802.7 1846.7
## <none>                1801.5 1847.5
## - campaign        1   1806.7 1850.7
## - poutcome        2   1808.7 1850.7
## - cons.conf.idx   1   1808.4 1852.4
## - emp.var.rate    1   1809.1 1853.1
## - cons.price.idx  1   1809.4 1853.4
## - month           9   1835.3 1863.3
## - contact         1   1821.4 1865.4
## 
## Step:  AIC=1844.87
## y ~ contact + month + campaign + pdays + previous + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed
## 
##                  Df Deviance    AIC
## - euribor3m       1   1803.0 1843.0
## - previous        1   1803.6 1843.6
## - pdays           1   1803.8 1843.8
## - nr.employed     1   1804.1 1844.1
## <none>                1802.9 1844.9
## - poutcome        2   1810.1 1848.1
## - campaign        1   1808.2 1848.2
## - cons.conf.idx   1   1809.6 1849.6
## - emp.var.rate    1   1810.8 1850.8
## - cons.price.idx  1   1811.0 1851.0
## - month           9   1836.4 1860.4
## - contact         1   1823.0 1863.0
## 
## Step:  AIC=1842.97
## y ~ contact + month + campaign + pdays + previous + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + nr.employed
## 
##                  Df Deviance    AIC
## - previous        1   1803.7 1841.7
## - pdays           1   1803.9 1841.9
## - nr.employed     1   1804.7 1842.7
## <none>                1803.0 1843.0
## - poutcome        2   1810.1 1846.1
## - campaign        1   1808.2 1846.2
## - emp.var.rate    1   1811.2 1849.2
## - cons.price.idx  1   1811.9 1849.9
## - cons.conf.idx   1   1814.4 1852.4
## - month           9   1837.5 1859.5
## - contact         1   1823.0 1861.0
## 
## Step:  AIC=1841.69
## y ~ contact + month + campaign + pdays + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + nr.employed
## 
##                  Df Deviance    AIC
## - pdays           1   1805.5 1841.5
## - nr.employed     1   1805.6 1841.6
## <none>                1803.7 1841.7
## - campaign        1   1809.0 1845.0
## - poutcome        2   1811.8 1845.8
## - emp.var.rate    1   1812.6 1848.6
## - cons.price.idx  1   1813.7 1849.7
## - cons.conf.idx   1   1815.2 1851.2
## - month           9   1838.1 1858.1
## - contact         1   1824.6 1860.6
## 
## Step:  AIC=1841.53
## y ~ contact + month + campaign + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + nr.employed
## 
##                  Df Deviance    AIC
## - nr.employed     1   1807.4 1841.4
## <none>                1805.5 1841.5
## - campaign        1   1810.9 1844.9
## - emp.var.rate    1   1814.3 1848.3
## - cons.price.idx  1   1815.5 1849.5
## - cons.conf.idx   1   1817.0 1851.0
## - month           9   1840.6 1858.6
## - contact         1   1826.9 1860.9
## - poutcome        2   1848.8 1880.8
## 
## Step:  AIC=1841.38
## y ~ contact + month + campaign + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx
## 
##                  Df Deviance    AIC
## <none>                1807.4 1841.4
## - campaign        1   1812.9 1844.9
## - cons.conf.idx   1   1817.2 1849.2
## - month           9   1841.3 1857.3
## - contact         1   1827.0 1859.0
## - poutcome        2   1850.8 1880.8
## - cons.price.idx  1   1861.0 1893.0
## - emp.var.rate    1   1902.6 1934.6

glm_model = backward_model
summary(glm_model)

## 
## Call:
## glm(formula = y ~ contact + month + campaign + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx, family = binomial, data = bank_train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.185e+02  1.569e+01  -7.552 4.28e-14 ***
## contacttelephone    -1.004e+00  2.398e-01  -4.187 2.83e-05 ***
## monthaug            -2.159e-02  3.394e-01  -0.064  0.94928    
## monthdec             8.387e-01  5.784e-01   1.450  0.14705    
## monthjul            -3.349e-03  3.208e-01  -0.010  0.99167    
## monthjun             1.338e-01  2.965e-01   0.451  0.65176    
## monthmar             1.546e+00  4.144e-01   3.731  0.00019 ***
## monthmay            -3.649e-01  2.548e-01  -1.432  0.15203    
## monthnov            -5.069e-01  3.150e-01  -1.609  0.10758    
## monthoct            -1.238e-01  4.037e-01  -0.307  0.75910    
## monthsep             2.681e-02  4.229e-01   0.063  0.94945    
## campaign            -8.187e-02  3.798e-02  -2.156  0.03110 *  
## poutcomenonexistent  3.560e-01  1.921e-01   1.853  0.06390 .  
## poutcomesuccess      1.705e+00  2.694e-01   6.330 2.45e-10 ***
## emp.var.rate        -7.020e-01  6.938e-02 -10.119  < 2e-16 ***
## cons.price.idx       1.266e+00  1.702e-01   7.440 1.01e-13 ***
## cons.conf.idx        5.218e-02  1.679e-02   3.108  0.00188 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2294.2  on 3294  degrees of freedom
## Residual deviance: 1807.4  on 3278  degrees of freedom
## AIC: 1841.4
## 
## Number of Fisher Scoring iterations: 6

#b)
pred_prob_bank_test <- predict.glm(glm_model, newdata = bank_test, type = "response")
summary(pred_prob_bank_test)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00701 0.03980 0.05851 0.10639 0.09419 0.91357

hist(pred_prob_bank_test)

#d)

pred_test <- prediction(pred_prob_bank_test, as.numeric(bank_test$y) - 1) #need the y variable to be 0 and 1
pred_test

## A prediction instance
##   with 824 data points

ROC <- performance(pred_test, "tpr", "fpr")

plot(ROC, colorize=TRUE)

#Get the AUC
auc_test = unlist(slot(performance(pred_test, "auc"), "y.values")) #unlist makes it into a number
print(auc_test)

## [1] 0.7981896

text(0.5,0.5, paste("AUC:", auc_test))

# Extract performance metrics
perf_sens <- performance(pred_test, "sens")
perf_spec <- performance(pred_test, "spec")

# Extract y-values (sensitivity and specificity)
sens <- unlist(perf_sens@y.values)
spec <- unlist(perf_spec@y.values)
cutoffs <- unlist(perf_sens@x.values)  # Sensitivity and specificity should share same cutoffs

# Ensure proper lengths
if (length(sens) == length(spec) && length(sens) == length(cutoffs)) {
  
  # Plot sensitivity
  plot(cutoffs, sens, type="l", lwd=2, ylab="Sensitivity", xlab="Cutoff", col="blue",
       main = paste("Maximized Cutoff\n","AUC: ", round(auc_test, 3)))
  
  par(new=TRUE) # Overlay specificity plot
  
  # Plot specificity
  plot(cutoffs, spec, type="l", lwd=2, col='red', ylab="", xlab="", axes=FALSE)
  axis(4, at=seq(0,1,0.2)) # Specificity axis labels
  mtext("Specificity", side=4, col='red')
  
  # Find optimal cutoff where sensitivity ≈ specificity
  min.diff <- which.min(abs(sens - spec))
  optimal_cutoff <- cutoffs[min.diff]
  
  # Draw reference lines
  abline(h = sens[min.diff], lty = 3)
  abline(v = optimal_cutoff, lty = 3)
  text(optimal_cutoff, 0, paste("Optimal threshold=", round(optimal_cutoff, 2)), pos=3)
  
} else {
  print("Error: Sensitivity, specificity, and cutoffs lengths do not match.")
}

# Use predicted probabilities, not `pred_test`
pr_class <- ifelse(pred_prob_bank_test > optimal_cutoff, "1", "0")  # Use optimal threshold

# Ensure bank_test$y is converted correctly
true_labels <- as.factor(ifelse(bank_test$y == "yes", "1", "0"))  # Convert "yes"/"no" to "1"/"0"

# Convert `pr_class` to factor
pr_class <- as.factor(pr_class)

# Compute confusion matrix
caret::confusionMatrix(pr_class, true_labels, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 563  25
##          1 175  61
##                                           
##                Accuracy : 0.7573          
##                  95% CI : (0.7265, 0.7862)
##     No Information Rate : 0.8956          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2667          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.70930         
##             Specificity : 0.76287         
##          Pos Pred Value : 0.25847         
##          Neg Pred Value : 0.95748         
##              Prevalence : 0.10437         
##          Detection Rate : 0.07403         
##    Detection Prevalence : 0.28641         
##       Balanced Accuracy : 0.73609         
##                                           
##        'Positive' Class : 1               
##

Using the varaibles selected during the stepwise regression (backward regression), to determine the model that best predicts the data

Data splitting for LDA model, between predicting whether someone will subscribe, allowing the variable `y` to have equal number of observations between yes and no

# Select only relevant variables plus y
selected_vars <- c("emp.var.rate", "cons.price.idx", "cons.conf.idx", 
                   "contact", "month", "poutcome", "campaign", "y")

balanced_data <- bank_additional[selected_vars]

# Convert categorical variables to factors
balanced_data$contact <- as.factor(balanced_data$contact)
balanced_data$month <- as.factor(balanced_data$month)
balanced_data$poutcome <- as.factor(balanced_data$poutcome)

# Create a 50-50 split for y
yes_data <- balanced_data[balanced_data$y == "yes", ]
no_data <- balanced_data[balanced_data$y == "no", ]

set.seed(2024)
min_n <- min(nrow(yes_data), nrow(no_data))

balanced_yes <- yes_data[sample(1:nrow(yes_data), min_n), ]
balanced_no <- no_data[sample(1:nrow(no_data), min_n), ]

# Combine and shuffle
balanced_data <- rbind(balanced_yes, balanced_no)
balanced_data <- balanced_data[sample(1:nrow(balanced_data)), ]

#making a training and testing data sets
set.seed(2024)
index_2 <- sample(1:nrow(balanced_data),nrow(balanced_data)*0.80) #using a 80 20 split
lda_train = balanced_data[index_2,]
lda_test = balanced_data[-index_2,]

# Check the distribution of 'y'
cat("Distribution of no and yes in y:\n")

## Distribution of no and yes in y:

print(table(bank_additional$y))

## 
##   no  yes 
## 3668  451

# Verify the new distribution
cat("New balance between No and Yes for LDA model: \n")

## New balance between No and Yes for LDA model:

print(table(balanced_data$y))

## 
##  no yes 
## 451 451

We now have a `balanced_data` that contains only 902 observations and 8 variables (Those being: `contact`, `month`, `campaign`, `poutcome`, `emp.var.rate`, `cons.price.idx`, and `cons.conf.idx`), we then split the `balanced_data` into a training and testing data sets, to get a more predictive model

lda.model = lda(formula = y ~ contact + month + campaign + poutcome + emp.var.rate + 
    cons.price.idx + cons.conf.idx, data = lda_train)
lda.model

## Call:
## lda(y ~ contact + month + campaign + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx, data = lda_train)
## 
## Prior probabilities of groups:
##        no       yes 
## 0.5006935 0.4993065 
## 
## Group means:
##     contacttelephone  monthaug    monthdec  monthjul  monthjun   monthmar
## no         0.3933518 0.1440443 0.002770083 0.1662050 0.1329640 0.00000000
## yes        0.1694444 0.1416667 0.030555556 0.1194444 0.1555556 0.06111111
##      monthmay   monthnov   monthoct    monthsep campaign poutcomenonexistent
## no  0.3684211 0.10803324 0.01108033 0.005540166 2.603878           0.8947368
## yes 0.2083333 0.09444444 0.05833333 0.055555556 1.955556           0.6555556
##     poutcomesuccess emp.var.rate cons.price.idx cons.conf.idx
## no       0.01108033    0.3022161       93.62327     -40.61994
## yes      0.19722222   -1.2044444       93.40978     -39.65583
## 
## Coefficients of linear discriminants:
##                             LD1
## contacttelephone    -0.08523481
## monthaug             1.06545514
## monthdec             0.87666010
## monthjul             0.89259700
## monthjun             0.49739861
## monthmar             1.78760315
## monthmay            -0.02384648
## monthnov             0.22667641
## monthoct             0.65495384
## monthsep             0.81160916
## campaign            -0.08608671
## poutcomenonexistent  0.36926337
## poutcomesuccess      0.70668850
## emp.var.rate        -0.82877364
## cons.price.idx       1.04165663
## cons.conf.idx        0.01185530

We can see the probabilities of someone not subscribing to someone who will, depending on different criteria, such as people will more than likely not subscribe when conctected through telephone, while peple are more likely to during October

# Load necessary library
library(pROC)

## Warning: package 'pROC' was built under R version 4.3.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Predicting
predictions.lda = predict(lda.model, lda_test)

# Extract the predicted probabilities for class "yes"
lda_probabilities <- predictions.lda$posterior[, "yes"]

# Create ROC curve
roc_curve <- roc(lda_test$y, lda_probabilities, levels = c("no", "yes"), direction = "<")

# Plot the ROC curve
plot(roc_curve, col = "blue", main = "ROC Curve for LDA Model")

# Calculate and print AUC value
auc_value <- auc(roc_curve)
cat("AUC Value: ", auc_value, "\n")

## AUC Value:  0.7818071

text(0.5,0.5, paste("AUC:", auc_value))

We have our confusion matrix model

# Make confusion matrix for the LDA predictions to compare accuracy 
#class has the actual predictions, so we can compare to our guessed

caret::confusionMatrix(as.factor(predictions.lda$class), as.factor(lda_test$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  79  35
##        yes 11  56
##                                           
##                Accuracy : 0.7459          
##                  95% CI : (0.6759, 0.8075)
##     No Information Rate : 0.5028          
##     P-Value [Acc > NIR] : 1.971e-11       
##                                           
##                   Kappa : 0.4924          
##                                           
##  Mcnemar's Test P-Value : 0.000696        
##                                           
##             Sensitivity : 0.6154          
##             Specificity : 0.8778          
##          Pos Pred Value : 0.8358          
##          Neg Pred Value : 0.6930          
##              Prevalence : 0.5028          
##          Detection Rate : 0.3094          
##    Detection Prevalence : 0.3702          
##       Balanced Accuracy : 0.7466          
##                                           
##        'Positive' Class : yes             
##

Results

The GLM model had a higher accuracy, as well as a better balanced sensitivity and specificity and slightly higher AUC (meaning how well the model predicts correctly), while the LDA model had a slightly lower accuracy and more unbalanced senistivity and specificity (Heavier on specificity), and slightly lower AUC.

Both models should be used to clarify and check each other, but right now it stands to reason that the glm model is better than the lda model, due to if we were to change the seed, or introduce other data, the glm model would predict relatively close results as it did before, while the lda model, will have more variation.

case study 1 v4

Chance Jones

2025-02-14

Here we are running various Exploratory Data analytics (EDA)

We can see here, that none of the variables that I loaded were not factors, i.e. leveled. To counteract I will physically change each specific variable needed, but first to make sure what will become levels, and specify how they will look.

`pdays` heavily skews the dataset, leading to creating a categorical variable.

When the observation is less than or equal to 998, it means an individual was contacted, anything above means no contact previously

Remove `duration` from analysis

We now have all variables that were originally character varaibles, turned into factors with levels, and attatched to the “original” dataset. We will now create the training and testing datasets, to which we can alter and use freely.

Using the varaibles selected during the stepwise regression (backward regression), to determine the model that best predicts the data

We have our confusion matrix model

Results

case study 1 v4

Chance Jones

2025-02-14

Here we are running various Exploratory Data analytics (EDA)

We can see here, that none of the variables that I loaded were not factors, i.e. leveled. To counteract I will physically change each specific variable needed, but first to make sure what will become levels, and specify how they will look.

pdays heavily skews the dataset, leading to creating a categorical variable.

When the observation is less than or equal to 998, it means an individual was contacted, anything above means no contact previously

Remove duration from analysis

We now have all variables that were originally character varaibles, turned into factors with levels, and attatched to the “original” dataset. We will now create the training and testing datasets, to which we can alter and use freely.

Now having the confusion matrix we can see that with 902 observations we were accurate in predictiing 76% of the time (Accuracy), while being able to predict the number who will subscribe 71% of the time, and who will not subscribe 76% of the time.

Using the varaibles selected during the stepwise regression (backward regression), to determine the model that best predicts the data

Data splitting for LDA model, between predicting whether someone will subscribe, allowing the variable y to have equal number of observations between yes and no

We now have a balanced_data that contains only 902 observations and 8 variables (Those being: contact, month, campaign, poutcome, emp.var.rate, cons.price.idx, and cons.conf.idx), we then split the balanced_data into a training and testing data sets, to get a more predictive model

We can see the probabilities of someone not subscribing to someone who will, depending on different criteria, such as people will more than likely not subscribe when conctected through telephone, while peple are more likely to during October

We have our confusion matrix model

Here we can see that it is only accurate 75% of the time, beingh able to correctly identify those who will subscribe 62% of the time, and those who will not 88%.

Results

The actual confusion matrix, leaves to user/company digression, the GLM model predicts better at NO, and gives for more False Positives (the undesirable outcome, meaning we predict people will subscribe but actually won’t meaning we spend extra money advertising to the wrong market segment).

`pdays` heavily skews the dataset, leading to creating a categorical variable.

Remove `duration` from analysis

Data splitting for LDA model, between predicting whether someone will subscribe, allowing the variable `y` to have equal number of observations between yes and no

We now have a `balanced_data` that contains only 902 observations and 8 variables (Those being: `contact`, `month`, `campaign`, `poutcome`, `emp.var.rate`, `cons.price.idx`, and `cons.conf.idx`), we then split the `balanced_data` into a training and testing data sets, to get a more predictive model