Before proceeding with the analysis, it is critical to ensure the integrity of the Bank Marketing dataset. As per the project requirements, I will perform a programmatic audit of the data to identify missing values, out-of-range entries, and hidden placeholders without relying on visual inspection of the raw file.
1. Structure and Data Type Consistency
First, I verify that the 11,162 observations have been read correctly and that the variables are assigned the appropriate data types. For instance, balance and age must be numeric to perform statistical operations, while our target variable y should be treated as a factor.
# Check internal structure
bank_data <- read.csv("bank.csv")
str(bank_data)
## 'data.frame': 11162 obs. of 17 variables:
## $ age : int 59 56 41 55 54 42 56 60 37 28 ...
## $ job : chr "admin." "admin." "technician" "services" ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education: chr "secondary" "secondary" "secondary" "secondary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2343 45 1270 2476 184 0 830 545 1 5090 ...
## $ housing : chr "yes" "no" "yes" "yes" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 6 6 6 6 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 1042 1467 1389 579 673 562 1201 1030 608 1297 ...
## $ campaign : int 1 1 1 1 2 2 1 1 1 3 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ deposit : chr "yes" "yes" "yes" "yes" ...
knitr::kable(head(bank_data))
| 59 |
admin. |
married |
secondary |
no |
2343 |
yes |
no |
unknown |
5 |
may |
1042 |
1 |
-1 |
0 |
unknown |
yes |
| 56 |
admin. |
married |
secondary |
no |
45 |
no |
no |
unknown |
5 |
may |
1467 |
1 |
-1 |
0 |
unknown |
yes |
| 41 |
technician |
married |
secondary |
no |
1270 |
yes |
no |
unknown |
5 |
may |
1389 |
1 |
-1 |
0 |
unknown |
yes |
| 55 |
services |
married |
secondary |
no |
2476 |
yes |
no |
unknown |
5 |
may |
579 |
1 |
-1 |
0 |
unknown |
yes |
| 54 |
admin. |
married |
tertiary |
no |
184 |
no |
no |
unknown |
5 |
may |
673 |
2 |
-1 |
0 |
unknown |
yes |
| 42 |
management |
single |
tertiary |
no |
0 |
yes |
yes |
unknown |
5 |
may |
562 |
2 |
-1 |
0 |
unknown |
yes |
# Confirm that no 'Age' values are logically impossible (e.g., < 18 or > 96)
age_range <- range(bank_data$age)
print(paste("Age range detected:", age_range[1], "to", age_range[2]))
## [1] "Age range detected: 18 to 95"
2. Identifying Missing Values and “Unknown” Placeholders
A standard check for NA values is performed; however, this specific dataset is known to use the string “unknown” for missing categorical data. I will calculate the frequency of these placeholders to determine if any column is too sparse to be useful.
library(tidyverse)
# Standard NA check
na_count <- colSums(is.na(bank_data))
# Placeholder check for categorical 'unknowns'
# We focus on categorical columns to see where data is missing
unknown_summary <- bank_data %>%
summarise(across(where(is.character), ~ sum(. == "unknown"))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Unknown_Count")
list(Total_NAs = na_count, Placeholder_Summary = unknown_summary)
## $Total_NAs
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## deposit
## 0
##
## $Placeholder_Summary
## # A tibble: 10 × 2
## Variable Unknown_Count
## <chr> <int>
## 1 job 70
## 2 marital 0
## 3 education 497
## 4 default 0
## 5 housing 0
## 6 loan 0
## 7 contact 2346
## 8 month 0
## 9 poutcome 8326
## 10 deposit 0
3. Cleaning Decisions
The verification reveals that while there are zero explicit NA values, the poutcome (previous outcome) variable contains over 8,000 “unknown” entries.
Because this accounts for over 70% of the data, I will exclude poutcome from my final regression model to avoid introducing significant bias.
However, variables like job and education have very few unknowns, and those rows will be retained to preserve the sample size.
In the Bank Marketing dataset, the job variable is one of the most diverse. By analyzing the subscription rate across different professions, we can start to answer our research question about which demographics are most likely to convert.