The data set is derived from a Portuguese bank’s direct marketing initiatives. Phone calls served as the foundation for these marketing strategies. To determine whether a consumer had subscribed to the product (a bank term deposit), multiple interactions with the client were necessary. The categorization goal is to forecast whether or not a client would open a term deposit with the bank.
# Load the data
bank <- read.csv("bank-full.csv", sep = ';')
# View the first few rows of the dataset
head(bank)## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
As we can see, this dataset contains 17 attributes and 45211 instances. The majority of the features are classified. Using glimpse(), we can see that every data type is correct. The data does contain some “unknowns,” though. This will be regarded as missing value. First, let’s see whether there are any duplicate or missing values in the data.
Replacing “unknowns” with NA, then checking whether the data has any missing values.
bank <- bank %>%
mutate(across(.cols = everything(),
.fns = ~replace(., . == "unknown", NA)))
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 13020 0 0 0 0 0 0 36959
## y
## 0
There are a lot of missing values in the “contact” and “poutcome” fields. Most of the values in “poutcome” are actually unknown. I will hence eliminate these factors from the analysis. The “day” and “month” features also worry me because there is no indication as to whether or not the data was gathered in the same year. Since the precise dates are not given, these attributes would be useless if the data was gathered in different years. The “day” and “month” variables will therefore also be eliminated. Although they make up fewer than 5% of the total data, the “job” and “education” variables also have some missing entries. As a result, I will only remove observations that have these two features’ missing values.
Now that I have cleaned the missing values, let’s see whether there are any duplicate values in the data. In general, I don’t believe unnecessary values should be included in a dataset like this. In my opinion, each observation should have a distinct value, especially when the data includes characteristics that measure the balance and the number of days that have gone since the last campaign.
## [1] 1
There is only one duplicate value and it will be removed.
# Checking correlation among numeric variables
numeric_data <- select_if(bank, is.numeric)
correlations <- cor(numeric_data)
print(correlations)## age balance duration campaign pdays
## age 1.000000000 0.097616881 -0.0049477989 0.004047283 -0.023236544
## balance 0.097616881 1.000000000 0.0200491909 -0.016249772 0.003923373
## duration -0.004947799 0.020049191 1.0000000000 -0.083117961 -0.002404361
## campaign 0.004047283 -0.016249772 -0.0831179607 1.000000000 -0.088918971
## pdays -0.023236544 0.003923373 -0.0024043612 -0.088918971 1.000000000
## previous 0.001106066 0.016561228 0.0002943088 -0.032380311 0.452951784
## previous
## age 0.0011060662
## balance 0.0165612281
## duration 0.0002943088
## campaign -0.0323803106
## pdays 0.4529517840
## previous 1.0000000000
The correlation plot from the Portuguese bank’s marketing campaign data shows some surprising connections between key variables. There is a modest positive correlation between age and balance, suggesting that older clients tend to have higher account balances, likely reflecting financial stability accumulated over time. Conversely, the duration of calls shows a negative correlation with the number of contacts per campaign, indicating that more frequent contacts tend to be shorter, possibly suggesting an efficiency-driven communication strategy or diminishing engagement over successive contacts. Notably, the number of days since the last contact (pdays) and the number of contacts before the current campaign (previous) are negatively correlated with the campaign variable. This pattern implies that clients who have been contacted less frequently in the past or more recently are targeted more intensively in the current campaign. These correlations provide critical insights into client behavior and campaign effectiveness, highlighting opportunities for strategic adjustments to enhance future marketing efforts by tailoring interactions based on client engagement history and demographic factors.
# Filtering numeric variables and plotting histograms
bank %>%
select_if(is.numeric) %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = 'blue', color = 'black') +
facet_wrap(~variable, scales = 'free') +
labs(title = "Distribution of Numeric Variables") +
theme_minimal()The exploratory data analysis of the Portuguese bank’s marketing campaign reveals insightful distributions across several key numeric variables through histograms. The age distribution peaks between 30 and 40 years, indicating a client base primarily composed of middle-aged individuals, suggesting a focus on clients who are potentially more financially stable and interested in products like term deposits. The balance histogram is markedly right-skewed, with most clients holding low to moderate balances, highlighting economic diversity with a few outliers on the higher end. The campaign contacts histogram shows most clients are contacted 1-3 times, suggesting a strategy of minimal contact per client to possibly avoid over-solicitation. Duration data, similarly right-skewed, reveals that most calls are brief, under 500 seconds, indicating either efficient client interactions or quick assessments of client interest. The pdays histogram, with a significant spike at ‘0’, and the previous contacts histogram, also peaking at ‘0’, both underline a focus on new or previously under-contacted clients, suggesting either an influx of new clients or a missed opportunity in follow-up engagements. These findings underscore areas for strategic adjustment, particularly in enhancing engagement with existing clients and optimizing contact strategies to improve the effectiveness of future campaigns.
# Filtering numeric variables
numeric_vars <- select_if(bank, is.numeric)
# Plotting boxplots for each numeric variable
numeric_vars %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Boxplots for Numeric Variables", x = "Variable", y = "Value") +
guides(fill=FALSE)The boxplot visualization for the numeric variables from the bank marketing dataset effectively highlights the presence of outliers across different metrics. Notably, the variable ‘balance’ shows a significant number of outliers, with values reaching up to 100,000, far exceeding the majority clustered near zero. This suggests that while most clients maintain lower balances, there are a few with exceptionally high balances, which could be influential or high-net-worth individuals. The ‘age’ variable appears to have a few outliers at the higher end, indicating some clients are significantly older than the typical bank’s clientele. In contrast, ‘campaign’, ‘duration’, ‘pdays’, and ‘previous’ show relatively few outliers. For ‘campaign’ and ‘duration’, outliers are those instances where the number of contacts or the duration of calls was exceptionally high, potentially indicating cases of intense follow-up or longer conversations. The ‘pdays’ and ‘previous’ variables, which track days since the last campaign contact and the number of contacts before the current campaign, respectively, also display a few extreme values, highlighting some rare cases of frequent previous contact or a long gap since last contact. These outliers may represent special or atypical cases in the data and could significantly impact the analysis and model performance if not appropriately managed or excluded.
To train the model two machine learning algorithms I selected are Logistic Regression and Random Forest. Logistic regression algorithm is good for binary outcomes; straightforward and interpretable. Where as, Random Forest algorithm is used to handle large data with many variables; not as easily interpretable but very powerful.
Logistic Regression:
Pros: Easy to implement and understand.
Cons: Assumes linearity between dependent and independent variables.
Random Forest:
Pros: Does not assume linear relationships; can handle complex interactions and classification.
Cons: More computationally intensive and can overfit if not tuned properly.
For this model, I would recommend using Random Forest algorithm. I chose this algorithm because of it’s robustness to outliers and its ability to handle imbalanced datasets effectively. Random Forest performs well in scenarios where the data includes categorical variables and is capable of modeling complex nonlinear relationships without the need for extensive data preprocessing.
Yes, the data includes a label, y, which indicates whether the client subscribed to a term deposit (yes or no). This is a classic setup for supervised learning and strongly influences the choice of algorithm. The presence of a clear binary label directs the selection towards classification algorithms, with Random Forest being particularly advantageous due to its ensemble approach, which enhances prediction accuracy and stability.
Random Forest is particularly well-suited to this dataset, which features a mix of numeric and categorical variables. This algorithm can inherently handle such mixed data types and is less sensitive to the scale of features, meaning that minimal preprocessing is required. Additionally, Random Forest can manage the high dimensionality and potential multicollinearity in the data due to its feature selection capability at each split in the decision trees making up the forest.
If the dataset were smaller, containing fewer than 1,000 records, I might consider switching to a simpler model, such as Logistic Regression. The rationale behind this shift would be to avoid overfitting, which is more likely to occur with complex models like Random Forest when there is limited data. Logistic Regression, being a simpler and more interpretable model, would require fewer data to effectively capture the underlying trends without overfitting. Additionally, with a smaller dataset, the computational and memory advantages of simpler models become more significant, making Logistic Regression a practical choice for ensuring faster model training and interpretation.
# Handling missing values
bank <- na.omit(bank) # Removing rows with NA values
# Feature engineering
bank$interaction_term <- bank$age * bank$balance # Example interaction term
# Data transformation
bank$normalized_age <- scale(bank$age, center = TRUE, scale = TRUE)
# Handling imbalanced data
table(bank$y) # Check class imbalance##
## no yes
## 38171 5021
The output suggests that the data is imbalanced with a significant skew towards the “no” category. In predictive modeling, such imbalance could bias the model towards predicting “no” more frequently, as it is the more common outcome in the training data. To handle imbalanced data, this could involve either oversampling the minority class (“yes”) or undersampling the majority class (“no”).