Introduction

The data set is derived from a Portuguese bank’s direct marketing initiatives. Phone calls served as the foundation for these marketing strategies. To determine whether a consumer had subscribed to the product (a bank term deposit), multiple interactions with the client were necessary. The categorization goal is to forecast whether or not a client would open a term deposit with the bank.

Expladatory Data Analysis

Initial Exploration

# Load the data
bank <- read.csv("bank-full.csv", sep = ';')

# View the first few rows of the dataset
head(bank)
# Summary of the data
summary(bank)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Check data types
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

As we can see, this dataset contains 17 attributes and 45211 instances. The majority of the features are classified. Using glimpse(), we can see that every data type is correct. The data does contain some “unknowns,” though. This will be regarded as missing value. First, let’s see whether there are any duplicate or missing values in the data.

Missing and Duplicated Values

Replacing “unknowns” with NA, then checking whether the data has any missing values.

bank <- bank %>% 
  mutate(across(.cols = everything(),
                .fns = ~replace(., . == "unknown", NA)))

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##     13020         0         0         0         0         0         0     36959 
##         y 
##         0

There are a lot of missing values in the “contact” and “poutcome” fields. Most of the values in “poutcome” are actually unknown. I will hence eliminate these factors from the analysis. The “day” and “month” features also worry me because there is no indication as to whether or not the data was gathered in the same year. Since the precise dates are not given, these attributes would be useless if the data was gathered in different years. The “day” and “month” variables will therefore also be eliminated. Although they make up fewer than 5% of the total data, the “job” and “education” variables also have some missing entries. As a result, I will only remove observations that have these two features’ missing values.

bank <- bank %>% 
  select(-c(contact, poutcome, day, month)) %>% 
  drop_na()

Now that I have cleaned the missing values, let’s see whether there are any duplicate values in the data. In general, I don’t believe unnecessary values should be included in a dataset like this. In my opinion, each observation should have a distinct value, especially when the data includes characteristics that measure the balance and the number of days that have gone since the last campaign.

sum(duplicated(bank))
## [1] 1

There is only one duplicate value and it will be removed.

bank <- bank[!duplicated(bank), ]

Correlation Analysis

# Checking correlation among numeric variables
numeric_data <- select_if(bank, is.numeric)
correlations <- cor(numeric_data)
print(correlations)
##                   age      balance      duration     campaign        pdays
## age       1.000000000  0.097616881 -0.0049477989  0.004047283 -0.023236544
## balance   0.097616881  1.000000000  0.0200491909 -0.016249772  0.003923373
## duration -0.004947799  0.020049191  1.0000000000 -0.083117961 -0.002404361
## campaign  0.004047283 -0.016249772 -0.0831179607  1.000000000 -0.088918971
## pdays    -0.023236544  0.003923373 -0.0024043612 -0.088918971  1.000000000
## previous  0.001106066  0.016561228  0.0002943088 -0.032380311  0.452951784
##               previous
## age       0.0011060662
## balance   0.0165612281
## duration  0.0002943088
## campaign -0.0323803106
## pdays     0.4529517840
## previous  1.0000000000
# Visualizing correlations
corrplot(correlations, method = 'circle')

The correlation plot from the Portuguese bank’s marketing campaign data shows some surprising connections between key variables. There is a modest positive correlation between age and balance, suggesting that older clients tend to have higher account balances, likely reflecting financial stability accumulated over time. Conversely, the duration of calls shows a negative correlation with the number of contacts per campaign, indicating that more frequent contacts tend to be shorter, possibly suggesting an efficiency-driven communication strategy or diminishing engagement over successive contacts. Notably, the number of days since the last contact (pdays) and the number of contacts before the current campaign (previous) are negatively correlated with the campaign variable. This pattern implies that clients who have been contacted less frequently in the past or more recently are targeted more intensively in the current campaign. These correlations provide critical insights into client behavior and campaign effectiveness, highlighting opportunities for strategic adjustments to enhance future marketing efforts by tailoring interactions based on client engagement history and demographic factors.

Distribution of Variables

# Filtering numeric variables and plotting histograms
bank %>%
  select_if(is.numeric) %>%
  gather(key = "variable", value = "value") %>%
  ggplot(aes(x = value)) +
    geom_histogram(bins = 30, fill = 'blue', color = 'black') +
    facet_wrap(~variable, scales = 'free') +
    labs(title = "Distribution of Numeric Variables") +
    theme_minimal()

The exploratory data analysis of the Portuguese bank’s marketing campaign reveals insightful distributions across several key numeric variables through histograms. The age distribution peaks between 30 and 40 years, indicating a client base primarily composed of middle-aged individuals, suggesting a focus on clients who are potentially more financially stable and interested in products like term deposits. The balance histogram is markedly right-skewed, with most clients holding low to moderate balances, highlighting economic diversity with a few outliers on the higher end. The campaign contacts histogram shows most clients are contacted 1-3 times, suggesting a strategy of minimal contact per client to possibly avoid over-solicitation. Duration data, similarly right-skewed, reveals that most calls are brief, under 500 seconds, indicating either efficient client interactions or quick assessments of client interest. The pdays histogram, with a significant spike at ‘0’, and the previous contacts histogram, also peaking at ‘0’, both underline a focus on new or previously under-contacted clients, suggesting either an influx of new clients or a missed opportunity in follow-up engagements. These findings underscore areas for strategic adjustment, particularly in enhancing engagement with existing clients and optimizing contact strategies to improve the effectiveness of future campaigns.

Outliers Detection

# Filtering numeric variables
numeric_vars <- select_if(bank, is.numeric)

# Plotting boxplots for each numeric variable
numeric_vars %>%
  gather(key = "variable", value = "value") %>%
  ggplot(aes(x = variable, y = value, fill = variable)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Boxplots for Numeric Variables", x = "Variable", y = "Value") +
    guides(fill=FALSE)

The boxplot visualization for the numeric variables from the bank marketing dataset effectively highlights the presence of outliers across different metrics. Notably, the variable ‘balance’ shows a significant number of outliers, with values reaching up to 100,000, far exceeding the majority clustered near zero. This suggests that while most clients maintain lower balances, there are a few with exceptionally high balances, which could be influential or high-net-worth individuals. The ‘age’ variable appears to have a few outliers at the higher end, indicating some clients are significantly older than the typical bank’s clientele. In contrast, ‘campaign’, ‘duration’, ‘pdays’, and ‘previous’ show relatively few outliers. For ‘campaign’ and ‘duration’, outliers are those instances where the number of contacts or the duration of calls was exceptionally high, potentially indicating cases of intense follow-up or longer conversations. The ‘pdays’ and ‘previous’ variables, which track days since the last campaign contact and the number of contacts before the current campaign, respectively, also display a few extreme values, highlighting some rare cases of frequent previous contact or a long gap since last contact. These outliers may represent special or atypical cases in the data and could significantly impact the analysis and model performance if not appropriately managed or excluded.

Algorithm Selection

Select two or more machine learning algorithms presented so far that could be used to train a model

To train the model two machine learning algorithms I selected are Logistic Regression and Random Forest. Logistic regression algorithm is good for binary outcomes; straightforward and interpretable. Where as, Random Forest algorithm is used to handle large data with many variables; not as easily interpretable but very powerful.

What are the pros and cons of each algorithm you selected?

Logistic Regression:

  • Pros: Easy to implement and understand.

  • Cons: Assumes linearity between dependent and independent variables.

Random Forest:

  • Pros: Does not assume linear relationships; can handle complex interactions and classification.

  • Cons: More computationally intensive and can overfit if not tuned properly.

Which algorithm would you recommend, and why?

For this model, I would recommend using Random Forest algorithm. I chose this algorithm because of it’s robustness to outliers and its ability to handle imbalanced datasets effectively. Random Forest performs well in scenarios where the data includes categorical variables and is capable of modeling complex nonlinear relationships without the need for extensive data preprocessing.

Are there labels in your data? Did that impact your choice of algorithm?

Yes, the data includes a label, y, which indicates whether the client subscribed to a term deposit (yes or no). This is a classic setup for supervised learning and strongly influences the choice of algorithm. The presence of a clear binary label directs the selection towards classification algorithms, with Random Forest being particularly advantageous due to its ensemble approach, which enhances prediction accuracy and stability.

How does your choice of algorithm relate to the dataset?

Random Forest is particularly well-suited to this dataset, which features a mix of numeric and categorical variables. This algorithm can inherently handle such mixed data types and is less sensitive to the scale of features, meaning that minimal preprocessing is required. Additionally, Random Forest can manage the high dimensionality and potential multicollinearity in the data due to its feature selection capability at each split in the decision trees making up the forest.

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

If the dataset were smaller, containing fewer than 1,000 records, I might consider switching to a simpler model, such as Logistic Regression. The rationale behind this shift would be to avoid overfitting, which is more likely to occur with complex models like Random Forest when there is limited data. Logistic Regression, being a simpler and more interpretable model, would require fewer data to effectively capture the underlying trends without overfitting. Additionally, with a smaller dataset, the computational and memory advantages of simpler models become more significant, making Logistic Regression a practical choice for ensuring faster model training and interpretation.

Pre Processing

# Handling missing values
bank <- na.omit(bank)  # Removing rows with NA values

# Feature engineering
bank$interaction_term <- bank$age * bank$balance  # Example interaction term

# Data transformation
bank$normalized_age <- scale(bank$age, center = TRUE, scale = TRUE)

# Handling imbalanced data
table(bank$y)  # Check class imbalance
## 
##    no   yes 
## 38171  5021

The output suggests that the data is imbalanced with a significant skew towards the “no” category. In predictive modeling, such imbalance could bias the model towards predicting “no” more frequently, as it is the more common outcome in the training data. To handle imbalanced data, this could involve either oversampling the minority class (“yes”) or undersampling the majority class (“no”).