The data-set

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).he data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

Variable Information

• 1 - age (numeric)

• 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)

• 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)

• 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

• 5 - default: has credit in default? (binary: “yes”,“no”)

• 6 - balance: average yearly balance, in euros (numeric)

• 7 - housing: has housing loan? (binary: “yes”,“no”)

• 8 - loan: has personal loan? (binary: “yes”,“no”)

• 9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)

• 10 - day: last contact day of the month (numeric)

• 11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

• 12 - duration: last contact duration, in seconds (numeric)

• 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

• 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

• 15 - previous: number of contacts performed before this campaign and for this client (numeric)

• 16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

• 17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

'data.frame':   41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : chr  "housemaid" "services" "services" "admin." ...
 $ marital       : chr  "married" "married" "married" "married" ...
 $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
 $ default       : chr  "no" "unknown" "no" "no" ...
 $ housing       : chr  "no" "no" "yes" "no" ...
 $ loan          : chr  "no" "no" "no" "no" ...
 $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
 $ month         : chr  "may" "may" "may" "may" ...
 $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
 $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : chr  "no" "no" "no" "no" ...

Dimension of the data-set

[1] 41188    21
[1] 30488    21

We have some ‘unknown’ values in the columns: job, marital, education, default, housing, and loan. Therefore, we removed the rows containing ‘unknown’ values. After this, we have 30,488 rows and 21 columns.

Column Names

 [1] "age"            "job"            "marital"        "education"     
 [5] "default"        "housing"        "loan"           "contact"       
 [9] "month"          "day_of_week"    "duration"       "campaign"      
[13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
[17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
[21] "y"             

Summary of the data-set

      age            job              marital           education        
 Min.   :17.00   Length:30488       Length:30488       Length:30488      
 1st Qu.:31.00   Class :character   Class :character   Class :character  
 Median :37.00   Mode  :character   Mode  :character   Mode  :character  
 Mean   :39.03                                                           
 3rd Qu.:45.00                                                           
 Max.   :95.00                                                           
   default            housing              loan             contact         
 Length:30488       Length:30488       Length:30488       Length:30488      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    month           day_of_week           duration         campaign     
 Length:30488       Length:30488       Min.   :   0.0   Min.   : 1.000  
 Class :character   Class :character   1st Qu.: 103.0   1st Qu.: 1.000  
 Mode  :character   Mode  :character   Median : 181.0   Median : 2.000  
                                       Mean   : 259.5   Mean   : 2.521  
                                       3rd Qu.: 321.0   3rd Qu.: 3.000  
                                       Max.   :4918.0   Max.   :43.000  
     pdays          previous        poutcome          emp.var.rate     
 Min.   :  0.0   Min.   :0.0000   Length:30488       Min.   :-3.40000  
 1st Qu.:999.0   1st Qu.:0.0000   Class :character   1st Qu.:-1.80000  
 Median :999.0   Median :0.0000   Mode  :character   Median : 1.10000  
 Mean   :956.3   Mean   :0.1943                      Mean   :-0.07151  
 3rd Qu.:999.0   3rd Qu.:0.0000                      3rd Qu.: 1.40000  
 Max.   :999.0   Max.   :7.0000                      Max.   : 1.40000  
 cons.price.idx  cons.conf.idx     euribor3m      nr.employed  
 Min.   :92.20   Min.   :-50.8   Min.   :0.634   Min.   :4964  
 1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.313   1st Qu.:5099  
 Median :93.44   Median :-41.8   Median :4.856   Median :5191  
 Mean   :93.52   Mean   :-40.6   Mean   :3.460   Mean   :5161  
 3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228  
 Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228  
      y            
 Length:30488      
 Class :character  
 Mode  :character  
                   
                   
                   

Checking for the missing values

[1] 0

• There is no missing values in the data-set

Transform character variables into factor variables.

There are 11 character variables in the data-set

'data.frame':   30488 obs. of  21 variables:
 $ age           : int  56 37 40 56 59 24 25 25 29 57 ...
 $ job           : Factor w/ 11 levels "admin.","blue-collar",..: 4 8 1 8 1 10 8 8 2 4 ...
 $ marital       : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 3 3 3 1 ...
 $ education     : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 4 2 4 6 6 4 4 4 1 ...
 $ default       : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ housing       : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 2 2 1 2 ...
 $ loan          : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 2 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : int  261 226 151 307 139 380 50 222 137 293 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Plot the ‘int’ and ‘num’ variables using histogram and their corresponding density curve

Warning in geom_histogram(aes(y = ..density..), bis = 24, colour = "black", :
Ignoring unknown parameters: `bis`
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot the factor variables

Bar chart,column chart,Donut chart

Boxplot

Detection of outlier

An outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

The interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the mid spread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by Q1 (also called the lower quartile), Q2 (the median), and Q3 (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q3 − Q1.

Boxplot after removing the outliers from the dataset

Correlation Matrix

We select the numeric variables and then calculate their correlations

corrplot 0.94 loaded

We can see that euribor3m and emp.var.rate are highly correlated, with a correlation of 0.97. Additionally, euribor3m and nr.employed are highly correlated, with a correlation of 0.95, and nr.employed and emp.var.rate are also highly correlated, with a correlation of 0.90.We removed the variables euribor3m, nr.employed, and emp.var.rate from the dataset because they have high correlations.

Dividing the data-set into training data and test data

Training data is a subset of a dataset used to train a machine learning model.

Test data refers to a subset of a dataset that is used to evaluate the performance of a machine learning model after it has been trained.

We randomly divide 90% of the dataset into training data and 10% into test data.

Decision tree classifier

Root Node (Top Node):

pdays >= 513 is the first decision point in the tree. If this condition is not met (no branch), the predicted outcome is no, with a probability of 0.09 (or 9%) and a sample distribution of 100%. If this condition is met (yes branch), the tree moves to the next level.

Left Subtree (pdays >= 513 and month):

For the yes branch of pdays >= 513, the next decision is based on the month. If the month is April, August, July, June, May, or November, the predicted outcome is no, with a probability of 0.07 (7%) and a sample distribution of 96%. If the month is not one of these, the tree further splits based on duration.

Further Splits:

For the branches where the month condition is met, if duration < 184, the model predicts no with a probability of 0.39 (39%) and a sample distribution of 4%. If duration >= 184, the prediction is no with a probability of 0.19 (19%) and a sample distribution of 2%. If duration < 184, the prediction is yes with a probability of 0.60 (60%) and a sample distribution of 2%.

Right Subtree (pdays < 513 and duration):

If pdays >= 513 is yes but the month condition is not met, the decision moves to duration < 165. If duration < 165, the predicted outcome is no with a probability of 0.26 (26%) and a sample distribution of 1%. If duration >= 165, the prediction is yes with a probability of 0.74 (74%) and a sample distribution of 3%.

Actual vs Predicted Value of the variable y

Loading required package: lattice

Confusion Matrix

Confusion Matrix and Statistics

          Reference
Prediction   no  yes
       no  2467   37
       yes  148  109
                                        
               Accuracy : 0.933         
                 95% CI : (0.923, 0.942)
    No Information Rate : 0.9471        
    P-Value [Acc > NIR] : 0.9994        
                                        
                  Kappa : 0.5077        
                                        
 Mcnemar's Test P-Value : 6.097e-16     
                                        
            Sensitivity : 0.9434        
            Specificity : 0.7466        
         Pos Pred Value : 0.9852        
         Neg Pred Value : 0.4241        
             Prevalence : 0.9471        
         Detection Rate : 0.8935        
   Detection Prevalence : 0.9069        
      Balanced Accuracy : 0.8450        
                                        
       'Positive' Class : no            
                                        

True Positives (TP): 2437 (Predicted “no” and actually “no”)

False Positives (FP): 205 (Predicted “yes” but actually “no”)

False Negatives (FN): 121 (Predicted “no” but actually “yes”)

True Negatives (TN): 174 (Predicted “yes” and actually “yes”)

The model correctly predicts the outcome 88.9% of the time. This is the proportion of true results (both true positives and true negatives) among the total number of cases examined. This range, from 0.8771 to 0.9001, indicates the interval in which the true accuracy lies with 95% confidence.

Also known as recall, sensitivity is the ability of the model to correctly identify positive instances (those who are actually “no”). Here, 92.24% of actual “no” cases are correctly identified.

Specificity is the ability of the model to correctly identify negative instances (those who are actually “yes”). Here, 58.98% of actual “yes” cases are correctly identified. This is relatively low, indicating that the model struggles more with identifying the “yes” cases correctly.

The model performs well in predicting the majority class (“no”), it struggles with the minority class (“yes”), leading to an imbalance in prediction accuracy.