• 1 - age (numeric)
• 2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
• 3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
• 4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
• 5 - default: has credit in default? (binary: “yes”,“no”)
• 6 - balance: average yearly balance, in euros (numeric)
• 7 - housing: has housing loan? (binary: “yes”,“no”)
• 8 - loan: has personal loan? (binary: “yes”,“no”)
• 9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
• 10 - day: last contact day of the month (numeric)
• 11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
• 12 - duration: last contact duration, in seconds (numeric)
• 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
• 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
• 15 - previous: number of contacts performed before this campaign and for this client (numeric)
• 16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
• 17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
'data.frame': 41188 obs. of 21 variables:
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : chr "housemaid" "services" "services" "admin." ...
$ marital : chr "married" "married" "married" "married" ...
$ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
$ default : chr "no" "unknown" "no" "no" ...
$ housing : chr "no" "no" "yes" "no" ...
$ loan : chr "no" "no" "no" "no" ...
$ contact : chr "telephone" "telephone" "telephone" "telephone" ...
$ month : chr "may" "may" "may" "may" ...
$ day_of_week : chr "mon" "mon" "mon" "mon" ...
$ duration : int 261 149 226 151 307 198 139 217 380 50 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
$ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
$ cons.price.idx: num 94 94 94 94 94 ...
$ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
$ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
$ nr.employed : num 5191 5191 5191 5191 5191 ...
$ y : chr "no" "no" "no" "no" ...
[1] 41188 21
[1] 30488 21
We have some ‘unknown’ values in the columns: job, marital, education, default, housing, and loan. Therefore, we removed the rows containing ‘unknown’ values. After this, we have 30,488 rows and 21 columns.
[1] "age" "job" "marital" "education"
[5] "default" "housing" "loan" "contact"
[9] "month" "day_of_week" "duration" "campaign"
[13] "pdays" "previous" "poutcome" "emp.var.rate"
[17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
[21] "y"
age job marital education
Min. :17.00 Length:30488 Length:30488 Length:30488
1st Qu.:31.00 Class :character Class :character Class :character
Median :37.00 Mode :character Mode :character Mode :character
Mean :39.03
3rd Qu.:45.00
Max. :95.00
default housing loan contact
Length:30488 Length:30488 Length:30488 Length:30488
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
month day_of_week duration campaign
Length:30488 Length:30488 Min. : 0.0 Min. : 1.000
Class :character Class :character 1st Qu.: 103.0 1st Qu.: 1.000
Mode :character Mode :character Median : 181.0 Median : 2.000
Mean : 259.5 Mean : 2.521
3rd Qu.: 321.0 3rd Qu.: 3.000
Max. :4918.0 Max. :43.000
pdays previous poutcome emp.var.rate
Min. : 0.0 Min. :0.0000 Length:30488 Min. :-3.40000
1st Qu.:999.0 1st Qu.:0.0000 Class :character 1st Qu.:-1.80000
Median :999.0 Median :0.0000 Mode :character Median : 1.10000
Mean :956.3 Mean :0.1943 Mean :-0.07151
3rd Qu.:999.0 3rd Qu.:0.0000 3rd Qu.: 1.40000
Max. :999.0 Max. :7.0000 Max. : 1.40000
cons.price.idx cons.conf.idx euribor3m nr.employed
Min. :92.20 Min. :-50.8 Min. :0.634 Min. :4964
1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.313 1st Qu.:5099
Median :93.44 Median :-41.8 Median :4.856 Median :5191
Mean :93.52 Mean :-40.6 Mean :3.460 Mean :5161
3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
y
Length:30488
Class :character
Mode :character
[1] 0
• There is no missing values in the data-set
There are 11 character variables in the data-set
'data.frame': 30488 obs. of 21 variables:
$ age : int 56 37 40 56 59 24 25 25 29 57 ...
$ job : Factor w/ 11 levels "admin.","blue-collar",..: 4 8 1 8 1 10 8 8 2 4 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 3 3 3 1 ...
$ education : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 4 2 4 6 6 4 4 4 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ housing : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 2 2 1 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 2 1 ...
$ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
$ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
$ duration : int 261 226 151 307 139 380 50 222 137 293 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
$ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
$ cons.price.idx: num 94 94 94 94 94 ...
$ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
$ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
$ nr.employed : num 5191 5191 5191 5191 5191 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Warning in geom_histogram(aes(y = ..density..), bis = 24, colour = "black", :
Ignoring unknown parameters: `bis`
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
An outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.
The interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the mid spread, middle 50%, fourth spread, or H‑spread. It is defined as the difference between the 75th and 25th percentiles of the data. To calculate the IQR, the data set is divided into quartiles, or four rank-ordered even parts via linear interpolation. These quartiles are denoted by Q1 (also called the lower quartile), Q2 (the median), and Q3 (also called the upper quartile). The lower quartile corresponds with the 25th percentile and the upper quartile corresponds with the 75th percentile, so IQR = Q3 − Q1.
We select the numeric variables and then calculate their correlations
corrplot 0.94 loaded
We can see that euribor3m and emp.var.rate are highly correlated, with a correlation of 0.97. Additionally, euribor3m and nr.employed are highly correlated, with a correlation of 0.95, and nr.employed and emp.var.rate are also highly correlated, with a correlation of 0.90.We removed the variables euribor3m, nr.employed, and emp.var.rate from the dataset because they have high correlations.
Training data is a subset of a dataset used to train a machine learning model.
Test data refers to a subset of a dataset that is used to evaluate the performance of a machine learning model after it has been trained.
We randomly divide 90% of the dataset into training data and 10% into test data.
Root Node (Top Node):
pdays >= 513 is the first decision point in the tree. If this condition is not met (no branch), the predicted outcome is no, with a probability of 0.09 (or 9%) and a sample distribution of 100%. If this condition is met (yes branch), the tree moves to the next level.
Left Subtree (pdays >= 513 and month):
For the yes branch of pdays >= 513, the next decision is based on the month. If the month is April, August, July, June, May, or November, the predicted outcome is no, with a probability of 0.07 (7%) and a sample distribution of 96%. If the month is not one of these, the tree further splits based on duration.
Further Splits:
For the branches where the month condition is met, if duration < 184, the model predicts no with a probability of 0.39 (39%) and a sample distribution of 4%. If duration >= 184, the prediction is no with a probability of 0.19 (19%) and a sample distribution of 2%. If duration < 184, the prediction is yes with a probability of 0.60 (60%) and a sample distribution of 2%.
Right Subtree (pdays < 513 and duration):
If pdays >= 513 is yes but the month condition is not met, the decision moves to duration < 165. If duration < 165, the predicted outcome is no with a probability of 0.26 (26%) and a sample distribution of 1%. If duration >= 165, the prediction is yes with a probability of 0.74 (74%) and a sample distribution of 3%.
Loading required package: lattice
Confusion Matrix and Statistics
Reference
Prediction no yes
no 2467 37
yes 148 109
Accuracy : 0.933
95% CI : (0.923, 0.942)
No Information Rate : 0.9471
P-Value [Acc > NIR] : 0.9994
Kappa : 0.5077
Mcnemar's Test P-Value : 6.097e-16
Sensitivity : 0.9434
Specificity : 0.7466
Pos Pred Value : 0.9852
Neg Pred Value : 0.4241
Prevalence : 0.9471
Detection Rate : 0.8935
Detection Prevalence : 0.9069
Balanced Accuracy : 0.8450
'Positive' Class : no
True Positives (TP): 2437 (Predicted “no” and actually “no”)
False Positives (FP): 205 (Predicted “yes” but actually “no”)
False Negatives (FN): 121 (Predicted “no” but actually “yes”)
True Negatives (TN): 174 (Predicted “yes” and actually “yes”)
The model correctly predicts the outcome 88.9% of the time. This is the proportion of true results (both true positives and true negatives) among the total number of cases examined. This range, from 0.8771 to 0.9001, indicates the interval in which the true accuracy lies with 95% confidence.
Also known as recall, sensitivity is the ability of the model to correctly identify positive instances (those who are actually “no”). Here, 92.24% of actual “no” cases are correctly identified.
Specificity is the ability of the model to correctly identify negative instances (those who are actually “yes”). Here, 58.98% of actual “yes” cases are correctly identified. This is relatively low, indicating that the model struggles more with identifying the “yes” cases correctly.
The model performs well in predicting the majority class (“no”), it struggles with the minority class (“yes”), leading to an imbalance in prediction accuracy.