library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
ggplot(bank_marketing_training, aes(x = marital)) +
geom_bar() +
theme_minimal() +
labs(title = "Distribution of Marital Status",
x = "Marital Status",
y = "Count")
Strength: Clearly shows married status dominates the dataset,
followed by single, then divorced.
Weakness: Doesn’t show response rate information.
ggplot(bank_marketing_training, aes(marital)) +
geom_bar(aes(fill = response)) +
coord_flip() +
labs(title = "Distribution of Marital Status by Response",
x = "Marital Status",
y = "Count")
Strength: Shows absolute counts and response (yes/no) split for each
marital status.
Weakness: Hard to compare response rates across categories due to
different group sizes.
ggplot(bank_marketing_training, aes(marital)) +
geom_bar(aes(fill = response), position = "fill") +
coord_flip() +
labs(title = "Normalized Distribution of Marital Status by Response",
x = "Marital Status",
y = "Proportion") +
scale_y_continuous(labels = scales::percent)
Strength: Makes it easy to compare response rates (proportions)
across marital statuses.
Weakness: Loses information about total sample size in each
category.
The relationship between marital status and response is relatively
weak.
Single people show a slightly higher proportion of “yes” responses,
while divorced individuals have a marginally lower proportion.
All marital status categories show a similar pattern with approximately
10-20% “yes” responses and 80-90% “no” responses.
The differences between categories are not substantial enough to suggest
marital status strongly influences response outcomes.
t.v1 <- table(bank_marketing_training$response, bank_marketing_training$previous_outcome)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)
##
## failure nonexistent success total
## no 2390 21176 320 23886
## yes 385 2034 569 2988
## total 2775 23210 889 26874
percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)
##
## failure nonexistent success
## no 86.1 91.2 36.0
## yes 13.9 8.8 64.0
Customers with previous successful campaigns were much more likely to say yes (64% success rate) compared to those with failures (13.9%) or no previous contact (8.8%).
t.v1 <- table(bank_marketing_training$response, bank_marketing_training$previous_outcome)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
row_percentages <- round(prop.table(t.v1, margin = 1)*100, 1)
print(row_percentages)
##
## failure nonexistent success
## no 10.0 88.7 1.3
## yes 12.9 68.1 19.0
Column percentages show success rates within each previous outcome
category.
Row percentages show how responses are distributed across previous
outcomes.
ggplot(bank_marketing_training, aes(x = duration)) +
geom_histogram() +
labs(title = "Distribution of Call Duration",
x = "Duration (seconds)",
y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Strength: Shows overall distribution and frequency of call
durations.
Weakness: Cannot see relationship with response variable.
ggplot(bank_marketing_training, aes(x = duration)) +
geom_histogram(aes(fill = response)) +
labs(title = "Distribution of Call Duration by Response",
x = "Duration (seconds)",
y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Strength: Shows absolute counts for both yes/no responses by duration. Weakness: Hard to compare success rates at different durations due to varying counts.
ggplot(bank_marketing_training, aes(x = duration)) +
geom_histogram(aes(fill = response), position = "fill") +
labs(title = "Normalized Distribution of Call Duration by Response",
x = "Duration (seconds)",
y = "Proportion") +
scale_y_continuous(labels = scales::percent)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_bar()`).
Strength: Clearly shows success rate trends across different durations. Weakness: Loses information about sample size at each duration point.
Shorter calls (< 500 seconds) mostly result in “no”
responses.
As call duration increases, “yes” responses become more common.
After 2000 seconds, “yes” responses become majority.
Very long calls (> 3000 seconds) are highly correlated with “yes”
responses.
This suggests longer conversations typically lead to more successful
outcomes.
The main cutoff points for duration appear to be:
Around 500 seconds (very low success rate below this)
Around 2000 seconds (higher success rate above this)
bank_marketing_training$duration_binned <- cut(x = bank_marketing_training$duration,
breaks = c(0, 500, 2000, max(bank_marketing_training$duration)),
right = FALSE,
labels = c("Short (<500s)",
"Medium (500-2000s)",
"Long (>2000s)"))
t.v1 <- table(bank_marketing_training$duration_binned, bank_marketing_training$response)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)
##
## no yes total
## Short (<500s) 22005 1650 23655
## Medium (500-2000s) 1865 1312 3177
## Long (>2000s) 15 26 41
## total 23885 2988 26873
percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)
##
## no yes
## Short (<500s) 92.1 55.2
## Medium (500-2000s) 7.8 43.9
## Long (>2000s) 0.1 0.9
Most calls are short (<500s), with 92.1% resulting in “no”
Medium calls (500-2000s) have better success rate at 43.9%
Long calls (>2000s) are rare but have highest success rate at
0.9%
ggplot(bank_marketing_training, aes(x = duration_binned)) +
geom_bar(aes(fill = response)) +
coord_flip() +
labs(title = "Distribution of Call Duration (Binned) by Response",
x = "Duration Category",
y = "Count")
Shows overwhelming majority of calls are short duration
Medium and long duration calls are much less frequent
Raw counts make it hard to compare success rates
ggplot(bank_marketing_training, aes(x = duration_binned)) +
geom_bar(aes(fill = response), position = "fill") +
coord_flip() +
labs(title = "Normalized Distribution of Call Duration (Binned) by Response",
x = "Duration Category",
y = "Proportion") +
scale_y_continuous(labels = scales::percent)
Clear trend: longer calls have higher success rates
Short calls show lowest proportion of “yes” responses
Medium and long calls show progressively better success rates
t.v1 <- table(bank_marketing_training$job, bank_marketing_training$response)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)
##
## no yes total
## admin. 5903 854 6757
## blue-collar 5631 420 6051
## entrepreneur 842 72 914
## housemaid 639 70 709
## management 1680 209 1889
## retired 852 291 1143
## self-employed 825 93 918
## services 2380 201 2581
## student 404 194 598
## technician 3972 465 4437
## unemployed 573 94 667
## unknown 185 25 210
## total 23886 2988 26874
percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)
##
## no yes
## admin. 24.7 28.6
## blue-collar 23.6 14.1
## entrepreneur 3.5 2.4
## housemaid 2.7 2.3
## management 7.0 7.0
## retired 3.6 9.7
## self-employed 3.5 3.1
## services 10.0 6.7
## student 1.7 6.5
## technician 16.6 15.6
## unemployed 2.4 3.1
## unknown 0.8 0.8
bank_marketing_training$job2 <- cut(as.numeric(factor(bank_marketing_training$job)),
breaks = c(0, 10, 25, 33),
right = FALSE,
labels = c("Low (<10%)", "Medium (10-25%)", "High (25-33%)"))
t.v1 <- table(bank_marketing_training$job2, bank_marketing_training$response)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)
##
## no yes total
## Low (<10%) 19156 2404 21560
## Medium (10-25%) 4730 584 5314
## High (25-33%) 0 0 0
## total 23886 2988 26874
percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)
##
## no yes
## Low (<10%) 80.2 80.5
## Medium (10-25%) 19.8 19.5
## High (25-33%) 0.0 0.0
The Low (<10%) category contains the majority of cases with 21,560 total, followed by Medium (10-25%) with 5,314 cases. Interestingly, the High (25-33%) category has zero cases, suggesting an issue with the categorization cutpoints.
The column percentages reveal that for both “no” and “yes” responses, about 80% come from the Low category and 20% from the Medium category. There’s little difference in the distribution of responses between these categories, suggesting job category may not be a strong predictor of response.
ggplot(bank_marketing_training, aes(x = job2)) +
geom_bar(aes(fill = response), position = "fill") +
coord_flip() +
labs(title = "Normalized Distribution of Job Categories by Response",
x = "Job Category",
y = "Proportion") +
scale_y_continuous(labels = scales::percent)
The normalized graph shows similar response patterns across both job
categories.
Both Low and Medium categories have approximately 80% “no” responses and
20% “yes” responses.
The High category is not visible as it contains no data.
There appears to be no meaningful difference in response rates between
job categories.