Import data, using the Import Dataset button.

library(readr)

## Warning: package 'readr' was built under R version 4.3.3

bank_marketing_training <- read_csv("C:/Users/cisco/Downloads/bank_marketing_training")

## Rows: 26874 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, days_since_previous, previous, emp.var.ra...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.Produce the following graphs. What is the strength of each graph? Weakness?

1.1: Bar graph of marital.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

ggplot(bank_marketing_training, aes(x = marital)) +
  geom_bar() +
  theme_minimal() +
  labs(title = "Distribution of Marital Status",
       x = "Marital Status",
       y = "Count")

Strength: Clearly shows married status dominates the dataset, followed by single, then divorced.
Weakness: Doesn’t show response rate information.

1.2: Bar graph of marital, with overlay of response.

ggplot(bank_marketing_training, aes(marital)) + 
  geom_bar(aes(fill = response)) + 
  coord_flip() +
  labs(title = "Distribution of Marital Status by Response",
       x = "Marital Status",
       y = "Count")

Strength: Shows absolute counts and response (yes/no) split for each marital status.
Weakness: Hard to compare response rates across categories due to different group sizes.

1.3: Normalized bar graph of marital, with overlay of response.

ggplot(bank_marketing_training, aes(marital)) + 
  geom_bar(aes(fill = response), position = "fill") + 
  coord_flip() +
  labs(title = "Normalized Distribution of Marital Status by Response",
       x = "Marital Status",
       y = "Proportion") +
  scale_y_continuous(labels = scales::percent)

Strength: Makes it easy to compare response rates (proportions) across marital statuses.
Weakness: Loses information about total sample size in each category.

2.Using the graph from Exercise 1.3, describe the relationship between marital and response.

The relationship between marital status and response is relatively weak.
Single people show a slightly higher proportion of “yes” responses, while divorced individuals have a marginally lower proportion.
All marital status categories show a similar pattern with approximately 10-20% “yes” responses and 80-90% “no” responses.
The differences between categories are not substantial enough to suggest marital status strongly influences response outcomes.

3.Do the following with the variables marital and response.

3.1: Build a contingency table, being careful to have the correct variables representing the rows and columns. Report the counts and the column percentages.

t.v1 <- table(bank_marketing_training$response, bank_marketing_training$previous_outcome)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)

##        
##         failure nonexistent success total
##   no       2390       21176     320 23886
##   yes       385        2034     569  2988
##   total    2775       23210     889 26874

percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)

##      
##       failure nonexistent success
##   no     86.1        91.2    36.0
##   yes    13.9         8.8    64.0

3.2: Describe what the contingency table is telling you.

Customers with previous successful campaigns were much more likely to say yes (64% success rate) compared to those with failures (13.9%) or no previous contact (8.8%).

4.Repeat the previous exercise, this time reporting the row percentages. Explain the difference between the interpretation of this table and the previous contingency table.

t.v1 <- table(bank_marketing_training$response, bank_marketing_training$previous_outcome)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
row_percentages <- round(prop.table(t.v1, margin = 1)*100, 1)
print(row_percentages)

##      
##       failure nonexistent success
##   no     10.0        88.7     1.3
##   yes    12.9        68.1    19.0

Column percentages show success rates within each previous outcome category.
Row percentages show how responses are distributed across previous outcomes.

5.Produce the following graphs. What is the strength of each graph? Weakness?

5.1: Histogram of duration.

ggplot(bank_marketing_training, aes(x = duration)) +
  geom_histogram() +
  labs(title = "Distribution of Call Duration",
       x = "Duration (seconds)",
       y = "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Strength: Shows overall distribution and frequency of call durations.
Weakness: Cannot see relationship with response variable.

5.2: Histogram of duration, with overlay of response.

ggplot(bank_marketing_training, aes(x = duration)) +
  geom_histogram(aes(fill = response)) +
  labs(title = "Distribution of Call Duration by Response",
       x = "Duration (seconds)",
       y = "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Strength: Shows absolute counts for both yes/no responses by duration. Weakness: Hard to compare success rates at different durations due to varying counts.

5.3: Normalized histogram of duration, with overlay of response.

ggplot(bank_marketing_training, aes(x = duration)) +
  geom_histogram(aes(fill = response), position = "fill") +
  labs(title = "Normalized Distribution of Call Duration by Response",
       x = "Duration (seconds)",
       y = "Proportion") +
  scale_y_continuous(labels = scales::percent)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Strength: Clearly shows success rate trends across different durations. Weakness: Loses information about sample size at each duration point.

6.Using the graph from Exercise 5.3, describe the relationship between duration and response.

Shorter calls (< 500 seconds) mostly result in “no” responses.
As call duration increases, “yes” responses become more common.
After 2000 seconds, “yes” responses become majority.
Very long calls (> 3000 seconds) are highly correlated with “yes” responses.
This suggests longer conversations typically lead to more successful outcomes.

7.Examine the non‐normalized and normalized histograms of duration, with overlay of response. Identify cutoff point(s) for duration, which separate low values of response from high values. Define a new categorical variable, duration_binned, using the cutoff points you identified.

The main cutoff points for duration appear to be:
Around 500 seconds (very low success rate below this)
Around 2000 seconds (higher success rate above this)

bank_marketing_training$duration_binned <- cut(x = bank_marketing_training$duration, 
                                breaks = c(0, 500, 2000, max(bank_marketing_training$duration)),
                                right = FALSE,
                                labels = c("Short (<500s)", 
                                         "Medium (500-2000s)",
                                         "Long (>2000s)"))

8.Provide the following. Describe each of the results.

8.1: Contingency table of duration_binned with response, with counts and column percentages.

t.v1 <- table(bank_marketing_training$duration_binned, bank_marketing_training$response)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)

##                     
##                         no   yes total
##   Short (<500s)      22005  1650 23655
##   Medium (500-2000s)  1865  1312  3177
##   Long (>2000s)         15    26    41
##   total              23885  2988 26873

percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)

##                     
##                        no  yes
##   Short (<500s)      92.1 55.2
##   Medium (500-2000s)  7.8 43.9
##   Long (>2000s)       0.1  0.9

Most calls are short (<500s), with 92.1% resulting in “no”
Medium calls (500-2000s) have better success rate at 43.9%
Long calls (>2000s) are rare but have highest success rate at 0.9%

8.2: Non‐normalized bar graph of duration_binned, with response overlay.

ggplot(bank_marketing_training, aes(x = duration_binned)) +
  geom_bar(aes(fill = response)) +
  coord_flip() +
  labs(title = "Distribution of Call Duration (Binned) by Response",
       x = "Duration Category",
       y = "Count")

Shows overwhelming majority of calls are short duration
Medium and long duration calls are much less frequent
Raw counts make it hard to compare success rates

8.3: Normalized bar graph of duration_binned, with response overlay.

ggplot(bank_marketing_training, aes(x = duration_binned)) +
  geom_bar(aes(fill = response), position = "fill") +
  coord_flip() +
  labs(title = "Normalized Distribution of Call Duration (Binned) by Response",
       x = "Duration Category",
       y = "Proportion") +
  scale_y_continuous(labels = scales::percent)

Clear trend: longer calls have higher success rates
Short calls show lowest proportion of “yes” responses
Medium and long calls show progressively better success rates

9.Construct a contingency table of job with response, with counts and column percentages.

t.v1 <- table(bank_marketing_training$job, bank_marketing_training$response)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)

##                
##                    no   yes total
##   admin.         5903   854  6757
##   blue-collar    5631   420  6051
##   entrepreneur    842    72   914
##   housemaid       639    70   709
##   management     1680   209  1889
##   retired         852   291  1143
##   self-employed   825    93   918
##   services       2380   201  2581
##   student         404   194   598
##   technician     3972   465  4437
##   unemployed      573    94   667
##   unknown         185    25   210
##   total         23886  2988 26874

percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)

##                
##                   no  yes
##   admin.        24.7 28.6
##   blue-collar   23.6 14.1
##   entrepreneur   3.5  2.4
##   housemaid      2.7  2.3
##   management     7.0  7.0
##   retired        3.6  9.7
##   self-employed  3.5  3.1
##   services      10.0  6.7
##   student        1.7  6.5
##   technician    16.6 15.6
##   unemployed     2.4  3.1
##   unknown        0.8  0.8

10.Referring to the previous exercise, do the following:

10.1: Combine the job categories according to the following response percentages: 0 < 10, 10 < 25, 25 < 33. Name the new variable job2.

bank_marketing_training$job2 <- cut(as.numeric(factor(bank_marketing_training$job)), 
                                  breaks = c(0, 10, 25, 33), 
                                  right = FALSE, 
                                  labels = c("Low (<10%)", "Medium (10-25%)", "High (25-33%)"))

10.2: Provide a contingency table of job2 with response, with counts and column percentages. Describe what you see.

t.v1 <- table(bank_marketing_training$job2, bank_marketing_training$response)
t.v2 <- addmargins(A = t.v1, FUN = list(total = sum), quiet = TRUE)
print(t.v2)

##                  
##                      no   yes total
##   Low (<10%)      19156  2404 21560
##   Medium (10-25%)  4730   584  5314
##   High (25-33%)       0     0     0
##   total           23886  2988 26874

percentages <- round(prop.table(t.v1, margin = 2)*100, 1)
print(percentages)

##                  
##                     no  yes
##   Low (<10%)      80.2 80.5
##   Medium (10-25%) 19.8 19.5
##   High (25-33%)    0.0  0.0

The Low (<10%) category contains the majority of cases with 21,560 total, followed by Medium (10-25%) with 5,314 cases. Interestingly, the High (25-33%) category has zero cases, suggesting an issue with the categorization cutpoints.

The column percentages reveal that for both “no” and “yes” responses, about 80% come from the Low category and 20% from the Medium category. There’s little difference in the distribution of responses between these categories, suggesting job category may not be a strong predictor of response.

10.3:Provide a normalized histogram of job2 with response. Describe the relationship.

ggplot(bank_marketing_training, aes(x = job2)) +
  geom_bar(aes(fill = response), position = "fill") +
  coord_flip() +
  labs(title = "Normalized Distribution of Job Categories by Response",
       x = "Job Category",
       y = "Proportion") +
  scale_y_continuous(labels = scales::percent)

The normalized graph shows similar response patterns across both job categories.
Both Low and Medium categories have approximately 80% “no” responses and 20% “yes” responses.
The High category is not visible as it contains no data.
There appears to be no meaningful difference in response rates between job categories.

Exercise1_Dataprep_Part2

Francisco Javier Estrada

2025-01-20