Banking Assignment 3

Author

Aakash Vashisth C00313452

Introdution

This report discusses two distinct data analysis tasks, one involving consumer insights and the other using data analytics. The initial task involves creating a classification model to determine if customers of a Portuguese bank would want to invest in a term deposit marketing campaign based on some relevant features of theirs. Using the data stemming from the marketing campaign, we will develop and present a classification tree model to help the bank better reach potential subscribers. The focus of the second task is K-means clustering, which examines consumers’ attitudes and behaviors surrounding recycling, in conjunction with segmentation. The goal here is to appreciate whether Carlow County Council can identify clusters in the responses from residents of Carlow about their behavioral pattern towards recycling, so as to better shape the recycling campaigns including the visions from more segments of people. Both tasks serve to illustrate the use of predictive modeling alongside consumer segmentation in marketing and other activities enhancing consumer awareness of the issues at hand.

Running Code

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.2
Warning: package 'ggplot2' was built under R version 4.4.2
Warning: package 'tibble' was built under R version 4.4.2
Warning: package 'tidyr' was built under R version 4.4.2
Warning: package 'readr' was built under R version 4.4.2
Warning: package 'purrr' was built under R version 4.4.2
Warning: package 'dplyr' was built under R version 4.4.2
Warning: package 'stringr' was built under R version 4.4.2
Warning: package 'forcats' was built under R version 4.4.2
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cluster)
library(ggplot2)
library(grid)
library(gridExtra)
Warning: package 'gridExtra' was built under R version 4.4.2

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine
comm <- read_csv("bank_sub_test.csv")
View(comm)
dom <- read_csv("bank_sub_test.csv")
View(dom)
comm_2 <- select(comm, age, loan, duration, prev_outcome, subscribed)
d1 <- dist(comm_2)
dom_2 <- select(dom, loan, duration, prev_outcome, subscribed)
d2 <- dist(dom_2)
h1 <- hclust(d1)
plot(h1, hang = -1)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

h2 <- hclust(d2)
plot(h2, hang = -1)

heatmap(as.matrix(d2), Rowv = as.dendrogram(h2), Colv = 'Rowv')

From these dendograms, the model is not overfitting but the nodes are impure.

clusters1 <- cutree(h1, k = 3)
clusters2 <- cutree(h2, k = 3)
sil1 <- silhouette(clusters1, d1)
summary(sil1)
Silhouette of 1334 units in 3 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
     1254        76         4 
0.7433888 0.5597983 0.7735888 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.4641  0.7555  0.8301  0.7330  0.8499  0.8579 
sil2 <- silhouette(clusters2, d2)
summary(sil2)
Silhouette of 1334 units in 3 clusters from silhouette.default(x = clusters2, dist = d2) :
 Cluster sizes and average silhouette widths:
     1274        56         4 
0.7476409 0.5772877 0.7516790 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5404  0.7751  0.8381  0.7405  0.8569  0.8629 
comm_clus <- cbind(comm, clusters1)
comm_clus <- mutate(comm_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
                                                   clusters1 == 2 ~ 'C2',
                                                   clusters1 == 3 ~ 'C3'))
 
dom_clus <- cbind(dom, clusters1)
dom_clus <- mutate(dom_clus, cluster = case_when(clusters1 == 1 ~ 'C1',
                                                   clusters1 == 2 ~ 'C2',
                                                   clusters1 == 3 ~ 'C3'))
size_rev <- comm_clus %>%
  group_by(cluster) %>%
  summarise(num_prev_contacts = n())
size_rev <- dom_clus %>%
  group_by(cluster) %>%
  summarise(num_prev_contacts = n())
ggplot(comm_clus, aes(x = age, group = subscribed)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of Subscription") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Subscription") 
Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.

ggplot(dom_clus, aes(x = age, group = subscribed)) + 
  geom_bar(aes(y = ..prop..), stat = "count", show.legend = FALSE) +
  facet_grid(~ cluster) +
  scale_y_continuous(labels = scales::percent) +
  ylab("Percentage of Subscription") + 
  xlab("Age Group") +
  ggtitle("Age Breakdown by Subscription") 

comm_clus_means <- comm_clus %>%
  group_by(cluster) %>%
  summarise(Loan = mean(loan),
          Subscribed = mean(subscribed),
            Poutcome = mean(prev_outcome),
            Duration = mean(duration))
Warning: There were 9 warnings in `summarise()`.
The first warning was:
ℹ In argument: `Loan = mean(loan)`.
ℹ In group 1: `cluster = "C1"`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 8 remaining warnings.
comm_clus_means
# A tibble: 3 × 5
  cluster  Loan Subscribed Poutcome Duration
  <chr>   <dbl>      <dbl>    <dbl>    <dbl>
1 C1         NA         NA       NA     211.
2 C2         NA         NA       NA     963.
3 C3         NA         NA       NA    1779 
dom_clus_means <- dom_clus %>%
  group_by(cluster) %>%
  summarise(Loan = mean(loan),
          Subscribed = mean(subscribed),
            Poutcome = mean(prev_outcome),
            Duration = mean(duration))
Warning: There were 9 warnings in `summarise()`.
The first warning was:
ℹ In argument: `Loan = mean(loan)`.
ℹ In group 1: `cluster = "C1"`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 8 remaining warnings.
dom_clus_means
# A tibble: 3 × 5
  cluster  Loan Subscribed Poutcome Duration
  <chr>   <dbl>      <dbl>    <dbl>    <dbl>
1 C1         NA         NA       NA     211.
2 C2         NA         NA       NA     963.
3 C3         NA         NA       NA    1779 
comm_clus_tidy <- comm_clus_means %>%
  pivot_longer(cols = c( Loan, Subscribed, Poutcome, Duration), names_to = "Subscription_Method", values_to = "Average_Value")

comm_clus_tidy$Subscription_Method <- factor(comm_clus_tidy$Subscription_Method, levels = c("loan", "duration", "prev_outcome", "subscribed"))

comm_clus_tidy
# A tibble: 12 × 3
   cluster Subscription_Method Average_Value
   <chr>   <fct>                       <dbl>
 1 C1      <NA>                          NA 
 2 C1      <NA>                          NA 
 3 C1      <NA>                          NA 
 4 C1      <NA>                         211.
 5 C2      <NA>                          NA 
 6 C2      <NA>                          NA 
 7 C2      <NA>                          NA 
 8 C2      <NA>                         963.
 9 C3      <NA>                          NA 
10 C3      <NA>                          NA 
11 C3      <NA>                          NA 
12 C3      <NA>                        1779 
dom_clus_tidy <- dom_clus_means %>%
  pivot_longer( cols = c( Loan, Subscribed, Poutcome, Duration), names_to = "Subscription_Method", values_to = "Average_Value")

dom_clus_tidy$Subscription_Method <- factor(dom_clus_tidy$Subscription_Method, levels = c("loan", "duration", "prev_outcome", "subscribed"))
ggplot(comm_clus, mapping = aes(x = subscribed, y = age, group = cluster, colour = cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1) +
  scale_colour_manual(values = c("#A752A0", "#FCCA3A", "#378B84")) +
  ylab("Mean Satisfaction Score") + 
  xlab("Subscription Label") +
  ggtitle("Line Graph showing subscription by age")

#Question 2

#Import dataset

tab <- read_csv('recycling.csv')
Rows: 366 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): age
dbl (7): id, pos_impact, environ, money, bins, local, avoid_waste

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(tab)

#Compute distances between each pair of players

tab_2 <- select(tab, pos_impact:avoid_waste)
tab_2_scale <- scale(tab_2)
d3 <- dist(tab_2_scale)

#Carry out the hierarchical clustering.

h3 <- hclust(d3, method = "ward.D")
plot(h3, hang = -1)

heatmap(as.matrix(d3), Rowv = as.dendrogram(h3), Colv = 'Rowv', labRow = F, labCol = F)

#Decide on number of clusters

clusters3 <- cutree(h3, k = 4)

#Assess the quality of the segmentation

sil3 <- silhouette(clusters3, d3)
summary(sil3)
Silhouette of 366 units in 4 clusters from silhouette.default(x = clusters3, dist = d3) :
 Cluster sizes and average silhouette widths:
       119         82        138         27 
 0.2670730  0.3183990 -0.0221163  0.4981124 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.34602  0.02214  0.19958  0.18658  0.37769  0.62491 
tab_clus <- cbind(tab, clusters3)
tab_clus <- mutate(tab_clus, cluster = case_when(clusters3 == 1 ~ 'C1',
                                                 clusters3 == 2 ~ 'C2',
                                                 clusters3 == 3 ~ 'C3',
                                                 clusters3 == 4 ~ 'C4'))
tab_clus_means <- tab_clus %>%
  group_by(cluster) %>%
  summarise(Age = mean(age),
            Pos_Impact = mean(pos_impact),
            Environ = mean(environ),
            Money = mean(money),
            Bins = mean(bins),
            Local = mean(local),
            Avoid_Waste = mean(avoid_waste))
Warning: There were 4 warnings in `summarise()`.
The first warning was:
ℹ In argument: `Age = mean(age)`.
ℹ In group 1: `cluster = "C1"`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
tab_clus_means
# A tibble: 4 × 8
  cluster   Age Pos_Impact Environ Money  Bins Local Avoid_Waste
  <chr>   <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>       <dbl>
1 C1         NA       3.86    3.63  3.97  3.18  3.56        1.34
2 C2         NA       3.91    3.67  3.95  2.83  3.49        3.18
3 C3         NA       3.43    3.36  3.72  2.54  2.08        2.36
4 C4         NA       1.48    1.15  1.19  1.07  2.48        2.81
tab_clus_tidy <- tab_clus_means %>%
  pivot_longer(cols = c(Pos_Impact, Environ, Money, Bins, Local, Avoid_Waste), 
               names_to = "Factors", values_to = "Average_Value")

tab_clus_tidy
# A tibble: 24 × 4
   cluster   Age Factors     Average_Value
   <chr>   <dbl> <chr>               <dbl>
 1 C1         NA Pos_Impact           3.86
 2 C1         NA Environ              3.63
 3 C1         NA Money                3.97
 4 C1         NA Bins                 3.18
 5 C1         NA Local                3.56
 6 C1         NA Avoid_Waste          1.34
 7 C2         NA Pos_Impact           3.91
 8 C2         NA Environ              3.67
 9 C2         NA Money                3.95
10 C2         NA Bins                 2.83
# ℹ 14 more rows

#Visualise the mean satisfaction score for each contact method by cluster.

ggplot(tab_clus_tidy, mapping = aes(x = Factors, y = Average_Value, group = cluster, colour = cluster)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  ylab("Mean Score") + 
  xlab("Non-Financial Factors") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + #Rotates the angle of the labels and moves them down so they don't overlap the bottom of the graph
  ggtitle("Average Rating of Non-Financial Variables by Cluster")

#Cluster Analysis and Strategic Recommendations for Carlow County Council:

The evaluation of the four clusters leads us to the conclusion that each cluster contains subjects with distinctly different views and approaches to recycling. It is noted that Cluster 1 encounters and agrees with most of the statements made about recycling such as ‘this creates a positive impact on the environment’ and ‘it is cost-effective’ therefore all of them are relatively knowledgeable about the entire process and do it actively. This group can be called “Engaged Recyclers”.

On the other hand, Cluster 2 feels more skeptical about their recycling contribution and also feels less knowledgeable regarding the details of recycling their plastic materials such as what what bins to use. Yet, this group shares the belief that recycling is indispensable to the solution of environmental problems. This group can be categorized as the “Environmentally Conscious, But Uncertain” cluster.

Cluster 3 describes people who are challenged with recycling and at times find ways to avoid waste but packaging seems difficult – they are not too active in recycling, however, they get its importance for the environment. For this group the term “Reluctant Recyclers” seems to fit well. Cluster 4, however, reported the least agreeable responses in particular the expected benefits of recycling itself and the reduced cost benefits. They are more inclined to complain that’s it’s hard not to waste and find it hard to get the thought of recycling. This group can be referred to as “Recycling Skeptics”. Strategies for Carlow County Council: Engaged Recyclers (Cluster 1). Strategy: Carlow County Council should this group actively engage by routinely informing them of the good benefits achieved as a result of their recycling efforts. They can be invited to join ecological leadership projects or act as ambassadors of the council’s campaigns. It might even encourage them to talk about their sharing to promote other people.

Conflicted Yet Environmentally Aware (Cluster 2):

Approach: An implementation of educational campaigns has also emerged as a stronger need among this group in terms of what can be recycled and how recycling and waste separation can be done. Tactics such as group sessions, laminated cards, or mobile applications that inform them about what to do with garbage would help boost their confidence. In addition, trying to demonstrate how doing it right helps communities in everyday life could be beneficial.

Cluster 3: The Defeated Recyclers

Approach: For people in this group, solutions go beyond just the infrastructure that supports recycling which includes the use of waste separation facilities. Providing easier to change solutions, such as more clearly stated instructions on recycling and what can be single use can assist to lessen confusion. Explaining that recycling does not require any special efforts and that there are regular neighborhood collections of recyclable materials may lead to greater participation.

Reserve Recyclers (Cluster 4):

Approach: The foremost recommendation for landfill diversion is to reduce the number of ‘Recycling Skeptics’ prominently. Reinventing the understanding of recycling specifically for Carlow County Council members to advocate the idea of long term better impact on the environment and cost/revenue aspects when the recycling is being performed. Custom messages that emphasize the aspects of their community and daily life might be more effective in reducing the number of “recycling skeptics.” In addition, seeing the shift in attitude toward the idea of recycling along discussing misconceptions with clarity and frequency may assist.

Having introduced such focused policies, Carlow County Council will be in a position to satisfy the needs and worries of every single person in the community and as a result enhance the level of engagement in recycling initiatives throughout the society.