Stratton AE-Banking is a newly founded online bank in the US market. The E-banking service is a joint venture of a young fintech start-up and the long-time standing New York Stratton & Fils private banking house. The joint venture was founded in 2020 and has since then enjoyed great interest by providing digital private banking services. It profits from an AI driven recommender engine that uses past investment information together with a market and finance machine learning engine to derive investment tips and portfolio suggestions for its customers. So far, the fintech startup was well able to successfully approach young investors and customers. After the joint venture with Stratton & Fils, the fintech hopes to also attract existing customers from the established bank.
However, the conservative bank management of Stratton & Fils is extremely worried about simply approaching all of its customers, as it fears that the data driven and digital customer experience of Stratton AE may disturb some of its long-standing customers and may harm the longtime established and very intimate customer relations, which are believed to be an essential success factor in the bank’s success history.
The management thus approaches you as the head of the data science team and asks you to conduct a segmentation analysis of the bank’s existing customer base and to identify suitable customer segments, which might be open to try out Stratton & Fils joint venture. As a base for your segmentation analysis, the CRM manager provides you with the following data.
## Rows: 28 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Variable, Description, Measurement
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
| Variable | Description | Measurement | |
|---|---|---|---|
| 1 | Age | Customer Age | Age in Years |
| 2 | Income | Household Net Income | Net Income in USD |
| 3 | HouseholdSize | Number of People Living in Household | Integer number |
| 4 | CityAreaSize | City or Main Area Population | Integer number |
| 5 | MeanCityIncome | Average Income on ZIP-Code and Street Level | Average Income in USD |
| 6 | MeanCityHousePrize | Average House Prizes on ZIP-Code and Street Level from last 5 years | Average Prizes in USD |
| 7 | MeanCityHouseholdSize | Average Household Size on ZIP-Code and Street Level from last 10 years | Average Number Inhabitants |
| 8 | MeanCitySqFtPrice | Average Prizes per Square Foot on ZIP-Code and Street Level | Yes/No |
| 9 | NumbCars | Number of registered cars of customer | Number of Cars |
| 10 | InternetTrafficVolume | Volume of Internet Traffic per customer household | GB |
| 11 | MortageVolume | Mortage to be paid by Customer | USD |
| 12 | AccountSpending | Monthly average spending from bank account | USD |
| 13 | CreditCardSpending | Monthly average spending from Credit Card | USD |
| 14 | HelpHotlineTime | Number of Minutes with Banking Hotline | Minutes |
| 15 | CustomerSince | Time since opening bank account | Months |
| 16 | GrocerySpending | Average grocery related spendings from bank account | USD |
| 17 | StockVolume | Stock Investment | USD |
| 18 | CreditVolume | Credits with the bank | USD |
| 19 | NASDAQInvest | Amount of money invested in NASDAQ listed companies | USD |
| 20 | USAXSFundInvest | Amount of money invested in Stratton owned share fund for mid sized US companies | USD |
| 21 | BranchVisits | Number of recorded branch visits within the last 8 weeks | Integer number |
| 22 | AppLogins | Number of customer logins in mobile banking app within the last 8 weeks | Integer number |
| 23 | ATMVisitis | Number of times customer used an ATM service point within the last 8 weeks | Integer number |
| 24 | TimeOnlineBanking | Time logged into the Online Banking System | Minutes |
| 25 | ServiceFees | Extra Fees paid for banking services | USD |
| 26 | SocialMediaInter | Number of Finance Specific Social Media Profiles a customer follows | Integer number |
| 27 | Bitcoins | Number of Bitcoins hold by customer | Number |
| 28 | NFT | Number of NFTs bought by customer | Integer number |
We can now load the data in R with the read_csv command and then inspect the dataframe with the str() command.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Import Data
BankinCRMData <- read_csv("Data/StrattonAEBankingCRM.csv")
## Rows: 10750 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (28): Age, Income, HouseholdSize, CityAreaSize, MeanCityIncome, MeanCity...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(BankinCRMData)
## Age Income HouseholdSize CityAreaSize
## Min. :18.00 Min. : 35202 Min. :1.00 Min. : 61613
## 1st Qu.:23.00 1st Qu.: 42803 1st Qu.:2.00 1st Qu.:121704
## Median :30.00 Median : 71268 Median :3.00 Median :450100
## Mean :35.16 Mean : 84700 Mean :2.81 Mean :372196
## 3rd Qu.:45.00 3rd Qu.:125870 3rd Qu.:4.00 3rd Qu.:459418
## Max. :74.00 Max. :181863 Max. :8.00 Max. :708729
## MeanCityIncome MeanCityHousePrize MeanCityHouseHoldSize MeanCitySqFtPrice
## Min. : 35372 Min. : 125011 Min. :1.000 Min. :1871
## 1st Qu.:116253 1st Qu.: 444817 1st Qu.:2.000 1st Qu.:2627
## Median :140458 Median : 614601 Median :3.000 Median :5778
## Mean :163706 Mean : 942505 Mean :3.023 Mean :5318
## 3rd Qu.:235000 3rd Qu.:1849915 3rd Qu.:4.000 3rd Qu.:6741
## Max. :286996 Max. :1850000 Max. :8.000 Max. :9886
## NumberCars InternetTrafficVolume MortageVolume AccountSpending
## Min. :0.000 Min. : 6.00 Min. : 14898 Min. : 500.0
## 1st Qu.:1.000 1st Qu.: 45.00 1st Qu.:120462 1st Qu.: 560.1
## Median :1.000 Median : 60.00 Median :232414 Median : 898.0
## Mean :1.384 Mean : 67.57 Mean :202824 Mean :1275.7
## 3rd Qu.:2.000 3rd Qu.: 86.00 3rd Qu.:287298 3rd Qu.:1647.6
## Max. :4.000 Max. :118.00 Max. :605846 Max. :4257.1
## CreditCardSpending HelpHotlineTime CustomerSince GrocerySpending
## Min. : 501.1 Min. : 0.00606 Min. : 0.00 Min. : 150.1
## 1st Qu.: 651.1 1st Qu.: 4.57751 1st Qu.: 3.00 1st Qu.: 293.4
## Median : 785.7 Median : 8.77448 Median :11.00 Median : 426.5
## Mean :1013.3 Mean :12.81641 Mean :19.25 Mean : 535.8
## 3rd Qu.:1451.4 3rd Qu.:16.59390 3rd Qu.:36.00 3rd Qu.: 627.9
## Max. :2041.9 Max. :60.75499 Max. :74.00 Max. :1253.5
## StockVolume CreditVolume NASDAQInvest USAXSFundInvest
## Min. : 388 Min. : 117.3 Min. : 228.4 Min. : 69.95
## 1st Qu.:1059 1st Qu.: 161.7 1st Qu.: 401.4 1st Qu.: 149.80
## Median :1537 Median : 802.7 Median :1498.1 Median : 313.82
## Mean :2142 Mean :1330.1 Mean :1828.5 Mean : 761.64
## 3rd Qu.:2505 3rd Qu.:2488.0 3rd Qu.:3056.1 3rd Qu.:1060.23
## Max. :5738 Max. :3532.3 Max. :4532.4 Max. :3396.61
## BranchVisits AppLogins ATMVisits TimeOnlineBanking
## Min. : 0.000 Min. : 1.0 Min. : 0.000 Min. : 22.77
## 1st Qu.: 2.000 1st Qu.: 18.0 1st Qu.: 3.000 1st Qu.: 69.28
## Median : 3.000 Median : 64.0 Median : 5.000 Median : 88.26
## Mean : 3.913 Mean : 55.7 Mean : 4.928 Mean :113.88
## 3rd Qu.: 5.000 3rd Qu.: 82.0 3rd Qu.: 7.000 3rd Qu.:152.92
## Max. :20.000 Max. :130.0 Max. :11.000 Max. :232.21
## ServiceFees SocialMediaInter Bitcoins NFTs
## Min. : 0.1343 Min. : 0.00 Min. :0.0000 Min. : 0.000
## 1st Qu.: 17.8442 1st Qu.: 5.00 1st Qu.:0.0005 1st Qu.: 1.000
## Median : 27.2386 Median :16.00 Median :0.0998 Median : 3.000
## Mean : 40.9382 Mean :19.03 Mean :0.1937 Mean : 3.317
## 3rd Qu.: 50.1652 3rd Qu.:31.00 3rd Qu.:0.4005 3rd Qu.: 4.000
## Max. :124.2613 Max. :60.00 Max. :0.6014 Max. :12.000
To identify segments of similar customers, let us first focus on the question how to measure similarity. Table 2 shows us some observations for customers from another banking database. The columns show the values of some customer related attributes. We can use the individual attribute characteristics to now calculate a so called distance measure, which shows how similar or dissimilar customers are. The higher the distance, the more dissimilar they are. For continuous variables, we can use the basic Euclidean Distance measure to derive similarities. The Euclidean Distance between two customers A and B can be expressed by the following equation.
\[ED_{A,B}= \sqrt{(f_{1,A}-f_{1,B})^{2}+(f_{2,A}-f_{2,B})^{2}+...+(f_{n,A}-f_{n,B})^{2}} \]
## Rows: 5 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Customer
## dbl (2): Age, Household Size
## num (2): Income, Debt
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Table2_1
## # A tibble: 5 × 5
## Customer Age Income Debt `Household Size`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Hawkeye 32 45000 25000 1
## 2 Potter 64 75000 10000 3
## 3 Burns 49 42000 20000 5
## 4 Hotlips 33 22000 2000 1
## 5 Klinger 29 16000 6000 4
We can now use the formula of the Euclidean Distance to calculate for example the distance between Hawkeye and Potter.
ED_Hawkeye_Potter = sqrt((32-64)^2 + (45-75)^2+ (25-10)^2 + (1-3)^2)
ED_Hawkeye_Potter
## [1] 46.40043
Question: Repeat the calculations for Hawkeye and Burns as well as Hawkeye and Hotlips.
While this is a great exercise, it will be impossible to calculate the distances amongst all members of a large customer data base with e.g. 200,000 entries. However in this case we can also use R’s function for Euclidean Distances. We simply need to give the function a data frame with all observations we would like to compare, and R will return a table with the corresponding distances.
library(philentropy)
distance(Table2_1[,2:5], method = "euclidean")
## Metric: 'euclidean'; comparing: 5 vectors.
## v1 v2 v3 v4 v5
## v1 0.000 33541.03 5830.978 32526.912 34669.872
## v2 33541.035 0.00 34481.883 53600.382 59135.448
## v3 5830.978 34481.88 0.000 26907.253 29529.653
## v4 32526.912 53600.38 26907.253 0.000 7211.104
## v5 34669.872 59135.45 29529.653 7211.104 0.000
While the distances help us with understanding similarities and dissimilarities they do not yet help us with forming subgroups, ad only from the distances, you do not know which threshold determines similarity/dissimilarity. Hawkeye may be closest to Burns, but is 18 still a great distance? Or actually already pretty similar? Who should be paired with whom?
This implies that grouping consumers in homogenous subgroups requires a lot of attention and balance and some more information than just similarity measures. In addition, we realized with our 5 customers, that grouping takes us some time and effort and will certainly prevent us from forming larger groups or segmenting larger data sets with hundreds of thousands of customers. Therefore, it is time to discover a method that uses intersubject distances to automatically form groups. Such methods are commonly referred to as cluster analysis. Cluster analysis are a well-known and established statistical method, that is used for the last 30-40 years in marketing research. With the advent of machine learning and artificial intelligence applications, cluster analysis became again popular in data science, where it is often referred to as an unsupervised learning algorithm.
k-mean cluster analysis uses distances to form clusters within data. Once the user determined the k number of clusters the algorithm should define, the cluster randomly assigns k starting points within the data (Step1). It continues then to calculate the distance of each observation in the data to each starting point. As pointed out in in below’s Figure the algorithm then assigns each observation according to the distance to the closest starting point (Step2). This leads to an initial cluster solution. For each of these clusters the algorithm then calculates the new center point of the cluster, called centroid (Step3). The centroid can be interpreted as the mean of all observations within this cluster. Step4 now repeats the procedure of step2. The new centroids are used to again calculate all distances between all observations and all centroids. Then again, the observations are assigned to the closest centroid. This may lead to changes in cluster membership and lead to new forms of clusters. In the subsequent step, the algorithm continues to calculate the resulting new centroids (Step5), to then re-calculate the distances and re-assigning observations to clusters. The algorithm stops once no observation can be re-assigned to another cluster or after a to be specified number of iterations.
One thing we may mind before running a cluster analysis, is scale heterogeneity. Especially k-mean clustering is sensitive to data that comes at different scale levels. Having variables at very different levels, thus creates problems, which may ultimately lead to biased results. A quick fix is to standardize the variables so that they all share a similar range. This procedure is commonly referred to as standardization.
R can standardize all variables for us with the help of the scale() function. When we now inspect the resulting new data frame scaled.crm with the head() function.
scaled.crm = scale(BankinCRMData)
head(scaled.crm)
## Age Income HouseholdSize CityAreaSize MeanCityIncome
## [1,] 0.3576515 -0.1028962 -0.6110203 0.3617815 -1.1309409
## [2,] 0.1361200 -0.2651878 1.6529279 0.3520407 -0.1078328
## [3,] 2.6468104 -0.1251715 -1.3656697 0.3701496 -1.7221925
## [4,] 1.3176214 -0.2991988 0.1436291 0.3701496 -0.7011903
## [5,] 0.3576515 -0.2119217 -1.3656697 0.3500189 -1.9739204
## [6,] -0.2330992 -0.1816402 3.1622266 0.3681716 -1.1212168
## MeanCityHousePrize MeanCityHouseHoldSize MeanCitySqFtPrice NumberCars
## [1,] 1.271530 -0.01716176 -0.62863020 0.6657393
## [2,] 1.270998 1.47667561 -0.02266382 0.6657393
## [3,] 1.271494 -0.01716176 -0.79901909 -0.4146800
## [4,] 1.270582 0.72975692 -1.32689056 -1.4950993
## [5,] 1.270505 2.22359429 -1.31895578 -0.4146800
## [6,] 1.271217 -0.01716176 -0.26404808 -0.4146800
## InternetTrafficVolume MortageVolume AccountSpending CreditCardSpending
## [1,] -0.2889997 1.6471061 -0.3450280 0.8636732
## [2,] -0.9537126 1.2700695 -0.1509407 -0.6818782
## [3,] -0.3192140 0.5749801 -0.3528008 0.5714726
## [4,] 0.1037852 1.3859733 -0.2801354 0.2591261
## [5,] -0.8630700 1.0690862 -0.1072321 0.2034089
## [6,] -0.4702851 0.9330313 -0.3513856 -0.1526760
## HelpHotlineTime CustomerSince GrocerySpending StockVolume CreditVolume
## [1,] -0.8632445 0.7723802 -0.34104672 -0.6842514 -0.4354132
## [2,] -0.8427150 0.7723802 0.19341832 -0.5016177 -0.4407973
## [3,] -0.9302276 0.7723802 0.12646708 -0.6849314 -0.4507881
## [4,] -0.6214784 0.7723802 0.08311691 -0.5265523 -0.4594416
## [5,] -0.6937764 0.8184972 -0.75375557 -0.7384732 -0.4396339
## [6,] -0.5917007 0.7723802 0.09847517 -0.3919688 -0.4472727
## NASDAQInvest USAXSFundInvest BranchVisits AppLogins ATMVisits
## [1,] -0.2353360 -0.3417909 -0.28345927 -1.319352 1.751956
## [2,] -0.2240967 -0.3244452 0.02716123 -1.059550 1.321701
## [3,] -0.2269506 -0.3549125 -0.28345927 -1.203885 1.321701
## [4,] -0.2294082 -0.3270950 0.02716123 -1.203885 1.321701
## [5,] -0.2268875 -0.3721992 -0.28345927 -1.146151 1.751956
## [6,] -0.2259212 -0.2313141 0.02716123 -1.232751 1.321701
## TimeOnlineBanking ServiceFees SocialMediaInter Bitcoins NFTs
## [1,] -0.7152344 0.02878044 0.4675133 -0.8542812 -0.4895549
## [2,] -0.7897758 0.36121889 0.3502121 -0.8520388 -0.8611631
## [3,] -0.9289516 0.44046295 0.5261639 -0.8076386 -0.8611631
## [4,] -0.8941460 0.02794600 0.8780673 -0.8614570 -0.8611631
## [5,] -0.7894912 0.52092062 0.8780673 -0.8349963 -1.2327713
## [6,] -0.9766564 0.51830595 0.2915616 -0.8551782 -0.1179467
As you see, all variables now range in similar areas. We can thus proceed with our analysis
We can now start with the cluster analysis. Let us first try out different solutions with different numbers of clusters. To ensure that we start with the same centroids, we use the set.seed function. This ensures that every time we run this code, we end up with the same results. If you do not use set.seed ahead of the cluster analysis, you will receive different solutions, which will be close to each other but not identical. We can run a k-mean cluster analysis with R’s kmean function. We tell the kmean function simply which data.frame contains our customer data and specify the k number of clusters we want to be included. Here we set k to 4.
To see how many customers are assigned to each cluster, we furthermore plot the cluster sizes with the help of ggplot and a simple barplot.
library(ggplot2)
set.seed(123)
StrattonCluster_4k <- kmeans(scaled.crm, 4)
StrattonCluster_4k[["size"]]
## [1] 1000 1250 2996 5504
sizes4k <- data.frame(Size = StrattonCluster_4k[["size"]],
Cluster = c("Cluster1", "Cluster2", "Cluster3", "Cluster4"))
ggplot(sizes4k, aes(x=factor(Cluster), y=Size)) +
geom_col(fill=hcl(195, 100, 65)) +
xlab("Cluster") + ylab("Size") + geom_text(aes(label=Size), vjust=0) +
ggtitle("Cluster sizes k-means 4-cluster solution")
We can now inspect the different clusters and check their mean values. We achieve this with the following code, that first matches the estimated cluster to each observation in our data frame. Subsequently, we use dyplr’s group_by command to calculate the mean of each variable per cluster. You can then inspect the resulting data frame. You will see that some of the clusters show substantially different mean values for specific variables, while in other cases the means do not vary across the clusters.
#Build Cluster Specific Means for all Variables
BankinCRMData$k4Cluster = StrattonCluster_4k[["cluster"]]
summarystats.percluster_4k = BankinCRMData %>% group_by(k4Cluster) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
head(summarystats.percluster_4k)
## # A tibble: 4 × 29
## k4Cluster Age Income HouseholdSize CityAreaSize MeanCityIncome
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 41.9 67991. 3.44 450058. 109775.
## 2 2 37.4 74948. 2.11 453890. 235000
## 3 3 43.7 157931. 2.98 160005. 115723.
## 4 4 28.8 50088. 2.76 454998. 183432.
## # ℹ 23 more variables: MeanCityHousePrize <dbl>, MeanCityHouseHoldSize <dbl>,
## # MeanCitySqFtPrice <dbl>, NumberCars <dbl>, InternetTrafficVolume <dbl>,
## # MortageVolume <dbl>, AccountSpending <dbl>, CreditCardSpending <dbl>,
## # HelpHotlineTime <dbl>, CustomerSince <dbl>, GrocerySpending <dbl>,
## # StockVolume <dbl>, CreditVolume <dbl>, NASDAQInvest <dbl>,
## # USAXSFundInvest <dbl>, BranchVisits <dbl>, AppLogins <dbl>,
## # ATMVisits <dbl>, TimeOnlineBanking <dbl>, ServiceFees <dbl>, …
Another approach to assess the quality of our segmentation, is to plot the different clusters. A key challenge here is dimensionality. Given that our clusters depend on a multitude of variables, we cannot plot them all together. To come to a solution that we can plot, we need to reduce the dimensions to two main factors, which then allows us to plot the points in two-dimensional space. A common technique to achieve this is a principal component analysis (PCA) that reduces all variables to two main factors, which we can subsequently plot. The plot will then allow us to better see if clusters overlap or if we end up with a meaningful separation between the different identified clusters. R’s factoextra package offers various functions, which achieve this with a single command that does not require us to code the PCA nor the plot.
#Plot Clusters for 4k solution
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(StrattonCluster_4k, scaled.crm, ellipse.type = "norm")
A quick inspection of the plot already reveals that our 4k cluster approach is not optimal, as we see some more separable groups of close to each other observations. Especially in case of Clusters 3 and 4 (the larger ones) it looks like we can still split these groups into two more subgroups each.
Question: Repeat the cluster analysis with k = 5 and k = 6
Trying out different solutions may point you onto something, still you will realize that determining the right number of clusters can be tricky.
To find the “best” number of cluster, there are different approaches and measures available. Before we discuss these, let us first reflect again on what we want to achieve with a cluster analysis.
We want to obtain subgroups that are homogenous within. So, to say we try to maximize within-group homogeneity, which means we try to reduce the level of variance between members of a cluster. The overall level of within-cluster-variances across all identified clusters can thus be used to describe the total degree of homogeneity obtained with a specific cluster solution. This gives us a chance to compare different cluster analyses with different numbers of clusters, as we can try to minimize the overall variances.
Using the within-cluster-variance values, we can determine which solution works best and then focus on this cluster analysis. To do so, we first estimate n cluster solutions with cluster numbers from 1 to k. Subsequently, we can then plot the within-cluster variance sums for each cluster solution.
Again R can do this for us with some short lines of code. Below you find two measures for within-cluster variances. We can now ask R to estimate kmean models with k values from 2 to 15 and to then plot the within variances of each solution. Don’t worry, if this takes some time.
#Obtain Elbow anf Silhouette and Plots to determine optimal k
factoextra::fviz_nbclust(na.omit(scaled.crm), kmeans, method = "wss", k.max = 15)
factoextra::fviz_nbclust(na.omit(scaled.crm), kmeans, method = "silhouette", k.max = 15)
The Elbow plot (1st plot), shows the total within sum of cluster variances for all estimated 15 solutions. The rule of thumb states, that the optimal cluster number lays within the “elbow” of the plot. This seems to be here rather tricky. As the function drops immediately and shows very low summed variances for clusters 2 to 15. Therefore, we rely on a second method, the Silhouette plot. The silhouette coefficient measures of how close an object is to its own cluster centroid, compared to the one of other clusters. The coefficient ranges from −1 to +1. High values indicate strong separation. Low values indicate poor separation. We thus want to select the cluster solution with the highest silhouette coefficient. In our case, the plot suggests 8 clusters. Looking again at the Elbow plot on the left, 8 seems rather high, especially as the “Elbow” – lays somewhere between 5 and 7. The silhouette plot suggests that the 7-cluster solution is inferior to the 6- and 8-cluster solutions. We may thus enrich our insights by plotting all three solutions with the following command.
#Plot Cluster Solutions
#k6
StrattonCluster_6k <- kmeans(scaled.crm, 6)
fviz_cluster(StrattonCluster_6k, scaled.crm, ellipse.type = "norm")
#k7
StrattonCluster_7k <- kmeans(scaled.crm, 7)
fviz_cluster(StrattonCluster_7k, scaled.crm, ellipse.type = "norm")
#k8
StrattonCluster_8k <- kmeans(scaled.crm, 8)
fviz_cluster(StrattonCluster_8k, scaled.crm, ellipse.type = "norm")
Let us first start by looking in more detail at our 8-cluster k-mean model and see how big each cluster is, with the following code.
# 8 cluster k-mean cluster size plot
sizes8k <- data.frame(Size = StrattonCluster_8k[["size"]],
Cluster = c("Cluster1", "Cluster2", "Cluster3", "Cluster4",
"Cluster5", "Cluster6", "Cluster7", "Cluster8"))
ggplot(sizes8k, aes(x=factor(Cluster), y=Size)) +
geom_col(fill=hcl(195, 100, 65)) +
xlab("Cluster") + ylab("Size") + geom_text(aes(label=Size), vjust=0) +
ggtitle("Cluster sizes k-means 8-cluster solution")
To gain deeper insights related to spending behavior as well as the digital affinity of the different segments, we want to plot the means of the different variables. To achieve this we first again assemble a descriptive data set with all variable means per cluster with the help of dplyr’s group_by function.
# Build Mean per Cluster DataFrame
BankinCRMData$k8Cluster = StrattonCluster_8k[["cluster"]]
summarystats.percluster_8k = BankinCRMData %>% group_by(k8Cluster) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
We can now generate barplots of the different variables of interest and see if we find promising segments of Stratton & Fils customers, who might be open and suitable for Stratton AE Banking. Let us first focus on spending behavior, as indicated by the service fee variable. Note, that we adapted some of the commands in ggplot. By leaving geom_col() blank we do not specify a color and the plot remains in grey. In addition, we ask ggplot in geom_text to add labels with the two-digit rounded values of ServiceFees in white color and in font size 2. With the position_stack command we put the values in the middle of the barplot.
#Barplot of Service Fees
ggplot(summarystats.percluster_8k, aes(x=factor(k8Cluster), y=ServiceFees)) +
geom_col() +
xlab("Clusters") + ylab("Spending") +
geom_text(aes(label = round(ServiceFees, digits = 2)),
size = 2, colour = "white",
position = position_stack(vjust = 0.5)) +
ggtitle("Average Spending in Service Fees per Cluster")
A visual inspection indicates that clusters 4 and 6 show the highest spending behavior, with clusters 8 and 2 following, while the remaining clusters show rather low service fee spendings. This makes at least the four high spending segments attractive for AE Banking. However, to be sure that the rather novel and highly digital app service appeals to these segments, we need to understand how digitally active and interested these segments are.
Let us first focus on the latest developments in fintech such as Bitcoin and NFT investments. We can again compare the segment-specific means for both variables. This time we want to combine the plots of Bitcoins and NFTs in one plot. We can arrange this with ggplot’s facet_wrap function that allows us to combine plots of different variables. The only “complication” we need to address is that we need to re-arrange the data set we want to plot. We can again use dplyr for this. We first select the variables of interest (cluster, NFTs and Bitcoins) and then transpose the data frame from a wide to a narrow format. We can then use again ggplot. This time we use the geom_bar command instead of the geom_col command. Facet_wrap will now tell ggplot to make two plots and combine them under each other (col =1). By setting scales to “free_y” we allow different y-axis levels, given that scales substantially vary across the two different variables.
#Barplots of Fintech Investments
FinTech <- summarystats.percluster_8k %>% select(k8Cluster, NFTs, Bitcoins) %>%
gather(key = "variable", value = "value", -k8Cluster)
ggplot(FinTech, aes(factor(k8Cluster), value))+
geom_bar(stat='identity') + xlab("Clusters") +
facet_wrap(~variable, ncol=1, scales = "free_y") +
geom_text(aes(label = round(value, digits = 1)), size = 2, colour = "white",
position = position_stack(vjust = 0.5)) +
ggtitle("FinTech Cluster Means")
From the inspection, we can see that clusters 4 and 6 show both most activity in NFTs acquisitions and are also most invested in Bitcoins, which makes them even more suitable for AE Banking. Let us now look at digital activities and compare digital and offline activities. With the following code we can inspect the means for BranchVisits, AppLogins, ATMVisits, TimeOnlineBanking, SocialMediaInter, InternetTrafficVolume. As you can see from facet_wrap we now include two columns.
#Plots for Digital vs. Offline Life
DigLife = summarystats.percluster_8k %>%
select(k8Cluster, BranchVisits, AppLogins,
ATMVisits, TimeOnlineBanking, SocialMediaInter,
InternetTrafficVolume) %>%
gather(key = "variable", value = "value", -k8Cluster)
ggplot(DigLife, aes(factor(k8Cluster), value))+
geom_bar(stat='identity') + xlab("Clusters") +
facet_wrap(~variable, ncol=2, scales = "free_y") +
geom_text(aes(label = round(value, digits = 1)), size = 2, colour = "white",
position = position_stack(vjust = 0.5)) +
ggtitle("Digital Life vs. Offline Life Cluster Means")
The plot further confirms the strong digital affinity of clusters 4 and 6. Both show the lowest number of branch and ATM visits, while showing strong activity in online baking, internet traffic, social media interest, and banking app logins. While we can now be sure that customers from segments 4 and 6 are highly digital affine and are thus likely to be interested in AE Banking, we should in the next step control the financial situation of these customers. Let us first focus on average age, income and household sizes.
#Plots for Socio Economic Factors
SocioEcon <- summarystats.percluster_8k %>%
select(k8Cluster, Age, Income, HouseholdSize) %>%
gather(key = "variable", value = "value", -k8Cluster)
ggplot(SocioEcon, aes(factor(k8Cluster), value))+
geom_bar(stat='identity') + xlab("Clusters") +
facet_wrap(~variable, ncol=1, scales = "free_y") +
geom_text(aes(label = round(value, digits = 1)), size = 2, colour = "white",
position = position_stack(vjust = 0.5)) +
ggtitle("Socio-Economic Cluster Means")
The plots revealthe problems with socio-economic clustering, as the results for age and household size do not vary too much across the 8 clusters. We see some variation for income, where clusters 4 and 6 remain close to the total mean of the dataset, indicating, that the digital affine users, we identified, are neither poor nor rich, making them still a suitable target group. Age-wise, we similarly see that both segments are well-established adults in their end 30s or early 40s. Given that the socio-economic information indicates that the digital affine users profit from stable incomes, we should in the next steps focus on spending and investment behavior to understand, whether these segments allow sufficient business volume and growth potential.
#Plots for Spending and Investments
Invest <- summarystats.percluster_8k %>%
select(k8Cluster, MortageVolume, StockVolume, NASDAQInvest, USAXSFundInvest) %>%
gather(key = "variable", value = "value", -k8Cluster)
ggplot(Invest, aes(factor(k8Cluster), value))+
geom_bar(stat='identity') + xlab("Clusters") +
facet_wrap(~variable, ncol=2, scales = "free_y") +
geom_text(aes(label = round(value, digits = 1)), size = 2, colour = "white",
position = position_stack(vjust = 0.5)) +
ggtitle("Investment Cluster Means")
Spending <- summarystats.percluster_8k %>%
select(k8Cluster, AccountSpending, CreditCardSpending, GrocerySpending) %>%
gather(key = "variable", value = "value", -k8Cluster)
ggplot(Spending, aes(factor(k8Cluster), value))+
geom_bar(stat='identity') + xlab("Clusters") +
facet_wrap(~variable, ncol=1, scales = "free_y") +
geom_text(aes(label = round(value, digits = 1)), size = 2, colour = "white",
position = position_stack(vjust = 0.5)) +
ggtitle("Spending Cluster Means")
From the inspection of the two plots, it becomes evident that
clusters 4 and 6 are more invested in stocks than their counterparts,
and compared to the other clusters also share lower levels of mortgages.
Looking at the types of investments, we see that cluster 4 is more
invested in NASDAQ listed companies than all other clusters, while
cluster 6 is strongly invested in Stratton’s fund for small and mid-size
US companies. Spending behavior information tells us that both segments
belong to the less spending customers, with cluster 4 showing the lowest
credit card turnover of all clusters. In case of grocery expenditures,
we see cluster 6 being the cluster with the second-highest average
spending behavior.
Last, we can enrich our insights, by looking at the living conditions of
the different segments and see where the different segments are located.
To achieve this, we finally compare residential information.
#Plots Residential Information
Life <- summarystats.percluster_8k %>%
select(k8Cluster, CityAreaSize, MeanCitySqFtPrice, MeanCityHouseHoldSize, MeanCityIncome) %>%
gather(key = "variable", value = "value", -k8Cluster)
ggplot(Life, aes(factor(k8Cluster), value))+
geom_bar(stat='identity') + xlab("Clusters") +
facet_wrap(~variable, ncol=2, scales = "free_y") +
geom_text(aes(label = round(value, digits = 1)), size = 2, colour = "white",
position = position_stack(vjust = 0.5)) +
ggtitle("Life Conditions Cluster Means")
From the plot we learn that clusters 4 and 6 both prefer city areas with mid-to-lower levels of population. In case of cluster 4 the average household sizes in the residential areas are rather small, while in case of cluster 6 we observe larger compounds with by average 4 members living in one household. Looking at income distributions and the area’s soil values, we learn that cluster 4 lives in rather richer neighborhoods with higher soil prices, whereas cluster 6 members prefer middle-class neighborhoods with affordable, low soil prices.
Question: Combining the information at hand, how do you depict members of clusters 4 and 6 and how do you believe they differ from each other? Can you similarly come up with personae for other clusters?
The results of the cluster analysis allow Stratton AE Banking to take several important marketing actions. First, the profound understanding of the different available market segments, allows the joint-venture to understand the different types of customers available and to determine, which segments in the existing customer base should build the base for future marketing activities.
To develop suitable positionings for each cluster and subsequently develop communication campaigns, one can use the further insights from the cluster analysis and the comparison of the cluster-specific means of the remaining variables.
Furthermore, the results of the cluster analysis can be used to also predict the interests and preferences of newly incoming customers. Here, one may use the existing information available and calculate the Euclidean distances between the new customer and the centers (i.e. the means of each dimension) of each cluster. The customer will likely belong to the cluster, with the lowest distance.