How can a magazine company effectively target key demographics?
This business question is what led this cluster analysis on magazine subscriber data. The data genrated from this analysis could help the aforementioned magazine company better advertise to specific groups or even include more articles that are better suited to specific demographics and their needs.
The dataset that was used for this cluster analysis was provided by the Eastern University course: “Data Science for Business”. Below is a list of the attributes and their description as depicted in the dataset.
age: age of subscriberfemale: 1 for yes (female); 0 for no (male)Real.Estate.Purchases: number of real estate
purchasesValue.of.Investments: value of investments in US
dollarsNumber.of.Transactions: number of purchases madeGraduate.Degree: 1 for having a degree; 0 for not
having a degreeHousehold.Income: income for each household in US
dollarsHave.Children: 1 for yes; 0 for noDue to the mix of numerical and categorical data types within the data, the data was separated based on these types, and a distinct clustering technique was employed to generate groupings based on the data. Hierarchical was chosen for the categorical variables, and K-means was utilized for the continuous numerical variables.
Below are the packages that were utilized to carry out this specific cluster analysis. The packages are in order of use and are broken up by which step in the process they were utilized.
# Packages needed for data manipulation, visualization
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# Packages needed for clustering the data
library(cluster)
library(fpc)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
The first step in this process was to import the data and assign it to an object for manipulation.
# set working directory
setwd("/Users/nickwinters/desktop/MSDS/DS for Buisness")
# assign csv file to an object
magdf <- read.csv("young_professional_magazine.csv")
Before diving into the analysis, surface level exploration was conducted on the data.
First the general structure of the data was obtained by using the
head() function. This allowed for “peaking the data” such
that the variables and their data types could be visualized in a table
format.
# display the first five rows
head(magdf, 5)
## Age Female Real.Estate.Purchases Value.of.Investments Number.of.Transactions
## 1 38 1 0 12200 4
## 2 30 0 0 12400 4
## 3 41 1 0 26800 5
## 4 28 1 1 19600 6
## 5 31 1 1 15100 5
## Graduate.Degree Household.Income Have.Children
## 1 1 75200 1
## 2 1 70300 1
## 3 1 48200 0
## 4 0 95300 0
## 5 0 73300 1
After getting a better sense for the structure of the data, summary statistics relating to the observations that make up the data were generated. For manipulation purposes and since the clustering methods require such, the data was seperated based on whether it was categorical or numerical.
Here, the mean and distribution of numerical observations were be calculated. These calculations (particularly the mean) were used for comparative purposes during the K-means cluster analysis.
# assign numerical variables to a new data frame
mag_num <- magdf |>
select(
Age,
Value.of.Investments,
Number.of.Transactions,
Household.Income
)
# Generate summary statistics for each numerical variable
summary(mag_num)
## Age Value.of.Investments Number.of.Transactions Household.Income
## Min. :19.00 Min. : 0 Min. : 0.000 Min. : 16200
## 1st Qu.:28.00 1st Qu.: 18300 1st Qu.: 4.000 1st Qu.: 51625
## Median :30.00 Median : 24800 Median : 6.000 Median : 66050
## Mean :30.11 Mean : 28538 Mean : 5.973 Mean : 74460
## 3rd Qu.:33.00 3rd Qu.: 34275 3rd Qu.: 7.000 3rd Qu.: 88775
## Max. :42.00 Max. :133400 Max. :21.000 Max. :322500
# age
mag_num |>
ggplot(aes(x=Age)) +
geom_histogram(binwidth = 1)
# value of investments
mag_num |>
ggplot(aes(x=Value.of.Investments)) +
geom_histogram(binwidth = 5000)
# number of transactions
mag_num |>
ggplot(aes(x=Number.of.Transactions)) +
geom_histogram(binwidth = 1)
# household income
mag_num |>
ggplot(aes(x=Household.Income)) +
geom_histogram(binwidth = 5000)
Observation:
Here, the frequency of categorical observations was calculated. These frequencies will be used for comparative purposes during the hierarchial cluster analysis.
# assign binary variables to new dataframe
mag_bin <- magdf |>
select(
Female,
Real.Estate.Purchases,
Graduate.Degree,
Have.Children
)
#Calculate counts and proportions for each binary variable
tabyl(mag_bin$Female)
## mag_bin$Female n percent
## 0 229 0.5585366
## 1 181 0.4414634
tabyl(mag_bin$Real.Estate.Purchases)
## mag_bin$Real.Estate.Purchases n percent
## 0 236 0.5756098
## 1 174 0.4243902
tabyl(mag_bin$Graduate.Degree)
## mag_bin$Graduate.Degree n percent
## 0 155 0.3780488
## 1 255 0.6219512
tabyl(mag_bin$Have.Children)
## mag_bin$Have.Children n percent
## 0 191 0.4658537
## 1 219 0.5341463
Observation:
Overall Approach:
The first step to conducting a hierarchical cluster analysis is to, for this particular case, calculate the similarity measure between observations. Below the matching distance method was employed by specifically setting the method to “manhattan”.
# calculate distance between each pair of observations
match_dist<-dist(mag_bin, method="manhattan")
Once the data’s similarity measurments have been calculated, the hclust function can now group the data into clusters based on similarity. The specific measure that was used to determine the dissimilarity between these generated clusters was the “average linkage” approach which takes the average distance between observations in one cluster with another cluster.
# use the hclust function and group average linkage
match_avg<-hclust(match_dist, method="average")
Once the possible clusters have been generated, a dendrogram was used to visualize them. This generated plot was used to determine the best number of clusters for the data.
#plot the dendrogram
plot(match_avg)
Observation: Based on the above dendrogram four clusters was determined to be ideal.
The four clusters that were chosen to represent the data were next superimposed onto the plot for better visualization.
#Create 4 clusters using the cutree function
match_avg_4<-cutree(match_avg, k=4)
#plot the dendrogram
plot(match_avg)
#visualize clusters on the dendrogram
rect.hclust(match_avg, k=4, border=2:4)
After the clusters were identified, they were then analyzed to extrapolate meaningful insights into subscriber segmentation.
First, the number of subscribers that populated each cluster was identified.
# link cluster assignments to original categorical data frame
hcl4<-cbind(mag_bin, clusterID=match_avg_4)
# write data frame to CSV file to analyze in Excel
# write.csv(hcl4df, "magazine_hier4_clusters.csv")
hcl4 |>
group_by(clusterID) |>
summarize(n())
## # A tibble: 4 × 2
## clusterID `n()`
## <int> <int>
## 1 1 140
## 2 2 115
## 3 3 69
## 4 4 86
Observation: In descending order the population of each cluster goes by the following:
To better get an idea of the subscriber segmentation, the overall frequency was subscriber attributes was determined for comparative purposes.
#attach value labels to binary variables
hcl4$Female<-factor(hcl4$Female,levels=c(0,1),labels=c("no","yes"))
hcl4$Real.Estate.Purchases<-factor(hcl4$Real.Estate.Purchases,levels=c(0,1),labels=c("No","Yes"))
hcl4$Graduate.Degree<-factor(hcl4$Graduate.Degree,levels=c(0,1),labels=c("No","Yes"))
hcl4$Have.Children<-factor(hcl4$Have.Children,levels=c(0,1),labels=c("No","Yes"))
#Create frequency tables for each variable overall
tabyl(hcl4$Female)
## hcl4$Female n percent
## no 229 0.5585366
## yes 181 0.4414634
tabyl(hcl4$Real.Estate.Purchases)
## hcl4$Real.Estate.Purchases n percent
## No 236 0.5756098
## Yes 174 0.4243902
tabyl(hcl4$Graduate.Degree)
## hcl4$Graduate.Degree n percent
## No 155 0.3780488
## Yes 255 0.6219512
tabyl(hcl4$Have.Children)
## hcl4$Have.Children n percent
## No 191 0.4658537
## Yes 219 0.5341463
The frequencies of subscriber attributes were then determined to for each cluster to identify unique properties.
#Create frequency tables for each variable by cluster
tabyl(hcl4,Female,clusterID) |>
adorn_percentages("col") |>
adorn_pct_formatting(digits=2) |>
adorn_ns()
## Female 1 2 3 4
## no 49.29% (69) 64.35% (74) 0.00% (0) 100.00% (86)
## yes 50.71% (71) 35.65% (41) 100.00% (69) 0.00% (0)
tabyl(hcl4,Real.Estate.Purchases,clusterID) |>
adorn_percentages("col") |>
adorn_pct_formatting(digits=2) |>
adorn_ns()
## Real.Estate.Purchases 1 2 3 4
## No 57.86% (81) 53.91% (62) 57.97% (40) 61.63% (53)
## Yes 42.14% (59) 46.09% (53) 42.03% (29) 38.37% (33)
tabyl(hcl4,Graduate.Degree,clusterID) |>
adorn_percentages("col") |>
adorn_pct_formatting(digits=2) |>
adorn_ns()
## Graduate.Degree 1 2 3 4
## No 0.00% (0) 0.00% (0) 100.00% (69) 100.00% (86)
## Yes 100.00% (140) 100.00% (115) 0.00% (0) 0.00% (0)
tabyl(hcl4,Have.Children,clusterID) |>
adorn_percentages("col") |>
adorn_pct_formatting(digits=2) |>
adorn_ns()
## Have.Children 1 2 3 4
## No 0.00% (0) 100.00% (115) 43.48% (30) 53.49% (46)
## Yes 100.00% (140) 0.00% (0) 56.52% (39) 46.51% (40)
Interpretation of clusters:
Overall Approach:
In order to run a K-means analysis to determine the similarity among different observations, the data was first normalized by using z-scores. This was conducted due to varying ranges of the numerical observations and the varying measures involving different units.
# use the scale() function to normalize the data
mag_scaled<-scale(mag_num)
head(mag_scaled)
## Age Value.of.Investments Number.of.Transactions Household.Income
## [1,] 1.96017891 -1.0333608 -0.636327424 0.02126726
## [2,] -0.02788133 -1.0207112 -0.636327424 -0.11946370
## [3,] 2.70570151 -0.1099432 -0.313837629 -0.75418902
## [4,] -0.52489639 -0.5653272 0.008652165 0.59855137
## [5,] 0.22062620 -0.8499422 -0.313837629 -0.03330189
## [6,] 0.46913373 0.7059532 -0.958817218 1.40560031
Two metrics were obtained in order to construct an elbow plot. These metrics are:
# set random number seed in order to replicate the analysis
set.seed(42)
# calculate total within-cluster sum of squared deviations
wss<-function(k){kmeans(mag_scaled, k, nstart=10)} $tot.withinss
# range of k values for elbow plot
k_values<- 1:10
# run the function to create the range of values for the elbow plot
wss_values<-map_dbl(k_values, wss)
#create a new data frame containing both k_values and wss_values
elbowdf<- data.frame(k_values, wss_values)
An elbow plot was constructed to determine what number of clusters better fit the data.
#create a new data frame containing both k_values and wss_values
elbowdf<- data.frame(k_values, wss_values)
#graph the elbow plot
ggplot(elbowdf, mapping = aes(x = k_values, y = wss_values)) +
geom_line() + geom_point()
Observation: A noticeable bend is present at around 4 to 5 clusters. Given the small size of the data and the fact that four clusters was used for the hierarchial analysis, four was chosen for the optimal number of clusters for K-means.
The K-means algorithm was deployed to segment the data into four clusters
# run k-means clustering with 4 clusters (k=4) and 1000 random restarts
k4<-kmeans(mag_scaled, 4, nstart=1000)
To visualize the clusters, the fviz_cluster() function
was used to generate a scatter plot that marks the points based on
cluster identity.
# generate plot using the k4 object on the scaled data
fviz_cluster(k4, data = mag_scaled, geom = "point", ellipse.type = "convex")
Observation: All 4 clusters have a general overlap with each other. This makes sense given that subscribers likely share some attributes given their interest in subscribing to a specific magazine.
After the clusters were identified, they were then analyzed to extrapolate meaningful insights into subscriber segmentation.
The cluster.stats() function was used to detail the
structure of the clusters. The key metrics that were used for analysis
were cluster.size, average.distance, and
ave.between.matrix.
# display cluster statistics
cluster.stats(dist(mag_scaled, method="euclidean"), k4$cluster)
## $n
## [1] 410
##
## $cluster.number
## [1] 4
##
## $cluster.size
## [1] 63 46 175 126
##
## $min.cluster.size
## [1] 46
##
## $noisen
## [1] 0
##
## $diameter
## [1] 8.731119 6.459981 5.047733 4.948840
##
## $average.distance
## [1] 2.712944 2.289854 1.702407 1.821043
##
## $median.distance
## [1] 2.503044 2.096996 1.633555 1.748860
##
## $separation
## [1] 0.3072036 0.4262397 0.3094109 0.3072036
##
## $average.toother
## [1] 3.260644 3.249676 2.735785 2.717608
##
## $separation.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.0000000 0.4262397 0.4010144 0.3072036
## [2,] 0.4262397 0.0000000 0.4721009 0.6744707
## [3,] 0.4010144 0.4721009 0.0000000 0.3094109
## [4,] 0.3072036 0.6744707 0.3094109 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3] [,4]
## [1,] 0.000000 3.717902 3.154172 3.241587
## [2,] 3.717902 0.000000 3.096728 3.227990
## [3,] 3.154172 3.096728 0.000000 2.394819
## [4,] 3.241587 3.227990 2.394819 0.000000
##
## $average.between
## [1] 2.903973
##
## $average.within
## [1] 1.960052
##
## $n.between
## [1] 57757
##
## $n.within
## [1] 26088
##
## $max.diameter
## [1] 8.731119
##
## $min.separation
## [1] 0.3072036
##
## $within.cluster.ss
## [1] 956.4456
##
## $clus.avg.silwidths
## 1 2 3 4
## 0.07470313 0.18916895 0.26707357 0.21641395
##
## $avg.silwidth
## [1] 0.2132051
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.4197272
##
## $dunn
## [1] 0.0351849
##
## $dunn2
## [1] 0.882738
##
## $entropy
## [1] 1.25922
##
## $wb.ratio
## [1] 0.6749552
##
## $ch
## [1] 96.15431
##
## $cwidegap
## [1] 3.413033 3.493278 1.616582 1.376527
##
## $widestgap
## [1] 3.493278
##
## $sindex
## [1] 0.4509849
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Observation:
The cluster identity for each observation was bound to the numeric dataframe, and then each attribute was summarized for each cluster to determine the cluster unique compositon.
# combining each observation's cluster assignment with unscaled data frame
mag_k4 <- cbind(mag_num, clusterID=k4$cluster)
head(mag_k4)
## Age Value.of.Investments Number.of.Transactions Household.Income clusterID
## 1 38 12200 4 75200 3
## 2 30 12400 4 70300 3
## 3 41 26800 5 48200 3
## 4 28 19600 6 95300 4
## 5 31 15100 5 73300 3
## 6 32 39700 3 123400 2
# write data frame to CSV file to analyze in Excel
# write.csv(quantdfk4, "magazine_kmeans_4clusters.csv")
#Calculate variable averages for each cluster
mag_k4 |>
group_by(clusterID) |>
summarize_all(mean)
## # A tibble: 4 × 5
## clusterID Age Value.of.Investments Number.of.Transactions Household.Income
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 31.7 53433. 8.27 72208.
## 2 2 30.2 28417. 5.39 146491.
## 3 3 32.5 21406. 5.05 63217.
## 4 4 26.0 26041. 6.32 64903.
Interpretation of Clusters:
This analysis successfully segmented the magazine subscriber base into unique demographics based on both hierarchical and K-means clustering. These generated clusters can now be used to either capitalize on the the largest demographics present or help bolster the subscriber counts of lesser represented demographics.