We are presented with data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. You would like to use this data to identify clusters of occupations that maintained similar income trends.
Before we begin to cluster this data we should determine whether any pre-processing steps (such as scaling and imputation) are necessary.
Leverage the functions head() and summary() to explore the oes data in order to determine which of the pre-processing steps below are necessary: there are no missing values, no categorical and the features are on the same scale.
We will take the necessary steps to build a dendrogram of occupations based on their yearly average salaries and propose clusters using a height of 100,000.
oes <- readRDS("oes.rds")
# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = "euclidean")
# Generate an average linkage analysis
hc_oes <- hclust(dist_oes, method = "average")
# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)
# Plot the dendrogram
plot(dend_oes)
# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)
# Plot the colored dendrogram
plot(dend_colored)
Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another.
We have now created a potential clustering for the oes data, before we can explore these clusters with ggplot2 we will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.
dist_oes <- dist(oes, method = 'euclidean')
hc_oes <- hclust(dist_oes, method = 'average')
library(tibble)
Attaching package: ‘tibble’
The following object is masked from ‘package:wrapr’:
view
library(tidyr)
# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')
# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)
# Generate the segmented the oes data frame
clust_oes <- mutate(df_oes, cluster = cut_oes)
# Create a tidy data frame by gathering the year and values into two columns
gathered_oes <- gather(data = clust_oes,
key = year,
value = mean_salary,
-occupation, -cluster)
head(gathered_oes)
We now have the data frames necessary to explore the results of this clustering
You have succesfully created all the parts necessary to explore the results of this hierarchical clustering work. Now, we will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.
# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)
Management Legal
1 1
Business Operations Computer Science
2 2
Architecture/Engineering Life/Physical/Social Sci.
2 2
Healthcare Practitioners Community Services
2 3
Education/Training/Library Arts/Design/Entertainment
3 3
Healthcare Support Protective Service
3 3
Food Preparation Grounds Cleaning & Maint.
3 3
Personal Care Sales
3 3
Office Administrative Farming/Fishing/Forestry
3 3
Construction Installation/Repair/Maint.
3 3
Production Transportation/Moving
3 3
# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
geom_line(aes(group = occupation))
From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years.
We will leverage the k-means elbow plot to propose the “best” number of clusters.
We use map_dbl() to run kmeans() using the oes data for k values ranging from 1 to 10 and extract the total within-cluster sum of squares value from each model: model$tot.withinss
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = oes, centers = k)
model$tot.withinss
})
We store the resulting vector as tot_withinss
The new data frame elbow_df containing the values of k and the vector of total within-cluster sum of squares
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
So the elbow analysis proposes a different value of k.
So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. We will use average silhouette widths to explore what the “best” value of k should be.
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10, function(k){
model <- pam(oes, k = k)
model$silinfo$avg.width
})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).
We ran three different methods for finding the optimal number of clusters and their assignments and we arrived with three different answers.
What can you say about the “best” way to cluster this data?