Load Packages and Dataset
dataset: oes (occupational employment statistics) that produces employment and wage estimates annually. The data contains the yearly average income from 2001 to 2016 for 22 occupation groups.
objectives: use this dataset to identify clusters of occupations that maintained similar income trends.
oes <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/oes.rds")
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ readr 2.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(dendextend)##
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
library(cluster)determine pre-processing steps. Below is trying to convert first column (index) to column with function “rownames_to_column”. The reverse function is “column to rownames” is we want to convert first column to index.
from the summary we can see that there are no missing values, no categorical and the features are on the same scale
head(oes)## 2001 2002 2003 2004 2005 2006 2007 2008
## Management 70800 78870 83400 87090 88450 91930 96150 100310
## Business Operations 50580 53350 56000 57120 57930 60000 62410 64720
## Computer Science 60350 61630 64150 66370 67100 69240 72190 74500
## Architecture/Engineering 56330 58020 60390 63060 63910 66190 68880 71430
## Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020 64280
## Community Services 34190 34630 35800 37050 37530 39000 40540 41790
## 2010 2011 2012 2013 2014 2015 2016
## Management 105440 107410 108570 110550 112490 115020 118020
## Business Operations 67690 68740 69550 71020 72410 73800 75070
## Computer Science 77230 78730 80180 82010 83970 86170 87880
## Architecture/Engineering 75550 77120 79000 80100 81520 82980 84300
## Life/Physical/Social Sci. 66390 67470 68360 69400 70070 71220 72930
## Community Services 43180 43830 44240 44710 45310 46160 47200
summary(oes)## 2001 2002 2003 2004
## Min. :16720 Min. :17180 Min. :17400 Min. :17620
## 1st Qu.:26728 1st Qu.:27392 1st Qu.:27858 1st Qu.:28535
## Median :34575 Median :35205 Median :36180 Median :37335
## Mean :37850 Mean :39701 Mean :41018 Mean :42275
## 3rd Qu.:49875 3rd Qu.:53108 3rd Qu.:55732 3rd Qu.:57442
## Max. :70800 Max. :78870 Max. :83400 Max. :87090
## 2005 2006 2007 2008
## Min. :17840 Min. :18430 Min. :19440 Min. : 20220
## 1st Qu.:29042 1st Qu.:29688 1st Qu.:30810 1st Qu.: 31642
## Median :37790 Median :39030 Median :40235 Median : 41510
## Mean :42775 Mean :44329 Mean :46074 Mean : 47763
## 3rd Qu.:58005 3rd Qu.:59915 3rd Qu.:62312 3rd Qu.: 64610
## Max. :88450 Max. :91930 Max. :96150 Max. :100310
## 2010 2011 2012 2013
## Min. : 21240 Min. : 21430 Min. : 21380 Min. : 21580
## 1st Qu.: 32862 1st Qu.: 33430 1st Qu.: 33795 1st Qu.: 34120
## Median : 42995 Median : 43610 Median : 44055 Median : 44565
## Mean : 49758 Mean : 50555 Mean : 51077 Mean : 51800
## 3rd Qu.: 67365 3rd Qu.: 68422 3rd Qu.: 69252 3rd Qu.: 70615
## Max. :105440 Max. :107410 Max. :108570 Max. :110550
## 2014 2015 2016
## Min. : 21980 Min. : 22850 Min. : 23850
## 1st Qu.: 34718 1st Qu.: 35425 1st Qu.: 36350
## Median : 45265 Median : 46075 Median : 46945
## Mean : 52643 Mean : 53785 Mean : 55117
## 3rd Qu.: 71825 3rd Qu.: 73155 3rd Qu.: 74535
## Max. :112490 Max. :115020 Max. :118020
oes_df<- as.data.frame(oes)
oes_tbl <- rownames_to_column(oes_df, var = "occupation") %>% as_tibble()
oes_clean <- oes_tbl %>%
rowwise() %>%
mutate(m = mean(c_across(starts_with("2"))))
#below we have done a plot to showcase the salary by occupation by descending order.
#re-ordered the occupation by values (avg salaries across all years)
ggplot(oes_clean, aes(m, fct_reorder(occupation,m), fill = occupation)) +
geom_col(alpha = 0.8) +
theme(legend.position = "none") #removed legendPractice
from above results, we learned that the oes data is ready for hierarchical clustering without any preprocessing steps necessary. next we will build a dendrogram of occupations based on their yearly average salaries (hence the linkage) and propose clusters using a height of 100,000 (can use other height as well and may do trial and error multiple times)
Hierarchical clustering: Occupation trees - steps to follow:
calculate euclidean distance between the occupations
generate an average linkage analysis (or complete or single linkage if preferred)
create a dendrogram object from the results of linkage analysis
plot the dendrogram
color the branches with a height of 100000 and plot the colored dendrogram
dist_oes <- dist(oes, method = "euclidean")
hc_oes <- hclust(dist_oes, method = "average")
dend_oes <- as.dendrogram(hc_oes)
plot(dend_oes)dend_colored <- color_branches(dend_oes, h = 100000)
plot(dend_colored)based on the plot of dendrogram, it might be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another.
Hierarchical clustering: preparing for exploration
create rownames (we have done it already)
build cluster assignment vector with h = 100,000
append cluster assignment to original dataframe
reshape the data with pivot longer
df_oes <- rownames_to_column(as.data.frame(oes), var = "occupation")
cut_oes <- cutree(hc_oes, h = 100000)
clust_oes <- mutate(df_oes, cluster = cut_oes)
gathered_oes <- clust_oes %>%
pivot_longer(cols = starts_with("2"), names_to = "year", values_to = "mean_salary" )- now we have created all parts necessary to explore the results of the hierarchical clustering work. We will now visualize the results
sort(cut_oes)## Management Legal
## 1 1
## Business Operations Computer Science
## 2 2
## Architecture/Engineering Life/Physical/Social Sci.
## 2 2
## Healthcare Practitioners Community Services
## 2 3
## Education/Training/Library Arts/Design/Entertainment
## 3 3
## Healthcare Support Protective Service
## 3 3
## Food Preparation Grounds Cleaning & Maint.
## 3 3
## Personal Care Sales
## 3 3
## Office Administrative Farming/Fishing/Forestry
## 3 3
## Construction Installation/Repair/Maint.
## 3 3
## Production Transportation/Moving
## 3 3
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
geom_line(aes(group = occupation))results look visually reasonable
let’s proceed to k-means
elbow analysis
tot_withinss <- map_dbl(1:10, function(k) {
model <- kmeans(x = oes, centers = k)
model$tot.withinss
})
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
ggplot(elbow_df, aes(k, tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)- k-means - average silhouette widths
sil_width <- map_dbl(2:10, function(k) {
model <- pam(oes, k = k)
model$silinfo$avg.width
})
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
ggplot(sil_df, aes(k, sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)looks like k of 7 gives the best result.
ultimately best way to cluster is highly dependent on how we want to use the data after and business side of knowledge.