Clustering_case study

Kate C

2022-02-09

Load Packages and Dataset

  • dataset: oes (occupational employment statistics) that produces employment and wage estimates annually. The data contains the yearly average income from 2001 to 2016 for 22 occupation groups.

  • objectives: use this dataset to identify clusters of occupations that maintained similar income trends.

oes <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/oes.rds")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ readr   2.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
library(cluster)
  • determine pre-processing steps. Below is trying to convert first column (index) to column with function “rownames_to_column”. The reverse function is “column to rownames” is we want to convert first column to index.

  • from the summary we can see that there are no missing values, no categorical and the features are on the same scale

head(oes)
##                            2001  2002  2003  2004  2005  2006  2007   2008
## Management                70800 78870 83400 87090 88450 91930 96150 100310
## Business Operations       50580 53350 56000 57120 57930 60000 62410  64720
## Computer Science          60350 61630 64150 66370 67100 69240 72190  74500
## Architecture/Engineering  56330 58020 60390 63060 63910 66190 68880  71430
## Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020  64280
## Community Services        34190 34630 35800 37050 37530 39000 40540  41790
##                             2010   2011   2012   2013   2014   2015   2016
## Management                105440 107410 108570 110550 112490 115020 118020
## Business Operations        67690  68740  69550  71020  72410  73800  75070
## Computer Science           77230  78730  80180  82010  83970  86170  87880
## Architecture/Engineering   75550  77120  79000  80100  81520  82980  84300
## Life/Physical/Social Sci.  66390  67470  68360  69400  70070  71220  72930
## Community Services         43180  43830  44240  44710  45310  46160  47200
summary(oes)
##       2001            2002            2003            2004      
##  Min.   :16720   Min.   :17180   Min.   :17400   Min.   :17620  
##  1st Qu.:26728   1st Qu.:27392   1st Qu.:27858   1st Qu.:28535  
##  Median :34575   Median :35205   Median :36180   Median :37335  
##  Mean   :37850   Mean   :39701   Mean   :41018   Mean   :42275  
##  3rd Qu.:49875   3rd Qu.:53108   3rd Qu.:55732   3rd Qu.:57442  
##  Max.   :70800   Max.   :78870   Max.   :83400   Max.   :87090  
##       2005            2006            2007            2008       
##  Min.   :17840   Min.   :18430   Min.   :19440   Min.   : 20220  
##  1st Qu.:29042   1st Qu.:29688   1st Qu.:30810   1st Qu.: 31642  
##  Median :37790   Median :39030   Median :40235   Median : 41510  
##  Mean   :42775   Mean   :44329   Mean   :46074   Mean   : 47763  
##  3rd Qu.:58005   3rd Qu.:59915   3rd Qu.:62312   3rd Qu.: 64610  
##  Max.   :88450   Max.   :91930   Max.   :96150   Max.   :100310  
##       2010             2011             2012             2013       
##  Min.   : 21240   Min.   : 21430   Min.   : 21380   Min.   : 21580  
##  1st Qu.: 32862   1st Qu.: 33430   1st Qu.: 33795   1st Qu.: 34120  
##  Median : 42995   Median : 43610   Median : 44055   Median : 44565  
##  Mean   : 49758   Mean   : 50555   Mean   : 51077   Mean   : 51800  
##  3rd Qu.: 67365   3rd Qu.: 68422   3rd Qu.: 69252   3rd Qu.: 70615  
##  Max.   :105440   Max.   :107410   Max.   :108570   Max.   :110550  
##       2014             2015             2016       
##  Min.   : 21980   Min.   : 22850   Min.   : 23850  
##  1st Qu.: 34718   1st Qu.: 35425   1st Qu.: 36350  
##  Median : 45265   Median : 46075   Median : 46945  
##  Mean   : 52643   Mean   : 53785   Mean   : 55117  
##  3rd Qu.: 71825   3rd Qu.: 73155   3rd Qu.: 74535  
##  Max.   :112490   Max.   :115020   Max.   :118020
oes_df<- as.data.frame(oes)
oes_tbl <- rownames_to_column(oes_df, var = "occupation") %>% as_tibble()

oes_clean <- oes_tbl %>% 
rowwise() %>% 
mutate(m = mean(c_across(starts_with("2"))))

#below we have done a plot to showcase the salary by occupation by descending order.
#re-ordered the occupation by values (avg salaries across all years)
ggplot(oes_clean, aes(m, fct_reorder(occupation,m), fill = occupation)) +
  geom_col(alpha = 0.8) +
  theme(legend.position = "none") #removed legend

Practice

  • from above results, we learned that the oes data is ready for hierarchical clustering without any preprocessing steps necessary. next we will build a dendrogram of occupations based on their yearly average salaries (hence the linkage) and propose clusters using a height of 100,000 (can use other height as well and may do trial and error multiple times)

  • Hierarchical clustering: Occupation trees - steps to follow:

    • calculate euclidean distance between the occupations

    • generate an average linkage analysis (or complete or single linkage if preferred)

    • create a dendrogram object from the results of linkage analysis

    • plot the dendrogram

    • color the branches with a height of 100000 and plot the colored dendrogram

dist_oes <- dist(oes, method = "euclidean")
hc_oes <- hclust(dist_oes, method = "average")
dend_oes <- as.dendrogram(hc_oes)
plot(dend_oes)

dend_colored <- color_branches(dend_oes, h = 100000)
plot(dend_colored)

  • based on the plot of dendrogram, it might be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another.

  • Hierarchical clustering: preparing for exploration

    • create rownames (we have done it already)

    • build cluster assignment vector with h = 100,000

    • append cluster assignment to original dataframe

    • reshape the data with pivot longer

df_oes <- rownames_to_column(as.data.frame(oes), var = "occupation")
cut_oes <- cutree(hc_oes, h = 100000)
clust_oes <- mutate(df_oes, cluster = cut_oes)

gathered_oes <- clust_oes %>% 
  pivot_longer(cols = starts_with("2"), names_to = "year", values_to = "mean_salary" )
  • now we have created all parts necessary to explore the results of the hierarchical clustering work. We will now visualize the results
sort(cut_oes)
##                 Management                      Legal 
##                          1                          1 
##        Business Operations           Computer Science 
##                          2                          2 
##   Architecture/Engineering  Life/Physical/Social Sci. 
##                          2                          2 
##   Healthcare Practitioners         Community Services 
##                          2                          3 
## Education/Training/Library  Arts/Design/Entertainment 
##                          3                          3 
##         Healthcare Support         Protective Service 
##                          3                          3 
##           Food Preparation  Grounds Cleaning & Maint. 
##                          3                          3 
##              Personal Care                      Sales 
##                          3                          3 
##      Office Administrative   Farming/Fishing/Forestry 
##                          3                          3 
##               Construction Installation/Repair/Maint. 
##                          3                          3 
##                 Production      Transportation/Moving 
##                          3                          3
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
  geom_line(aes(group = occupation))

  • results look visually reasonable

  • let’s proceed to k-means

  • elbow analysis

tot_withinss <- map_dbl(1:10, function(k) {
  model <- kmeans(x = oes, centers = k)
  model$tot.withinss
})

elbow_df <- data.frame(
  k = 1:10, 
  tot_withinss = tot_withinss
)

ggplot(elbow_df, aes(k, tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

  • k-means - average silhouette widths
sil_width <- map_dbl(2:10, function(k) {
  model <- pam(oes, k = k)
  model$silinfo$avg.width
})

sil_df <- data.frame(
  k = 2:10, 
  sil_width = sil_width
)

ggplot(sil_df, aes(k, sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

  • looks like k of 7 gives the best result.

  • ultimately best way to cluster is highly dependent on how we want to use the data after and business side of knowledge.