Clustering_case study

Load Packages and Dataset

dataset: oes (occupational employment statistics) that produces employment and wage estimates annually. The data contains the yearly average income from 2001 to 2016 for 22 occupation groups.
objectives: use this dataset to identify clusters of occupations that maintained similar income trends.

oes <- readRDS("~/Documents/R programming/Datacamp/DC_Cluster/Data/oes.rds")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ readr   2.1.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(dendextend)

## 
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## Attaching package: 'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

library(cluster)

determine pre-processing steps. Below is trying to convert first column (index) to column with function “rownames_to_column”. The reverse function is “column to rownames” is we want to convert first column to index.
from the summary we can see that there are no missing values, no categorical and the features are on the same scale

head(oes)

##                            2001  2002  2003  2004  2005  2006  2007   2008
## Management                70800 78870 83400 87090 88450 91930 96150 100310
## Business Operations       50580 53350 56000 57120 57930 60000 62410  64720
## Computer Science          60350 61630 64150 66370 67100 69240 72190  74500
## Architecture/Engineering  56330 58020 60390 63060 63910 66190 68880  71430
## Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020  64280
## Community Services        34190 34630 35800 37050 37530 39000 40540  41790
##                             2010   2011   2012   2013   2014   2015   2016
## Management                105440 107410 108570 110550 112490 115020 118020
## Business Operations        67690  68740  69550  71020  72410  73800  75070
## Computer Science           77230  78730  80180  82010  83970  86170  87880
## Architecture/Engineering   75550  77120  79000  80100  81520  82980  84300
## Life/Physical/Social Sci.  66390  67470  68360  69400  70070  71220  72930
## Community Services         43180  43830  44240  44710  45310  46160  47200

summary(oes)

##       2001            2002            2003            2004      
##  Min.   :16720   Min.   :17180   Min.   :17400   Min.   :17620  
##  1st Qu.:26728   1st Qu.:27392   1st Qu.:27858   1st Qu.:28535  
##  Median :34575   Median :35205   Median :36180   Median :37335  
##  Mean   :37850   Mean   :39701   Mean   :41018   Mean   :42275  
##  3rd Qu.:49875   3rd Qu.:53108   3rd Qu.:55732   3rd Qu.:57442  
##  Max.   :70800   Max.   :78870   Max.   :83400   Max.   :87090  
##       2005            2006            2007            2008       
##  Min.   :17840   Min.   :18430   Min.   :19440   Min.   : 20220  
##  1st Qu.:29042   1st Qu.:29688   1st Qu.:30810   1st Qu.: 31642  
##  Median :37790   Median :39030   Median :40235   Median : 41510  
##  Mean   :42775   Mean   :44329   Mean   :46074   Mean   : 47763  
##  3rd Qu.:58005   3rd Qu.:59915   3rd Qu.:62312   3rd Qu.: 64610  
##  Max.   :88450   Max.   :91930   Max.   :96150   Max.   :100310  
##       2010             2011             2012             2013       
##  Min.   : 21240   Min.   : 21430   Min.   : 21380   Min.   : 21580  
##  1st Qu.: 32862   1st Qu.: 33430   1st Qu.: 33795   1st Qu.: 34120  
##  Median : 42995   Median : 43610   Median : 44055   Median : 44565  
##  Mean   : 49758   Mean   : 50555   Mean   : 51077   Mean   : 51800  
##  3rd Qu.: 67365   3rd Qu.: 68422   3rd Qu.: 69252   3rd Qu.: 70615  
##  Max.   :105440   Max.   :107410   Max.   :108570   Max.   :110550  
##       2014             2015             2016       
##  Min.   : 21980   Min.   : 22850   Min.   : 23850  
##  1st Qu.: 34718   1st Qu.: 35425   1st Qu.: 36350  
##  Median : 45265   Median : 46075   Median : 46945  
##  Mean   : 52643   Mean   : 53785   Mean   : 55117  
##  3rd Qu.: 71825   3rd Qu.: 73155   3rd Qu.: 74535  
##  Max.   :112490   Max.   :115020   Max.   :118020

oes_df<- as.data.frame(oes)
oes_tbl <- rownames_to_column(oes_df, var = "occupation") %>% as_tibble()

oes_clean <- oes_tbl %>% 
rowwise() %>% 
mutate(m = mean(c_across(starts_with("2"))))

#below we have done a plot to showcase the salary by occupation by descending order.
#re-ordered the occupation by values (avg salaries across all years)
ggplot(oes_clean, aes(m, fct_reorder(occupation,m), fill = occupation)) +
  geom_col(alpha = 0.8) +
  theme(legend.position = "none") #removed legend

Practice

from above results, we learned that the oes data is ready for hierarchical clustering without any preprocessing steps necessary. next we will build a dendrogram of occupations based on their yearly average salaries (hence the linkage) and propose clusters using a height of 100,000 (can use other height as well and may do trial and error multiple times)
Hierarchical clustering: Occupation trees - steps to follow:
- calculate euclidean distance between the occupations
- generate an average linkage analysis (or complete or single linkage if preferred)
- create a dendrogram object from the results of linkage analysis
- plot the dendrogram
- color the branches with a height of 100000 and plot the colored dendrogram

dist_oes <- dist(oes, method = "euclidean")
hc_oes <- hclust(dist_oes, method = "average")
dend_oes <- as.dendrogram(hc_oes)
plot(dend_oes)

dend_colored <- color_branches(dend_oes, h = 100000)
plot(dend_colored)

based on the plot of dendrogram, it might be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another.
Hierarchical clustering: preparing for exploration
- create rownames (we have done it already)
- build cluster assignment vector with h = 100,000
- append cluster assignment to original dataframe
- reshape the data with pivot longer

df_oes <- rownames_to_column(as.data.frame(oes), var = "occupation")
cut_oes <- cutree(hc_oes, h = 100000)
clust_oes <- mutate(df_oes, cluster = cut_oes)

gathered_oes <- clust_oes %>% 
  pivot_longer(cols = starts_with("2"), names_to = "year", values_to = "mean_salary" )

now we have created all parts necessary to explore the results of the hierarchical clustering work. We will now visualize the results

sort(cut_oes)

##                 Management                      Legal 
##                          1                          1 
##        Business Operations           Computer Science 
##                          2                          2 
##   Architecture/Engineering  Life/Physical/Social Sci. 
##                          2                          2 
##   Healthcare Practitioners         Community Services 
##                          2                          3 
## Education/Training/Library  Arts/Design/Entertainment 
##                          3                          3 
##         Healthcare Support         Protective Service 
##                          3                          3 
##           Food Preparation  Grounds Cleaning & Maint. 
##                          3                          3 
##              Personal Care                      Sales 
##                          3                          3 
##      Office Administrative   Farming/Fishing/Forestry 
##                          3                          3 
##               Construction Installation/Repair/Maint. 
##                          3                          3 
##                 Production      Transportation/Moving 
##                          3                          3

ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
  geom_line(aes(group = occupation))

results look visually reasonable
let’s proceed to k-means
elbow analysis

tot_withinss <- map_dbl(1:10, function(k) {
  model <- kmeans(x = oes, centers = k)
  model$tot.withinss
})

elbow_df <- data.frame(
  k = 1:10, 
  tot_withinss = tot_withinss
)

ggplot(elbow_df, aes(k, tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

k-means - average silhouette widths

sil_width <- map_dbl(2:10, function(k) {
  model <- pam(oes, k = k)
  model$silinfo$avg.width
})

sil_df <- data.frame(
  k = 2:10, 
  sil_width = sil_width
)

ggplot(sil_df, aes(k, sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

looks like k of 7 gives the best result.
ultimately best way to cluster is highly dependent on how we want to use the data after and business side of knowledge.

Clustering_case study

Kate C

2022-02-09

Load Packages and Dataset

Practice