National Occupational mean wage
Occupational wage data
Initial exploration of the data
We are presented with data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. You would like to use this data to identify clusters of occupations that maintained similar income trends.
Before we begin to cluster this data we should determine whether any pre-processing steps (such as scaling and imputation) are necessary.
Leverage the functions head() and summary() to explore the oes data in order to determine which of the pre-processing steps below are necessary: there are no missing values, no categorical and the features are on the same scale.
Hierarchical clustering
Occupation trees
We will take the necessary steps to build a dendrogram of occupations based on their yearly average salaries and propose clusters using a height of 100,000.
oes <- readRDS("oes.rds")
dist_oes <- dist(oes, method = "euclidean")
hc_oes <- hclust(dist_oes, method = "average")
dend_oes <- as.dendrogram(hc_oes)
plot(dend_oes)

dend_colored <- color_branches(dend_oes, h = 100000)
plot(dend_colored)

Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another.
Preparing for exploration
We have now created a potential clustering for the oes data, before we can explore these clusters with ggplot2 we will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.
dist_oes <- dist(oes, method = 'euclidean')
hc_oes <- hclust(dist_oes, method = 'average')
library(tibble)
Attaching package: ‘tibble’
The following object is masked from ‘package:wrapr’:
view
library(tidyr)
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')
cut_oes <- cutree(hc_oes, h = 100000)
clust_oes <- mutate(df_oes, cluster = cut_oes)
gathered_oes <- gather(data = clust_oes,
key = year,
value = mean_salary,
-occupation, -cluster)
| | | | |
---|
1 | Management | 1 | 2001 | 70800 |
2 | Business Operations | 2 | 2001 | 50580 |
3 | Computer Science | 2 | 2001 | 60350 |
4 | Architecture/Engineering | 2 | 2001 | 56330 |
5 | Life/Physical/Social Sci. | 2 | 2001 | 49710 |
6 | Community Services | 3 | 2001 | 34190 |
We now have the data frames necessary to explore the results of this clustering
Plotting occupational clusters
You have succesfully created all the parts necessary to explore the results of this hierarchical clustering work. Now, we will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.
Management Legal
1 1
Business Operations Computer Science
2 2
Architecture/Engineering Life/Physical/Social Sci.
2 2
Healthcare Practitioners Community Services
2 3
Education/Training/Library Arts/Design/Entertainment
3 3
Healthcare Support Protective Service
3 3
Food Preparation Grounds Cleaning & Maint.
3 3
Personal Care Sales
3 3
Office Administrative Farming/Fishing/Forestry
3 3
Construction Installation/Repair/Maint.
3 3
Production Transportation/Moving
3 3
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
geom_line(aes(group = occupation))

From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years.
Kmeans
Elbow analysis
We will leverage the k-means elbow plot to propose the “best” number of clusters.
We use map_dbl() to run kmeans() using the oes data for k values ranging from 1 to 10 and extract the total within-cluster sum of squares value from each model: model$tot.withinss
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = oes, centers = k)
model$tot.withinss
})
We store the resulting vector as tot_withinss
The new data frame elbow_df containing the values of k and the vector of total within-cluster sum of squares
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)

So the elbow analysis proposes a different value of k.
Average Silhouette Widths
So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. We will use average silhouette widths to explore what the “best” value of k should be.
sil_width <- map_dbl(2:10, function(k){
model <- pam(oes, k = k)
model$silinfo$avg.width
})
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)

It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).
The “best” number of clusters
We ran three different methods for finding the optimal number of clusters and their assignments and we arrived with three different answers.
What can you say about the “best” way to cluster this data?
- The clusters generated by the hierarchical clustering all have members with a Euclidean distance amongst one another less than 100,000 and hence is the best clustering method.
- The clusters generated using k-means with a k = 2 was identified using elbow analysis and hence is the best way to cluster this data.
- The clusters generated using k-means with a k = 7 has the largest Average Silhouette Widths among the cluster and hence is the best way to cluster this data. But the best way to cluster is highly dependent on how you would use this data after. There is no quantitative way to determine which of these clustering approaches is the right one without further exploration.
---
title: "Case Study: National Occupational mean wage"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    number_sections: true
    
toc_depth: 3
---
# National Occupational mean wage

## Occupational wage data

### Initial exploration of the data

We are presented with data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. You would like to use this data to identify clusters of occupations that maintained similar income trends.

Before we begin to cluster this data we should determine whether any pre-processing steps (such as scaling and imputation) are necessary.

Leverage the functions head() and summary() to explore the oes data in order to determine which of the pre-processing steps below are necessary: there are no missing values, no categorical and the features are on the same scale.

## Hierarchical clustering

### Occupation trees

We will take the necessary steps to build a dendrogram of occupations based on their yearly average salaries and propose clusters using a height of 100,000.
```{r}
oes <- readRDS("oes.rds")
```
```{r}
# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = "euclidean")

# Generate an average linkage analysis 
hc_oes <- hclust(dist_oes, method = "average")

# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)

# Plot the dendrogram
plot(dend_oes)

# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)

# Plot the colored dendrogram
plot(dend_colored)

```
Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another.

### Preparing for exploration

We have now created a potential clustering for the oes data, before we can explore these clusters with ggplot2 we will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.

```{r}
dist_oes <- dist(oes, method = 'euclidean')
hc_oes <- hclust(dist_oes, method = 'average')

library(tibble)
library(tidyr)

# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')

# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)

# Generate the segmented the oes data frame
clust_oes <- mutate(df_oes, cluster = cut_oes)

# Create a tidy data frame by gathering the year and values into two columns
gathered_oes <- gather(data = clust_oes, 
                       key = year, 
                       value = mean_salary, 
                       -occupation, -cluster)
```
```{r}
head(gathered_oes)
```
We now have the data frames necessary to explore the results of this clustering

### Plotting occupational clusters

You have succesfully created all the parts necessary to explore the results of this hierarchical clustering work. Now, we will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.
```{r}
# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)

# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) + 
    geom_line(aes(group = occupation))
```
From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years.

## Kmeans

### Elbow analysis

We will leverage the k-means elbow plot to propose the "best" number of clusters.

We use map_dbl() to run kmeans() using the oes data for k values ranging from 1 to 10 and extract the total within-cluster sum of squares value from each model: model$tot.withinss
```{r}
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = oes, centers = k)
  model$tot.withinss
})
```
We store the resulting vector as tot_withinss

The new data frame elbow_df containing the values of k and the vector of total within-cluster sum of squares
```{r}
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)
```
So the elbow analysis proposes a different value of k.

### Average Silhouette Widths

So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. We will use average silhouette widths to explore what the "best" value of k should be.
```{r}
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(oes, k = k)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)
```
It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).

## The "best" number of clusters

We ran three different methods for finding the optimal number of clusters and their assignments and we arrived with three different answers.


What can you say about the "best" way to cluster this data?

-	The clusters generated by the hierarchical clustering all have members with a Euclidean distance amongst one another less than 100,000 and hence is the best clustering method.
-	The clusters generated using k-means with a k = 2 was identified using elbow analysis and hence is the best way to cluster this data.
-	The clusters generated using k-means with a k = 7 has the largest Average Silhouette Widths among the cluster and hence is the best way to cluster this data.
But the best way to cluster is highly dependent on how you would use this data after. There is no quantitative way to determine which of these clustering approaches is the right one without further exploration.
