library(carData)
jobs <- force(Prestige)
head(jobs)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
library(tidyr)
jobs_clean <- drop_na(jobs)
head(jobs)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
Unit of observation: one job
Initial sample size: 98 jobs
Definition of variables:
Source of data: dataset in R (Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21].)
### Data manipulation, creating factors
jobs_clean$type <- factor(jobs_clean$type,
levels = c ("bc", "prof", "wc"),
labels = c ("Blue_collar","Professional", "White_collar"))
### Descriptive statistic for numerical variables
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
round(stat.desc(jobs_clean[ , c (1:4) ]), 2)
## education income women prestige
## nbr.val 98.00 98.00 98.00 98.00
## nbr.null 0.00 0.00 5.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 6.38 1656.00 0.00 17.30
## max 15.97 25879.00 97.51 87.20
## range 9.59 24223.00 97.51 69.90
## sum 1057.92 680008.00 2840.60 4638.10
## median 10.61 6035.50 14.47 43.60
## mean 10.80 6938.86 28.99 47.33
## SE.mean 0.28 427.11 3.17 1.73
## CI.mean.0.95 0.55 847.69 6.29 3.43
## var 7.56 17877104.76 984.83 292.24
## std.dev 2.75 4228.13 31.38 17.09
## coef.var 0.25 0.61 1.08 0.36
The estimated average percentage of incumbents who are women in our sample of jobs is 28.99%.
50% of jobs in our sample had prestige score up to and including 43.60, and 50% of jobs had a higher prestige score.
The lowest average education of incubans in our samle was 6.38 years, and the higest was 15.97 years.
Can we group Canadian jobs into more and less desirable jobs, based on the required education, income, proportion of women and prestige?
jobs_std <- as.data.frame(scale(jobs_clean[c("education",
"income",
"women",
"prestige")]))
### Looking for outliers
jobs_clean$Dissimilarity <- sqrt(jobs_clean$education^2 +
jobs_clean$income^2 +
jobs_clean$women^2 +
jobs_clean$prestige^2)
head(jobs_clean[order(-jobs_clean$Dissimilarity), ], 10)
## education income women prestige census type
## general.managers 12.26 25879 4.02 69.1 1130 Professional
## physicians 15.96 25308 10.56 87.2 3111 Professional
## lawyers 15.77 19263 5.13 82.3 2343 Professional
## osteopaths.chiropractors 14.71 17498 6.91 68.4 3117 Professional
## veterinarians 15.94 14558 4.32 66.7 3115 Professional
## architects 15.44 14163 2.69 78.1 2141 Professional
## pilots 12.27 14032 0.58 66.1 9111 Professional
## university.teachers 15.97 12480 19.59 84.6 2711 Professional
## gov.administrators 13.11 12351 11.16 68.8 1113 Professional
## civil.engineers 14.52 11377 1.03 73.1 2143 Professional
## Dissimilarity
## general.managers 25879.10
## physicians 25308.16
## lawyers 19263.18
## osteopaths.chiropractors 17498.14
## veterinarians 14558.16
## architects 14163.22
## pilots 14032.16
## university.teachers 12480.31
## gov.administrators 12351.20
## civil.engineers 11377.24
Dissimilarity tells us how different each job is from the rest of the jobs (higher value means more different). From this case, it is evident that the jobs are quite different from one another, which makes sense, since very different jobs are included in the sample. “general.managers”, “physicians” and “lawyers” seem to be the most different form the others, so I will remove them to avoid issues with clustering.
### Removing outliers
jobs_clean <- jobs_clean[!row.names(jobs_clean) %in% c("general.managers", "physicians", "lawyers"), ]
jobs_std <- as.data.frame(scale(jobs_clean[c("education",
"income",
"women",
"prestige")]))
head(jobs_clean[order(-jobs_clean$Dissimilarity), ], 10)
## education income women prestige census type
## osteopaths.chiropractors 14.71 17498 6.91 68.4 3117 Professional
## veterinarians 15.94 14558 4.32 66.7 3115 Professional
## architects 15.44 14163 2.69 78.1 2141 Professional
## pilots 12.27 14032 0.58 66.1 9111 Professional
## university.teachers 15.97 12480 19.59 84.6 2711 Professional
## gov.administrators 13.11 12351 11.16 68.8 1113 Professional
## civil.engineers 14.52 11377 1.03 73.1 2143 Professional
## physicists 15.64 11030 5.13 77.6 2113 Professional
## mining.engineers 14.64 11023 0.94 68.8 2153 Professional
## pharmacists 15.21 10432 24.71 69.3 3151 Professional
## Dissimilarity
## osteopaths.chiropractors 17498.14
## veterinarians 14558.16
## architects 14163.22
## pilots 14032.16
## university.teachers 12480.31
## gov.administrators 12351.20
## civil.engineers 11377.24
## physicists 11030.29
## mining.engineers 11023.22
## pharmacists 10432.27
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(jobs_std,
method = "euclidian")
fviz_dist(Distances,
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
### Hopkins statistics
library(factoextra)
get_clust_tendency(jobs_std,
n = nrow(jobs_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7682074
##
## $plot
## NULL
From the graph of Euclidian distances and Hopkins statistics we can check if data can be clustered. Hopkins statistic is close to 1, which confirms that data can be used for clusters. The graph also shows the possibility of clustering, but maybe the number of clusters is not so clear, so I will continue the analysis to figure out how many clusters I should make.
I will check how many clusters I should make based on the Elbow method and Silhouette analysis.
library(factoextra)
library(NbClust)
fviz_nbclust(jobs_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
When using the elbow method to determine the optimal number of clusters, we are looking for the most evident breaks. For me, this seems to be at 3, but I will check additionally with silhouette analysis.
fviz_nbclust(jobs_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")
Silhouette methods measures how well data fits into their assigned cluster. In this case, the analysis indicates that 3 clusters will provide the best fit for my data.
I will also use k-means method to check:
library(NbClust)
nc <- NbClust(jobs_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 3 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Based on this method, 3 clusters are the best number of clusters for my data. So I will proceed my analysis with 3 clusters.
Clustering <- kmeans(jobs_std,
centers = 3,
nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 26, 46, 23
##
## Cluster means:
## education income women prestige
## 1 0.03542115 -0.8183123 1.4505122 -0.2066937
## 2 -0.67326365 -0.1331005 -0.6024839 -0.5373306
## 3 1.30648600 1.1912498 -0.4347416 1.3083149
##
## Clustering vector:
## gov.administrators accountants purchasing.officers
## 3 3 3
## chemists physicists biologists
## 3 3 3
## architects civil.engineers mining.engineers
## 3 3 3
## surveyors draughtsmen computer.programers
## 3 3 3
## economists psychologists social.workers
## 3 3 1
## librarians vocational.counsellors ministers
## 1 3 3
## university.teachers primary.school.teachers secondary.school.teachers
## 3 1 3
## veterinarians osteopaths.chiropractors nurses
## 3 3 1
## nursing.aides physio.therapsts pharmacists
## 1 1 3
## medical.technicians commercial.artists radio.tv.announcers
## 1 2 3
## secretaries typists bookkeepers
## 1 1 1
## tellers.cashiers computer.operators shipping.clerks
## 1 1 2
## file.clerks receptionsts mail.carriers
## 1 1 2
## postal.clerks telephone.operators collectors
## 1 1 1
## claim.adjustors travel.clerks office.clerks
## 1 2 1
## sales.supervisors commercial.travellers sales.clerks
## 2 2 1
## service.station.attendant insurance.agents real.estate.salesmen
## 2 2 2
## buyers firefighters policemen
## 2 2 2
## cooks bartenders funeral.directors
## 1 2 2
## launderers janitors elevator.operators
## 1 2 2
## farm.workers rotary.well.drillers bakers
## 2 2 2
## slaughterers.1 slaughterers.2 canners
## 2 2 1
## textile.weavers textile.labourers tool.die.makers
## 2 2 2
## machinists sheet.metal.workers welders
## 2 2 2
## auto.workers aircraft.workers electronic.workers
## 2 2 1
## radio.tv.repairmen sewing.mach.operators auto.repairmen
## 2 1 2
## aircraft.repairmen railway.sectionmen electrical.linemen
## 2 2 2
## electricians construction.foremen carpenters
## 2 2 2
## masons house.painters plumbers
## 2 2 2
## construction.labourers pilots train.engineers
## 2 3 2
## bus.drivers taxi.drivers longshoremen
## 2 2 2
## typesetters bookbinders
## 2 1
##
## Within cluster sum of squares by cluster:
## [1] 43.54207 52.73682 39.20547
## (between_SS / total_SS = 64.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = TRUE,
labelsize = 5,
ggtheme = theme_bw(),
data = jobs_std)
Based on above results, we have 3 distinct clusters that include 26, 46 and 23 jobs.
My clusters are:
Cluster 1: Displays higher than average values in the proportion of women and a bit above average values for education but lower than average values in all other categories.This indicates that this is a cluster of less desirable jobs, mostly held by women.
Cluster 2: Displays consistently lower than average values in all categories. This indicates a cluster of low desirable jobs, held more by men.
Cluster 3: Has lower than average values in proportion of women, while all other values are higher than average. This indicates that this cluster contains the most desirable and prestigeous jobs.
Averages <- Clustering$centers
Averages
## education income women prestige
## 1 0.03542115 -0.8183123 1.4505122 -0.2066937
## 2 -0.67326365 -0.1331005 -0.6024839 -0.5373306
## 3 1.30648600 1.1912498 -0.4347416 1.3083149
Let’s show the differences we are seeing in the average values on the graph:
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("education",
"income",
"women",
"prestige"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("education",
"income",
"women",
"prestige"),
labels = c("education",
"income",
"women",
"prestige"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
jobs_clean$Group <- Clustering$cluster
fit <- aov(cbind(education, income, women, prestige) ~ as.factor(Group),
data = jobs_clean)
summary(fit)
## Response education :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 433.80 216.899 81.712 < 2.2e-16 ***
## Residuals 92 244.21 2.654
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response income :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 465412098 232706049 54.242 2.746e-16 ***
## Residuals 92 394696807 4290183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response women :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 75707 37854 190.91 < 2.2e-16 ***
## Residuals 92 18242 198
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response prestige :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 14276 7138.1 61.458 < 2.2e-16 ***
## Residuals 92 10686 116.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response education - H0: Mean(education in Cluster 1) = Mean(education in Cluster 2) = Mean(education in Cluster 3)
H1: At least one mean is different.
(Same for other responses, I’m looking for statistical differences of means of each variable between my clusters)
We reject H0 for all tests (p<0.001). Very low p values indicate that my clustering successfully differentiates jobs between the clusters.
To validate my clusters, I will be using the variable type of job, as based on the cluster variables, the type should also vary between clusters. Since type is a categorical variable, I will perform a Pearson chi-square test.
chi_square <- chisq.test(jobs_clean$type, as.factor(jobs_clean$Group))
chi_square
##
## Pearson's Chi-squared test
##
## data: jobs_clean$type and as.factor(jobs_clean$Group)
## X-squared = 85.123, df = 4, p-value < 2.2e-16
H0: There is no association between variables (type of job and cluster)
H1: There is an association between variables (type of job and cluster)
We reject H0 at p<0.001 - there seems to be association between the type of job and the cluster.
addmargins(chi_square$observed)
## as.factor(jobs_clean$Group)
## jobs_clean$type 1 2 3 Sum
## Blue_collar 7 37 0 44
## Professional 5 1 22 28
## White_collar 14 8 1 23
## Sum 26 46 23 95
addmargins(round(chi_square$expected), 2)
## as.factor(jobs_clean$Group)
## jobs_clean$type 1 2 3 Sum
## Blue_collar 12 21 11 44
## Professional 8 14 7 29
## White_collar 6 11 6 23
round(chi_square$res, 2)
## as.factor(jobs_clean$Group)
## jobs_clean$type 1 2 3
## Blue_collar -1.45 3.40 -3.26
## Professional -0.96 -3.41 5.85
## White_collar 3.07 -0.94 -1.94
In the combination white collar and cluster 1, there is more than expected number of jobs (alpha = 1%).
In the combination blue collar and cluster 2, there is more than expected number of jobs (alpha = 0.1%).
In the combination professional and cluster 2, there is less than expected number of jobs (alpha = 0.1%).
In the combination blue collar and cluster 3, there is less than expected number of jobs (alpha = 0.1%).
In the combination professional and cluster 3, there is more than expected umber of jobs (alpha = 0.1%).
In the combination white collar and cluster 3, there is more than expected number of jobs (alpha = 1%).
My analysis aimed to reveal how jobs cluster into more and less desirable and prestigious, based on education, proportion of women, income and prestige. The results of clustering and chi-square test demonstrate that the jobs ca be divided into three meaningful clusters, with different characteristics:
Cluster 1: This cluster contains 26 jobs (about 27%).Here we can find jobs that have above average rate of women employed and are above average in income and prestige, while employment is about average. This cluster contains more white collar jobs than expected.
Cluster 2: This cluster contains 46 jobs (about 48%). Here we can find jobs that have a below average rate of employed women (jobs are mostly held by men) and all the other variables are also below average. It contains more than expected number of blue collar jobs and less than expected number of professional jobs.
Cluster 3: This cluster contains 23 jobs (about 24%). Here we can find jobs that have and above average values in all cluster variables except the rate of women (jobs mostly held by men). It contains more tahn expected number of professional jobs and less than expected blue and white collar jobs.