How does a grouping of Canadian jobs based on the gender ratio and education level of its employees, and prestige look like and what are the demographic characteristics of those groups
library(carData)
mydata <- force(Prestige)
head(mydata, 6)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
Sample size: 102
Unit of observation: One job
Variables:
education: Average education of occupational incumbents, years, in 1971
income: Average income of incumbents, dollars, in 1971
women: Percentage of incumbents who are women
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s
census: Canadian Census occupational code
type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar
Source: Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21], Personal communication from B. Blishen, W. Carroll, and C. Moore, Departments of Sociology, York University and University of Victoria
mydata$typeF <- factor(mydata$type,
levels = c("bc", "prof","wc"),
labels = c("blue collar", "professional","white collar"))
mydata$ID <- seq(nrow(mydata))
library(tidyr)
mydata <- drop_na(mydata)
summary(mydata[ , c(1,2,3,4,7)])
## education income women prestige
## Min. : 6.380 Min. : 1656 Min. : 0.000 Min. :17.30
## 1st Qu.: 8.445 1st Qu.: 4250 1st Qu.: 3.268 1st Qu.:35.38
## Median :10.605 Median : 6036 Median :14.475 Median :43.60
## Mean :10.795 Mean : 6939 Mean :28.986 Mean :47.33
## 3rd Qu.:12.755 3rd Qu.: 8226 3rd Qu.:52.203 3rd Qu.:59.90
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## typeF
## blue collar :44
## professional:31
## white collar:23
##
##
##
income mean: The average income in dollars in 1971 of the workers based on this sample is $6939
3rd quartile education: 75% of the workers had upto 12.75 years in education based on this sample
median prestige: If arranged in ascending order in terms of the Pineo-Porter prestige score for occupation the middle most job has a score of 43.6
mydata$Edu_z <- scale(mydata$education)
mydata$women_z <- scale(mydata$women)
mydata$prestige_z <- scale(mydata$prestige)
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c("Edu_z", "women_z", "prestige_z")]),
type = "pearson")
## Edu_z women_z prestige_z
## Edu_z 1.00 0.06 0.87
## women_z 0.06 1.00 -0.11
## prestige_z 0.87 -0.11 1.00
##
## n= 98
##
##
## P
## Edu_z women_z prestige_z
## Edu_z 0.5447 0.0000
## women_z 0.5447 0.2803
## prestige_z 0.0000 0.2803
Based on this there is a high positive correlation between the variables Edu_z and prestige_z, and almost no correlation between Edu_z and women_z and women_z and prestige_z
mydata$Dissimilarity <- sqrt(mydata$Edu_z^2 + mydata$women_z^2 + mydata$prestige_z^2)
head(mydata[order(-mydata$Dissimilarity), ], 10)
## education income women prestige census type typeF
## physicians 15.96 25308 10.56 87.2 3111 prof professional
## university.teachers 15.97 12480 19.59 84.6 2711 prof professional
## lawyers 15.77 19263 5.13 82.3 2343 prof professional
## sewing.mach.operators 6.38 2847 90.67 28.2 8563 bc blue collar
## physicists 15.64 11030 5.13 77.6 2113 prof professional
## architects 15.44 14163 2.69 78.1 2141 prof professional
## physio.therapsts 13.62 5092 82.66 72.1 3137 prof professional
## nurses 12.46 4614 96.12 64.7 3131 prof professional
## launderers 7.33 3000 69.31 20.8 6162 bc blue collar
## veterinarians 15.94 14558 4.32 66.7 3115 prof professional
## ID Edu_z women_z prestige_z Dissimilarity
## physicians 24 1.8788713 -0.5871425 2.332417 3.052059
## university.teachers 21 1.8825091 -0.2993980 2.180325 2.896083
## lawyers 17 1.8097537 -0.7601715 2.045782 2.835188
## sewing.mach.operators 84 -1.6061128 1.9655935 -1.118904 2.774004
## physicists 6 1.7624626 -0.7601715 1.770846 2.611519
## architects 8 1.6897072 -0.8379230 1.800095 2.607214
## physio.therapsts 29 1.0276330 1.7103517 1.449113 2.466021
## nurses 27 0.6056517 2.1392598 1.016235 2.444582
## launderers 64 -1.2605246 1.2849489 -1.551781 2.376561
## veterinarians 25 1.8715958 -0.7859825 1.133229 2.324833
Based on this we can say none of the units have a high dissimilarity which warrants a removal
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
distance <- get_dist(mydata[c("Edu_z", "women_z", "prestige_z")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2,
gradient = list(low = "lightblue", mid = "pink", high = "purple"))
get_clust_tendency(mydata[c("Edu_z", "women_z", "prestige_z")],
n = nrow(mydata) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.705796
##
## $plot
## NULL
Since its above the 0.5 threshold we can say the data is clusterable
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
WARD <- mydata[c("Edu_z", "women_z", "prestige_z")] %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 98
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
fviz_dend(WARD,
k = 3,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
Based on the dendrogram we can say three clusters are appropriate for this dataset
mydata$ClusterWard <- cutree(WARD,
k = 3)
head(mydata[c(8,13)])
## ID ClusterWard
## gov.administrators 1 1
## general.managers 2 1
## accountants 3 1
## purchasing.officers 4 1
## chemists 5 1
## physicists 6 1
Initial_leaders <- aggregate(mydata[ , c("Edu_z", "women_z", "prestige_z")],
by = list(mydata$ClusterWard),
FUN = mean)
Initial_leaders
## Group.1 Edu_z women_z prestige_z
## 1 1 0.9599705 -0.1677223 1.0058522
## 2 2 -0.2381292 1.4360067 -0.5883361
## 3 3 -0.8851641 -0.5792432 -0.7491412
K_MEANS <- hkmeans(mydata[c("Edu_z", "women_z", "prestige_z")],
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 3 clusters of sizes 28, 25, 45
##
## Cluster means:
## Edu_z women_z prestige_z
## 1 1.21095067 -0.3974638 1.2142894
## 2 -0.00622133 1.4939348 -0.2709316
## 3 -0.75002412 -0.5826530 -0.6050403
##
## Clustering vector:
## gov.administrators general.managers accountants
## 1 1 1
## purchasing.officers chemists physicists
## 1 1 1
## biologists architects civil.engineers
## 1 1 1
## mining.engineers surveyors draughtsmen
## 1 1 1
## computer.programers economists psychologists
## 1 1 1
## social.workers lawyers librarians
## 1 1 2
## vocational.counsellors ministers university.teachers
## 1 1 1
## primary.school.teachers secondary.school.teachers physicians
## 2 1 1
## veterinarians osteopaths.chiropractors nurses
## 1 1 2
## nursing.aides physio.therapsts pharmacists
## 2 2 1
## medical.technicians commercial.artists radio.tv.announcers
## 2 1 1
## secretaries typists bookkeepers
## 2 2 2
## tellers.cashiers computer.operators shipping.clerks
## 2 2 3
## file.clerks receptionsts mail.carriers
## 2 2 3
## postal.clerks telephone.operators collectors
## 2 2 2
## claim.adjustors travel.clerks office.clerks
## 2 2 2
## sales.supervisors commercial.travellers sales.clerks
## 3 3 2
## service.station.attendant insurance.agents real.estate.salesmen
## 3 3 3
## buyers firefighters policemen
## 3 3 3
## cooks bartenders funeral.directors
## 3 3 3
## launderers janitors elevator.operators
## 2 3 3
## farm.workers rotary.well.drillers bakers
## 3 3 3
## slaughterers.1 slaughterers.2 canners
## 3 3 2
## textile.weavers textile.labourers tool.die.makers
## 3 3 3
## machinists sheet.metal.workers welders
## 3 3 3
## auto.workers aircraft.workers electronic.workers
## 3 3 2
## radio.tv.repairmen sewing.mach.operators auto.repairmen
## 3 2 3
## aircraft.repairmen railway.sectionmen electrical.linemen
## 3 3 3
## electricians construction.foremen carpenters
## 3 3 3
## masons house.painters plumbers
## 3 3 3
## construction.labourers pilots train.engineers
## 3 1 3
## bus.drivers taxi.drivers longshoremen
## 3 3 3
## typesetters bookbinders
## 3 2
##
## Within cluster sum of squares by cluster:
## [1] 22.87224 34.26621 32.39664
## (between_SS / total_SS = 69.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
As we can see the ratio of between SS and total SS is 69.2% which means there is a sufficient difference between the clusters
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("Edu_z", "women_z", "prestige_z")])
## Edu_z women_z prestige_z
## gov.administrators 0.8421067 -0.5680232 1.2560729
## general.managers 0.5328962 -0.7955421 1.2736220
## accountants 0.7184225 -0.4233544 0.9401893
## purchasing.officers 0.2273235 -0.6333473 0.5541094
## chemists 1.3914101 -0.5514532 1.5310086
## physicists 1.7624626 -0.7601715 1.7708461
table(mydata$ClusterWard)
##
## 1 2 3
## 40 20 38
table(mydata$ClusterK_Means)
##
## 1 2 3
## 28 25 45
table(mydata$ClusterWard, mydata$ClusterK_Means)
##
## 1 2 3
## 1 28 5 7
## 2 0 20 0
## 3 0 0 38
Interpretation of the second row and the second column: The number of observations in the second cluster as per hierarchical method was 20 but it received 5 observations from Cluster 1 as a part of reclassification in the K Means method and now it has 25 observations
Centroids <- K_MEANS$centers
Centroids
## Edu_z women_z prestige_z
## 1 1.21095067 -0.3974638 1.2142894
## 2 -0.00622133 1.4939348 -0.2709316
## 3 -0.75002412 -0.5826530 -0.6050403
library(ggplot2)
library(tidyr)
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("Edu_z", "women_z", "prestige_z"))
Figure$Groups <- factor(Figure$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$nameFactor <- factor(Figure$name,
levels = c("Edu_z", "women_z", "prestige_z"),
labels = c("Edu_z", "women_z", "prestige_z"))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
ylim(-1.5, 1.5)
#### Checking cluster variable fit
H1: At least one mean is different
We reject H0(p<0.001) and can say at least one mean is significantly different
H1: At least one mean is different
We reject H0(p<0.001) and can say at least one mean is significantly different
H1: At least one mean is different
We reject H0(p<0.001) and can say at least one mean is significantly different
fit <- aov(cbind(Edu_z, women_z, prestige_z) ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 66.374 33.187 102.95 < 2.2e-16 ***
## Residuals 95 30.626 0.322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 75.496 37.748 166.76 < 2.2e-16 ***
## Residuals 95 21.504 0.226
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 59.594 29.7972 75.677 < 2.2e-16 ***
## Residuals 95 37.406 0.3937
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aggregate(mydata$income,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 11194.250
## 2 2 3971.000
## 3 3 5939.867
fit <- aov(income ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 772147904 386073952 38.13 6.98e-13 ***
## Residuals 95 961931258 10125592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Mean(income Cluster 1) = Mean(income Cluster 2) = Mean(income Cluster 3)
H1: At least one mean is different
We reject H0(p<0.001) and can say at least one mean is significantly different and as a result, we are able to validate our analysis using income
aggregate(mydata$education,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 14.123929
## 2 2 10.778000
## 3 3 8.733333
fit <- aov(education ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 501.6 250.78 102.9 <2e-16 ***
## Residuals 95 231.4 2.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Mean(education Cluster 1) = Mean(education Cluster 2) = Mean(education Cluster 3)
H1: At least one mean is different
We reject H0(p<0.001) and can say at least one mean is significantly different and as a result, we are able to validate our analysis using education
chisq_results <- chisq.test(mydata$typeF, as.factor(mydata$ClusterK_Means))
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$typeF and as.factor(mydata$ClusterK_Means)
## X-squared = 104.15, df = 4, p-value < 2.2e-16
H0: There is no association between the type of job and cluster
H1: There is an association
We reject H0(p<0.001) and can say there is an association betweem the type of job and cluster and as a result, we are able to validate our analysis using job type
round(chisq_results$expected, 2)
##
## mydata$typeF 1 2 3
## blue collar 12.57 11.22 20.20
## professional 8.86 7.91 14.23
## white collar 6.57 5.87 10.56
round(chisq_results$observed, 2)
##
## mydata$typeF 1 2 3
## blue collar 0 6 38
## professional 27 4 0
## white collar 1 15 7
round(chisq_results$res, 2)
##
## mydata$typeF 1 2 3
## blue collar -3.55 -1.56 3.96
## professional 6.10 -1.39 -3.77
## white collar -2.17 3.77 -1.10
The number of professional jobs in Cluster 1 is significantly more than expected and the number of white and blue collar jobs is significantly less than expected
The number of professional jobs in Cluster 3 is significantly less than expected and the number of blue collar jobs is significantly more than expected
We can describe the groups as follows:
Group 1: It has 28.5%(28/98) of the jobs analysed by us. It has an above average number of the education years of the employees compared to the other clusters while having a below average proportion of women working in the jobs belonging to this cluster compared to the other clusters and has an above average prestige score compared to the other clusters. It has the highest average income and the highest average education years of the employees while having the most professional jobs
Group 2(the glass ceiling effect): It has 25.5%(25/98) of the jobs analysed by us. It has an exactly average number of the education years of the employees compared to the other clusters while having a significantly above average proportion of women working in the jobs belonging to this cluster compared to the other clusters and has a slightly below average prestige score compared to the other clusters. It has the lowest average income and the second highest average education years of the employees while having the most jobs in the white collar category
Group 3: It has 45.9%(45/98) of the jobs analysed by us. It has a below average number of the education years of the employees compared to the other clusters while having a significantly below average proportion of women working in the jobs belonging to this cluster compared to the other clusters and has a significantly below average prestige score compared to the other clusters. It has the second highest average income and the lowest average education years of the employees while having the most jobs in the blue collar category
There is a really important observation while analysing Group 2 and 3> While Group 2 has a higher number of education years and the prestige score compared with Group 3, it still has the lowest income and a reason for this could be the glass ceiling> because this cluster has jobs with the highest proportion of women, the average income is the lowest compared with other clusters. This is in confirmation with many researches which show women are paid lesser even while being more educated and working in jobs with a higher prestige than men