HW2_7

1 Importing dataset

library(carData)
jobs <- force(Prestige)
head(jobs)

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

library(tidyr)
jobs_clean <- drop_na(jobs)

head(jobs)

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

2 Data description

Unit of observation: one job

Initial sample size: 98 jobs

Definition of variables:

education: average education of occupational incumbents, years, in 1971
income: average income of incumbents, dollars, in 1971
women: percentage of incumbents who are women
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid- 1960s
census: Canadian Census occupational code
type: a factor for type of occupation with levels bc (Blue Collar), prof (Professional, Managerial and Technical) and wc (White Collar)

Source of data: dataset in R (Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21].)

### Data manipulation, creating factors

jobs_clean$type <- factor(jobs_clean$type,
                         levels = c ("bc", "prof", "wc"),
                         labels = c ("Blue_collar","Professional", "White_collar"))

### Descriptive statistic for numerical variables
library(pastecs)

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:tidyr':
## 
##     extract

round(stat.desc(jobs_clean[ , c (1:4) ]), 2)

##              education      income   women prestige
## nbr.val          98.00       98.00   98.00    98.00
## nbr.null          0.00        0.00    5.00     0.00
## nbr.na            0.00        0.00    0.00     0.00
## min               6.38     1656.00    0.00    17.30
## max              15.97    25879.00   97.51    87.20
## range             9.59    24223.00   97.51    69.90
## sum            1057.92   680008.00 2840.60  4638.10
## median           10.61     6035.50   14.47    43.60
## mean             10.80     6938.86   28.99    47.33
## SE.mean           0.28      427.11    3.17     1.73
## CI.mean.0.95      0.55      847.69    6.29     3.43
## var               7.56 17877104.76  984.83   292.24
## std.dev           2.75     4228.13   31.38    17.09
## coef.var          0.25        0.61    1.08     0.36

The estimated average percentage of incumbents who are women in our sample of jobs is 28.99%.

50% of jobs in our sample had prestige score up to and including 43.60, and 50% of jobs had a higher prestige score.

The lowest average education of incubans in our samle was 6.38 years, and the higest was 15.97 years.

3 Research question

Can we group Canadian jobs into more and less desirable jobs, based on the required education, income, proportion of women and prestige?

jobs_std <- as.data.frame(scale(jobs_clean[c("education",
                                                             "income",
                                                             "women",
                                                             "prestige")]))

### Looking for outliers
jobs_clean$Dissimilarity <- sqrt(jobs_clean$education^2 + 
                                        jobs_clean$income^2 + 
                                        jobs_clean$women^2 + 
                                        jobs_clean$prestige^2)

head(jobs_clean[order(-jobs_clean$Dissimilarity), ], 10)

##                          education income women prestige census         type
## general.managers             12.26  25879  4.02     69.1   1130 Professional
## physicians                   15.96  25308 10.56     87.2   3111 Professional
## lawyers                      15.77  19263  5.13     82.3   2343 Professional
## osteopaths.chiropractors     14.71  17498  6.91     68.4   3117 Professional
## veterinarians                15.94  14558  4.32     66.7   3115 Professional
## architects                   15.44  14163  2.69     78.1   2141 Professional
## pilots                       12.27  14032  0.58     66.1   9111 Professional
## university.teachers          15.97  12480 19.59     84.6   2711 Professional
## gov.administrators           13.11  12351 11.16     68.8   1113 Professional
## civil.engineers              14.52  11377  1.03     73.1   2143 Professional
##                          Dissimilarity
## general.managers              25879.10
## physicians                    25308.16
## lawyers                       19263.18
## osteopaths.chiropractors      17498.14
## veterinarians                 14558.16
## architects                    14163.22
## pilots                        14032.16
## university.teachers           12480.31
## gov.administrators            12351.20
## civil.engineers               11377.24

Dissimilarity tells us how different each job is from the rest of the jobs (higher value means more different). From this case, it is evident that the jobs are quite different from one another, which makes sense, since very different jobs are included in the sample. “general.managers”, “physicians” and “lawyers” seem to be the most different form the others, so I will remove them to avoid issues with clustering.

### Removing outliers 

jobs_clean <- jobs_clean[!row.names(jobs_clean) %in% c("general.managers", "physicians", "lawyers"), ]

jobs_std <- as.data.frame(scale(jobs_clean[c("education",
                                                             "income",
                                                             "women",
                                                             "prestige")]))
head(jobs_clean[order(-jobs_clean$Dissimilarity), ], 10)

##                          education income women prestige census         type
## osteopaths.chiropractors     14.71  17498  6.91     68.4   3117 Professional
## veterinarians                15.94  14558  4.32     66.7   3115 Professional
## architects                   15.44  14163  2.69     78.1   2141 Professional
## pilots                       12.27  14032  0.58     66.1   9111 Professional
## university.teachers          15.97  12480 19.59     84.6   2711 Professional
## gov.administrators           13.11  12351 11.16     68.8   1113 Professional
## civil.engineers              14.52  11377  1.03     73.1   2143 Professional
## physicists                   15.64  11030  5.13     77.6   2113 Professional
## mining.engineers             14.64  11023  0.94     68.8   2153 Professional
## pharmacists                  15.21  10432 24.71     69.3   3151 Professional
##                          Dissimilarity
## osteopaths.chiropractors      17498.14
## veterinarians                 14558.16
## architects                    14163.22
## pilots                        14032.16
## university.teachers           12480.31
## gov.administrators            12351.20
## civil.engineers               11377.24
## physicists                    11030.29
## mining.engineers              11023.22
## pharmacists                   10432.27

Is data suitable for clustering?

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Distances <- get_dist(jobs_std, 
                      method = "euclidian")

fviz_dist(Distances,
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

### Hopkins statistics

library(factoextra) 

get_clust_tendency(jobs_std, 
                   n = nrow(jobs_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7682074
## 
## $plot
## NULL

From the graph of Euclidian distances and Hopkins statistics we can check if data can be clustered. Hopkins statistic is close to 1, which confirms that data can be used for clusters. The graph also shows the possibility of clustering, but maybe the number of clusters is not so clear, so I will continue the analysis to figure out how many clusters I should make.

How many clusters?

I will check how many clusters I should make based on the Elbow method and Silhouette analysis.

library(factoextra)
library(NbClust)

fviz_nbclust(jobs_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

When using the elbow method to determine the optimal number of clusters, we are looking for the most evident breaks. For me, this seems to be at 3, but I will check additionally with silhouette analysis.

fviz_nbclust(jobs_std, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette analysis")

Silhouette methods measures how well data fits into their assigned cluster. In this case, the analysis indicates that 3 clusters will provide the best fit for my data.

I will also use k-means method to check:

library(NbClust)
nc <- NbClust(jobs_std, 
              distance = "euclidean",
              min.nc = 2, max.nc = 10,
              method = "kmeans",
              index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 3 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Based on this method, 3 clusters are the best number of clusters for my data. So I will proceed my analysis with 3 clusters.

Clustering

Clustering <- kmeans(jobs_std,
centers = 3,
nstart = 25)

Clustering

## K-means clustering with 3 clusters of sizes 26, 46, 23
## 
## Cluster means:
##     education     income      women   prestige
## 1  0.03542115 -0.8183123  1.4505122 -0.2066937
## 2 -0.67326365 -0.1331005 -0.6024839 -0.5373306
## 3  1.30648600  1.1912498 -0.4347416  1.3083149
## 
## Clustering vector:
##        gov.administrators               accountants       purchasing.officers 
##                         3                         3                         3 
##                  chemists                physicists                biologists 
##                         3                         3                         3 
##                architects           civil.engineers          mining.engineers 
##                         3                         3                         3 
##                 surveyors               draughtsmen       computer.programers 
##                         3                         3                         3 
##                economists             psychologists            social.workers 
##                         3                         3                         1 
##                librarians    vocational.counsellors                 ministers 
##                         1                         3                         3 
##       university.teachers   primary.school.teachers secondary.school.teachers 
##                         3                         1                         3 
##             veterinarians  osteopaths.chiropractors                    nurses 
##                         3                         3                         1 
##             nursing.aides          physio.therapsts               pharmacists 
##                         1                         1                         3 
##       medical.technicians        commercial.artists       radio.tv.announcers 
##                         1                         2                         3 
##               secretaries                   typists               bookkeepers 
##                         1                         1                         1 
##          tellers.cashiers        computer.operators           shipping.clerks 
##                         1                         1                         2 
##               file.clerks              receptionsts             mail.carriers 
##                         1                         1                         2 
##             postal.clerks       telephone.operators                collectors 
##                         1                         1                         1 
##           claim.adjustors             travel.clerks             office.clerks 
##                         1                         2                         1 
##         sales.supervisors     commercial.travellers              sales.clerks 
##                         2                         2                         1 
## service.station.attendant          insurance.agents      real.estate.salesmen 
##                         2                         2                         2 
##                    buyers              firefighters                 policemen 
##                         2                         2                         2 
##                     cooks                bartenders         funeral.directors 
##                         1                         2                         2 
##                launderers                  janitors        elevator.operators 
##                         1                         2                         2 
##              farm.workers      rotary.well.drillers                    bakers 
##                         2                         2                         2 
##            slaughterers.1            slaughterers.2                   canners 
##                         2                         2                         1 
##           textile.weavers         textile.labourers           tool.die.makers 
##                         2                         2                         2 
##                machinists       sheet.metal.workers                   welders 
##                         2                         2                         2 
##              auto.workers          aircraft.workers        electronic.workers 
##                         2                         2                         1 
##        radio.tv.repairmen     sewing.mach.operators            auto.repairmen 
##                         2                         1                         2 
##        aircraft.repairmen        railway.sectionmen        electrical.linemen 
##                         2                         2                         2 
##              electricians      construction.foremen                carpenters 
##                         2                         2                         2 
##                    masons            house.painters                  plumbers 
##                         2                         2                         2 
##    construction.labourers                    pilots           train.engineers 
##                         2                         3                         2 
##               bus.drivers              taxi.drivers              longshoremen 
##                         2                         2                         2 
##               typesetters               bookbinders 
##                         2                         1 
## 
## Within cluster sum of squares by cluster:
## [1] 43.54207 52.73682 39.20547
##  (between_SS / total_SS =  64.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
fviz_cluster(Clustering,
             palette = "Set1",
             repel = TRUE,
             labelsize = 5,
             ggtheme = theme_bw(),
             data = jobs_std)

Based on above results, we have 3 distinct clusters that include 26, 46 and 23 jobs.

My clusters are:

Cluster 1: Displays higher than average values in the proportion of women and a bit above average values for education but lower than average values in all other categories.This indicates that this is a cluster of less desirable jobs, mostly held by women.
Cluster 2: Displays consistently lower than average values in all categories. This indicates a cluster of low desirable jobs, held more by men.
Cluster 3: Has lower than average values in proportion of women, while all other values are higher than average. This indicates that this cluster contains the most desirable and prestigeous jobs.

Averages <- Clustering$centers

Averages

##     education     income      women   prestige
## 1  0.03542115 -0.8183123  1.4505122 -0.2066937
## 2 -0.67326365 -0.1331005 -0.6024839 -0.5373306
## 3  1.30648600  1.1912498 -0.4347416  1.3083149

Let’s show the differences we are seeing in the average values on the graph:

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("education",
                                        "income",
                                        "women",
                                        "prestige"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("education",
                                        "income",
                                        "women",
                                        "prestige"),

                       labels = c("education",
                                        "income",
                                        "women",
                                        "prestige"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Checking if variables successfully differentiate between groups

jobs_clean$Group <- Clustering$cluster

fit <- aov(cbind(education, income, women, prestige) ~ as.factor(Group), 
           data = jobs_clean)

summary(fit)

##  Response education :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)  2 433.80 216.899  81.712 < 2.2e-16 ***
## Residuals        92 244.21   2.654                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response income :
##                  Df    Sum Sq   Mean Sq F value    Pr(>F)    
## as.factor(Group)  2 465412098 232706049  54.242 2.746e-16 ***
## Residuals        92 394696807   4290183                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response women :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)  2  75707   37854  190.91 < 2.2e-16 ***
## Residuals        92  18242     198                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response prestige :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)  2  14276  7138.1  61.458 < 2.2e-16 ***
## Residuals        92  10686   116.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response education - H0: Mean(education in Cluster 1) = Mean(education in Cluster 2) = Mean(education in Cluster 3)

H1: At least one mean is different.

(Same for other responses, I’m looking for statistical differences of means of each variable between my clusters)

We reject H0 for all tests (p<0.001). Very low p values indicate that my clustering successfully differentiates jobs between the clusters.

Criterion validity

To validate my clusters, I will be using the variable type of job, as based on the cluster variables, the type should also vary between clusters. Since type is a categorical variable, I will perform a Pearson chi-square test.

chi_square <- chisq.test(jobs_clean$type, as.factor(jobs_clean$Group))

chi_square

## 
##  Pearson's Chi-squared test
## 
## data:  jobs_clean$type and as.factor(jobs_clean$Group)
## X-squared = 85.123, df = 4, p-value < 2.2e-16

H0: There is no association between variables (type of job and cluster)

H1: There is an association between variables (type of job and cluster)

We reject H0 at p<0.001 - there seems to be association between the type of job and the cluster.

addmargins(chi_square$observed)

##                as.factor(jobs_clean$Group)
## jobs_clean$type  1  2  3 Sum
##    Blue_collar   7 37  0  44
##    Professional  5  1 22  28
##    White_collar 14  8  1  23
##    Sum          26 46 23  95

addmargins(round(chi_square$expected), 2)

##                as.factor(jobs_clean$Group)
## jobs_clean$type  1  2  3 Sum
##    Blue_collar  12 21 11  44
##    Professional  8 14  7  29
##    White_collar  6 11  6  23

round(chi_square$res, 2)

##                as.factor(jobs_clean$Group)
## jobs_clean$type     1     2     3
##    Blue_collar  -1.45  3.40 -3.26
##    Professional -0.96 -3.41  5.85
##    White_collar  3.07 -0.94 -1.94

In the combination white collar and cluster 1, there is more than expected number of jobs (alpha = 1%).

In the combination blue collar and cluster 2, there is more than expected number of jobs (alpha = 0.1%).

In the combination professional and cluster 2, there is less than expected number of jobs (alpha = 0.1%).

In the combination blue collar and cluster 3, there is less than expected number of jobs (alpha = 0.1%).

In the combination professional and cluster 3, there is more than expected umber of jobs (alpha = 0.1%).

In the combination white collar and cluster 3, there is more than expected number of jobs (alpha = 1%).

Conclusions

My analysis aimed to reveal how jobs cluster into more and less desirable and prestigious, based on education, proportion of women, income and prestige. The results of clustering and chi-square test demonstrate that the jobs ca be divided into three meaningful clusters, with different characteristics:

Cluster 1: This cluster contains 26 jobs (about 27%).Here we can find jobs that have above average rate of women employed and are above average in income and prestige, while employment is about average. This cluster contains more white collar jobs than expected.
Cluster 2: This cluster contains 46 jobs (about 48%). Here we can find jobs that have a below average rate of employed women (jobs are mostly held by men) and all the other variables are also below average. It contains more than expected number of blue collar jobs and less than expected number of professional jobs.
Cluster 3: This cluster contains 23 jobs (about 24%). Here we can find jobs that have and above average values in all cluster variables except the rate of women (jobs mostly held by men). It contains more tahn expected number of professional jobs and less than expected blue and white collar jobs.