Clustering Assignment

Research Question

How does a grouping of Canadian jobs based on the gender ratio and education level of its employees, and prestige look like and what are the demographic characteristics of those groups

library(carData)
mydata <- force(Prestige)
head(mydata, 6)

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

Description of the data

Sample size: 102

Unit of observation: One job

Variables:

education: Average education of occupational incumbents, years, in 1971
income: Average income of incumbents, dollars, in 1971
women: Percentage of incumbents who are women
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s
census: Canadian Census occupational code
type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar

Source: Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21], Personal communication from B. Blishen, W. Carroll, and C. Moore, Departments of Sociology, York University and University of Victoria

mydata$typeF <- factor(mydata$type,
                             levels = c("bc", "prof","wc"), 
                             labels = c("blue collar", "professional","white collar"))

mydata$ID <- seq(nrow(mydata))

Removing NA values

library(tidyr)
mydata <- drop_na(mydata)

Summary of the dataset

summary(mydata[ , c(1,2,3,4,7)])

##    education          income          women           prestige    
##  Min.   : 6.380   Min.   : 1656   Min.   : 0.000   Min.   :17.30  
##  1st Qu.: 8.445   1st Qu.: 4250   1st Qu.: 3.268   1st Qu.:35.38  
##  Median :10.605   Median : 6036   Median :14.475   Median :43.60  
##  Mean   :10.795   Mean   : 6939   Mean   :28.986   Mean   :47.33  
##  3rd Qu.:12.755   3rd Qu.: 8226   3rd Qu.:52.203   3rd Qu.:59.90  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##           typeF   
##  blue collar :44  
##  professional:31  
##  white collar:23  
##                   
##                   
##

Explaining a few parameters

income mean: The average income in dollars in 1971 of the workers based on this sample is $6939
3rd quartile education: 75% of the workers had upto 12.75 years in education based on this sample
median prestige: If arranged in ascending order in terms of the Pineo-Porter prestige score for occupation the middle most job has a score of 43.6

Scaling the variables to be used for clustering

mydata$Edu_z <- scale(mydata$education)
mydata$women_z <- scale(mydata$women)
mydata$prestige_z <- scale(mydata$prestige)

Checking the correlation

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c("Edu_z", "women_z", "prestige_z")]),
      type = "pearson")

##            Edu_z women_z prestige_z
## Edu_z       1.00    0.06       0.87
## women_z     0.06    1.00      -0.11
## prestige_z  0.87   -0.11       1.00
## 
## n= 98 
## 
## 
## P
##            Edu_z  women_z prestige_z
## Edu_z             0.5447  0.0000    
## women_z    0.5447         0.2803    
## prestige_z 0.0000 0.2803

Based on this there is a high positive correlation between the variables Edu_z and prestige_z, and almost no correlation between Edu_z and women_z and women_z and prestige_z

Checking the dissimilarity

mydata$Dissimilarity <- sqrt(mydata$Edu_z^2 + mydata$women_z^2 + mydata$prestige_z^2)
head(mydata[order(-mydata$Dissimilarity), ], 10)

##                       education income women prestige census type        typeF
## physicians                15.96  25308 10.56     87.2   3111 prof professional
## university.teachers       15.97  12480 19.59     84.6   2711 prof professional
## lawyers                   15.77  19263  5.13     82.3   2343 prof professional
## sewing.mach.operators      6.38   2847 90.67     28.2   8563   bc  blue collar
## physicists                15.64  11030  5.13     77.6   2113 prof professional
## architects                15.44  14163  2.69     78.1   2141 prof professional
## physio.therapsts          13.62   5092 82.66     72.1   3137 prof professional
## nurses                    12.46   4614 96.12     64.7   3131 prof professional
## launderers                 7.33   3000 69.31     20.8   6162   bc  blue collar
## veterinarians             15.94  14558  4.32     66.7   3115 prof professional
##                       ID      Edu_z    women_z prestige_z Dissimilarity
## physicians            24  1.8788713 -0.5871425   2.332417      3.052059
## university.teachers   21  1.8825091 -0.2993980   2.180325      2.896083
## lawyers               17  1.8097537 -0.7601715   2.045782      2.835188
## sewing.mach.operators 84 -1.6061128  1.9655935  -1.118904      2.774004
## physicists             6  1.7624626 -0.7601715   1.770846      2.611519
## architects             8  1.6897072 -0.8379230   1.800095      2.607214
## physio.therapsts      29  1.0276330  1.7103517   1.449113      2.466021
## nurses                27  0.6056517  2.1392598   1.016235      2.444582
## launderers            64 -1.2605246  1.2849489  -1.551781      2.376561
## veterinarians         25  1.8715958 -0.7859825   1.133229      2.324833

Based on this we can say none of the units have a high dissimilarity which warrants a removal

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

distance <- get_dist(mydata[c("Edu_z", "women_z", "prestige_z")],
                     method = "euclidian") 

distance2 <- distance^2 
fviz_dist(distance2,
          gradient = list(low = "lightblue", mid = "pink", high = "purple"))

Checking how clusterdable the data is

get_clust_tendency(mydata[c("Edu_z", "women_z", "prestige_z")],
                   n = nrow(mydata) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.705796
## 
## $plot
## NULL

Since its above the 0.5 threshold we can say the data is clusterable

Making the dendrogram

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

WARD <- mydata[c("Edu_z", "women_z", "prestige_z")] %>%
           get_dist(method = "euclidian") %>%
           hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 98

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

fviz_dend(WARD,
          k = 3,
          cex = 0.5,
          palette = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

Based on the dendrogram we can say three clusters are appropriate for this dataset

Listing the IDs by cluster numbers

mydata$ClusterWard <- cutree(WARD,
                             k = 3)
head(mydata[c(8,13)])

##                     ID ClusterWard
## gov.administrators   1           1
## general.managers     2           1
## accountants          3           1
## purchasing.officers  4           1
## chemists             5           1
## physicists           6           1

Setting initial leaders

Initial_leaders <- aggregate(mydata[ , c("Edu_z", "women_z", "prestige_z")],
                             by = list(mydata$ClusterWard),
                             FUN = mean)
Initial_leaders

##   Group.1      Edu_z    women_z prestige_z
## 1       1  0.9599705 -0.1677223  1.0058522
## 2       2 -0.2381292  1.4360067 -0.5883361
## 3       3 -0.8851641 -0.5792432 -0.7491412

Clustering the data using the K Means method after knowing the suitable number of clusters using hierarchical method

K_MEANS <- hkmeans(mydata[c("Edu_z", "women_z", "prestige_z")], 
                   k = 3,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 3 clusters of sizes 28, 25, 45
## 
## Cluster means:
##         Edu_z    women_z prestige_z
## 1  1.21095067 -0.3974638  1.2142894
## 2 -0.00622133  1.4939348 -0.2709316
## 3 -0.75002412 -0.5826530 -0.6050403
## 
## Clustering vector:
##        gov.administrators          general.managers               accountants 
##                         1                         1                         1 
##       purchasing.officers                  chemists                physicists 
##                         1                         1                         1 
##                biologists                architects           civil.engineers 
##                         1                         1                         1 
##          mining.engineers                 surveyors               draughtsmen 
##                         1                         1                         1 
##       computer.programers                economists             psychologists 
##                         1                         1                         1 
##            social.workers                   lawyers                librarians 
##                         1                         1                         2 
##    vocational.counsellors                 ministers       university.teachers 
##                         1                         1                         1 
##   primary.school.teachers secondary.school.teachers                physicians 
##                         2                         1                         1 
##             veterinarians  osteopaths.chiropractors                    nurses 
##                         1                         1                         2 
##             nursing.aides          physio.therapsts               pharmacists 
##                         2                         2                         1 
##       medical.technicians        commercial.artists       radio.tv.announcers 
##                         2                         1                         1 
##               secretaries                   typists               bookkeepers 
##                         2                         2                         2 
##          tellers.cashiers        computer.operators           shipping.clerks 
##                         2                         2                         3 
##               file.clerks              receptionsts             mail.carriers 
##                         2                         2                         3 
##             postal.clerks       telephone.operators                collectors 
##                         2                         2                         2 
##           claim.adjustors             travel.clerks             office.clerks 
##                         2                         2                         2 
##         sales.supervisors     commercial.travellers              sales.clerks 
##                         3                         3                         2 
## service.station.attendant          insurance.agents      real.estate.salesmen 
##                         3                         3                         3 
##                    buyers              firefighters                 policemen 
##                         3                         3                         3 
##                     cooks                bartenders         funeral.directors 
##                         3                         3                         3 
##                launderers                  janitors        elevator.operators 
##                         2                         3                         3 
##              farm.workers      rotary.well.drillers                    bakers 
##                         3                         3                         3 
##            slaughterers.1            slaughterers.2                   canners 
##                         3                         3                         2 
##           textile.weavers         textile.labourers           tool.die.makers 
##                         3                         3                         3 
##                machinists       sheet.metal.workers                   welders 
##                         3                         3                         3 
##              auto.workers          aircraft.workers        electronic.workers 
##                         3                         3                         2 
##        radio.tv.repairmen     sewing.mach.operators            auto.repairmen 
##                         3                         2                         3 
##        aircraft.repairmen        railway.sectionmen        electrical.linemen 
##                         3                         3                         3 
##              electricians      construction.foremen                carpenters 
##                         3                         3                         3 
##                    masons            house.painters                  plumbers 
##                         3                         3                         3 
##    construction.labourers                    pilots           train.engineers 
##                         3                         1                         3 
##               bus.drivers              taxi.drivers              longshoremen 
##                         3                         3                         3 
##               typesetters               bookbinders 
##                         3                         2 
## 
## Within cluster sum of squares by cluster:
## [1] 22.87224 34.26621 32.39664
##  (between_SS / total_SS =  69.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

As we can see the ratio of between SS and total SS is 69.2% which means there is a sufficient difference between the clusters

Showing the clusters

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

Showing cluster means

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("Edu_z", "women_z", "prestige_z")])

##                         Edu_z    women_z prestige_z
## gov.administrators  0.8421067 -0.5680232  1.2560729
## general.managers    0.5328962 -0.7955421  1.2736220
## accountants         0.7184225 -0.4233544  0.9401893
## purchasing.officers 0.2273235 -0.6333473  0.5541094
## chemists            1.3914101 -0.5514532  1.5310086
## physicists          1.7624626 -0.7601715  1.7708461

Showing tables with the observations from the hierarchical method and the K Means method

table(mydata$ClusterWard)

## 
##  1  2  3 
## 40 20 38

table(mydata$ClusterK_Means)

## 
##  1  2  3 
## 28 25 45

table(mydata$ClusterWard, mydata$ClusterK_Means)

##    
##      1  2  3
##   1 28  5  7
##   2  0 20  0
##   3  0  0 38

Interpretation of the second row and the second column: The number of observations in the second cluster as per hierarchical method was 20 but it received 5 observations from Cluster 1 as a part of reclassification in the K Means method and now it has 25 observations

Showing the centroids

Centroids <- K_MEANS$centers
Centroids

##         Edu_z    women_z prestige_z
## 1  1.21095067 -0.3974638  1.2142894
## 2 -0.00622133  1.4939348 -0.2709316
## 3 -0.75002412 -0.5826530 -0.6050403

Showing the cluster positions in terms of the cluster variables

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("Edu_z", "women_z", "prestige_z"))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3), 
                        labels = c("1", "2", "3"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("Edu_z", "women_z", "prestige_z"), 
                            labels = c("Edu_z", "women_z", "prestige_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-1.5, 1.5)

#### Checking cluster variable fit

Response 1> H0: Mean(Edu_z Cluster 1) = Mean(Edu_z Cluster 2) = Mean(Edu_z Cluster 3)

H1: At least one mean is different

We reject H0(p<0.001) and can say at least one mean is significantly different

Response 2> H0: Mean(women_z Cluster 1) = Mean(women_z Cluster 2) = Mean(women_z Cluster 3)

H1: At least one mean is different

We reject H0(p<0.001) and can say at least one mean is significantly different

Response 1> H0: Mean(prestige_z Cluster 1) = Mean(prestige_z Cluster 2) = Mean(prestige_z Cluster 3)

H1: At least one mean is different

We reject H0(p<0.001) and can say at least one mean is significantly different

fit <- aov(cbind(Edu_z, women_z, prestige_z) ~ as.factor(ClusterK_Means), 
                 data = mydata)

summary(fit)

##  Response 1 :
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)  2 66.374  33.187  102.95 < 2.2e-16 ***
## Residuals                 95 30.626   0.322                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)  2 75.496  37.748  166.76 < 2.2e-16 ***
## Residuals                 95 21.504   0.226                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)  2 59.594 29.7972  75.677 < 2.2e-16 ***
## Residuals                 95 37.406  0.3937                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Validation:

Using Income

aggregate(mydata$income, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")

##   Group.1         x
## 1       1 11194.250
## 2       2  3971.000
## 3       3  5939.867

fit <- aov(income ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)

##                           Df    Sum Sq   Mean Sq F value   Pr(>F)    
## as.factor(ClusterK_Means)  2 772147904 386073952   38.13 6.98e-13 ***
## Residuals                 95 961931258  10125592                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Mean(income Cluster 1) = Mean(income Cluster 2) = Mean(income Cluster 3)

H1: At least one mean is different

We reject H0(p<0.001) and can say at least one mean is significantly different and as a result, we are able to validate our analysis using income

Using Education

aggregate(mydata$education, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")

##   Group.1         x
## 1       1 14.123929
## 2       2 10.778000
## 3       3  8.733333

fit <- aov(education ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)

##                           Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)  2  501.6  250.78   102.9 <2e-16 ***
## Residuals                 95  231.4    2.44                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Mean(education Cluster 1) = Mean(education Cluster 2) = Mean(education Cluster 3)

H1: At least one mean is different

We reject H0(p<0.001) and can say at least one mean is significantly different and as a result, we are able to validate our analysis using education

Using type of job

chisq_results <- chisq.test(mydata$typeF, as.factor(mydata$ClusterK_Means))

chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$typeF and as.factor(mydata$ClusterK_Means)
## X-squared = 104.15, df = 4, p-value < 2.2e-16

H0: There is no association between the type of job and cluster

H1: There is an association

We reject H0(p<0.001) and can say there is an association betweem the type of job and cluster and as a result, we are able to validate our analysis using job type

round(chisq_results$expected, 2)

##               
## mydata$typeF       1     2     3
##   blue collar  12.57 11.22 20.20
##   professional  8.86  7.91 14.23
##   white collar  6.57  5.87 10.56

round(chisq_results$observed, 2)

##               
## mydata$typeF    1  2  3
##   blue collar   0  6 38
##   professional 27  4  0
##   white collar  1 15  7

round(chisq_results$res, 2)

##               
## mydata$typeF       1     2     3
##   blue collar  -3.55 -1.56  3.96
##   professional  6.10 -1.39 -3.77
##   white collar -2.17  3.77 -1.10

The number of professional jobs in Cluster 1 is significantly more than expected and the number of white and blue collar jobs is significantly less than expected

The number of professional jobs in Cluster 3 is significantly less than expected and the number of blue collar jobs is significantly more than expected

Conclusion and Answer to the question

We can describe the groups as follows:

Group 1: It has 28.5%(28/98) of the jobs analysed by us. It has an above average number of the education years of the employees compared to the other clusters while having a below average proportion of women working in the jobs belonging to this cluster compared to the other clusters and has an above average prestige score compared to the other clusters. It has the highest average income and the highest average education years of the employees while having the most professional jobs
Group 2(the glass ceiling effect): It has 25.5%(25/98) of the jobs analysed by us. It has an exactly average number of the education years of the employees compared to the other clusters while having a significantly above average proportion of women working in the jobs belonging to this cluster compared to the other clusters and has a slightly below average prestige score compared to the other clusters. It has the lowest average income and the second highest average education years of the employees while having the most jobs in the white collar category
Group 3: It has 45.9%(45/98) of the jobs analysed by us. It has a below average number of the education years of the employees compared to the other clusters while having a significantly below average proportion of women working in the jobs belonging to this cluster compared to the other clusters and has a significantly below average prestige score compared to the other clusters. It has the second highest average income and the lowest average education years of the employees while having the most jobs in the blue collar category

The glass ceiling effect

There is a really important observation while analysing Group 2 and 3> While Group 2 has a higher number of education years and the prestige score compared with Group 3, it still has the lowest income and a reason for this could be the glass ceiling> because this cluster has jobs with the highest proportion of women, the average income is the lowest compared with other clusters. This is in confirmation with many researches which show women are paid lesser even while being more educated and working in jobs with a higher prestige than men

Clustering Assignment

Ashish Bhambhaney

2024-02-01

Research Question

Description of the data

Removing NA values

Summary of the dataset

Explaining a few parameters

Scaling the variables to be used for clustering

Checking the correlation

Checking the dissimilarity

Checking how clusterdable the data is

Making the dendrogram

Listing the IDs by cluster numbers

Setting initial leaders

Clustering the data using the K Means method after knowing the suitable number of clusters using hierarchical method

Showing the clusters

Showing cluster means

Showing tables with the observations from the hierarchical method and the K Means method

Showing the centroids

Showing the cluster positions in terms of the cluster variables

Validation:

Conclusion and Answer to the question

The glass ceiling effect