Erdene Enkh, 4 February 2024

IMB; Multivariate Analysis;

Homework 4

Education related topic clustering

Research question: How can we group US states based on education-related statistics using the clustering method? In other words, how can we group US states based on high school students’ SAT math scores, SAT verbal scores, and the average teacher’s salary?

Variables Involved:

Region: (categorical) U. S. Census regions.

Pop: (numeric) Population: in 1,000s.

SATV: (numeric) Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).

SATM: (numeric) Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.

Percent: (numeric) Percentage of graduating high-school students in the state who took the SAT exam.

Dollars: (numeric) State spending on public education, in $1000s per student.

Pay: (numeric) Average teacher’s salary in the state, in $1000s.

Unit of observation: One United States state.

Sample size: 51

Source of the dataset: United States (1992) Statistical Abstract of the United States. Bureau of the Census.

library(carData)
mydata <- force(States)

# Display the first few rows of the dataset
head(mydata, 6)

##    region   pop SATV SATM percent dollars pay
## AL    ESC  4041  470  514       8   3.648  27
## AK    PAC   550  438  476      42   7.887  43
## AZ    MTN  3665  445  497      25   4.231  30
## AR    WSC  2351  470  511       6   3.334  23
## CA    PAC 29760  419  484      45   4.826  39
## CO    MTN  3294  456  513      28   4.809  31

Let’s do the factoring of categorical variable “Region”.

mydata$regionF <- factor(mydata$region,
                             levels = c("ENC", "ESC","MA","MTN","NE","NE","SA","WNC","WSC","PAC"), 
                             labels = c("East North Central", "East South Central","Mid-Atlantic","Mountain","New England","Pacific","South Atlantic","West North Central","West South Central","Pacific"))

#Adding ID in the first column, so the units of observation have their IDs. 
mydata$ID <- seq(1, nrow(mydata))

Let’s do the summary of the data.

summary(mydata[ , c(1,2,3,4,7)])

##      region        pop             SATV            SATM            pay       
##  SA     : 9   Min.   :  454   Min.   :397.0   Min.   :437.0   Min.   :22.00  
##  MTN    : 8   1st Qu.: 1215   1st Qu.:422.5   1st Qu.:470.0   1st Qu.:27.50  
##  WNC    : 7   Median : 3294   Median :443.0   Median :490.0   Median :30.00  
##  NE     : 6   Mean   : 4877   Mean   :448.2   Mean   :497.4   Mean   :30.94  
##  ENC    : 5   3rd Qu.: 5780   3rd Qu.:474.5   3rd Qu.:522.5   3rd Qu.:33.50  
##  PAC    : 5   Max.   :29760   Max.   :511.0   Max.   :577.0   Max.   :43.00  
##  (Other):11

Let’s explain some of the numbers.

The state with the minimum population has 454 population, and the maximum population is 29760. The mean population is 4877.

The mean SAT math score is 497,4, while the mean SAT verbal score is 448,2. Apparently high school students or better at math in average or, maybe they just study math harder because they think it is more useful.

Let’s do the scaling.

mydata$SATV_z <- scale(mydata$SATV)
mydata$SATM_z <- scale(mydata$SATM)
mydata$pay_z <- scale(mydata$pay)

Let’s check the correlation.

library(Hmisc)
rcorr(as.matrix(mydata[ , c("SATV_z", "SATM_z", "pay_z")]),
      type = "pearson")

##        SATV_z SATM_z pay_z
## SATV_z   1.00   0.96 -0.56
## SATM_z   0.96   1.00 -0.49
## pay_z   -0.56  -0.49  1.00
## 
## n= 51 
## 
## 
## P
##        SATV_z SATM_z pay_z
## SATV_z        0e+00  0e+00
## SATM_z 0e+00         3e-04
## pay_z  0e+00  3e-04

There is a strong positive correlations between SAT Verbal Scores and SAT Math Scores with a coefficient of 0,96. There is a moderate negative correlation between SAT Verbal Scores and average teacher’s salary with a coefficient of -0.56. It means that the less the teachers are paid, the better the SAT Verbal Scores. The correlation between SAT Math Scores and average teacher’s salary is also negative but weaker at -0.49.

The p-values for all correlations are less than 0,001, which means that these correlations are statistically significant.

Now let’s check the dissimilarity.

mydata$Dissimilarity <- sqrt(mydata$SATM_z^2 + mydata$SATV_z^2 + mydata$pay_z^2)
head(mydata[order(-mydata$Dissimilarity), ], 10)

##    region   pop SATV SATM percent dollars pay            regionF ID     SATV_z
## IA    WNC  2777  511  577       5   4.839  28 West North Central 16  2.0389705
## ND    WNC   639  505  564       6   3.685  23 West North Central 35  1.8442981
## SD    WNC   696  506  555       5   3.730  22 West North Central 42  1.8767435
## DC     SA   607  409  441      68   8.210  39     South Atlantic  9 -1.2704599
## NY     MA 17990  412  470      70   8.500  42       Mid-Atlantic 33 -1.1731237
## SC     SA  3487  397  437      54   4.327  28     South Atlantic 41 -1.6598047
## CN     NE  3287  430  471      74   7.914  43        New England  7 -0.5891066
## AK    PAC   550  438  476      42   7.887  43            Pacific  2 -0.3295434
## NC     SA  6629  401  440      55   4.802  29     South Atlantic 34 -1.5300231
## GA     SA  6478  401  443      57   4.860  29     South Atlantic 11 -1.5300231
##        SATM_z      pay_z Dissimilarity
## IA  2.3028801 -0.5540868      3.125327
## ND  1.9268187 -1.4960343      3.058134
## SD  1.6664684 -1.6844238      3.022675
## DC -1.6313013  1.5181978      2.565178
## NY -0.7923950  2.0833663      2.518834
## SC -1.7470125 -0.5540868      2.472654
## CN -0.7634672  2.2717558      2.467955
## AK -0.6188281  2.2717558      2.377482
## NC -1.6602291 -0.3656973      2.287152
## GA -1.5734457 -0.3656973      2.224958

We see that none of the units being observed have much dissimilarities, so they do don’t have to be removed from our analysis.

Let’s make the visualization.

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

distance <- get_dist(mydata[c("SATV_z", "SATM_z", "pay_z")],
                     method = "euclidian") 
distance2 <- distance^2 
fviz_dist(distance2,
          gradient = list(low = "red", mid = "lightblue", high = "green"))

Let’s check the clusterability of the data.

get_clust_tendency(mydata[c("SATV_z", "SATM_z", "pay_z")],
                   n = nrow(mydata) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7259428
## 
## $plot
## NULL

Hopkins Statistic is 0,73 which is above 0,5. It means that the data can be clustered.

Now let’s build the dendrogram.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

WARD <- mydata[c("SATV_z", "SATM_z", "pay_z")] %>%
           get_dist(method = "euclidian") %>%
           hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 51

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Let’s cut the dendrogram into 4 groups, because if we cut it into three, it will be too close to the lines.

fviz_dend(WARD,
          k = 4,
          cex = 0.5,
          palette = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

We see that the four group clustering is perfect for the dataset.

mydata$ClusterWard <- cutree(WARD,
                             k = 4)
head(mydata[c(1,14)])

##    region ClusterWard
## AL    ESC           1
## AK    PAC           2
## AZ    MTN           3
## AR    WSC           1
## CA    PAC           2
## CO    MTN           3

Let’s set up the initial leaders.

Initial_leaders <- aggregate(mydata[ , c("SATV_z", "SATM_z", "pay_z")],
                             by = list(mydata$ClusterWard),
                             FUN = mean)
Initial_leaders

##   Group.1     SATV_z     SATM_z      pay_z
## 1       1  1.1095053  0.9824109 -0.9641110
## 2       2 -0.8014765 -0.8186930  1.4668188
## 3       3  0.1271879  0.2712582  0.1704882
## 4       4 -1.1698792 -1.1221719 -0.1961467

Now let’s use the K-means method.

K_MEANS <- hkmeans(mydata[c("SATV_z", "SATM_z", "pay_z")], 
                   k = 4,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 4 clusters of sizes 19, 11, 11, 10
## 
## Cluster means:
##       SATV_z      SATM_z      pay_z
## 1  1.0895157  1.01635409 -0.8217982
## 2 -0.8014765 -0.81869297  1.4668188
## 3 -0.0168878  0.08332858  0.1309660
## 4 -1.1698792 -1.12217194 -0.1961467
## 
## Clustering vector:
## AL AK AZ AR CA CO CN DE DC FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO 
##  1  2  3  1  2  3  2  2  2  4  4  4  1  3  4  1  1  1  1  4  2  2  3  1  1  1 
## MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY 
##  1  1  3  3  2  1  2  4  1  3  1  3  2  2  4  1  1  4  1  4  4  3  3  1  3 
## 
## Within cluster sum of squares by cluster:
## [1] 15.109951  4.761381  6.299115  3.779698
##  (between_SS / total_SS =  80.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

The ratio between_SS / total_SS is 80%, so the clusters have sufficient difference.

Let’s do the visualization of our four clusters.

library(factoextra)
fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

Let’s do the cluster means.

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("SATV_z", "SATM_z", "pay_z")])

##        SATV_z      SATM_z       pay_z
## AL  0.7087092  0.48042844 -0.74247628
## AK -0.3295434 -0.61882814  2.27175579
## AZ -0.1024257 -0.01134424 -0.17730777
## AR  0.7087092  0.39364503 -1.49603430
## CA -0.9460060 -0.38740570  1.51819777
## CO  0.2544737  0.45150064  0.01108174

Let’s display the tables that contain observations generated from both the hierarchical and K Means methods.

table(mydata$ClusterWard)

## 
##  1  2  3  4 
## 17 11 13 10

table(mydata$ClusterK_Means)

## 
##  1  2  3  4 
## 19 11 11 10

table(mydata$ClusterWard, mydata$ClusterK_Means)

##    
##      1  2  3  4
##   1 17  0  0  0
##   2  0 11  0  0
##   3  2  0 11  0
##   4  0  0  0 10

Let’s display the centroids.

Centroids <- K_MEANS$centers
Centroids

##       SATV_z      SATM_z      pay_z
## 1  1.0895157  1.01635409 -0.8217982
## 2 -0.8014765 -0.81869297  1.4668188
## 3 -0.0168878  0.08332858  0.1309660
## 4 -1.1698792 -1.12217194 -0.1961467

Let’s illustrate the positions of the clusters relative to the cluster variables.

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("SATV_z", "SATM_z", "pay_z"))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3, 4), 
                        labels = c("1", "2", "3", "4"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("SATV_z", "SATM_z", "pay_z"), 
                            labels = c("SATV_z", "SATM_z", "pay_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-4, 4)

Group 1: The high school students from the states which are included in group 1 have above average (the best) SAT verbal and maths scores, but their teachers have the lowest salary of all groups.

Group 2: This is a pretty average group.

Group 3: This group’s teachers have the highest pay, but the students’ scores are below average.

Group 4: This is the group with the worst indicators, the SAT scores are the lowest. However, their teachers’ salary is still below average but higher than group 1 teachers.

By looking at this figure, we can get a conclusion that the less you pay the teachers, the better the students will study.

Now let’s put units into groups by using ANOVA method. Hypothesis: H0: the means in all groups are same. H1: At least one differs.

fit <- aov(pop ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)

##                           Df    Sum Sq  Mean Sq F value Pr(>F)  
## as.factor(ClusterK_Means)  3 1.863e+08 62089622   2.257  0.094 .
## Residuals                 47 1.293e+09 27510158                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit <- aov(percent ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)

##                           Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)  3  24099    8033   77.39 <2e-16 ***
## Residuals                 47   4879     104                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s do validation by the variable Percent.

aggregate(mydata$percent, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")

##   Group.1         x
## 1       1  9.736842
## 2       2 62.090909
## 3       3 28.636364
## 4       4 53.800000

We see that in the states which are included in Group 1, only around 10% of the students take the SAT exam. If we consider that the teachers from the Group 1 has the lowest salary, maybe this states are very poor and most of the high school students don’t care about college and only 10% of them take the SAT which eventually get the best scores in the whole country.

Group 2 states has around 62,1% of attendance of SAT exam. And the students of these states are average.

In group 3 around 29% of the students take SAT, and considering the teachers are the best payed, I think these states have some rich population, and maybe the students party, they have no motivation and the parents’ persuade around one third of them to at least take SAT and they get the below average score in the exams. The teachers are well payed so they don’t care that the students are doing anything else but studying. It is just my guess.

In group 4 around half of the high schoolers take SAT (53,8%), sadly they have the lowest scores in average.

Hw4

2024-02-01

Erdene Enkh, 4 February 2024

IMB; Multivariate Analysis;

Homework 4