Research question: How can we group US states based on education-related statistics using the clustering method? In other words, how can we group US states based on high school students’ SAT math scores, SAT verbal scores, and the average teacher’s salary?
Variables Involved:
Region: (categorical) U. S. Census regions.
Pop: (numeric) Population: in 1,000s.
SATV: (numeric) Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
SATM: (numeric) Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
Percent: (numeric) Percentage of graduating high-school students in the state who took the SAT exam.
Dollars: (numeric) State spending on public education, in $1000s per student.
Pay: (numeric) Average teacher’s salary in the state, in $1000s.
Unit of observation: One United States state.
Sample size: 51
Source of the dataset: United States (1992) Statistical Abstract of the United States. Bureau of the Census.
library(carData)
mydata <- force(States)
# Display the first few rows of the dataset
head(mydata, 6)
## region pop SATV SATM percent dollars pay
## AL ESC 4041 470 514 8 3.648 27
## AK PAC 550 438 476 42 7.887 43
## AZ MTN 3665 445 497 25 4.231 30
## AR WSC 2351 470 511 6 3.334 23
## CA PAC 29760 419 484 45 4.826 39
## CO MTN 3294 456 513 28 4.809 31
Let’s do the factoring of categorical variable “Region”.
mydata$regionF <- factor(mydata$region,
levels = c("ENC", "ESC","MA","MTN","NE","NE","SA","WNC","WSC","PAC"),
labels = c("East North Central", "East South Central","Mid-Atlantic","Mountain","New England","Pacific","South Atlantic","West North Central","West South Central","Pacific"))
#Adding ID in the first column, so the units of observation have their IDs.
mydata$ID <- seq(1, nrow(mydata))
Let’s do the summary of the data.
summary(mydata[ , c(1,2,3,4,7)])
## region pop SATV SATM pay
## SA : 9 Min. : 454 Min. :397.0 Min. :437.0 Min. :22.00
## MTN : 8 1st Qu.: 1215 1st Qu.:422.5 1st Qu.:470.0 1st Qu.:27.50
## WNC : 7 Median : 3294 Median :443.0 Median :490.0 Median :30.00
## NE : 6 Mean : 4877 Mean :448.2 Mean :497.4 Mean :30.94
## ENC : 5 3rd Qu.: 5780 3rd Qu.:474.5 3rd Qu.:522.5 3rd Qu.:33.50
## PAC : 5 Max. :29760 Max. :511.0 Max. :577.0 Max. :43.00
## (Other):11
Let’s explain some of the numbers.
The state with the minimum population has 454 population, and the maximum population is 29760. The mean population is 4877.
The mean SAT math score is 497,4, while the mean SAT verbal score is 448,2. Apparently high school students or better at math in average or, maybe they just study math harder because they think it is more useful.
Let’s do the scaling.
mydata$SATV_z <- scale(mydata$SATV)
mydata$SATM_z <- scale(mydata$SATM)
mydata$pay_z <- scale(mydata$pay)
Let’s check the correlation.
library(Hmisc)
rcorr(as.matrix(mydata[ , c("SATV_z", "SATM_z", "pay_z")]),
type = "pearson")
## SATV_z SATM_z pay_z
## SATV_z 1.00 0.96 -0.56
## SATM_z 0.96 1.00 -0.49
## pay_z -0.56 -0.49 1.00
##
## n= 51
##
##
## P
## SATV_z SATM_z pay_z
## SATV_z 0e+00 0e+00
## SATM_z 0e+00 3e-04
## pay_z 0e+00 3e-04
There is a strong positive correlations between SAT Verbal Scores and SAT Math Scores with a coefficient of 0,96. There is a moderate negative correlation between SAT Verbal Scores and average teacher’s salary with a coefficient of -0.56. It means that the less the teachers are paid, the better the SAT Verbal Scores. The correlation between SAT Math Scores and average teacher’s salary is also negative but weaker at -0.49.
The p-values for all correlations are less than 0,001, which means that these correlations are statistically significant.
Now let’s check the dissimilarity.
mydata$Dissimilarity <- sqrt(mydata$SATM_z^2 + mydata$SATV_z^2 + mydata$pay_z^2)
head(mydata[order(-mydata$Dissimilarity), ], 10)
## region pop SATV SATM percent dollars pay regionF ID SATV_z
## IA WNC 2777 511 577 5 4.839 28 West North Central 16 2.0389705
## ND WNC 639 505 564 6 3.685 23 West North Central 35 1.8442981
## SD WNC 696 506 555 5 3.730 22 West North Central 42 1.8767435
## DC SA 607 409 441 68 8.210 39 South Atlantic 9 -1.2704599
## NY MA 17990 412 470 70 8.500 42 Mid-Atlantic 33 -1.1731237
## SC SA 3487 397 437 54 4.327 28 South Atlantic 41 -1.6598047
## CN NE 3287 430 471 74 7.914 43 New England 7 -0.5891066
## AK PAC 550 438 476 42 7.887 43 Pacific 2 -0.3295434
## NC SA 6629 401 440 55 4.802 29 South Atlantic 34 -1.5300231
## GA SA 6478 401 443 57 4.860 29 South Atlantic 11 -1.5300231
## SATM_z pay_z Dissimilarity
## IA 2.3028801 -0.5540868 3.125327
## ND 1.9268187 -1.4960343 3.058134
## SD 1.6664684 -1.6844238 3.022675
## DC -1.6313013 1.5181978 2.565178
## NY -0.7923950 2.0833663 2.518834
## SC -1.7470125 -0.5540868 2.472654
## CN -0.7634672 2.2717558 2.467955
## AK -0.6188281 2.2717558 2.377482
## NC -1.6602291 -0.3656973 2.287152
## GA -1.5734457 -0.3656973 2.224958
We see that none of the units being observed have much dissimilarities, so they do don’t have to be removed from our analysis.
Let’s make the visualization.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
distance <- get_dist(mydata[c("SATV_z", "SATM_z", "pay_z")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2,
gradient = list(low = "red", mid = "lightblue", high = "green"))
Let’s check the clusterability of the data.
get_clust_tendency(mydata[c("SATV_z", "SATM_z", "pay_z")],
n = nrow(mydata) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7259428
##
## $plot
## NULL
Hopkins Statistic is 0,73 which is above 0,5. It means that the data can be clustered.
Now let’s build the dendrogram.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
WARD <- mydata[c("SATV_z", "SATM_z", "pay_z")] %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 51
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Let’s cut the dendrogram into 4 groups, because if we cut it into three,
it will be too close to the lines.
fviz_dend(WARD,
k = 4,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
We see that the four group clustering is perfect for the dataset.
mydata$ClusterWard <- cutree(WARD,
k = 4)
head(mydata[c(1,14)])
## region ClusterWard
## AL ESC 1
## AK PAC 2
## AZ MTN 3
## AR WSC 1
## CA PAC 2
## CO MTN 3
Let’s set up the initial leaders.
Initial_leaders <- aggregate(mydata[ , c("SATV_z", "SATM_z", "pay_z")],
by = list(mydata$ClusterWard),
FUN = mean)
Initial_leaders
## Group.1 SATV_z SATM_z pay_z
## 1 1 1.1095053 0.9824109 -0.9641110
## 2 2 -0.8014765 -0.8186930 1.4668188
## 3 3 0.1271879 0.2712582 0.1704882
## 4 4 -1.1698792 -1.1221719 -0.1961467
Now let’s use the K-means method.
K_MEANS <- hkmeans(mydata[c("SATV_z", "SATM_z", "pay_z")],
k = 4,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 4 clusters of sizes 19, 11, 11, 10
##
## Cluster means:
## SATV_z SATM_z pay_z
## 1 1.0895157 1.01635409 -0.8217982
## 2 -0.8014765 -0.81869297 1.4668188
## 3 -0.0168878 0.08332858 0.1309660
## 4 -1.1698792 -1.12217194 -0.1961467
##
## Clustering vector:
## AL AK AZ AR CA CO CN DE DC FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO
## 1 2 3 1 2 3 2 2 2 4 4 4 1 3 4 1 1 1 1 4 2 2 3 1 1 1
## MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY
## 1 1 3 3 2 1 2 4 1 3 1 3 2 2 4 1 1 4 1 4 4 3 3 1 3
##
## Within cluster sum of squares by cluster:
## [1] 15.109951 4.761381 6.299115 3.779698
## (between_SS / total_SS = 80.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
The ratio between_SS / total_SS is 80%, so the clusters have sufficient difference.
Let’s do the visualization of our four clusters.
library(factoextra)
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
Let’s do the cluster means.
mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("SATV_z", "SATM_z", "pay_z")])
## SATV_z SATM_z pay_z
## AL 0.7087092 0.48042844 -0.74247628
## AK -0.3295434 -0.61882814 2.27175579
## AZ -0.1024257 -0.01134424 -0.17730777
## AR 0.7087092 0.39364503 -1.49603430
## CA -0.9460060 -0.38740570 1.51819777
## CO 0.2544737 0.45150064 0.01108174
Let’s display the tables that contain observations generated from both the hierarchical and K Means methods.
table(mydata$ClusterWard)
##
## 1 2 3 4
## 17 11 13 10
table(mydata$ClusterK_Means)
##
## 1 2 3 4
## 19 11 11 10
table(mydata$ClusterWard, mydata$ClusterK_Means)
##
## 1 2 3 4
## 1 17 0 0 0
## 2 0 11 0 0
## 3 2 0 11 0
## 4 0 0 0 10
Let’s display the centroids.
Centroids <- K_MEANS$centers
Centroids
## SATV_z SATM_z pay_z
## 1 1.0895157 1.01635409 -0.8217982
## 2 -0.8014765 -0.81869297 1.4668188
## 3 -0.0168878 0.08332858 0.1309660
## 4 -1.1698792 -1.12217194 -0.1961467
Let’s illustrate the positions of the clusters relative to the cluster variables.
library(ggplot2)
library(tidyr)
Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("SATV_z", "SATM_z", "pay_z"))
Figure$Groups <- factor(Figure$id,
levels = c(1, 2, 3, 4),
labels = c("1", "2", "3", "4"))
Figure$nameFactor <- factor(Figure$name,
levels = c("SATV_z", "SATM_z", "pay_z"),
labels = c("SATV_z", "SATM_z", "pay_z"))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
ylim(-4, 4)
Group 1: The high school students from the states which are included in group 1 have above average (the best) SAT verbal and maths scores, but their teachers have the lowest salary of all groups.
Group 2: This is a pretty average group.
Group 3: This group’s teachers have the highest pay, but the students’ scores are below average.
Group 4: This is the group with the worst indicators, the SAT scores are the lowest. However, their teachers’ salary is still below average but higher than group 1 teachers.
By looking at this figure, we can get a conclusion that the less you pay the teachers, the better the students will study.
Now let’s put units into groups by using ANOVA method. Hypothesis: H0: the means in all groups are same. H1: At least one differs.
fit <- aov(pop ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 1.863e+08 62089622 2.257 0.094 .
## Residuals 47 1.293e+09 27510158
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit <- aov(percent ~ as.factor(ClusterK_Means),
data = mydata)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 3 24099 8033 77.39 <2e-16 ***
## Residuals 47 4879 104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s do validation by the variable Percent.
aggregate(mydata$percent,
by = list(mydata$ClusterK_Means),
FUN = "mean")
## Group.1 x
## 1 1 9.736842
## 2 2 62.090909
## 3 3 28.636364
## 4 4 53.800000
We see that in the states which are included in Group 1, only around 10% of the students take the SAT exam. If we consider that the teachers from the Group 1 has the lowest salary, maybe this states are very poor and most of the high school students don’t care about college and only 10% of them take the SAT which eventually get the best scores in the whole country.
Group 2 states has around 62,1% of attendance of SAT exam. And the students of these states are average.
In group 3 around 29% of the students take SAT, and considering the teachers are the best payed, I think these states have some rich population, and maybe the students party, they have no motivation and the parents’ persuade around one third of them to at least take SAT and they get the below average score in the exams. The teachers are well payed so they don’t care that the students are doing anything else but studying. It is just my guess.
In group 4 around half of the high schoolers take SAT (53,8%), sadly they have the lowest scores in average.