Homework 4: clustering

Homework 4: clustering - Luka Mihailović Potrč

Introduction and descriptive statistics

In this homework, we will try to cluster data based on a data set taken from R.

Let’s begin by loading and describing the data set as well as some descriptive statistics.

#data()
#data(package = .packages(all.available = TRUE))
library(psych)

## Warning: package 'psych' was built under R version 4.3.2

mydata <- force(sat.act)
colnames(mydata) <- c("Gender", "Education", "Age", "ACT", "SATV", "SATQ") #changing column names to be capitalized because it bothers me
head(mydata, 6)

##       Gender Education Age ACT SATV SATQ
## 29442      2         3  19  24  500  500
## 29457      2         3  23  35  600  500
## 29498      2         3  20  21  480  470
## 29503      1         4  27  26  550  520
## 29504      1         2  33  31  600  550
## 29518      1         5  26  28  640  640

Explanations of the data set:

Gender: A representation of a respective person’s gender - 1 for male and 2 for female.
Education: A representation of a respective person’s self-reported level of education - 1 stands for high school and 5 stands for graduate work. The interim levels are not specified, so we will leave them numbered, because we do not know their representation and the precise intervals between them.
Age: A measurement of the respective person’s age.
ACT score: A measurement of the respective person’s composite American College Test (hereafter: ACT) score. The scores can range from 1-36; national norms have a mean of 20.
SATV score: A measurement of the respective person’s SAT verbal score. These scores can range from 200-800.
SATQ score: A measurement of the respective person’s SAT quantitative score. These scores can also range from 200-800.

The unit of observation is a subject; a person.

The sample size of our data encompasses 700 units along the 6 variables delineated above (I did not count ID as a specific variable).

Explanations of the variables:

The first two variables measure a person’s gender and self-reported level of education. Both of these variables are categorical. Gender is a nominal scale variable and education is an ordinal scale variable (but we do not know specific ordinal brackets, because they are not specified beyond the scores of 1 and 5).
The third variable is age. It is a numerical variable that measures a person’s age.
The fourth variable measures a person’s ACT composite score on a scale from 1-36. It is a numerical variable.
The fifth and sixth variables measure a person’s SATV and SATQ scores. Both are numerical variables whose scores can range from 200-800.

Following is the data source citation:

Revelle, William, Wilt, Joshua, and Rosenthal, Allen (2009) Personality and Cognition: The Personality-Cognition Link. In Gruszka, Alexandra and Matthews, Gerald and Szymura, Blazej (Eds.) Handbook of Individual Differences in Cognition: Attention, Memory and Executive Control, Springer.

Before doing anything else, I will add ID as a variable in case I will need to drop some units later. I will also put ID in the first column.

mydata$ID <- seq(1, nrow(mydata))
mydata <- mydata[c("ID", "Gender", "Education", "Age", "ACT", "SATV", "SATQ")]
head(mydata, 6)

##       ID Gender Education Age ACT SATV SATQ
## 29442  1      2         3  19  24  500  500
## 29457  2      2         3  23  35  600  500
## 29498  3      2         3  20  21  480  470
## 29503  4      1         4  27  26  550  520
## 29504  5      1         2  33  31  600  550
## 29518  6      1         5  26  28  640  640

When factoring Education, for it is a categorical variable, I will check which of the 5 values as degrees of self-reported education is the most common and use it as the reference category. In order to do that, let’s check its distributions below.

table(mydata$Education)

## 
##   0   1   2   3   4   5 
##  57  45  44 275 138 141

As we can see, the most commonly selected level of education was “3”. When factoring categorical variables, we will ensure to put “3” on the first place so that it will be taken as the reference category. Consequently, we will have created four dummy variables for the other four values, comparing them to “3”. Some responses (57) are missing (0 is not a valid score from 1-5), but when we factor, those should be converted to NA automatically.

Later on in this assignment, I figured out that this was probably not necessary, but whatever. It didn’t impact the results, so we’ll just leave it here.

Finally, I will factor the variables of Gender and Education.

mydata$Gender <- factor(mydata$Gender,
                             levels = c(1, 2), 
                             labels = c("Male", "Female"))
mydata$Education <- factor(mydata$Education, 
                             levels = c(3, 1, 2, 4, 5), 
                             labels = c("Edu3", "Edu1", "Edu2", "Edu4", "Edu5"))
head(mydata)

##       ID Gender Education Age ACT SATV SATQ
## 29442  1 Female      Edu3  19  24  500  500
## 29457  2 Female      Edu3  23  35  600  500
## 29498  3 Female      Edu3  20  21  480  470
## 29503  4   Male      Edu4  27  26  550  520
## 29504  5   Male      Edu2  33  31  600  550
## 29518  6   Male      Edu5  26  28  640  640

Factoring variables showed us some NA values, which we will now drop from our dataset to avoid complications later on.

library(tidyr)
mydata <- drop_na(mydata)

With NA data dropped, we will now proceed with this clustering exercise.

Finally, let’s display some descriptive statistics (for variables other than ID)

summary(mydata[ , -1])

##     Gender    Education       Age             ACT             SATV      
##  Male  :218   Edu3:269   Min.   :17.00   Min.   : 3.00   Min.   :200.0  
##  Female:413   Edu1: 43   1st Qu.:19.00   1st Qu.:26.00   1st Qu.:550.0  
##               Edu2: 43   Median :23.00   Median :29.00   Median :620.0  
##               Edu4:137   Mean   :26.41   Mean   :28.64   Mean   :611.8  
##               Edu5:139   3rd Qu.:30.00   3rd Qu.:32.00   3rd Qu.:700.0  
##                          Max.   :65.00   Max.   :36.00   Max.   :800.0  
##       SATQ      
##  Min.   :200.0  
##  1st Qu.:530.0  
##  Median :620.0  
##  Mean   :609.3  
##  3rd Qu.:700.0  
##  Max.   :800.0

Explaining some descriptive statistics:

“Gender”: We have 218 males and 413 females in our sample (after dropping NA values, our total sample size is 631 units),
min. of “Age”: The youngest person(s) in our sample is (are) 17 years old,
Mean of ACT: The respondents in our sample scored 28,64 points in their ACT test on average.

Finally, let’s define our research question: Can we group our units (people) based on their performance in ACT, SATV, and SATQ tests?

Checking how clusterable our data is

In clustering data, we want the units within groups to be as homogeneous as possible, while the groups themselves should remain as heterogeneous as possible between themselves.

For our cluster variables, we will stick to those that are numeric. This is not an ideal exercise; we work with the data set we have. Ideally we would look for about 5 cluster variables as a rule of thumb, but I will stick to the 3 variables with scores (ACT score, Verbal SAT score, and Quantitative SAT score). I will use the “demographic” variables - gender and education, as well as age for verification of our clustering. As we have no theoretical basis for clustering, I will use all three instead, and offer some reasoning: I reckon we should be able to cluster units by ACT, SATV and SATQ scores as they all in some way measure competency in certain topics: by the Gauss curve, there should theoretically be some who perform great in all, some who perform poorly in all, and some who are good in either of them; but this remains to be seen. Let’s then see if the Gauss curve holds for these tests.

To check whether our data is suitable for clustering, we will use Hopkins statistics.

Let’s first standardize our variables, because they are not of the same scale (ACT is measured on a scale from 1 to 36, while SATV and SATQ are measured on a scale from 200 to 800). We standardize them in order to ensure all of them contribute equally to our results. Standardized variables have the arithmetic mean of 0 and a standard deviation of 1; therefore, removing some would distort the arithmetic mean and consequently standard deviation values. If we ever remove any units from our data, we must make sure to standardize them again.

mydata$ACT_z <- scale(mydata$ACT)
mydata$SATV_z <- scale(mydata$SATV)
mydata$SATQ_z <- scale(mydata$SATQ)

We have added standardized variables to our data set. Now we should check that the cluster variables (the now-standardized variables we will use for clustering) are not too strongly correlated among themselves.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c("ACT_z", "SATV_z", "SATQ_z")]),
      type = "pearson")

##        ACT_z SATV_z SATQ_z
## ACT_z   1.00   0.55   0.58
## SATV_z  0.55   1.00   0.62
## SATQ_z  0.58   0.62   1.00
## 
## n= 631 
## 
## 
## P
##        ACT_z SATV_z SATQ_z
## ACT_z         0      0    
## SATV_z  0            0    
## SATQ_z  0     0

We can notice some correlation between the variables, but will nevertheless keep them as cluster variables. In our exercises during lectures, we kept even higher correlation values, and the data set is, as we’ve said, imperfect. And since there is semi-strong correlation between all of the cluster variables, there should not be issues; the method should likely work.

But while the method will work, the problem with highly correlated cluster variables is that they will “move together”. We will essentially end up with very “vanilla” cluster groups, where one group will represent low results, one medium results, and one high results. This is because when one cluster variable increases, so do generally the other two (due to correlation between all of them). There is not really much we can do about this with our dataset. Performing PCA analysis would not help us much, because all variables are correlated - to remove all correlations, we would have to end up with a single principal component. Even if, for example, we extract a principal component for SATQ and SATV, that component will still correlate to ACT, and we will have solved nothing. With that in mind, we have to retain in mind that our dataset is far from perfect for clustering; this exercise is purely to show understanding of the topic. Normally, we would do clustering with different data (where pairs of cluster variables were correlated so that we could perform PCA on them, but that component would then not be correlated to the remaining cluster variables). The logic throughout this exercise still holds and could be applied to a better-suited dataset; however, we will keep this one, because although it is imperfectly suitable, it serves as a great illustration of the topic, as we will observe later.

Next, we should check for outliers within our data. To do so, let’s observe the measure of dissimilarity and display it by descending order below.

mydata$Dissimilarity <- sqrt(mydata$ACT_z^2 + mydata$SATV_z^2 + mydata$SATQ_z^2) #We don't have to subtract values from the mean, because the mean is 0, as the values are standardized. As such, we can simply apply the square root of the sum of their squares.

head(mydata[order(-mydata$Dissimilarity), c(1, 5, 6, 7, 11)], 10) #Showing the 10 units with the highest values of Dissimilarity

##      ID ACT SATV SATQ Dissimilarity
## 403 440   3  200  400      6.752375
## 228 251  35  200  200      5.295688
## 297 324  17  300  300      4.578010
## 355 389  17  300  300      4.578010
## 202 220  21  245  350      4.290793
## 307 335  20  300  300      4.277905
## 337 370  35  350  230      4.263071
## 591 658  22  300  300      4.118902
## 75   79  15  375  375      4.096486
## 261 285  18  300  400      4.005519

Seeing as units with the ID values of 440 and 251 stand out the most from the others, we check them in our data to see what happened there.

The unit with ID 440 scored very low on SAT (3 points out of 26) and SATV (the minimum 200 points out of 800) tests, and rather poorly on the SATQ test (400 out of 800 points). Scoring low on Verbal SAT might be combinable with the other 2, but it is somewhat odd to notice a low score in SAT and SATQ tests, for both on those focus on more “mathematical” themes. We check online for what the respective tests measure. SATV measures literacy of reading and writing. SATQ measures math on levels 1 and 2, akin to ACT, which measures English, math, reading, science, and writing.

As we can see, the “high” score in SATQ would indicate that the person is better at math than other subjects. The low score in ACT could then be justifiable by saying they performed poorly in english, reading, science and writing (they only performed “good” in math). That would also coincide with scoring low on SATV, which measures reading and writing.

Let us not exclude this unit, because we cannot claim an odd observation; such a person could easily exist in the population. The scores still fall within valid ranges and given these are formal tests, let’s assume we believe their validity.

The same logic can be applied to the unit with ID of 251, who performed well in ACT (35 out of 36 points), but poorly in SATV and SATQ (200 out of 800 points in both) tests.

Since we said we want to observe whether Gauss curve holds with performance in these tests, removing extremely “poor” performing students would be counterintuitive.

As such, we will not exclude any units so far.

In continuation, let’s observe Euclidean distances.

#install.packages("factoextra")
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

distance <- get_dist(mydata[c("ACT_z", "SATV_z", "SATQ_z")],
                     method = "euclidian") #Calculating Euclidean distances and adding them to the dataset

distance2 <- distance^2 #Creating squared Euclidean distances

fviz_dist(distance2,
          gradient = list(low = "yellow", mid = "lightgray", high = "black")) #Showing the dissimilarity matrix with oh-so-pretty colors that are more visible. Trust me, it looks terrible in any combination of colors - I tried a lot of them. This seemed like the best. It still gives me color blindness though.

As we can see above, our clusters along the line are really overlapping with one another. But if we imagine three “one-third of the line” squares along the line, we can see the groups are somewhat heterogeneous from one another at places. Let’s look at Hopkins statistics and see.

get_clust_tendency(mydata[c("ACT_z", "SATV_z", "SATQ_z")],
                   n = nrow(mydata) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8118829
## 
## $plot
## NULL

We want the Hopkins value to be above 0.5 at the very least as a cut-off rule-of-the-thumb point. Seeing as our value is roughly 0.81, our data is clusterable according to Hopkins, but we will see just how well later on.

Clustering

Let’s now compose a dendrogram and figure out how many groups (clusters) we should divide our units among.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

WARD <- mydata[c("ACT_z", "SATV_z", "SATQ_z")] %>%
           get_dist(method = "euclidian") %>%
           hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 631

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We have 631 units in our data. Consequently, we would need 630 (n-1) steps to encompass all units within one very heterogeneous group. To, for example, then end up with units divided among 3 groups (clusters), we would have to go two steps back; for that, we would need 628 steps.

Looking at the dendrogram above, we will divide data into 3 cluster groups. If we look at the y-axis - the measure of heterogeneity within groups, we would visually “cut” the dendrogram somewhere between the height values of approximately 13 and 20. As such, we would get three cluster groups.

The reasoning for dividing into 3 cluster groups is the following:

We do not want less than 3 groups. “Cutting” the dendrogram higher up as to where we would end up with two cluster groups would leave our groups much more heterogeneous (they would each encompass more different units; be less specific, while we want our groups to be as homogenous within as possible). Cutting it into three groups instead will allow for lower heterogeneity within groups (they will each include units that are less different from one another; as such, each group will be more “specific” and they will differentiate more, which is what we want).
We do not want more than 3 cluster groups. If we wanted to divide the data into 4 groups, we would need to draw the line at approximately the height value of 13, so that it would cut just below only the “right-most” of the three “bars” at about that value. But doing so would not be sensible, because, in the most intuitive explanation, further cutting would not make big difference of heterogeneity within groups (if you will, the “horizontal lines” of all “bars” clustered around the height value of 12-13 are far too close together).

Let’s show how such divison into three cluster groups would look in a dendrogram below, coloring for cluster groups for easier visualization.

fviz_dend(WARD,
          k = 3,
          cex = 0.5,
          palette = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

As we can see, further division would barely make any difference between groups, so we settle for 3 cluster groups.

Let’s let R know we will be cutting into 3 cluster groups.

mydata$ClusterWard <- cutree(WARD,
                             k = 3)
head(mydata[c(1, 5, 6, 7, 12)])

##   ID ACT SATV SATQ ClusterWard
## 1  1  24  500  500           1
## 2  2  35  600  500           1
## 3  3  21  480  470           2
## 4  4  26  550  520           1
## 5  5  31  600  550           1
## 6  6  28  640  640           3

The “ClusterWard” variable tells us which of the cluster groups a respective unit has been assigned to. For example, units with ID of 1, 2, 4, and 5, have been assigned to group 1. Unit ID 3 has been assigned to group 2, and unit ID 6 has been assigned to group 3. We cannot really tell how the groups differ insofar by the units above, but we will be able to confirm this later.

Now, we will specify initial leaders, which will later be reassigned using K-means.

Initial_leaders <- aggregate(mydata[ , c("ACT_z", "SATV_z", "SATQ_z")],
                             by = list(mydata$ClusterWard),
                             FUN = mean)
Initial_leaders

##   Group.1      ACT_z     SATV_z     SATQ_z
## 1       1 -0.5646354 -0.2809746 -0.3785181
## 2       2 -0.9820800 -1.6645708 -1.5155391
## 3       3  0.6648568  0.6578849  0.6841063

We have obtained initial leaders in each cluster group for each of the standardized variables. Now we will assign K-means and observe for reassignment.

K_MEANS <- hkmeans(mydata[c("ACT_z", "SATV_z", "SATQ_z")], 
                   k = 3,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 3 clusters of sizes 256, 110, 265
## 
## Cluster means:
##        ACT_z     SATV_z     SATQ_z
## 1 -0.2658798 -0.1460056 -0.2015290
## 2 -1.3065932 -1.4295381 -1.4226987
## 3  0.7992093  0.7344401  0.7852388
## 
## Clustering vector:
##   [1] 2 1 2 1 1 1 1 2 2 3 3 1 1 3 1 1 3 3 1 3 1 1 1 1 3 3 2 2 1 1 1 1 3 1 1 1 1
##  [38] 3 1 3 3 3 3 1 3 3 2 3 3 1 1 3 3 1 3 3 2 3 3 3 3 3 1 1 1 1 2 1 1 3 3 1 1 1
##  [75] 2 1 2 1 3 1 1 3 2 1 3 2 3 3 3 3 1 3 1 3 2 3 1 3 3 1 1 1 1 2 3 1 3 3 1 1 3
## [112] 1 1 3 2 2 1 1 3 2 1 3 1 3 1 1 1 1 2 3 1 1 3 3 1 1 3 1 3 3 3 1 3 2 3 2 1 1
## [149] 1 3 2 1 3 3 3 3 3 3 3 1 1 1 1 3 3 1 1 3 3 2 1 3 3 1 1 1 3 3 1 3 2 2 3 1 1
## [186] 1 1 3 1 3 1 1 2 2 1 3 1 3 1 2 1 2 2 2 3 2 3 1 3 2 1 2 1 1 1 1 3 1 2 1 1 1
## [223] 3 2 1 1 1 2 1 1 2 3 3 3 3 2 1 3 2 1 2 1 3 3 1 1 3 1 3 2 1 1 1 3 2 2 3 2 3
## [260] 1 2 1 2 3 3 3 2 3 3 3 3 1 2 3 1 3 1 1 2 2 1 3 2 3 2 1 1 2 1 1 3 3 2 3 3 1
## [297] 2 1 2 3 1 3 3 2 1 1 2 3 1 1 2 1 3 1 3 1 1 1 3 1 2 1 1 3 3 3 1 3 3 1 1 2 3
## [334] 1 3 2 2 1 1 1 2 1 3 1 3 1 1 1 2 3 3 1 2 3 2 2 2 3 3 2 3 1 2 1 2 1 3 1 3 1
## [371] 1 2 2 3 1 1 1 1 2 3 3 3 3 1 3 3 2 1 3 1 3 1 3 1 3 3 1 2 2 3 1 1 2 2 2 1 3
## [408] 1 3 3 1 3 3 3 3 2 3 3 1 1 1 1 2 3 3 2 1 2 3 1 1 1 1 1 1 1 1 2 3 1 2 1 1 3
## [445] 2 3 3 2 3 1 3 1 3 1 1 2 3 3 1 1 2 2 1 3 3 3 1 3 1 3 3 1 1 3 2 1 1 3 3 3 1
## [482] 3 3 3 1 2 2 1 2 2 3 1 2 1 1 2 3 1 1 3 3 2 3 2 3 3 3 3 3 3 3 1 3 3 3 3 3 3
## [519] 3 1 3 3 1 3 3 3 3 1 3 3 3 3 3 3 3 3 1 3 3 3 1 1 3 3 1 1 3 3 1 3 2 1 3 3 2
## [556] 3 1 1 1 3 3 3 1 3 3 1 3 3 2 3 3 1 1 3 3 2 3 1 1 3 3 3 3 1 3 3 1 2 1 3 2 3
## [593] 3 3 3 3 2 1 3 1 1 1 1 2 1 3 3 3 3 3 1 1 1 3 1 3 1 1 1 3 3 3 1 3 1 1 1 1 3
## [630] 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 314.1316 230.7171 200.3622
##  (between_SS / total_SS =  60.6 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

The first thing we can observe in the display above are sizes of respective cluster groups:

cluster group 1 encompasses 256 of the 631 units (40,6% or 256/631 units),
cluster group 2 encompasses 110 of the 631 units (17,4% or 110/631 units),
cluster group 3 encompasses 265 of the 631 units (42% or 265/631 units).

We can also see that after the K-means method conducted reassignment of leaders, the leaders of all variables across all three cluster groups changed; they were all reassigned for better representation.

The “Clustering vector” output shows us the group numbers of respective units. The numbers in square brackets on the left represent the sequence of a respective unit that is first in a given row for easier tracking. As such, we can see that unit 1 now belongs to cluster group 2 (it belonged to cluster group 1 before K-means). Cluster unit 586, for example, belongs to cluster group 3.

Finally, we can observe the ratio between sums of squares of between groups divided by total sum of squares. Because we want groups to differ from one another while being internally homogeneous, we want this ratio to be high. The higher it is, the higher proportion of total sum of squares will be represented by sum of squares between groups. Our value is 60.6%, which is okay. The cluster groups somewhat differentiate between units.

Let’s also plot this graphically.

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

Our assortment looks pretty good, aside from group 2, the within heterogeneity of which could be minimized quite substantially by removing units with ID values of 403 and 228. In the display above, we are not really paying attention to whether groups (colors) overlap, because the display is a simplified version of a three-dimensional display (we have three variables; three dimensions, but the display is simplified using PCA - principal agent analysis), so some units could seem to be closer to leaders of groups that are not their own (but it is not so in reality, for we cannot see the third dimension above).

When we checked for outliers above, we denoted units with IDs of 440 and 251 as potential outliers by their dissimilarity value. But we cannot really see them standing out in the display above (there really are a lot of points there). However, the ones we can notice are units with ID values of 403 and 228.

So far as removing those is concerned, we could use similar reasoning as we did with dissimilarity outliers and argue that they are still representatives of the population. In order to maintain consistency in reasoning, I will not be removing these, for I did not remove the ones above. But if we did decide to remove them, we would simply filter them by ID, compute standardized variables again (removing units impacts standardized values, like we said above), and then simply do the whole analysis all over again until we arrived at our new display like the one above. However, we said we want to see whether Gauss curve holds with these tests; removing extremely low-performing students as “outliers” would be counterintuitive.

As we said, I will not be removing the units; partially also for the sake of my poor professor, who will have to read all of this. If we wanted to, it would not really pose complications, however. We already showed understanding, though, so let’s not do it again.

We will proceed to check for reclassification of units that happened when we moved from the hierarchical approach (the dendrogram) to K-means. Before doing so, let’s compute a new variable for K-means.

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("ACT_z", "SATV_z", "SATQ_z")])

##        ACT_z     SATV_z     SATQ_z
## 1 -0.9703401 -0.9995309 -0.9528551
## 2  1.3283393 -0.1057140 -0.9528551
## 3 -1.5972527 -1.1782943 -1.2143636
## 4 -0.5523984 -0.5526225 -0.7785162
## 5  0.4924559 -0.1057140 -0.5170077
## 6 -0.1344567  0.2518127  0.2675178

table(mydata$ClusterWard)

## 
##   1   2   3 
## 222  89 320

Above, we can see the initial numbers of units in each group with hierarchical clustering.

These numbers changed to those shown in the output below after using K-means for each of the groups respectively (after reclassifications).

table(mydata$ClusterK_Means)

## 
##   1   2   3 
## 256 110 265

Finally, let’s compute a display where assignments of units by hierarchical clustering (dendrogram) are shown in rows and reclassification by K-means in columns.

table(mydata$ClusterWard, mydata$ClusterK_Means)

##    
##       1   2   3
##   1 186  35   1
##   2  14  75   0
##   3  56   0 264

The table above holds a lot of information. I will explain one row and one column for showing understanding:

row #1: When using hierarchical clustering, the first group initially contained 222 units (186 + 35 + 1). Later, using K-means, 35 units were reclassified into group 2, and 1 unit was reclassified into group 3, while 186 remained in group 1.
column #1: In group 1 we have 256 units (186 + 14 + 56) after having used K-means clustering. Out of these 256, 186 were there initially, 14 were reclassified from group 2, and 56 were reclassified from group 3.

We could read the table by similar logic for each row and column.

In continuation, we will compute centroids.

Centroids <- K_MEANS$centers
Centroids

##        ACT_z     SATV_z     SATQ_z
## 1 -0.2658798 -0.1460056 -0.2015290
## 2 -1.3065932 -1.4295381 -1.4226987
## 3  0.7992093  0.7344401  0.7852388

Finally, we will use those to plot a display that will allow us to observe and comment on the trends across the three cluster groups.

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("ACT_z", "SATV_z", "SATQ_z"))

Figure$Groups <- factor(Figure$id, 
                        levels = c(1, 2, 3), 
                        labels = c("1", "2", "3"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("ACT_z", "SATV_z", "SATQ_z"), 
                            labels = c("ACT_z", "SATV_z", "SATQ_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-1.5, 1.5)

We can now see the trends across all three groups. Let’s explain them one by one.

Group 1 In group 1 are clustered the individuals who performed below average in all of the tests (ACT, SATV, and SATQ). While these are below average, they are but slightly so.

Group 2 In group 2 are clustered the individuals who performed poorly in all of the tests (ACT, SATV, and SATQ). These individuals were far below average in performance across all three tests in relative terms.

Group 3 In group 3 are clustered the individuals who performed notably above average in all of the tests (ACT, SATV, and SATQ). These individuals performed well across all three tests.

We will comment on all of the groups with more specifics, but let’s first validate our results using our demographic variables - Gender, Education, and Age.

Firstly, let’s check whether all the cluster variables are successful at classifying units into groups by performing ANOVAs.

fit <- aov(cbind(ACT_z, SATV_z, SATQ_z) ~ as.factor(ClusterK_Means), 
                 data = mydata)

summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 375.15 187.576  462.23 < 2.2e-16 ***
## Residuals                 628 254.85   0.406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 373.19 186.596  456.31 < 2.2e-16 ***
## Residuals                 628 256.81   0.409                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 396.44 198.222  532.99 < 2.2e-16 ***
## Residuals                 628 233.56   0.372                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Each response represents the respective variable in sequence - response 1 for standardized ACT scores, response 2 for standardized SATV scores, and response 3 for standardized SATQ scores.

The ANOVA test for each of the respective variables has the following hypotheses:

H0: the means of the respective variable (for Response 1, this variable is ACT) are equal across all groups. For example, response 1 claims: Mu(ACT g1) = Mu(ACT g2) = Mu(AC g3).
H1: At least one of the means of the respective variable differs across the three groups (in our case, at least one group has a different mean of ACT than the others).

As we can see, ANOVA tests are significant for all three variables. This means that the means of all three tests of competencies - ACT, SATV and SATQ scores - differ across the cluster groups (p<0,001).

Validation of results with demographic variables

Age

We will now validate our results with our demographic variables. Let’s start with Age (the numerical demographic variable).

aggregate(mydata$Age, 
          by = list(mydata$ClusterK_Means), 
          FUN = "mean")

##   Group.1        x
## 1       1 27.42578
## 2       2 24.80000
## 3       3 26.10566

Above are displayed the means of age across all three cluster groups.

Let’s do an ANOVA check to see whether age has a statistically significant effect on groups as classified by K-means. We will assume the assumptions for ANOVA are met.

fit <- aov(Age ~ as.factor(ClusterK_Means), 
           data = mydata)

summary(fit)

##                            Df Sum Sq Mean Sq F value Pr(>F)  
## as.factor(ClusterK_Means)   2    574   286.9   3.188 0.0419 *
## Residuals                 628  56521    90.0                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the hypotheses are:

H0: The means of age of units inside the respective group are the same across all three cluster groups; Mu(age, g1) = Mu(age, g2) = Mu(age, g3)
H1: At least one cluster group differs in the mean of age of the units inside it.

We can reject the null hypothesis and conclude that at least one cluster group differs from the others in the mean of age of the units inside it (p=0,042).

We can see that cluster group 1 contains the oldest people (let’s round the mean value of age to 27 for explanations), followed by group 3 (mean of age 26). Group 2 contains the youngest people (mean of age 25). In performance above, people in group 3 performed above average in tests, followed by people group 1, who performed slightly below average in tests, and finally by people in group 2, who performed notably below average.

As such:

group 3, in which people performed the best - above average, contains the second youngest people on average (mean of age = 26),
group 1, in which people performed slightly below average, contains the oldest people on average (mean of age = 27).
group 2, in which people performed the worst - notably below average, contains the youngest people on average (mean of age = 25).

While the impact is not completely consistent, we can see that younger people on average performed notably worse than older people on our ACT, SATV, and SATQ tests.

Gender

Following, we will validate results with our demographic categorical variable of gender. Because this is a categorical variable, we will use the Pearson Chi-square test for association between categorical variables. In this test, we will check only for the assumption of expected frequencies being above 5 (20% of them can be below 1 and 5 if the table is larger than 2x2). Our observations are independent, and our observed frequencies are all above 5, as we will be able to see below, so the assumptions are met.

chisq_results <- chisq.test(mydata$Gender, as.factor(mydata$ClusterK_Means))

chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Gender and as.factor(mydata$ClusterK_Means)
## X-squared = 3.8886, df = 2, p-value = 0.1431

The hypotheses for chi-square above are the following:

H0: There is no association between gender and clusters by K-means.
H1: There is association between gender and clusters by K-means.

We lack sufficient statistical evidence to reject the null hypothesis and claim that there is association between gender and clusters by K-means.

Since we cannot claim impact, I will not explain results in the tables below; I will only show expected frequencies, because they will validate the assumption for the test we mentioned in the previous paragraph.

round(chisq_results$expected, 2)

##              
## mydata$Gender      1  2      3
##        Male    88.44 38  91.55
##        Female 167.56 72 173.45

As we can see, all expected frequencies are above 5 - we could have used the test, but it proved statistically insignificant.

Education

The last demographic variable we will use to validate our clustering results is the categorical variable of education, which we included using 4 dummies.

We will once again see all expected frequencies above 5 below, so our assumptions are met.

chisq_results2 <- chisq.test(mydata$Education, as.factor(mydata$ClusterK_Means))

chisq_results2

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Education and as.factor(mydata$ClusterK_Means)
## X-squared = 17.539, df = 8, p-value = 0.02496

The hypotheses for chi-square above are the following:

H0: There is no association between education and clusters by K-means.
H1: There is association between education and clusters by K-means.

We can reject the null hypothesis and conclude that there is association between education and the cluster groups as defined after reclassification using K-means (p=0,025).

We will display the tables of observed frequencies, expected frequencies, and residuals below.

addmargins(chisq_results2$observed)

##                 
## mydata$Education   1   2   3 Sum
##             Edu3 100  55 114 269
##             Edu1  17  11  15  43
##             Edu2  15  13  15  43
##             Edu4  60  18  59 137
##             Edu5  64  13  62 139
##             Sum  256 110 265 631

We can, for example, see that of those people divided into cluster group 1, 100 of them self-reported their education level as “3”, 17 as “1”, 15 as “2”, 60 as “4”, and 64 as “5”. Let’s not go too deep into this; the third table really matters.

round(chisq_results2$expected, 2)

##                 
## mydata$Education      1     2      3
##             Edu3 109.13 46.89 112.97
##             Edu1  17.45  7.50  18.06
##             Edu2  17.45  7.50  18.06
##             Edu4  55.58 23.88  57.54
##             Edu5  56.39 24.23  58.38

We can see, all expected frequencies are above 5. Our assumption is met and we could have done the test. For example, we expect that out of all those people in cluster group 2, 46,89 would self-report their education level as “3”. Let’s take a look at the last table, which gives us the most concrete results.

round(chisq_results2$res, 2)

##                 
## mydata$Education     1     2     3
##             Edu3 -0.87  1.18  0.10
##             Edu1 -0.11  1.28 -0.72
##             Edu2 -0.59  2.01 -0.72
##             Edu4  0.59 -1.20  0.19
##             Edu5  1.01 -2.28  0.47

While we were able to claim association between education and cluster grouping, that gave us no concrete insights.

For reference before interpreting: the only statistically significant values of residuals happen in cluster group 2 (they fall within the critical zone). Group 2 was the group that performed the worst in terms of results - the group that performed notably below average.

Looking at the third and fifth row combined with the second column above, we can see that there were more than expected people who self-reported their education level as “2” (alpha = 0,05) and less than expected people who self-reported their education level as “5” (alpha = 0,05) in cluster group 2.

We can see that people in group 2, who performed the worst in ACT, SATV, and SATQ scores, were less educated according to their self-reported education (there were more than expected people with education “2” and less than expected people with education “5” on a scale from 1 to 5).

Conclusions

We can affirmatively answer our research question, that stated: Can we group our units (people) based on their performance in ACT, SATV, and SATQ tests?

Our classification of 631 people (the sample size was reduced from 700 to 631 when we excluded NA units) was based on three standardized variables - ACT score, SATV score, and SATQ score.

For hierarchical clustering we used Ward’s algorithm. Based on the analysis of the dendrogram we decided to classify people into three cluster groups. We further optimized clustering by using K-Means clustering for reassigning units among cluster groups.

Group 1

After clustering by K-means, group 1 falls second in the number of people it contains (40,6% (256/631) of all people), characterized by a slightly lower than average value across all three cluster variables (ACT, SATV, and SATQ scores). The people in this cluster group have the second worst/second best mean scores on all three tests. On average, the people are the oldest (their rounded mean of age is 27). The values of self-reported education in this group do not fall within the critical zone in respect to residuals (there are no more or less than expected self-reports of each level of education in cluster group 1 at alpha = 0,05).

Group 2

Group 2 contains the least people (17,4% (110/631) of all people), characterized by the worst performance - notably lower than average value across all three cluster variables (ACT, SATV, and SATQ scores). The people in this cluster group have the worst mean scores on all three tests; they performed the worst. On average, the people are the youngest (their rounded mean of age is 25). In this group, there were more than expected people who self-reported their education level as “2” (alpha = 0,05) and less than expected people who self-reported their education level as “5” (alpha = 0,05) on a scale from 1 to 5; we saw that in this worst-performing group, people were less educated according to their self-reported education.

Group 3

Group 3 contains the most people (42% (265/631) of all people), characterized by a higher than average value across all three cluster variables (ACT, SATV, and SATQ scores). The people in this cluster group have the best mean scores on all three tests; they performed the best. On average, the people are neither the youngest nor the oldest (their rounded mean of age is 26). The values of self-reported education in this group do not fall within the critical zone in respect to residuals (there are no more or less than expected self-reports of each level of education in cluster group 3 at alpha = 0,05).

We can see that students in general performed as according to the Gauss curve across the three test; a sentiment backed up by our low-performing, medium-performing, and high-performing groups. But we did see that less people performed poorly than well (but more people performed below “average” than above it). This could be explained (among many other factors) by students’ levels of motivation and willingness to prepare well for the exams.

Homework 4: clustering

2024-01-24