Student performance is a key indicator of academic success and can be influenced by various factors including socioeconomic background, parental education, gender, and study habits. Understanding the underlying structure of these factors can help educators, policymakers, and researchers identify patterns and relationships that affect academic outcomes. In this paper, I will perform Principal Component Analysis (PCA) and Factor Analysis, to explore and reduce the dimensions of the ‘Students Performance in Exams’ dataset from Kaggle, uncovering key variables and patterns that drive student achievement.
The “Students Performance in Exam” dataset contains 1000 observations of students, each characterized by 7 variables. They divide into 5 categorical variables - gender, race or ethnicity, parental education, test preparation and lunch type, as well as 3 numerical variables such as math, reading and writing score. This dataset provides comprehensive information on student academic performance. The variables are related to demographics, such as gender (male/female) and race/ethinicity, which categorizes students into different racial or ethnic groups. Parental influence is represented by the parental level of education, ranging from “Some high school” to “Master’s degree”. Dataset also contains school-related factor, such as whether the student has completed a test preparation course and even the type of lunch (standard or reduced/free) they receive, which can have an impact on the academic output. Academic performance is captured through three numerical variables: math score, reading score and writing score.
The primary objective of this analysis is to reduce the dimensionality of the dataset, while retaining as much information as possible. This will be achieved through:
Principal Component Analysis (PCA) that reduces the number of dimensions in large datasets to principal components that retain most of the original information. It does this by transforming potentially correlated variables into a smaller set of variables, called principal components. It is very effective for visualizing and exploring dataset that are characterized by high-dimensionality or many features - it can identify the patterns and trends.1
Factor Analysis - an unsupervised technique for dimensionality reduction, grouping correlated variables into fewer latent factors that share common variance. It simplifies data by uncovering meaningful relationships among variables.2
As a first step, I am assigning new variables names for a cleaner look of the dataset. In the next steps, I will preprocess the data to ensure it is suitable for dimension reduction. There are no NAs values in this dataset.
Students_performance <- read.csv("StudentsPerformance.csv", sep=",", dec=".", header=TRUE, col.names=c('Gender', 'Race/Ethnicity', 'ParentsEducation', 'Lunch', 'TestPreparationCourse', 'MathScore', 'ReadingScore', 'WritingScore'))
head(Students_performance, 10)
## Gender Race.Ethnicity ParentsEducation Lunch TestPreparationCourse
## 1 female group B bachelor's degree standard none
## 2 female group C some college standard completed
## 3 female group B master's degree standard none
## 4 male group A associate's degree free/reduced none
## 5 male group C some college standard none
## 6 female group B associate's degree standard none
## 7 female group B some college standard completed
## 8 male group B some college free/reduced none
## 9 male group D high school free/reduced completed
## 10 female group B high school free/reduced none
## MathScore ReadingScore WritingScore
## 1 72 72 74
## 2 69 90 88
## 3 90 95 93
## 4 47 57 44
## 5 76 78 75
## 6 71 83 78
## 7 88 95 92
## 8 40 43 39
## 9 64 64 67
## 10 38 60 50
str(Students_performance)
## 'data.frame': 1000 obs. of 8 variables:
## $ Gender : chr "female" "female" "female" "male" ...
## $ Race.Ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ ParentsEducation : chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ Lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ TestPreparationCourse: chr "none" "completed" "none" "none" ...
## $ MathScore : int 72 69 90 47 76 71 88 40 64 38 ...
## $ ReadingScore : int 72 90 95 57 78 83 95 43 64 60 ...
## $ WritingScore : int 74 88 93 44 75 78 92 39 67 50 ...
There are 5 categorical variables. I am converting them into numerical format.
Students_performance$Gender <- as.numeric(factor(Students_performance$Gender))
Students_performance$Race.Ethnicity <- as.numeric(factor(Students_performance$Race.Ethnicity))
unique(Students_performance$ParentsEducation)
## [1] "bachelor's degree" "some college" "master's degree"
## [4] "associate's degree" "high school" "some high school"
library(dplyr)
Students_performance$ParentsEducation <- recode(Students_performance$ParentsEducation,
"some high school" = 1,
"high school" = 2,
"some college" = 3,
"associate's degree" = 4,
"bachelor's degree" = 5,
"master's degree" = 6)
Students_performance$Lunch <- as.numeric(factor(Students_performance$Lunch))
Students_performance$TestPreparationCourse <- as.numeric(factor(Students_performance$TestPreparationCourse))
str(Students_performance)
## 'data.frame': 1000 obs. of 8 variables:
## $ Gender : num 1 1 1 2 2 1 1 2 2 1 ...
## $ Race.Ethnicity : num 2 3 2 1 3 2 2 2 4 2 ...
## $ ParentsEducation : num 5 3 6 4 3 4 3 3 2 2 ...
## $ Lunch : num 2 2 2 1 2 2 2 1 1 1 ...
## $ TestPreparationCourse: num 2 1 2 2 2 2 1 2 1 2 ...
## $ MathScore : int 72 69 90 47 76 71 88 40 64 38 ...
## $ ReadingScore : int 72 90 95 57 78 83 95 43 64 60 ...
## $ WritingScore : int 74 88 93 44 75 78 92 39 67 50 ...
All of the variables are numerical.
library("ggplot2")
library("reshape2")
Students_performance_plot <- melt(Students_performance, id.vars = NULL)
ggplot(data = Students_performance_plot) + geom_bar(aes(x = value), fill = "darkolivegreen4") + theme(plot.title = element_text(hjust = 0.5, size = 15)) +
facet_wrap(~ variable, scales = "free", ncol = 3)
summary(Students_performance)
## Gender Race.Ethnicity ParentsEducation Lunch
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
## Median :1.000 Median :3.000 Median :3.000 Median :2.000
## Mean :1.482 Mean :3.174 Mean :3.081 Mean :1.645
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :2.000 Max. :5.000 Max. :6.000 Max. :2.000
## TestPreparationCourse MathScore ReadingScore WritingScore
## Min. :1.000 Min. : 0.00 Min. : 17.00 Min. : 10.00
## 1st Qu.:1.000 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median :2.000 Median : 66.00 Median : 70.00 Median : 69.00
## Mean :1.642 Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.:2.000 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :2.000 Max. :100.00 Max. :100.00 Max. :100.00
table(Students_performance$ParentsEducation)
##
## 1 2 3 4 5 6
## 179 196 226 222 118 59
According to the plots above, female group is larger and equivalent to 518 in comparison to male group - 482. A substantial share of the students belong to the third race/ethnicity group. When it comes to parental education, 226 of the students have parents that graduate from ‘some college’. Significantly larger group of students (645) are equipped with a standard lunch and only 358 students from 1000 observed has completed a preparation course for the exam.
Before performing Principal Component Analysis and Factor Analysis, I am assessing, whether the data is suitable for these techniques. I am using statistical tests that are commonly choosed to evaluate the suitability for dimension reduction.
Kaiser-Meyer-Olkin Test - a measure of the proportion of variance among variables that might be common variance. The lower the proportion, the more suited your data is to Factor Analysis.3
Bartlett’s Test of Sphericity that provides information about whether the correlations in the data are strong enough to use a dimension-reduction technique such as principal components or common factor analysis.4
library("psych")
library("corrplot")
#Correlation Matrix
cor_matrix <- cor(Students_performance)
corrplot(cor_matrix, type = "lower", order = "alphabet", tl.cex=0.6)
KMO(cor_matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor_matrix)
## Overall MSA = 0.58
## MSA for each item =
## Gender Race.Ethnicity ParentsEducation
## 0.17 0.66 0.49
## Lunch TestPreparationCourse MathScore
## 0.67 0.34 0.62
## ReadingScore WritingScore
## 0.73 0.61
cortest.bartlett(cor_matrix, 1000)
## $chisq
## [1] 4941.045
##
## $p.value
## [1] 0
##
## $df
## [1] 28
Overall MSA is equal to 0.58, which is definitely not ideal but marginally suitable for PCA and Factor Analysis. Nevertheless, the p-value from Bartlett’s test, which is equivalent to 0 indicates that the correlations are significant and the data are not an identity matrix, which supports the feasibility of factor analysis. The chi-square test indicates the same and it supports PCA because it shows there are relationships between variables.
Considering the output, I proceed with the PCA and Factor analysis.
library("factoextra")
pca <- prcomp(Students_performance, center = TRUE, scale=TRUE)
pca
## Standard deviations (1, .., p=8):
## [1] 1.7461792 1.0870206 1.0257443 1.0026693 0.9470587 0.8351410 0.2851541
## [8] 0.1898786
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Gender 0.1054102 -0.78224493 0.039417792 -0.39703348
## Race.Ethnicity -0.1508357 -0.14255833 0.567881344 -0.17923287
## ParentsEducation -0.1602277 0.19907728 0.638316197 -0.21204375
## Lunch -0.2129445 -0.43159310 -0.111757549 0.58566547
## TestPreparationCourse 0.1912390 -0.08833952 0.497819855 0.64751399
## MathScore -0.5065405 -0.31546866 0.009921791 -0.02851884
## ReadingScore -0.5442871 0.10510836 -0.064226936 0.04396515
## WritingScore -0.5529150 0.15327596 -0.063020240 0.01589860
## PC5 PC6 PC7 PC8
## Gender 0.24393886 0.1678883 0.34049942 -0.118994587
## Race.Ethnicity -0.76496151 -0.1193097 0.05356836 0.006449393
## ParentsEducation 0.57669920 -0.3834948 -0.02206310 0.046348655
## Lunch 0.00749719 -0.6370356 0.08298316 0.011468549
## TestPreparationCourse 0.12070674 0.5126830 0.06776566 -0.080737913
## MathScore 0.08124437 0.2623394 -0.73936901 0.144552251
## ReadingScore 0.03402103 0.2424660 0.53018884 0.587876146
## WritingScore 0.01900926 0.1193335 0.20361401 -0.781349244
Determining the optimal number of principal components
# Calculate cumulative variance explained
explained_variance <- summary(pca)$importance[2, ]
cumulative_variance <- cumsum(explained_variance)
# Plot cumulative variance
plot(cumulative_variance, type = "b", main = "Cumulative Variance Explained", xlab = "Number of Components", ylab = "Cumulative Variance")
cumulative_variance
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.38114 0.52884 0.66036 0.78603 0.89815 0.98533 0.99549 1.00000
I am calculating the cumulative variance explained by the principal components. The plot shows how much variance is explained as more components are added.
#Kaiser Criterion - Eigenvalues
eigenvalues <- pca$sdev^2
# Determine number of components with eigenvalue > 1
num_components <- sum(eigenvalues > 1)
num_components
## [1] 4
The Kaiser criterion (rule of thumb) decides how many components to retain. An eigenvalue > 1 indicates that the component explains more variance than an individual variable would (since the variance of a standardized variable is 1). In this example, it gives us 4 components to retain. The first 4 components explain 78.6% of the total variance in Students Performance dataset.
fviz_eig(pca, choice = "eigenvalue", barfill = "darkolivegreen4", barcolor = "darkolivegreen", addlabels = TRUE, main = "Eigenvalues")
fviz_pca_var(pca, col.var = "darkgreen")
The plots above visualize the eigenvalues for each principal component and variables in PCA. Variables such as Writing and Reading Score are strongly positively correlated since they are pointing in similar direction. Longer arrows here, suggest that variables contribute more to the PCA - Writing, Reading, Math Score, Gender.
I am now analyzing Loadings that represent the correlation or contribution of the original variables to each of the principal components. They will help to understand how each variable influences the principal components and interpret the reduced dimensions.
loadings <- pca$rotation
# Loading PCA1
loadings_PC1 <- sort(abs(loadings[,1]), decreasing = TRUE)
library(ggplot2)
ggplot(data = data.frame(Variable = names(loadings_PC1), Loading = loadings_PC1), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "darkolivegreen") +
labs(title = "Contributions to PC1", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Percantage contribution for PC1
contribution_PC1 <- (loadings[, 1]^2) / sum(loadings[, 1]^2) * 100
contribution_PC1
## Gender Race.Ethnicity ParentsEducation
## 1.111130 2.275140 2.567291
## Lunch TestPreparationCourse MathScore
## 4.534535 3.657234 25.658328
## ReadingScore WritingScore
## 29.624842 30.571501
# WritingScore has the strongest contribution to PC1 - 30.571501%
WritingScore contributes 30.57% to PC1, the largest contribution among all variables - has the strongest influence on the direction of PC1.
# Loading PC2
loadings_PC2 <- sort(abs(loadings[,2]), decreasing = TRUE)
ggplot(data = data.frame(Variable = names(loadings_PC2), Loading = loadings_PC2), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "darkolivegreen") +
labs(title = "Contributions to PC2", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Percntage contribution to PC2
contribution_PC2 <- (loadings[, 2]^2) / sum(loadings[, 2]^2) * 100
contribution_PC2
## Gender Race.Ethnicity ParentsEducation
## 61.190713 2.032288 3.963176
## Lunch TestPreparationCourse MathScore
## 18.627260 0.780387 9.952047
## ReadingScore WritingScore
## 1.104777 2.349352
# Gender has the strongest contribution to PC2 - 61.190713%
PC2 is most influenced by Gender variable - 61.190713%.
# Loading PC3
loadings_PC3 <- sort(abs(loadings[,3]), decreasing = TRUE)
ggplot(data = data.frame(Variable = names(loadings_PC3), Loading = loadings_PC3), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "darkolivegreen") +
labs(title = "Contributions to PC3", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Percntage contribution to PC3
contribution_PC3 <- (loadings[, 3]^2) / sum(loadings[, 3]^2) * 100
contribution_PC3
## Gender Race.Ethnicity ParentsEducation
## 0.155376236 32.248922045 40.744756732
## Lunch TestPreparationCourse MathScore
## 1.248974972 24.782460827 0.009844194
## ReadingScore WritingScore
## 0.412509936 0.397155059
# ParentsEducation has the strongest contribution to PC3 - 40.744756732%
ParentsEducation has the strongest contribution to PC3 - 61.190713%.
# Loading PCA4
loadings_PC4 <- sort(abs(loadings[,4]), decreasing = TRUE)
ggplot(data = data.frame(Variable = names(loadings_PC4), Loading = loadings_PC4), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "darkolivegreen") +
labs(title = "Contributions to PC4", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Percntage contribution to PC3
contribution_PC3 <- (loadings[, 4]^2) / sum(loadings[, 4]^2) * 100
contribution_PC3
## Gender Race.Ethnicity ParentsEducation
## 15.76355856 3.21244211 4.49625525
## Lunch TestPreparationCourse MathScore
## 34.30040450 41.92743717 0.08133240
## ReadingScore WritingScore
## 0.19329348 0.02527654
# TestPreparationCourse has the strongest contribution to PC4 - 41.92743717%
Finally, PC4 is dominated by TestPreparationCourse - 41.92743717%.
The analysis effectively reduced the complexity of the dataset, making it easier to interpret and visualize without losing much information. From 8 original variables, 4 principal components were retained, explaining 78.6% of the total variance. The analysis revealed that certain variables, like WritingScore, Gender, ParentsEducation, and TestPreparationCourse, contribute significantly to explaining the variance in the dataset. This suggests that these factors are key drivers of differences in student performance.
fa_result <- fa(Students_performance, nfactors = 4, rotate = "varimax", fm="ml")
print(fa_result)
## Factor Analysis using method = ml
## Call: fa(r = Students_performance, nfactors = 4, rotate = "varimax",
## fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML2 ML3 ML4 h2 u2 com
## Gender -0.10 0.99 0.05 -0.03 0.995 0.005 1.0
## Race.Ethnicity 0.05 -0.01 0.24 0.05 0.063 0.937 1.2
## ParentsEducation 0.08 -0.03 0.08 0.26 0.079 0.921 1.4
## Lunch 0.10 0.01 0.38 0.02 0.157 0.843 1.1
## TestPreparationCourse -0.09 -0.03 -0.01 -0.46 0.223 0.777 1.1
## MathScore 0.58 0.20 0.75 0.24 0.995 0.005 2.3
## ReadingScore 0.84 -0.17 0.37 0.36 0.995 0.005 1.9
## WritingScore 0.66 -0.24 0.45 0.55 0.995 0.005 3.1
##
## ML1 ML2 ML3 ML4
## SS loadings 1.51 1.11 1.10 0.78
## Proportion Var 0.19 0.14 0.14 0.10
## Cumulative Var 0.19 0.33 0.47 0.56
## Proportion Explained 0.34 0.25 0.25 0.17
## Cumulative Proportion 0.34 0.58 0.83 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 4 factors are sufficient.
##
## df null model = 28 with the objective function = 4.96 with Chi Square = 4941.04
## df of the model are 2 and the objective function was 0.04
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.12
##
## The harmonic n.obs is 1000 with the empirical chi square 59.7 with prob < 1.1e-13
## The total n.obs was 1000 with Likelihood Chi Square = 41.94 with prob < 7.8e-10
##
## Tucker Lewis Index of factoring reliability = 0.886
## RMSEA index = 0.141 and the 90 % confidence intervals are 0.106 0.18
## BIC = 28.13
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## ML1 ML2 ML3 ML4
## Correlation of (regression) scores with factors 0.98 1.00 0.98 0.96
## Multiple R square of scores with factors 0.95 0.99 0.96 0.92
## Minimum correlation of possible factor scores 0.91 0.99 0.93 0.83
ML1 explains 19%, ML2 explains 14%, ML3 explains 14%, and ML4 explains 10% of the variance in the data. The first two factors (ML1 and ML2) explain 33% of the variance and all four factors explain 56% of the variance cumulatively, which is reasonable. Factors ML1 to ML4 have correlations close to 1 with their respective factor scores, indicating that the scores are well-represented by the factors.
loadings_fa <- fa_result$loadings
# Loading ML1
loadings_ML1 <- sort(abs(loadings_fa[,1]), decreasing = TRUE)
ggplot(data = data.frame(Variable = names(loadings_ML1), Loading = loadings_ML1), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "coral4") +
labs(title = "Contributions to ML1", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ReadingScore has a strong posiitive correlation with ML1 - the first factor in this factor analysis model.
# Loading ML2
loadings_ML2 <- sort(abs(loadings_fa[,2]), decreasing = TRUE)
ggplot(data = data.frame(Variable = names(loadings_ML2), Loading = loadings_ML2), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "coral4") +
labs(title = "Contributions to ML2", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The primary contributor to ML2 is Gender, with a very strong positive correlation.
# Loading ML3
loadings_ML3 <- sort(abs(loadings_fa[,3]), decreasing = TRUE)
ggplot(data = data.frame(Variable = names(loadings_ML3), Loading = loadings_ML3), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "coral4") +
labs(title = "Contributions to ML3", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ML3 is odminated by MathScore.
# Loading ML4
loadings_ML4 <- sort(abs(loadings_fa[,4]), decreasing = TRUE)
ggplot(data = data.frame(Variable = names(loadings_ML4), Loading = loadings_ML3), aes(x = Variable, y = Loading)) +
geom_bar(stat = "identity", fill = "coral4") +
labs(title = "Contributions to ML3", x = "Variable", y = "Loading") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Finally, WritingScore contributes significantly to the last ML.
Four factors retained in Factor Analysis represent the key underlying structures in the data, together explaining 56% of the variance in the data. The high correlation of factor scores with the corresponding factors indicates a good model fit. The factor scores seem to capture the essential structure of the data, and the factors themselves appear to be clearly defined and well-interpreted. Students’ reading performance seems to be an important underlying factor, as it contributes significantly to the first factor.On the other hand, Gender is a prominent contributor to the second factor, which suggest that gender differences play a role in the performance patterns across various subjects.The third factor reflects the importance of MathScore in explaining the variation in student performance - math performance could be viewed as a distinct factor influencing overall performance. Finally, WritingScore is a critical contributor to the overall performance model.
PCA explains a higher percentage of variance (78.6%) compared to FA (56%). This is expected, as PCA focuses on explaining the most variance in the data through a linear combination of variables, while FA focuses on uncovering latent factors that underlie the observed data. Together, they offer a comprehensive approach to analyzing provided Students Performance dataset.
I decided that I will proceed with clustering after PCA and Factor Analysis. It seems logical and beneficial for me, because clustering complements them by further uncovering patterns and grouping observations in the dataset. Clustering on the reduced Students Performance dataset can group students into distinct clusters based on shared characteristics (for example; performance levels, demographic factors, or study habits). This helps in identifying meaningful subgroups, such as “high-performing students with minimal parental education” or “students benefiting most from test preparation.”. PCA ensures that clustering is applied to the most informative dimensions, improving the quality of the clusters.
I am choosing the previosly defined 4 first principal components (they explain 78.6% of the variance).
pca_data <- as.data.frame(pca$x[, 1:4])
Determining the optimal number of clusters. I am using Elbow Method and Silhouette Analysis.
library(factoextra)
# Elbow method
fviz_nbclust(pca_data, kmeans, method = "wss") +
labs(title = "Elbow Method for Optimal Clusters")
# Silhouette analysis
fviz_nbclust(pca_data, kmeans, method = "silhouette") +
labs(title = "Silhouette Analysis for Optimal Clusters")
I am now applying the K-means clustering on the determined 2 clusters.
kmeans_result <- kmeans(pca_data, centers = 2, nstart = 25)
# Visualize the clusters
fviz_cluster(kmeans_result, data = pca_data, geom = "point") +
labs(title = "K-means Clustering Results")
Students_performance$Cluster <- kmeans_result$cluster
# View summary of each cluster
aggregate(. ~ Cluster, data = Students_performance, FUN = mean)
## Cluster Gender Race.Ethnicity ParentsEducation Lunch
## 1 1 1.398198 3.427027 3.358559 1.776577
## 2 2 1.586517 2.858427 2.734831 1.480899
## TestPreparationCourse MathScore ReadingScore WritingScore
## 1 1.515315 75.41081 79.11892 78.72072
## 2 1.800000 54.46292 56.75955 54.75056
Interpretation of the output: Cluster 1
-Gender: 1.398 (closer to one, which corresponds to females). Higher proportion of females. -Race/ethnicity: 3.43 (corresponding to the third racial/ethnic group). Consists mainly of students from racial/ethnic group C. -Parent’s Education: 3.36 - relatively higher than in Cluster 2, closer to “some college” or “associate’s degree” -Lunch: 1.78 (closer to 2). Most students receive a standard lunch. -Test Preparation Course: 1.52 (closer to 2). Many students in this cluster has completed a test preparation course. -Scores - Math: 75.41 - Reading: 79.12 - Writing: 78.72 Students in this cluster tend to score higher on all of the three exam in comparison to the second cluster - this indicates better academic performance.
Cluster 2
-Gender: 1.59 (closer to two, which corresponds to males). This cluster has Higher proportion of males. -Race/ethnicity: 2.86 (corresponding to the second racial/ethnic group). Consists mainly of students from racial/ethnic group B. -Parent’s Education: 2.73 - lower than in Cluster 1, closer to “high school” or “some high school” -Lunch: 1.48 (closer to 1). Most students receive a reduced/free lunch. -Test Preparation Course: 1.80 (closer to 2). Even more students compared to Cluster 1, has completed a test preparation course. -Scores - Math: 54.46 - Reading: 56.76 - Writing: 54.75 Students in this cluster tend to have lower scores - they are comparatively weaker when it comes to academic performance.
library(cluster)
silhouette_score <- silhouette(kmeans_result$cluster, dist(Students_performance))
summary(silhouette_score)
## Silhouette of 1000 units in 2 clusters from silhouette.default(x = kmeans_result$cluster, dist = dist(Students_performance)) :
## Cluster sizes and average silhouette widths:
## 555 445
## 0.4536389 0.4128297
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4211 0.3577 0.5243 0.4355 0.5867 0.6441
I am computing a Silhouette Score to measure how well each of students fits within their cluster compared to other clusters. I can draw several observation from this analysis. The first cluster contains 555 units (55.5% of the data), while the second cluster contains 445 units (44.5% of the data). The sizes are fairly balanced, which is a positive sign for meaningful clusters. The average silhouette width for Cluster 1 is 0.4536, for Cluster 2 is 0.4128. These scores are above 0.4, which suggests that the clusters are moderately well-separated, but unfortunately there may still be some overlap. Minimum silhouette width is -0.4211, indicating that some points are likely misclassified. Median silhouette width is 0.5243, indicating that the majority of points are reasonably well-clustered. The overall mean silhouette width is 0.4355 - acceptable, close to 0.5, but not very strong clustering structure.
I will try to experiment with a larger number of clusters and see if this score improves.
# K-means with 3 clusters
kmeans_result_new <- kmeans(pca_data, centers = 3, nstart = 25)
# Visualize the clusters
fviz_cluster(kmeans_result_new, data = pca_data, geom = "point") +
labs(title = "K-means Clustering Results")
Students_performance$Cluster <- kmeans_result_new$cluster
# View summary of each cluster
aggregate(. ~ Cluster, data = Students_performance, FUN = mean)
## Cluster Gender Race.Ethnicity ParentsEducation Lunch
## 1 1 1.558659 2.796089 2.734637 1.427374
## 2 2 2.000000 3.498221 3.185053 1.786477
## 3 3 1.002770 3.296399 3.343490 1.750693
## TestPreparationCourse MathScore ReadingScore WritingScore
## 1 1.821229 51.74581 54.63128 52.55307
## 2 1.516014 77.94662 74.11388 72.46619
## 3 1.562327 71.08310 79.73684 79.99169
silhouette_score_new <- silhouette(kmeans_result_new$cluster, dist(Students_performance))
summary(silhouette_score_new)
## Silhouette of 1000 units in 3 clusters from silhouette.default(x = kmeans_result_new$cluster, dist = dist(Students_performance)) :
## Cluster sizes and average silhouette widths:
## 358 281 361
## 0.43104728 0.12795533 0.09858885
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.47774 0.07697 0.19798 0.22586 0.40457 0.62941
The addition of a third cluster has not improved the separation. It has worsened compared to the previous 2-cluster case. Cluster 2 seems to have the best separation, but Clusters 1 and 3 show weak clustering, especially with negative silhouette widths for some points. The low mean, equal to 0.2277 indicates weak clustering overall.
I am now trying Hierarchical Clustering in order to check if it offers better separation than the K-means clustering.
dist_matrix <- dist(pca_data, method = "euclidean")
hc <- hclust(dist_matrix, method = "ward.D2")
hc_clusters <- cutree(hc, k = 2)
Students_performance$Cluster <- hc_clusters
library(dendextend)
# Color the dendrogram by cluster
hc_dend <- as.dendrogram(hc)
hc_colored <- color_branches(hc_dend, k = 2)
plot(hc_colored, main = "Colored Dendrogram")
aggregate(. ~ Cluster, data = Students_performance, FUN = mean)
## Cluster Gender Race.Ethnicity ParentsEducation Lunch
## 1 1 1.383701 3.361630 3.331070 1.816638
## 2 2 1.622871 2.905109 2.722628 1.399027
## TestPreparationCourse MathScore ReadingScore WritingScore
## 1 1.602716 73.58744 77.10017 76.59253
## 2 1.698297 55.34307 57.80292 55.81752
library(cluster)
silhouette_result_hh <- silhouette(cutree(hc, k = 2), dist(Students_performance))
summary(silhouette_result_hh)
## Silhouette of 1000 units in 2 clusters from silhouette.default(x = cutree(hc, k = 2), dist = dist(Students_performance)) :
## Cluster sizes and average silhouette widths:
## 589 411
## 0.3150273 0.2800987
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5451 0.1701 0.4447 0.3007 0.5138 0.5709
The K-means solution has a higher mean silhouette width (0.4355) and median silhouette width (0.5243), which indicates better clustering quality overall in comparison to Hierarchical Clustering, with a lower mean silhouette width (0.3007) and median silhouette width (0.4447).
Cluster 1 students have higher parental education levels and receive more standard lunches, possibly reflecting higher socioeconomic status. In contrast, Cluster 2 students tend to have reduced/free lunches and lower parental education levels, reflecting lower socioeconomic status. Cluster 1 leans toward females, while Cluster 2 leans toward males.Cluster 1 students outperform Cluster 2 students in all three exam scores (math, reading, and writing). Both clusters show a significant number of students who completed the test preparation course, but this appears to have a greater impact on Cluster 1 students’ scores. According to Cluster 1 students may benefit from their relatively stronger potential socioeconomic background and parental support. They may require advanced academic challenges to further develop their performance. Cluster 2 highlights the importance of interventions such as additional support for academic skills in math, reading, and writing.
K-means Clustering with 2 clusters demonstrates better overall clustering quality based on the silhouette analysis, but not perfect. It has higher average, median, and mean silhouette widths, indicating more cohesive and well-separated clusters. Hierarchical Clustering with 2 clusters has lower average silhouette widths and shows more negative individual silhouette widths - poorer clustering quality. The clusters are not as well-defined or cohesive in this case.