In this R Markdown session, I will revisit the “Students Performance in Exams” dataset and perform a few clustering techniques. Clustering is an unsupervised learning method, which means the program uses algorithms to analyze and group unlabeled data. Clustering is effective in grouping data based on similar characteristics, as well as finding trends and patterns within the data.
The dataset used can be found on the Kaggle website, an online platform for data scientists containing free datasets and code collaboration. Below is the link for the dataset:
https://www.kaggle.com/datasets/spscientist/students-performance-in-exams.
The data contains eight variables used to explore the effect of various factors on test scores. The variables are:
Note: this is a fictional dataset used strictly for to demonstrate beginner data analysis skills. The results are not official and should not be used to conclude actual relationships between the variables listed and education
First, the dataset is loaded into R and saved as “student_data”.
student_data <- read.csv("~/R datasets/StudentsPerformance.csv")
Now that the dataset is loaded into R, the next step is to view the data and see if it’s clean for analysis.
#Inspect the data frame.
head(student_data)
## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
#View the column names.
colnames(student_data)
## [1] "gender" "race.ethnicity"
## [3] "parental.level.of.education" "lunch"
## [5] "test.preparation.course" "math.score"
## [7] "reading.score" "writing.score"
#View summary of the data frame.
summary(student_data)
## gender race.ethnicity parental.level.of.education
## Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## lunch test.preparation.course math.score reading.score
## Length:1000 Length:1000 Min. : 0.00 Min. : 17.00
## Class :character Class :character 1st Qu.: 57.00 1st Qu.: 59.00
## Mode :character Mode :character Median : 66.00 Median : 70.00
## Mean : 66.09 Mean : 69.17
## 3rd Qu.: 77.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00
## writing.score
## Min. : 10.00
## 1st Qu.: 57.75
## Median : 69.00
## Mean : 68.05
## 3rd Qu.: 79.00
## Max. :100.00
#View data types in the data frame.
str(student_data)
## 'data.frame': 1000 obs. of 8 variables:
## $ gender : chr "female" "female" "female" "male" ...
## $ race.ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ parental.level.of.education: chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ test.preparation.course : chr "none" "completed" "none" "none" ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
I don’t like the “.” in the names, electing to change the “.” to a “_” for easier reading. Also, the name of the “race_ethnicity” variable is shortened to “ethnicity”. The data frame is renamed “students”, which will be used the rest of the time.
#Rename the columns in the student data frame.
students <- student_data%>%rename(parental_education = parental.level.of.education, math_score = math.score, writing_score = writing.score, reading_score = reading.score, ethnicity = race.ethnicity, test_prep_course = test.preparation.course)
#View the updated column names in the student data frame.
colnames(students)
## [1] "gender" "ethnicity" "parental_education"
## [4] "lunch" "test_prep_course" "math_score"
## [7] "reading_score" "writing_score"
Next, I want to check if there are any missing values in the dataset.
#Print the total number of missing values in the data frame.
sum(is.na(students))
## [1] 0
Now that I know there aren’t any missing values, next I check to see if there are any duplicates in the dataset.
#Create a variable storing the amount of duplicates in the data frame.
duplicates <- students%>%duplicated()
#Displays how many duplicates are present in a table. If a value is not a duplicate, it is placed in 'FALSE'. If the value is a duplicate, it is placed in 'TRUE'.
duplicates_count <- duplicates%>%table()
duplicates_count
## .
## FALSE
## 1000
The data frame is plotted to see the linearity of the data. Based on the linearity of the data, it can also be seen which values are categorical variables (gender, ethnicity, parental education) and which are numerical variables (math score, reading score, writing score).
#Plot data frame
plot(as.data.frame(students))
Now that the data has been viewed and cleaned, the clustering analysis can begin.
Hierarchical clustering builds a hierarchy (or tree) of clusters. Hierarchical clustering considers all of the data points as a single cluster and separates the data points at each iteration. In the end, a dendrogram is created, which shows the clustering in a tree-based representation.
There are various types of hierarchical clustering methods. In this case, average linkage clustering and complete linkage clustering will be used. Average linkage clustering considers the average distance between clusters, while complete linkage clustering considers largest distance between clusters.
First, the average linkage clustering method is run:
#Average Linkage Clustering Method
dend_ave <- hclust(dist(students), method = "average")
## Warning in dist(students): NAs introduced by coercion
plot(dend_ave, main = "Average Link Clustering")
Next, the complete linkage clustering is run:
#Complete Linkage Clustering Method
dend_comp <- hclust(dist(students), method = "complete")
## Warning in dist(students): NAs introduced by coercion
plot(dend_comp, main = "Complete Link Clustering")
Since clustering is an unsupervised learning method, the amount of clusters appropriate for the data is defined by the user. From the plots, it appears the data can be split into four clusters. To help determine the amount of clusters, a table is printed for both the average linkage clustering and complete linkage clustering methods to see how many data points are in clusters 1, 2, 3, and 4.
#Average Linkage Clustering Method tree cut
ave_cut <- cutree(dend_ave, k = 4)
#Table view for the 4 clusters
table(ave_cut)
## ave_cut
## 1 2 3 4
## 519 161 302 18
#Complete Linkage Clustering Method tree cut
com_cut <- cutree(dend_comp, k = 4)
#Table view for the 4 clusters
table(com_cut)
## com_cut
## 1 2 3 4
## 448 305 227 20
From the tables, there is a sharp drop off in between 3 and 4 for both tables, which means most of the data points can be grouped into 3 clusters.
How do you determine which method is more effective? One way is to use the agnes function in the cluster package. The agnes function gives the agglomerative coefficient, which gives the strength of the clustering structure on a scale of 0 to 1, with 1 suggesting a very strong clustering structure.
#Agnes function created for complete and average linkage clustering methods
agnes_comp <- agnes(students, method = "complete")
agnes_ave <- agnes(students, method = "average")
#Agglomerative Coefficient (AC) values printed
agnes_comp$ac
## [1] 0.9785117
agnes_ave$ac
## [1] 0.9531943
Both complete linkage and average linkage have a very high agglomerative coefficient (ac) value, which means both models have strong clustering structures.
Another clustering method is k-means clustering. K-means clustering identifies the k number of centroids, then allocates each data point to the nearest cluster.
Since k-means clustering works on numeric variables, a new data set is created using the numeric variables in the students data frame.
#New data frame with the numeric variables only
students1 <- students%>%select(math_score, reading_score, writing_score)
Next, two k-means clusters are created - one with 4 centers and another with 5 centers.
#K-means clustering with 4 and 5 centers
kmeans(students1, centers = 4, nstart = 10)
## K-means clustering with 4 clusters of sizes 356, 299, 159, 186
##
## Cluster means:
## math_score reading_score writing_score
## 1 71.36517 74.93820 73.99438
## 2 59.21739 61.86622 60.94314
## 3 44.50314 46.98113 44.56604
## 4 85.48925 88.83333 88.19355
##
## Clustering vector:
## [1] 1 4 4 3 1 1 4 3 2 3 2 3 1 1 2 1 4 3 3 2 2 1 3 1 1 1 2 1 1 1 1 2 2 3 4 1 1
## [38] 2 4 2 2 2 2 2 2 2 2 1 1 4 3 1 3 1 4 3 4 3 2 3 1 3 2 1 2 2 3 1 2 2 2 2 3 2
## [75] 3 3 3 1 1 2 3 3 3 2 3 1 4 1 2 1 1 3 1 3 4 1 2 1 2 2 1 1 4 3 4 2 4 2 2 1 4
## [112] 2 3 2 4 1 4 1 2 2 4 4 4 2 1 4 1 1 1 3 4 3 1 1 1 2 3 2 2 2 1 2 3 3 1 3 4 1
## [149] 1 4 2 1 2 3 2 4 1 2 1 2 1 4 3 3 4 4 3 1 1 1 1 4 4 2 3 4 3 1 2 4 1 2 3 1 3
## [186] 2 1 2 3 4 2 1 2 2 1 2 2 2 3 1 1 1 1 2 3 1 1 1 1 2 1 3 2 2 4 4 4 3 1 2 2 4
## [223] 2 1 2 3 1 2 1 4 1 3 1 4 4 1 2 2 3 4 1 4 2 3 1 1 1 2 2 2 3 1 2 1 3 1 1 1 1
## [260] 1 1 1 3 4 1 3 1 1 4 1 2 3 3 1 4 1 4 2 1 2 3 3 1 1 3 4 4 1 1 1 1 1 2 1 1 2
## [297] 3 1 3 4 1 2 1 1 1 1 4 3 2 3 1 2 2 1 2 1 4 1 1 2 1 1 1 3 3 4 2 3 1 3 2 3 2
## [334] 4 4 2 1 3 3 3 2 2 1 1 1 1 2 4 1 1 2 2 1 2 2 1 2 2 2 4 2 4 2 3 2 2 2 2 2 4
## [371] 1 2 1 4 1 3 4 4 1 2 4 4 1 3 3 1 1 1 2 1 2 1 1 2 1 3 2 4 2 2 2 3 2 4 2 1 2
## [408] 1 2 4 1 4 2 2 1 1 1 1 2 2 4 2 2 1 3 2 4 1 2 2 2 1 2 3 1 3 1 1 2 4 1 1 1 1
## [445] 1 1 2 4 3 1 1 4 1 2 2 3 4 3 4 1 2 3 1 4 2 4 3 1 4 1 4 1 1 2 4 1 1 1 2 1 1
## [482] 2 2 3 3 1 2 1 2 4 1 2 4 4 2 2 3 1 1 1 1 4 2 4 3 4 1 2 1 4 1 3 2 2 4 4 4 1
## [519] 1 1 2 4 2 2 3 2 2 3 3 1 1 3 2 4 1 1 2 2 1 4 1 1 1 4 2 1 4 1 2 1 1 4 3 1 3
## [556] 3 1 2 2 2 1 1 4 1 3 3 4 1 3 1 1 4 2 2 1 3 2 4 2 2 4 1 1 1 1 1 1 2 2 2 2 2
## [593] 2 1 4 1 3 3 1 1 2 3 1 2 4 1 4 3 2 2 2 2 4 1 4 2 3 4 4 1 3 2 2 4 2 4 1 3 3
## [630] 3 2 1 1 4 4 1 1 4 1 1 3 4 1 1 2 1 2 2 3 1 2 1 4 1 1 2 1 2 2 4 1 1 2 2 1 2
## [667] 1 4 1 1 1 3 1 1 1 2 1 1 1 2 1 2 2 3 2 4 1 1 3 4 3 1 1 1 2 4 4 1 1 2 4 2 4
## [704] 2 2 2 3 2 4 2 4 4 4 4 1 4 1 4 1 4 1 2 4 3 3 1 1 2 4 3 1 3 4 3 2 2 4 2 1 2
## [741] 1 3 4 1 3 1 1 2 2 4 1 1 1 4 2 4 2 1 2 1 2 2 4 2 2 1 1 2 1 2 3 1 2 1 2 2 1
## [778] 3 1 4 2 4 4 2 4 3 1 3 2 2 2 2 1 4 3 2 1 1 2 2 1 1 4 4 1 1 1 3 1 2 3 3 2 4
## [815] 1 4 2 1 2 4 4 4 3 1 3 2 2 1 1 2 2 4 2 4 3 2 2 1 2 1 3 2 3 1 2 4 4 2 2 1 1
## [852] 2 4 1 2 4 1 1 3 1 2 4 3 1 4 4 2 3 1 3 2 1 4 4 2 1 4 2 1 1 2 1 1 3 3 1 4 2
## [889] 1 3 4 4 2 1 2 3 3 1 2 1 4 1 3 4 1 4 3 4 1 1 3 1 2 2 2 1 4 3 1 4 1 3 1 2 1
## [926] 2 2 2 3 3 1 2 1 1 4 2 2 2 4 1 1 4 1 2 2 2 4 2 3 1 1 1 1 2 2 2 4 4 2 1 2 3
## [963] 4 1 2 1 2 2 1 1 4 1 2 2 2 1 2 2 3 4 3 1 4 4 1 2 3 1 3 1 4 1 1 1 2 4 2 2 1
## [1000] 4
##
## Within cluster sum of squares by cluster:
## [1] 39569.16 34401.55 38245.75 24147.34
## (between_SS / total_SS = 79.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
kmeans(students1, centers = 5, nstart = 10)
## K-means clustering with 5 clusters of sizes 220, 57, 271, 287, 165
##
## Cluster means:
## math_score reading_score writing_score
## 1 52.58636 54.70455 52.74545
## 2 35.98246 39.28070 36.92982
## 3 73.78229 77.40959 76.32841
## 4 63.67596 66.61324 66.02091
## 5 86.05455 89.69091 89.16364
##
## Clustering vector:
## [1] 3 5 5 1 3 3 5 2 4 1 1 2 3 3 1 3 5 2 2 1 4 4 1 3 3 3 1 4 4 4 3 4 4 2 5 3 3
## [38] 1 5 1 1 4 1 4 1 1 4 3 4 5 1 3 1 3 3 2 5 1 1 2 3 2 1 3 1 4 2 4 1 1 4 4 1 1
## [75] 1 2 2 3 4 4 1 1 1 4 2 3 5 3 4 3 4 2 3 1 5 3 4 4 4 4 3 3 5 1 5 4 5 4 4 4 5
## [112] 1 1 1 5 3 3 3 1 4 5 5 5 1 3 5 4 3 3 1 5 2 3 3 3 1 1 4 4 4 3 4 1 1 3 2 5 4
## [149] 3 5 4 4 4 1 4 5 3 4 3 4 3 5 1 1 5 5 1 3 3 3 3 5 3 4 1 5 1 3 4 5 4 1 1 4 1
## [186] 4 3 4 1 5 4 3 4 4 3 1 4 1 1 3 3 3 3 4 1 3 4 3 3 4 3 2 1 1 5 3 5 2 3 1 4 3
## [223] 4 3 4 1 3 1 4 5 4 2 3 5 5 3 4 4 1 3 4 3 1 1 3 3 3 4 4 4 1 3 4 3 1 4 4 3 3
## [260] 3 3 3 1 5 3 1 3 3 5 3 4 1 1 4 5 3 5 4 3 4 1 1 3 3 2 3 5 3 3 3 3 3 4 3 3 4
## [297] 2 3 2 5 3 1 3 3 3 4 5 1 4 1 3 4 4 4 4 4 5 3 4 4 3 3 3 1 2 5 4 2 3 1 4 2 1
## [334] 5 5 4 3 1 2 1 1 4 3 4 3 3 4 5 3 3 4 1 3 1 4 4 4 4 4 5 4 5 1 2 1 1 4 4 1 3
## [371] 3 4 3 5 4 2 5 5 3 4 5 5 3 2 2 3 4 4 4 3 4 3 4 1 3 2 4 5 4 1 4 1 1 5 1 4 4
## [408] 3 1 5 3 3 4 4 4 3 3 3 4 1 5 4 1 3 2 4 5 4 1 4 4 4 1 1 4 1 4 3 1 5 3 3 4 3
## [445] 3 3 4 5 2 3 3 5 3 1 1 1 5 1 5 4 1 1 3 5 4 5 2 4 5 3 5 4 3 4 5 3 4 3 4 3 4
## [482] 1 4 1 1 3 1 4 1 3 3 4 5 5 4 4 1 3 4 3 3 5 1 5 1 3 4 1 3 5 3 1 1 1 5 5 5 3
## [519] 3 3 1 5 4 1 1 4 1 2 2 4 4 1 4 5 4 3 1 4 3 5 4 3 3 5 4 3 5 4 4 3 3 5 1 4 1
## [556] 2 4 4 4 4 3 4 5 3 1 1 5 4 1 4 3 5 1 4 3 1 1 5 1 4 5 3 3 3 3 3 4 1 4 4 1 4
## [593] 4 3 5 4 2 1 3 3 1 2 3 1 5 4 5 1 4 4 4 4 5 3 5 4 2 5 5 4 1 4 1 5 4 5 4 1 1
## [630] 1 4 4 3 5 5 4 3 5 3 3 1 5 3 3 4 3 4 4 1 3 1 3 5 4 3 4 4 4 1 5 3 4 4 4 4 1
## [667] 4 5 4 3 4 1 3 3 3 4 3 3 3 4 3 4 1 2 4 5 3 3 1 5 1 3 3 3 1 5 5 4 3 4 3 4 5
## [704] 4 4 4 2 1 5 1 5 5 5 3 4 5 3 5 3 3 4 1 5 1 2 3 3 1 5 2 3 1 5 1 1 4 5 4 3 1
## [741] 3 1 5 4 1 3 3 4 1 5 4 4 3 5 1 5 1 4 1 3 4 1 3 4 4 3 4 4 3 1 1 3 4 3 4 1 4
## [778] 2 3 5 1 5 3 1 5 2 3 2 4 4 1 4 4 5 1 4 4 3 4 1 4 3 5 5 3 3 3 2 3 1 2 1 4 5
## [815] 3 5 1 4 4 5 5 5 1 3 1 4 4 4 3 1 1 5 4 5 1 4 4 3 1 3 1 1 2 3 1 5 5 1 4 3 4
## [852] 4 5 4 4 5 4 3 1 3 1 5 2 3 5 5 1 1 3 1 1 3 3 5 1 4 5 1 4 3 4 4 4 1 1 3 5 4
## [889] 4 1 5 5 4 3 4 2 2 3 4 3 5 3 2 5 3 3 1 5 3 4 1 3 4 1 1 4 5 1 3 5 4 2 4 4 3
## [926] 4 1 4 2 1 4 4 4 3 5 4 1 1 5 3 4 5 4 4 4 1 3 1 1 4 3 3 3 1 4 4 5 5 1 3 4 1
## [963] 5 3 4 4 4 4 4 3 5 3 1 1 4 3 4 4 1 5 2 3 5 5 3 1 1 3 2 3 3 3 4 4 4 5 1 4 3
## [1000] 5
##
## Within cluster sum of squares by cluster:
## [1] 24626.90 13218.21 24961.46 28190.81 20306.33
## (between_SS / total_SS = 83.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The summary for each model shows the cluster sizes, cluster means, sum of squares by cluster, and vectoring. Based on the cluster sizes and sum of squares, it appears grouping in 4 clusters would be more effective. The sum of squares didn’t dramatically increase between 4 and 5 clusters, and the cluster sizes look a little more normally distributed among 4 clusters.
This session investigated the use of clustering in the “Student Performance in Exams” dataset. In this session, hierarchical clustering and k-means clustering are explored.
For hierarchical clustering, average linkage and complete linkage clustering methods are used. After viewing the plots and cutting the trees, a table is created to see the amount of data points in each cluster. From the tables, it appears clustering into 4 groups is the best method. Looking at the agglomerative coefficient, both hierarchical methods used have a large coefficient (both of 0.90) which implies a strong clustering structure.
Next, k-means clustering was examined using the numeric variables. A k-means clustering model for 4 and 5 centers were explored. After looking at the summaries, it appears the k-means clustering model with 4 centers would be more effective.
Thank you for viewing this R Markdown session.