In this R Markdown session, I will revisit the “Students Performance in Exams” dataset and perform a few clustering techniques. Clustering is an unsupervised learning method, which means the program uses algorithms to analyze and group unlabeled data. Clustering is effective in grouping data based on similar characteristics, as well as finding trends and patterns within the data.

About the Dataset

The dataset used can be found on the Kaggle website, an online platform for data scientists containing free datasets and code collaboration. Below is the link for the dataset:

https://www.kaggle.com/datasets/spscientist/students-performance-in-exams.

The data contains eight variables used to explore the effect of various factors on test scores. The variables are:

Note: this is a fictional dataset used strictly for to demonstrate beginner data analysis skills. The results are not official and should not be used to conclude actual relationships between the variables listed and education

First, the dataset is loaded into R and saved as “student_data”.

student_data <- read.csv("~/R datasets/StudentsPerformance.csv")

Viewing and Cleaning the Data

Now that the dataset is loaded into R, the next step is to view the data and see if it’s clean for analysis.

#Inspect the data frame.
head(student_data)
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78
#View the column names.
colnames(student_data)
## [1] "gender"                      "race.ethnicity"             
## [3] "parental.level.of.education" "lunch"                      
## [5] "test.preparation.course"     "math.score"                 
## [7] "reading.score"               "writing.score"
#View summary of the data frame.
summary(student_data)
##     gender          race.ethnicity     parental.level.of.education
##  Length:1000        Length:1000        Length:1000                
##  Class :character   Class :character   Class :character           
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##     lunch           test.preparation.course   math.score     reading.score   
##  Length:1000        Length:1000             Min.   :  0.00   Min.   : 17.00  
##  Class :character   Class :character        1st Qu.: 57.00   1st Qu.: 59.00  
##  Mode  :character   Mode  :character        Median : 66.00   Median : 70.00  
##                                             Mean   : 66.09   Mean   : 69.17  
##                                             3rd Qu.: 77.00   3rd Qu.: 79.00  
##                                             Max.   :100.00   Max.   :100.00  
##  writing.score   
##  Min.   : 10.00  
##  1st Qu.: 57.75  
##  Median : 69.00  
##  Mean   : 68.05  
##  3rd Qu.: 79.00  
##  Max.   :100.00
#View data types in the data frame.
str(student_data)
## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : chr  "female" "female" "female" "male" ...
##  $ race.ethnicity             : chr  "group B" "group C" "group B" "group A" ...
##  $ parental.level.of.education: chr  "bachelor's degree" "some college" "master's degree" "associate's degree" ...
##  $ lunch                      : chr  "standard" "standard" "standard" "free/reduced" ...
##  $ test.preparation.course    : chr  "none" "completed" "none" "none" ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

I don’t like the “.” in the names, electing to change the “.” to a “_” for easier reading. Also, the name of the “race_ethnicity” variable is shortened to “ethnicity”. The data frame is renamed “students”, which will be used the rest of the time.

#Rename the columns in the student data frame.
students <- student_data%>%rename(parental_education = parental.level.of.education, math_score = math.score, writing_score = writing.score, reading_score = reading.score, ethnicity = race.ethnicity, test_prep_course = test.preparation.course)
#View the updated column names in the student data frame.
colnames(students)
## [1] "gender"             "ethnicity"          "parental_education"
## [4] "lunch"              "test_prep_course"   "math_score"        
## [7] "reading_score"      "writing_score"

Next, I want to check if there are any missing values in the dataset.

#Print the total number of missing values in the data frame.
sum(is.na(students))
## [1] 0

Now that I know there aren’t any missing values, next I check to see if there are any duplicates in the dataset.

#Create a variable storing the amount of duplicates in the data frame.
duplicates <- students%>%duplicated()
#Displays how many duplicates are present in a table. If a value is not a duplicate, it is placed in 'FALSE'. If the value is a duplicate, it is placed in 'TRUE'.
duplicates_count <- duplicates%>%table()
duplicates_count
## .
## FALSE 
##  1000

The data frame is plotted to see the linearity of the data. Based on the linearity of the data, it can also be seen which values are categorical variables (gender, ethnicity, parental education) and which are numerical variables (math score, reading score, writing score).

#Plot data frame
plot(as.data.frame(students))

Now that the data has been viewed and cleaned, the clustering analysis can begin.

Hierarchical Clustering Analysis

Hierarchical clustering builds a hierarchy (or tree) of clusters. Hierarchical clustering considers all of the data points as a single cluster and separates the data points at each iteration. In the end, a dendrogram is created, which shows the clustering in a tree-based representation.

There are various types of hierarchical clustering methods. In this case, average linkage clustering and complete linkage clustering will be used. Average linkage clustering considers the average distance between clusters, while complete linkage clustering considers largest distance between clusters.

First, the average linkage clustering method is run:

#Average Linkage Clustering Method
dend_ave <- hclust(dist(students), method = "average")
## Warning in dist(students): NAs introduced by coercion
plot(dend_ave, main = "Average Link Clustering")

Next, the complete linkage clustering is run:

#Complete Linkage Clustering Method 
dend_comp <- hclust(dist(students), method = "complete")
## Warning in dist(students): NAs introduced by coercion
plot(dend_comp, main = "Complete Link Clustering")

Since clustering is an unsupervised learning method, the amount of clusters appropriate for the data is defined by the user. From the plots, it appears the data can be split into four clusters. To help determine the amount of clusters, a table is printed for both the average linkage clustering and complete linkage clustering methods to see how many data points are in clusters 1, 2, 3, and 4.

#Average Linkage Clustering Method tree cut
ave_cut <- cutree(dend_ave, k = 4)
#Table view for the 4 clusters
table(ave_cut)
## ave_cut
##   1   2   3   4 
## 519 161 302  18
#Complete Linkage Clustering Method tree cut
com_cut <- cutree(dend_comp, k = 4)
#Table view for the 4 clusters
table(com_cut)
## com_cut
##   1   2   3   4 
## 448 305 227  20

From the tables, there is a sharp drop off in between 3 and 4 for both tables, which means most of the data points can be grouped into 3 clusters.

How do you determine which method is more effective? One way is to use the agnes function in the cluster package. The agnes function gives the agglomerative coefficient, which gives the strength of the clustering structure on a scale of 0 to 1, with 1 suggesting a very strong clustering structure.

#Agnes function created for complete and average linkage clustering methods
agnes_comp <- agnes(students, method = "complete")
agnes_ave <- agnes(students, method = "average")
#Agglomerative Coefficient (AC) values printed
agnes_comp$ac
## [1] 0.9785117
agnes_ave$ac
## [1] 0.9531943

Both complete linkage and average linkage have a very high agglomerative coefficient (ac) value, which means both models have strong clustering structures.

K-Means Clustering

Another clustering method is k-means clustering. K-means clustering identifies the k number of centroids, then allocates each data point to the nearest cluster.

Since k-means clustering works on numeric variables, a new data set is created using the numeric variables in the students data frame.

#New data frame with the numeric variables only
students1 <- students%>%select(math_score, reading_score, writing_score)

Next, two k-means clusters are created - one with 4 centers and another with 5 centers.

#K-means clustering with 4 and 5 centers
kmeans(students1, centers = 4, nstart = 10)
## K-means clustering with 4 clusters of sizes 356, 299, 159, 186
## 
## Cluster means:
##   math_score reading_score writing_score
## 1   71.36517      74.93820      73.99438
## 2   59.21739      61.86622      60.94314
## 3   44.50314      46.98113      44.56604
## 4   85.48925      88.83333      88.19355
## 
## Clustering vector:
##    [1] 1 4 4 3 1 1 4 3 2 3 2 3 1 1 2 1 4 3 3 2 2 1 3 1 1 1 2 1 1 1 1 2 2 3 4 1 1
##   [38] 2 4 2 2 2 2 2 2 2 2 1 1 4 3 1 3 1 4 3 4 3 2 3 1 3 2 1 2 2 3 1 2 2 2 2 3 2
##   [75] 3 3 3 1 1 2 3 3 3 2 3 1 4 1 2 1 1 3 1 3 4 1 2 1 2 2 1 1 4 3 4 2 4 2 2 1 4
##  [112] 2 3 2 4 1 4 1 2 2 4 4 4 2 1 4 1 1 1 3 4 3 1 1 1 2 3 2 2 2 1 2 3 3 1 3 4 1
##  [149] 1 4 2 1 2 3 2 4 1 2 1 2 1 4 3 3 4 4 3 1 1 1 1 4 4 2 3 4 3 1 2 4 1 2 3 1 3
##  [186] 2 1 2 3 4 2 1 2 2 1 2 2 2 3 1 1 1 1 2 3 1 1 1 1 2 1 3 2 2 4 4 4 3 1 2 2 4
##  [223] 2 1 2 3 1 2 1 4 1 3 1 4 4 1 2 2 3 4 1 4 2 3 1 1 1 2 2 2 3 1 2 1 3 1 1 1 1
##  [260] 1 1 1 3 4 1 3 1 1 4 1 2 3 3 1 4 1 4 2 1 2 3 3 1 1 3 4 4 1 1 1 1 1 2 1 1 2
##  [297] 3 1 3 4 1 2 1 1 1 1 4 3 2 3 1 2 2 1 2 1 4 1 1 2 1 1 1 3 3 4 2 3 1 3 2 3 2
##  [334] 4 4 2 1 3 3 3 2 2 1 1 1 1 2 4 1 1 2 2 1 2 2 1 2 2 2 4 2 4 2 3 2 2 2 2 2 4
##  [371] 1 2 1 4 1 3 4 4 1 2 4 4 1 3 3 1 1 1 2 1 2 1 1 2 1 3 2 4 2 2 2 3 2 4 2 1 2
##  [408] 1 2 4 1 4 2 2 1 1 1 1 2 2 4 2 2 1 3 2 4 1 2 2 2 1 2 3 1 3 1 1 2 4 1 1 1 1
##  [445] 1 1 2 4 3 1 1 4 1 2 2 3 4 3 4 1 2 3 1 4 2 4 3 1 4 1 4 1 1 2 4 1 1 1 2 1 1
##  [482] 2 2 3 3 1 2 1 2 4 1 2 4 4 2 2 3 1 1 1 1 4 2 4 3 4 1 2 1 4 1 3 2 2 4 4 4 1
##  [519] 1 1 2 4 2 2 3 2 2 3 3 1 1 3 2 4 1 1 2 2 1 4 1 1 1 4 2 1 4 1 2 1 1 4 3 1 3
##  [556] 3 1 2 2 2 1 1 4 1 3 3 4 1 3 1 1 4 2 2 1 3 2 4 2 2 4 1 1 1 1 1 1 2 2 2 2 2
##  [593] 2 1 4 1 3 3 1 1 2 3 1 2 4 1 4 3 2 2 2 2 4 1 4 2 3 4 4 1 3 2 2 4 2 4 1 3 3
##  [630] 3 2 1 1 4 4 1 1 4 1 1 3 4 1 1 2 1 2 2 3 1 2 1 4 1 1 2 1 2 2 4 1 1 2 2 1 2
##  [667] 1 4 1 1 1 3 1 1 1 2 1 1 1 2 1 2 2 3 2 4 1 1 3 4 3 1 1 1 2 4 4 1 1 2 4 2 4
##  [704] 2 2 2 3 2 4 2 4 4 4 4 1 4 1 4 1 4 1 2 4 3 3 1 1 2 4 3 1 3 4 3 2 2 4 2 1 2
##  [741] 1 3 4 1 3 1 1 2 2 4 1 1 1 4 2 4 2 1 2 1 2 2 4 2 2 1 1 2 1 2 3 1 2 1 2 2 1
##  [778] 3 1 4 2 4 4 2 4 3 1 3 2 2 2 2 1 4 3 2 1 1 2 2 1 1 4 4 1 1 1 3 1 2 3 3 2 4
##  [815] 1 4 2 1 2 4 4 4 3 1 3 2 2 1 1 2 2 4 2 4 3 2 2 1 2 1 3 2 3 1 2 4 4 2 2 1 1
##  [852] 2 4 1 2 4 1 1 3 1 2 4 3 1 4 4 2 3 1 3 2 1 4 4 2 1 4 2 1 1 2 1 1 3 3 1 4 2
##  [889] 1 3 4 4 2 1 2 3 3 1 2 1 4 1 3 4 1 4 3 4 1 1 3 1 2 2 2 1 4 3 1 4 1 3 1 2 1
##  [926] 2 2 2 3 3 1 2 1 1 4 2 2 2 4 1 1 4 1 2 2 2 4 2 3 1 1 1 1 2 2 2 4 4 2 1 2 3
##  [963] 4 1 2 1 2 2 1 1 4 1 2 2 2 1 2 2 3 4 3 1 4 4 1 2 3 1 3 1 4 1 1 1 2 4 2 2 1
## [1000] 4
## 
## Within cluster sum of squares by cluster:
## [1] 39569.16 34401.55 38245.75 24147.34
##  (between_SS / total_SS =  79.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
kmeans(students1, centers = 5, nstart = 10)
## K-means clustering with 5 clusters of sizes 220, 57, 271, 287, 165
## 
## Cluster means:
##   math_score reading_score writing_score
## 1   52.58636      54.70455      52.74545
## 2   35.98246      39.28070      36.92982
## 3   73.78229      77.40959      76.32841
## 4   63.67596      66.61324      66.02091
## 5   86.05455      89.69091      89.16364
## 
## Clustering vector:
##    [1] 3 5 5 1 3 3 5 2 4 1 1 2 3 3 1 3 5 2 2 1 4 4 1 3 3 3 1 4 4 4 3 4 4 2 5 3 3
##   [38] 1 5 1 1 4 1 4 1 1 4 3 4 5 1 3 1 3 3 2 5 1 1 2 3 2 1 3 1 4 2 4 1 1 4 4 1 1
##   [75] 1 2 2 3 4 4 1 1 1 4 2 3 5 3 4 3 4 2 3 1 5 3 4 4 4 4 3 3 5 1 5 4 5 4 4 4 5
##  [112] 1 1 1 5 3 3 3 1 4 5 5 5 1 3 5 4 3 3 1 5 2 3 3 3 1 1 4 4 4 3 4 1 1 3 2 5 4
##  [149] 3 5 4 4 4 1 4 5 3 4 3 4 3 5 1 1 5 5 1 3 3 3 3 5 3 4 1 5 1 3 4 5 4 1 1 4 1
##  [186] 4 3 4 1 5 4 3 4 4 3 1 4 1 1 3 3 3 3 4 1 3 4 3 3 4 3 2 1 1 5 3 5 2 3 1 4 3
##  [223] 4 3 4 1 3 1 4 5 4 2 3 5 5 3 4 4 1 3 4 3 1 1 3 3 3 4 4 4 1 3 4 3 1 4 4 3 3
##  [260] 3 3 3 1 5 3 1 3 3 5 3 4 1 1 4 5 3 5 4 3 4 1 1 3 3 2 3 5 3 3 3 3 3 4 3 3 4
##  [297] 2 3 2 5 3 1 3 3 3 4 5 1 4 1 3 4 4 4 4 4 5 3 4 4 3 3 3 1 2 5 4 2 3 1 4 2 1
##  [334] 5 5 4 3 1 2 1 1 4 3 4 3 3 4 5 3 3 4 1 3 1 4 4 4 4 4 5 4 5 1 2 1 1 4 4 1 3
##  [371] 3 4 3 5 4 2 5 5 3 4 5 5 3 2 2 3 4 4 4 3 4 3 4 1 3 2 4 5 4 1 4 1 1 5 1 4 4
##  [408] 3 1 5 3 3 4 4 4 3 3 3 4 1 5 4 1 3 2 4 5 4 1 4 4 4 1 1 4 1 4 3 1 5 3 3 4 3
##  [445] 3 3 4 5 2 3 3 5 3 1 1 1 5 1 5 4 1 1 3 5 4 5 2 4 5 3 5 4 3 4 5 3 4 3 4 3 4
##  [482] 1 4 1 1 3 1 4 1 3 3 4 5 5 4 4 1 3 4 3 3 5 1 5 1 3 4 1 3 5 3 1 1 1 5 5 5 3
##  [519] 3 3 1 5 4 1 1 4 1 2 2 4 4 1 4 5 4 3 1 4 3 5 4 3 3 5 4 3 5 4 4 3 3 5 1 4 1
##  [556] 2 4 4 4 4 3 4 5 3 1 1 5 4 1 4 3 5 1 4 3 1 1 5 1 4 5 3 3 3 3 3 4 1 4 4 1 4
##  [593] 4 3 5 4 2 1 3 3 1 2 3 1 5 4 5 1 4 4 4 4 5 3 5 4 2 5 5 4 1 4 1 5 4 5 4 1 1
##  [630] 1 4 4 3 5 5 4 3 5 3 3 1 5 3 3 4 3 4 4 1 3 1 3 5 4 3 4 4 4 1 5 3 4 4 4 4 1
##  [667] 4 5 4 3 4 1 3 3 3 4 3 3 3 4 3 4 1 2 4 5 3 3 1 5 1 3 3 3 1 5 5 4 3 4 3 4 5
##  [704] 4 4 4 2 1 5 1 5 5 5 3 4 5 3 5 3 3 4 1 5 1 2 3 3 1 5 2 3 1 5 1 1 4 5 4 3 1
##  [741] 3 1 5 4 1 3 3 4 1 5 4 4 3 5 1 5 1 4 1 3 4 1 3 4 4 3 4 4 3 1 1 3 4 3 4 1 4
##  [778] 2 3 5 1 5 3 1 5 2 3 2 4 4 1 4 4 5 1 4 4 3 4 1 4 3 5 5 3 3 3 2 3 1 2 1 4 5
##  [815] 3 5 1 4 4 5 5 5 1 3 1 4 4 4 3 1 1 5 4 5 1 4 4 3 1 3 1 1 2 3 1 5 5 1 4 3 4
##  [852] 4 5 4 4 5 4 3 1 3 1 5 2 3 5 5 1 1 3 1 1 3 3 5 1 4 5 1 4 3 4 4 4 1 1 3 5 4
##  [889] 4 1 5 5 4 3 4 2 2 3 4 3 5 3 2 5 3 3 1 5 3 4 1 3 4 1 1 4 5 1 3 5 4 2 4 4 3
##  [926] 4 1 4 2 1 4 4 4 3 5 4 1 1 5 3 4 5 4 4 4 1 3 1 1 4 3 3 3 1 4 4 5 5 1 3 4 1
##  [963] 5 3 4 4 4 4 4 3 5 3 1 1 4 3 4 4 1 5 2 3 5 5 3 1 1 3 2 3 3 3 4 4 4 5 1 4 3
## [1000] 5
## 
## Within cluster sum of squares by cluster:
## [1] 24626.90 13218.21 24961.46 28190.81 20306.33
##  (between_SS / total_SS =  83.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The summary for each model shows the cluster sizes, cluster means, sum of squares by cluster, and vectoring. Based on the cluster sizes and sum of squares, it appears grouping in 4 clusters would be more effective. The sum of squares didn’t dramatically increase between 4 and 5 clusters, and the cluster sizes look a little more normally distributed among 4 clusters.

Conclusion

This session investigated the use of clustering in the “Student Performance in Exams” dataset. In this session, hierarchical clustering and k-means clustering are explored.

For hierarchical clustering, average linkage and complete linkage clustering methods are used. After viewing the plots and cutting the trees, a table is created to see the amount of data points in each cluster. From the tables, it appears clustering into 4 groups is the best method. Looking at the agglomerative coefficient, both hierarchical methods used have a large coefficient (both of 0.90) which implies a strong clustering structure.

Next, k-means clustering was examined using the numeric variables. A k-means clustering model for 4 and 5 centers were explored. After looking at the summaries, it appears the k-means clustering model with 4 centers would be more effective.

Thank you for viewing this R Markdown session.

END