Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.
In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.
The data for this case study is generated with the simulated function below. The data contains the following features:
Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance
simulate_student_features <- function(n = 100) {
# Set the random seed
set.seed(260923)
# Generate unique student IDs
student_ids <- seq(1, n)
# Simulate student engagement
student_engagement <- rnorm(n, mean = 50, sd = 10)
# Simulate student performance
student_performance <- rnorm(n, mean = 60, sd = 15)
# Combine the data into a data frame
student_features <- data.frame(
student_id = student_ids,
student_engagement = student_engagement,
student_performance = student_performance
)
# Return the data frame
return(student_features)
}
This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.
To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:
student_features <- simulate_student_features(n = 100)
We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns, ## Explore
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#structure
str(student_features)
## 'data.frame': 100 obs. of 3 variables:
## $ student_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ student_engagement : num 35.5 51.8 62.4 35.2 59.4 ...
## $ student_performance: num 50.5 58.9 40.6 62.5 54.7 ...
#Check out 10 rows
slice_sample(student_features, n= 10)
## student_id student_engagement student_performance
## 1 52 73.07623 72.12643
## 2 20 43.16863 73.15011
## 3 53 62.47065 81.81560
## 4 92 50.25869 46.90754
## 5 59 42.12485 72.70400
## 6 89 33.68189 71.96778
## 7 91 51.90072 83.01396
## 8 6 57.00109 54.09745
## 9 39 24.56077 72.83616
## 10 15 60.90687 65.14477
#Summary
summary(student_features)
## student_id student_engagement student_performance
## Min. : 1.00 Min. :24.56 Min. :33.94
## 1st Qu.: 25.75 1st Qu.:43.56 1st Qu.:53.75
## Median : 50.50 Median :51.37 Median :64.45
## Mean : 50.50 Mean :50.43 Mean :62.42
## 3rd Qu.: 75.25 3rd Qu.:58.57 3rd Qu.:72.01
## Max. :100.00 Max. :73.08 Max. :97.40
Simulate the data. – see above
Perform dimensionality reduction on the data using PCA.
Check column means and standard deviations
# creating a dataframe of just features 1 and 2 since student ID is an identifying field
student_features_fields <- student_features[, 2:3]
#Mean
colMeans(student_features_fields)
## student_engagement student_performance
## 50.42965 62.41808
#Standard Deviation
apply(student_features_fields,2,sd)
## student_engagement student_performance
## 10.14143 13.99781
#check first 10 rows
head(student_features_fields, n=10)
## student_engagement student_performance
## 1 35.47855 50.52231
## 2 51.79512 58.88396
## 3 62.41012 40.56755
## 4 35.20679 62.46033
## 5 59.37552 54.69326
## 6 57.00109 54.09745
## 7 34.81908 51.59185
## 8 66.43009 71.66933
## 9 53.12224 53.38812
## 10 58.66933 77.51403
#min values
min_se <- min(student_features_fields$student_engagement)
min_se
## [1] 24.56077
min_sp <- min(student_features_fields$student_performance)
min_sp
## [1] 33.93715
#max values
max_se <- max(student_features_fields$student_engagement)
max_se
## [1] 73.07623
max_sp <- max(student_features_fields$student_performance)
max_sp
## [1] 97.4003
#with scaling
with_scaling <- prcomp(student_features_fields, scale = TRUE, center = TRUE)
#without scaling
with_out_scaling <- prcomp(student_features_fields, scale = FALSE, center =TRUE)
#bi plots
biplot(with_scaling)
biplot(with_out_scaling)
pr.out <- prcomp(student_features_fields, center = TRUE, scale = FALSE)
head(pr.out$x, 20)
## PC1 PC2
## [1,] -11.4144481 15.321721
## [2,] -3.5757493 -1.252429
## [3,] -22.2203607 -11.279760
## [4,] 0.5261828 15.213820
## [5,] -8.0053198 -8.695768
## [6,] -8.5253378 -6.303596
## [7,] -10.3244803 15.946858
## [8,] 8.7378929 -16.286463
## [9,] -9.1109985 -2.404154
## [10,] 14.8263640 -8.715435
## [11,] 3.7303841 -14.307701
## [12,] -20.3130736 -13.668677
## [13,] 18.8243797 -11.562512
## [14,] 8.8875479 6.682150
## [15,] 2.3922238 -10.558615
## [16,] 34.1068507 -12.976337
## [17,] -2.2059401 20.310874
## [18,] -17.5188961 -11.730859
## [19,] -7.7324306 -2.794577
## [20,] 10.9574455 6.916167
summary(pr.out)
## Importance of components:
## PC1 PC2
## Standard deviation 14.0012 10.1368
## Proportion of Variance 0.6561 0.3439
## Cumulative Proportion 0.6561 1.0000
# biplot of pca
biplot(pr.out)
# scatter plot
plot(pr.out$x,
xlab = "PC1", ylab = "PC2")
## Variance Explained
#variability of each principal component
pr.var <- pr.out$sdev^2
#varaince explained by each principal component
pve <- pr.var / sum(pr.var)
#plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0,1), type = "b")
#plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0,1), type = "b")
set.seed(20240423)
# Perform K-Means clustering on the iris dataset with 3 centers, run 15 times
kmeans_results <- stats::kmeans(student_features_fields, centers = 3, nstart = 15)
# Print the cluster assignments
kmeans_results$cluster
## [1] 1 3 3 1 3 3 1 2 3 2 2 3 2 1 2 2 1 3 3 1 1 1 3 2 2 1 3 2 1 2 2 2 1 3 1 3 2
## [38] 2 1 2 3 2 3 3 2 3 2 2 3 3 3 2 2 3 3 2 2 2 1 1 2 1 3 1 1 2 2 3 1 1 3 3 2 1
## [75] 2 2 1 1 3 2 3 2 3 1 1 3 3 3 1 2 2 3 2 2 2 2 2 2 1 2
# Print the kemans_results object
kmeans_results
## K-means clustering with 3 clusters of sizes 27, 41, 32
##
## Cluster means:
## student_engagement student_performance
## 1 38.88027 63.76751
## 2 55.27903 73.69308
## 3 53.96117 46.83341
##
## Clustering vector:
## [1] 1 3 3 1 3 3 1 2 3 2 2 3 2 1 2 2 1 3 3 1 1 1 3 2 2 1 3 2 1 2 2 2 1 3 1 3 2
## [38] 2 1 2 3 2 3 3 2 3 2 2 3 3 3 2 2 3 3 2 2 2 1 1 2 1 3 1 1 2 2 3 1 1 3 3 2 1
## [75] 2 2 1 1 3 2 3 2 3 1 1 3 3 3 1 2 2 3 2 2 2 2 2 2 1 2
##
## Within cluster sum of squares by cluster:
## [1] 2453.993 5361.704 3765.955
## (between_SS / total_SS = 60.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# kmeans with 2 features
student_out <- kmeans(student_features_fields, centers = 3, nstart = 15)
student_out
## K-means clustering with 3 clusters of sizes 41, 27, 32
##
## Cluster means:
## student_engagement student_performance
## 1 55.27903 73.69308
## 2 38.88027 63.76751
## 3 53.96117 46.83341
##
## Clustering vector:
## [1] 2 3 3 2 3 3 2 1 3 1 1 3 1 2 1 1 2 3 3 2 2 2 3 1 1 2 3 1 2 1 1 1 2 3 2 3 1
## [38] 1 2 1 3 1 3 3 1 3 1 1 3 3 3 1 1 3 3 1 1 1 2 2 1 2 3 2 2 1 1 3 2 2 3 3 1 2
## [75] 1 1 2 2 3 1 3 1 3 2 2 3 3 3 2 1 1 3 1 1 1 1 1 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 5361.704 2453.993 3765.955
## (between_SS / total_SS = 60.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#plot
plot(student_features_fields$student_engagement, student_features_fields$student_performance, col = student_out$cluster, main ="k-means with 3 clusters", xlab =""
,ylab ="")
#load packages
library(ggplot2)
library(dplyr)
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(ggforce)
#Create ggplot 2 object
ggplot(student_features, aes(x = student_engagement, y = student_performance, color = kmeans_results$cluster)) +
#add geom_point() to plot data points
geom_point() +
#add title and axis
labs(title = "Kmeans clustering of student features", x = "student engagement", y = "student performance")
## Selecting Number of Clusters
# Initialize total within sum of squares error: wss
wss <- 0
student_features_fields
## student_engagement student_performance
## 1 35.47855 50.52231
## 2 51.79512 58.88396
## 3 62.41012 40.56755
## 4 35.20679 62.46033
## 5 59.37552 54.69326
## 6 57.00109 54.09745
## 7 34.81908 51.59185
## 8 66.43009 71.66933
## 9 53.12224 53.38812
## 10 58.66933 77.51403
## 11 64.61152 66.60144
## 12 64.73720 42.54982
## 13 61.38786 81.60053
## 14 43.46833 71.08870
## 15 60.90687 65.14477
## 16 62.31512 96.92023
## 17 30.19917 59.56755
## 18 62.71153 45.28098
## 19 53.46864 54.77840
## 20 43.16863 73.15011
## 21 47.79988 57.94230
## 22 32.86718 64.51376
## 23 48.43080 44.12915
## 24 44.64360 81.96037
## 25 59.37263 67.10083
## 26 40.14190 57.07313
## 27 52.05290 33.93715
## 28 58.53192 66.30623
## 29 46.85526 60.08155
## 30 40.57498 83.21472
## 31 48.27392 68.24131
## 32 63.40857 64.74623
## 33 42.63505 74.19713
## 34 49.50695 36.55820
## 35 39.64492 63.26139
## 36 58.16632 42.53282
## 37 46.85609 77.93879
## 38 53.21176 71.84144
## 39 24.56077 72.83616
## 40 47.94104 72.70324
## 41 54.54657 57.30181
## 42 65.36895 64.50104
## 43 63.35530 43.93461
## 44 38.06179 40.12883
## 45 52.79802 70.87511
## 46 53.09701 57.98622
## 47 52.26786 75.90067
## 48 60.27817 65.06789
## 49 49.96451 35.62894
## 50 62.85455 53.86589
## 51 61.18422 35.28333
## 52 73.07623 72.12643
## 53 62.47065 81.81560
## 54 55.20634 46.58676
## 55 50.93514 36.66812
## 56 55.74387 72.98336
## 57 56.41645 80.03891
## 58 50.67616 64.95174
## 59 42.12485 72.70400
## 60 33.46542 63.30804
## 61 45.10251 73.36602
## 62 47.66047 64.39439
## 63 53.12859 46.91366
## 64 49.11046 61.00457
## 65 45.98572 66.17874
## 66 53.55687 70.29459
## 67 59.53438 68.26805
## 68 71.76380 50.57808
## 69 42.81577 70.02416
## 70 34.62981 60.46338
## 71 43.46502 43.93160
## 72 59.06670 56.56050
## 73 53.00539 60.76549
## 74 43.51405 67.50167
## 75 54.59768 71.64025
## 76 43.57840 79.52546
## 77 29.80201 57.75304
## 78 45.01510 55.98704
## 79 48.30114 37.44258
## 80 45.19776 97.40030
## 81 54.44051 52.26339
## 82 56.73032 66.93843
## 83 46.66086 54.04959
## 84 33.72874 51.90589
## 85 32.07322 74.45711
## 86 53.97599 57.86733
## 87 33.15555 40.26646
## 88 50.55680 43.10695
## 89 33.68189 71.96778
## 90 45.95642 72.64398
## 91 51.90072 83.01396
## 92 50.25869 46.90754
## 93 62.41476 62.63873
## 94 44.93745 80.00130
## 95 61.23368 69.66618
## 96 50.22648 82.94298
## 97 62.92548 67.67482
## 98 57.15390 90.19502
## 99 39.31421 65.78667
## 100 52.15625 62.67658
#for 1 to 15 cluster centers
for (i in 1:15) {
km.out <- kmeans(student_features_fields, center = i, nstart = 20)
# Save total within sum of squares to wss variable
wss[i] <- km.out$tot.withinss
}
# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
# Set k equal to the number of clusters corresponding to the elbow location
# Select number of clusters
k <- 2
# Build model with k clusters: km.out
student.out <- kmeans(student_features_fields, centers = k, nstart = 20, iter.max = 50)
#view model
student.out
## K-means clustering with 2 clusters of sizes 42, 58
##
## Cluster means:
## student_engagement student_performance
## 1 50.12445 49.08469
## 2 50.65065 72.07329
##
## Clustering vector:
## [1] 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 2 1 1 1 2 1 2 1 2 2 1 1 2 1 2 2 2 2 1 2 1 2
## [38] 2 2 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 1 1 2 2
## [75] 2 2 1 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 6950.383 9749.145
## (between_SS / total_SS = 43.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#plot by cluster membership
plot(student_features_fields,
col = student.out$cluster,
main = paste("K-Means clustering of student features with", k, "clusters"),
xlab = "student engagement",
ylab = "student performance")
#view clusters
student.out$cluster
## [1] 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 2 1 1 1 2 1 2 1 2 2 1 1 2 1 2 2 2 2 1 2 1 2
## [38] 2 2 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 1 1 2 2
## [75] 2 2 1 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2
# Perform DBSCAN clustering on the student_features data frame, specifying a value for the eps argument
dbscan_results <- dbscan(student_features_fields, eps = 2.5)
# Print the cluster assignments
dbscan_results$cluster
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 0 0 0 0 2 0 0 2 0 0 0 0 1 0 0 0 0
## [38] 3 0 0 0 0 0 0 3 0 0 2 0 0 0 0 0 0 0 3 0 0 1 0 1 0 0 0 0 3 2 0 0 0 0 0 0 0
## [75] 3 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Define the variable eps
eps <- .05
# Create a ggplot2 object
ggplot(student_features_fields, aes(x = student_engagement, y = student_performance, color = as.factor(dbscan_results$cluster))) +
# Add geom_point
geom_point() +
# Add geom_circle to draw circles around each data point with radius equal to eps
geom_circle(aes(x0 = student_engagement, y0 = student_performance, r = eps), alpha = 0.05) +
# Manually specify colors for each cluster
scale_color_manual(values = c("red", "blue", "green", "orange", "purple", "black", "pink")) +
# Add title and axis labels
labs(title = "DBSCAN clustering of student features", x = "Student Engagement", y = "Student Performance")
k <-3
# Get distance
student.dist <- dist(student_features_fields)
# Create Hierarchical clustering model
student.hclust <- hclust(student.dist, method = 'complete')
#plot
plot(student.hclust)
#Cut tree down to two clusters
student.hclust.clusters <- cutree(student.hclust, k=k)
plot(student_features_fields,
col = student.hclust.clusters,
main = paste("Hierarchical clustering of student features with", k, "clusters"),
xlab = "student engagement",
ylab = "student performance")
## Compare K-means to Hierarchical
table(student.out$cluster, student.hclust.clusters )
## student.hclust.clusters
## 1 2 3
## 1 14 28 0
## 2 15 0 43
K-means and Hierarchical clustering showed very similar results with the visualizations when using 2 clusters. I ended up changing the hierarchical model to 3 clusters as I thought the 2 clusters were diverse enough that a 3rd could comfortably explain further segmentation. The DBSCAN cluster I am unsure of the results. I played around with the eps value and minpts but the clusters just did not seem to be turning out correctly.
Submit a report containing the following:
My approach to dimensionality was first to see if the data needed to be scaled. Since it is simulated data and the features were of the same magnatude it did not appear to need scaling. The means and standard deviations of the features were similar. Next I performed the prcomp function and then plotted the data. Then I looked at the variability fo each principal component as well as the variance explained by each principal component, followed by visuzalizing it. For clustering I chose to try 3 methods: k-means, hierarchical, and DBSCAN.
Number of clusters Identified : K-means - 2 Hierarchical - 3 DBSCAN - 4
Characteristics of each cluster:
K means with 2 clusters I think shows the general trend of high engaged, high performing students. As well as low engaged and low performing students.
Hierarchical as I went with 3 clusters helps show further diversity in the data showing low engaged and low performance students (colored black), high engaged and low performance students (colored red), and then the remainder (colored green)
To note for hierarchical clustering I tried 4 clusters, which did not model very well. The 4th cluster was basically outliers.
I am not sure how to interpret the 4 clusters from the DBSCAN. It is 3 small groups, and then everything else. I tried playing around with the eps and minPTs but couldn’t achieve anything more interpretable.
I interpret these results to mean in general the more engaged a student is the higher their performance will be. This seems intuitive. While there are outliers in general the more engaged students had better performance.
The below paper concludes that student engagement is key to their performance and participation in online courses. The study found a connection between online student engagement and student performance. Although this paper focuses on online learning, I feel its insights are likely true for all types of learning. Data for the study was collected from Learning Management Systems (LMS), student questionnaires, and instructor interviews. The data was analyzed using descriptive statistics.
Sahni, J. (2023). Is learning analytics the future of online education? International Journal of Emerging Technologies in Learning/International Journal: Emerging Technologies in Learning, 18(02), 33–49. https://doi.org/10.3991/ijet.v18i02.32167
Your report should include your code. Submit the published RPubs link to Blackboard.