Lab 3 Case Study: Unsupervised Learning in Learning Analytics

Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns, ## Explore

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#structure
str(student_features)

## 'data.frame':    100 obs. of  3 variables:
##  $ student_id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ student_engagement : num  35.5 51.8 62.4 35.2 59.4 ...
##  $ student_performance: num  50.5 58.9 40.6 62.5 54.7 ...

#Check out 10 rows
slice_sample(student_features, n= 10)

##    student_id student_engagement student_performance
## 1          52           73.07623            72.12643
## 2          20           43.16863            73.15011
## 3          53           62.47065            81.81560
## 4          92           50.25869            46.90754
## 5          59           42.12485            72.70400
## 6          89           33.68189            71.96778
## 7          91           51.90072            83.01396
## 8           6           57.00109            54.09745
## 9          39           24.56077            72.83616
## 10         15           60.90687            65.14477

#Summary
summary(student_features)

##    student_id     student_engagement student_performance
##  Min.   :  1.00   Min.   :24.56      Min.   :33.94      
##  1st Qu.: 25.75   1st Qu.:43.56      1st Qu.:53.75      
##  Median : 50.50   Median :51.37      Median :64.45      
##  Mean   : 50.50   Mean   :50.43      Mean   :62.42      
##  3rd Qu.: 75.25   3rd Qu.:58.57      3rd Qu.:72.01      
##  Max.   :100.00   Max.   :73.08      Max.   :97.40

Tasks

Simulate the data. – see above
Perform dimensionality reduction on the data using PCA.
Check column means and standard deviations

# creating a dataframe of just features 1 and 2 since student ID is an identifying field
student_features_fields <- student_features[, 2:3]

#Mean
colMeans(student_features_fields)

##  student_engagement student_performance 
##            50.42965            62.41808

#Standard Deviation
apply(student_features_fields,2,sd)

##  student_engagement student_performance 
##            10.14143            13.99781

Does this data need scaled? Because this data is simulated and has a pre determined mean and standard deviation which is in the same units and magnatueds it does not need to be scaled

#check first 10 rows
head(student_features_fields, n=10)

##    student_engagement student_performance
## 1            35.47855            50.52231
## 2            51.79512            58.88396
## 3            62.41012            40.56755
## 4            35.20679            62.46033
## 5            59.37552            54.69326
## 6            57.00109            54.09745
## 7            34.81908            51.59185
## 8            66.43009            71.66933
## 9            53.12224            53.38812
## 10           58.66933            77.51403

#min values
min_se <- min(student_features_fields$student_engagement)
min_se

## [1] 24.56077

min_sp <- min(student_features_fields$student_performance)
min_sp

## [1] 33.93715

#max values
max_se <- max(student_features_fields$student_engagement)
max_se

## [1] 73.07623

max_sp <- max(student_features_fields$student_performance)
max_sp

## [1] 97.4003

#with scaling
with_scaling <- prcomp(student_features_fields, scale = TRUE, center = TRUE)

#without scaling

with_out_scaling <- prcomp(student_features_fields, scale = FALSE, center =TRUE)


#bi plots

biplot(with_scaling)

biplot(with_out_scaling)

Perform PCA

pr.out <- prcomp(student_features_fields, center = TRUE, scale = FALSE)

head(pr.out$x, 20)

##               PC1        PC2
##  [1,] -11.4144481  15.321721
##  [2,]  -3.5757493  -1.252429
##  [3,] -22.2203607 -11.279760
##  [4,]   0.5261828  15.213820
##  [5,]  -8.0053198  -8.695768
##  [6,]  -8.5253378  -6.303596
##  [7,] -10.3244803  15.946858
##  [8,]   8.7378929 -16.286463
##  [9,]  -9.1109985  -2.404154
## [10,]  14.8263640  -8.715435
## [11,]   3.7303841 -14.307701
## [12,] -20.3130736 -13.668677
## [13,]  18.8243797 -11.562512
## [14,]   8.8875479   6.682150
## [15,]   2.3922238 -10.558615
## [16,]  34.1068507 -12.976337
## [17,]  -2.2059401  20.310874
## [18,] -17.5188961 -11.730859
## [19,]  -7.7324306  -2.794577
## [20,]  10.9574455   6.916167

summary(pr.out)

## Importance of components:
##                            PC1     PC2
## Standard deviation     14.0012 10.1368
## Proportion of Variance  0.6561  0.3439
## Cumulative Proportion   0.6561  1.0000

Visualize PCA

# biplot of pca

biplot(pr.out)

# scatter plot

plot(pr.out$x, 
     xlab = "PC1", ylab = "PC2")

## Variance Explained

#variability of each principal component
pr.var <- pr.out$sdev^2

#varaince explained by each principal component
pve <- pr.var / sum(pr.var)

Visualize variance explained

#plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0,1), type = "b")

#plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0,1), type = "b")

Cluster the data using KMeans and other clustering algorithms.

Cluster Analysis K means

set.seed(20240423)

# Perform K-Means clustering on the iris dataset with 3 centers, run 15 times
kmeans_results <- stats::kmeans(student_features_fields, centers = 3, nstart = 15)

# Print the cluster assignments
kmeans_results$cluster

##   [1] 1 3 3 1 3 3 1 2 3 2 2 3 2 1 2 2 1 3 3 1 1 1 3 2 2 1 3 2 1 2 2 2 1 3 1 3 2
##  [38] 2 1 2 3 2 3 3 2 3 2 2 3 3 3 2 2 3 3 2 2 2 1 1 2 1 3 1 1 2 2 3 1 1 3 3 2 1
##  [75] 2 2 1 1 3 2 3 2 3 1 1 3 3 3 1 2 2 3 2 2 2 2 2 2 1 2

# Print the kemans_results object
kmeans_results

## K-means clustering with 3 clusters of sizes 27, 41, 32
## 
## Cluster means:
##   student_engagement student_performance
## 1           38.88027            63.76751
## 2           55.27903            73.69308
## 3           53.96117            46.83341
## 
## Clustering vector:
##   [1] 1 3 3 1 3 3 1 2 3 2 2 3 2 1 2 2 1 3 3 1 1 1 3 2 2 1 3 2 1 2 2 2 1 3 1 3 2
##  [38] 2 1 2 3 2 3 3 2 3 2 2 3 3 3 2 2 3 3 2 2 2 1 1 2 1 3 1 1 2 2 3 1 1 3 3 2 1
##  [75] 2 2 1 1 3 2 3 2 3 1 1 3 3 3 1 2 2 3 2 2 2 2 2 2 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 2453.993 5361.704 3765.955
##  (between_SS / total_SS =  60.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

K-means clustering

# kmeans with 2 features


student_out <- kmeans(student_features_fields, centers = 3, nstart = 15)

student_out

## K-means clustering with 3 clusters of sizes 41, 27, 32
## 
## Cluster means:
##   student_engagement student_performance
## 1           55.27903            73.69308
## 2           38.88027            63.76751
## 3           53.96117            46.83341
## 
## Clustering vector:
##   [1] 2 3 3 2 3 3 2 1 3 1 1 3 1 2 1 1 2 3 3 2 2 2 3 1 1 2 3 1 2 1 1 1 2 3 2 3 1
##  [38] 1 2 1 3 1 3 3 1 3 1 1 3 3 3 1 1 3 3 1 1 1 2 2 1 2 3 2 2 1 1 3 2 2 3 3 1 2
##  [75] 1 1 2 2 3 1 3 1 3 2 2 3 3 3 2 1 1 3 1 1 1 1 1 1 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 5361.704 2453.993 3765.955
##  (between_SS / total_SS =  60.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

#plot

plot(student_features_fields$student_engagement, student_features_fields$student_performance, col = student_out$cluster, main ="k-means with 3 clusters", xlab =""
,ylab ="")

Load packages for visualizing

#load packages
library(ggplot2)
library(dplyr)
library(dbscan)

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(ggforce)

Visualize Kmeans clusters

#Create ggplot 2 object
ggplot(student_features, aes(x = student_engagement, y = student_performance, color = kmeans_results$cluster)) +
  
  #add geom_point() to plot data points
  geom_point() +
  
  #add title and axis
  labs(title = "Kmeans clustering of student features", x = "student engagement", y = "student performance")

## Selecting Number of Clusters

# Initialize total within sum of squares error: wss
wss <- 0

student_features_fields

##     student_engagement student_performance
## 1             35.47855            50.52231
## 2             51.79512            58.88396
## 3             62.41012            40.56755
## 4             35.20679            62.46033
## 5             59.37552            54.69326
## 6             57.00109            54.09745
## 7             34.81908            51.59185
## 8             66.43009            71.66933
## 9             53.12224            53.38812
## 10            58.66933            77.51403
## 11            64.61152            66.60144
## 12            64.73720            42.54982
## 13            61.38786            81.60053
## 14            43.46833            71.08870
## 15            60.90687            65.14477
## 16            62.31512            96.92023
## 17            30.19917            59.56755
## 18            62.71153            45.28098
## 19            53.46864            54.77840
## 20            43.16863            73.15011
## 21            47.79988            57.94230
## 22            32.86718            64.51376
## 23            48.43080            44.12915
## 24            44.64360            81.96037
## 25            59.37263            67.10083
## 26            40.14190            57.07313
## 27            52.05290            33.93715
## 28            58.53192            66.30623
## 29            46.85526            60.08155
## 30            40.57498            83.21472
## 31            48.27392            68.24131
## 32            63.40857            64.74623
## 33            42.63505            74.19713
## 34            49.50695            36.55820
## 35            39.64492            63.26139
## 36            58.16632            42.53282
## 37            46.85609            77.93879
## 38            53.21176            71.84144
## 39            24.56077            72.83616
## 40            47.94104            72.70324
## 41            54.54657            57.30181
## 42            65.36895            64.50104
## 43            63.35530            43.93461
## 44            38.06179            40.12883
## 45            52.79802            70.87511
## 46            53.09701            57.98622
## 47            52.26786            75.90067
## 48            60.27817            65.06789
## 49            49.96451            35.62894
## 50            62.85455            53.86589
## 51            61.18422            35.28333
## 52            73.07623            72.12643
## 53            62.47065            81.81560
## 54            55.20634            46.58676
## 55            50.93514            36.66812
## 56            55.74387            72.98336
## 57            56.41645            80.03891
## 58            50.67616            64.95174
## 59            42.12485            72.70400
## 60            33.46542            63.30804
## 61            45.10251            73.36602
## 62            47.66047            64.39439
## 63            53.12859            46.91366
## 64            49.11046            61.00457
## 65            45.98572            66.17874
## 66            53.55687            70.29459
## 67            59.53438            68.26805
## 68            71.76380            50.57808
## 69            42.81577            70.02416
## 70            34.62981            60.46338
## 71            43.46502            43.93160
## 72            59.06670            56.56050
## 73            53.00539            60.76549
## 74            43.51405            67.50167
## 75            54.59768            71.64025
## 76            43.57840            79.52546
## 77            29.80201            57.75304
## 78            45.01510            55.98704
## 79            48.30114            37.44258
## 80            45.19776            97.40030
## 81            54.44051            52.26339
## 82            56.73032            66.93843
## 83            46.66086            54.04959
## 84            33.72874            51.90589
## 85            32.07322            74.45711
## 86            53.97599            57.86733
## 87            33.15555            40.26646
## 88            50.55680            43.10695
## 89            33.68189            71.96778
## 90            45.95642            72.64398
## 91            51.90072            83.01396
## 92            50.25869            46.90754
## 93            62.41476            62.63873
## 94            44.93745            80.00130
## 95            61.23368            69.66618
## 96            50.22648            82.94298
## 97            62.92548            67.67482
## 98            57.15390            90.19502
## 99            39.31421            65.78667
## 100           52.15625            62.67658

#for 1 to 15 cluster centers

for (i in 1:15) {
  km.out <- kmeans(student_features_fields, center = i, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

# Set k equal to the number of clusters corresponding to the elbow location

# Select number of clusters
k <- 2

# Build model with k clusters: km.out
student.out <- kmeans(student_features_fields, centers = k, nstart = 20, iter.max = 50)

#view model
student.out

## K-means clustering with 2 clusters of sizes 42, 58
## 
## Cluster means:
##   student_engagement student_performance
## 1           50.12445            49.08469
## 2           50.65065            72.07329
## 
## Clustering vector:
##   [1] 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 2 1 1 1 2 1 2 1 2 2 1 1 2 1 2 2 2 2 1 2 1 2
##  [38] 2 2 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 1 1 2 2
##  [75] 2 2 1 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 6950.383 9749.145
##  (between_SS / total_SS =  43.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

#plot by cluster membership

plot(student_features_fields,
     col = student.out$cluster,
     main = paste("K-Means clustering of student features with", k, "clusters"),
     xlab = "student engagement",
     ylab = "student performance")

#view clusters

student.out$cluster

##   [1] 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 2 1 1 1 2 1 2 1 2 2 1 1 2 1 2 2 2 2 1 2 1 2
##  [38] 2 2 2 1 2 1 1 2 1 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 1 1 2 2
##  [75] 2 2 1 1 1 2 1 2 1 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2

Clustering with DBSCAN

# Perform DBSCAN clustering on the student_features data frame, specifying a value for the eps argument
dbscan_results <- dbscan(student_features_fields, eps = 2.5)

# Print the cluster assignments
dbscan_results$cluster

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 0 0 0 0 2 0 0 2 0 0 0 0 1 0 0 0 0
##  [38] 3 0 0 0 0 0 0 3 0 0 2 0 0 0 0 0 0 0 3 0 0 1 0 1 0 0 0 0 3 2 0 0 0 0 0 0 0
##  [75] 3 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Visualization of Cluster Assignments

# Define the variable eps
eps <- .05

# Create a ggplot2 object
ggplot(student_features_fields, aes(x = student_engagement, y = student_performance, color = as.factor(dbscan_results$cluster))) +
  
  # Add geom_point
  geom_point() +
  
  # Add geom_circle to draw circles around each data point with radius equal to eps
  geom_circle(aes(x0 = student_engagement, y0 = student_performance, r = eps), alpha = 0.05) +
  
  # Manually specify colors for each cluster
  scale_color_manual(values = c("red", "blue", "green", "orange", "purple", "black", "pink")) +
  
  # Add title and axis labels
  labs(title = "DBSCAN clustering of student features", x = "Student Engagement", y = "Student Performance")

Hierarchical clustering

k <-3
# Get distance
student.dist <- dist(student_features_fields)

# Create Hierarchical clustering model

student.hclust <- hclust(student.dist, method = 'complete')

#plot
plot(student.hclust)

#Cut tree down to two clusters

student.hclust.clusters <- cutree(student.hclust, k=k)

plot(student_features_fields,
     col = student.hclust.clusters,
     main = paste("Hierarchical clustering of student features with", k, "clusters"),
     xlab = "student engagement",
     ylab = "student performance")

## Compare K-means to Hierarchical

table(student.out$cluster, student.hclust.clusters )

##    student.hclust.clusters
##      1  2  3
##   1 14 28  0
##   2 15  0 43

Interpret the results of your analysis.

K-means and Hierarchical clustering showed very similar results with the visualizations when using 2 clusters. I ended up changing the hierarchical model to 3 clusters as I thought the 2 clusters were diverse enough that a 3rd could comfortably explain further segmentation. The DBSCAN cluster I am unsure of the results. I played around with the eps value and minpts but the clusters just did not seem to be turning out correctly.

Submission

Submit a report containing the following:

A brief description of your approach to dimensionality reduction and clustering.

My approach to dimensionality was first to see if the data needed to be scaled. Since it is simulated data and the features were of the same magnatude it did not appear to need scaling. The means and standard deviations of the features were similar. Next I performed the prcomp function and then plotted the data. Then I looked at the variability fo each principal component as well as the variance explained by each principal component, followed by visuzalizing it. For clustering I chose to try 3 methods: k-means, hierarchical, and DBSCAN.

The results of your analysis, including the number of clusters identified, the characteristics of each cluster, and any other insights you gained from the data.

Number of clusters Identified : K-means - 2 Hierarchical - 3 DBSCAN - 4

Characteristics of each cluster:

K means with 2 clusters I think shows the general trend of high engaged, high performing students. As well as low engaged and low performing students.

Hierarchical as I went with 3 clusters helps show further diversity in the data showing low engaged and low performance students (colored black), high engaged and low performance students (colored red), and then the remainder (colored green)

To note for hierarchical clustering I tried 4 clusters, which did not model very well. The 4th cluster was basically outliers.

I am not sure how to interpret the 4 clusters from the DBSCAN. It is 3 small groups, and then everything else. I tried playing around with the eps and minPTs but couldn’t achieve anything more interpretable.

A discussion of the implications of your findings for learning analytics.

I interpret these results to mean in general the more engaged a student is the higher their performance will be. This seems intuitive. While there are outliers in general the more engaged students had better performance.

Provide at least one scholarly reference.

The below paper concludes that student engagement is key to their performance and participation in online courses. The study found a connection between online student engagement and student performance. Although this paper focuses on online learning, I feel its insights are likely true for all types of learning. Data for the study was collected from Learning Management Systems (LMS), student questionnaires, and instructor interviews. The data was analyzed using descriptive statistics.

Sahni, J. (2023). Is learning analytics the future of online education? International Journal of Emerging Technologies in Learning/International Journal: Emerging Technologies in Learning, 18(02), 33–49. https://doi.org/10.3991/ijet.v18i02.32167

Your report should include your code. Submit the published RPubs link to Blackboard.