Lab 3 Case Study: Unsupervised Learning in Learning Analytics

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

Generating the student data with 3 features student ID, student engagement, student performance.

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

summary(simulate_student_features())

##    student_id     student_engagement student_performance
##  Min.   :  1.00   Min.   :24.56      Min.   :33.94      
##  1st Qu.: 25.75   1st Qu.:43.56      1st Qu.:53.75      
##  Median : 50.50   Median :51.37      Median :64.45      
##  Mean   : 50.50   Mean   :50.43      Mean   :62.42      
##  3rd Qu.: 75.25   3rd Qu.:58.57      3rd Qu.:72.01      
##  Max.   :100.00   Max.   :73.08      Max.   :97.40

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)
student_features

##     student_id student_engagement student_performance
## 1            1           35.47855            50.52231
## 2            2           51.79512            58.88396
## 3            3           62.41012            40.56755
## 4            4           35.20679            62.46033
## 5            5           59.37552            54.69326
## 6            6           57.00109            54.09745
## 7            7           34.81908            51.59185
## 8            8           66.43009            71.66933
## 9            9           53.12224            53.38812
## 10          10           58.66933            77.51403
## 11          11           64.61152            66.60144
## 12          12           64.73720            42.54982
## 13          13           61.38786            81.60053
## 14          14           43.46833            71.08870
## 15          15           60.90687            65.14477
## 16          16           62.31512            96.92023
## 17          17           30.19917            59.56755
## 18          18           62.71153            45.28098
## 19          19           53.46864            54.77840
## 20          20           43.16863            73.15011
## 21          21           47.79988            57.94230
## 22          22           32.86718            64.51376
## 23          23           48.43080            44.12915
## 24          24           44.64360            81.96037
## 25          25           59.37263            67.10083
## 26          26           40.14190            57.07313
## 27          27           52.05290            33.93715
## 28          28           58.53192            66.30623
## 29          29           46.85526            60.08155
## 30          30           40.57498            83.21472
## 31          31           48.27392            68.24131
## 32          32           63.40857            64.74623
## 33          33           42.63505            74.19713
## 34          34           49.50695            36.55820
## 35          35           39.64492            63.26139
## 36          36           58.16632            42.53282
## 37          37           46.85609            77.93879
## 38          38           53.21176            71.84144
## 39          39           24.56077            72.83616
## 40          40           47.94104            72.70324
## 41          41           54.54657            57.30181
## 42          42           65.36895            64.50104
## 43          43           63.35530            43.93461
## 44          44           38.06179            40.12883
## 45          45           52.79802            70.87511
## 46          46           53.09701            57.98622
## 47          47           52.26786            75.90067
## 48          48           60.27817            65.06789
## 49          49           49.96451            35.62894
## 50          50           62.85455            53.86589
## 51          51           61.18422            35.28333
## 52          52           73.07623            72.12643
## 53          53           62.47065            81.81560
## 54          54           55.20634            46.58676
## 55          55           50.93514            36.66812
## 56          56           55.74387            72.98336
## 57          57           56.41645            80.03891
## 58          58           50.67616            64.95174
## 59          59           42.12485            72.70400
## 60          60           33.46542            63.30804
## 61          61           45.10251            73.36602
## 62          62           47.66047            64.39439
## 63          63           53.12859            46.91366
## 64          64           49.11046            61.00457
## 65          65           45.98572            66.17874
## 66          66           53.55687            70.29459
## 67          67           59.53438            68.26805
## 68          68           71.76380            50.57808
## 69          69           42.81577            70.02416
## 70          70           34.62981            60.46338
## 71          71           43.46502            43.93160
## 72          72           59.06670            56.56050
## 73          73           53.00539            60.76549
## 74          74           43.51405            67.50167
## 75          75           54.59768            71.64025
## 76          76           43.57840            79.52546
## 77          77           29.80201            57.75304
## 78          78           45.01510            55.98704
## 79          79           48.30114            37.44258
## 80          80           45.19776            97.40030
## 81          81           54.44051            52.26339
## 82          82           56.73032            66.93843
## 83          83           46.66086            54.04959
## 84          84           33.72874            51.90589
## 85          85           32.07322            74.45711
## 86          86           53.97599            57.86733
## 87          87           33.15555            40.26646
## 88          88           50.55680            43.10695
## 89          89           33.68189            71.96778
## 90          90           45.95642            72.64398
## 91          91           51.90072            83.01396
## 92          92           50.25869            46.90754
## 93          93           62.41476            62.63873
## 94          94           44.93745            80.00130
## 95          95           61.23368            69.66618
## 96          96           50.22648            82.94298
## 97          97           62.92548            67.67482
## 98          98           57.15390            90.19502
## 99          99           39.31421            65.78667
## 100        100           52.15625            62.67658

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Perform dimensionality reduction on the data using PCA.

standardized_data <- scale(student_features[, -1])
pca_results <- prcomp(standardized_data, center = TRUE, scale. = TRUE)
summary(pca_results)

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.0104 0.9895
## Proportion of Variance 0.5104 0.4896
## Cumulative Proportion  0.5104 1.0000

projected_data <- predict(pca_results, newdata = standardized_data)[, 1:2]
projected_data

##                PC1          PC2
##   [1,] -0.44153737 -1.643380141
##   [2,]  0.27373527 -0.083321127
##   [3,]  1.93912437 -0.268457565
##   [4,] -1.06354148 -1.059273032
##   [5,]  1.01397062  0.233523648
##   [6,]  0.87851186  0.037870039
##   [7,] -0.54154734 -1.635333079
##   [8,]  0.64829243  1.582955387
##   [9,]  0.64389289 -0.268413613
##  [10,] -0.18807145  1.337087697
##  [11,]  0.77750056  1.200149779
##  [12,]  2.00124323 -0.006067892
##  [13,] -0.20495479  1.733068467
##  [14,] -0.92337600 -0.047374095
##  [15,]  0.59278007  0.868260136
##  [16,] -0.91418433  2.571603601
##  [17,] -1.26656513 -1.554557697
##  [18,]  1.72203836 -0.009340896
##  [19,]  0.59781455 -0.174030184
##  [20,] -1.04840601  0.035862475
##  [21,]  0.04273706 -0.409456358
##  [22,] -1.33039987 -1.118671184
##  [23,]  0.78450629 -1.063243379
##  [24,] -1.39061878  0.583759541
##  [25,]  0.38699415  0.860096644
##  [26,] -0.44730546 -0.987311759
##  [27,]  1.55191011 -1.325548742
##  [28,]  0.36851567  0.761338960
##  [29,] -0.13119181 -0.367253986
##  [30,] -1.73766561  0.363440854
##  [31,] -0.44447063  0.143855740
##  [32,]  0.78734268  1.022557491
##  [33,] -1.13849959  0.051550120
##  [34,]  1.24199096 -1.370659827
##  [35,] -0.79456054 -0.709360676
##  [36,]  1.54395060 -0.465078429
##  [37,] -1.03320196  0.534871689
##  [38,] -0.28204435  0.670007297
##  [39,] -2.32997121 -1.277421390
##  [40,] -0.69307708  0.346043162
##  [41,]  0.54550167  0.028598970
##  [42,]  0.93641443  1.146858327
##  [43,]  1.83493785 -0.032466206
##  [44,]  0.26360943 -1.988296693
##  [45,] -0.26207782  0.592344953
##  [46,]  0.40985830 -0.037896877
##  [47,] -0.55291109  0.809248538
##  [48,]  0.55282748  0.820540079
##  [49,]  1.32083585 -1.385698798
##  [50,]  1.29833932  0.434302672
##  [51,]  2.12058446 -0.620868278
##  [52,]  1.08860108  2.069445854
##  [53,] -0.14032196  1.819429500
##  [54,]  1.13278042 -0.466674376
##  [55,]  1.33601826 -1.265527334
##  [56,] -0.16317891  0.904241858
##  [57,] -0.47269802  1.307552968
##  [58,] -0.11080134  0.145177084
##  [59,] -1.09864725 -0.059450280
##  [60,] -1.22778046 -1.137867035
##  [61,] -0.92447290  0.181608381
##  [62,] -0.29291406 -0.093245066
##  [63,]  0.97139647 -0.595031595
##  [64,] -0.02057571 -0.163384636
##  [65,] -0.49982285 -0.119879423
##  [66,] -0.17984204  0.615930018
##  [67,]  0.33930949  0.930337496
##  [68,]  2.08561849  0.889411126
##  [69,] -0.91509958 -0.146649256
##  [70,] -1.00289393 -1.200379643
##  [71,]  0.44824836 -1.419459916
##  [72,]  0.89811332  0.306315631
##  [73,]  0.26307377  0.096110798
##  [74,] -0.73898746 -0.225386858
##  [75,] -0.17524823  0.756476952
##  [76,] -1.34188847  0.386487777
##  [77,] -1.20259668 -1.673910628
##  [78,] -0.05265961 -0.702394474
##  [79,]  1.11324109 -1.410059644
##  [80,] -2.13193645  1.402353560
##  [81,]  0.79262527 -0.233314031
##  [82,]  0.21096381  0.667659616
##  [83,]  0.15996139 -0.685516301
##  [84,] -0.63343513 -1.695492469
##  [85,] -1.88805183 -0.671735830
##  [86,]  0.47715093  0.017383608
##  [87,] -0.08542882 -2.323430058
##  [88,]  0.98437722 -0.966646429
##  [89,] -1.65013810 -0.685322516
##  [90,] -0.82846040  0.204672457
##  [91,] -0.93784186  1.142981079
##  [92,]  0.77160276 -0.795443159
##  [93,]  0.82451076  0.846802624
##  [94,] -1.27116671  0.505283953
##  [95,]  0.38716477  1.119447721
##  [96,] -1.05099176  1.022659895
##  [97,]  0.60572020  1.136813501
##  [98,] -0.93432117  1.872011687
##  [99,] -0.94518542 -0.604853601
## [100,]  0.10732844  0.133445056

Here i am performing elbow method to find the optimal no of clusters.

# Elbow method to determine the optimal number of clusters (k)
wcss_values <- numeric(10)  # Initialize an empty vector to store WCSS values
for (k in 1:10) {
  kmeans_model <- kmeans(projected_data, centers = k)
  wcss_values[k] <- kmeans_model$tot.withinss
}

# Create a plot to visualize the elbow method
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.1

elbow_data <- data.frame(K = 1:10, WCSS = wcss_values)

ggplot(elbow_data, aes(x = K, y = WCSS)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Clusters (K)", y = "Within-Cluster Sum of Squares (WCSS)") +
  ggtitle("Elbow Method for Optimal K")

# Find the optimal number of clusters (K) based on the elbow point
optimal_k <- 1  # Initialize with a default value
for (i in 2:length(wcss_values)) {
  if ((wcss_values[i - 1] - wcss_values[i]) / wcss_values[i - 1] > 0.05) {
    optimal_k <- i
    break
  }
}

# Print the optimal number of clusters
cat("Optimal number of clusters (K) based on the elbow method:", optimal_k, "\n")

## Optimal number of clusters (K) based on the elbow method: 2

Cluster the data using KMeans

from the elbow method i got optimal clusters are 2.

set.seed(12)
kmeans_result <- kmeans(projected_data , centers = 2)
# number of clusters have been chosen as 2

student_features$cluster <- kmeans_result$cluster
# adding cluster labels to the original data

library(ggplot2)

ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(cluster))) +
  geom_point() +
  labs(title = "KMeans Clustering of Students",
       x = "Student Engagement",
       y = "Student Performance") +
  theme_minimal()

cluster_centers <- as.data.frame(kmeans_result$centers)
cluster_centers

##          PC1        PC2
## 1  0.6303755  0.3334006
## 2 -0.8705186 -0.4604103

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

student_features %>%
  group_by(cluster) %>%
  summarise(
    Avg_Engagement = mean(student_engagement),
    Avg_Performance = mean(student_performance),
    Num_Students = n()
  )

## # A tibble: 2 × 4
##   cluster Avg_Engagement Avg_Performance Num_Students
##     <int>          <dbl>           <dbl>        <int>
## 1       1           57.3            59.5           58
## 2       2           40.9            66.5           42

here i am trying to find out the silhouette_scores to check wheather our clusters are well seperated or not.

options(warn.conflicts = FALSE)



###install.packages("factoextra")
### install.packages("fpc")

# Load necessary libraries
library(cluster)
library(fpc)

## Warning: package 'fpc' was built under R version 4.3.1

# Calculate silhouette scores
silhouette_scores <- silhouette(kmeans_result$cluster, dist(projected_data))

# Calculate the average silhouette score
average_silhouette_score <- mean(silhouette_scores[, "sil_width"])

# Print the average silhouette score
cat("Average Silhouette Score:", average_silhouette_score, "\n")

## Average Silhouette Score: 0.3332294

Interpreting the k-mean clustering:

The quality and separation of the clusters are revealed by this score. With values ranging from -1 (bad clustering) to 1 (great clustering), a higher average silhouette score generally implies better clustering. The clusters appear to be somewhat distinct, according to your computed score of around 0.333, although cluster quality may still be able to be improved.

Hierarchical clustering :

hierarchical_result <- hclust(dist(projected_data), method = "ward.D2")
# performing hierarchical clustering

cluster_assignments <- cutree(hierarchical_result, k = 2)
# cutting the tree to get a number of clusters as 2

student_features$cluster_hierarchical <- cluster_assignments
# adding cluster labels to the original data

ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(cluster_hierarchical))) +
  geom_point() +
  labs(title = "Hierarchical Clustering of Students",
       x = "Student Engagement",
       y = "Student Performance") +
  theme_minimal()

### i plotted dendogram for this.

# Load necessary libraries
library(ggplot2)

# Perform hierarchical clustering
dist_matrix <- dist(projected_data)  # Calculate pairwise distances
hclust_result <- hclust(dist_matrix, method = "ward.D2")

# Create a dendrogram to visualize the hierarchical clustering
dendrogram <- as.dendrogram(hclust_result)

# Plot the dendrogram
plot(dendrogram)

# Find the optimal number of clusters based on the dendrogram
optimal_k <- 1  # Initialize with a default value
for (i in 2:length(dendrogram)) {
  if (attr(dendrogram[[i]], "height") > 5) {  # Adjust the height threshold as needed
    optimal_k <- i
    break
  }
}

# Print the optimal number of clusters
cat("Optimal number of clusters based on hierarchical clustering:", optimal_k, "\n")

## Optimal number of clusters based on hierarchical clustering: 2

hierarchical_clusters <- data.frame(
  Cluster = unique(cluster_assignments),
  Num_Students = table(cluster_assignments)
)

hierarchical_clusters

##   Cluster Num_Students.cluster_assignments Num_Students.Freq
## 1       1                                1                72
## 2       2                                2                28

student_features %>%
  group_by(cluster_hierarchical) %>%
  summarise(
    Avg_Engagement = mean(student_engagement),
    Avg_Performance = mean(student_performance),
    Num_Students = n()
  )

## # A tibble: 2 × 4
##   cluster_hierarchical Avg_Engagement Avg_Performance Num_Students
##                  <int>          <dbl>           <dbl>        <int>
## 1                    1           48.4            68.6           72
## 2                    2           55.6            46.6           28

# Load necessary libraries
library(cluster)

# Perform hierarchical clustering
dist_matrix <- dist(projected_data)
hclust_result <- hclust(dist_matrix, method = "ward.D2")

# Cut the tree to get a number of clusters as 2
cluster_assignments <- cutree(hclust_result, k = 2)

# Calculate the Dunn Index using the dunn() function from the fpc package
# Install the fpc package if not already installed
if (!requireNamespace("fpc", quietly = TRUE)) {
  install.packages("fpc")
}
library(fpc)
dunn_index <- cluster.stats(dist_matrix, cluster_assignments)$dunn

# Calculate the Silhouette Score
silhouette_score <- silhouette(cluster_assignments, dist_matrix)

# Print the Dunn Index and Silhouette Score
cat("Dunn Index:", dunn_index, "\n")

## Dunn Index: 0.02890451

cat("Average Silhouette Score:", mean(silhouette_score[, "sil_width"]), "\n")

## Average Silhouette Score: 0.2951394

Interpreting the hierarchical clustering results

Dunn Index:

The minimal inter-cluster distance to the greatest intra-cluster distance is measured by the Dunn Index (0.0289). While a higher value denotes improved clustering, a lower Dunn Index value signifies that the clusters are farther distant from one another. A Dunn Index of 0.0289 in your situation indicates that cluster separation might be improved and that there may be some overlap or close closeness between clusters.

Silhouette Score:

Average Silhouette Score (0.2951), which assesses the distinction and caliber of clusters. Higher values denote better separation, and the scale runs from -1 (poor clustering) to 1 (great clustering). The clusters are reasonably distinct, according to a score of 0.2951, although there may be space for improvement in the cluster quality. Although there is considerable distinction, the clusters may be more defined, as seen by the score’s low level.