Lab 3 Case Study: Unsupervised Learning in Learning Analytics

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(stats)

Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Tasks

Simulate the data.

Simulating the data occurred in the previous chunk.

Perform dimensionality reduction on the data using PCA.

scaled_data <- scale(student_features[, c("student_engagement", "student_performance")])

pr.out <- prcomp(scaled_data, center = TRUE, scale. = TRUE)

summary(pr.out)

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.0104 0.9895
## Proportion of Variance 0.5104 0.4896
## Cumulative Proportion  0.5104 1.0000

biplot(pr.out)

pr.var <- pr.out$sdev^2

pve <- pr.var/sum(pr.var)

plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Cluster the data using KMeans and other clustering algorithms.

k <- 4

km.out <-kmeans(scaled_data, centers = k, nstart = 20)

plot(scaled_data[, c("student_engagement", "student_performance")],
     col = km.out$cluster,
     main = paste("k-means clustering of students", k, "clusters"),
     xlab = "Student Engagement", ylab = "Student Performance")

Cluster data using hierarchical

hclust.out <- hclust(dist(scaled_data), method = "average")

plot(hclust.out, main = "Hierarchical")

Interpret the results of your analysis.

library(cluster)

pam_kmstudent <- pam(scaled_data, k=3)

plot(silhouette(pam_kmstudent))

library(cluster)

pam_kmstudent <- pam(scaled_data, k=4)

plot(silhouette(pam_kmstudent))

library(cluster)

pam_kmstudent <- pam(scaled_data, k=5)

plot(silhouette(pam_kmstudent))

Submission

Submit a report containing the following:

A brief description of your approach to dimensionality reduction and clustering.

For dimensionality reduction and pre-processing, I first scaled the data to increase readability and enhance the KMeans clustering. Scaling allows for the comparison of numbers on different scales. Additionally, in the scaled_data I only included the engagement and performance of each student.
The results of your analysis, including the number of clusters identified, the characteristics of each cluster, and any other insights you gained from the data.

The KMeans clustering used is fairly standard with 4 clusters selected based off the best view in the plot. Then hierarchical clustering was used to observe the data with a different clustering method. Average hierarchical clustering was used as it produces balanced trees. Due to the mild association between student engagement and performance, the 4 clusters represent groups from low engagement/low performance, medium to high engagement/low performance, medium engagement/high performance, and high engagement/high performance. These cluster titles are subjective but identification helps comprehension.
A discussion of the implications of your findings for learning analytics.

The silhouette graphs further support the selection of 4 clusters, with the highest silhouette width of 0.38 when compared to k=3 or k=5. With this study teachers, administrators, and parents can learn root causes. Perhaps more engaging lesson plans, perhaps a specialized program for a student, perhaps engagement does not factor into certain students’ performance. Additionally, were this study deemed relevant and useful, it could be applied to other schools.
Provide at least one scholarly reference.

ÇAM, E., & ÖZDAĞ, M. E. (2020). Discovery of Course Success Using Unsupervised Machine Learning Algorithms. Malaysian Online Journal of Educational Technology, 9(1), 26–47. https://doi.org/10.17220/mojet.2021.9.1.242

This study utilizes and evaluates k-means clustering and deep-embedded clustering, which is an advanced method of clustering that uses deep neural networks.
This paper evaluated 6 similar case studies that utilize

Your report should include your code. Submit the published RPubs link to Blackboard.