Lab 3 Case Study: Unsupervised Learning in Learning Analytics

Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Tasks

Simulate the data.
Perform dimensionality reduction on the data using PCA.
Cluster the data using KMeans and other clustering algorithms.
Interpret the results of your analysis.

# Features selected for PCA
pca_features <- student_features[, c("student_engagement", "student_performance")]

# Scaling PCA features
scaled_pca_features <- scale(pca_features)

# Execute dimensionality reduction using PCA
final_pca <- prcomp(scaled_pca_features, scale = TRUE)

# Viewing summary of results
summary(final_pca)

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.0104 0.9895
## Proportion of Variance 0.5104 0.4896
## Cumulative Proportion  0.5104 1.0000

library(ggplot2)

# Building a kmeans model with 3 centers
kmeans_model <- kmeans(scaled_pca_features, centers = 3, nstart = 20)

# Extracting cluster assignments
clust_kmeans <- kmeans_model$cluster

# Plotting student features
ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(clust_kmeans))) +
  geom_point() + 
  labs(title = "Student Data with K-Means Clustering",
       x = "Student Engagement",
       y = "Student Performance")

library(ggplot2)

# Building a kmeans model with 2 centers
kmeans_model <- kmeans(scaled_pca_features, centers = 2, nstart = 20)

# Extracting cluster assignments
clust_kmeans <- kmeans_model$cluster

# Plotting student features
ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(clust_kmeans))) +
  geom_point() + 
  labs(title = "Student Data with K-Means Clustering",
       x = "Student Engagement",
       y = "Student Performance")

library(purrr)
library(ggplot2)

# Confirming correct value of k with an elbow plot
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = student_features, centers = k)
  model$tot.withinss
})

# Generating the data frame
elbow_df <- data.frame(
  k = 1:10 ,
  tot_withinss = tot_withinss
)
 
# Plotting the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

library(cluster)

# Generating k-means using pam() function
pam_features <- pam(student_features, k = 3)

# Plotting to identify silhouette width for three centers
plot(silhouette(pam_features))

library(cluster)

# Generating k-means using pam() function
pam_features <- pam(student_features, k = 2)

# Plotting to identify silhouette width for two centers
plot(silhouette(pam_features))

# Displaying model results
kmeans_model

## K-means clustering with 2 clusters of sizes 53, 47
## 
## Cluster means:
##   student_engagement student_performance
## 1         -0.5644579            0.557169
## 2          0.6365164           -0.628297
## 
## Clustering vector:
##   [1] 1 2 2 1 2 2 1 2 2 1 2 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 2 1 2 1 2 1
##  [38] 1 1 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 2 2 2 1
##  [75] 1 1 1 1 2 1 2 2 2 1 1 2 1 2 1 1 1 2 2 1 2 1 2 1 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 74.60182 52.46275
##  (between_SS / total_SS =  35.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Submission

Submit a report containing the following:

A brief description of your approach to dimensionality reduction and clustering.

A dimensionality reduction was performed using Principal Component Analysis (PCA). This clustering method aims to find structure in the features of a data set as well as aid in visualization. In this case, the dimensions of student_engagement and student_performance were reduced while still maintaining most of the important information in the data set. The features were selected, scaled and reduced to yield the best results.

Another clustering method, k-means, was used to identify clusters (groups of data points with similarities) within the data set. A k-means model was constructed to identify two distinct groups. An elbow plot was then fitted to confirm the optimal value of k. Additionally, a silhouette coefficient was calculated and plotted to determine the most optimal model.
The results of your analysis, including the number of clusters identified, the characteristics of each cluster, and any other insights you gained from the data.

The optimal value of k (the optimal number of clusters) for this model is 2. The average silhouette width is 0.44 when k = 2. The average silhouette width is 0.31 when k = 3. Since 0.44 is closer to 1, the data is more appropriately clustered when k = 2.
A total of two clusters were identified using the k-means model. The clusters have 53 and 47 data points, respectively. The first cluster has above average performance with average engagement. The second cluster has above average engagement with average performance.

Overall, student performance and student engagement appear to be directly proportional. There is quite a bit of variance within the data. Additional variables may contribute to optimized model performance and cluster selection.
A discussion of the implications of your findings for learning analytics.

The goal of learning analytics is to improve education via computerized analysis. Therefore, I would suggest implementing requirements to increase student engagement in the classroom. As evidenced by this data set, there is a strong correlation between engagement and performance. Students who have increased participation tend to perform better. This statement is true for all levels of education.
Provide at least one scholarly reference.

A scholarly reference that supports the above discussion, particularly for higher education, is listed below.

Hooda, M., & Rana, C. (2020). Learning Analytics Lens: Improving Quality of Higher Education . International Journal of Emerging Trends in Engineering Research , 8(No. 5). https://doi.org/ISSN 2347 - 3983

Your report should include your code. Submit the published RPubs link to Blackboard.

Lab 3 Case Study: Unsupervised Learning in Learning Analytics

Dominic Valdiserri

2024-04-23

Introduction

Data

Tasks

Submission