Lab 3 Case Study: Unsupervised Learning in Learning Analytics

Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

# simulates 100 students
student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Tasks

Simulate the data.
Perform dimensionality reduction on the data using PCA.
Cluster the data using KMeans and other clustering algorithms.

Interpret the results of your analysis.

CODE PORTION

# PRE-PROCESSING

# CHOOSE RELEVANT FEATURES
student_data <- student_features[, c("student_engagement", 
                                     "student_performance")]

# SCALING DATA
scaled_student_data <- scale(student_data)

Student IDs are irrelevant to analysis, and are discarded. Data is normalized to provide better results in clustering and PCA.

# DIMENSIONALITY REDUCTION USING PCA

# perform PCA
student_PCA_results <- prcomp(scaled_student_data, scale. = TRUE, 
                              center = TRUE)

# summarize PCA results
summary(student_PCA_results)

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.0104 0.9895
## Proportion of Variance 0.5104 0.4896
## Cumulative Proportion  0.5104 1.0000

# plot PCA
biplot(student_PCA_results)

PCA performed using the native prcomp function in R. Results were summarized and plotted.

# ELBOW PLOT GENERATION FOR DETERMINING OPTIMAL K VALUE
library(purrr)
library(ggplot2)

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = student_features, centers = k)
  model$tot.withinss
})

# generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# generate elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

An elbow plot was generated (as per our datacamp assignments) to determine optimal k-values for k-means clustering analysis. The resulting plot shows an obvious elbow at k=2, so 2 centers will be used for our k-means model.

# KMEANS CLUSTERING FOR K = 2
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# building our kmeans model
kmeans_student <- kmeans(scaled_student_data, centers = 2, nstart = 20)

# extracting cluster labels, append them to original data
clust_km <- kmeans_student$cluster
clustered_student_features <- mutate(student_features, cluster = clust_km)

# plotting our model
ggplot(student_features, aes(x = student_engagement, y = student_performance, 
                             color = factor(clust_km))) +
  geom_point() + 
  labs(title = "K-Means Clustering on Student Data with K = 2",
       x = "Student Engagement",
       y = "Student Performance")

# print model results
kmeans_student

## K-means clustering with 2 clusters of sizes 53, 47
## 
## Cluster means:
##   student_engagement student_performance
## 1         -0.5644579            0.557169
## 2          0.6365164           -0.628297
## 
## Clustering vector:
##   [1] 1 2 2 1 2 2 1 2 2 1 2 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 2 1 2 1 2 1
##  [38] 1 1 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 2 2 2 1
##  [75] 1 1 1 1 2 1 2 2 2 1 1 2 1 2 1 1 1 2 2 1 2 1 2 1 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 74.60182 52.46275
##  (between_SS / total_SS =  35.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

A k-means model was generated for k=2 and plotted. Our cluster labels were added to the original dataset.

# SILHOUETTE ANALYSIS OF KMEANS MODEL
library(cluster)

# generate k-means model using the pam function for k=2
pam_kmstudent <- pam(student_data, k=2)

# silhouette plot of our kmeans model
plot(silhouette(pam_kmstudent))

# generate k-means model using the pam function for k=3
pam_kmstudent3 <- pam(student_data, k=3)

# silhouette plot of our kmeans model
plot(silhouette(pam_kmstudent3))

As demonstrated in our datacamp assignments, silhouette analysis was done on our k-means model to understand/evaluate our model. Our average silhouette width for k=2 is 0.38, which is above 0 and means our observations are matching fairly well within their clusters. Just to check our work, increasing number of clusters to k=3 drops our average silhouette width to 0.33, which further reinforces our optimal k-value of 2.

Submission

Submit a report containing the following:

A brief description of your approach to dimensionality reduction and clustering.

The data first went through pre-processing, Specifically, scaling and some feature selection was done for optimal results. Then, dimensionality reduction was performed using principal component analysis. Additionally, the data were clustered using k-means. An elbow plot was first generated to determine the optimal k-value, then the k-means model was built. Silhouette analysis was also done to evaluate the effectiveness of our clustering.
The results of your analysis, including the number of clusters identified, the characteristics of each cluster, and any other insights you gained from the data.

Our PCA analysis showed similar proportions of variation for both student engagement and performance.

For our k-means model, firstly our elbow plot showed a very obvious elbow at k=2 making that our optimal k-value. Our k-means model was then generated, resulting in two clusters of 58 and 42 students. A silhouette plot analysis further confirmed our evaluation with an average silhouette width of 0.38. Increasing our k-value in our silhouette analysis generates a decrease in average silhouette width to 0.33, further confirming our optimal k-value of k=2.
A discussion of the implications of your findings for learning analytics.

Our PCA biplot implies that student performance and engagement are mildly associated (both pointing in the direction of positive PC2). The two clusters in our k-means analysis affirm this judgment of mild association, as one group possesses higher student performance and middling engagement while the other group exhibits higher engagement and middling performance. The implication follows that a more engaged student performs somewhat better. Faculty could then use this information to try to increase engagement to better student performance, but the method by which this is performed depends somewhat on how engagement was measured in our hypothetical scenario (since we used simulated data). It is worth pointing out that our data has relatively small sample size and only two variables, so its findings should be taken with a grain of salt.
Provide at least one scholarly reference.

Roark, H., Carchedi, N., & Jeon, T. Unsupervised Learning in R. DataCamp. Retrieved April 20, 2024, from https://app.datacamp.com/learn/courses/unsupervised-learning-in-r.

Gorenshteyn, D., Roy, Y., Cotton, R. Cluster Analysis in R. DataCamp. Retrieved April 21, 2024, from https://app.datacamp.com/learn/courses/cluster-analysis-in-r

Your report should include your code. Submit the published RPubs link to Blackboard.

Lab 3 Case Study: Unsupervised Learning in Learning Analytics

Benjamin Mauldin

2024-04-22

Introduction

Data

Tasks

Submission