Simulate student features

simulate_student_features <- function(n = 100) {

 set.seed(260923)

 student_ids <- seq(1, n)

 student_engagement <- rnorm(n, mean = 50, sd = 10)

 student_performance <- rnorm(n, mean = 60, sd = 15)

 student_features <- data.frame(

 student_id = student_ids,

 student_engagement = student_engagement,

 student_performance = student_performance

 )

 return(student_features)

}
student_features <- simulate_student_features(n = 100)

Exclude the “student_id” variable

student_data <- student_features %>%

 select(-student_id) # Exclude the "student_id" column

Perform dimensionality reduction using PCA

student_pca <- student_data %>%

 prcomp(center = TRUE, scale. = TRUE)

Plot variance explained by principal components

fviz_eig(student_pca, addlabels = TRUE, ylim = c(0, 50)) +

 labs(title = "Variance Explained by Principal Components")

Determine the number of principal components to retain (e.g., 2 components)

num_components <- 2

student_pca_data <- as.data.frame(predict(student_pca, newdata = student_data)[, 1:num_components])
# Load necessary libraries

library(ggplot2)

library(dplyr)

library(cluster)

 

# Initialize an empty vector to store the within-cluster sum of squares

wcss <- vector()

 

# Define the range of possible cluster numbers (e.g., from 1 to 10)

k_values <- 1:10

 

# Calculate the within-cluster sum of squares for different cluster numbers

for (k in k_values) {

 kmeans_model <- kmeans(student_pca_data, centers = k)

 wcss[k] <- kmeans_model$tot.withinss

}

 

# Create a data frame with the number of clusters and corresponding WCSS values

elbow_data <- data.frame(K = k_values, WCSS = wcss)

 

# Plot the elbow curve

ggplot(elbow_data, aes(x = K, y = WCSS)) +

 geom_line() +

 geom_point() +

 labs(title = "Elbow Method for Optimal Number of Clusters") +

 xlab("Number of Clusters (K)") +

 ylab("Within-Cluster Sum of Squares (WCSS)")

Perform KMeans clustering

set.seed(123)

kmeans_clusters <- kmeans(student_pca_data, centers = 3) # You can choose the number of clusters

Hierarchical clustering

hierarchical_clusters <- hclust(dist(student_pca_data))

hierarchical_clusters_cut <- cutree(hierarchical_clusters, k = 3) # You can choose the number of clusters

Visualize clustering results

ggplot(student_pca_data, aes(x = PC1, y = PC2)) +

 geom_point(aes(color = factor(kmeans_clusters$cluster)), size = 3) +

 labs(title = "KMeans Clustering") +

 theme_minimal()

ggplot(student_pca_data, aes(x = PC1, y = PC2)) +

 geom_point(aes(color = factor(hierarchical_clusters_cut)), size = 3) +

 labs(title = "Hierarchical Clustering") +

 theme_minimal()

kmeans_clusters$size
## [1] 35 29 36
kmeans_clusters$centers
##          PC1        PC2
## 1 -0.8160112  0.5711591
## 2 -0.1206254 -1.1703390
## 3  0.8905147  0.3874796

Interpretation of clustering results

cluster_summary <- student_data %>%

 mutate(KMeans_Cluster = kmeans_clusters$cluster,

 Hierarchical_Cluster = hierarchical_clusters_cut)

 

head(cluster_summary)
##   student_engagement student_performance KMeans_Cluster Hierarchical_Cluster
## 1           35.47855            50.52231              2                    1
## 2           51.79512            58.88396              3                    1
## 3           62.41012            40.56755              3                    2
## 4           35.20679            62.46033              2                    1
## 5           59.37552            54.69326              3                    2
## 6           57.00109            54.09745              3                    2

Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Tasks

  • Simulate the data.
  • Perform dimensionality reduction on the data using PCA.
  • Cluster the data using KMeans and other clustering algorithms.
  • Interpret the results of your analysis.

Submission

Submit a report containing the following:

  • A brief description of your approach to dimensionality reduction and clustering.
  • The results of your analysis, including the number of clusters identified, the characteristics of each cluster, and any other insights you gained from the data.
  • A discussion of the implications of your findings for learning analytics.
  • Provide at least one scholarly reference.

Your report should include your code. Submit the published RPubs link to Blackboard.