Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.
In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.
The data for this case study is generated with the simulated function below. The data contains the following features:
Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance ##Loading libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.1
library(cluster)
simulate_student_features <- function(n = 100) {
# Set the random seed
set.seed(260923)
# Generate unique student IDs
student_ids <- seq(1, n)
# Simulate student engagement
student_engagement <- rnorm(n, mean = 50, sd = 10)
# Simulate student performance
student_performance <- rnorm(n, mean = 60, sd = 15)
# Combine the data into a data frame
student_features <- data.frame(
student_id = student_ids,
student_engagement = student_engagement,
student_performance = student_performance
)
# Return the data frame
return(student_features)
}
This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.
To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:
student_features <- simulate_student_features(n = 100)
scaled_data <- scale(student_features[, c("student_engagement", "student_performance")])
# standardizing the features
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
# performing Principal Component Analysis
summary(pca_result)
## Importance of components:
## PC1 PC2
## Standard deviation 1.0104 0.9895
## Proportion of Variance 0.5104 0.4896
## Cumulative Proportion 0.5104 1.0000
pca_data <- as.data.frame(pca_result$x[, 1:2])
# select the number of principal components to 2
# Initialize an empty vector to store the within-cluster sum of squares
wcss <- vector()
# Define the range of possible cluster numbers (e.g., from 1 to 10)
k_values <- 1:10
# Calculate the within-cluster sum of squares for different cluster numbers
for (k in k_values) {
kmeans_model <- kmeans(pca_data, centers = k)
wcss[k] <- kmeans_model$tot.withinss
}
# Create a data frame with the number of clusters and corresponding WCSS values
elbow_data <- data.frame(K = k_values, WCSS = wcss)
# Plot the elbow curve
ggplot(elbow_data, aes(x = K, y = WCSS)) +
geom_line() +
geom_point() +
labs(title = "Elbow Method for Optimal Number of Clusters") +
xlab("Number of Clusters (K)") +
ylab("Within-Cluster Sum of Squares (WCSS)")
##Clustering the data using KMeans
set.seed(10052023)
# number of clusters have been chosen as 3
kmeans_result <- kmeans(pca_data, centers = 3)
student_features$cluster <- kmeans_result$cluster
# adding cluster labels to the original data
ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(cluster))) +
geom_point() +
labs(title = "KMeans Clustering of Students",
x = "Student Engagement",
y = "Student Performance") +
theme_minimal()
##Interpretation of KMeans clustering results
cluster_centers <- as.data.frame(kmeans_result$centers)
cluster_centers
## PC1 PC2
## 1 -0.7582485 0.64689777
## 2 0.8586238 0.05710053
## 3 -0.8521969 -1.26781325
## summary of each cluster
student_features%>%
group_by(cluster)%>%
summarise(
Avg_Engagement = mean(student_engagement),
Avg_Performance = mean(student_performance),
Num_Students = n()
)
## # A tibble: 3 × 4
## cluster Avg_Engagement Avg_Performance Num_Students
## <int> <dbl> <dbl> <int>
## 1 1 49.6 76.3 33
## 2 2 57.0 54.5 48
## 3 3 35.2 58.3 19
##Clustering the data using Hierarchical clustering
hierarchical_result <- hclust(dist(pca_data), method = "ward.D2")
# cutting the tree to get a number of clusters as 3
cluster_assignments <- cutree(hierarchical_result, k = 3)
# adding cluster labels to the original data
student_features$cluster_hierarchical <- cluster_assignments
#Visualize clustering results
ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(cluster_hierarchical))) +
geom_point() +
labs(title = "Hierarchical Clustering of Students",
x = "Student Engagement",
y = "Student Performance") +
theme_minimal()
##Interpretation of Hierarchical clustering results
hierarchical_clusters <- data.frame(
Cluster = unique(cluster_assignments),
Num_Students = table(cluster_assignments)
)
hierarchical_clusters
## Cluster Num_Students.cluster_assignments Num_Students.Freq
## 1 1 1 17
## 2 2 2 55
## 3 3 3 28
##summary of each hierarchical cluster
student_features %>%
group_by(cluster_hierarchical) %>%
summarise(
Avg_Engagement = mean(student_engagement),
Avg_Performance = mean(student_performance),
Num_Students = n()
)
## # A tibble: 3 × 4
## cluster_hierarchical Avg_Engagement Avg_Performance Num_Students
## <int> <dbl> <dbl> <int>
## 1 1 34.2 59.3 17
## 2 2 52.8 71.4 55
## 3 3 55.6 46.6 28
We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,
Submit a report containing the following:
Your report should include your code. Submit the published RPubs link to Blackboard.