library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stats)
Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.
In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.
The data for this case study is generated with the simulated function below. The data contains the following features:
Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance
simulate_student_features <- function(n = 100) {
# Set the random seed
set.seed(260923)
# Generate unique student IDs
student_ids <- seq(1, n)
# Simulate student engagement
student_engagement <- rnorm(n, mean = 50, sd = 10)
# Simulate student performance
student_performance <- rnorm(n, mean = 60, sd = 15)
# Combine the data into a data frame
student_features <- data.frame(
student_id = student_ids,
student_engagement = student_engagement,
student_performance = student_performance
)
# Return the data frame
return(student_features)
}
This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.
To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:
student_features <- simulate_student_features(n = 100)
We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,
Simulating the data occurred in the previous chunk.
scaled_data <- scale(student_features[, c("student_engagement", "student_performance")])
pr.out <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
summary(pr.out)
## Importance of components:
## PC1 PC2
## Standard deviation 1.0104 0.9895
## Proportion of Variance 0.5104 0.4896
## Cumulative Proportion 0.5104 1.0000
biplot(pr.out)
pr.var <- pr.out$sdev^2
pve <- pr.var/sum(pr.var)
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
k <- 4
km.out <-kmeans(scaled_data, centers = k, nstart = 20)
plot(scaled_data[, c("student_engagement", "student_performance")],
col = km.out$cluster,
main = paste("k-means clustering of students", k, "clusters"),
xlab = "Student Engagement", ylab = "Student Performance")
hclust.out <- hclust(dist(scaled_data), method = "average")
plot(hclust.out, main = "Hierarchical")
library(cluster)
pam_kmstudent <- pam(scaled_data, k=3)
plot(silhouette(pam_kmstudent))
library(cluster)
pam_kmstudent <- pam(scaled_data, k=4)
plot(silhouette(pam_kmstudent))
library(cluster)
pam_kmstudent <- pam(scaled_data, k=5)
plot(silhouette(pam_kmstudent))
Submit a report containing the following:
A brief description of your approach to dimensionality reduction and clustering.
For dimensionality reduction and pre-processing, I first scaled the data to increase readability and enhance the KMeans clustering. Scaling allows for the comparison of numbers on different scales. Additionally, in the scaled_data I only included the engagement and performance of each student.
The results of your analysis, including the number of clusters
identified, the characteristics of each cluster, and any other insights
you gained from the data.
The KMeans clustering used is fairly standard with 4 clusters selected
based off the best view in the plot. Then hierarchical clustering was
used to observe the data with a different clustering method. Average
hierarchical clustering was used as it produces balanced trees. Due to
the mild association between student engagement and performance, the 4
clusters represent groups from low engagement/low performance, medium to
high engagement/low performance, medium engagement/high performance, and
high engagement/high performance. These cluster titles are subjective
but identification helps comprehension.
A discussion of the implications of your findings for learning
analytics.
The silhouette graphs further support the selection of 4 clusters, with
the highest silhouette width of 0.38 when compared to k=3 or k=5. With
this study teachers, administrators, and parents can learn root causes.
Perhaps more engaging lesson plans, perhaps a specialized program for a
student, perhaps engagement does not factor into certain students’
performance. Additionally, were this study deemed relevant and useful,
it could be applied to other schools.
Provide at least one scholarly reference.
ÇAM, E., & ÖZDAĞ, M. E. (2020). Discovery of Course Success Using
Unsupervised Machine Learning Algorithms. Malaysian Online Journal
of Educational Technology, 9(1), 26–47. https://doi.org/10.17220/mojet.2021.9.1.242
This study utilizes and evaluates k-means clustering and deep-embedded clustering, which is an advanced method of clustering that uses deep neural networks.
This paper evaluated 6 similar case studies that utilize
Your report should include your code. Submit the published RPubs link to Blackboard.