Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.
In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.
##Data Simulation The data for this case study is generated with the simulated function below. The data contains the following features:
Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cluster)
The data for this case study is generated with the simulated function below. The data contains the following features:
Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance
simulate_student_features <- function(n = 100) {
# Set the random seed
set.seed(101523)
# Generate unique student IDs
student_ids <- seq(1, n)
# Simulate student engagement
student_engagement <- rnorm(n, mean = 50, sd = 10)
# Simulate student performance
student_performance <- rnorm(n, mean = 60, sd = 15)
# Combine the data into a data frame
student_features <- data.frame(
student_id = student_ids,
student_engagement = student_engagement,
student_performance = student_performance
)
# Return the data frame
return(student_features)
}
The creation of customized learning strategies can be influenced by these identified groupings. For instance, interventions focused at enhancing both engagement and academic performance may be advantageous for Cluster 2 students.
student_features <- simulate_student_features(n = 100)
head(student_features)
## student_id student_engagement student_performance
## 1 1 53.07439 67.33784
## 2 2 56.57569 72.86446
## 3 3 37.20003 50.11009
## 4 4 75.51590 73.82575
## 5 5 57.89940 49.79730
## 6 6 46.76411 62.10166
This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.
Performing dimensionality reduction on the data using PCA.
scaled_data <- scale(student_features[, c("student_engagement", "student_performance")])
# standardizing the features
pca_result <- prcomp(scaled_data, center = TRUE, scale. = TRUE)
# performing Principal Component Analysis
summary(pca_result)
## Importance of components:
## PC1 PC2
## Standard deviation 1.0028 0.9972
## Proportion of Variance 0.5028 0.4972
## Cumulative Proportion 0.5028 1.0000
Characteristics of Each Cluster: We clustered the students into three distinct groups based on their engagement and performance levels
Submit a report containing the following:
pca_data <- as.data.frame(pca_result$x[, 1:2])
# select the number of principal components to 2
##- Interpret the results of your analysis.
#Loading Required Libraries which is cluster and ggplot2
set.seed(101523)
kmeans_result <- kmeans(pca_data, centers = 3)
# number of clusters have been chosen as 3
student_features$cluster <- kmeans_result$cluster
# adding cluster labels to the original data
library(ggplot2)
ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(cluster))) +
geom_point() +
labs(title = "KMeans Clustering of Students",
x = "Student Engagement",
y = "Student Performance") +
theme_minimal()
#Visualizing the Clusters by using ggplot2
cluster_centers <- as.data.frame(kmeans_result$centers)
cluster_centers
## PC1 PC2
## 1 1.1361634 0.2036687
## 2 -0.5855954 0.8837331
## 3 -0.2226003 -0.9765731
Thank you for reading my report.