Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Tasks

library(dplyr)     # For data manipulation
## Warning: package 'dplyr' was built under R version 4.3.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)   # For data visualization
## Warning: package 'ggplot2' was built under R version 4.3.1
library(stats)     # For PCA and KMeans

Simulate the data

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

Simulate the data with 100 students

student_features <- simulate_student_features(n = 100)

Perform dimensionality reduction using PCA

Assume you want to retain 2 principal components

pca_result <- prcomp(student_features[, c("student_engagement", "student_performance")], center = TRUE, scale = TRUE)
student_features_pca <- as.data.frame(pca_result$x[, 1:2])
library(dplyr)
library(ggplot2)

Cluster the data using KMeans

Assume you want to identify 3 clusters

num_clusters <- 8
kmeans_result <- kmeans(student_features_pca, centers = num_clusters)
student_features$cluster <- as.factor(kmeans_result$cluster)

Visualization of PCA and Clusters

# Visualization of PCA and Clusters
ggplot(data = student_features_pca, aes(x = PC1, y = PC2, color = as.factor(kmeans_result$cluster))) +
  geom_point() +
  labs(x = "student performances", y = "student engagement", title = "PCA and Clustering")

Interpretation of Clustering Results

summary(kmeans_result)
##              Length Class  Mode   
## cluster      100    -none- numeric
## centers       16    -none- numeric
## totss          1    -none- numeric
## withinss       8    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           8    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

Discussion of Implications

Provide insights and implications of the clustering results for learning analytics.

Scholarly Reference

Cite a relevant scholarly reference here.

Submission

Submit a report containing the following:

  • A brief description of your approach to dimensionality reduction and clustering.

-clustering involves selecting the most appropriate algorithm for the specific dataset and problem at hand, assessing clustering quality using appropriate metrics, and visualizing the results to gain insights into the data’s underlying structure

  • The results of your analysis, including the number of clusters identified, the characteristics of each cluster, and any other insights you gained from the data.

-cluster 100

  • A discussion of the implications of your findings for learning analytics.

-learning analytics can have a significant impact on educational practices and student outcomes. By utilizing data-driven insights from learning analytics, educators and institutions can make informed decisions to enhance teaching and learning experiences. - Provide at least one scholarly reference.

Lester, J. (2015). Big Data in Education: The Digital Future of Learning, Policy and Practice. Sage. Your report should include your code. Submit the published RPubs link to Blackboard.