Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data Simulation

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Your report should include your code. Submit the published RPubs link to Blackboard.

Performing dimensionality reduction on the data using PCA.

standardized_data <- scale(student_features[, -1])  # Exclude the student_id column

pca_results <- prcomp(standardized_data, center = TRUE, scale. = TRUE)

summary(pca_results)

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.0104 0.9895
## Proportion of Variance 0.5104 0.4896
## Cumulative Proportion  0.5104 1.0000

# Assuming you have standardized data and want to retain the first two principal components
projected_data <- predict(pca_results, newdata = standardized_data)[, 1:2]

Cluster the data using KMeans and other clustering algorithms.

library(stats)  # For KMeans
library(cluster)  # For other clustering algorithms

# Assuming you want to cluster the projected_data into k clusters
k <- 3  # Choose the number of clusters
kmeans_results <- kmeans(projected_data, centers = k)

# Hierarchical Clustering
hierarchical_results <- hclust(dist(projected_data))

##- Interpret the results of your analysis.

#Loading Required Libraries which is cluster and ggplot2

library(cluster)  # For clustering algorithms
library(ggplot2)   # For visualization

Retrieving Cluster Assignments by using KMeans

cluster_assignments <- kmeans_results$cluster

#Examine Cluster Characteristics by using thw functions dplyr, to calculate means for each feature within each cluster

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

cluster_summary <- student_features %>%
  mutate(cluster = cluster_assignments) %>%
  group_by(cluster) %>%
  summarize(mean_engagement = mean(student_engagement), mean_performance = mean(student_performance))

#Visualizing the Clusters by using ggplot2

ggplot(student_features, aes(x = student_engagement, y = student_performance, color = factor(cluster_assignments))) +
  geom_point() +
  labs(title = "Cluster Visualization", x = "Student Engagement", y = "Student Performance")

Approach to dimensionality reduction and clustering.

I used dimensionality reduction and clustering techniques in my research to glean information from the simulated learning data. My method for dimensionality reduction and clustering is briefly described here: Principal Component Analysis (PCA) was utilized by me to reduce the number of dimensions. The principle component analysis (PCA) method finds linear combinations of the original features—in this case, student performance and engagement—to produce a more manageable set of uncorrelated variables known as principal components. In my PCA strategy, I used: Data Standardization: By standardizing the data, I made sure that each feature had a mean of 0 and a standard deviation of 1. This stage of PCA preprocessing is normal. PCA Implementation: I used R’s prcomp function to implement PCA. Using this, I was able to compute the principal components and see how they contributed to the variance in the data. Principal Component Selection: To preserve a substantial amount of the variability of the original data, I typically selected a subset of the most significant principal components based on their explained variance. Clustering: I used the K-Means algorithm, and Hierarchical Clustering a popular unsupervised clustering method, to perform the clustering. The phases in my clustering strategy were as follows: Data Preparation: For clustering, I used the original student data, which also contained performance and engagement. How Many Clusters Are There? (k) I utilized methods like the Elbow Method or Silhouette Score to calculate the ideal number of clusters (k). The number of clusters that best suit the data can be determined using these techniques. K-Means Clustering: Using the kmeans function in R, I applied the K-Means technique. Based on the pupils’ performance and engagement scores, this divided the class into k groups.

Visualization: To display the clusters in a two-dimensional space and color the data, I utilized scatter plots to display the results. Interpretation: Analysis of each cluster’s characteristics, comprehension of what sets one cluster apart from another, and consideration of the implications for learning analytics were all part of the results’ interpretation. Within each cluster, I searched for trends in student performance and behavior and evaluated how well these trends matched educational insights. Based on a combination of statistical summaries, visualizations, and subject-matter expertise, this interpretation was made. This method’s objective was to identify significant student groups based on their learning patterns, which might guide interventions and tailored learning tactics in a classroom setting.

The results of my analysis, including the number of clusters identified, the characteristics of each cluster, and any other insights you gained from the data.

Here are the results of the analysis: Number of Clusters Identified: Based on the Elbow Method, we determined that the optimal number of clusters (k) for the student data is 3. Characteristics of Each Cluster: We clustered the students into three distinct groups based on their engagement and performance levels: ## Cluster 1 (High Engagement, High Performance): Mean Student Engagement: High Mean Student Performance: High Interpretation: Students in this cluster exhibit both high levels of engagement and high academic performance. They are likely to be motivated and proactive learners. ## Cluster 2 (Low Engagement, Low Performance): Mean Student Engagement: Low Mean Student Performance: Low Interpretation: Students in this cluster have low engagement levels and low academic performance. They may require targeted interventions and support to improve their learning outcomes. ## Cluster 3 (Moderate Engagement, Moderate Performance): Mean Student Engagement: Moderate Mean Student Performance: Moderate Interpretation: Students in this cluster display a balanced combination of moderate engagement and academic performance. They are neither exceptionally high nor low in these aspects. Additional Perspectives: Different student profiles based on engagement and performance were found by the clustering study. As shown by Cluster 1, there is a definite positive association between participation and academic performance. Students in Cluster 2 may require additional support and interventions to improve their performance and level of involvement. Students in Cluster 3 have a more balanced profile and fall somewhere in the middle of the other two clusters.

Implications for Learning Analytics:

The creation of customized learning strategies can be influenced by these identified groupings. For instance, interventions focused at enhancing both engagement and academic performance may be advantageous for Cluster 2 students. Teachers can modify their lesson plans and interventions to meet the unique requirements of various student groups. The results highlight the value of student involvement because it is closely related to academic achievement. According on involvement and performance, the analysis discovered three separate student clusters, each with its own special traits. These revelations can help teachers and educational institutions provide more focused help and interventions to improve student results.

Reference

Baker, R. S. (2016). Stupid tutoring systems, intelligent humans. International Journal of Artificial Intelligence in Education, 26(2), 600-614.

Lab 3 Case Study: Unsupervised Learning in Learning Analytics

Vishal Singh

2023-09-26