Introduction

Learning analytics is the use of data to understand and improve learning. Unsupervised learning is a type of machine learning that can be used to identify patterns and relationships in data without the need for labeled data.

In this case study, you will use unsupervised learning to analyze learning data from a Simulated School course. You will use dimensionality reduction to reduce the number of features in the data, and then use clustering to identify groups of students with similar learning patterns.

Data

The data for this case study is generated with the simulated function below. The data contains the following features:

Student ID: A unique identifier for each student Feature 1: A measure of student engagement Feature 2: A measure of student performance

simulate_student_features <- function(n = 100) {
  # Set the random seed
  set.seed(260923)
  
  # Generate unique student IDs
  student_ids <- seq(1, n)

  # Simulate student engagement
  student_engagement <- rnorm(n, mean = 50, sd = 10)

  # Simulate student performance
  student_performance <- rnorm(n, mean = 60, sd = 15)

  # Combine the data into a data frame
  student_features <- data.frame(
    student_id = student_ids,
    student_engagement = student_engagement,
    student_performance = student_performance
  )

  # Return the data frame
  return(student_features)
}

This function takes the number of students to simulate as an input and returns a data frame with three columns: student_id, student_engagement, and student_performance. The student_engagement and student_performance features are simulated using normal distributions with mean values of 50 and 60, respectively, and standard deviations of 10 and 15, respectively.

To use the simulate_student_features() function, we can simply pass the desired number of students to simulate as the argument:

student_features <- simulate_student_features(n = 100)

We can then use this data frame to perform unsupervised learning to identify groups of students with similar learning patterns,

Tasks

pr.out <- prcomp(student_features,scale. = TRUE)

summary(pr.out)

km.out <- kmeans(student_features, centers = 3, nstart = 20)

summary(km.out)

Submission

Submit a report containing the following:

TC - My approach was to use PCA to reduce the dimensionality into a smaller data set without losing the information provided by the larger data set.

TC - results shown below. 100 clusters identified.

TC - my goal is to apply PCA to my project to predict hockey players probability of scoring goals. With the algorithms covered so far, there are a number of ways to accomplish that.

“Feb 2023 https://www.datacamp.com/tutorial/pca-analysis-r” DataCamp

Your report should include your code. Submit the published RPubs link to Blackboard.

Importance of components: PC1 PC2 PC3 Standard deviation 1.0825 0.9895 0.9214 Proportion of Variance 0.3906 0.3264 0.2830 Cumulative Proportion 0.3906 0.7170 1.0000 > pr.out <- prcomp(student_features,scale. = TRUE) > > summary(pr.out) Importance of components: PC1 PC2 PC3 Standard deviation 1.0825 0.9895 0.9214 Proportion of Variance 0.3906 0.3264 0.2830 Cumulative Proportion 0.3906 0.7170 1.0000 > > km.out <- kmeans(student_features, centers = 3, nstart = 20) > > summary(km.out) Length Class Mode
cluster 100 -none- numeric centers 9 -none- numeric totss 1 -none- numeric withinss 3 -none- numeric tot.withinss 1 -none- numeric betweenss 1 -none- numeric size 3 -none- numeric iter 1 -none- numeric ifault 1 -none- numeric >