Clustering.R

##### CLustering with initial data source
#install.packages("ggplot2") # This package offers excellent plots to visualise data
#install.packages(("RColorBrewer")) # This package offers attractive color palettes to be used in plots

### Load the installed libraries
library(ggplot2)
library(RColorBrewer)

## Reading the data file. The below file is what I call the initialclust, which comprises of fewer data points Eg: Let's say this is after Day 10 of introducing the Personlized Learning Platform
initialclust = read.csv("initialclust.csv", header=TRUE) # header=TRUE implies that the first row of my dataset is the column name

## Top few rows of the data:
head(initialclust)

##   QID HardLevel TotalTimesAnswered CorrectAttempts FA_MeanTime PerCorrect
## 1 151    Medium                 82              52   10.981585      0.634
## 2  93      Easy                122              98    6.682476      0.803
## 3  15      Easy                135             110    5.729358      0.815
## 4  37      Easy                117              95    3.385395      0.812
## 5  67      Easy                117              93    9.106476      0.795
## 6 194      Hard                 92              47   16.331469      0.511

# Number of rows in the dataset
nrow(initialclust)

## [1] 83

# There are 83 questions in the sample dataset

# Initial classification of difficulty level of the questions
table(initialclust$HardLevel)

## 
##   Easy   Hard Medium 
##     36     22     25

# According to our initial dataset, 36 questions are easy, 25 questions are Medium & 22 questions are Hard. Let's say this is the Common-Core assumption

# Description of Variable Names:
# QID - Question ID
# HardLevel - Difficulty level of the question, assigned based on Common Core/Other standards/Assumptions.
# TotalTimesAnswered - Number of times the question has been answered in total
# CorrectAttempts - Number of times the question has been answered correctly
# FA_MeanTime - Average time taken to answer the question
# PerCorrect - CorrectAttempts / TotalTimesAnswered

# NOTE: This is a randomly generated dataset. Also, the Question database will have a lot more columns in real time (Ex: Time for second attempts, Operation Name, Subtopic name etc.)

# Algorithm/Technique Chosen: Clustering (K-means clustering)

# Preparing the input data for clustering. For clustering, we should introduce only the parameters, not the ID's or ANY categorical data. All data MUST be numeric. In this case, I'm going to cluster the data based on two variables: FA_MeanTime & PerCorrect (These are the 5th & 6th column of the dataset). Again, we can and will need to cluster the data based on many other factors (Ex: Time taken for second attempt, Median time taken etc..), but for ease of representation, we are using only two variables.
# This creates a data frame with only the 5th & 6th columns from the initial dataset
clust_subset = initialclust[,5:6] 

## Top few rows of the data:
head(clust_subset)

##   FA_MeanTime PerCorrect
## 1   10.981585      0.634
## 2    6.682476      0.803
## 3    5.729358      0.815
## 4    3.385395      0.812
## 5    9.106476      0.795
## 6   16.331469      0.511

# For reproducibility of cluster information
set.seed(45) 

# Applying the kmeans algorithm. The kmeans function is available by default in R, so we don't have to install any package for this separately. 
# The idea behind k-means clustering is to identify 'k' (in our case 3) centers (means). The points are allocated to these three clusters based on the cluster to which they are closest to. The distance between the clusters are calculated using Euclidean distance. 
# 3 is the number of clusters we have chosen to obtain. The number of clusters depends on the business need/domain. Here, we choose 3 as our hypothesis is to have 3 levels of difficulty : Easy, Medium & Hard

kmeans_inidata = kmeans(clust_subset,3) 

# So now the clustering algorithm has been implemented on the data. Let's try to read how the clusters have been created based on the two parameters. 

kmeans_inidata$centers

##   FA_MeanTime PerCorrect
## 1    5.210075  0.7353061
## 2   35.309795  0.4740000
## 3   15.989607  0.5728276

# Looking at the above, it looks like Cluster 1 is what we would like to label as 'Easy' - users take 5.2 seconds on average to solve questions from this cluster. Moreover, they have answered the questions correctly 73% of the time. The 'Hard' questions take 35.3 seconds to solve and only 47% of the time, the questions have been answered correctly.  
# To summarize, Easy questions would probably be those which are solved the quickest and correctly the maximum number of times. Hard questions are those that take a lot more time, and are difficult to get right answers for. 


# The plot will look something like this:

cols <- rev(brewer.pal(11, 'RdYlBu'))
cluster = kmeans_inidata$cluster
ggplot(initialclust, aes(FA_MeanTime, PerCorrect, color=cluster)) + geom_point(size=4) + scale_colour_gradientn(colours = cols)

# Each point of the plot is a Question ID. Whenever someone answers the question, the statistics will change. As the statistics change, they may/may not remain in the same cluster.
# In the plot, you can see some blue & red points that are so close to each other. They may move interchangeably after few more attempts by the children. 

# Adding the cluster information back to the original data frame
initialclust$cluster = kmeans_inidata$cluster 

#Now our dataset will look like this:
head(initialclust)

##   QID HardLevel TotalTimesAnswered CorrectAttempts FA_MeanTime PerCorrect
## 1 151    Medium                 82              52   10.981585      0.634
## 2  93      Easy                122              98    6.682476      0.803
## 3  15      Easy                135             110    5.729358      0.815
## 4  37      Easy                117              95    3.385395      0.812
## 5  67      Easy                117              93    9.106476      0.795
## 6 194      Hard                 92              47   16.331469      0.511
##   cluster
## 1       3
## 2       1
## 3       1
## 4       1
## 5       1
## 6       3

# A new column 'cluster' has been added. This is the difficulty level assigned by the clustering algorithm. 

# Let's compare the clustering results with our hard coded difficulty levels. 

table(initialclust$HardLevel, initialclust$cluster)

##         
##           1  2  3
##   Easy   30  0  6
##   Hard    4  5 13
##   Medium 15  0 10

# Initially, we had classified 36 questions as easy, but the clustering algorithm says only 30 out of the 36 questions are 'Easy' (Cluster 1), and 6 questions are actually 'Medium' (Cluster 3).  

# The advantage with clustering is that the difficulty levels will keep changing based on the user's comfort level with the questions. It can also help us record the deviations on a daily basis.

Clustering.R

srsampath

Tue Sep 13 12:37:44 2016