Introduction

The Purpose of the project is to use unsupervised learning methodologies to the provided data-set of student’s performances.Multiple plots have been used to represent the data set visually.Elbow and silhouette methods are used to determine the optimal number of clusters within data set and by using k-means and Clara methods, I have been able to cluster our data set according to student’s performance scores in maths, reading and writing subjects.

Data Processing

Below libraries are used to handle graph plots and data manipulation.

library(ClusterR)
library(gridExtra)
library(cluster)
library(ggplot2)
library(ggmosaic)
library(factoextra)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(corrplot)
library(pillar)

Summary of Student’s Performance Dataset

## Rows: 373
## Columns: 7
## $ gender                      <chr> "female", "female", "male", "female", "fem~
## $ ethnicity                   <chr> "group B", "group B", "group D", "group B"~
## $ parental.level.of.education <chr> "bachelor's degree", "master's degree", "h~
## $ test.preparation.course     <chr> "none", "none", "completed", "none", "none~
## $ math.score                  <int> 72, 90, 64, 38, 65, 50, 88, 46, 66, 74, 73~
## $ reading.score               <int> 72, 95, 64, 60, 81, 53, 89, 42, 69, 71, 74~
## $ writing.score               <int> 74, 93, 67, 50, 73, 58, 86, 46, 63, 80, 72~
##     gender           ethnicity         parental.level.of.education
##  Length:373         Length:373         Length:373                 
##  Class :character   Class :character   Class :character           
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##  test.preparation.course   math.score     reading.score    writing.score
##  Length:373              Min.   :  8.00   Min.   : 24.00   Min.   : 15  
##  Class :character        1st Qu.: 56.00   1st Qu.: 59.00   1st Qu.: 57  
##  Mode  :character        Median : 66.00   Median : 70.00   Median : 68  
##                          Mean   : 65.64   Mean   : 69.02   Mean   : 68  
##                          3rd Qu.: 76.00   3rd Qu.: 79.00   3rd Qu.: 79  
##                          Max.   :100.00   Max.   :100.00   Max.   :100

Including Plots

The following box plot graphs tell us about the variation of median of scores between “male” and “female” genders. It can be seen from the graphs of “math.scores” that median for gender “male” is higher compared to median of gender “female” in maths.scores. But in the scores of “reading” and “writing”, gender “female” has better performance compared to “males”.

fig1<-ggplot(student_performance, aes(y = math.score, color = gender)) + geom_boxplot() +
  ggtitle("Gender wise distribution of \n Maths scores") + theme_bw()
fig2<-ggplot(student_performance, aes(y = reading.score, color = gender)) + geom_boxplot() +
  ggtitle("Gender wise distribution of \n Reading scores") + theme_bw()
fig3<-ggplot(student_performance, aes(y = writing.score, color = gender)) + geom_boxplot() +
  ggtitle("Gender wise distribution of \n Writing scores") + theme_bw()
grid.arrange(fig1, fig2, fig3, ncol = 2, nrow = 2)

The below graph depicts the relation of gender and their ethnicity groups. We can see from the plot that only in ethnicity group A, the males are more in number than females and in the rest of the ethnicity groups, the females are more than males.

ggplot(student_performance) +
  geom_mosaic(aes(x = product(gender, ethnicity), 
                  fill = gender)) +
  xlab("Ethnicity group") + ylab("Gender") +
  labs(fill = "Gender") + 
  ggtitle("Gender wise ethnicity distribution") + 
  theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1))

The below plot indicates the scores of students based on the preparation course which means that in plot “math.scores” and “reading.score” the difference in medians of scores is only slightly different between those who completed the preparation course and who did not. But in third plot on “writing.score”, the difference of medians differ largely between those who completed the preparation course and who did not completed it.

fig4<-ggplot(student_performance, aes(y = math.score, color = test.preparation.course)) + 
  geom_boxplot()+ggtitle("Does completing a preparation course affects the scores?") + 
  theme_bw()
fig5<-ggplot(student_performance, aes(y = reading.score, color = test.preparation.course)) + geom_boxplot()+
  theme_bw()
fig6<-ggplot(student_performance, aes(y = writing.score, color = test.preparation.course)) + geom_boxplot()+
  theme_bw()
grid.arrange(fig4, fig5, fig6, nrow = 3 )

Checking Correlation

Correlation is a statistical measure. Correlation explains how one or more variables are related to each other. These variables can be input data features which have been used to forecast our target variable.It means that when the value of one variable increases then the value of the other variable(s) also increases.

From my data set and below plot we can see that there is high correlation between all the variables that I have used in data set.

student_score <- student_performance[,5:7]
correlation_matrix <- cor(student_score, method = "pearson", use = "everything")
correlation_data <- cor(student_score)
corrplot(correlation_data, type ="lower")

correlation_data
##               math.score reading.score writing.score
## math.score     1.0000000     0.8261012     0.8111747
## reading.score  0.8261012     1.0000000     0.9548454
## writing.score  0.8111747     0.9548454     1.0000000

Elbow Method

The elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

In the below graph, using elbow method, we can see that the gradient of the elbow start to decline from the value of 2 and keeps on declining till the value 3.

maximum_clusters <- 7

elbow_data <- sapply(1:maximum_clusters, function(k){kmeans(student_score, k, 
                                          nstart=50,iter.max = 1000 )$tot.withinss})

elbow_data
## [1] 255396.32 109521.82  64563.02  50984.47  41140.71  35523.26  30831.61
plot(1:maximum_clusters, elbow_data,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

Optimal Clusters for k-means and Clara

So to support my findings about optimal number of clusters I will do some analysis using total within sum of squares method.

optimum_kmeans <- fviz_nbclust(student_score, FUNcluster = kmeans, method = "wss") + 
  ggtitle("Optimal clusters for \n K-means")

optimum_clara <- fviz_nbclust(student_score, FUNcluster = cluster::clara, method = "wss") + 
  ggtitle("Optimal clusters  for\n CLARA")

grid.arrange(optimum_kmeans, optimum_clara, ncol=2)

Silhouette plot

The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The value of the silhouette ranges between [1, -1], where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

We can see from plots below that optimal number of clusters by applying silhouette method are 2.

sil_opt_kmeans <- fviz_nbclust(student_score, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")

sil_opt_clara <- fviz_nbclust(student_score, FUNcluster = cluster::clara, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n CLARA")

grid.arrange(sil_opt_kmeans, sil_opt_clara, ncol=2)

k-means plot

K-means is used to calculate the sum of the square of the points and calculates the average distance. When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the within-cluster sum of square value will decrease.

Below is th k-means plot for our data set having 2 numbers of optimal clusters.From the plot it can be seen that there is slightly interaction between both clusters and cluster 2 seems good fit as it has less values that are in negative zone.

kmeans_data <- eclust(student_score, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

kmean_list <- fviz_cluster(kmeans_data, data=student_score, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
kmean_silhouette <- fviz_silhouette(kmeans_data)
##   cluster size ave.sil.width
## 1       1  207          0.47
## 2       2  166          0.45
grid.arrange(kmean_list, kmean_silhouette, ncol=2)

Clara

Below is the result of applying Clara to plot graph for data set.

clara_data <- eclust(student_score, k=2 , FUNcluster="clara", hc_metric="euclidean", graph=F)

clara_list <- fviz_cluster(clara_data, data=student_score, elipse.type="norm", geom=c("point")) + ggtitle("CLARA with 2 clusters")
clara_silhouette <- fviz_silhouette(clara_data)
##   cluster size ave.sil.width
## 1       1  249          0.45
## 2       2  124          0.52
grid.arrange(clara_list, clara_silhouette, ncol=2)

Conclusion

By looking at above plots of k-means and Clara and by analyzing the average silhouette width we can conclude that the optimal numbers fo clusters are 2 and both k-means and Clara provided almost same average silhouette width which indicates that both methods can be used for provided data-set.

References

https://www.kaggle.com/spscientist/students-performance-in-exams https://www.rdocumentation.org/packages/ggmosaic/versions/0.3.3 https://stackoverflow.com/questions