The Purpose of the project is to use unsupervised learning methodologies to the provided data-set of student’s performances.Multiple plots have been used to represent the data set visually.Elbow and silhouette methods are used to determine the optimal number of clusters within data set and by using k-means and Clara methods, I have been able to cluster our data set according to student’s performance scores in maths, reading and writing subjects.
Below libraries are used to handle graph plots and data manipulation.
library(ClusterR)
library(gridExtra)
library(cluster)
library(ggplot2)
library(ggmosaic)
library(factoextra)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(corrplot)
library(pillar)
## Rows: 373
## Columns: 7
## $ gender <chr> "female", "female", "male", "female", "fem~
## $ ethnicity <chr> "group B", "group B", "group D", "group B"~
## $ parental.level.of.education <chr> "bachelor's degree", "master's degree", "h~
## $ test.preparation.course <chr> "none", "none", "completed", "none", "none~
## $ math.score <int> 72, 90, 64, 38, 65, 50, 88, 46, 66, 74, 73~
## $ reading.score <int> 72, 95, 64, 60, 81, 53, 89, 42, 69, 71, 74~
## $ writing.score <int> 74, 93, 67, 50, 73, 58, 86, 46, 63, 80, 72~
## gender ethnicity parental.level.of.education
## Length:373 Length:373 Length:373
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## test.preparation.course math.score reading.score writing.score
## Length:373 Min. : 8.00 Min. : 24.00 Min. : 15
## Class :character 1st Qu.: 56.00 1st Qu.: 59.00 1st Qu.: 57
## Mode :character Median : 66.00 Median : 70.00 Median : 68
## Mean : 65.64 Mean : 69.02 Mean : 68
## 3rd Qu.: 76.00 3rd Qu.: 79.00 3rd Qu.: 79
## Max. :100.00 Max. :100.00 Max. :100
The following box plot graphs tell us about the variation of median of scores between “male” and “female” genders. It can be seen from the graphs of “math.scores” that median for gender “male” is higher compared to median of gender “female” in maths.scores. But in the scores of “reading” and “writing”, gender “female” has better performance compared to “males”.
fig1<-ggplot(student_performance, aes(y = math.score, color = gender)) + geom_boxplot() +
ggtitle("Gender wise distribution of \n Maths scores") + theme_bw()
fig2<-ggplot(student_performance, aes(y = reading.score, color = gender)) + geom_boxplot() +
ggtitle("Gender wise distribution of \n Reading scores") + theme_bw()
fig3<-ggplot(student_performance, aes(y = writing.score, color = gender)) + geom_boxplot() +
ggtitle("Gender wise distribution of \n Writing scores") + theme_bw()
grid.arrange(fig1, fig2, fig3, ncol = 2, nrow = 2)
The below graph depicts the relation of gender and their ethnicity groups. We can see from the plot that only in ethnicity group A, the males are more in number than females and in the rest of the ethnicity groups, the females are more than males.
ggplot(student_performance) +
geom_mosaic(aes(x = product(gender, ethnicity),
fill = gender)) +
xlab("Ethnicity group") + ylab("Gender") +
labs(fill = "Gender") +
ggtitle("Gender wise ethnicity distribution") +
theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1))
The below plot indicates the scores of students based on the preparation course which means that in plot “math.scores” and “reading.score” the difference in medians of scores is only slightly different between those who completed the preparation course and who did not. But in third plot on “writing.score”, the difference of medians differ largely between those who completed the preparation course and who did not completed it.
fig4<-ggplot(student_performance, aes(y = math.score, color = test.preparation.course)) +
geom_boxplot()+ggtitle("Does completing a preparation course affects the scores?") +
theme_bw()
fig5<-ggplot(student_performance, aes(y = reading.score, color = test.preparation.course)) + geom_boxplot()+
theme_bw()
fig6<-ggplot(student_performance, aes(y = writing.score, color = test.preparation.course)) + geom_boxplot()+
theme_bw()
grid.arrange(fig4, fig5, fig6, nrow = 3 )
Correlation is a statistical measure. Correlation explains how one or more variables are related to each other. These variables can be input data features which have been used to forecast our target variable.It means that when the value of one variable increases then the value of the other variable(s) also increases.
From my data set and below plot we can see that there is high correlation between all the variables that I have used in data set.
student_score <- student_performance[,5:7]
correlation_matrix <- cor(student_score, method = "pearson", use = "everything")
correlation_data <- cor(student_score)
corrplot(correlation_data, type ="lower")
correlation_data
## math.score reading.score writing.score
## math.score 1.0000000 0.8261012 0.8111747
## reading.score 0.8261012 1.0000000 0.9548454
## writing.score 0.8111747 0.9548454 1.0000000
The elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.
In the below graph, using elbow method, we can see that the gradient of the elbow start to decline from the value of 2 and keeps on declining till the value 3.
maximum_clusters <- 7
elbow_data <- sapply(1:maximum_clusters, function(k){kmeans(student_score, k,
nstart=50,iter.max = 1000 )$tot.withinss})
elbow_data
## [1] 255396.32 109521.82 64563.02 50984.47 41140.71 35523.26 30831.61
plot(1:maximum_clusters, elbow_data,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
So to support my findings about optimal number of clusters I will do some analysis using total within sum of squares method.
optimum_kmeans <- fviz_nbclust(student_score, FUNcluster = kmeans, method = "wss") +
ggtitle("Optimal clusters for \n K-means")
optimum_clara <- fviz_nbclust(student_score, FUNcluster = cluster::clara, method = "wss") +
ggtitle("Optimal clusters for\n CLARA")
grid.arrange(optimum_kmeans, optimum_clara, ncol=2)
The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The value of the silhouette ranges between [1, -1], where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
We can see from plots below that optimal number of clusters by applying silhouette method are 2.
sil_opt_kmeans <- fviz_nbclust(student_score, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
sil_opt_clara <- fviz_nbclust(student_score, FUNcluster = cluster::clara, method = "silhouette") +
ggtitle("Optimal number of clusters \n CLARA")
grid.arrange(sil_opt_kmeans, sil_opt_clara, ncol=2)
K-means is used to calculate the sum of the square of the points and calculates the average distance. When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the within-cluster sum of square value will decrease.
Below is th k-means plot for our data set having 2 numbers of optimal clusters.From the plot it can be seen that there is slightly interaction between both clusters and cluster 2 seems good fit as it has less values that are in negative zone.
kmeans_data <- eclust(student_score, k=2 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
kmean_list <- fviz_cluster(kmeans_data, data=student_score, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 2 clusters")
kmean_silhouette <- fviz_silhouette(kmeans_data)
## cluster size ave.sil.width
## 1 1 207 0.47
## 2 2 166 0.45
grid.arrange(kmean_list, kmean_silhouette, ncol=2)
Below is the result of applying Clara to plot graph for data set.
clara_data <- eclust(student_score, k=2 , FUNcluster="clara", hc_metric="euclidean", graph=F)
clara_list <- fviz_cluster(clara_data, data=student_score, elipse.type="norm", geom=c("point")) + ggtitle("CLARA with 2 clusters")
clara_silhouette <- fviz_silhouette(clara_data)
## cluster size ave.sil.width
## 1 1 249 0.45
## 2 2 124 0.52
grid.arrange(clara_list, clara_silhouette, ncol=2)
By looking at above plots of k-means and Clara and by analyzing the average silhouette width we can conclude that the optimal numbers fo clusters are 2 and both k-means and Clara provided almost same average silhouette width which indicates that both methods can be used for provided data-set.