Every year, high school students have to take part in final exam where they can be assessed whether they could enter the college or not. After receiving tremendous mount of applications, college will set an average pass grade to filter out those who fails, for instance: 40/100. However, such approach is could be problematic because a slightly change in passing grade would lead to the college’s capacity to be overload or underload. (If there is significantly amount of students have as same grades as the passing grade) In this paper, clustering methods are used in order to divide students into smaller group which have same study attitudes among 3 exams.
The (dataset)[https://www.kaggle.com/spscientist/students-performance-in-exams] was taken from Kaggle website in December 2018. There is 1000 observations with 8 variables; however, the main interest are student’s grade, three last columns representing their academic performance are chosen: math, reading and writing.
The basic summary statistic of 3 columns:
There is no missing value, the distribution of all columns look similarly to each other.
Hopskin statistic is used to determine how well the data can be clustered,
H0: no clusters are visible, uniformly distributed data H1: some clusters are visible
get_clust_tendency(df, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.06402075
##
## $plot
We reject null hypothesis.
Duda-Hart test to check if the data set could be splitted into 2 cluster. H0: homogeneity of cluster
H1: heterogeneity of cluster
km1 <- kmeans(df,2)
dudahart2(df,km1$cluster)
## $p.value
## [1] 0
##
## $dh
## [1] 0.4187496
##
## $compare
## [1] 0.7196301
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
cluster1=FALSE (H0 of homogeneity rejected, accept H1)
After performing both test above, we can assume the dataset can be used to clustering
In the library factoextra, there is a function fviz_nbclust() to determine number of optimal cluster based on methods and criteria. This paper will focus on: k-means, pam and clara. The metric of comparison is silhouette.
library(cluster)
fviz_nbclust(df, FUNcluster=kmeans)
As indicated from the graphs, number of cluster should be 2.
But other methods tells another story
library(NbClust)
how_many <-NbClust(df, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
cat('Number of cluster should be', how_many$Best.nc, "\n")
## Number of cluster should be 4 1096.157
how_many$All.index
## 2 3 4 5 6 7 8
## 157.5453 706.3257 1096.1565 833.0342 828.6204 795.0242 751.3601
Calinski-Harabsasz measures of clustering quality
Since the higher the Calinski-Harabsasz measures the better, it could be possible that 4 cluster is better than 2. Therefore, taking it into account, both case when k= 2 and k = 4 are considered. Besides, it could be a good idea for Dean Office. If they just want to know who pass/ fails, 2 clusters will satisfy. In case the college’s capicity is the main factor, having 4 clusters is better:
Group 1: those will certainly be accepted
Group 2: those have highly likely to be accepted if there are available places
Group 3: those have lower chances to be accepted if there are available places
Group 4: those will certainly be unaccepted
## 2 3 4 5 6
## kmeans euclidean 0.4730479 0.4056746 0.3737878 0.3328932 0.3276116
## kmeans manhattan 0.4730479 0.4056746 0.3737878 0.3328932 0.3276116
## kmeans canberra 0.4730479 0.4056746 0.3737878 0.3328932 0.3276116
## pam euclidean 0.4746317 0.3972226 0.3334809 0.3046521 0.2781847
## pam manhattan 0.4746317 0.3972226 0.3334809 0.3046521 0.2781847
## pam canberra 0.4746317 0.3972226 0.3334809 0.3046521 0.2781847
## clara euclidean 0.4743694 0.3959897 0.3135392 0.2687079 0.2403324
## clara manhattan 0.4743694 0.3959897 0.3135392 0.2687079 0.2403324
## clara canberra 0.4743694 0.3959897 0.3135392 0.2687079 0.2403324
From the result above, number of cluster at 2 have the highest numbers and there is no different between metrics distance. All the methods produce results which are approximately identical. To reduce the excessive analysis in postdialogsis, kmeans with euclidean of 2 and 4 clusters (due to the point mentioned above).
k = 2
km2 <- eclust(df,'kmeans', hc_metric = 'euclidean', k= 2)
fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 556 0.49
## 2 2 444 0.45
k = 4
km4 <- eclust(df,'kmeans', hc_metric = 'euclidean', k= 4)
fviz_silhouette(km4)
## cluster size ave.sil.width
## 1 1 247 0.41
## 2 2 76 0.36
## 3 3 275 0.37
## 4 4 402 0.36
In both cases, the silhouette is nicely plotted, there is few negatives number, which means the cluster range is clearly visible and slightly overlaps each other.
k = 2
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew
## math.score 1 556 75.95 10.01 75 75.60 10.38 51 100 49 0.28
## reading.score 2 556 79.38 8.70 78 78.93 8.90 60 100 40 0.43
## writing.score 3 556 78.59 8.98 78 78.08 8.90 61 100 39 0.47
## kurtosis se
## math.score -0.46 0.42
## reading.score -0.35 0.37
## writing.score -0.40 0.38
## --------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range
## math.score 1 444 53.74 10.87 55 54.52 10.38 0 76 76
## reading.score 2 444 56.38 9.53 58 57.25 8.90 17 76 59
## writing.score 3 444 54.86 10.30 56 55.54 8.90 10 75 65
## skew kurtosis se
## math.score -0.93 1.82 0.52
## reading.score -0.96 1.15 0.45
## writing.score -0.80 1.16 0.49
k = 4
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew
## math.score 1 247 83.69 7.85 83 83.56 7.41 67 100 33 0.11
## reading.score 2 247 86.68 6.62 86 86.44 5.93 73 100 27 0.29
## writing.score 3 247 85.91 7.17 85 85.73 7.41 69 100 31 0.20
## kurtosis se
## math.score -0.54 0.50
## reading.score -0.47 0.42
## writing.score -0.66 0.46
## --------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew
## math.score 1 76 38.91 10.82 40.0 39.68 9.64 0 59 59 -0.89
## reading.score 2 76 41.18 7.76 42.5 41.79 5.19 17 56 39 -0.79
## writing.score 3 76 38.96 8.02 41.0 39.92 5.93 10 50 40 -1.32
## kurtosis se
## math.score 1.35 1.24
## reading.score 0.51 0.89
## writing.score 1.89 0.92
## --------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew
## math.score 1 275 54.92 7.65 55 55.00 8.90 35 73 38 -0.10
## reading.score 2 275 57.28 5.40 58 57.39 5.93 41 72 31 -0.21
## writing.score 3 275 55.53 5.96 55 55.60 5.93 41 69 28 -0.07
## kurtosis se
## math.score -0.41 0.46
## reading.score -0.20 0.33
## writing.score -0.42 0.36
## --------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range skew
## math.score 1 402 68.05 7.27 68 68.01 7.41 45 87 42 0.02
## reading.score 2 402 71.84 5.62 72 71.79 5.93 57 86 29 0.04
## writing.score 3 402 71.14 5.79 71 71.15 5.93 57 87 30 0.01
## kurtosis se
## math.score -0.22 0.36
## reading.score -0.48 0.28
## writing.score -0.57 0.29
From these graphs, a pattern in either number of cluster is 2 or 4 could be concluded. Students score highly in one subject have tendency to score highly in other subject as well, and vice versa.
This paper introduce a new approach for selecting applicants, and potentially significantly reducing workload time needed.