Clustering students based on their exam performance

Introduction

Every year, high school students have to take part in final exam where they can be assessed whether they could enter the college or not. After receiving tremendous mount of applications, college will set an average pass grade to filter out those who fails, for instance: 40/100. However, such approach is could be problematic because a slightly change in passing grade would lead to the college’s capacity to be overload or underload. (If there is significantly amount of students have as same grades as the passing grade) In this paper, clustering methods are used in order to divide students into smaller group which have same study attitudes among 3 exams.

The (dataset)[https://www.kaggle.com/spscientist/students-performance-in-exams] was taken from Kaggle website in December 2018. There is 1000 observations with 8 variables; however, the main interest are student’s grade, three last columns representing their academic performance are chosen: math, reading and writing.

The basic summary statistic of 3 columns:

There is no missing value, the distribution of all columns look similarly to each other.

Prediagnostics

Hopskin statistic is used to determine how well the data can be clustered,

H0: no clusters are visible, uniformly distributed data H1: some clusters are visible

get_clust_tendency(df, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)

## $hopkins_stat
## [1] 0.06402075
## 
## $plot

We reject null hypothesis.

Duda-Hart test to check if the data set could be splitted into 2 cluster. H0: homogeneity of cluster

H1: heterogeneity of cluster

km1 <- kmeans(df,2) 
dudahart2(df,km1$cluster)

## $p.value
## [1] 0
## 
## $dh
## [1] 0.4187496
## 
## $compare
## [1] 0.7196301
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

cluster1=FALSE (H0 of homogeneity rejected, accept H1)

After performing both test above, we can assume the dataset can be used to clustering

Find the optimal of number of cluster

In the library factoextra, there is a function fviz_nbclust() to determine number of optimal cluster based on methods and criteria. This paper will focus on: k-means, pam and clara. The metric of comparison is silhouette.

library(cluster)
fviz_nbclust(df, FUNcluster=kmeans)

As indicated from the graphs, number of cluster should be 2.

But other methods tells another story

library(NbClust)
how_many <-NbClust(df, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
cat('Number of cluster should be', how_many$Best.nc, "\n")

## Number of cluster should be 4 1096.157

how_many$All.index

##         2         3         4         5         6         7         8 
##  157.5453  706.3257 1096.1565  833.0342  828.6204  795.0242  751.3601

Calinski-Harabsasz measures of clustering quality

Since the higher the Calinski-Harabsasz measures the better, it could be possible that 4 cluster is better than 2. Therefore, taking it into account, both case when k= 2 and k = 4 are considered. Besides, it could be a good idea for Dean Office. If they just want to know who pass/ fails, 2 clusters will satisfy. In case the college’s capicity is the main factor, having 4 clusters is better:

Group 1: those will certainly be accepted

Group 2: those have highly likely to be accepted if there are available places

Group 3: those have lower chances to be accepted if there are available places

Group 4: those will certainly be unaccepted

Clustering using kmean, pam and clara with different number of clusters and metrics

##                          2         3         4         5         6
## kmeans euclidean 0.4730479 0.4056746 0.3737878 0.3328932 0.3276116
## kmeans manhattan 0.4730479 0.4056746 0.3737878 0.3328932 0.3276116
## kmeans canberra  0.4730479 0.4056746 0.3737878 0.3328932 0.3276116
## pam euclidean    0.4746317 0.3972226 0.3334809 0.3046521 0.2781847
## pam manhattan    0.4746317 0.3972226 0.3334809 0.3046521 0.2781847
## pam canberra     0.4746317 0.3972226 0.3334809 0.3046521 0.2781847
## clara euclidean  0.4743694 0.3959897 0.3135392 0.2687079 0.2403324
## clara manhattan  0.4743694 0.3959897 0.3135392 0.2687079 0.2403324
## clara canberra   0.4743694 0.3959897 0.3135392 0.2687079 0.2403324

From the result above, number of cluster at 2 have the highest numbers and there is no different between metrics distance. All the methods produce results which are approximately identical. To reduce the excessive analysis in postdialogsis, kmeans with euclidean of 2 and 4 clusters (due to the point mentioned above).

k = 2

km2 <- eclust(df,'kmeans', hc_metric = 'euclidean', k= 2)

fviz_silhouette(km2)

##   cluster size ave.sil.width
## 1       1  556          0.49
## 2       2  444          0.45

k = 4

km4 <- eclust(df,'kmeans', hc_metric = 'euclidean', k= 4)

fviz_silhouette(km4)

##   cluster size ave.sil.width
## 1       1  247          0.41
## 2       2   76          0.36
## 3       3  275          0.37
## 4       4  402          0.36

In both cases, the silhouette is nicely plotted, there is few negatives number, which means the cluster range is clearly visible and slightly overlaps each other.

Postdiagnostics

Boxplot and basic statistic

k = 2

## 
##  Descriptive statistics by group 
## group: 1
##               vars   n  mean    sd median trimmed   mad min max range skew
## math.score       1 556 75.95 10.01     75   75.60 10.38  51 100    49 0.28
## reading.score    2 556 79.38  8.70     78   78.93  8.90  60 100    40 0.43
## writing.score    3 556 78.59  8.98     78   78.08  8.90  61 100    39 0.47
##               kurtosis   se
## math.score       -0.46 0.42
## reading.score    -0.35 0.37
## writing.score    -0.40 0.38
## -------------------------------------------------------- 
## group: 2
##               vars   n  mean    sd median trimmed   mad min max range
## math.score       1 444 53.74 10.87     55   54.52 10.38   0  76    76
## reading.score    2 444 56.38  9.53     58   57.25  8.90  17  76    59
## writing.score    3 444 54.86 10.30     56   55.54  8.90  10  75    65
##                skew kurtosis   se
## math.score    -0.93     1.82 0.52
## reading.score -0.96     1.15 0.45
## writing.score -0.80     1.16 0.49

k = 4

## 
##  Descriptive statistics by group 
## group: 1
##               vars   n  mean   sd median trimmed  mad min max range skew
## math.score       1 247 83.69 7.85     83   83.56 7.41  67 100    33 0.11
## reading.score    2 247 86.68 6.62     86   86.44 5.93  73 100    27 0.29
## writing.score    3 247 85.91 7.17     85   85.73 7.41  69 100    31 0.20
##               kurtosis   se
## math.score       -0.54 0.50
## reading.score    -0.47 0.42
## writing.score    -0.66 0.46
## -------------------------------------------------------- 
## group: 2
##               vars  n  mean    sd median trimmed  mad min max range  skew
## math.score       1 76 38.91 10.82   40.0   39.68 9.64   0  59    59 -0.89
## reading.score    2 76 41.18  7.76   42.5   41.79 5.19  17  56    39 -0.79
## writing.score    3 76 38.96  8.02   41.0   39.92 5.93  10  50    40 -1.32
##               kurtosis   se
## math.score        1.35 1.24
## reading.score     0.51 0.89
## writing.score     1.89 0.92
## -------------------------------------------------------- 
## group: 3
##               vars   n  mean   sd median trimmed  mad min max range  skew
## math.score       1 275 54.92 7.65     55   55.00 8.90  35  73    38 -0.10
## reading.score    2 275 57.28 5.40     58   57.39 5.93  41  72    31 -0.21
## writing.score    3 275 55.53 5.96     55   55.60 5.93  41  69    28 -0.07
##               kurtosis   se
## math.score       -0.41 0.46
## reading.score    -0.20 0.33
## writing.score    -0.42 0.36
## -------------------------------------------------------- 
## group: 4
##               vars   n  mean   sd median trimmed  mad min max range skew
## math.score       1 402 68.05 7.27     68   68.01 7.41  45  87    42 0.02
## reading.score    2 402 71.84 5.62     72   71.79 5.93  57  86    29 0.04
## writing.score    3 402 71.14 5.79     71   71.15 5.93  57  87    30 0.01
##               kurtosis   se
## math.score       -0.22 0.36
## reading.score    -0.48 0.28
## writing.score    -0.57 0.29

From these graphs, a pattern in either number of cluster is 2 or 4 could be concluded. Students score highly in one subject have tendency to score highly in other subject as well, and vice versa.

Conclusion

This paper introduce a new approach for selecting applicants, and potentially significantly reducing workload time needed.