The objective of our report is to determine the main reasons for student’s underperformance who are studying in Portuguese Schools Gabriel Pereira School and Mousinho da Silveira based on the passing and failing results and the reasons associated to it. The classifiers include the final grade 10 which is the benchmark below it is fail and more than it is pass. The sourcing of dataset was done from UCI Machine Learning Repository. The project comprises of two stages, Phase one emphasizes on data pre-processing and exploration which is present in this report. Phase 2 would be focused on modelling,the section 2 is based on the datasets and their attributes and section 3 is based data pre-processing and section 4 explores each attribute and their inter-relationships and the last section covers the summary.
We considered three classifiers - Naive Bayes (NB), Decision Tree (DT). The NB was the baseline classifier. Each classifier was trainned to make probability predictions so that we were able to adjust prediction threshold to refine the performance. We split the full data set into 80 % training set and 20 % test set. Each set resembled the full data by having the same proportion of target classes. For fine-tuning process, we ran a five-folded cross-validation stratified sampling on each classifier. Stratified sampling was used to cater the slight imbalance class of the target feature.
Next, for each classsifer, we determined the optimal probability threshold. Using the tuned hyperparameters and the optimal thresholds, we made predictions on the test data. During model training (hyperparameter tuning and threshold adjustment), we relied on mean misclassification error rate (mmce). In addition to mmce, we also used the confusion matrix on the test data to evaluate classifiers’ performance. The modelling was implemented in R
with the mlr
package [@mlr].
Since the training set might have unwittingly excluded rare instances, the NB classifier might produce some fitted zero probabilities as predictions. To mitigate this, we ran a grid search to determine the optimal value of the Laplacian smoothing parameter. Using the stratified sampling discussed in the previous section, we experimented values ranging from 0 to 30. The optimal Laplacian parameter was 16.7 with a mean test error of 0.1892.
We tune-fined the number of variables randomly sampled as candidates at each split (i.e. mtry
). For a classification problem, @Breiman suggests mtry
= \(\sqrt{p}\) where \(p\) is the number of descriptive features. In our case, \(\sqrt{p} = \sqrt{11}=3.31\). Therefore, we experimented mtry
= 2, 3, and 4. We left other hyperparameters, such as the number of trees to grow at the default value. The result was 2 with a mean test error of 0.185.
We took the maximum depth for the tree with an upper and lower depth of 0.1 and 0.01. The result was max depth 4 and cp 0.03 and error 0.182
The following plots depict the value of mmce vs. the range of probability thresholds. The thresholds were approximately 0.95, 0.34, and 0.44 for NB, RF, and Decision Tree classifiers respectively. These thresholds were used to determine the probability of students passing and failing.
mod <- train(learner3, task_Dt)
tree <- getLearnerModel(mod)
rpart.plot( tree )
## Warning in predict.WrappedModel(tunedMod1, newdata = test_data_NV):
## Provided data for prediction is not a pure data.frame but from class
## tbl_df, hence it will be converted.
## Warning in predict.WrappedModel(tunedMod2, newdata = test_data_Rf):
## Provided data for prediction is not a pure data.frame but from class
## tbl_df, hence it will be converted.
## Warning in predict.WrappedModel(tunedMod3, newdata = test_data_Dt):
## Provided data for prediction is not a pure data.frame but from class
## tbl_df, hence it will be converted.
Using the parameters and threshold levels, we calculated the confusion matrix for each classifier. The confusion matrix of NB classifer is as follow:
## Relative confusion matrix (normalized by row/column):
## predicted
## true Fail Pass -err.-
## Fail 0.20/0.65 0.80/0.28 0.80
## Pass 0.05/0.35 0.95/0.72 0.05
## -err.- 0.35 0.28 0.28
##
##
## Absolute confusion matrix:
## predicted
## true Fail Pass -err.-
## Fail 13 52 52
## Pass 7 137 7
## -err.- 7 52 59
The confusion matrix of RF classifer is as follow:
## Relative confusion matrix (normalized by row/column):
## predicted
## true Fail Pass -err.-
## Fail 0.35/0.61 0.65/0.25 0.65
## Pass 0.10/0.39 0.90/0.75 0.10
## -err.- 0.39 0.25 0.27
##
##
## Absolute confusion matrix:
## predicted
## true Fail Pass -err.-
## Fail 23 42 42
## Pass 15 129 15
## -err.- 15 42 57
The confusion matrix of Decision Tree classifer is as follow:
## Relative confusion matrix (normalized by row/column):
## predicted
## true Fail Pass -err.-
## Fail 0.34/0.67 0.66/0.24 0.66
## Pass 0.08/0.33 0.92/0.76 0.08
## -err.- 0.33 0.24 0.26
##
##
## Absolute confusion matrix:
## predicted
## true Fail Pass -err.-
## Fail 22 43 43
## Pass 11 133 11
## -err.- 11 43 54
All classifiers accurately predicted individual grade less than 50, but not high graders. The class accuracy difference was substantial. Based on class accuracy and mmce, we concluded that the DT classifer was the best model.
The previous section showed that all classifiers did not perform accurately in predicting the high grade earners despite the stratified sampling. This implies the imbalance class problem was prevalent.
The NB model assumes the descriptive features to follow normality that are not necessarily true. The solution would be a transformation on numeric features. Based mmce, the NB classifer underperformed the RF and DT classifier. This highlights the NB classifier might not be appropriate given there were many categorical features in the data. The DT outperformed other models. Having said this, it was “unfair” to the NB and RF classifiers because the DF was able to run multiple bagged models at each iteration during the resampling.
Among three classifiers, the Decision Tree produces the best performance in predicting individuals grades more than 50%(Pass). We split the data into training and test sets. Via a stratified sampling, we determined the optimal value of the selected hyperparameter of each classifier and the probability threshold. Despite this, the imbalance class issue still persisted and therefore reduced the class accuracy of the High grade students. For future works, we proposed to consider grade-sensitive classification and under/over-sampling methods to mitigate the class imbalance.