Classification of Student Success with Caret

The aim is to study the effects of different kinds of features on the student’s academic achievement and build a computational model to predict future learning behavior by using different classification methods. In this study, we will be looking for best classification method which provides higher recall value for the unsuccessful students.

Education Dataset

Data set is retrieved from https://www.kaggle.com/aljarah/xAPI-Edu-Data which is collected using a learner activity tracker tool called API (xAPI) from learning management system (LMS) called Kalboard 360. Data set consists 480 student records with 16 attributes and 1 class label that identifies student success in 3 categories: low level (0-69), middle level (70-89) and high level (90-100).

library(ggplot2)
library(lattice)
library(caret)
library(C50)
library(kernlab)
library(mlbench)
library(randomForest)
library(caretEnsemble)
library(MASS)
library(klaR)
library(nnet)

education <- read.csv("xAPI-Edu-Data.csv", header = TRUE)

Distribution of the success levels:

summary(education)

##  gender     NationalITy       PlaceofBirth         StageID   
##  F:175   KW       :179   KuwaIT     :180   HighSchool  : 33  
##  M:305   Jordan   :172   Jordan     :176   lowerlevel  :199  
##          Palestine: 28   Iraq       : 22   MiddleSchool:248  
##          Iraq     : 22   lebanon    : 19                     
##          lebanon  : 17   SaudiArabia: 16                     
##          Tunis    : 12   USA        : 16                     
##          (Other)  : 50   (Other)    : 51                     
##     GradeID    SectionID     Topic     Semester   Relation  
##  G-02   :147   A:283     IT     : 95   F:245    Father:283  
##  G-08   :116   B:167     French : 65   S:235    Mum   :197  
##  G-07   :101   C: 30     Arabic : 59                        
##  G-04   : 48             Science: 51                        
##  G-06   : 32             English: 45                        
##  G-11   : 13             Biology: 30                        
##  (Other): 23             (Other):135                        
##   raisedhands     VisITedResources AnnouncementsView   Discussion   
##  Min.   :  0.00   Min.   : 0.0     Min.   : 0.00     Min.   : 1.00  
##  1st Qu.: 15.75   1st Qu.:20.0     1st Qu.:14.00     1st Qu.:20.00  
##  Median : 50.00   Median :65.0     Median :33.00     Median :39.00  
##  Mean   : 46.77   Mean   :54.8     Mean   :37.92     Mean   :43.28  
##  3rd Qu.: 75.00   3rd Qu.:84.0     3rd Qu.:58.00     3rd Qu.:70.00  
##  Max.   :100.00   Max.   :99.0     Max.   :98.00     Max.   :99.00  
##                                                                     
##  ParentAnsweringSurvey ParentschoolSatisfaction StudentAbsenceDays Class  
##  No :210               Bad :188                 Above-7:191        H:142  
##  Yes:270               Good:292                 Under-7:289        L:127  
##                                                                    M:211  
##                                                                           
##                                                                           
##                                                                           
##

Preprocessing

# Is there missing data?
sum(is.na(education))

## [1] 0

# calculate correlation matrix
correlationMatrix <- cor(education[,10:13])
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)

Features that are not required to create an accurate classification model can be identified by Recursive Feature Elimination (RFE) method. In the algorithm, all possible subsets of attributes are examined in terms of accuracy and top predictive features are selected.

Feature Selection:

Results shows that by removing the attribute ‘Semester’ better accuracy can be obtained. We will eliminate this attribute from the dataset.

education <-education[,-8]

Data Splitting

In classification, the data set needs to be divided into training and test sets. Model training is performed using the training set and test set is used to assess the performance of the classification model. In our case, we will split the data with ‘stratified sampling’ method which divides the data sets into subgroups based on their outcomes and creates the training set by selecting from these subgroups with the same outcome distribution of the whole dataset. We will sample 75% of the data set as training, 25% of it as a test set.

10-fold cross validation with stratified sampling is used to divide the dataset in training and test sets. In classification modeling, cross validation is used with 10 repeats.

set.seed(17)
# Stratified sampling
TrainingDataIndex <- createDataPartition(education$Class, p=0.75, list = FALSE)
# Create Training Data 
trainingData <- education[TrainingDataIndex,]
testData <- education[-TrainingDataIndex,]
TrainingParameters <- trainControl(method = "repeatedcv", number = 10, repeats=10)

Model Training

4 different classification algorithms will be used to predict student success.

Classification with Support Vector Machine
Support vector machine algorithm finds an optimal hyperplane that separates data points that belong different classes. There are different kernel functions that can be used with SVM such as radial basis, hyperbolic, linear. We will use polynomial kernel function (svmPoly) for this data set.

# training model with SVM

SVModel <- train(Class ~ ., data = trainingData,
                 method = "svmPoly",
                 trControl= TrainingParameters,
                 tuneGrid = data.frame(degree = 1,
                                       scale = 1,
                                       C = 1),
                 preProcess = c("pca","scale","center"),
                 na.action = na.omit
)

SVMPredictions <-predict(SVModel, testData)
# Create confusion matrix
cmSVM <-confusionMatrix(SVMPredictions, testData$Class)
print(cmSVM)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  H  L  M
##          H 26  0 12
##          L  0 27  6
##          M  9  4 34
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7373         
##                  95% CI : (0.6483, 0.814)
##     No Information Rate : 0.4407         
##     P-Value [Acc > NIR] : 6.116e-11      
##                                          
##                   Kappa : 0.5992         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: H Class: L Class: M
## Sensitivity            0.7429   0.8710   0.6538
## Specificity            0.8554   0.9310   0.8030
## Pos Pred Value         0.6842   0.8182   0.7234
## Neg Pred Value         0.8875   0.9529   0.7465
## Prevalence             0.2966   0.2627   0.4407
## Detection Rate         0.2203   0.2288   0.2881
## Detection Prevalence   0.3220   0.2797   0.3983
## Balanced Accuracy      0.7991   0.9010   0.7284

Ranking Features by Importance
Importance of the attributes can be estimated with a classification model. As shown in the graph, attributes have different ranks for each class. Visited resources and raisedHands attributes have the highest rankings for the classification of low level and medium level success.

importance <- varImp(SVModel, scale=FALSE)
plot(importance)

Classification with Decision Tree
In decision tree classification, a series of test questions are organized in a tree structure and data is labeled in the root nodes. We will use C5.0 decision tree classification which is an extension C4.5 decision tree algorithm.

# Train a model with above parameters. We will use C5.0 algorithm
DecTreeModel <- train(Class ~ ., data = trainingData, 
                      method = "C5.0",
                      preProcess=c("scale","center"),
                      trControl= TrainingParameters,
                      na.action = na.omit
)

#Predictions
DTPredictions <-predict(DecTreeModel, testData, na.action = na.pass)
# Print confusion matrix and results
cmTree <-confusionMatrix(DTPredictions, testData$Class)
print(cmTree)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  H  L  M
##          H 25  0 10
##          L  1 28  3
##          M  9  3 39
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7797          
##                  95% CI : (0.6941, 0.8507)
##     No Information Rate : 0.4407          
##     P-Value [Acc > NIR] : 5.958e-14       
##                                           
##                   Kappa : 0.6612          
##  Mcnemar's Test P-Value : 0.7885          
## 
## Statistics by Class:
## 
##                      Class: H Class: L Class: M
## Sensitivity            0.7143   0.9032   0.7500
## Specificity            0.8795   0.9540   0.8182
## Pos Pred Value         0.7143   0.8750   0.7647
## Neg Pred Value         0.8795   0.9651   0.8060
## Prevalence             0.2966   0.2627   0.4407
## Detection Rate         0.2119   0.2373   0.3305
## Detection Prevalence   0.2966   0.2712   0.4322
## Balanced Accuracy      0.7969   0.9286   0.7841

Classification with Naïve Bayes
Naïve Bayes algorithm is based on the calculation posterior probabilities of different hypotheses and choosing the one with the highest probability.

#Naive algorithm
NaiveModel <- train(trainingData[,-17], trainingData$Class, 
                    method = "nb",
                    preProcess=c("scale","center"),
                    trControl= TrainingParameters,
                    na.action = na.omit
)

#Predictions
NaivePredictions <-predict(NaiveModel, testData, na.action = na.pass)
cmNaive <-confusionMatrix(NaivePredictions, testData$Class)

Classification with Neural Networks
Last algorithm that we are going to use on the educational data set is neural networks. Neural networks mimic the way brain and solves the problems with neural units which has a function that combines all the inputs and propagate to other neurons.

# train model with neural networks
NNModel <- train(trainingData[,-17], trainingData$Class,
                  method = "nnet",
                  trControl= TrainingParameters,
                  preProcess=c("scale","center"),
                  na.action = na.omit
)

NNPredictions <-predict(NNModel, testData)
# Create confusion matrix
cmNN <-confusionMatrix(NNPredictions, testData$Class)

print(cmNN)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  H  L  M
##          H 35  0  0
##          L  0 31  0
##          M  0  0 52
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9692, 1)
##     No Information Rate : 0.4407     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: H Class: L Class: M
## Sensitivity            1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000
## Prevalence             0.2966   0.2627   0.4407
## Detection Rate         0.2966   0.2627   0.4407
## Detection Prevalence   0.2966   0.2627   0.4407
## Balanced Accuracy      1.0000   1.0000   1.0000

Classification with Ensemble Model
Firstly, we will check the correlations between different models. Selecting the two un-correlated and high accurate models for ensemble modeling gives the optimal results. Unfortunately, since caret package does not support ensemble method for multi-class predictions, we will not perform ensemble methods to make predictions.

# Create models
econtrol <- trainControl(method="cv", number=10, savePredictions=TRUE, classProbs=TRUE)
model_list <- caretList(Class ~., data=trainingData,
                    methodList=c("svmPoly", "nnet", "C5.0", "nb"),
                    preProcess=c("scale","center"),
                    trControl = econtrol
)


results <- resamples(model_list)

# What is model correlation?
mcr <-modelCor(results)
print (mcr)

##             svmPoly       nnet        C5.0           nb
## svmPoly 1.000000000  0.2779276  0.12088623  0.003648899
## nnet    0.277927641  1.0000000  0.12788434 -0.215407799
## C5.0    0.120886230  0.1278843  1.00000000 -0.094577545
## nb      0.003648899 -0.2154078 -0.09457754  1.000000000

4 algorithms (Decision Tree, Naïve Bayes, Support Vector Machine and Neural Networks) are used to construct classification of three success levels on the education data. Results show that neural network classification with 10-fold cross validation can predict the results with 100% accuracy.

Recommendation

Results show that the best classification model to predict students’ success is Neural Networks. This algorithm should be used to make the future predictions about students’ learning process and to spot students who tend to be unsuccessful. Secondly, we defined the most important attributes that plays an important role in explanation students’ unsuccessful learning process. How many times the student visit the resources online, number of raised hands and announcement views are important indicators of the student success level. Once weak students are predicted, proactive approaches can be developed to support these students.

Reference

Educational Data Mining, Retrieved from https://en.wikipedia.org/wiki/Educational_data_mining

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student’s performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.

Brownlee, J., Feature Selection with the Caret R Package, Retrieved from http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/

Kuhn, M., The caret package, Retrieved from http://topepo.github.io/caret/index.html

Artificial Neural Network, Retrieved from https://en.wikipedia.org/wiki/Artificial_neural_network