This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
The purpose of this assignment is to predict student performance using logistic regression, Neural networks and a machine learning algorithm of our choice. In our case the third choice would be an ensemble learning method called extreme gradient boosting
EnsurePackage <- function(x) {
x <- as.character(x)
if (!require(x,character.only = T))
install.packages(x,repos = "https://cran.cnr.berkeley.edu/")
require(x,character.only = T)
}
library(keras)
#install_keras()
EnsurePackage("caret")# set of functions that attempt to streamline the process for creating predictive models
EnsurePackage("tidyverse") #Manipulating datasets
EnsurePackage("MASS")
EnsurePackage("ggplot2")
EnsurePackage("keras") #deep learning models
EnsurePackage("lime") #feature importance of deep learning models
EnsurePackage("recipes") #new package for preprocessing machine learning datasets
EnsurePackage("xgboost") #A faster implementation of GBM
The dataset is from Kaggle and contains information on demographics, academic background, parents participation and behavioral features.
schoolData <- read.csv("xAPI-Edu-Data.csv")
str(schoolData)
## 'data.frame': 480 obs. of 17 variables:
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 2 2 1 1 ...
## $ NationalITy : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PlaceofBirth : Factor w/ 14 levels "Egypt","Iran",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ StageID : Factor w/ 3 levels "HighSchool","lowerlevel",..: 2 2 2 2 2 2 3 3 3 3 ...
## $ GradeID : Factor w/ 10 levels "G-02","G-04",..: 2 2 2 2 2 2 5 5 5 5 ...
## $ SectionID : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 2 ...
## $ Topic : Factor w/ 12 levels "Arabic","Biology",..: 8 8 8 8 8 8 9 9 9 8 ...
## $ Semester : Factor w/ 2 levels "F","S": 1 1 1 1 1 1 1 1 1 1 ...
## $ Relation : Factor w/ 2 levels "Father","Mum": 1 1 1 1 1 1 1 1 1 1 ...
## $ raisedhands : int 15 20 10 30 40 42 35 50 12 70 ...
## $ VisITedResources : int 16 20 7 25 50 30 12 10 21 80 ...
## $ AnnouncementsView : int 2 3 0 5 12 13 0 15 16 25 ...
## $ Discussion : int 20 25 30 35 50 70 17 22 50 70 ...
## $ ParentAnsweringSurvey : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 1 2 2 2 ...
## $ ParentschoolSatisfaction: Factor w/ 2 levels "Bad","Good": 2 2 1 1 1 1 1 2 2 2 ...
## $ StudentAbsenceDays : Factor w/ 2 levels "Above-7","Under-7": 2 2 1 1 1 1 1 2 2 2 ...
## $ Class : Factor w/ 3 levels "H","L","M": 3 3 2 2 3 3 2 3 3 3 ...
To perform Proportion Odds Logistic Regression, some of the attributes need to be converted to ordinal factors. This can be done using the ordered function. Some attributes will also be excluded mainly because they are either redundant or convey very little information that would help in improving the machine learning model
#Converting attributes to ordinal attributes
#Class H > L
schoolData$Class <- ordered(schoolData$Class, levels = c("L","M","H"))
#Absence days Under-7 > Above-7
schoolData$StudentAbsenceDays <- ordered(schoolData$StudentAbsenceDays, levels = c("Above-7","Under-7"))
#Parent school satisfaction Bad <Good
schoolData$ParentschoolSatisfaction <- ordered(schoolData$ParentschoolSatisfaction,levels = c("Bad","Good"))
eduData <- schoolData
#Excluding place of birth,stage id,grade id, section id and relation
eduData_polr<- eduData[,-c(3,4,5,6,8)]
set.seed(8)
sample<-createDataPartition(eduData_polr$Class,p=0.7,list = FALSE)
trainedu<-eduData_polr[sample,]
testedu<-eduData_polr[-sample,]
#Below function will remove zero variance, near zero variance variables
trn_processed <- preProcess(trainedu,method = c("zv","nzv","scale","center"))
trn_f <- predict(trn_processed,trainedu)
test_f <- predict(trn_processed,testedu)
start_time <- Sys.time()
model_polr <- polr(Class~., data = trn_f,Hess = TRUE)
summary(model_polr,digits = 2)
## Call:
## polr(formula = Class ~ ., data = trn_f, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## genderM -0.861 3.3e-01 -2.6e+00
## NationalITyIran 2.105 1.6e+00 1.3e+00
## NationalITyIraq 2.667 1.5e+00 1.8e+00
## NationalITyJordan 0.971 1.2e+00 8.4e-01
## NationalITyKW 0.931 1.1e+00 8.2e-01
## NationalITylebanon 0.732 1.4e+00 5.3e-01
## NationalITyLybia -16.763 2.4e-06 -6.9e+06
## NationalITyMorocco -1.223 3.0e+00 -4.0e-01
## NationalITyPalestine 0.082 1.3e+00 6.5e-02
## NationalITySaudiArabia 3.314 1.4e+00 2.4e+00
## NationalITySyria 1.343 1.6e+00 8.4e-01
## NationalITyTunis 0.997 1.5e+00 6.5e-01
## NationalITyUSA 1.613 1.6e+00 1.0e+00
## NationalITyvenzuela 7.628 8.1e+01 9.4e-02
## TopicBiology 0.106 8.4e-01 1.3e-01
## TopicChemistry -1.076 9.1e-01 -1.2e+00
## TopicEnglish 0.058 6.6e-01 8.7e-02
## TopicFrench -0.948 6.7e-01 -1.4e+00
## TopicGeology -1.722 7.8e-01 -2.2e+00
## TopicHistory -1.341 8.9e-01 -1.5e+00
## TopicIT -0.446 6.0e-01 -7.4e-01
## TopicMath 0.328 8.6e-01 3.8e-01
## TopicQuran -0.404 8.9e-01 -4.5e-01
## TopicScience -0.748 6.6e-01 -1.1e+00
## TopicSpanish -1.108 7.6e-01 -1.5e+00
## RelationMum 1.128 3.5e-01 3.2e+00
## raisedhands 0.738 2.2e-01 3.4e+00
## VisITedResources 1.051 2.3e-01 4.5e+00
## AnnouncementsView 0.061 2.3e-01 2.7e-01
## Discussion 0.333 1.8e-01 1.8e+00
## ParentAnsweringSurveyYes 0.542 3.9e-01 1.4e+00
## ParentschoolSatisfaction.L 0.655 2.8e-01 2.4e+00
## StudentAbsenceDays.L 2.245 3.1e-01 7.3e+00
##
## Intercepts:
## Value Std. Error t value
## L|M -1.80 1.29 -1.40
## M|H 3.67 1.32 2.79
##
## Residual Deviance: 335.1986
## AIC: 405.1986
pred <- predict(model_polr,test_f)
polracc <- confusionMatrix(pred,test_f$Class)
#Calculating p value to find which features are significant
#Store coefficient table
ctable <- coef(summary(model_polr))
options(scipen = 999)
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
ctable <- cbind(ctable, "p value" = p)
round(ctable,digits = 3)
## Value Std. Error t value p value
## genderM -0.861 0.328 -2.624 0.009
## NationalITyIran 2.105 1.605 1.311 0.190
## NationalITyIraq 2.667 1.508 1.768 0.077
## NationalITyJordan 0.971 1.162 0.836 0.403
## NationalITyKW 0.931 1.141 0.816 0.415
## NationalITylebanon 0.732 1.384 0.529 0.597
## NationalITyLybia -16.763 0.000 -6931233.712 0.000
## NationalITyMorocco -1.223 3.023 -0.405 0.686
## NationalITyPalestine 0.082 1.257 0.065 0.948
## NationalITySaudiArabia 3.314 1.408 2.354 0.019
## NationalITySyria 1.343 1.596 0.841 0.400
## NationalITyTunis 0.997 1.525 0.654 0.513
## NationalITyUSA 1.613 1.611 1.002 0.317
## NationalITyvenzuela 7.628 81.207 0.094 0.925
## TopicBiology 0.106 0.841 0.127 0.899
## TopicChemistry -1.076 0.908 -1.186 0.236
## TopicEnglish 0.058 0.661 0.087 0.930
## TopicFrench -0.948 0.671 -1.412 0.158
## TopicGeology -1.722 0.781 -2.205 0.027
## TopicHistory -1.341 0.892 -1.503 0.133
## TopicIT -0.446 0.602 -0.742 0.458
## TopicMath 0.328 0.858 0.382 0.702
## TopicQuran -0.404 0.891 -0.454 0.650
## TopicScience -0.748 0.657 -1.139 0.255
## TopicSpanish -1.108 0.759 -1.460 0.144
## RelationMum 1.128 0.350 3.226 0.001
## raisedhands 0.738 0.220 3.354 0.001
## VisITedResources 1.051 0.232 4.534 0.000
## AnnouncementsView 0.061 0.230 0.267 0.790
## Discussion 0.333 0.181 1.841 0.066
## ParentAnsweringSurveyYes 0.542 0.390 1.390 0.165
## ParentschoolSatisfaction.L 0.655 0.276 2.374 0.018
## StudentAbsenceDays.L 2.245 0.306 7.326 0.000
## L|M -1.804 1.292 -1.396 0.163
## M|H 3.673 1.317 2.789 0.005
Based on p-values we can say that features such as gender, raised hands, visited resources, parent school satisfaction and student absence days are very significant. It was also observed that Nationality = libya is also very significant. Displaying the confusion matrix will give us more info on the performance of the model
polracc
## Confusion Matrix and Statistics
##
## Reference
## Prediction L M H
## L 35 6 0
## M 3 43 9
## H 0 14 33
##
## Overall Statistics
##
## Accuracy : 0.7762
## 95% CI : (0.699, 0.8416)
## No Information Rate : 0.4406
## P-Value [Acc > NIR] : 0.0000000000000002763
##
## Kappa : 0.6598
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: L Class: M Class: H
## Sensitivity 0.9211 0.6825 0.7857
## Specificity 0.9429 0.8500 0.8614
## Pos Pred Value 0.8537 0.7818 0.7021
## Neg Pred Value 0.9706 0.7727 0.9062
## Prevalence 0.2657 0.4406 0.2937
## Detection Rate 0.2448 0.3007 0.2308
## Detection Prevalence 0.2867 0.3846 0.3287
## Balanced Accuracy 0.9320 0.7663 0.8236
end_time <- Sys.time()
polrper <- end_time - start_time
polrper
## Time difference of 0.159441 secs
We will be using Keras package to develop a deep learning model in R. Neural networks have some apprehensions because of their black box nature, therefore we will be using package called lime to help us uncover feature importance out of these sophisticated models. First we use a new preprocessing technique called recipes. In our example we are using recipe to do one hot encoding, scaling and centering. Recipes is a pipeline that makes the preprocessing process quick and easy.To know more about recipe a link is provided in the code section below.
#Use Recipes for preprocessing - new type of preprocessing https://www.rdocumentation.org/packages/recipes/versions/0.1.5
#https://blogs.rstudio.com/tensorflow/posts/2018-01-11-keras-customer-churn/
#Split data
set.seed(8)
sample2<-createDataPartition(eduData$Class,p=0.7,list = FALSE)
edudata2 <- eduData[,-c(3,4,5,6,8)]
trainedu2<-edudata2[sample2,]
testedu2<-edudata2[-sample2,]
#Create Recipes
rec_obj <- recipe(Class ~ ., data = trainedu2) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_center(all_predictors(), -all_outcomes()) %>%
step_scale(all_predictors(), -all_outcomes()) %>%
prep(data = trainedu2)
#Bake
ann_trainedu <- bake(rec_obj, new_data = trainedu2) %>% dplyr::select(-Class)
ann_testedu <- bake(rec_obj, new_data = testedu2) %>% dplyr::select(-Class)
#Preprocessing and assigning target variables
y_trainvec <- as.character(trainedu2$Class) %>% recode('L'=0, 'M'=1, 'H'=2)
y_testvec <- as.character(testedu2$Class) %>% recode('L' = 0,'M' = 1,'H' = 2)
After Preprocessing the Ann model will be built using keras package. The model has 2 hidden layers and 2 dropout layers. Several changes in activation function and nodes were experimented. Increasing the units of hidden layers helped improve accuracy significantly till 100 units, after that it only improves the training accuracy but does not seem to influence the validation accuracy. Using “Relu” as activation function gave better performance and speed compared to sigmoid. Putting epoch as 40 gave us the best accuracy.
## Model
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense (Dense) (None, 200) 6800
## ___________________________________________________________________________
## dropout (Dropout) (None, 200) 0
## ___________________________________________________________________________
## dense_1 (Dense) (None, 100) 20100
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 100) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 3) 303
## ===========================================================================
## Total params: 27,203
## Trainable params: 27,203
## Non-trainable params: 0
## ___________________________________________________________________________
## Confusion Matrix and Statistics
##
## Reference
## Prediction H L M
## H 36 1 17
## L 0 34 6
## M 6 3 40
##
## Overall Statistics
##
## Accuracy : 0.7692
## 95% CI : (0.6915, 0.8355)
## No Information Rate : 0.4406
## P-Value [Acc > NIR] : 0.000000000000001194
##
## Kappa : 0.6524
## Mcnemar's Test P-Value : 0.06403
##
## Statistics by Class:
##
## Class: H Class: L Class: M
## Sensitivity 0.8571 0.8947 0.6349
## Specificity 0.8218 0.9429 0.8875
## Pos Pred Value 0.6667 0.8500 0.8163
## Neg Pred Value 0.9326 0.9612 0.7553
## Prevalence 0.2937 0.2657 0.4406
## Detection Rate 0.2517 0.2378 0.2797
## Detection Prevalence 0.3776 0.2797 0.3427
## Balanced Accuracy 0.8395 0.9188 0.7612
## Time difference of 11.73451 secs
Neural networks are sophisticated models and hard to interpret for a human because of it “black box” nature, meaning these sophisticated models are difficult to explain using traditional methods. LIME package solves this problem by doing a very nice job of indentifying feature importance in a visual manner. Below is how a lime model is implemented.
class(model_keras)
## [1] "keras.engine.sequential.Sequential"
## [2] "keras.engine.training.Model"
## [3] "keras.engine.network.Network"
## [4] "keras.engine.base_layer.Layer"
## [5] "tensorflow.python.training.checkpointable.base.CheckpointableBase"
## [6] "python.builtin.object"
# Setup lime::model_type() function for keras
model_type.keras.engine.sequential.Sequential <- function(x, ...) {
"classification"
}
# Setup lime::predict_model() function for keras
predict_model.keras.engine.sequential.Sequential <- function(x, newdata,type,...) {
pred <- predict_proba(object = x, x = as.matrix(newdata))
data.frame(L = pred[,1],M=pred[,2],H=pred[,3])
}
# Test our predict_model() function
a <- predict_model(x = model_keras, newdata = (ann_testedu),type ='raw')%>%tibble::as_tibble()
# Run lime() on training set
explainer <- lime::lime(
x = ann_trainedu,
model = model_keras,
bin_continuous = TRUE,
n_bins = 5,
n_permutations = 1000
)
# Run explain() on explainer
explanation <- lime::explain(
ann_testedu[1:5, ],
explainer = explainer,
n_labels = 3,
n_features = 10
)
The below plot shows the feature importance for different cases. We have chosen 5 cases from our dataset and processed it through lime to find out the feature importance of the learning model.The green bar means that the feature supports the model conclusion and the red bars contradict. If you look at case 1 you can see it has a higher probability for M and the values of features mum, resources and raised hands contribute to the labelling, whereas for label H it contradicts because the value of features do not support the validity of case 1 being labelled as H.
plot_features(explanation,ncol=3)
## Extreme Gradient Boosting Extreme Gradient boosting is an ensemble learning technique that uses decision trees in a sequential manner. The error correction resulting from the first tree is applied to the second tree and so on.
xgb_train <- xgb.DMatrix(data = as.matrix(ann_trainedu), label=y_trainvec)
xgb_val <- xgb.DMatrix(data = as.matrix(ann_testedu), label = y_testvec)
params <- list(booster = "gbtree", objective = "multi:softmax", num_class = 3, eval_metric = "auc")
#Train a xgboost model
start_time <- Sys.time()
xgb <- xgb.train(params = params, data = xgb_train, nrounds = 300)
# predict values in test set
y_pred2 <- predict(xgb, xgb_val)
y_pred2fac <- y_pred2 %>% recode('0' ='L', '1' ='M', '2' = 'H')
xgbacc <- confusionMatrix(as.factor(y_testvec1),as.factor(y_pred2fac))
xgbacc
## Confusion Matrix and Statistics
##
## Reference
## Prediction H L M
## H 31 0 11
## L 0 38 0
## M 12 8 43
##
## Overall Statistics
##
## Accuracy : 0.7832
## 95% CI : (0.7066, 0.8477)
## No Information Rate : 0.3776
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6715
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: H Class: L Class: M
## Sensitivity 0.7209 0.8261 0.7963
## Specificity 0.8900 1.0000 0.7753
## Pos Pred Value 0.7381 1.0000 0.6825
## Neg Pred Value 0.8812 0.9238 0.8625
## Prevalence 0.3007 0.3217 0.3776
## Detection Rate 0.2168 0.2657 0.3007
## Detection Prevalence 0.2937 0.2657 0.4406
## Balanced Accuracy 0.8055 0.9130 0.7858
end_time <- Sys.time()
xgbper <- end_time - start_time
xgbper
## Time difference of 2.958709 secs
acc <- list(polracc$overall[1],annacc$overall[1],xgbacc$overall[1])
time <- list(polrper,annper,xgbper)
#Model Importance
importance <- xgb.importance(feature_names = colnames(xgb_train),
model = xgb)
head(importance,n = 15)
Gain is the improvement in accuracy brought by the feature to the branches it is on. Cover measures the relative quantity of observations covered by the feature. Visited Resource has the highest gain and cover and it makes sense why it is important to a students performance because it measures that the student is committed to the subject and often visits resource section to study materials, thereby improving his performance.
#Final Accuracy of all models
Accuracy <- data.frame(acc)
names(Accuracy) <- c("Polr","ANN","XGB")
Performance <- rbind(Accuracy,Time = time)
Performance
Surprisingly the neural network performed worst in both accuracy and time, which proves the no free lunch theorem. The theory states that no one algorithm is best for all cases. While neural networks have a strong reputation for image recognition, however in our current problem XGB performed better than other algorithms and slightly took more time than ordinal logistic regression.Logistic regression worked faster because it was a single model implementation and did not perform multiple model construction or back propagation as seen in the cases of neural networks and xgb.