Student attainment prediction with neural networks

Education is a key factor affecting long term economic progress. Success in the core subjects provide a linguistic and numeric scaffold for other subjects later in students’ academic careers.The growth in school educational databases facilitates the use of Data Mining and Machine Learning practises to improve outcomes in these subjects by identifying factors that are indicative of failure (or success). Predicting outcomes allows educators to take corrective measures for weak students mitigating the risk of failure.

The Data

The data was downloaded from the UCI Machine Learning database (see readme) and inspired by Cortez et al., 2008. We use maths results data only.

The Approach

“Any sufficently advanced technology is indistinguishable from magic.” - Arthur C. Clarke

As a scientist I find any computational methodology that is loosely based on how the brain works inherently interesting. Although somewhat derided for its complexity and computational expense, this approach has seen a resurgence in popularity with deep learning problems, such as Youtube cat video identification. We tackle a simpler problem here that I previously approached with the decision tree method. Let’s see how the default methods compare to the 95% classification accuracy of the decision tree, which also had the benefit of being readily intelligible.

Neural networks use concepts borrowed from an understanding of animal brains in order to model arbitary functions. We can use multiple hidden layers in the network to provide deep learning, this approach is commonly called the Multilayer Perceptron (unrelated to Transformers).

From the codebook we know that G3 is the final grade of the students. We can inspect it’s distribution using a hist. It has been standardised to range from 0-20.

The magical black box that is the neural networks

G3 is pretty normally distributed, despite the dodgy tail. Previously we converted it into a binary output and then used a decision tree approach to make predictions from associated student characteristics. We use the neural network approach here while maintaining G3 as an integer variable with a range of 1-20.

First we start off by identifying variables we think will be useful based on expert domain knowledge. We then normalise the continuous variables to ensure equal weighting when measuring distance and check for missing values.

Training and test datasets.

We need to split the data so we can build the model and then test it, to see if it generalises well. The data arrived in a random order so we split it in an analagous way to how we did it with the decision tree method.

data_train <- data_interest[1:350, ]  # we want to compare to decision tree method!
data_test <- data_interest[351:395, ]

#data_train <- sample_frac(data_interest, 0.8)  # 80% train, 20% test
#data_test <- setdiff(data_interest, data_train)

Now we need to train the model on the data using the neuralnet() function using backpropogation from the package of the same name using the default settings. We specify a linear output as we are doing a regression not a classification. First we fit a model using relevant continuous normalised variables, omitting the non-numeric encoded factors, such as gender for now.

#TRAIN the model on the data
#n <- names(data_train)
#f <- as.formula(paste("G3 ~", paste(n[!n %in% "G3"], collapse = " + ")))
# as pointed out by an R bloggers post, we mustwrite the formila and pass it as an argument
# http://www.r-bloggers.com/fitting-a-neural-network-in-r-neuralnet-package/


net_model <- neuralnet(G3 ~ G1 + G2 + goout + 
       absences + failures + Fedu + Medu,
                            data = data_train, hidden = 1, linear.output = TRUE)
print(net_model)

## Call: neuralnet(formula = G3 ~ G1 + G2 + goout + absences + failures +     Fedu + Medu, data = data_train, hidden = 1, linear.output = TRUE)
## 
## 1 repetition was calculated.
## 
##         Error Reached Threshold Steps
## 1 1.594919534    0.008259923976   710

plot.nnet(net_model)

Generally, the input layer (I) is considered a distributor of the signals from the external world. Hidden layers (H) are considered to be categorizers or feature detectors of such signals. The output layer (O) is considered a collector of the features detected and producer of the response. While this view of the neural network may be helpful in conceptualizing the functions of the layers, you should not take this model too literally as the functions described can vary widely. Bias layers (B) aren’t all that informative , they are analogous to intercept terms in a regression model.

Evaluating the neural network model

Note how we use compute function to generate predictions on the testing dataset (rather than predict). Also rather than assessing whether we were right or wrong (compared to classification) we need to compare our predicted G3 score with the actual score, we can acheive this by comparing how the predicted results covary with the real data.

model_results <- compute(net_model, data_test[c("G1", "G2", "goout", "absences", 
                                                "failures", "Fedu", "Medu")])
predicted_G3 <- model_results$net.result

cor(predicted_G3, data_test$G3)[ , 1]  # can vary depending on random seed

## [1] 0.9216071625

plot(predicted_G3, data_test$G3, 
     main = "1 hidden node layers", ylab = "Real G3")  # line em up, aid visualisation
abline(a = 0, b = 1, col = "black")

Here we compare to a 1:1 abline in black. It would be interesting to compare how this approach fares against a standard linear regression. Let’s add some extra complexity by adding some more hidden nodes.

net_model2 <- neuralnet(G3 ~ G1 + G2 + goout + 
       absences + failures + Fedu + Medu,
                            data = data_train, hidden = 5, linear.output = TRUE)
print(net_model2)

## Call: neuralnet(formula = G3 ~ G1 + G2 + goout + absences + failures +     Fedu + Medu, data = data_train, hidden = 5, linear.output = TRUE)
## 
## 1 repetition was calculated.
## 
##         Error Reached Threshold Steps
## 1 1.183841225    0.009654936966  6281

plot.nnet(net_model2)

Now we evaluate as before.

#now evaluate as before

model_results2 <- compute(net_model2, data_test[c("G1", "G2", "goout", "absences", 
                                                "failures", "Fedu", "Medu")])
predicted_G3_2 <- model_results2$net.result

cor(predicted_G3_2, data_test$G3)[ , 1]  # can vary depending on random seed

## [1] 0.9591425577

plot(predicted_G3_2, data_test$G3,
     main = "5 hidden node layers", ylab = "Real G3")  # line em up, aid visualisation
abline(a = 0, b = 1, col = "black")

A slight improvement, on parr with the Decision Tree approach, even though some variables that we know to be useful were excluded from this modelling exercise. We can improve things by incoporating them into the model. Furthermore, this is not just a pass or fail classification, this provides a predicted exam score in G3 for any student.

A caveat, prior to any conclusions the model should be validated using cross-validation which provides some protection against under or over-fitting (the risk of overfitting increases as we increase the number of hidden nodes). Furthermore interpretability is an issue, I have a prediction but limited understanding of what is going on.

References

Machine Learning with R, Chapter 7
Rumelhart et al, (1986). Nature 323, 533-536
Yeh, (1998). Cement and Concrete Research 28, 1797-1808
R-bloggers
Neuralyst

session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.2.3 (2015-12-10)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United Kingdom.1252 
##  tz       Europe/London               
##  date     2016-02-08

## Packages ------------------------------------------------------------------

##  package    * version date       source        
##  abind        1.4-3   2015-03-13 CRAN (R 3.2.3)
##  assertthat   0.1     2013-12-06 CRAN (R 3.2.2)
##  colorspace   1.2-6   2015-03-11 CRAN (R 3.2.3)
##  curl         0.9.4   2015-11-20 CRAN (R 3.2.2)
##  DBI          0.3.1   2014-09-24 CRAN (R 3.2.2)
##  devtools   * 1.9.1   2015-09-11 CRAN (R 3.2.3)
##  digest       0.6.9   2016-01-08 CRAN (R 3.2.3)
##  dplyr      * 0.4.3   2015-09-01 CRAN (R 3.2.2)
##  evaluate     0.8     2015-09-18 CRAN (R 3.2.2)
##  formatR      1.2.1   2015-09-18 CRAN (R 3.2.3)
##  htmltools    0.3     2015-12-29 CRAN (R 3.2.3)
##  httr         1.0.0   2015-06-25 CRAN (R 3.2.2)
##  knitr        1.12    2016-01-07 CRAN (R 3.2.3)
##  lazyeval     0.1.10  2015-01-02 CRAN (R 3.2.2)
##  magrittr     1.5     2014-11-22 CRAN (R 3.2.2)
##  MASS       * 7.3-45  2015-11-10 CRAN (R 3.2.3)
##  memoise      0.2.1   2014-04-22 CRAN (R 3.2.2)
##  munsell      0.4.2   2013-07-11 CRAN (R 3.2.3)
##  neuralnet  * 1.32    2012-09-20 CRAN (R 3.2.3)
##  plyr         1.8.3   2015-06-12 CRAN (R 3.2.2)
##  R6           2.1.1   2015-08-19 CRAN (R 3.2.2)
##  Rcpp         0.12.3  2016-01-10 CRAN (R 3.2.3)
##  reshape    * 0.8.5   2014-04-23 CRAN (R 3.2.3)
##  RItools    * 0.1-13  2016-01-18 CRAN (R 3.2.3)
##  rmarkdown    0.9.2   2016-01-01 CRAN (R 3.2.2)
##  scales     * 0.3.0   2015-08-25 CRAN (R 3.2.3)
##  SparseM      1.7     2015-08-15 CRAN (R 3.2.3)
##  stringi      1.0-1   2015-10-22 CRAN (R 3.2.2)
##  stringr      1.0.0   2015-04-30 CRAN (R 3.2.2)
##  svd          0.3.3-2 2014-03-07 CRAN (R 3.2.3)
##  xtable       1.8-0   2015-11-02 CRAN (R 3.2.3)
##  yaml         2.1.13  2014-06-12 CRAN (R 3.2.3)