This is a personal project with the objective of learning about Artificial Neural Networks (ANN) and their implementation in R.
All the code is available in my github repo for statistical learning studies. Please take a look at the files named neuralnet*
A basic, succinct and clear explanation of how ANNs wok can be found at Portilla, (2016), which I paraphrase here:
Neural networks are built from units called perceptrons (ptrons). Ptrons have one or more inputs, an activation function and an output. An ANN model is built up by combining ptrons in structured layers. The ptrons in a given layer are independent of each other, but the each connect to all the ptrons in the next layer.
The connections between ptrons are weighted. The magnitude of the weight controls the strength of the influence of that input on the receiving ptron. The sign of the weight controls whether the influence is stimulating or inhibiting the signal to the next layer.
The weights are somewhat analogous to the coefficients of a linear model. There is also a bias adjustment that represents the base Value of a ptron and is analogous to the intercept in a linear model. If the inputs are near zero, the bias ensures that the output is near average. However the situation is much more complex due to the network-like nature of the ANN. This leads to complex, non-linear relationships between the predictors and response.
There are a handful of interesting libraries though in this report I’ll focus on the neuralnet package, Fritsch & Guenther, (2016).
Hasheminia, (2015) provides a very useful introduction, covering the essential parameters and common steps.
This library is very flexible, allowing the analyst to build ANNs using a number of algorithms, activation functions and error functions and with multiple hidden layers to cover all but the most esoteric scenarios. It has syntax that is not entirely aligned with other model fitting libraries and functions so requires a couple of minor tweaks compared to common modeling functions, e.g. lm and glm.
This example uses the Boston data from the MASS package which contains a number of predictors of median property values in suburbs of Boston, MA, USA. See help(Boston) for more information. The code used is based on Alice, (2015) and adapted to emphasise and deepen particular topics.
# Using my standard boiler plate for machine learning
# Run through some standard checks for data quality
dt <- setData(Boston, "medv")
# problem type: either classification or regression
dt$ptype
## [1] "regression"
# check for NA vals
na.vals.check(dt)
## crim zn indus chas nox rm age dis rad
## 0 0 0 0 0 0 0 0 0
## tax ptratio black lstat medv
## 0 0 0 0 0
# and near zero variance
nzv.check(dt)
## freqRatio percentUnique zeroVar nzv
## crim 1.000000 99.6047431 FALSE FALSE
## zn 17.714286 5.1383399 FALSE FALSE
## indus 4.400000 15.0197628 FALSE FALSE
## chas 13.457143 0.3952569 FALSE FALSE
## nox 1.277778 16.0079051 FALSE FALSE
## rm 1.000000 88.1422925 FALSE FALSE
## age 10.750000 70.3557312 FALSE FALSE
## dis 1.250000 81.4229249 FALSE FALSE
## rad 1.147826 1.7786561 FALSE FALSE
## tax 3.300000 13.0434783 FALSE FALSE
## ptratio 4.117647 9.0909091 FALSE FALSE
## black 40.333333 70.5533597 FALSE FALSE
## lstat 1.000000 89.9209486 FALSE FALSE
## medv 2.000000 45.2569170 FALSE FALSE
# and highly correlated variables
cor.vars.check(dt, 0.8)
## [1] "tax"
# and linear combinations
lin.comb.check(dt)
## $linearCombos
## list()
##
## $remove
## NULL
# Have a closer look at the tax variable
corrplot(cor(dt$dt.frm))
It looks like the predictors tax and rad are highly correlated at > 80%. This can be a problem in some machine learning scenarios but I’m going to ignore it as that’s not the focus of this report.
For the purposes of comparison, Alice, (2015) creates a linear model with ordinary least squares (OLS), which I reproduce here.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.7307885 5.5933446 6.7457 5.146e-11
## crim -0.1064088 0.0343137 -3.1011 0.0020602
## zn 0.0502403 0.0148834 3.3756 0.0008061
## indus 0.0163460 0.0663909 0.2462 0.8056426
## chas 2.6498239 0.8735158 3.0335 0.0025694
## nox -20.7693106 4.1995283 -4.9456 1.105e-06
## rm 4.0013561 0.4684868 8.5410 2.568e-16
## age -0.0054396 0.0141457 -0.3845 0.7007751
## dis -1.5159065 0.2107094 -7.1943 2.966e-12
## rad 0.3208897 0.0731548 4.3864 1.463e-05
## tax -0.0139571 0.0040943 -3.4089 0.0007161
## ptratio -0.9505748 0.1432791 -6.6344 1.022e-10
## black 0.0077192 0.0029105 2.6522 0.0083035
## lstat -0.4616203 0.0560346 -8.2381 2.322e-15
##
## n = 428, p = 14, Residual SE = 4.71283, R-Squared = 0.76
I’ll fit two models for comparison. The first has two hidden layers, with 5 and 3 ptrons, respectively. The second will have just one layer with 8 ptrons. One of the reasons for doing this is that some of the available utilities for analysing ANNs don’t work with multiple hidden layers. This is by design in some cases but not easy to explain in others.
As a part of this project, I have created some new tools to get around this limitation.
Code is echoed to show a couple of quirks of the neuralnet library, namely the requirement to give a fully qualified formula. It is also necessary to scale the response variable to values falling between [0..1] and the predictor variables to z-scores \(\sim{N(0, 1)}\) as discussed in Olden & Jackson, (2002).
Other than that, the default settings are used for algorithm (rprop+ = resilient back propagation) and error function (sse = sum of squared errors). See R help(neuralnet) for further details on default settings.
The number of ptrons in the hidden layers is given as a numeric vector and the linear.output is set to TRUE, indicating a regression problem.
# The predictor vars must be scaled data for the ANN fitting
Boston.scaled <- as.data.frame(scale(Boston))
min.medv <- min(Boston$medv)
max.medv <- max(Boston$medv)
# response var must be scaled to [0 < resp < 1]
Boston.scaled$medv <- scale(Boston$medv
, center = min.medv
, scale = max.medv - min.medv)
# Train-test split
Boston.train.scaled <- Boston.scaled[Boston.split, ]
Boston.test.scaled <- Boston.scaled[!Boston.split, ]
# neuralnet doesn't accept resp~. (dot) notation
# so a utility function to create a verbose formula is used
Boston.nn.fmla <- generate.full.fmla("medv", Boston)
# 2 models, one with 2 layers of 5 and 3
# the second with one layer of 8
# linear output is used for a regression problem
Boston.nn.5.3 <- neuralnet(Boston.nn.fmla
, data=Boston.train.scaled
, hidden=c(5,3)
, linear.output=TRUE)
Boston.nn.8 <- neuralnet(Boston.nn.fmla
, data=Boston.train.scaled
, hidden=8
, linear.output=TRUE)
To create performance metrics for comparison with the OLS model it’s necessary to reversing the scaling operation on the responses/predictions. I won’t echo that code here as it’s visible in the github repository.
I use a utility function to generate Mean Squared Error (MSE), Route Mean Squared Error (RMSE) and Median Absolute Deviation (MAD).
The RMSE and MAD are both on the same scale as the errors and therefore somewhat easier to interpret than the MSE. Large differences between RMSE and MAD are indicitive of skew in the distribution of errors.
## [1] "OLS regression measures"
## MSE.measure RMSE.measure MAD.measure
## 1 25.4926546 5.049025114 2.664651504
## [1] "NN.5.3 regression measures"
## MSE.measure RMSE.measure MAD.measure
## 1 10.18324012 3.191118945 1.607836209
## [1] "NN.8 regression measures"
## MSE.measure RMSE.measure MAD.measure
## 1 17.39375447 4.170582031 1.781133818
Both ANN models are quite a lot better at predicting on new data than the OLS model. Though the two-layer model appears to do moderately better. This is expected due its greater flexibility/non-linearity. Some further analysis can be done to determine the better of the two more definitively (see later section on Cross Validation).
The difference between RMSE and MAD for all three models indicates that the errors are skewed. The majority of errors must be smaller than the mean error, implying that there are a small number of large outliers.
Now that the two models have been trained it is possible to visualise them in a plot. The neuralnet library comes with a default plot function:
Beck, (2013) describes these plots as “leav[ing] much to be desired,” and I tend to agree. These plots are difficult to work with. Visually they suffer from a problem of serious clutter, making the interation weights mostly illegible. In addition, the plot.nn function has some undesirable behaviour. In the RStudio environment, the plots appear as pop-outs rather than in the integrated viewer window. The function also doesn’t seem to be compatible with knitr, so the images above were exported manually and included as static .png files.
This issue has been addressed by the Neural Interpretation Diagram (NID) described in Olden & Jackson, (2002) and implemented in the NeuralNetTools library, Beck, (2015) as the plotnet() function. This gives the following default plots for the same models.
This elegant plot solves the visual clutter problem by using line thickness to represent weight magnitude and line colour to represent weight sign (black = positive, grey = negative). Other useful features include a colour option for the input layer so that, for example, variable importance can be represented. This will be discussed next.
ANNs have been attributed as having “superior predictive power” while being labelled a “black box” in the same article Olden & Jackson, (2002). This is because complex interactions can be modelled by the ANN giving highly flexible non-linear response values while this very flexibility complicates the assessment of the relative importance of each predictor.
Several approaches for illiminating the mechanics of the models are described and compared by Olden & Jackson, (2002), Gevrey et al, (2003) and further discussed in Olden et al, (2004). The approach used in the NeuralNetTools library Beck, (2015) is Garson’s algorithm in the garson() function. The resultant plot is shown next, although the garson() function in the NeuralNetTools library also only works for single layer models.
# garson function on two hidden layer model returns an error
tryCatch.W.E(garson(Boston.nn.5.3))
## $value
## <simpleError in garson.default(wts_in, x_names, y_names, ...): Garsons algorithm not applicable for multiple hidden layers>
##
## $warning
## NULL
# works fine on single hidden layer model
garson(Boston.nn.8)
Garson’s algorithm was found by Olden et al, (2004) to be the least reliable method when tested against simulated data. They proposed a number of more reliable approaches. Their preferred option being to calculate the product of the weights through the model from each predictor to the output. I endeavour to create this as a custom function in R which works for any number of hidden layers. I also produce a simplified version that tallies the values of the input layer weights only.
I use the lattice plotting system rather than ggplot2 as this is my preference. I also opt for a dotplot, which better represents comparison of point values. According to Sarkar, (2008) barcharts are better reserved for counts, lengths or measures where the information is encoded in the height or area of the bar.
Comparing these plots one is immediately struck by the inconsistency in both the ranking and sign of the variable effects with respect to each other. Intuitively one would expect generally similar results but these models have little in common.
An empirical approachto this problem is to use profiling which also has the benefit of illuminating the non-linear relationships of each predictor to the response and to each other.
The lekprofile method is broadly similar an the effect plot and describe in Gevrey et al(2003). This method has been devised with ANN in mind but could systematically be used with any multivariate problem.
In essence, a profiling method uses the generated/learned model to predict new values modifying one predictor through its full range while holding all but one of the predictors level. This is repeated at various levels the static predictors (e.g. quanstandards) and then iterated for all predictors (or simply those of interest).
The resulting graphs yield a fascinating insight on the model behaviour.
I’ve had trouble getting the NNT library lekprofile function to run with this data. It’s also documented as not working for multiple hidden layers.
# not working
# expected to fail
tryCatch.W.E(lekprofile(Boston.nn.5.3)) # ex
## $value
## <simpleError in lekprofile.nn(Boston.nn.5.3): Cannot use lekprofile with multiple hidden layers>
##
## $warning
## NULL
# something wrong as there is one hidden layer
tryCatch.W.E(lekprofile(Boston.nn.8))
## $value
## <simpleError: object of type 'symbol' is not subsettable>
##
## $warning
## NULL
However, given the description of its implementation, I see no reason why it can’t be run for multiple hidden layer models. Consequently I’ve also produced my own version of this plotting function using the lattice graphics system.
An interesting side effect of calculating all the individual predictor effects in this way is that the resulting range of response values gives a novel measure of variable importance.
With this in mind I have constructed a new plot that shows the magnitude of the response range as each predictor is applied through its full range, while holding the other variables at their minima, maxima and whichever level gives the individual predictor its strongest effect (it’s not always the min or max of the others).
Here a barchart makes sense as we’re comparing spans. Namely, the range of the response when the individual predictor is run through its range.
The results differ again from the Garson and weights accumulation methods. This plot is insightful because it’s possible to see which variables are only important when other variables are at a minimum and which are still important with all the variables saturated. The interactions between variables is very apparent. Also interesting are the different profiles of variables. Some are unaffected by changes in the other predictors, some are highly suppressed or have vastly different curves with the others at the maxima, for example.
The most interesting ones can be plotted individually for futher examination.
There is a general weights plot function, gwplot delivers an individual profile plot but it’s rather basic and doesn’t have the richness of information as the lekprofile approach.
Using the variable importance based on the profiled range of values of the response as a result of each predictor, these values can be fed back into the NID plot of the NeuralNetTools function as below. Note that I find it more intuitive in this plot if darker colours model more important variables. This is contrary to the implementation of the garson algorithm for variable importance plot.
A common fitted versus actual, and fitted versus residual plots are as useful for visually comparing model performance. Plotting separately or together can help to diagnose the differences.
The ANN model with a single hidden layer looks to have the best performance based on how close the points are to the unity line.
A more rigourous approach to determining model performance is considered good/best practice. The previous plots give a sense that these neuralnet models are better than OLS but how to tell between them? It’s also understood from the literature (see references) that ANNs are very sensitive to the training data and simply changing the random seed could yield quite different results. The way to examine which model gives the best predictions is cross validation.
This process is a cinch with a linear model fit thanks to the boot library. For comparison, this process is run next.
# Linear model cross validation
# same seed value each time
seed.val <- 2016
Boston.glm.full <- glm(medv~.,data=Boston) # full data set
set.seed(seed.val)
Boston.cv.RMSE <- cv.glm(Boston, Boston.glm.full
, cost = RMSE
, K=10)$delta[1]
set.seed(seed.val)
Boston.cv.MAD <- cv.glm(Boston, Boston.glm.full
, cost = MAD
, K=10)$delta[1]
# delta[1] is unadjusted for fair comparison
Boston.cv.RMSE
## [1] 4.840892116
Boston.cv.MAD
## [1] 2.596723852
There is not, to my knowledge a similar library for the nn class of objects, so cross validati on is hand-coded, using a standard pattern.
set.seed(seed.val)
k <- 10
cv.error <- matrix(nrow = k, ncol = 4)
folds <- sample(1:k, nrow(Boston)
, replace = TRUE)
for(i in 1:k){
Boston.train.cv <- Boston.scaled[folds != i,]
Boston.test.cv <- Boston.scaled[folds == i,]
nn.5.3 <- neuralnet(Boston.nn.fmla
, data=Boston.train.cv
, hidden=c(5,3)
, linear.output=TRUE)
nn.8 <- neuralnet(Boston.nn.fmla
, data=Boston.train.cv
, hidden=8
, linear.output=TRUE)
Boston.5.3.preds.scaled <- neuralnet::compute(nn.5.3, Boston.test.cv[, 1:13])
Boston.8.preds.scaled <- neuralnet::compute(nn.8, Boston.test.cv[, 1:13])
Boston.5.3.preds <- Boston.5.3.preds.scaled$net.result * (max(Boston$medv) - min(Boston$medv)) + min(Boston$medv)
Boston.8.preds <- Boston.8.preds.scaled$net.result * (max(Boston$medv) - min(Boston$medv)) + min(Boston$medv)
medv.unscaled <- (Boston.test.cv$medv) * (max(Boston$medv) - min(Boston$medv)) + min(Boston$medv)
cv.error[i, ] <- c(
RMSE(medv.unscaled, Boston.5.3.preds)
, MAD(medv.unscaled, Boston.5.3.preds)
, RMSE(medv.unscaled, Boston.8.preds)
, MAD(medv.unscaled, Boston.8.preds)
)
}
## [,1] [,2] [,3] [,4]
## [1,] 4.329520454 1.6929329924 4.316728731 2.159583755
## [2,] 2.615678392 0.8967653295 2.963626103 1.862072934
## [3,] 2.696146117 1.6135030948 2.646600097 1.648672111
## [4,] 4.696036606 2.1822030171 3.789964548 2.001242435
## [5,] 3.695304325 1.5498633770 4.509617464 2.290719114
## [6,] 7.617378451 2.4458471987 4.612057468 2.102816502
## [7,] 4.516907512 1.6214514369 3.840333522 1.692079883
## [8,] 8.854374959 1.9990646196 4.604622424 2.276381560
## [9,] 3.438679008 1.9254810880 2.995144806 1.838174465
## [10,] 4.773645805 1.9009483303 6.796696664 2.682105298
## [1] 4.723367163
## [1] 1.782806048
## [1] 4.107539183
## [1] 2.055384806
These results can be visualised to understand them more clearly.
A t-test can be used to determine if the test results are significant:
t.test(cv.error[,1], cv.error[,3]) # RMSE
##
## Welch Two Sample t-test
##
## data: cv.error[, 1] and cv.error[, 3]
## t = 0.82877621, df = 14.570162, p-value = 0.4206011
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9720397893 2.2036957494
## sample estimates:
## mean of x mean of y
## 4.723367163 4.107539183
t.test(cv.error[,2], cv.error[,4]) # MAD
##
## Welch Two Sample t-test
##
## data: cv.error[, 2] and cv.error[, 4]
## t = -1.6429406, df = 16.71342, p-value = 0.1190738
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.62307407545 0.07791656065
## sample estimates:
## mean of x mean of y
## 1.782806048 2.055384806
# MAD shows signif difference
# shows majority of errors are within a smaller bound
# a few poor error predictions
The 95% confidence intervals from the t-tests contain zero both cases. From this result we can say that the differences in the means of the minimised loss functions (RMSE and MAD) do not yield significantly different results between the two models. The performance of both models is about the same, despite the early indications that the 2 hidden layer model was better.
It cannot escape my attention when using cross validation that OLS seems also to have faired about the same even though the literature suggests that in many cases ANNs are much more powerful algorithms.
In this report I have explored the neuralnet and NeuralNetTools libraries by adapting some excellent code examples available on public blogs. I have also created some potentially useful diagnostic and exploratory tools of my own.
Despite my ambiguous results comparing to traditional OLS, the literature rates ANNs as one of the most useful and exciting machine learning algorithms which I will continue to use in my own work.
Portilla, J. (2016), A Beginner’s Guide to Neural Networks with R! Available at: https://www.r-bloggers.com/fitting-a-neural-network-in-r-neuralnet-package/ [Accessed September 2016]
Fritsch, S & Guenther, F. (2016). ‘neuralnet: Training of Neural Networks’ R package version 1.33 Available at: https://CRAN.R-project.org/package=neuralnet [Accessed September 2016]
Hamed, H (2015). ‘R-Session 11 - Statistical Learning - Neural Networks’ Available at: https://www.youtube.com/watch?v=lTMqXSSjCvk [Accessed September 2016]
Michy, A. (2015). ‘Fitting a neural network in R; neuralnet package’ Available at: https://www.r-bloggers.com/fitting-a-neural-network-in-r-neuralnet-package/ [Accessed September 2016]
Beck, M. (2015). ‘NeuralNetTools: Visualization and Analysis Tools for Neural Networks’ R package version 1.4.0 Available at: http://CRAN.R-project.org/package=NeuralNetTools [Accessed September 2016]
Beck, M. (2013). ‘Visualizing neural networks in R – update’ Available at: https://beckmw.wordpress.com/tag/neural-network/ [Accessed September 2016]
Julian D. Olden & Donald A. Jackson (2002). ‘Illuminating the ‘‘black box’’: a randomization approach for understanding variable contributions in artificial neural networks’ Ecological Modelling 154 pp. 135–150
Julian D. Olden, Michael K. Joy and Russell G. Death (2004). ‘An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data’ Ecological Modelling 178 pp. 389–397
Deepayan Sarkar (2008), Multivariate Data Visualization with R, Seattle: Springer, 2nd Printing p. 57
Gevrey, M., Dimopoulos, I., Lek, S. (2003). ‘Review and comparison of methods to study the contribution of variables in artificial neural network models’ Ecological Modelling 160, pp 249-264