Black Box Methods 1 - Neural Nets

Artificial Neural Nets (ANNs)

The idea

These need no introduction: they model the biological neurons, and can be applied to most tasks - typically to supervised learning (either classification or regression), but also to unsupervised learning for discovering properties about unknown data (though this use’s application is still quite rare)

Each node in the net has:

multiple input signals, each with its own weight (the x’s)
an activation function which (typically) sums all inputs * their respective weight
a single output signal (the y)

So \(y(x) = f(\sum_{i=1}^{n}w_i x_i)\)

The activation function

The activation function in the biological sense simply adds up the weighted inputs, and if that exceeds a threshold, the neuron fires. This is simply a step function from y = 0 to y = 1:

zstep = function(x) { ifelse(x<=0, 0, 1) }
curve(zstep(x), -5, 5, n = 10000)

However, this is rarely of use in ANNs: a key reason is that whatever is used needs to be differentiable (more on this later). As a result, typically sigmoid functions are used:

zsigmoid = function(x) { 1/(1+exp(-x)) }
curve(zsigmoid(x), -10, 10, n=10000)

Other alternatives are linear (think y=x), saturated linear, Gaussian etc. The key difference between the alternatives is the range of the output signal’s value (0, 1), (-1, +1), (-inf, +inf).

The choice of the activation function biases the ANN in such a way that particular types of data are fit more appropriately. These functions take in a wide range of inputs, and (usually) produce an output with a very narrow range (like [0,1]) - so they’re sometimes called squashing functions. This introduces annoyances which are readily resolved by standardising / normalising the features of our data (avoiding complications that household income would otherwise dwarf number of children)

The network topology

There are three properties which define a neural net’s architecture:

number of nodes in a given layer
number of layers in total
whether information in the network is allowed to flow backwards

The specifics are what determine the complexity a given network can handle - typically, the greter the numbers, the more complex the ANN - but also the more the data is overfit.

The very first layer usually concists of input nodes: a single node per feature of our raw dataset. Networks are typically fully connected, meaning that each node in a given layer typically sends its output signal as the input signal of each node in the next layer. Layers in between the input nodes and the output node are called hidden layers. ANNs tend to be feedforward networks, i.e. information typically flows in one direction only, though feedback (or recurrent) networks exist as well - combined with node-local memory, these help learn about sequences of events over a period of time. Finally, there can be multiple output nodes.

The number of nodes in hidden layers is important, and up to the user to decide: it should be a function of the amount of features, amount of training data, amount of noisy data, and the complexity of the task at hand. The more the better, balanced with overfitting concerns and consequent poor generalisation; not to mention compute time.

Learning mechanism - backpropagation

The network starts with no knowledge about the data, so the weightings are randomly chosen. This results in an output signal which will most likely be wrong; its error relative to the correct value is evaluated, and that is backpropagated in the network to modify the weightings between neurons and reduce future errors. This 2-stage cycle is repeated until some stopping criteria are met.

So how does the algo decide how much a weighting should be changed? By using gradient descent - it uses the derivative of each neuron’s activation function; this provides a gradient, indicating how much effect a change in weight will have. This is why the function needs to be differentiable.

Let’s look at an example

Example: Strength of Concrete

The data consists of 1030 examples of concrete compositions. We’re interested in the strength field (there are 8 other features).

concrete = read.csv("D://dev//R//mlwr//chap7-nn-svm//concrete.csv")
str(concrete)

## 'data.frame':    1030 obs. of  9 variables:
##  $ cement      : num  141 169 250 266 155 ...
##  $ slag        : num  212 42.2 0 114 183.4 ...
##  $ ash         : num  0 124.3 95.7 0 0 ...
##  $ water       : num  204 158 187 228 193 ...
##  $ superplastic: num  0 10.8 5.5 0 9.1 0 0 6.4 0 9 ...
##  $ coarseagg   : num  972 1081 957 932 1047 ...
##  $ fineagg     : num  748 796 861 670 697 ...
##  $ age         : int  28 14 28 28 28 90 7 56 28 28 ...
##  $ strength    : num  29.9 23.5 29.2 45.9 18.3 ...

summary(concrete)

##      cement           slag            ash             water      
##  Min.   :102.0   Min.   :  0.0   Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0   Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9   Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9   3rd Qu.:118.30   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4   Max.   :200.10   Max.   :247.0  
##   superplastic      coarseagg         fineagg           age        
##  Min.   : 0.000   Min.   : 801.0   Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000   1st Qu.: 932.0   1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.400   Median : 968.0   Median :779.5   Median : 28.00  
##  Mean   : 6.205   Mean   : 972.9   Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.200   3rd Qu.:1029.4   3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200   Max.   :1145.0   Max.   :992.6   Max.   :365.00  
##     strength    
##  Min.   : 2.33  
##  1st Qu.:23.71  
##  Median :34.45  
##  Mean   :35.82  
##  3rd Qu.:46.13  
##  Max.   :82.60

We can see that the ranges are pretty wide across the features, so, per above, some normalisation / standardisation needs to be done. Recall that is the data is normally distributed, we can use R’s built-in scale() function, and if it’s uniformly, or severely non-normally distributed, we should standardise to [0, 1].

library(ggplot2)
library(reshape2)
library(gridExtra)
melted_concrete = melt(concrete)

## No id variables; using all as measure variables

tail(melted_concrete)

##      variable value
## 9265 strength 21.91
## 9266 strength 13.29
## 9267 strength 41.30
## 9268 strength 44.28
## 9269 strength 55.06
## 9270 strength 52.61

qplot(x=value, data=melted_concrete) + facet_wrap(~variable, scales='free')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So a lot of these are non-normal, let’s normalise to [0, 1].

normalise = function(x) {
  return( (x-min(x)) / (max(x) - min(x)))
}
concrete_norm = as.data.frame(lapply(concrete, normalise))

# let's make sure:
summary(concrete_norm)

##      cement            slag              ash             water       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2063   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.3442  
##  Median :0.3902   Median :0.06121   Median :0.0000   Median :0.5048  
##  Mean   :0.4091   Mean   :0.20561   Mean   :0.2708   Mean   :0.4774  
##  3rd Qu.:0.5662   3rd Qu.:0.39775   3rd Qu.:0.5912   3rd Qu.:0.5607  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##   superplastic      coarseagg         fineagg            age         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.3808   1st Qu.:0.3436   1st Qu.:0.01648  
##  Median :0.1988   Median :0.4855   Median :0.4654   Median :0.07418  
##  Mean   :0.1927   Mean   :0.4998   Mean   :0.4505   Mean   :0.12270  
##  3rd Qu.:0.3168   3rd Qu.:0.6640   3rd Qu.:0.5770   3rd Qu.:0.15110  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##     strength     
##  Min.   :0.0000  
##  1st Qu.:0.2664  
##  Median :0.4001  
##  Mean   :0.4172  
##  3rd Qu.:0.5457  
##  Max.   :1.0000

We can now split data into training and test sets - we’ll go 3:1

c.train = concrete_norm[1:773, ]
c.test  = concrete_norm[774:1030, ]

We’ll use the neuralnet package as it’s standard and easy to use and allows topology visualisation. There are many, and the default in R is nnet, though there’s also RSNNS package (but tricky to learn).

## Loading required package: grid
## Loading required package: MASS
## WARNING: Rtools is required to build R packages, but is not currently installed.
## 
## Please download and install Rtools 3.2 from http://cran.r-project.org/bin/windows/Rtools/ and then run find_rtools().
## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef

This library uses neuralnet() to train, and compute() to test:

# Let's start by building the simplest possible multilayer feedforward network with only one single hidden node
c.model = neuralnet(data=c.train,
                    strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age)
# Let's see the coeffs:
plot.nnet(c.model)

## Loading required package: scales
## Loading required package: reshape
## 
## Attaching package: 'reshape'
## 
## The following objects are masked from 'package:reshape2':
## 
##     colsplit, melt, recast

# Note the error value of about 5.1

# Let's now run on the test set
model.results = compute(c.model, c.test[1:8])

# model.results is a list of 2 components: neurons and net.results. We want the latter
predicted_strength = model.results$net.result

# because this is a numeric prediction, we can't use a confusion matrix, so we'll use cor() instead
cor(predicted_strength, c.test$strength)

##              [,1]
## [1,] 0.8061918147

Ok, so that’s the overall workflow. Can this be improved upon? We can certainly increase the number of nodes in the hidden layer, let’s see what that effect is:

c.model2 = neuralnet(data=c.train,
                    strength ~ cement + slag + ash + water + superplastic + coarseagg + fineagg + age,
                    hidden = 5)

plot.nnet(c.model2)

# great, error is 1.9, but computationally more intensive - almost 5x more steps
# Let's now run on the test set
model.results2 = compute(c.model2, c.test[1:8])
cor(model.results2$net.result, c.test$strength)

##              [,1]
## [1,] 0.9240877753

# ok, clearly a significant improvement

The next installment of black box methods looks at Support Vector Machines, and we re-visit this concrete example, so you can compare and contrast the results of these two learning methods.