We’ll first cover the basic important concepts behind understanding a neural network and then go through some code examples.

Concepts

Basic Biological Motivation

The following is a simple schematic of a neuron:


Schematic of Neuron


In essence, each neuron takes information from other neurons, processes them, and then produces an output. One could imagine that certain neurons output information based on raw sensory inputs, other neurons build higher representations on that, and so on until one gets outputs that are significant at a higher level.

Activation Function

One important step in the previous description is how the neuron processes all of its inputs. Suppose that a node takes in \(l\) inputs; one could imagine taking a weighted average of these according to some weights \(b_i\) for \(i \in \{1, \dots, l\}\). Then, the neuron either fires or doesn’t fire depending on whether this weighted average is above a threshold or not; in the case of artificial neurons, one typically does an analog equivalent of those discrete cases. There are a couple of choices of activiation functions such as the sigmoid (logistic), hyperbolic tangent, rectified linear unit (ReLU), leaky ReLU, and maxout functions, but we’ll use the logistic function here: \[f(x) = \frac{1}{1 + e^{-x}}\]

Here’s a graph of the logistic function:

To put it all together, let’s examine this picture:


Mathematical Model of Neuron


Error Function

Another important component of a neural network that we can specify is the error function; while we typically use the squared loss functions in most machine learning settings, it unfortunately has the property that neurons that are initialized to an output close to the extremes will be very slow to train. (This situation is called a saturated neuron.) Thus, we can introduce a different cost function named the logarithmic loss: \[L(y, \hat{y}) = y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}).\] This error function turns out to make the training process efficient even if one is training on a saturated neuron. While we will not cover it here, one can understand the intuition of why one should use the logarithmic loss instead of the squared error loss here.

To get more intuition for this function, let’s look at a plot of the function.

One can observe the high levels of loss at the points which correspond to poor predictions and very low levels of loss at the points which correspond to good predictions.

Connection Pattern

In a feed-forward neural network (neurons only affect neurons further along), one typically connects all units of one layer to all of the next layer; one can also include a bias term (constant) for each hidden layer.


Neural Network Architecture


Universal Approximation Theorem

One reason people are hopeful about neural networks is that they are a universal function approximator. Specifically, if \(\varphi(x)\) is a nonconstant, bounded, and monotonically-increasing continuous function, then there exists an \(N\) such that \(F(x)\) is an approximate realization of any function over a compact subset of a Euclidean space: \[F(x) = \sum_{i = 1}^{N}{v_i * \varphi(w^\intercal x + b_i)}\]

Backpropagation

One question you all may have is how we can train such a network. The most fundamental algorithm is named backpropagation, and we will walk through an example of this here. Let’s analyze the network below:

An intuitive way to start deriving backpropagation is to look at the cost function evaluated at a single point. In mathematical terms, we have an input point \((x_1, x_2)\) from class \(y\); the neural network produces an estimate of the label, namely \(\hat{y}\). Additionally, we will use the logarithmic loss function as our cost function, \(C\). The ultimate goal is to determine the derivative of the cost function with respect to the weights, which we can index as \(w_{i,j}^{(l)}\); this particular weight refers to the weight of the connection between node \(i\) of layer \(l\) and node \(j\) of layer \(l + 1\). (The bias weights can be represented when \(i = 0\).) By determining the partial derivative, we may use optimization techniques such as gradient descent.

For further notation, one can denote \(s_l\) as the number of nodes in the \(l\)-th layer. Also, a node is defined by its layer, \(i\), and its index within that layer, \(j\). Each non-input node produces an output, \(n_{i,j}\), which comes from the weighted sum of its inputs \(z_{i,j}\); these two quantities are related by the activation function, \(f\): \(f(z_{i, j}) = n_{i,j}\).

Using all of that notation, we can label the above diagram.

We can start by defining how \(\hat{y}\) is calculated. For a node in the input layer, one can define their output as follows: \[n_{1,j} = x_j,\] where \(n_{1,0}\) is defined to be 0. For other layers, one can write their output as: \[n_{i,j} = f(z_{i,j}) = f(\sum_{k = 1}^{s_{i-1}}{w_{k,j}^{(i - 1)}n_{(i - 1),k}} + w_{0,j}^{(i - 1)})\] for \(i \geq 2\). In this way, the inputs flow through the network to calculate the final output.

To start deriving backpropragation, we’ll start by analyzing how the cost function changes with respect to an arbitrary non-input node. One could define how the cost function changes with respect to a node as how the cost function changes with respect to \(n_{i,j}\), but we will define it as \[\delta_{i,j} := \frac{\partial{C}}{\partial{z_{i,j}}}\] to make some of the algebra easier. It is easiest to start with \(\hat{y} = n_{4,1}\) for our example: \[\delta_{4,1} = \frac{\partial{C}}{\partial{z_{4,1}}} = \frac{\partial{C}}{\partial{n_{4,1}}} \cdot \frac{\partial{n_{4,1}}}{\partial{z_{4,1}}} = \frac{\partial{C}}{\partial{n_{4,1}}} \cdot f'(z_{4,1}).\] This is easily computed because we can take the derivative of the cost function with respect to the estimated label we produce and the derivative of the activation function with respect to the current value of \(z_{4,1}\).

Let’s now try it for an example in the second hidden layer: \(\delta_{3,1} = \frac{\partial{C}}{\partial{z_{3,1}}}\). Because the cost function depends on \(z_{3,1}\) only through \(z_{4,1}\), we may rewrite \(\delta_{3,1} = \frac{\partial{C}}{\partial{z_{4,1}}} \frac{\partial{z_{4,1}}}{\partial{z_{3,1}}} = \delta_{4,1} \frac{\partial{z_{4,1}}}{\partial{z_{3,1}}}\). Evaluating the second term, we get \[\frac{\partial{z_{4,1}}}{\partial{z_{3,1}}} = \frac{\partial{(\sum_{k = 1}^{2}{w_{k,1}^{(3)}n_{3,k}} + w_{0,1}^{(3)})}}{\partial{z_{3,1}}} = \frac{\partial{(w_{1,1}^{(3)}f(z_{3,1}))}}{\partial{z_{3,1}}} = w_{1,1}^{(3)} f'(z_{3,1}).\] Putting these two results together, \[\delta_{3,1} = \delta_{4,1} \cdot w_{1,1}^{(3)} f'(z_{3,1}).\] In a similar way, we can establish what \(\delta_{3,2}\) is because each of the terms can directly be calculated.

In such a way, one may continue to derive the \(\delta\) terms; in a shortened derivation, we will derive \(\delta_{2,1}\): \[\delta_{2,1} = \frac{\partial{C}}{\partial{z_{2,1}}} = \frac{\partial{C}}{\partial{z_{3,1}}} \frac{\partial{z_{3,1}}}{\partial{z_{2,1}}} + \frac{\partial{C}}{\partial{z_{3,2}}} \frac{\partial{z_{3,2}}}{\partial{z_{2,1}}} = \delta_{3,1} \frac{\partial{z_{3,1}}}{\partial{z_{2,1}}} + \delta_{3,2} \frac{\partial{z_{3,2}}}{\partial{z_{2,1}}}\] \[\frac{\partial{z_{3,1}}}{\partial{z_{2,1}}} = \frac{\partial{(\sum_{k = 1}^{2}{w_{k,1}^{(2)}n_{2,k}} + w_{0,1}^{(2)})}}{\partial{z_{2,1}}} = \frac{\partial{(w_{1,1}^{(2)}f(z_{2,1}))}}{\partial{z_{2,1}}} = w_{1,1}^{(2)} f'(z_{2,1})\] \[\frac{\partial{z_{3,2}}}{\partial{z_{2,1}}} = \frac{\partial{(\sum_{k = 1}^{2}{w_{k,2}^{(2)}n_{2,k}} + w_{0,2}^{(2)})}}{\partial{z_{2,1}}} = \frac{\partial{(w_{1,2}^{(2)}f(z_{2,1}))}}{\partial{z_{2,1}}} = w_{1,2}^{(2)} f'(z_{2,1})\] \[\delta_{2,1} = \delta_{3,1} \cdot w_{1,1}^{(2)}f'(z_{2,1}) + \delta_{3,2} \cdot w_{1,2}^{(2)} f'(z_{2,1})\] Similarly, one could do the same mechanical process for \(\delta_{2,2}\). Hence, we can calculate all reasonable \(\delta\) terms. (One wouldn’t want to calculate any \(\delta_{1,j}\) terms since we cannot change the value of input nodes.)

Let’s now go after the quantity of interest: the partial derivative of the cost function with respect to a weight. For any particular weight \(w_{i,j}^{(l)}\), the quantity \(C\) only depends on that weight through \(z_{(l + 1),j}\). Let’s look specifically at \(w_{1,1}^{(3)}\): \[\frac{\partial{C}}{\partial{w_{1,1}^{(3)}}} = \frac{\partial{C}}{\partial{z_{4,1}}} \frac{\partial{z_{4,1}}}{\partial{w_{1,1}^{(3)}}} = \delta_{4,1} \frac{\partial{(\sum_{k = 1}^{2}{w_{k,1}^{(3)}n_{3,k}} + w_{0,1}^{(3)})}}{\partial{w_{1,1}^{(3)}}} = \delta_{4,1} \frac{\partial{(w_{1,1}^{(3)}n_{3,1})}}{\partial{w_{1,1}^{(3)}}} = \delta_{4,1} n_{3,1}.\] This generalizes well to: \[\frac{\delta{C}}{\partial{w_{i,j}^{(l)}}} = n_{l, i} \delta_{(l + 1), j}.\]

(Stochastic) Gradient Descent

The whole purpose of backpropagation is to define the partial derivative of the cost function with respect to a particular weight. One can now use things like gradient descent to optimize the weights and thus optimize the network. However, an interesting point is that we can do this with even one example, which is called stochastic gradient descent. One can continuously take one example at a time and apply stochastic gradient descent until the weights aren’t changing much; this comes in handy when storing all data in memory isn’t feasible because one can just read a single data point into memory and still optimize the network.

Architecture

One can arbitrarily stack more and more layers, which leads to deep networks, hence deep learning. Additionally, there are more fancy layers that one could add such as convolutional layers that are helpful for images. Also, one can connect neurons in loops, which can lead to recurrent networks.

XOR Function Example

Let’s now walk through some code; we will be using a fake dataset before moving on to a well known and widely used dataset.

Specifically, we will be aiming to train a neural network on the boolean XOR function. Here is the truth table of the XOR function:

\(X\) \(Y\) \(X \oplus Y\)
0 0 0
0 1 1
1 0 1
1 1 0

We can get examples for this dataset by sampling points from a multivariate normal distribution centered on the coordinate points of each row and associating those with the truth value of that row. Let us generate some data:

set.seed(4400) # For identical results across all document evaluations
library(MASS) # Needed to sample multivariate Gaussian distributions 
library(neuralnet) # The package for neural networks in R
## Loading required package: grid
cov <- matrix(c(0.05, 0, 0, 0.05), 2, 2) # Diagonal covariance matrix
cov
##      [,1] [,2]
## [1,] 0.05 0.00
## [2,] 0.00 0.05
num.points <- 5000 # Number of points for each row 
first <- mvrnorm(num.points, c(0, 0), cov) # First row
second <- mvrnorm(num.points, c(0, 1), cov) # Second row
third <- mvrnorm(num.points, c(1, 0), cov) # Third row
fourth <- mvrnorm(num.points, c(1, 1), cov) # Fourth row
all.points <- rbind(first, second, third, fourth) # Stack all points together
labels <- rep(c(0, 1, 1, 0), each = num.points) # Truth values
xor.data <- as.data.frame(cbind(labels, all.points)) # Combine labels and points
colnames(xor.data) <- c("label", "x", "y")

Now, let’s look at a few of these:

num.sample.rows <- 10 # Number of rows to display
display.rows <- sample(1:nrow(xor.data), num.sample.rows)
xor.data[display.rows, ]
##       label           x           y
## 11454     1  1.42220464 -0.18766478
## 18167     0  1.03298011  1.09020888
## 9372      1 -0.16110025  1.10459225
## 1219      0 -0.03854581  0.41963209
## 17624     0  0.69464369  1.11301751
## 6381      1  0.42702536  1.05702197
## 9640      1 -0.07102788  1.36626535
## 4968      0 -0.36453981 -0.55909047
## 11795     1  1.28374293 -0.08614112
## 15739     0  0.59026886  1.15683830

Even better, we can visualize this data:

library(ggplot2) # Needed to plot points
ggplot(xor.data, aes(x = x, y = y, color = factor(label))) + geom_point() +
  scale_color_manual(name = "Labels", values = c("blue", "orange"), 
                     labels = c("False", "True")) +
  ggtitle("XOR Function Data") + xlab("X") + ylab("Y")

We are ready to now fit a neural network to this data using the neuralnet package (which we had loaded earlier):

xor.nnet <- neuralnet("label ~ x + y", data = xor.data, threshold = 1,
                     hidden = c(20), # 1 hidden layer with 20 units
                     linear.output = F, # Classification
                     err.fct = "ce", #Error Function
                     act.fct = "logistic") #Activation Function
cat(sprintf("Best error reached: %f", xor.nnet$result.matrix[c('error'), ]))
## Best error reached: 1304.851925
test.data <- data.frame(x = c(0, 0, 1, 1),
                        y = c(0, 1, 0, 1),
                        true.label = c(0, 1, 1, 0))
prediction <- compute(xor.nnet, test.data[, c("x", "y")])$net.result
cbind(test.data, prediction)
##   x y true.label       prediction
## 1 0 0          0 0.00024752482164
## 2 0 1          1 0.99980195988870
## 3 1 0          1 0.99964127382315
## 4 1 1          0 0.00009631966684
more.test.data <- data.frame(x = runif(10), y = runif(10))
more.predictions <- compute(xor.nnet, more.test.data)$net.result
cbind(more.test.data, more.predictions)
##             x          y more.predictions
## 1  0.74616144 0.73052313      0.025273571
## 2  0.77842043 0.04074222      0.996814192
## 3  0.44390654 0.98305029      0.754318170
## 4  0.82129699 0.76295029      0.008318009
## 5  0.85883274 0.71210591      0.015474711
## 6  0.36138164 0.55601637      0.709415777
## 7  0.12423632 0.54866919      0.740658959
## 8  0.90422697 0.80150616      0.002573246
## 9  0.03837237 0.87045207      0.998875833
## 10 0.28665360 0.70683817      0.958190494

(One note is that the linear.output parameter controls whether the neural network is aiming to do regression or classification: F implies classification, and T implies regression.

Additionally, we could have specified rep = 5 because we wanted to do the fitting procedure 5 times and then use the results of the best iteration. We may need to do this since the neural network weights are randomly initialized, meaning a given iteration may not lead to a great fit.)

(Another note about fitting a neural network using the neuralnet package is that it doesn’t give a helpful message when its fit doesn’t converge. The error message looks a little like this:

Error in nrow[w] * ncol[w] : non-numeric argument to binary operator
In addition: Warning message:
In is.na(weights) : is.na() applied to non-(list or vector) of type 'NULL'

This means that the value of weights is NULL, which means that the network didn’t converge; you can try fixing this by manually raising the threshold for the change in weights from its default value of 0.01.)

Wow, it looks like the neural network did a really good job! We can visualize how it’s predicting by showing what it’s predicting on a uniform grid:

num.interpolating.points <- 100
x.values <- seq(0, 1, len = num.interpolating.points)
y.values <- seq(0, 1, len = num.interpolating.points)
test.points <- as.data.frame(expand.grid(x.values, y.values))
colnames(test.points) <- c("x", "y")
predictions <- compute(xor.nnet, test.points)$net.result
ggplot() + geom_point(aes(x = test.points$x, y = test.points$y, color = predictions)) +
  scale_color_gradient("Prediction", low = "blue", high = "orange") +
  ggtitle("Visualizing the Neural Network's Decision Pattern") + xlab("X") + ylab("Y")

An important observation is that the neural network is able to find a nonlinear decision boundary unlike its predecessor, the preceptron, which only is able to find a linear decision boundary as you may recall.

One can also try to uncover what the neural network is doing by plotting the nodes and seeing the edges. (Blue edges are positive, and orange edges are negative; the width represents the magnitude.)

library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef
plot.nnet(xor.nnet, pos.col = "blue", neg.col = "orange")
## Loading required package: scales
## Loading required package: reshape

Another attempt at variable importance can be seen here:

source_gist('6206737')
## Sourcing https://gist.githubusercontent.com/fawda123/6206737/raw/8e013bde5158f9861e92cb37e5bfd800ea9597db/gar_fun.r
## SHA-1 hash of file is 7aa33496459a2fe0d3b359351c44fd423928bcfd
gar.fun("y", xor.nnet)
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:reshape':
## 
##     rename, round_any

The example chosen here is historically significant, and I have paraphrased this Wikipedia page to explain that significance:

Marvin Minsky and Seymour Papert proved in 1969 that perceptrons are not able to learn the XOR function; it was incorrectly believed that they also conjectured that multi-layer perceptrons (neural networks) are also unable to learn the XOR function. (They already knew that multi-layer perceptrons were able to learn the XOR function, and Stephen Grossberg published papers three years later introducing neural networks capable of learning XOR functions.) However, the misconception regarding Minsky and Papert’s results led to a significant decline in interest and funding of neural network research. Neural networks finally experienced a resurgence in the 1980s.

xor.2.nnet <- neuralnet("label ~ x + y", data = xor.data, threshold = 1,
                        hidden = c(20, 20), # 2 hidden layer with 20 units each
                        linear.output = F, # Classification
                        err.fct = "ce", #Error Function
                        act.fct = "logistic") #Activation Function
cat(sprintf("Best error reached: %f", xor.2.nnet$result.matrix[c('error'), ]))
## Best error reached: 1254.272770
simple.predictions.2 <- compute(xor.2.nnet, test.data[, c("x", "y")])$net.result
cbind(test.data, simple.predictions.2)
##   x y true.label simple.predictions.2
## 1 0 0          0     0.00002500060824
## 2 0 1          1     0.99983945068570
## 3 1 0          1     0.99968601624850
## 4 1 1          0     0.00008085807318
predictions.2 <- compute(xor.2.nnet, test.points)$net.result
ggplot() + geom_point(aes(x = test.points$x, y = test.points$y, color = predictions.2)) +
  scale_color_gradient("Prediction", low = "blue", high = "orange") +
  ggtitle("Visualizing the 2 Layer Neural Network's Decision Pattern") + xlab("X") + ylab("Y")

ggplot() + geom_point(aes(x = test.points$x, y = test.points$y, color = predictions.2 - predictions)) +
  scale_color_gradient("Prediction", low = "blue", high = "orange") +
  ggtitle("Visualizing the Neural Network's Decision Pattern") + xlab("X") + ylab("Y")

#import 'gar.fun' from Github
source_gist('6206737')
## Sourcing https://gist.githubusercontent.com/fawda123/6206737/raw/8e013bde5158f9861e92cb37e5bfd800ea9597db/gar_fun.r
## SHA-1 hash of file is 7aa33496459a2fe0d3b359351c44fd423928bcfd

Wisconsin Breast Cancer Data Example

The last example was artifically constructed, but let’s now try neural networks on a real dataset. The Wisconsin Breast Cancer dataset is a popular dataset for testing classification algorithms. First, let’s get the data:

wbcd.url <- "http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
wbcd.data <- read.csv(wbcd.url, header = F)
wbcd.data <- wbcd.data[, -c(1)] # Don't need patient ID
wbcd.data[, 1] <- as.numeric(wbcd.data[, 1] == "M") # Make patient label numeric
colnames(wbcd.data)[1] <- "label"
wbcd.data[1, ]
##   label    V3    V4    V5   V6     V7     V8     V9    V10    V11     V12
## 1     1 17.99 10.38 122.8 1001 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871
##     V13    V14   V15   V16      V17     V18     V19     V20     V21
## 1 1.095 0.9053 8.589 153.4 0.006399 0.04904 0.05373 0.01587 0.03003
##        V22   V23   V24   V25  V26    V27    V28    V29    V30    V31
## 1 0.006193 25.38 17.33 184.6 2019 0.1622 0.6656 0.7119 0.2654 0.4601
##      V32
## 1 0.1189

The neural network can now be trained on the data because we have 1 output variable and 30 numeric input variables. (The 30 numeric covariates are the mean, standard error, and average of the 3 largest values of 10 different metrics for the patient; more information can be found here under Section 7, Attribute Information.) However, as good machine learning practitioners, let’s split our data into test and training data.

train.proportion <- 0.8
train.index <- sample(x = 1:nrow(wbcd.data),
                      size = floor(train.proportion * nrow(wbcd.data)),
                      replace = F)
wbcd.train.data <- wbcd.data[train.index, ]
wbcd.test.data <- wbcd.data[-train.index, ]
wbcd.test.labels <- wbcd.test.data$label
wbcd.test.data <- subset(wbcd.test.data, select = -c(label))

Just as before, we can compute neural networks now; however, we aren’t sure which is the best number of layers (or the number of hidden units in each layer for that matter). We will try a few different combinations; one note is that we have to raise the threshold from 0.01 because otherwise the network won’t think that it has converged. The value 2 was found experimentally to lead to decent results.

formula <- sprintf("%s%s", "label ~ ", paste("V", 3:32, collapse = " + ", sep = ""))
formula #Specifying the output and inputs
## [1] "label ~ V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 + V21 + V22 + V23 + V24 + V25 + V26 + V27 + V28 + V29 + V30 + V31 + V32"
wbcd.first.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(5), # 1 hidden layer with 5 units
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.second.net <- neuralnet(formula, data = wbcd.train.data,
                             hidden = c(10), # 1 hidden layer with 10 units
                             linear.output = F, rep = 5,
                             err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.third.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(15), # 1 hidden layer with 15 units 
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.fourth.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(5, 5), # 2 hidden layers with 5 units each each
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.fifth.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(10, 10), # 2 hidden layers with 10 units each
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
wbcd.sixth.net <- neuralnet(formula, data = wbcd.train.data,
                            hidden = c(15, 15), # 2 hidden layers with 15 units each
                            linear.output = F, rep = 5,
                            err.fct = "ce", act.fct = "logistic", threshold = 2)
train.scores <- sapply(list(wbcd.first.net, wbcd.second.net, wbcd.third.net,
                      wbcd.fourth.net, wbcd.fifth.net, wbcd.sixth.net),
                 function(x) {min(x$result.matrix[c("error"), ])})
cat(paste(c("Training Scores (Logarithmic Loss)\n1 Hidden Layer, 5 Hidden Units:", "1 Hidden Layer, 10 Hidden Units:", "1 Hidden Layer, 15 Hidden Units:", "2 Hidden Layers, 5 Hidden Units Each:", "2 Hidden Layers, 10 Hidden Units Each:", "2 Hidden Layers, 15 Hidden Units Each:"), train.scores, collapse = "\n"))
## Training Scores (Logarithmic Loss)
## 1 Hidden Layer, 5 Hidden Units: 21.9807633574602
## 1 Hidden Layer, 10 Hidden Units: 9.63435577789839
## 1 Hidden Layer, 15 Hidden Units: 2.02920638964649
## 2 Hidden Layers, 5 Hidden Units Each: 0.0246021559576084
## 2 Hidden Layers, 10 Hidden Units Each: 6.20278004725438
## 2 Hidden Layers, 15 Hidden Units Each: 4.87049402299849

Now, let’s examine how the neural networks do on predicting the test data:

percentage.correctly.classified <- function(nn, threshold = 0.5) {
  best <- which.min(nn$result.matrix[c("error"), ])
  net.predictions <- compute(nn, wbcd.test.data, rep = best)$net.result
  thresholded.net.predictions <- ifelse(net.predictions > threshold, 1, 0)
  num.correct <- sum(as.numeric(thresholded.net.predictions == wbcd.test.labels))
  num.correct / length(wbcd.test.labels)
}
scores <- sapply(list(wbcd.first.net, wbcd.second.net, wbcd.third.net,
                      wbcd.fourth.net, wbcd.fifth.net, wbcd.sixth.net),
                 percentage.correctly.classified)
cat(paste(c("Test Scores (Percentage Correctly Classified)\n1 Hidden Layer, 5 Hidden Units:", "1 Hidden Layer, 10 Hidden Units:", "1 Hidden Layer, 15 Hidden Units:", "2 Hidden Layers, 5 Hidden Units Each:", "2 Hidden Layers, 10 Hidden Units Each:", "2 Hidden Layers, 15 Hidden Units Each:"), scores, collapse = "\n"))
## Test Scores (Percentage Correctly Classified)
## 1 Hidden Layer, 5 Hidden Units: 0.964912280701754
## 1 Hidden Layer, 10 Hidden Units: 0.93859649122807
## 1 Hidden Layer, 15 Hidden Units: 0.947368421052632
## 2 Hidden Layers, 5 Hidden Units Each: 0.956140350877193
## 2 Hidden Layers, 10 Hidden Units Each: 0.894736842105263
## 2 Hidden Layers, 15 Hidden Units Each: 0.912280701754386

From these, we can see that simply having more nodes isn’t helpful for more accurate predictions.

Ashu’s Concerns

5 - Comparison of forward and reverse mode differentiation (O(n^2) backwards and O(n^3) forwards)
10 - What’s Next? (Convolutional Nets, Deep Learning, Caffe, Theano)
11 - Explain why we’re trying to use logarithmic loss