Why multiple layers?
- We demonstrated the perceptron using a data set where classes were indeed separated by a linear boundary. But we are not always so lucky.
- Way back in 1969, it was known that the perceptron could not classify the XOR problem.
- People discovered that this problem can be overcome by stacking perceptrons one over the other so that the inputs of one of them are the outputs of the ones below. Hence the name Multi-layer Perceptrons. The layers between the input and output are called the hidden layers.
- A network with two or more hidden layers is called a deep neural network.
- The first MLPs were trained using back-propagation algorithm. However, there are many others now.
- We will demonstrate the use of multi-layer perceptron by using it on the iris data.
Preparing the data
- How many cases do we have for each class (species)?
aggregate(cbind(n.cases = Sepal.Length) ~ Species, iris, length)
## Species n.cases
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
- There are an equal number of cases in each class. Let us prepare our sample so that we choose an equal number from each class.
set.seed(18121842)
iris.setosa <- iris[iris$Species == 'setosa',]
iris.versicolor <- iris[iris$Species == 'versicolor',]
iris.virginica <- iris[iris$Species == 'virginica',]
split_data <- function(N, p) {
stopifnot(p > 0 & p < 1)
n <- N * p
trn.index <- sample.int(N, n, replace = FALSE)
test.index <- setdiff(1:N, trn.index)
list(train = trn.index, test = test.index)
}
setosa.split <- split_data(nrow(iris.setosa), 0.2)
setosa.train <- iris.setosa[setosa.split[["train"]],]
setosa.test <- iris.setosa[setosa.split[["test"]],]
versicolor.split <- split_data(nrow(iris.versicolor), 0.2)
versicolor.train <- iris.versicolor[versicolor.split[["train"]],]
versicolor.test <- iris.versicolor[versicolor.split[["test"]],]
virginica.split <- split_data(nrow(iris.virginica), 0.2)
virginica.train <- iris.virginica[virginica.split[["train"]],]
virginica.test <- iris.virginica[virginica.split[["test"]],]
train.data <- rbind(setosa.train, versicolor.train, virginica.train)
test.data <- rbind(setosa.test, versicolor.test, virginica.test)
rm(
setosa.split,
versicolor.split,
virginica.split,
iris.setosa,
iris.versicolor,
iris.virginica
)
rm(setosa.train, versicolor.train, virginica.train)
rm(setosa.test, versicolor.test, virginica.test)
Training the network - 1
- We use the single hidden layer network provided in B.D. Ripley’s ‘nnet’ package.
- Recall that in the perceptron we had a hyperparameter \(\eta\) called the learning rate. It influenced the rate at which weights changed in each iteration. We use the relation \[w_{i+1} = w_i - \eta\frac{\partial E}{\partial w_i}\] to updates weights in gradient descent.
- Sometimes the weights can get extreme. To avoid that, we apply a regularizing parameter \(\lambda\) called the decay. With it, the gradient descent is modified to \[w_{i+1} = w_i - \eta\frac{\partial E}{\partial w_i} - \eta\lambda w_i.\]
- As in the case of the perceptron, we can also specify the maximum number of iterations.
- The hidden layer is specified by passing its size.
Training the network - 2
- The function call to build the single hidden layer neural network is
nn.1 <-
nnet(
Species ~ .,
data = train.data,
size = 2,
decay = 1e-5,
maxit = 50
)
## # weights: 19
## initial value 35.411445
## iter 10 value 32.955229
## iter 20 value 14.638913
## iter 30 value 13.888475
## iter 40 value 13.869530
## final value 13.868609
## converged
- We have not dropped any feature from the data for we do not know which of them are useful in a 3-way classification.
- One can see the network using the summary function.
summary(nn.1)
## a 4-2-3 network with 19 weights
## options were - softmax modelling decay=1e-05
## b->h1 i1->h1 i2->h1 i3->h1 i4->h1
## 0.02 1.00 4.49 -8.03 -3.67
## b->h2 i1->h2 i2->h2 i3->h2 i4->h2
## -0.20 -1.34 -0.28 -0.83 -0.53
## b->o1 h1->o1 h2->o1
## -6.14 14.48 0.09
## b->o2 h1->o2 h2->o2
## 3.48 -5.69 -0.12
## b->o3 h1->o3 h2->o3
## 3.48 -7.95 0.30
- We observe that there are four input neurons i1, i2, i3, i4, one for each feature. There is also a ‘bias’ neuron b.
- There are two neurons in the hidden layer h1 and h2, along with a bias b
- As expected, there are three output layers o1, o2, o3.
- The weights from one neuron to the others in the succeeding layer are also mentioned.
- The output layer is treated to a softmax function to convert it to a probability.
Testing the network
- Like other models in R we have the predict function for neural networks as well.
predicted.species <- predict(nn.1, test.data, type = "class")
comparison <- data.frame(actual = test.data$Species, predicted = predicted.species)
# How did we do?
table(comparison)
## predicted
## actual setosa versicolor
## setosa 40 0
## versicolor 0 40
## virginica 0 40
- We mistook a all virginica to be versicolor.
- Does increasing the number of iteration help?
Model with increasing the number of iterations
nn.2 <-
nnet(
Species ~ .,
data = train.data,
size = 2,
decay = 1e-5,
maxit = 100
)
## # weights: 19
## initial value 37.118395
## iter 10 value 3.354185
## iter 20 value 0.356276
## iter 30 value 0.070115
## iter 40 value 0.062318
## iter 50 value 0.055942
## iter 60 value 0.054585
## iter 70 value 0.052726
## iter 80 value 0.040042
## iter 90 value 0.033890
## iter 100 value 0.031711
## final value 0.031711
## stopped after 100 iterations
predicted.species <- predict(nn.2, test.data, type = "class")
comparison <- data.frame(actual = test.data$Species, predicted = predicted.species)
# How did we do?
table(comparison)
## predicted
## actual setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 40 0
## virginica 0 12 28
- It did, but not too much. How about increasing the hidden layers.
Model with more units in the hidden layer.
nn.3 <-
nnet(
Species ~ .,
data = train.data,
size = 4,
decay = 1e-5,
maxit = 50
)
## # weights: 35
## initial value 39.909614
## iter 10 value 13.743053
## iter 20 value 2.588907
## iter 30 value 0.436368
## iter 40 value 0.027481
## iter 50 value 0.022585
## final value 0.022585
## stopped after 50 iterations
predicted.species <- predict(nn.3, test.data, type = "class")
comparison <- data.frame(actual = test.data$Species, predicted = predicted.species)
# How did we do?
table(comparison)
## predicted
## actual setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 40 0
## virginica 0 11 29
- Adding hidden layers made no great difference.
Model with more units and more iterations.
nn.4 <-
nnet(
Species ~ .,
data = train.data,
size = 4,
decay = 1e-5,
maxit = 50
)
## # weights: 35
## initial value 35.989656
## iter 10 value 9.590075
## iter 20 value 0.263699
## iter 30 value 0.072545
## iter 40 value 0.065118
## iter 50 value 0.054472
## final value 0.054472
## stopped after 50 iterations
predicted.species <- predict(nn.4, test.data, type = "class")
comparison <- data.frame(actual = test.data$Species, predicted = predicted.species)
# How did we do?
table(comparison)
## predicted
## actual setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 40 0
## virginica 0 11 29
Model with fewer features.
nn.5 <-
nnet(
Species ~ Petal.Length + Petal.Width,
data = train.data,
size = 4,
decay = 1e-5,
maxit = 50
)
## # weights: 27
## initial value 36.393243
## iter 10 value 11.117675
## iter 20 value 0.566856
## iter 30 value 0.127323
## iter 40 value 0.065351
## iter 50 value 0.055558
## final value 0.055558
## stopped after 50 iterations
predicted.species <- predict(nn.5, test.data, type = "class")
comparison <- data.frame(actual = test.data$Species, predicted = predicted.species)
table(comparison)
## predicted
## actual setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 40 0
## virginica 0 7 33
- Dropping features made a great difference. Can we improve?
Model with fewer features and more iterations.
nn.6 <-
nnet(
Species ~ Petal.Length + Petal.Width,
data = train.data,
size = 4,
decay = 1e-5,
maxit = 100
)
## # weights: 27
## initial value 39.684126
## iter 10 value 8.099589
## iter 20 value 0.208046
## iter 30 value 0.082138
## iter 40 value 0.071433
## iter 50 value 0.066694
## iter 60 value 0.050155
## iter 70 value 0.047392
## iter 80 value 0.040535
## iter 90 value 0.036129
## iter 100 value 0.034321
## final value 0.034321
## stopped after 100 iterations
predicted.species <- predict(nn.6, test.data, type = "class")
comparison <- data.frame(actual = test.data$Species, predicted = predicted.species)
table(comparison)
## predicted
## actual setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 40 0
## virginica 0 7 33