Artificial Neural Network

In machine learning and cognitive science, Artificial Neural Networks (ANNs) are a family of statistical learning models inspired by biological neural networks (the central nervous systems of animals, in particular, the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. https://en.wikipedia.org/wiki/Artificial_neural_network.

The motivation or benefit of ANNs is that they allow the modeling of highly complex relationships between inputs/features and response variable(s), especially if the relationships are highly nonlinear. No underlying assumptions are required to create and evaluate the model and it can be used with qualitative and quantitative responses.
Neural Network Architecture

  1. Neural Network is made of layers with many interconnected nodes (neuron)
  2. There are three main layers:
  1. Hidden Layer can be one or more and accordingly it is called

ANN - Neuron

  1. Aneuron is an information processing unit that is fundamental to the operation of a neural network
  2. Three basic elements of the neuron model:
  1. External input bias to increase or lower the net input to the Activation function

How does a Neuron Work

Activation Function is applied to the weighted sum of the inputs of a neuron to product the output.
Different types of activation function include
- Heaviside: when output is 0 or 1
- Sigmoid (Logistic): when output is continuous between 0 and 1
- Softmax (Generalized Logistic):when output is between 0 and 1 and sum of all output values is 1
- Hyperbolic Tangent: when output is continuous between -1 and 1

Illustration

Synaptic Weights

The synaptic weights of neurons are determined based on the neural net learning process (Learning Algorithm)

Most common measure of the Error (cost function) is mean square error E = (y - d)^2
Iterations of the above process of providing the network with an input and updating the network’s weight to reduce error is used to train the network

To complete the cycle or epoch as it is known, backpropagation happens and trains the model based on what was learned. To initiate the backpropagation, an error is determined based on a loss function such as Sum of Squared Error or Cross Entropy, among others. As the weights were set to some initial random values, the initial error may be high. Working backward, the weights are changed to minimize the error in the loss function.

This completes one epoch. This process continues, using gradient descent until the algorithm converges to the minimum error or prespecified number of epochs.

Packages required

#install.packages("caret",repos = "http://cran.us.r-project.org")  
#install.packages("neuralnet",repos = "http://cran.us.r-project.org")  
#install.packages("vcd",repos = "http://cran.us.r-project.org")  
library(MASS)       #for the shuttle data set
library(caret)      #for data prepration
library(neuralnet)  #for NN model
library(vcd)        #for data visualization

Data Understanding

data(shuttle)
str(shuttle)
## 'data.frame':    256 obs. of  7 variables:
##  $ stability: Factor w/ 2 levels "stab","xstab": 2 2 2 2 2 2 2 2 2 2 ...
##  $ error    : Factor w/ 4 levels "LX","MM","SS",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sign     : Factor w/ 2 levels "nn","pp": 2 2 2 2 2 2 1 1 1 1 ...
##  $ wind     : Factor w/ 2 levels "head","tail": 1 1 1 2 2 2 1 1 1 2 ...
##  $ magn     : Factor w/ 4 levels "Light","Medium",..: 1 2 4 1 2 4 1 2 4 1 ...
##  $ vis      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ use      : Factor w/ 2 levels "auto","noauto": 1 1 1 1 1 1 1 1 1 1 ...

The data consists of 256 observations and 7 variables. Notice that all of the variables are categorical and the response is use with two levels, auto and noauto.
Data defination:
> stability: This is stable positioning or not (stab/xstab)
> error: This is the size of the error (MM / SS / LX)
> sign: This is the sign of the error, positive or negative (pp/nn)
> wind: This is the wind sign (head / tail)
> magn: This is the wind strength (Light / Medium / Strong / Out of Range)
> vis: This is the visibility (yes / no)

table(shuttle$use)
## 
##   auto noauto 
##    145    111

The vcd package offers a number of table and plotting functions. One is structable(). This function will take a formula (column1 + column2 ~ column3), where column3 becomes the rows in the table.

table1=structable(wind+magn~use,shuttle)
table1
##        wind  head                    tail                  
##        magn Light Medium Out Strong Light Medium Out Strong
## use                                                        
## auto           19     19  16     18    19     19  16     19
## noauto         13     13  16     14    13     13  16     13
mosaic(table1, gp=shading_hcl)  #provides p-value for a chi-squared test

The p-value is not significant, so the variables are independent, which means that knowing the levels of wind and/or magn does not help us predict the use of the autolander.

mosaic(use~error+vis, shuttle)

The shading of the table has changed, reflecting the rejection of the null hypothesis and dependence in the variables. The plot first takes and splits the visibility. The result is that if the visibility is no, then the autolander is used. The next split is horizontal by error. If error is SS or MM when vis is no, then the autolander is recommended, otherwise it is not. A p-value is not necessary as the gray shading indicates significance.

#The Chi-square test is applied to determine whether there is a significant association between the two categorical variables from the same population.
chisq.test(shuttle$use,shuttle$stability)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  shuttle$use and shuttle$stability
## X-squared = 4.0718, df = 1, p-value = 0.0436
chisq.test(shuttle$use,shuttle$error)
## 
##  Pearson's Chi-squared test
## 
## data:  shuttle$use and shuttle$error
## X-squared = 6.1555, df = 3, p-value = 0.1043
chisq.test(shuttle$use,shuttle$sign)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  shuttle$use and shuttle$sign
## X-squared = 0.25449, df = 1, p-value = 0.6139
chisq.test(shuttle$use,shuttle$wind)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  shuttle$use and shuttle$wind
## X-squared = 0, df = 1, p-value = 1
chisq.test(shuttle$use,shuttle$magn)
## 
##  Pearson's Chi-squared test
## 
## data:  shuttle$use and shuttle$magn
## X-squared = 1.5747, df = 3, p-value = 0.6652
chisq.test(shuttle$use,shuttle$vis)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  shuttle$use and shuttle$vis
## X-squared = 192.46, df = 1, p-value < 2.2e-16

Data Preparation

Preparing the data for a neural network is very important as all the covariates and responses need to be numeric. In our case, we have all of them categorical. The caret package allows us to quickly create dummy variables as our input features.

dummies<-dummyVars(use~.,shuttle, fullRank = TRUE)
dummies
## Dummy Variable Object
## 
## Formula: use ~ .
## 7 variables, 7 factors
## Variables and levels will be separated by '.'
## A full rank encoding is used

To put this into a data frame, we need to predict the dummies object to an existing data, either the same or different, in as.data.frame().

shuttle.2 <- as.data.frame(predict(dummies,newdata = shuttle))
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev
## = object$lvls): variable 'use' is not a factor
names(shuttle.2)
##  [1] "stability.xstab" "error.MM"        "error.SS"       
##  [4] "error.XL"        "sign.pp"         "wind.tail"      
##  [7] "magn.Medium"     "magn.Out"        "magn.Strong"    
## [10] "vis.yes"
head(shuttle.2)
##   stability.xstab error.MM error.SS error.XL sign.pp wind.tail magn.Medium
## 1               1        0        0        0       1         0           0
## 2               1        0        0        0       1         0           1
## 3               1        0        0        0       1         0           0
## 4               1        0        0        0       1         1           0
## 5               1        0        0        0       1         1           1
## 6               1        0        0        0       1         1           0
##   magn.Out magn.Strong vis.yes
## 1        0           0       0
## 2        0           0       0
## 3        0           1       0
## 4        0           0       0
## 5        0           0       0
## 6        0           1       0

We now have an input feature space of ten variables.
Stability is now either 0 for stab or 1 for xstab.
Base error is LX and three variables represent the other categories.

The response can be created using the ifelse() function:

shuttle.2$use <- ifelse(shuttle$use=="auto",1,0)
table (shuttle.2$use)
## 
##   0   1 
## 111 145

Create Training and Test Datasets

set.seed(123)
trainIndex <- createDataPartition(shuttle.2$use, p = 0.7, list = FALSE, times = 1)
shuttleTrain <- shuttle.2[trainIndex,]
table(shuttleTrain$use)
## 
##  0  1 
## 81 99
shuttleTest <- shuttle.2[-trainIndex,]
table(shuttleTest$use)
## 
##  0  1 
## 30 46

Modeling using “neuralnet()”

The package that we will use is neuralnet. The function in neuralnet will call for the use of a formula as we used elsewhere, such as y~x1+x2+x3+x4, data = df. In the past, we used y~. to specify all the other variables in the data as inputs. However, neuralnet does not accommodate this. The way around this limitation is to use the as.formula() function. After first creating an object of the variable names, we will use this as an input in order to paste the variables properly on the right side of the equation.

n <- names(shuttleTrain)
form <- as.formula(paste("use~", paste(n[!n %in% "use"], collapse = "+")))
form
## use ~ stability.xstab + error.MM + error.SS + error.XL + sign.pp + 
##     wind.tail + magn.Medium + magn.Out + magn.Strong + vis.yes

Keep this function in mind for your own use as it may come in quite handy.

In the neuralnet package, the function that we will use is appropriately named neuralnet(). Other than the formula, there are four other critical arguments that we will need to examine:
1. hidden: This is the number of hidden neurons in each layer, which can be up to three layers; the default is 1
2. act.fct: This is the activation function with the default logistic and tanh available
3. err.fct: This is the function used to calculate the error with the default “sse”; as we are dealing with binary outcomes, we will use “ce” for cross-entropy
4. linear.output: This is a logical argument on whether or not to ignore act.fct with the default TRUE, so for our data, this will need to be FALSE

Other arguments include:
5. lifesign:a string specifying how much the function will print during the calculation of the neural network. ‘none’, ‘minimal’ or ‘full’
6. lifesign.step:an integer specifying the stepsize to print the minimal threshold in full lifesign mode
7. threshold:a numeric value specifying the threshold for the partial derivatives of the error function as stopping criteria.
8. stepmax:he maximum steps for the training of the neural network. Reaching this maximum leads to a stop of the neural network’s training process.
9. algorithm: a string containing the algorithm type to calculate the neural network. The following types are possible: ‘backprop’, ‘rprop+’, ‘rprop-’, ‘sag’, or ‘slr’. ‘backprop’. The default is resilient with backpropagation and we will use it along with the default of one hidden neuron.

fit <- neuralnet(form, 
                data = shuttleTrain,
                err.fct = "ce",
                linear.output = FALSE
                )  

fit$result.matrix
##                                            1
## error                         0.009928587504
## reached.threshold             0.009905188403
## steps                       660.000000000000
## Intercept.to.1layhid1        -4.392654985479
## stability.xstab.to.1layhid1   1.957595172393
## error.MM.to.1layhid1         -1.596634090134
## error.SS.to.1layhid1         -2.519372079568
## error.XL.to.1layhid1         -0.371734253789
## sign.pp.to.1layhid1          -0.863963659357
## wind.tail.to.1layhid1         0.102077456260
## magn.Medium.to.1layhid1      -0.018170137582
## magn.Out.to.1layhid1          1.886928834123
## magn.Strong.to.1layhid1       0.140129588700
## vis.yes.to.1layhid1           6.209014123244
## Intercept.to.use             30.721652703205
## 1layhid.1.to.use            -65.084168998463
plot(fit, rep = "best")

Model Performance

## Assigning the Probabilities to Train dataset
shuttleTrain$Prob <- fit$net.result[[1]]

## The distribution of the estimated probabilities
quantile(shuttleTrain$Prob,c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
##                         0%                         1% 
## 0.000000000000001494708979 0.000000000000001861635561 
##                         5%                        10% 
## 0.000000000000004575896810 0.000000000000011112555051 
##                        25%                        50% 
## 0.000000000065330068666304 0.999907798987664686585219 
##                        75%                        90% 
## 0.999999999999899635838574 0.999999999999945043960281 
##                        95%                        98% 
## 0.999999999999951150186916 0.999999999999953037566058 
##                        99%                       100% 
## 0.999999999999953148588361 0.999999999999953370632966
## deciling code
decile <- function(x){
  deciles <- vector(length=10)
  for (i in seq(0.1,1,.1)){
    deciles[i*10] <- quantile(x, i, na.rm=T)
  }
  return (
    ifelse(x<deciles[1], 1,
    ifelse(x<deciles[2], 2,
    ifelse(x<deciles[3], 3,
    ifelse(x<deciles[4], 4,
    ifelse(x<deciles[5], 5,
    ifelse(x<deciles[6], 6,
    ifelse(x<deciles[7], 7,
    ifelse(x<deciles[8], 8,
    ifelse(x<deciles[9], 9, 10
    ))))))))))
}

## deciling
shuttleTrain$deciles <- decile(shuttleTrain$Prob)


## Ranking code
##install.packages("data.table")
library(data.table)
library(scales)

tmp_DT = data.table(shuttleTrain)

rank <- tmp_DT[, list(
  cnt = length(use), 
  cnt_resp = sum(use), 
  cnt_non_resp = sum(use == 0)) , 
  by=deciles][order(-deciles)]

rank$rrate <- round (rank$cnt_resp / rank$cnt,2);

rank$cum_resp <- cumsum(rank$cnt_resp)

rank$cum_non_resp <- cumsum(rank$cnt_non_resp)

rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),2);

rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),2);

rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp);

rank$rrate <- percent(rank$rrate)

rank$cum_rel_resp <- percent(rank$cum_rel_resp)

rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)

head(rank, 10)
##     deciles cnt cnt_resp cnt_non_resp rrate cum_resp cum_non_resp
##  1:      10  18       18            0  100%       18            0
##  2:       9  18       18            0  100%       36            0
##  3:       8  18       18            0  100%       54            0
##  4:       7  18       18            0  100%       72            0
##  5:       6  18       18            0  100%       90            0
##  6:       5  18        9            9   50%       99            9
##  7:       4  18        0           18    0%       99           27
##  8:       3  18        0           18    0%       99           45
##  9:       2  18        0           18    0%       99           63
## 10:       1  18        0           18    0%       99           81
##     cum_rel_resp cum_rel_non_resp   ks
##  1:          18%               0% 0.18
##  2:          36%               0% 0.36
##  3:          55%               0% 0.55
##  4:          73%               0% 0.73
##  5:          91%               0% 0.91
##  6:         100%              11% 0.89
##  7:         100%              33% 0.67
##  8:         100%              56% 0.44
##  9:         100%              78% 0.22
## 10:         100%             100% 0.00

Will use the compute() function and specifying the fit model and covariates. This syntax will be the same for the predictions on the test and train sets. Once computed, a list of the predictions is created with $net.result.

res <- compute(fit, shuttleTrain[,1:10])
predTrain <- res$net.result
# These results are in probabilities, will turn them into 0 or 1.
predTrain <- ifelse(predTrain >=0.5,1,0)
table(predTrain, shuttleTrain$use)
##          
## predTrain  0  1
##         0 81  0
##         1  0 99

The neural network model has achieved 100 percent accuracy. We will now see how it does on the test set.

res2 <- compute (fit, shuttleTest[,1:10])
predTest <- res2$net.result
predTest <- ifelse (predTest >=0.5,1,0)
table(predTest, shuttleTest$use)
##         
## predTest  0  1
##        0 29  0
##        1  1 46

Only one false positive in the tes data set. To identify which one was this, use which() function to single it out.

which(predTest==1 & shuttleTest$use==0)
## [1] 62
shuttleTest[62,]
##     stability.xstab error.MM error.SS error.XL sign.pp wind.tail
## 203               0        1        0        0       1         0
##     magn.Medium magn.Out magn.Strong vis.yes use
## 203           0        0           1       1   0

It is row 62 in the test set and observation 203 in the full dataset.

Can we improve on this result and achieve 100 percent accuracy in the test set? To do this, we will increase the complexity by specifying three neurons in the first layer and two neurons in the second layer.

fit2 = neuralnet(form, 
                 data=shuttleTrain, 
                 hidden=c(3,2), 
                 err.fct="ce",
                 linear.output=FALSE
                 )

The plot results now get quite busy and extremely hard to interpret.

plot(fit2, rep = "best")

Check the performance of the new model on train and test dataset.

#Train data set
res <- compute (fit2, shuttleTrain[,1:10])
predTrain <- res$net.result
predTrain <- ifelse(predTrain>=0.5,1,0)
table(predTrain, shuttleTrain$use)
##          
## predTrain  0  1
##         0 81  0
##         1  0 99
#Test data set
res2 <- compute (fit2, shuttleTest[,1:10])
predTest <- res2$net.result
predTest <- ifelse(predTest>=0.5,1,0)
table(predTest, shuttleTest$use)
##         
## predTest  0  1
##        0 29  0
##        1  1 46
which(predTest==1 & shuttleTest$use==0)
## [1] 62
which(predTest==0 & shuttleTest$use==1)
## integer(0)

The false positive observation has not changed from the prior fit. In this case, adding complexity did not improve the test set’s performance; in fact, it is less generalized than the simpler model.

Summary

The neural network is constructed with an interconnected group of nodes, which involves the input, connected weights, processing element, and output. Neural networks can be applied to many areas, such as classification, clustering, and prediction. To train a neural network in R, you can use neuralnet, which is built to train multilayer perceptron in the context of regression analysis, and contains many flexible functions to train forward neural networks.