(Optional) Reference

  1. Machine Learning Course by Andrew Ng:
    https://www.coursera.org/learn/machine-learning
  2. Deep Learning Specialization by Andrew Ng:
    https://www.coursera.org/specializations/deep-learning
  3. Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville:
    http://www.deeplearningbook.org/

Packages

library(nnet)                                        # Feed-forward neural networks with one hidden layer

Artificial Neural Network

The famous Fisher’s iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are iris setosa, versicolor, and virginica.

data(iris)                                           # data: load specified data sets
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
X <- iris[,1:4]
Y <- (iris[,5] == "setosa")*1 + (iris[,5] == "versicolor")*2 + (iris[,5] == "virginica")*3

Linear Output

# 4-2-1 Neural Network
iris.nn <- nnet(X, Y, size = 2, linout = T)          # 2 units in hidden layer; linear output
# weights:  13
initial  value 701.773203 
iter  10 value 8.108658
iter  20 value 5.732022
iter  30 value 5.387146
iter  40 value 5.033249
iter  50 value 4.847241
final  value 4.847189 
converged
summary(iris.nn)                                     # Summary of output
a 4-2-1 network with 13 weights
options were - linear output units 
 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 
 -0.66  -2.22  -1.40  -1.06  -0.55 
 b->h2 i1->h2 i2->h2 i3->h2 i4->h2 
 -1.27  -0.32  -0.71   0.71   1.44 
 b->o h1->o h2->o 
 0.96  0.38  2.47 

The result is summarized as: \[\begin{aligned} h_1&=-0.66+(-2.22)x_1+(-1.4)x_2+(-1.06)x_3+(-0.55)x_4\\ h_2&=-1.27+(-0.32)x_1+(-0.71)x_2+(0.71)x_3+(1.44)x_4\\ h_1'&=\frac{\text{exp}(h_1)}{1+\text{exp}(h_1)} \\ h_2'&=\frac{\text{exp}(h_2)}{1+\text{exp}(h_2)}\\ v&=0.96+(0.38)h_1'+(2.47)h_2' \end{aligned}\]

pred <- round(iris.nn$fit)                           # Round the fitted values
table(iris[,5], levels(iris$Species)[pred])          # Classification table
            
             setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         46         4
  virginica       0          0        50

Improved Version

To avoid parameter estimates trapped at a local minimum of the error function, we can run several times from different sets of initial parameter values in order to get the optimal weights of ANN (hopefully the true global minimum).

# Try nnet(x,y) k times and output the best trial
# x is the matrix of input variable
# y is the dependent value; y must be factor if linout = F is used

ann <- function(x, y, size, maxit = 100, linout = FALSE, try = 5, ...) {
    ann1 <- nnet(y~., data = x, size = size, maxit = maxit, linout = linout, ...)
    v1 <- ann1$value                                 # First trial

    for (i in 2:try) {
        ann <- nnet(y~., data = x, size = size, maxit = maxit, linout = linout, ...)
        if (ann$value < v1) {
            v1 <- ann$value
            ann1 <- ann
        }
    }
    return(ann1)
} 

Logistic Output

The csv file fin-ratio.csv contains financial ratios of 680 securities listed in the main board of Hong Kong Stock Exchange in 2002. There are six financial variables, namely, Earning Yield (EY), Cash Flow to Price (CFTP), logarithm of Market Value (ln MV), Dividend Yield (DY), Book to Market Equity (BTME), Debt to Equity Ratio (DTE). Among these companies, there are 32 Blue Chips which are the Hang Seng Index Constituent Stocks. The last column HSI is a binary variable indicating whether the stock is a Blue Chip or not.

d <- read.csv("./../Dataset/fin-ratio.csv")          

Y <- as.factor(d$HSI)                                # Output: Y

var <- names(d)[!names(d) %in% "HSI"]                # Exclude HSI
X <- d[,var]                                         # Input:  X

# results = 'hide', Default: logistic output
fin.nn <- ann(X, Y, size = 2, maxit = 200, try = 10)
summary(fin.nn)
a 6-2-1 network with 17 weights
options were - entropy fitting 
  b->h1  i1->h1  i2->h1  i3->h1  i4->h1  i5->h1  i6->h1 
  -7.81   19.86   16.49    4.20   -2.62   -3.93   21.31 
  b->h2  i1->h2  i2->h2  i3->h2  i4->h2  i5->h2  i6->h2 
3935.72 -147.31  182.22 -407.57  -11.03  -83.71   29.85 
   b->o   h1->o   h2->o 
  -1.38    4.92 -337.07 
fin.nn$value                                         # Display the best value
[1] 7.117396
Prediction <- round(fin.nn$fit)
Reference <- d$HSI                                   # Ground-truth labels

table(Prediction, Reference)                         # Classification table
          Reference
Prediction   0   1
         0 647   1
         1   1  31
Measure of Performance

Note that: \[\begin{aligned} \text{Accuracy}&=\frac{\text{True Positive}+\text{True Negative}}{\text{Total Observation No}}=\frac{TP+TN}{TP+TN+FP+FN}\\ \\ \text{Precision}&=\frac{\text{True Positive}}{\text{True Positive}+\text{False Positive}}=\frac{TP}{TP+FP}\\ \\ \text{Recall}&=\frac{\text{True Positive}}{\text{True Positive}+\text{False Negative}}=\frac{TP}{TP+FN} \\ \\ F_1&=\bigg(\frac{\text{Recall}^{-1}+\text{Precision}^{-1}}{2}\bigg)^{-1}=2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}\end{aligned}\]

(Accuracy <- sum((Prediction == Reference))/length(Prediction))
[1] 0.9970588
(Precision <- sum(Prediction == 1 & Reference == 1)/sum(Prediction == 1))
[1] 0.96875
(Recall <- sum(Prediction == 1 & Reference == 1)/sum(Reference == 1))
[1] 0.96875
(F1 <- 1/((1/Precision + 1/Recall)/2))
[1] 0.96875

Remark

Training error rate does not reflect the classification performance accurately. In fact, you can randomly choose some observations as training data and remaining observations as testing data.