Janpu

Janpu

Loading Data from UCI ML repository

One typical ML project is to develop a mechanism that can learn to use an individual flower’s measurements to identify that flower’s species.

The point of this short R script is to show you how to get and use a dataset from UCI.

UC Irvine Machine Learning Repository currently maintain 436 data sets as a service to the machine learning community. https://archive.ics.uci.edu/ml/index.php. A well-known dataset called iris can be found: https://archive.ics.uci.edu/ml/datasets/iris. If you click on the Data Folder on the right up corner you will get the address of the dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/.

On this well-known dataset called iris, the rows are measurements of 150 iris flowers - 50 each of three species of iris. The species are called setosa, versicolor, and virginica. The data are sepal length, sepal width, petal length, petal width, and species.

In that first line of the dataset, notice that the first two values (sepal length and width) are larger than the second two (petal length and width).

# The first argument is the web address of the dataset. 
# The second indicates that the first row of the dataset is a row of data and does not provide the names of the columns. 
# The third argument is a vector that assigns the column names. 
# The column names come from the Data Set Description web page. 
# That page gives class as the name for the last column, but it seems that species is correct. 

iris.uci <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header=FALSE, col.names=c("sepal.length","sepal.width","petal.length","petal.width","species"))

head(iris.uci)
##   sepal.length sepal.width petal.length petal.width     species
## 1          5.1         3.5          1.4         0.2 Iris-setosa
## 2          4.9         3.0          1.4         0.2 Iris-setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa
## 5          5.0         3.6          1.4         0.2 Iris-setosa
## 6          5.4         3.9          1.7         0.4 Iris-setosa
summary(iris.uci)
##   sepal.length    sepal.width     petal.length    petal.width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##             species  
##  Iris-setosa    :50  
##  Iris-versicolor:50  
##  Iris-virginica :50  
##                      
##                      
## 

Visualization of Iris Data Set

You can also embed plots, for example:

data("iris")
pairs(Species~., data=iris, col=iris$Species)

Setup and Train the Neural Network for Iris Data

Neural Network emulates how the human brain works by having a network of neurons that are interconnected and sending stimulating signal to each other.

In the Neural Network model, each neuron is equivalent to a logistic regression unit. Neurons are organized in multiple layers where every neuron at layer i connects out to every neuron at layer i+1 and nothing else.

The tuning parameters in Neural network includes the number of hidden layers, number of neurons in each layer, as well as the learning rate.

There are no fixed rules to set these parameters and depends a lot in the problem domain. My default choice is to use a single hidden layer and set the number of neurons to be the same as the input variables. The number of neurons at the output layer depends on how many binary outputs need to be learned. In a classification problem, this is typically the number of possible values at the output category.

The learning happens via an iterative feedback mechanism where the error of training data output is used to adjusted the corresponding weights of input. This adjustment will be propagated back to previous layers and the learning algorithm is known as back-propagation.

library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.4.4
set.seed(101)
size.sample <- 50
iristrain <- iris[sample(1:nrow(iris), size.sample),] # get a training sample from iris
nnet_iristrain <- iristrain

# Binarize the categorical output
nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == 'setosa')
nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == 'versicolor')
nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == 'virginica')

names(nnet_iristrain)[6] <- 'setosa'
names(nnet_iristrain)[7] <- 'versicolor'
names(nnet_iristrain)[8] <- 'virginica'

head(nnet_iristrain) 
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species setosa
## 56           5.7         2.8          4.5         1.3 versicolor  FALSE
## 7            4.6         3.4          1.4         0.3     setosa   TRUE
## 106          7.6         3.0          6.6         2.1  virginica  FALSE
## 97           5.7         2.9          4.2         1.3 versicolor  FALSE
## 37           5.5         3.5          1.3         0.2     setosa   TRUE
## 44           5.0         3.5          1.6         0.6     setosa   TRUE
##     versicolor virginica
## 56        TRUE     FALSE
## 7        FALSE     FALSE
## 106      FALSE      TRUE
## 97        TRUE     FALSE
## 37       FALSE     FALSE
## 44       FALSE     FALSE

Visulization of the Neural Network on Iris Data

Here is the plot of the Neural network we learn

Neural network is very good at learning non-linear function and also multiple outputs can be learnt at the same time. However, the training time is relatively long and it is also susceptible to local minimum traps. This can be mitigated by doing multiple rounds and pick the best learned model.

nn <- neuralnet(setosa+versicolor+virginica ~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=nnet_iristrain, hidden=c(3))

plot(nn) 
Neural Network Model on Iris Data

Neural Network Model on Iris Data