Janpu
One typical ML project is to develop a mechanism that can learn to use an individual flower’s measurements to identify that flower’s species.
The point of this short R script is to show you how to get and use a dataset from UCI.
UC Irvine Machine Learning Repository currently maintain 436 data sets as a service to the machine learning community. https://archive.ics.uci.edu/ml/index.php. A well-known dataset called iris can be found: https://archive.ics.uci.edu/ml/datasets/iris. If you click on the Data Folder on the right up corner you will get the address of the dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/.
On this well-known dataset called iris, the rows are measurements of 150 iris flowers - 50 each of three species of iris. The species are called setosa, versicolor, and virginica. The data are sepal length, sepal width, petal length, petal width, and species.
In that first line of the dataset, notice that the first two values (sepal length and width) are larger than the second two (petal length and width).
# The first argument is the web address of the dataset.
# The second indicates that the first row of the dataset is a row of data and does not provide the names of the columns.
# The third argument is a vector that assigns the column names.
# The column names come from the Data Set Description web page.
# That page gives class as the name for the last column, but it seems that species is correct.
iris.uci <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header=FALSE, col.names=c("sepal.length","sepal.width","petal.length","petal.width","species"))
head(iris.uci)
## sepal.length sepal.width petal.length petal.width species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
summary(iris.uci)
## sepal.length sepal.width petal.length petal.width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## species
## Iris-setosa :50
## Iris-versicolor:50
## Iris-virginica :50
##
##
##
You can also embed plots, for example:
data("iris")
pairs(Species~., data=iris, col=iris$Species)
Neural Network emulates how the human brain works by having a network of neurons that are interconnected and sending stimulating signal to each other.
In the Neural Network model, each neuron is equivalent to a logistic regression unit. Neurons are organized in multiple layers where every neuron at layer i connects out to every neuron at layer i+1 and nothing else.
The tuning parameters in Neural network includes the number of hidden layers, number of neurons in each layer, as well as the learning rate.
There are no fixed rules to set these parameters and depends a lot in the problem domain. My default choice is to use a single hidden layer and set the number of neurons to be the same as the input variables. The number of neurons at the output layer depends on how many binary outputs need to be learned. In a classification problem, this is typically the number of possible values at the output category.
The learning happens via an iterative feedback mechanism where the error of training data output is used to adjusted the corresponding weights of input. This adjustment will be propagated back to previous layers and the learning algorithm is known as back-propagation.
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.4.4
set.seed(101)
size.sample <- 50
iristrain <- iris[sample(1:nrow(iris), size.sample),] # get a training sample from iris
nnet_iristrain <- iristrain
# Binarize the categorical output
nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == 'setosa')
nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == 'versicolor')
nnet_iristrain <- cbind(nnet_iristrain, iristrain$Species == 'virginica')
names(nnet_iristrain)[6] <- 'setosa'
names(nnet_iristrain)[7] <- 'versicolor'
names(nnet_iristrain)[8] <- 'virginica'
head(nnet_iristrain)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa
## 56 5.7 2.8 4.5 1.3 versicolor FALSE
## 7 4.6 3.4 1.4 0.3 setosa TRUE
## 106 7.6 3.0 6.6 2.1 virginica FALSE
## 97 5.7 2.9 4.2 1.3 versicolor FALSE
## 37 5.5 3.5 1.3 0.2 setosa TRUE
## 44 5.0 3.5 1.6 0.6 setosa TRUE
## versicolor virginica
## 56 TRUE FALSE
## 7 FALSE FALSE
## 106 FALSE TRUE
## 97 TRUE FALSE
## 37 FALSE FALSE
## 44 FALSE FALSE
Here is the plot of the Neural network we learn
Neural network is very good at learning non-linear function and also multiple outputs can be learnt at the same time. However, the training time is relatively long and it is also susceptible to local minimum traps. This can be mitigated by doing multiple rounds and pick the best learned model.
nn <- neuralnet(setosa+versicolor+virginica ~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=nnet_iristrain, hidden=c(3))
plot(nn)
Neural Network Model on Iris Data