Summary
This R markup document shows training and validation of a sample class prediction model. The code below, along with its comments, should be self-explanatory.
library(jsonlite)
library(randomForest)
# read the training data set 'D' for the class prediction model:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
D = read.csv(url,header=F)
names(D)[5]='label'
# train the model
set.seed(123)
model <- randomForest(label ~ ., data=D, na.action=na.omit)
# print performance metrics incl. its estimated out-of-sample (OOB) error rate:
model
##
## Call:
## randomForest(formula = label ~ ., data = D, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4%
## Confusion matrix:
## Iris-setosa Iris-versicolor Iris-virginica class.error
## Iris-setosa 50 0 0 0.00
## Iris-versicolor 0 47 3 0.06
## Iris-virginica 0 3 47 0.06
# prepare the validation test set 'V'
V = fromJSON('example.json')
V = na.omit(V)
VI = as.data.frame(t(simplify2array(V$info)))
names(VI) = c('V1','V2','V3','V4')
V$info = NULL
V$id = NULL
VL = factor(V$label)
V$label = NULL
V = cbind(V,VI)
V = na.omit(V) # the predictors V and corresponding labels VL to be used for validating the model
P = predict(model,V)
PL = as.factor(as.numeric(P)-1)
# predict labels of V using model, and compare against VL:
accuracy = sum(PL==VL)/length(VL)
# accuracy in range 0..1:
print(accuracy)
## [1] 1
What’s next
Obviously, what’s shown above is very basic use of R. The challenges include:
classifying samples from streaming data sources,
continuously training the prediction model based on actual species labels of the samples classified by it, and
providing an API for using such streaming and on-line learning object classifier.