Predicting Iris Species By Sepals

ak@dived.me
April 2016

What is Iris dataset?

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

https://en.wikipedia.org/wiki/Iris_flower_data_set.

Predicting kind of Iris specie

But how much measurements do you need to predict what kind of iris is new one? Dataset contains only 150 measurements. Is it enough for machine learning methods?

https://hokumski.shinyapps.io/predicting-iris-species-by-sepals/

GBM is R package for Generalized Boosted Regression Modeling. Let's use only two parameters from Iris dataset: sepal width and length.

This is classification question: Species ~ Sepal.Length + Sepal.Width

Computing probabilities

gbmmodel <- gbm(Species ~ Sepal.Length+Sepal.Width, distribution="multinomial", data=irisTrain, n.trees=20, shrinkage=0.1, cv.folds=5, n.minobsinnode = 2, verbose=FALSE, n.cores=1)
pred<-predict(gbmmodel, irisTest, type="response"); pred
Using 20 trees...
, , 20

          setosa versicolor  virginica
 [1,] 0.67395608 0.23598358 0.09006033
 [2,] 0.91568790 0.05751980 0.02679230
 [3,] 0.71543686 0.20596069 0.07860245
 [4,] 0.41606202 0.22415966 0.35977832
 [5,] 0.41606202 0.22415966 0.35977832
 [6,] 0.93598431 0.04093702 0.02307867
 [7,] 0.95103306 0.02968722 0.01927971
 [8,] 0.79983750 0.15130156 0.04886094
 [9,] 0.93555089 0.04194339 0.02250571
[10,] 0.85227398 0.10485617 0.04286985
[11,] 0.84371386 0.11391033 0.04237581
[12,] 0.82698086 0.12280927 0.05020987
[13,] 0.69249118 0.20571433 0.10179448
[14,] 0.71543686 0.20596069 0.07860245
[15,] 0.93598431 0.04093702 0.02307867
[16,] 0.26357595 0.67762392 0.05880012
[17,] 0.84371386 0.11391033 0.04237581
[18,] 0.71543686 0.20596069 0.07860245
[19,] 0.84371386 0.11391033 0.04237581
[20,] 0.07057198 0.16971222 0.75971580
[21,] 0.03692297 0.90334957 0.05972746
[22,] 0.02552216 0.24153717 0.73294068
[23,] 0.03193718 0.33323251 0.63483031
[24,] 0.06567396 0.39744336 0.53688267
[25,] 0.13200724 0.37465771 0.49333505
[26,] 0.04007096 0.38092899 0.57900005
[27,] 0.04296960 0.25887829 0.69815211
[28,] 0.04655007 0.48788653 0.46556340
[29,] 0.05458295 0.42601774 0.51939932
[30,] 0.04838338 0.37763037 0.57398625
[31,] 0.14448525 0.62179200 0.23372275
[32,] 0.06567396 0.39744336 0.53688267
[33,] 0.04457614 0.34791498 0.60750889
[34,] 0.07318610 0.57121465 0.35559924
[35,] 0.09385246 0.46632003 0.43982751
[36,] 0.06519569 0.56101408 0.37379023
[37,] 0.43529502 0.44659318 0.11811181
[38,] 0.05851996 0.45674600 0.48473403
[39,] 0.04296960 0.25887829 0.69815211
[40,] 0.08886494 0.25108438 0.66005068
[41,] 0.03932795 0.20166346 0.75900859
[42,] 0.11083070 0.35252586 0.53664344
[43,] 0.01395790 0.75434855 0.23169355
[44,] 0.05201904 0.12509591 0.82288505
[45,] 0.03830939 0.40814411 0.55354650
[46,] 0.03292186 0.35074599 0.61633215
[47,] 0.25999649 0.27461650 0.46538701
[48,] 0.11083070 0.35252586 0.53664344
[49,] 0.08886494 0.25108438 0.66005068
[50,] 0.06567396 0.39744336 0.53688267

What could be done next?

  • Of course! Add petals!
  • Separate species when sampling the dataset (N of each specie)
  • Add tuning of GBM parameters

https://github.com/hokumski/ddp-predicting-iris-by-sepals

Thanks!