Using ‘iris’ data set in R, predict the species type using measured characteristics of different iris’.
Start off with loading data. We can use the built-in data set iris; this data set is structured already with different attributes of varying species of iris’. Preview data:
# print compact display of 'irs' data
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
From the compact display, we can inspect our data set and see that the data type is ‘num’, there are 3 species types, and 4 characteristics for each (pedal length and width, sepal length and width).
Next we can get a better idea of how the data is distributed for these characteristics:
# how is data distributed for Species:
# number per species:
table(iris$Species)
setosa versicolor virginica
50 50 50
## Percent per species:
#round(prop.table(table(iris$Species)) * 100, digits = 1)
# summary of data
print('Summary:')
[1] "Summary:"
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Now, we need to decide which cases are relevant. We can start by taking a quick look at the data. For example, we can plot various attributes of for the species and see if they are correlalated. For example, what can we learn from the data set and what is useful for classification?
One option for plotting is to use the ggvis package (in python, one could use seaborn or bokeh):
# Load in `ggvis`
# install.packages('ggvis')
library(ggvis)
# PLOT:
# Iris scatter plot: sepal length vs sepal width
iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points
# Iris scatter plot: petal length vs. petal width
iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points()
It appears that there is a strong positive correlation between petal length and petal width. This can be tested and vizualized for all species types:
# Overall correlation `Petal.Length` and `Petal.Width`
cor(iris$Petal.Length, iris$Petal.Width)
[1] 0.9628654
Or, the correlation among individual species types can be tested and viewed:
# viz correlation matrix for each species
#install.packages('corrplot')
library(corrplot)
# Return values of `iris` levels
x=levels(iris$Species)
# Viz Setosa correlation matrix
title <- "Setosa"
SetoCorr = cor(iris[iris$Species==x[1],1:4])
corrplot(SetoCorr, method="number", title=title, mar=c(0,0,1,0))
# Viz Versicolor correlation matrix
title='Versicolor'
VersiCorr = cor(iris[iris$Species==x[2],1:4])
corrplot(VersiCorr, method="number", title=title, mar=c(0,0,1,0))
# Viz Virginica correlation matrix
title='Virginica'
VirgCorr = cor(iris[iris$Species==x[3],1:4])
corrplot(VirgCorr, method="number", title=title, mar=c(0,0,1,0))
Interestingly, when all the species are combined, we get a correlation coefficient of 0.96 for pedal length vs. pedal width. However, seperately, we get 0.33, 0.79, and 0.32 for the Setosa, Versicolor, and Virginica alone, respectively.
Here, we are going to split the data set into two data sets: a train and a test set. We want to randomly select the rows that go into the test and train sets, and ideally we want equal observations of each species within each set as to not weight/bias either data set. So, we can randomly shuffle the data set and sample sample()
# set random number generator seed
set.seed(69)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.8, 0.2))
# Compose training set
iris.training <- iris[ind==1, 1:4]
# Inspect training set
head(iris.training)
# Compose test set
iris.test <- iris[ind==2, 1:4]
# Inspect test set
head(iris.test)