ABOUT
This is an example of a notebook to demonstrate concepts of Data Science. In this example we will do some exploratory data analysis on the famous Iris dataset.
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were used to create a linear discriminant model to classify the species. The dataset is often used in data mining, classification and clustering examples and to test algorithms.
Information about the original paper and usages of the dataset can be found in the UCI Machine Learning Repository – Iris Data Set.
DATA OVERVIEW
The iris dataset is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.
str(iris)
We can use the dim() function to get the dimensions of the dataset in terms of number of rows and number of columns:
dim(iris)
Take a look at the first six rows of the dataset by using the head() function:
head(iris)
# Get first 5 rows of each subset
subset(iris, Species == "setosa")[1:5,]
DATA CLASSES
subset(iris, Species == "versicolor")[1:5,]
subset(iris, Species == "virginica")[1:5,]
DATA SUMMARY
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
For each of the numeric variables we can see the following information:
- Min: The minimum value.
- 1st Qu: The value of the first quartile (25th percentile).
- Median: The median value.
- Mean: The mean value.
- 3rd Qu: The value of the third quartile (75th percentile).
- Max: The maximum value.
For the only categorical variable in the dataset (Species) we see a frequency count of each value:
- setosa: This species occurs 50 times.
- versicolor: This species occurs 50 times.
- virginica: This species occurs 50 times.
VISUALIZATION
Use hist() function.
hist(iris$Sepal.Length,
col='steelblue',
main='Histogram',
xlab='Length',
ylab='Frequency')

par(mfrow=c(1,3))
hist(irisVer$Petal.Length,breaks=seq(0,8,l=17),xlim=c(0,8),ylim=c(0,40))
hist(irisSet$Petal.Length,breaks=seq(0,8,l=17),xlim=c(0,8),ylim=c(0,40))
hist(irisVir$Petal.Length,breaks=seq(0,8,l=17),xlim=c(0,8),ylim=c(0,40))

Use plot() function.
#create scatterplot of sepal width vs. sepal length
plot(iris$Petal.Width, iris$Petal.Length,
col='steelblue',
main='Scatterplot',
xlab='Sepal Width',
ylab='Sepal Length',
pch=19)

library(beanplot)
xiris <- iris
xiris$Species <- NULL
beanplot(xiris, main = "Iris flowers",col=c('#ff8080','#0000FF','#0000FF',
'#FF00FF'), border = "#000000")

Use boxplot() function.
#create scatterplot of sepal width vs. sepal length
boxplot(Sepal.Length~Species,
data=iris,
main='Sepal Length by Species',
xlab='Species',
ylab='Sepal Length',
col='steelblue',
border='black')

The x-axis displays the three species and the y-axis displays the distribution of values for sepal length for each species.
This gives us a rough estimate of the distribution of the values for each attribute. But maybe it makes more sense to see the distribution of the values considering each class, since we have labels for each class.
irisVer <- subset(iris, Species == "versicolor")
irisSet <- subset(iris, Species == "setosa")
irisVir <- subset(iris, Species == "virginica")
par(mfrow=c(1,3),mar=c(6,3,2,1))
boxplot(irisVer[,1:4], main="Versicolor",ylim = c(0,8),las=2)
boxplot(irisSet[,1:4], main="Setosa",ylim = c(0,8),las=2)
boxplot(irisVir[,1:4], main="Virginica",ylim = c(0,8),las=2)

CORRELATION ANALYSIS
Are any variables correlated?
corr <- cor(iris[,1:4])
round(corr,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.000 -0.118 0.872 0.818
Sepal.Width -0.118 1.000 -0.428 -0.366
Petal.Length 0.872 -0.428 1.000 0.963
Petal.Width 0.818 -0.366 0.963 1.000
+1 means variables are correlated, -1 inversely correlated.
pairs(iris[,1:4])

Let’s color the scatterplot to help visually:
pairs(iris[,1:4],col=iris[,5],oma=c(4,4,6,12))
par(xpd=TRUE)
legend(0.85,0.6, as.vector(unique(iris$Species)),fill=c(1,2,3))

CLASSIFICATION
Create a model that predicts the species from the petal and sepal width and length using a decision tree.
library(C50)
input <- iris[,1:4]
output <- iris[,5]
model1 <- C5.0(input, output, control = C5.0Control(noGlobalPruning = TRUE,minCases=1))
plot(model1, main="C5.0 Decision Tree - Unpruned, min=1")

Adjust Paramters of the desicion tree.
model2 <- C5.0(input, output, control = C5.0Control(noGlobalPruning = FALSE))
plot(model2, main="C5.0 Decision Tree - Pruned")

Retrieve model infromation.
summary(model2)
Call:
C5.0.default(x = input, y = output, control = C5.0Control(noGlobalPruning = FALSE))
C5.0 [Release 2.07 GPL Edition] Thu Jan 13 17:04:45 2022
-------------------------------
Class specified by attribute `outcome'
Read 150 cases (5 attributes) from undefined.data
Decision tree:
Petal.Length <= 1.9: setosa (50)
Petal.Length > 1.9:
:...Petal.Width > 1.7: virginica (46/1)
Petal.Width <= 1.7:
:...Petal.Length <= 4.9: versicolor (48/1)
Petal.Length > 4.9: virginica (6/2)
Evaluation on training data (150 cases):
Decision Tree
----------------
Size Errors
4 4( 2.7%) <<
(a) (b) (c) <-classified as
---- ---- ----
50 (a): class setosa
47 3 (b): class versicolor
1 49 (c): class virginica
Attribute usage:
100.00% Petal.Length
66.67% Petal.Width
Time: 0.0 secs
Observe features used to create model.
C5imp(model2,metric='usage')
Predict classes from numerical attributes:
newcases <- iris[c(1:3,51:53,101:103),]
newcases
predicted <- predict(model2, newcases, type="class")
predicted
[1] setosa setosa setosa versicolor versicolor versicolor virginica virginica
[9] virginica
Levels: setosa versicolor virginica
K MEANS CLUSTERING
K-means Clustering is used with unlabeled data, but in this case, we have a labeled dataset so we have to use the iris data without the Species column. In this way, algorithm will cluster the data and we will be able to compare the predicted results with the original results, getting the accuracy of the model.
library(ggplot2)
df <- iris
head(iris)
ggplot(df, aes(Petal.Length, Petal.Width)) + geom_point(aes(col=Species), size=4)

As we can see, setosa is going to be clustered easier. Meanwhile, there is noise between versicolor and virginica even when they look like perfectly clustered.
Let’s run the model. kmeans is installed in the base package from R, so we don’t have to install any package.
In the kmeans function, it is necessary to set center, which is the number of groups we want to cluster to. In this case, we know this value will be 3. Let’s set that.
Let’s see how we would build the model if we didn’t know it.
set.seed(101)
irisCluster <- kmeans(df[,1:4], center=3, nstart=20)
irisCluster
K-means clustering with 3 clusters of sizes 38, 62, 50
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.850000 3.073684 5.742105 2.071053
2 5.901613 2.748387 4.393548 1.433871
3 5.006000 3.428000 1.462000 0.246000
Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[44] 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2
[87] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1
[130] 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2
Within cluster sum of squares by cluster:
[1] 23.87947 39.82097 15.15100
(between_SS / total_SS = 88.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Compare the predicted clusters with the original data.
table(irisCluster$cluster, df$Species)
setosa versicolor virginica
1 0 2 36
2 0 48 14
3 50 0 0
Plot out these clusters:
library(cluster)
clusplot(iris, irisCluster$cluster, color=T, shade=T, labels=0, lines=0)

We can see the setosa cluster perfectly explained, meanwhile virginica and versicolor have a little noise between their clusters.
If we would want to know the exactly number of centers, we should have built the elbow method.
tot.withinss <- vector(mode="character", length=10)
for (i in 1:10){
irisCluster <- kmeans(df[,1:4], center=i, nstart=20)
tot.withinss[i] <- irisCluster$tot.withinss
}
plot(1:10, tot.withinss, type="b", pch=19)

This shows the optimal number of clusters as 3.
