Import Data
# change path if necessary
path = "file:///Users/ellenhwng/Documents/Research/iris.csv"
iris <- read.csv(path, na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
names(iris) <- c("sepal.length", "sepal.width", "petal.length", "petal.width", "class")
attributes <- c("sepal.length", "sepal.width", "petal.length", "petal.width")
Create Subset for Each Class
setosa <- iris[iris$class == "Iris-setosa", ]
versicolor <- iris[iris$class == "Iris-versicolor", ]
virginica <- iris[iris$class == "Iris-virginica", ]
Partition Data for Modelling: Training, Validation, Testing
nobs <- nrow(iris) # 151 observations/rows
train <- sample(nrow(iris), .7*nobs) #105 observations
validate <- sample(setdiff(seq_len(nrow(iris)), train), .15*nobs) # 22 observations
test <- setdiff(setdiff(seq_len(nrow(iris)), train), validate) #23 observations
nobs
## [1] 151
length(train)
## [1] 105
length(validate)
## [1] 22
length(test)
## [1] 24
Overall summary of dataset
## sepal.length sepal.width petal.length petal.width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## NA's :1 NA's :1 NA's :1 NA's :1
List of summaries of each iris type
obs_mean <- list(setosa = apply(setosa[1:4], 2, summary),
versicolor = apply(versicolor[1:4], 2, summary),
virginica = apply(virginica[1:4], 2, summary))
obs_mean
## $setosa
## sepal.length sepal.width petal.length petal.width
## Min. 4.300 2.300 1.000 0.100
## 1st Qu. 4.800 3.125 1.400 0.200
## Median 5.000 3.400 1.500 0.200
## Mean 5.006 3.418 1.464 0.244
## 3rd Qu. 5.200 3.675 1.575 0.300
## Max. 5.800 4.400 1.900 0.600
## NA's 1.000 1.000 1.000 1.000
##
## $versicolor
## sepal.length sepal.width petal.length petal.width
## Min. 4.900 2.000 3.00 1.000
## 1st Qu. 5.600 2.525 4.00 1.200
## Median 5.900 2.800 4.35 1.300
## Mean 5.936 2.770 4.26 1.326
## 3rd Qu. 6.300 3.000 4.60 1.500
## Max. 7.000 3.400 5.10 1.800
## NA's 1.000 1.000 1.00 1.000
##
## $virginica
## sepal.length sepal.width petal.length petal.width
## Min. 4.900 2.200 4.500 1.400
## 1st Qu. 6.225 2.800 5.100 1.800
## Median 6.500 3.000 5.550 2.000
## Mean 6.588 2.974 5.552 2.026
## 3rd Qu. 6.900 3.175 5.875 2.300
## Max. 7.900 3.800 6.900 2.500
## NA's 1.000 1.000 1.000 1.000
It was clear that, of all the iris attribution distributions, petal lengths and widths were the most distinguishing characteristics. Iris Setosa clearly had the smallest distribution of petal length and petal width. The histogram distributions of the two attributes demonstrated that the petal attributes of versicolor and virginica were separated but also had a little overlap in petal length and width. Therefore, I know I have to more closely examine the sizes of the attributes, petal lengths and petal widths, in order to create a model that distinguishes between iris versicolor and iris virginica.
set.seed(42)
iris_kmeans <- kmeans(na.omit(iris[train, attributes]), 3)
iris_kmeans
## K-means clustering with 3 clusters of sizes 27, 45, 33
##
## Cluster means:
## sepal.length sepal.width petal.length petal.width
## 1 6.888889 3.062963 5.748148 2.0666667
## 2 5.904444 2.737778 4.406667 1.4377778
## 3 5.030303 3.442424 1.493939 0.2484848
##
## Clustering vector:
## 34 53 109 50 115 18 66 63 37 19 134 149 41 8 138 70 116 48
## 3 1 1 3 2 3 2 2 3 3 2 1 3 3 1 2 1 3
## 99 122 40 26 22 79 80 49 125 85 91 30 43 74 44 117 64 131
## 2 2 3 3 3 2 2 3 1 2 2 3 3 2 3 1 2 1
## 108 90 126 35 97 25 58 120 104 136 73 86 59 100 6 17 103 77
## 1 2 1 3 2 3 2 2 1 1 2 2 2 2 3 3 1 2
## 144 5 145 96 128 7 52 12 76 78 46 11 110 2 28 102 65 142
## 1 3 1 2 2 3 2 3 2 1 3 3 1 3 3 2 2 1
## 123 107 119 140 42 20 69 29 55 89 92 94 87 31 124 133 148 121
## 1 2 1 1 3 3 2 3 2 2 2 2 2 3 2 1 1 1
## 132 45 81 33 147 62 150 75 54 84 38 82 129 143 105
## 1 3 2 3 2 2 2 2 2 2 3 2 1 2 1
##
## Within cluster sum of squares by cluster:
## [1] 15.877037 30.598667 8.491515
## (between_SS / total_SS = 88.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Graphical Representation of kmeans clusters
library(cluster)
cluster::clusplot(na.omit(iris[train, attributes]), iris_kmeans$cluster, color=TRUE, shade=TRUE, main='2D Representation of Clusters')
## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(iris_rpart, main = "Iris Decision Tree")
print(iris_rpart)
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 69 Iris-versicolor (0.31428571 0.34285714 0.34285714)
## 2) petal.length< 2.45 33 0 Iris-setosa (1.00000000 0.00000000 0.00000000) *
## 3) petal.length>=2.45 72 36 Iris-versicolor (0.00000000 0.50000000 0.50000000)
## 6) petal.width< 1.75 39 3 Iris-versicolor (0.00000000 0.92307692 0.07692308) *
## 7) petal.width>=1.75 33 0 Iris-virginica (0.00000000 0.00000000 1.00000000) *
printcp(iris_rpart)
##
## Classification tree:
## rpart(formula = class ~ ., data = iris[train, c(attributes, "class")],
## method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0,
## maxsurrogate = 0))
##
## Variables actually used in tree construction:
## [1] petal.length petal.width
##
## Root node error: 69/105 = 0.65714
##
## n= 105
##
## CP nsplit rel error xerror xstd
## 1 0.47826 0 1.000000 1.101449 0.066399
## 2 0.01000 2 0.043478 0.043478 0.024741
Confusion Matrix for kmeans performance The confusion matrix shows that the 1st kmeans cluster grouped 27 virginica obserations, the 2nd kmeans cluster grouped all 38 of the setosa observations, and the 3rd kmeans cluster grouped 28 versicolor and 11 virginica observations. This demonstrates some similarities and cross over in attribute measurements between iris versicolor and virginica.
table(na.omit(iris[train,5]),iris_kmeans$cluster)
##
## 1 2 3
## Iris-setosa 0 0 33
## Iris-versicolor 2 34 0
## Iris-virginica 25 11 0
Error Matrices for the Validation The error matrix for the validation data shows that the decision tree model makes 1 error in predicting iris versicolor as iris virginica, giving an 9% error when predicting iris versicolor
## Predicted
## Actual Iris-setosa Iris-versicolor Iris-virginica
## Iris-setosa 5 0 0
## Iris-versicolor 0 7 0
## Iris-virginica 0 1 8
## Predicted
## Actual Iris-setosa Iris-versicolor Iris-virginica Error
## Iris-setosa 0.23 0.00 0.00 0.00
## Iris-versicolor 0.00 0.32 0.00 0.00
## Iris-virginica 0.00 0.05 0.36 0.11
Error Matrices for the Testing The error matrix for the testing data shows 3 errors in predicting iris versicolor as iris virginica and 1 error in predicting iris virginica as iris versicolor, giving a 27% error in predicting iris versicolor and 17% error in predicting iris virginica.
## Predicted
## Actual Iris-setosa Iris-versicolor Iris-virginica
## Iris-setosa 12 0 0
## Iris-versicolor 0 6 1
## Iris-virginica 0 1 4
## Predicted
## Actual Iris-setosa Iris-versicolor Iris-virginica Error
## Iris-setosa 0.5 0.00 0.00 0.00
## Iris-versicolor 0.0 0.25 0.04 0.14
## Iris-virginica 0.0 0.04 0.17 0.20