Getting the Data Ready

Import Data

# change path if necessary
path = "file:///Users/ellenhwng/Documents/Research/iris.csv"
iris <- read.csv(path, na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
names(iris) <- c("sepal.length", "sepal.width", "petal.length", "petal.width", "class")
attributes <- c("sepal.length", "sepal.width", "petal.length", "petal.width")


Create Subset for Each Class

setosa <- iris[iris$class == "Iris-setosa", ]
versicolor <- iris[iris$class == "Iris-versicolor", ]
virginica <- iris[iris$class == "Iris-virginica", ]


Partition Data for Modelling: Training, Validation, Testing

nobs <- nrow(iris) # 151 observations/rows
train <- sample(nrow(iris), .7*nobs) #105 observations
validate <- sample(setdiff(seq_len(nrow(iris)), train), .15*nobs) # 22 observations
test <- setdiff(setdiff(seq_len(nrow(iris)), train), validate) #23 observations
nobs 
## [1] 151
length(train)
## [1] 105
length(validate)
## [1] 22
length(test)
## [1] 24


Summaries and Means

Overall summary of dataset

##   sepal.length    sepal.width     petal.length    petal.width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##  NA's   :1       NA's   :1       NA's   :1       NA's   :1


List of summaries of each iris type

obs_mean <- list(setosa = apply(setosa[1:4], 2, summary), 
                 versicolor = apply(versicolor[1:4], 2, summary), 
                 virginica = apply(virginica[1:4], 2, summary))
obs_mean
## $setosa
##         sepal.length sepal.width petal.length petal.width
## Min.           4.300       2.300        1.000       0.100
## 1st Qu.        4.800       3.125        1.400       0.200
## Median         5.000       3.400        1.500       0.200
## Mean           5.006       3.418        1.464       0.244
## 3rd Qu.        5.200       3.675        1.575       0.300
## Max.           5.800       4.400        1.900       0.600
## NA's           1.000       1.000        1.000       1.000
## 
## $versicolor
##         sepal.length sepal.width petal.length petal.width
## Min.           4.900       2.000         3.00       1.000
## 1st Qu.        5.600       2.525         4.00       1.200
## Median         5.900       2.800         4.35       1.300
## Mean           5.936       2.770         4.26       1.326
## 3rd Qu.        6.300       3.000         4.60       1.500
## Max.           7.000       3.400         5.10       1.800
## NA's           1.000       1.000         1.00       1.000
## 
## $virginica
##         sepal.length sepal.width petal.length petal.width
## Min.           4.900       2.200        4.500       1.400
## 1st Qu.        6.225       2.800        5.100       1.800
## Median         6.500       3.000        5.550       2.000
## Mean           6.588       2.974        5.552       2.026
## 3rd Qu.        6.900       3.175        5.875       2.300
## Max.           7.900       3.800        6.900       2.500
## NA's           1.000       1.000        1.000       1.000



Explorations

It was clear that, of all the iris attribution distributions, petal lengths and widths were the most distinguishing characteristics. Iris Setosa clearly had the smallest distribution of petal length and petal width. The histogram distributions of the two attributes demonstrated that the petal attributes of versicolor and virginica were separated but also had a little overlap in petal length and width. Therefore, I know I have to more closely examine the sizes of the attributes, petal lengths and petal widths, in order to create a model that distinguishes between iris versicolor and iris virginica.



Model Building with Rattle

K Means

set.seed(42)
iris_kmeans <- kmeans(na.omit(iris[train, attributes]), 3)
iris_kmeans
## K-means clustering with 3 clusters of sizes 27, 45, 33
## 
## Cluster means:
##   sepal.length sepal.width petal.length petal.width
## 1     6.888889    3.062963     5.748148   2.0666667
## 2     5.904444    2.737778     4.406667   1.4377778
## 3     5.030303    3.442424     1.493939   0.2484848
## 
## Clustering vector:
##  34  53 109  50 115  18  66  63  37  19 134 149  41   8 138  70 116  48 
##   3   1   1   3   2   3   2   2   3   3   2   1   3   3   1   2   1   3 
##  99 122  40  26  22  79  80  49 125  85  91  30  43  74  44 117  64 131 
##   2   2   3   3   3   2   2   3   1   2   2   3   3   2   3   1   2   1 
## 108  90 126  35  97  25  58 120 104 136  73  86  59 100   6  17 103  77 
##   1   2   1   3   2   3   2   2   1   1   2   2   2   2   3   3   1   2 
## 144   5 145  96 128   7  52  12  76  78  46  11 110   2  28 102  65 142 
##   1   3   1   2   2   3   2   3   2   1   3   3   1   3   3   2   2   1 
## 123 107 119 140  42  20  69  29  55  89  92  94  87  31 124 133 148 121 
##   1   2   1   1   3   3   2   3   2   2   2   2   2   3   2   1   1   1 
## 132  45  81  33 147  62 150  75  54  84  38  82 129 143 105 
##   1   3   2   3   2   2   2   2   2   2   3   2   1   2   1 
## 
## Within cluster sum of squares by cluster:
## [1] 15.877037 30.598667  8.491515
##  (between_SS / total_SS =  88.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"


Graphical Representation of kmeans clusters

library(cluster)
cluster::clusplot(na.omit(iris[train, attributes]), iris_kmeans$cluster, color=TRUE, shade=TRUE, main='2D Representation of Clusters')

Decision Tree Model

## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(iris_rpart, main = "Iris Decision Tree")

print(iris_rpart)
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 69 Iris-versicolor (0.31428571 0.34285714 0.34285714)  
##   2) petal.length< 2.45 33  0 Iris-setosa (1.00000000 0.00000000 0.00000000) *
##   3) petal.length>=2.45 72 36 Iris-versicolor (0.00000000 0.50000000 0.50000000)  
##     6) petal.width< 1.75 39  3 Iris-versicolor (0.00000000 0.92307692 0.07692308) *
##     7) petal.width>=1.75 33  0 Iris-virginica (0.00000000 0.00000000 1.00000000) *
printcp(iris_rpart)
## 
## Classification tree:
## rpart(formula = class ~ ., data = iris[train, c(attributes, "class")], 
##     method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, 
##         maxsurrogate = 0))
## 
## Variables actually used in tree construction:
## [1] petal.length petal.width 
## 
## Root node error: 69/105 = 0.65714
## 
## n= 105 
## 
##        CP nsplit rel error   xerror     xstd
## 1 0.47826      0  1.000000 1.101449 0.066399
## 2 0.01000      2  0.043478 0.043478 0.024741



Evaluate the Model

Kmeans

Confusion Matrix for kmeans performance The confusion matrix shows that the 1st kmeans cluster grouped 27 virginica obserations, the 2nd kmeans cluster grouped all 38 of the setosa observations, and the 3rd kmeans cluster grouped 28 versicolor and 11 virginica observations. This demonstrates some similarities and cross over in attribute measurements between iris versicolor and virginica.

table(na.omit(iris[train,5]),iris_kmeans$cluster)
##                  
##                    1  2  3
##   Iris-setosa      0  0 33
##   Iris-versicolor  2 34  0
##   Iris-virginica  25 11  0

Decision Tree Model

Error Matrices for the Validation The error matrix for the validation data shows that the decision tree model makes 1 error in predicting iris versicolor as iris virginica, giving an 9% error when predicting iris versicolor

##                  Predicted
## Actual            Iris-setosa Iris-versicolor Iris-virginica
##   Iris-setosa               5               0              0
##   Iris-versicolor           0               7              0
##   Iris-virginica            0               1              8
##                  Predicted
## Actual            Iris-setosa Iris-versicolor Iris-virginica Error
##   Iris-setosa            0.23            0.00           0.00  0.00
##   Iris-versicolor        0.00            0.32           0.00  0.00
##   Iris-virginica         0.00            0.05           0.36  0.11


Error Matrices for the Testing The error matrix for the testing data shows 3 errors in predicting iris versicolor as iris virginica and 1 error in predicting iris virginica as iris versicolor, giving a 27% error in predicting iris versicolor and 17% error in predicting iris virginica.

##                  Predicted
## Actual            Iris-setosa Iris-versicolor Iris-virginica
##   Iris-setosa              12               0              0
##   Iris-versicolor           0               6              1
##   Iris-virginica            0               1              4
##                  Predicted
## Actual            Iris-setosa Iris-versicolor Iris-virginica Error
##   Iris-setosa             0.5            0.00           0.00  0.00
##   Iris-versicolor         0.0            0.25           0.04  0.14
##   Iris-virginica          0.0            0.04           0.17  0.20