This dataset contains the measurements of the variables sepal width and length, petal width and length for 50 flowers from 3 species of iris.
Necessary packages:
require(ggplot2)
Loading required package: ggplot2
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
Histogram of sepal length
qplot(Sepal.Length, data=iris, geom='histogram', fill=Species, alpha=I(1/2))
summary(iris$Sepal.Width)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.800 3.000 3.057 3.300 4.400
summary(iris$Petal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.600 4.350 3.758 5.100 6.900
Histogram of petal length
qplot(Petal.Length, data=iris, geom='histogram')
Estimating a kernel denstiy for the distribution of petal length:
qplot(Petal.Length, data=iris, geom='density')
Separating the petal length histogram on the basis of species:
qplot(Petal.Length, data=iris, geom='histogram', fill=Species, alpha=I(0.5))
Densities on the basis of species
qplot(Petal.Length, data=iris, geom='density', color=Species, fill=Species)
summary(iris$Petal.Width)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.100 0.300 1.300 1.199 1.800 2.500
cor(iris$Sepal.Length, as.numeric(iris$Species), method='spearman')
[1] 0.7980781
cor(iris$Sepal.Width, as.numeric(iris$Species), method='spearman')
[1] -0.4402896
cor(iris$Petal.Length, as.numeric(iris$Species), method='spearman')
[1] 0.9354305
cor(iris$Petal.Length, as.numeric(iris$Species), method='spearman')
[1] 0.9354305
We note that petal length and width are very strongly correlated with species. Are these two numbers enough to correctly predict the species?
Let us prepare a data frame with only these two columns:
iris2 <- iris[, c("Petal.Length", "Petal.Width")]
Let us perform k-means clustering on this data set
num_clusters <- 3
set.seed(1111)
result <- kmeans(iris2, num_clusters, nstart=20)
The k-means algorithm has identified 3 centers:
result$centers
Petal.Length Petal.Width
1 5.595833 2.037500
2 1.462000 0.246000
3 4.269231 1.342308
The cluster assignments are as follows:
result$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[42] 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3
[83] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
[124] 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1
Mapping between cluster assignment and original species labels:
table(result$cluster, iris$Species)
setosa versicolor virginica
1 0 2 46
2 50 0 0
3 0 48 4
Only 6 measurements have been misclassified.
qplot(Petal.Width, Petal.Length, data=iris, color=Species)
qplot(Petal.Width, Petal.Length, data=iris, color=factor(result$cluster))