data(iris) # call in iris dataset
head(iris) # view the first few rows
set.seed(42) # set seed to ensure reproducible results
km <-kmeans(iris[,1:4], 3,nstart=25) # choose 3 clusters – there are 3 species in the iris dataset
km
K-means clustering with 3 clusters of sizes 38, 62, 50
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.850000 3.073684 5.742105 2.071053
2 5.901613 2.748387 4.393548 1.433871
3 5.006000 3.428000 1.462000 0.246000
Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
[80] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2
Within cluster sum of squares by cluster:
[1] 23.87947 39.82097 15.15100
(between_SS / total_SS = 88.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
table(km$cluster, iris$Species)
setosa versicolor virginica
1 0 2 36
2 0 48 14
3 50 0 0
plot(iris[,1], iris[,2], col=km$cluster) # plots cluster centers by Sepal Length by Sepal Width
points(km$centers[,c(1,2)], col=1:3, pch=8, cex=2) # add points for the cluster centers
plot(iris[,3], iris[,4], col=km$cluster)
points(km$centers[,c(3,4)], col=1:3, pch=8, cex=2)
1.How is unsupervised learning related to the statistical clustering problem?
The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.Unsupervised learning methods are used in bioinformatics for sequence analysis and genetic clustering; in data mining for sequence and pattern mining; in medical imaging for image segmentation; and in computer vision for object recognition.
The (sparcl) package in R performs sparse hierarchical and sparse K-means clustering. The (nsprcomp) R package provides methods for sparse Principal Component Analysis. (PCA) package will perform Principal Component Analysis