k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
You can also embed plots, for example:
## Warning: package 'ggplot2' was built under R version 3.4.3
irisCluster<-kmeans(iris[,3:4],3,nstart=30) #nstart=number of starting points tried to arrive at final
irisCluster
## K-means clustering with 3 clusters of sizes 50, 52, 48
##
## Cluster means:
## Petal.Length Petal.Width
## 1 1.462000 0.246000
## 2 4.269231 1.342308
## 3 5.595833 2.037500
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3
## [141] 3 3 3 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 2.02200 13.05769 16.29167
## (between_SS / total_SS = 94.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
table(irisCluster$cluster,iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 4
## 3 0 2 46
Conceptually, factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables. One of the most important uses of factors is in statistical modeling; since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly.
irisCluster$cluster<-as.factor(irisCluster$cluster) #continous to discrete
ggplot(iris,aes(Petal.Length,Petal.Width, color=irisCluster$cluster))+geom_point()
Elbow Method WSS=Total within-cluster sum of squares(distance) Take elbow or kink point beyond which wss changes little as optimal number of clusters
wss<-sapply(1:15,function(k){kmeans(iris[,3:4],k,nstart=30,iter.max = 20)$tot.withinss})
wss
## [1] 550.895333 86.390220 31.371359 19.465989 13.916909 11.025145
## [7] 9.236596 7.674414 6.456495 5.550520 5.088786 4.750035
## [13] 4.241221 3.927031 3.519928
plot(1:15,wss)
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.