Part 1 1.1 Create a kmeans object from the first, second, and third columns
cars_k3 = kmeans(mtcars[,1:3], centers =3)
1.2 What is the size of each cluster?
cars_k3$size
## [1] 8 8 16
1.3 What are the centers of each cluster?
cars_k3$centers
## mpg cyl disp
## 1 14.6000 8.000 399.1250
## 2 16.7625 7.500 279.1750
## 3 24.5000 4.625 122.2937
1.4 What is the average disp, wt, and qsec of each cluster?
mtcars$cluster_id<-cars_k3$cluster
car_agg<-aggregate(mtcars, by=list(mtcars$cluster_id), FUN=mean)
car_agg[,c("cluster_id", "disp", "wt", "qsec")]
## cluster_id disp wt qsec
## 1 1 399.1250 4.2355 16.63000
## 2 2 279.1750 3.5975 17.67875
## 3 3 122.2938 2.5180 18.54312
1.5 Describe each cluster in English.
car_agg[,c("cluster_id", "mpg", "cyl", "disp")]
## cluster_id mpg cyl disp
## 1 1 14.6000 8.000 399.1250
## 2 2 16.7625 7.500 279.1750
## 3 3 24.5000 4.625 122.2938
Cluster 1: Automobiles which have 4.6 cylinders, 24.5 miles per gallon, and 122 displacement, on average.
Cluster 2: Automobiles which have 8 cylinders, 14.6 miles per gallon, and 399 displacement, on average.
Cluster 3: Automobiles which have 7.5 cylinders, 16.8 miles per gallon, and 279 displacement, on average.
Part 2
2.1 Find a data set with at least 4 columns of numeric data and a categorical column
data(iris)
2.2 Run several scatter plots of the data
attach(iris)
plot(Sepal.Length, Sepal.Width, main="Relationship between Sepal Length and Sepal Width")
plot(Sepal.Length, Petal.Length, main="Relationship between Sepal Length and Petal Length")
plot(Sepal.Length, Petal.Width, main="Relationship between Sepal Length and Petal Width")
plot(Sepal.Width, Petal.Length, main="Relationship between Sepal Width and Petal Length")
plot(Sepal.Width, Petal.Width, main="Relationship between Sepal Width and Petal Width")
plot(Petal.Length, Petal.Width, main="Relationship between Petal Length and Petal Width")
2.3 Create a kmeans object from the numeric data, you can pick K to be whatever you want
iris_k4 = kmeans(iris[,1:4], centers =4)
2.4 Determine the size of each cluster
iris_k4$size
## [1] 32 28 40 50
2.5 Determine the centers of each cluster
iris_k4$center
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.912500 3.100000 5.846875 2.131250
## 2 5.532143 2.635714 3.960714 1.228571
## 3 6.252500 2.855000 4.815000 1.625000
## 4 5.006000 3.428000 1.462000 0.246000
2.6 Compare the clusters to the categorical data
iris$cluster_id<-iris_k4$cluster
table(iris$cluster_id, iris$Species)
##
## setosa versicolor virginica
## 1 0 0 32
## 2 0 27 1
## 3 0 23 17
## 4 50 0 0
Part 3
3.1 For your chosen data set, describe what each row of data represents
data(USArrests)
3.2 Describe each of your columns used - give a one sentence description of the column
Murder: number of arrests for murder
Assault: number of arrests for assault
Rape: number of arrests for rape
UrbanPop: percentage of people in the state who live in an urban area
3.3 If you know it, describe how the data was generated For the clusters
Arrests_k3 = kmeans(USArrests[,c(1,2,4)], centers =3)
3.4 Describe the size and means of clusters
Arrests_k3$size
## [1] 14 20 16
Arrests_k3$center
## Murder Assault Rape
## 1 8.214286 173.2857 22.84286
## 2 4.270000 87.5500 14.39000
## 3 11.812500 272.5625 28.37500
3.5 Give a one- or two-word description to each cluster - in other words, give each cluster a label or name. This is an exercise in turning your numeric data into something descriptive for non-statisticians
Cluster 1: Medium level of security (states with medium rate of murder, assault, and rape)
Cluster 2: High level of security (states with low rate of murder, assault, and rape)
Cluster 3: Low level of security (states with high rate of murder, assault, and rape)