Part 1 1.1 Create a kmeans object from the first, second, and third columns

cars_k3 = kmeans(mtcars[,1:3], centers =3)

1.2 What is the size of each cluster?

cars_k3$size
## [1]  8  8 16

1.3 What are the centers of each cluster?

cars_k3$centers
##       mpg   cyl     disp
## 1 14.6000 8.000 399.1250
## 2 16.7625 7.500 279.1750
## 3 24.5000 4.625 122.2937

1.4 What is the average disp, wt, and qsec of each cluster?

mtcars$cluster_id<-cars_k3$cluster
car_agg<-aggregate(mtcars, by=list(mtcars$cluster_id), FUN=mean)
car_agg[,c("cluster_id", "disp", "wt", "qsec")]
##   cluster_id     disp     wt     qsec
## 1          1 399.1250 4.2355 16.63000
## 2          2 279.1750 3.5975 17.67875
## 3          3 122.2938 2.5180 18.54312

1.5 Describe each cluster in English.

car_agg[,c("cluster_id", "mpg", "cyl", "disp")]
##   cluster_id     mpg   cyl     disp
## 1          1 14.6000 8.000 399.1250
## 2          2 16.7625 7.500 279.1750
## 3          3 24.5000 4.625 122.2938

Cluster 1: Automobiles which have 4.6 cylinders, 24.5 miles per gallon, and 122 displacement, on average.

Cluster 2: Automobiles which have 8 cylinders, 14.6 miles per gallon, and 399 displacement, on average.

Cluster 3: Automobiles which have 7.5 cylinders, 16.8 miles per gallon, and 279 displacement, on average.

Part 2

2.1 Find a data set with at least 4 columns of numeric data and a categorical column

data(iris)

2.2 Run several scatter plots of the data

attach(iris)
plot(Sepal.Length, Sepal.Width, main="Relationship between Sepal Length and Sepal Width")

plot(Sepal.Length, Petal.Length, main="Relationship between Sepal Length and Petal Length")

plot(Sepal.Length, Petal.Width, main="Relationship between Sepal Length and Petal Width")

plot(Sepal.Width, Petal.Length, main="Relationship between Sepal Width and Petal Length")

plot(Sepal.Width, Petal.Width, main="Relationship between Sepal Width and Petal Width")

plot(Petal.Length, Petal.Width, main="Relationship between Petal Length and Petal Width")

2.3 Create a kmeans object from the numeric data, you can pick K to be whatever you want

iris_k4 = kmeans(iris[,1:4], centers =4)

2.4 Determine the size of each cluster

iris_k4$size
## [1] 32 28 40 50

2.5 Determine the centers of each cluster

iris_k4$center
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.912500    3.100000     5.846875    2.131250
## 2     5.532143    2.635714     3.960714    1.228571
## 3     6.252500    2.855000     4.815000    1.625000
## 4     5.006000    3.428000     1.462000    0.246000

2.6 Compare the clusters to the categorical data

iris$cluster_id<-iris_k4$cluster
table(iris$cluster_id, iris$Species)
##    
##     setosa versicolor virginica
##   1      0          0        32
##   2      0         27         1
##   3      0         23        17
##   4     50          0         0

Part 3

3.1 For your chosen data set, describe what each row of data represents

data(USArrests)

3.2 Describe each of your columns used - give a one sentence description of the column

Murder: number of arrests for murder

Assault: number of arrests for assault

Rape: number of arrests for rape

UrbanPop: percentage of people in the state who live in an urban area

3.3 If you know it, describe how the data was generated For the clusters

Arrests_k3 = kmeans(USArrests[,c(1,2,4)], centers =3)

3.4 Describe the size and means of clusters

Arrests_k3$size
## [1] 14 20 16
Arrests_k3$center
##      Murder  Assault     Rape
## 1  8.214286 173.2857 22.84286
## 2  4.270000  87.5500 14.39000
## 3 11.812500 272.5625 28.37500

3.5 Give a one- or two-word description to each cluster - in other words, give each cluster a label or name. This is an exercise in turning your numeric data into something descriptive for non-statisticians

Cluster 1: Medium level of security (states with medium rate of murder, assault, and rape)

Cluster 2: High level of security (states with low rate of murder, assault, and rape)

Cluster 3: Low level of security (states with high rate of murder, assault, and rape)