Assignment #8 Cluster Analysis

Part 1

Using the mtcars data set Create a kmeans object from the first, second, and third columns What is the size of each cluster? What are the centers of each cluster? What is the average disp, wt, and qsec of each cluster? Describe each cluster in English

mtcarsnum <- mtcars[,1:3]
mtcars_k3   <- kmeans(mtcarsnum,    centers=3)
mtcars_k3$size

## [1] 14  9  9

mtcars_k3$centers

##        mpg      cyl      disp
## 1 15.10000 8.000000 353.10000
## 2 27.34444 4.000000  96.55556
## 3 20.60000 5.555556 174.52222

mtcars_k3$cluster_id <- mtcars_k3$cluster
mtcars_new<-aggregate(mtcars, by=list(mtcars_k3$cluster_id), FUN=mean)
mtcars_new

##   Group.1      mpg      cyl      disp        hp     drat       wt     qsec
## 1       1 15.10000 8.000000 353.10000 209.21429 3.229286 3.999214 16.77214
## 2       2 27.34444 4.000000  96.55556  83.55556 4.130000 2.089222 18.62333
## 3       3 20.60000 5.555556 174.52222 112.55556 3.634444 3.128889 18.74889
##          vs        am     gear     carb
## 1 0.0000000 0.1428571 3.285714 3.500000
## 2 0.8888889 0.8888889 4.111111 1.444444
## 3 0.6666667 0.3333333 3.888889 3.111111

mtcars_new[,c("Group.1", "disp", "wt", "qsec")]

##   Group.1      disp       wt     qsec
## 1       1 353.10000 3.999214 16.77214
## 2       2  96.55556 2.089222 18.62333
## 3       3 174.52222 3.128889 18.74889

mtcars_new gives the average for each variable in each one of the clusters.

Part 2

Find a data set with at least 4 columns of numeric data and a categorical column Run several scatter plots of the data Create a kmeans object from the numeric data, you can pick K to be whatever you want Determine the size of each cluster Determine the centers of each cluster Compare the clusters to the categorical data column as we did with the iris$Species column

# to look at number of drivers killed and number of drivers killed or seriously injuried over time
plot(Seatbelts[,1])

plot(Seatbelts[,2])

# to look at the scatter plot between number of drivers killed and number of drivers killed or seriously injuried
plot(Seatbelts[,2],Seatbelts[,1])

# to look at the scatter plot between number of drivers killed and petrol price
plot(Seatbelts[,6],Seatbelts[,1])

# to look at the scatter plot between number of drivers killed and distance driven
plot(Seatbelts[,5],Seatbelts[,1])

Seatbeltsnum <- Seatbelts[,1:4]
Seatbelts_k4    <- kmeans(Seatbeltsnum, centers=4)
Seatbelts_k4$size

## [1] 62 65 47 18

Seatbelts_k4$centers

##   DriversKilled  drivers     front     rear
## 1      122.9355 1691.452  892.4194 420.0645
## 2      108.1538 1479.492  724.6615 353.3077
## 3      156.0213 2078.000 1030.5745 452.2340
## 4       88.5000 1222.000  548.6667 376.0000

Seatbelts_play <- Seatbelts
head(Seatbelts_play)

## [1] 107  97 102  87 119 106

Seatbelts_play_cluster_id <- Seatbelts_k4$cluster
Seatbelts_play_cluster_id

##   [1] 1 2 2 2 1 1 1 1 1 1 3 3 1 1 1 1 1 2 1 3 1 3 3 3 3 1 1 1 1 1 1 3 1 3 3
##  [36] 3 3 1 1 1 3 3 3 1 1 3 3 3 3 3 1 3 3 1 3 3 3 3 3 3 2 2 2 2 1 1 1 3 3 3
##  [71] 3 3 2 4 1 2 2 2 2 1 1 2 1 3 2 2 2 2 2 4 2 2 2 1 3 3 2 2 2 2 2 2 1 1 2
## [106] 1 3 3 3 2 2 2 2 1 1 1 1 1 3 3 1 2 1 2 2 2 2 1 1 1 3 3 1 4 2 2 2 2 2 1
## [141] 2 1 1 3 2 2 2 2 2 2 1 2 1 3 1 1 2 2 2 2 2 2 2 1 1 1 3 3 2 4 4 4 4 4 4
## [176] 4 2 2 2 2 4 4 4 4 4 4 4 4 2 2 1 1

# to compare the clusters to the law data column that indicates whether seatbelt law was in force (1 means law in force and 0 means no law in force)
table(Seatbelts_play[,8],Seatbelts_play_cluster_id)

##    Seatbelts_play_cluster_id
##      1  2  3  4
##   0 60 59 47  3
##   1  2  6  0 15

Part 3

For your chosen data set, describe what each row of data represents; describe each of your columns used – give a one sentence description of the column. If you know it, describe how the data was generated

Each row is the data for a month.

Data description: 1. DriversKilled - car drivers killed 2. drivers - monthly totals of car drivers in Great Britain killed or seriously injured 3. front - front-seat passengers killed or seriously injured 4. rear - rear-seat passengers killed or seriously injured 5. kms - distance driven 6. PetrolPrice - petrol price 7. VanKilled - number of van (‘light goods vehicle’) drivers killed 8. law - categorical data that reflects whether the seatbelt law in effect that month

For the clusters, describe the size and means of clusters. Give a one- or two-word description to each cluster – in other words, give each cluster a label or name. This is an exercise in turning your numeric data into something descriptive for non-statisticians

Seatbelts_k4$size

## [1] 62 65 47 18

Seatbelts_k4$centers

##   DriversKilled  drivers     front     rear
## 1      122.9355 1691.452  892.4194 420.0645
## 2      108.1538 1479.492  724.6615 353.3077
## 3      156.0213 2078.000 1030.5745 452.2340
## 4       88.5000 1222.000  548.6667 376.0000

Cluster sizes are as shown. Means for the 4 variables included are very different across clusters. Cluster 3 has the highest mean for drivers killed and thus other variables are also the highest among clusters. Highest driver killed would be the name. Cluster 1 and 2 are closer and thus could be sharing common factors. These two clusters are middle driver killed. Cluster 4 has the lowest mean for drivers killed and thus can be called lowest driver killed.