Using the mtcars data set Create a kmeans object from the first, second, and third columns What is the size of each cluster? What are the centers of each cluster? What is the average disp, wt, and qsec of each cluster? Describe each cluster in English
mtcarsnum <- mtcars[,1:3]
mtcars_k3 <- kmeans(mtcarsnum, centers=3)
mtcars_k3$size
## [1] 14 9 9
mtcars_k3$centers
## mpg cyl disp
## 1 15.10000 8.000000 353.10000
## 2 27.34444 4.000000 96.55556
## 3 20.60000 5.555556 174.52222
mtcars_k3$cluster_id <- mtcars_k3$cluster
mtcars_new<-aggregate(mtcars, by=list(mtcars_k3$cluster_id), FUN=mean)
mtcars_new
## Group.1 mpg cyl disp hp drat wt qsec
## 1 1 15.10000 8.000000 353.10000 209.21429 3.229286 3.999214 16.77214
## 2 2 27.34444 4.000000 96.55556 83.55556 4.130000 2.089222 18.62333
## 3 3 20.60000 5.555556 174.52222 112.55556 3.634444 3.128889 18.74889
## vs am gear carb
## 1 0.0000000 0.1428571 3.285714 3.500000
## 2 0.8888889 0.8888889 4.111111 1.444444
## 3 0.6666667 0.3333333 3.888889 3.111111
mtcars_new[,c("Group.1", "disp", "wt", "qsec")]
## Group.1 disp wt qsec
## 1 1 353.10000 3.999214 16.77214
## 2 2 96.55556 2.089222 18.62333
## 3 3 174.52222 3.128889 18.74889
mtcars_new gives the average for each variable in each one of the clusters.
Find a data set with at least 4 columns of numeric data and a categorical column Run several scatter plots of the data Create a kmeans object from the numeric data, you can pick K to be whatever you want Determine the size of each cluster Determine the centers of each cluster Compare the clusters to the categorical data column as we did with the iris$Species column
# to look at number of drivers killed and number of drivers killed or seriously injuried over time
plot(Seatbelts[,1])
plot(Seatbelts[,2])
# to look at the scatter plot between number of drivers killed and number of drivers killed or seriously injuried
plot(Seatbelts[,2],Seatbelts[,1])
# to look at the scatter plot between number of drivers killed and petrol price
plot(Seatbelts[,6],Seatbelts[,1])
# to look at the scatter plot between number of drivers killed and distance driven
plot(Seatbelts[,5],Seatbelts[,1])
Seatbeltsnum <- Seatbelts[,1:4]
Seatbelts_k4 <- kmeans(Seatbeltsnum, centers=4)
Seatbelts_k4$size
## [1] 62 65 47 18
Seatbelts_k4$centers
## DriversKilled drivers front rear
## 1 122.9355 1691.452 892.4194 420.0645
## 2 108.1538 1479.492 724.6615 353.3077
## 3 156.0213 2078.000 1030.5745 452.2340
## 4 88.5000 1222.000 548.6667 376.0000
Seatbelts_play <- Seatbelts
head(Seatbelts_play)
## [1] 107 97 102 87 119 106
Seatbelts_play_cluster_id <- Seatbelts_k4$cluster
Seatbelts_play_cluster_id
## [1] 1 2 2 2 1 1 1 1 1 1 3 3 1 1 1 1 1 2 1 3 1 3 3 3 3 1 1 1 1 1 1 3 1 3 3
## [36] 3 3 1 1 1 3 3 3 1 1 3 3 3 3 3 1 3 3 1 3 3 3 3 3 3 2 2 2 2 1 1 1 3 3 3
## [71] 3 3 2 4 1 2 2 2 2 1 1 2 1 3 2 2 2 2 2 4 2 2 2 1 3 3 2 2 2 2 2 2 1 1 2
## [106] 1 3 3 3 2 2 2 2 1 1 1 1 1 3 3 1 2 1 2 2 2 2 1 1 1 3 3 1 4 2 2 2 2 2 1
## [141] 2 1 1 3 2 2 2 2 2 2 1 2 1 3 1 1 2 2 2 2 2 2 2 1 1 1 3 3 2 4 4 4 4 4 4
## [176] 4 2 2 2 2 4 4 4 4 4 4 4 4 2 2 1 1
# to compare the clusters to the law data column that indicates whether seatbelt law was in force (1 means law in force and 0 means no law in force)
table(Seatbelts_play[,8],Seatbelts_play_cluster_id)
## Seatbelts_play_cluster_id
## 1 2 3 4
## 0 60 59 47 3
## 1 2 6 0 15
Each row is the data for a month.
Data description: 1. DriversKilled - car drivers killed 2. drivers - monthly totals of car drivers in Great Britain killed or seriously injured 3. front - front-seat passengers killed or seriously injured 4. rear - rear-seat passengers killed or seriously injured 5. kms - distance driven 6. PetrolPrice - petrol price 7. VanKilled - number of van (‘light goods vehicle’) drivers killed 8. law - categorical data that reflects whether the seatbelt law in effect that month
Seatbelts_k4$size
## [1] 62 65 47 18
Seatbelts_k4$centers
## DriversKilled drivers front rear
## 1 122.9355 1691.452 892.4194 420.0645
## 2 108.1538 1479.492 724.6615 353.3077
## 3 156.0213 2078.000 1030.5745 452.2340
## 4 88.5000 1222.000 548.6667 376.0000
Cluster sizes are as shown. Means for the 4 variables included are very different across clusters. Cluster 3 has the highest mean for drivers killed and thus other variables are also the highest among clusters. Highest driver killed would be the name. Cluster 1 and 2 are closer and thus could be sharing common factors. These two clusters are middle driver killed. Cluster 4 has the lowest mean for drivers killed and thus can be called lowest driver killed.