From this player’s data we have to select the best players for the next tournament.
We load the data.
cricket <- tbl_df(cricket)
cricket
## Source: local data frame [435 x 14]
##
## ID Mat Inns Not.Outs Runs HS Ave BF SR X100s X50s Ducks X4s X6s
## 1 1 55 48 14 2116 124 62.23 2569 82 2 16 0 138 27
## 2 2 51 47 12 1788 85 51.08 2120 84 0 17 0 109 20
## 3 3 49 46 4 1772 150 42.19 1949 91 4 12 3 183 11
## 4 4 49 47 5 1653 138 39.35 1667 99 4 8 3 167 52
## 5 5 49 48 3 1643 128 36.51 2208 74 4 8 2 162 3
## 6 6 33 33 1 1499 125 46.84 1193 126 3 10 0 206 36
## 7 7 42 42 1 1476 126 36.00 1865 79 3 10 1 143 13
## 8 8 38 38 2 1411 178 39.19 1637 86 3 8 1 156 22
## 9 9 45 45 0 1409 154 31.31 1797 78 2 8 6 160 16
## 10 10 41 41 6 1339 100 38.25 2004 67 1 12 2 109 2
## .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
Lets remove the variable ‘id’. We don’t need ‘id’ for clustering.
cricket1 <- cricket %>%
select(Mat:X6s)
cricket1
## Source: local data frame [435 x 13]
##
## Mat Inns Not.Outs Runs HS Ave BF SR X100s X50s Ducks X4s X6s
## 1 55 48 14 2116 124 62.23 2569 82 2 16 0 138 27
## 2 51 47 12 1788 85 51.08 2120 84 0 17 0 109 20
## 3 49 46 4 1772 150 42.19 1949 91 4 12 3 183 11
## 4 49 47 5 1653 138 39.35 1667 99 4 8 3 167 52
## 5 49 48 3 1643 128 36.51 2208 74 4 8 2 162 3
## 6 33 33 1 1499 125 46.84 1193 126 3 10 0 206 36
## 7 42 42 1 1476 126 36.00 1865 79 3 10 1 143 13
## 8 38 38 2 1411 178 39.19 1637 86 3 8 1 156 22
## 9 45 45 0 1409 154 31.31 1797 78 2 8 6 160 16
## 10 41 41 6 1339 100 38.25 2004 67 1 12 2 109 2
## .. ... ... ... ... ... ... ... ... ... ... ... ... ...
Lets check how many clusters we can form.
sqrt(435)/2
## [1] 10.43
Checking the best cluster
cluster1 <- kmeans(cricket1, 10)
cluster1$size
## [1] 59 88 22 49 22 8 55 25 68 39
cluster2 <- kmeans(cricket1, 5)
cluster2$size
## [1] 69 195 28 39 104
cluster3 <- kmeans(cricket1, 4)
cluster3$size
## [1] 30 270 94 41
cluster4 <- kmeans(cricket1, 3)
cluster4$size
## [1] 88 33 314
So we are not able to conclude a proper cluster. We will do some data manipulation.
summary(cricket1)
## Mat Inns Not.Outs Runs
## Min. : 1.0 Min. : 1.0 Min. : 0.00 Min. : 0
## 1st Qu.: 4.0 1st Qu.: 3.0 1st Qu.: 0.00 1st Qu.: 23
## Median : 9.0 Median : 7.0 Median : 1.00 Median : 93
## Mean :13.2 Mean :10.5 Mean : 1.85 Mean : 242
## 3rd Qu.:20.0 3rd Qu.:15.0 3rd Qu.: 3.00 3rd Qu.: 297
## Max. :55.0 Max. :48.0 Max. :14.00 Max. :2116
## HS Ave BF SR
## Min. : 0.0 Min. : 0.0 Min. : 1 Min. : 0.0
## 1st Qu.: 15.0 1st Qu.: 9.5 1st Qu.: 39 1st Qu.: 54.0
## Median : 39.0 Median : 20.0 Median : 139 Median : 70.0
## Mean : 48.2 Mean : 21.5 Mean : 310 Mean : 68.4
## 3rd Qu.: 74.0 3rd Qu.: 31.0 3rd Qu.: 386 3rd Qu.: 84.5
## Max. :194.0 Max. :183.0 Max. :2569 Max. :329.0
## X100s X50s Ducks X4s
## Min. :0.000 Min. : 0.00 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.: 0.00 1st Qu.:0.000 1st Qu.: 2.0
## Median :0.000 Median : 0.00 Median :1.000 Median : 8.0
## Mean :0.251 Mean : 1.29 Mean :0.936 Mean : 21.9
## 3rd Qu.:0.000 3rd Qu.: 2.00 3rd Qu.:1.000 3rd Qu.: 25.5
## Max. :4.000 Max. :17.00 Max. :6.000 Max. :206.0
## X6s
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 1.00
## Mean : 3.57
## 3rd Qu.: 4.00
## Max. :53.00
We will remove the observations who played less than 10 matches and do the cluster without them.
cricket2 <- cricket1 %>%
filter(Mat >= 10)
summary(cricket2)
## Mat Inns Not.Outs Runs
## Min. :10.0 Min. : 2.0 Min. : 0.00 Min. : 1
## 1st Qu.:14.0 1st Qu.:10.0 1st Qu.: 1.00 1st Qu.: 128
## Median :21.0 Median :16.0 Median : 3.00 Median : 300
## Mean :23.1 Mean :18.3 Mean : 3.37 Mean : 441
## 3rd Qu.:31.0 3rd Qu.:24.5 3rd Qu.: 5.00 3rd Qu.: 619
## Max. :55.0 Max. :48.0 Max. :14.00 Max. :2116
## HS Ave BF SR
## Min. : 1.0 Min. : 0.25 Min. : 5 Min. : 11.0
## 1st Qu.: 33.0 1st Qu.:14.73 1st Qu.: 166 1st Qu.: 64.0
## Median : 63.0 Median :24.40 Median : 389 Median : 76.0
## Mean : 67.6 Mean :25.08 Mean : 553 Mean : 75.2
## 3rd Qu.: 99.0 3rd Qu.:35.15 3rd Qu.: 839 3rd Qu.: 86.0
## Max. :194.0 Max. :62.23 Max. :2569 Max. :149.0
## X100s X50s Ducks X4s
## Min. :0.000 Min. : 0.00 Min. :0.00 Min. : 0
## 1st Qu.:0.000 1st Qu.: 0.00 1st Qu.:0.00 1st Qu.: 9
## Median :0.000 Median : 1.00 Median :1.00 Median : 27
## Mean :0.483 Mean : 2.41 Mean :1.55 Mean : 40
## 3rd Qu.:0.500 3rd Qu.: 4.00 3rd Qu.:2.00 3rd Qu.: 57
## Max. :4.000 Max. :17.00 Max. :6.00 Max. :206
## X6s
## Min. : 0.00
## 1st Qu.: 1.00
## Median : 3.00
## Mean : 6.59
## 3rd Qu.: 9.00
## Max. :53.00
Checking the best cluster.
cluster5 <- kmeans(cricket2, 6)
cluster5$size
## [1] 23 8 52 47 36 41
cluster6 <- kmeans(cricket2, 5) #This looks perfect for our cluster
cluster6$size
## [1] 8 69 67 24 39
We do some plotting as well.
library("cluster")
library("fpc")
clusplot(cricket2, cluster6$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)
plotcluster(cricket2, cluster6$cluster)
Lets make a dataframe with cluster variable included.
cricket_cluster <- data.frame(cricket2, cluster6$cluster)
head(cricket_cluster)
## Mat Inns Not.Outs Runs HS Ave BF SR X100s X50s Ducks X4s X6s
## 1 55 48 14 2116 124 62.23 2569 82 2 16 0 138 27
## 2 51 47 12 1788 85 51.08 2120 84 0 17 0 109 20
## 3 49 46 4 1772 150 42.19 1949 91 4 12 3 183 11
## 4 49 47 5 1653 138 39.35 1667 99 4 8 3 167 52
## 5 49 48 3 1643 128 36.51 2208 74 4 8 2 162 3
## 6 33 33 1 1499 125 46.84 1193 126 3 10 0 206 36
## cluster6.cluster
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 4
We check the mean scores of each cluster and choose the best cluster.
cluster6$center
## Mat Inns Not.Outs Runs HS Ave BF SR X100s X50s
## 1 47.62 45.50 5.625 1649.50 125.62 42.12 2022.4 81.75 2.5000 11.37500
## 2 17.99 9.58 3.435 75.09 24.25 12.96 105.7 66.96 0.0000 0.05797
## 3 19.55 15.87 3.030 300.73 69.99 25.11 393.9 77.34 0.1194 1.34328
## 4 35.83 33.67 3.833 1139.21 120.33 39.60 1335.3 86.12 1.9167 6.87500
## 5 25.15 23.03 3.077 651.72 95.72 34.04 834.2 77.92 0.6667 3.82051
## Ducks X4s X6s
## 1 2.125 146.375 18.000
## 2 1.464 6.348 1.130
## 3 1.582 27.015 5.403
## 4 2.250 108.167 18.750
## 5 1.103 58.103 8.487
Since cluster 2 has highest means in all the variables, we will take the observations(players) who are in cluster 2.