Cluster Analysis: KMeans - Cricket Player Selections

From this player’s data we have to select the best players for the next tournament.
We load the data.

cricket <- tbl_df(cricket)
cricket

## Source: local data frame [435 x 14]
## 
##    ID Mat Inns Not.Outs Runs  HS   Ave   BF  SR X100s X50s Ducks X4s X6s
## 1   1  55   48       14 2116 124 62.23 2569  82     2   16     0 138  27
## 2   2  51   47       12 1788  85 51.08 2120  84     0   17     0 109  20
## 3   3  49   46        4 1772 150 42.19 1949  91     4   12     3 183  11
## 4   4  49   47        5 1653 138 39.35 1667  99     4    8     3 167  52
## 5   5  49   48        3 1643 128 36.51 2208  74     4    8     2 162   3
## 6   6  33   33        1 1499 125 46.84 1193 126     3   10     0 206  36
## 7   7  42   42        1 1476 126 36.00 1865  79     3   10     1 143  13
## 8   8  38   38        2 1411 178 39.19 1637  86     3    8     1 156  22
## 9   9  45   45        0 1409 154 31.31 1797  78     2    8     6 160  16
## 10 10  41   41        6 1339 100 38.25 2004  67     1   12     2 109   2
## .. .. ...  ...      ...  ... ...   ...  ... ...   ...  ...   ... ... ...

Lets remove the variable ‘id’. We don’t need ‘id’ for clustering.

cricket1 <- cricket %>%
  select(Mat:X6s)
cricket1

## Source: local data frame [435 x 13]
## 
##    Mat Inns Not.Outs Runs  HS   Ave   BF  SR X100s X50s Ducks X4s X6s
## 1   55   48       14 2116 124 62.23 2569  82     2   16     0 138  27
## 2   51   47       12 1788  85 51.08 2120  84     0   17     0 109  20
## 3   49   46        4 1772 150 42.19 1949  91     4   12     3 183  11
## 4   49   47        5 1653 138 39.35 1667  99     4    8     3 167  52
## 5   49   48        3 1643 128 36.51 2208  74     4    8     2 162   3
## 6   33   33        1 1499 125 46.84 1193 126     3   10     0 206  36
## 7   42   42        1 1476 126 36.00 1865  79     3   10     1 143  13
## 8   38   38        2 1411 178 39.19 1637  86     3    8     1 156  22
## 9   45   45        0 1409 154 31.31 1797  78     2    8     6 160  16
## 10  41   41        6 1339 100 38.25 2004  67     1   12     2 109   2
## .. ...  ...      ...  ... ...   ...  ... ...   ...  ...   ... ... ...

Lets check how many clusters we can form.

sqrt(435)/2

## [1] 10.43

Checking the best cluster

cluster1 <- kmeans(cricket1, 10)
cluster1$size

##  [1] 59 88 22 49 22  8 55 25 68 39

cluster2 <- kmeans(cricket1, 5)
cluster2$size

## [1]  69 195  28  39 104

cluster3 <- kmeans(cricket1, 4)
cluster3$size

## [1]  30 270  94  41

cluster4 <- kmeans(cricket1, 3)
cluster4$size

## [1]  88  33 314

So we are not able to conclude a proper cluster. We will do some data manipulation.

summary(cricket1)

##       Mat            Inns         Not.Outs          Runs     
##  Min.   : 1.0   Min.   : 1.0   Min.   : 0.00   Min.   :   0  
##  1st Qu.: 4.0   1st Qu.: 3.0   1st Qu.: 0.00   1st Qu.:  23  
##  Median : 9.0   Median : 7.0   Median : 1.00   Median :  93  
##  Mean   :13.2   Mean   :10.5   Mean   : 1.85   Mean   : 242  
##  3rd Qu.:20.0   3rd Qu.:15.0   3rd Qu.: 3.00   3rd Qu.: 297  
##  Max.   :55.0   Max.   :48.0   Max.   :14.00   Max.   :2116  
##        HS             Ave              BF             SR       
##  Min.   :  0.0   Min.   :  0.0   Min.   :   1   Min.   :  0.0  
##  1st Qu.: 15.0   1st Qu.:  9.5   1st Qu.:  39   1st Qu.: 54.0  
##  Median : 39.0   Median : 20.0   Median : 139   Median : 70.0  
##  Mean   : 48.2   Mean   : 21.5   Mean   : 310   Mean   : 68.4  
##  3rd Qu.: 74.0   3rd Qu.: 31.0   3rd Qu.: 386   3rd Qu.: 84.5  
##  Max.   :194.0   Max.   :183.0   Max.   :2569   Max.   :329.0  
##      X100s            X50s           Ducks            X4s       
##  Min.   :0.000   Min.   : 0.00   Min.   :0.000   Min.   :  0.0  
##  1st Qu.:0.000   1st Qu.: 0.00   1st Qu.:0.000   1st Qu.:  2.0  
##  Median :0.000   Median : 0.00   Median :1.000   Median :  8.0  
##  Mean   :0.251   Mean   : 1.29   Mean   :0.936   Mean   : 21.9  
##  3rd Qu.:0.000   3rd Qu.: 2.00   3rd Qu.:1.000   3rd Qu.: 25.5  
##  Max.   :4.000   Max.   :17.00   Max.   :6.000   Max.   :206.0  
##       X6s       
##  Min.   : 0.00  
##  1st Qu.: 0.00  
##  Median : 1.00  
##  Mean   : 3.57  
##  3rd Qu.: 4.00  
##  Max.   :53.00

We will remove the observations who played less than 10 matches and do the cluster without them.

cricket2 <- cricket1 %>%
  filter(Mat >= 10)
summary(cricket2)

##       Mat            Inns         Not.Outs          Runs     
##  Min.   :10.0   Min.   : 2.0   Min.   : 0.00   Min.   :   1  
##  1st Qu.:14.0   1st Qu.:10.0   1st Qu.: 1.00   1st Qu.: 128  
##  Median :21.0   Median :16.0   Median : 3.00   Median : 300  
##  Mean   :23.1   Mean   :18.3   Mean   : 3.37   Mean   : 441  
##  3rd Qu.:31.0   3rd Qu.:24.5   3rd Qu.: 5.00   3rd Qu.: 619  
##  Max.   :55.0   Max.   :48.0   Max.   :14.00   Max.   :2116  
##        HS             Ave              BF             SR       
##  Min.   :  1.0   Min.   : 0.25   Min.   :   5   Min.   : 11.0  
##  1st Qu.: 33.0   1st Qu.:14.73   1st Qu.: 166   1st Qu.: 64.0  
##  Median : 63.0   Median :24.40   Median : 389   Median : 76.0  
##  Mean   : 67.6   Mean   :25.08   Mean   : 553   Mean   : 75.2  
##  3rd Qu.: 99.0   3rd Qu.:35.15   3rd Qu.: 839   3rd Qu.: 86.0  
##  Max.   :194.0   Max.   :62.23   Max.   :2569   Max.   :149.0  
##      X100s            X50s           Ducks           X4s     
##  Min.   :0.000   Min.   : 0.00   Min.   :0.00   Min.   :  0  
##  1st Qu.:0.000   1st Qu.: 0.00   1st Qu.:0.00   1st Qu.:  9  
##  Median :0.000   Median : 1.00   Median :1.00   Median : 27  
##  Mean   :0.483   Mean   : 2.41   Mean   :1.55   Mean   : 40  
##  3rd Qu.:0.500   3rd Qu.: 4.00   3rd Qu.:2.00   3rd Qu.: 57  
##  Max.   :4.000   Max.   :17.00   Max.   :6.00   Max.   :206  
##       X6s       
##  Min.   : 0.00  
##  1st Qu.: 1.00  
##  Median : 3.00  
##  Mean   : 6.59  
##  3rd Qu.: 9.00  
##  Max.   :53.00

Checking the best cluster.

cluster5 <- kmeans(cricket2, 6)
cluster5$size

## [1] 23  8 52 47 36 41

cluster6 <- kmeans(cricket2, 5) #This looks perfect for our cluster
cluster6$size

## [1]  8 69 67 24 39

We do some plotting as well.

library("cluster")
library("fpc")

clusplot(cricket2, cluster6$cluster, color = TRUE, shade = TRUE, labels = 2, lines = 0)

plot of chunk unnamed-chunk-10

plotcluster(cricket2, cluster6$cluster)

plot of chunk unnamed-chunk-10

Lets make a dataframe with cluster variable included.

cricket_cluster <- data.frame(cricket2, cluster6$cluster)
head(cricket_cluster)

##   Mat Inns Not.Outs Runs  HS   Ave   BF  SR X100s X50s Ducks X4s X6s
## 1  55   48       14 2116 124 62.23 2569  82     2   16     0 138  27
## 2  51   47       12 1788  85 51.08 2120  84     0   17     0 109  20
## 3  49   46        4 1772 150 42.19 1949  91     4   12     3 183  11
## 4  49   47        5 1653 138 39.35 1667  99     4    8     3 167  52
## 5  49   48        3 1643 128 36.51 2208  74     4    8     2 162   3
## 6  33   33        1 1499 125 46.84 1193 126     3   10     0 206  36
##   cluster6.cluster
## 1                1
## 2                1
## 3                1
## 4                1
## 5                1
## 6                4

We check the mean scores of each cluster and choose the best cluster.

cluster6$center

##     Mat  Inns Not.Outs    Runs     HS   Ave     BF    SR  X100s     X50s
## 1 47.62 45.50    5.625 1649.50 125.62 42.12 2022.4 81.75 2.5000 11.37500
## 2 17.99  9.58    3.435   75.09  24.25 12.96  105.7 66.96 0.0000  0.05797
## 3 19.55 15.87    3.030  300.73  69.99 25.11  393.9 77.34 0.1194  1.34328
## 4 35.83 33.67    3.833 1139.21 120.33 39.60 1335.3 86.12 1.9167  6.87500
## 5 25.15 23.03    3.077  651.72  95.72 34.04  834.2 77.92 0.6667  3.82051
##   Ducks     X4s    X6s
## 1 2.125 146.375 18.000
## 2 1.464   6.348  1.130
## 3 1.582  27.015  5.403
## 4 2.250 108.167 18.750
## 5 1.103  58.103  8.487

Since cluster 2 has highest means in all the variables, we will take the observations(players) who are in cluster 2.

Cluster Analysis: KMeans - Cricket Player Selections

Loy

Monday, December 15, 2014