University of Warsaw, Unsupervised learning course conducted by PhD Katarzyna Kopczewska
Developing a volleyball team from 14 individuals is a very interesting and challenging process. The coaching staff has to not only select the players but also determine which positions they will play and which of them will be part of the 7 core players. To top it all off, the next stage is the training period, when changes take place and the team starts to take its final shape. Then during the season there are injuries, changes and transfers to and from other teams. In addition, you have to remember to adjust your strategy and lineup to the team you are playing against and even to the current situation on the pitch. Each of the above mentioned “elements of the volleyball puzzle” creates a need to think about what the team is made of? If the question was really simple, the recipe for victory would simply be to assemble the best 14 players available. However, getting to the heart of the matter would make it much easier to plan and set match strategies and better understand the essence of the game itself. It is true that the team can be divided into 5 groups according to position, but this is a rigid and mechanical division that does not match the individual playing style of each player. My goal is to find not obvious relationship between specific types of players.
library(dplyr)
library(stringr)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(hopkins)
library(gridExtra)
Data was gathered by Yunus Karatepe from CEV EuroVolley Women’s Player Stats and uploaded on Kaggle. It contains information on 27 characteristics (including name) of 518 players. For the purpose of this analysis, only numerical data with valid responses was taken into consideration. Moreover, only players who played sufficient number of games and sets were taken into consideration so the results are more robust. Additionaly, I did not take into consideration efficiency variables in % as they can be very misleading when it comes down to volleyball. For instance, a setter does not receive the ball but it may happen that she does so, if she does it correctly, she has 100% efficiency (while the best libero can have e.g. 70%) or 0 in opposite situation. The same applies to many situation so it is better to use another variables. By cleaning the data, many observations were removed and/or transformed:
all_data <- read.csv("all_data.csv", header=TRUE, sep=",")
cols = all_data %>%
colnames() %>%
paste(all_data[1, ], sep = '_') %>%
str_remove('_NA')
colnames(all_data)<-colnames(all_data %>%
'colnames<-'(cols) %>%
slice(-1) %>%
apply(2, as.numeric) %>%
as.data.frame())
all_data <- slice(all_data,-1)
rownames(all_data) <- paste(all_data$Role_,rownames(all_data))
k <- 1:length(all_data)
l <- k[-c(1,2,12,17,18,23,24)]
for (i in l) {
all_data[,i] <- as.numeric(all_data[,i])
}
for (i in c(12,17,18,23,24)) {
all_data[,i] <- as.numeric(str_replace(all_data[,i],"%", ""))/100
}
all_data <- all_data[all_data$Played.1_Sets>15&all_data$Played_Matches>7,k[-c(1:4,8:10,12,13,17,18,23,24)]]
The size of the dataset:
columns: 20
rows: 250
Dataset includes 20 continuous variables:
Played_Matches
Played.1_Sets
Points_Total Points
Points.1_Side-Out
Points.2_Break
Point Serve_Total
Points Serve.1_Ace
Serve.2_Error
Serve.3_Ace per Set
Reception_Total
Reception.1_Error
Reception.2_Negative
Reception.3_Excellent
Attack_Total Points
Attack.1_Error
Attack.2_Blocked
Attack.3_Excellent
Block_Net
Block.1_Points
Block.2_Points per Set
head(all_data)
## Points_Total Points Points.1_Side-Out Points.2_Break Point
## Setter 1 32 10 22
## Setter 2 32 15 17
## Setter 3 29 11 18
## Setter 4 26 11 15
## Setter 5 25 9 16
## Setter 6 25 12 13
## Serve.3_Ace per Set Reception.1_Error Reception.2_Negative
## Setter 1 0.30 0 2
## Setter 2 0.21 0 0
## Setter 3 0.37 0 0
## Setter 4 0.13 1 2
## Setter 5 0.14 0 1
## Setter 6 0.22 0 0
## Reception.3_Excellent Attack_Total Points Attack.1_Error
## Setter 1 0 26 3
## Setter 2 0 28 3
## Setter 3 0 22 2
## Setter 4 0 29 0
## Setter 5 0 32 1
## Setter 6 0 25 1
## Attack.2_Blocked Attack.3_Excellent Block_Net Block.1_Points
## Setter 1 1 10 1 9
## Setter 2 1 17 0 9
## Setter 3 1 14 1 5
## Setter 4 2 12 0 9
## Setter 5 1 13 0 7
## Setter 6 2 10 1 9
## Block.2_Points per Set
## Setter 1 0.20
## Setter 2 0.32
## Setter 3 0.19
## Setter 4 0.23
## Setter 5 0.20
## Setter 6 0.33
The data presented above will be used to find out if we can group players into clusters using unsupervised learning methods. There is limited possibilities of number clusters as more than 4 does not makes sense. There are 5 positions possible if it comes down to volleyball and even if we wanted to find the differences between players of the same position, we don’t have enough data to believe this findings. I, as a former player, would say that 3 clusters would be a perfect result as we could distinguish offensive players, deffensive and so called ‘floor generals’ who have more of a strategic task on the floor. To begin with, I will use a hopkins statistic to asses if the data is cluterable.
all_datascale <- data.frame(scale(all_data))
hopkins(all_datascale)
## [1] 0.9917487
Obtained value suggest that there there are statistically significant clusters in the data set. Now, we can move forward and begin clustering.
c1 <- fviz_nbclust(all_datascale, FUNcluster = kmeans)
c2 <- fviz_nbclust(all_datascale, FUNcluster = pam)
grid.arrange(c1, c2, top="Kmeans top, PAM bottom")
plot1 <- fviz_nbclust(all_datascale, FUNcluster=kmeans, method="gap_stat")
plot2 <- fviz_nbclust(all_datascale, FUNcluster=cluster::pam, method="gap_stat")
grid.arrange(plot1, plot2, ncol=2, top="K-means / PAM")
As we can see, the optimal number of clusters are 2 in both cases when we take silhouette into consideration. However, the average silhouette value is higher in case of k-means for number of clusters of our interest than it is in PAM and not much different in terms of gap-statistics. It is possibly better to go with k-means rather than PAM. We won’t use CLARA as our dataset is quite small. When we take a closer look, we see that for number of clusters 2 and 3 silhouette value is not significantly different and in terms of gap statistics, higher. Thus, it is reasonable to take into consideration both number of clusters with slight prefference of 3. Unfortunately, these values could be higher in general. Low silhouette is undesirable and its value close to 0 indicates overlapping clusters. In order to asses our model, further investigation of its behavior is needed. Let’s visualize clusters generated by both methods and use euclidian distance for calculations (other distance metrics do not change the outcome in a significant way.
The clusters are not overlapping in both cases. However, taking into consideration the labels, 3 clusters seem to be much more intuitive.
sil<-silhouette(km1$cluster, dist(all_datascale))
fviz_silhouette(sil, ggtheme=theme_classic(), main="K-means k=3")
## cluster size ave.sil.width
## 1 1 23 0.29
## 2 2 59 0.38
## 3 3 35 0.20
sil2<-silhouette(km2$cluster, dist(all_datascale))
fviz_silhouette(sil2, ggtheme=theme_classic(), main="K-means k=2")
## cluster size ave.sil.width
## 1 1 43 0.17
## 2 2 74 0.47
calinhara(all_datascale, km1$cluster)
## [1] 58.20988
calinhara(all_datascale, km2$cluster)
## [1] 66.67771
Whereas the average silhouette is bigger in case of 2 clusters, the plot is better for 3 as there are not so many values below 0 which is undesirable. Calinski-Harabasz index indicates that 2 would be a better choice. Taking all into consideration, there are reasons for picking both, 2 and 3 clusters. In that case, I will follow intuition and go with 3. Let’s see if the model is stable when it comes down to various methods.
km11<-eclust(all_datascale, "kmeans", hc_metric="manhattan",k=3)
km22<-eclust(all_datascale, "pam", hc_metric="manhattan", k=3)
The plots are clearly different, especially in terms of the top left cluster. However, we can see that the clusters are overlapping and this probably means many negative silhouettes which is yet another reason to claim that PAM is a worse method in this case. The model is ready. Let’s inspect the characteristics of the clusters now.
dfnew<-data.frame(cbind(all_datascale, km1$cluster))
colnames(dfnew)[15]<-"cluster"
par(mfrow=c(2,3))
boxplot(dfnew[,1]~dfnew[,15], vertical=TRUE, col="blue", xlab='clusters', ylab=colnames(dfnew)[1])
boxplot(dfnew[,4]~dfnew[,15], vertical=TRUE, col="blue", xlab='clusters', ylab=colnames(dfnew)[4])
boxplot(dfnew[,7]~dfnew[,15], vertical=TRUE, col="blue", xlab='clusters', ylab=colnames(dfnew)[7])
boxplot(dfnew[,11]~dfnew[,15], vertical=TRUE, col="blue", xlab='clusters', ylab=colnames(dfnew)[11])
boxplot(dfnew[,14]~dfnew[,15], vertical=TRUE, col="blue", xlab='clusters', ylab=colnames(dfnew)[14])
boxplot(dfnew[,8]~dfnew[,15], vertical=TRUE, col="blue", xlab='clusters', ylab=colnames(dfnew)[8])
In the course of the analysis described above, the data on volleyball players’ statistics was investigated and clustered into groups. Three following groups were generated:
Block builders: they are characterised by little to none activity in order aspects of the game than blocking and attacking (but mostly blocking).
Floor generals: Players whose statistics could indicate that they are no good as they do not get points nor are deffensive. Their job on the floor is setting up the whole action and that is consistent with reality as the vast majority of the groups are setters and liberos.
Core players: Players who get the most points and receive the ball. The vast majority of the pressure is put on them and it is quite impossible to win a game without them. This is why those players (mostly outside hitters and opposite) always get the most glory while the setters and liberos are underapreciated by the audience.
This classification is really good in a real-life sense and predictions could be used for team assembly. However, the model could be improved as it is not the best in a statistical sense.
Possible improvements:
Gathering more data.
Outliers analysis.
Using more sophisticated methods.
Integrating the model with other models such as e.g. total points predictions for coherent recommendation tool.
Adding another variables and changing methods of existing calculations.