This article is a part of the Unsupervised Learning course at the Faculty of Economic Sciences, University of Warsaw.
Volleyball is a sport played in teams, where each point is a separate action and the team’s goal is to achieve 25 points before the opposite team. Therefore the main drivers for winning matches are team consistency also expressed as avoiding making mistakes, and points gained by players’ individual effort. In this paper, I will try to identify and measure those theoretical concepts and try to predict the positions of teams based on the assignment to calculated clusters with Principal Component Analysis and K-means clustering. I will use statistics from Plusliga, the highest level professional league in Poland and one of the best in the world, for seasons 2017/18 - 2020/21 provided by Plusliga.pl. The original dataset comes from my prior university projects.
kable(head(Plusliga))
| Season | Team Name | Sets | Points | Serves | Aces | Serve Errors | Aces per set | Receptions | Reception Errors | Negative reception | Perfect Reception | % perf reception | Attacks | Attack errors | Blocked attacks | Perfect attacs | % perf attacks | Blocks | Blocks per set |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020/21 | Aluron CMC Warta Zawiercie | 5 | 73 | 108 | 6 | 11 | 1.20 | 85 | 9 | 19 | 8 | 9.41 | 121 | 7 | 9 | 60 | 49.59 | 7 | 1.40 |
| 2020/21 | Aluron CMC Warta Zawiercie | 4 | 76 | 101 | 6 | 10 | 1.50 | 81 | 4 | 30 | 24 | 29.63 | 111 | 7 | 8 | 60 | 54.05 | 10 | 2.50 |
| 2020/21 | Aluron CMC Warta Zawiercie | 3 | 54 | 74 | 3 | 8 | 1.00 | 43 | 3 | 12 | 9 | 20.93 | 67 | 0 | 2 | 43 | 64.18 | 8 | 2.67 |
| 2020/21 | Aluron CMC Warta Zawiercie | 3 | 61 | 74 | 5 | 5 | 1.67 | 51 | 2 | 8 | 13 | 25.49 | 83 | 5 | 6 | 51 | 61.45 | 5 | 1.67 |
| 2020/21 | Aluron CMC Warta Zawiercie | 4 | 82 | 108 | 5 | 10 | 1.25 | 77 | 4 | 17 | 21 | 27.27 | 117 | 5 | 7 | 70 | 59.83 | 7 | 1.75 |
| 2020/21 | Aluron CMC Warta Zawiercie | 3 | 54 | 74 | 3 | 11 | 1.00 | 49 | 3 | 7 | 10 | 20.41 | 71 | 4 | 7 | 40 | 56.34 | 11 | 3.67 |
We want to spot two measures: Team Consistency and Individual Offensive Effort (IOE). As most of the original statistics derived from the Plusliga website are highly determined by the number of sets played every match, we will create custom measures, To achieve that, we need to define other, more granular drivers of those measures
#Team consistancy measures:
Plusliga$serve_consistency <- coalesce((Plusliga$Serves - Plusliga$`Serve Errors`)/Plusliga$Serves, 0)
Plusliga$Reception_consistency <- coalesce((Plusliga$Receptions - Plusliga$`Reception Errors` - Plusliga$`Negative reception`)/Plusliga$Receptions, 0)
Plusliga$Perfect_reception_consistency <- coalesce(Plusliga$`Perfect Reception`/Plusliga$Receptions, 0)
Plusliga$Attacks_consistency <- coalesce((Plusliga$Attacks - Plusliga$`Attack errors`) / Plusliga$Attacks, 0)
#IOE measures (we will use two common measures - blocks per set and aces per set, and one custom)
Plusliga$Perfect_Attacks_ratio <- coalesce(Plusliga$`Perfect attacs`/Plusliga$Attacks,0)
# Grouping by Team name and Season
df <- aggregate(Plusliga[c(8,20:25)], by =list(Plusliga$`Team Name`, Plusliga$Season), FUN = mean)
# Adding group and final standings data
Final_Standings <- read_delim("Final_Standings.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
df <- merge(df, Final_Standings, by.x = c("Group.2", "Group.1"), by.y = c("Season", "Team"))
colnames(df) <- c("Season", "Team", colnames(df[,3:11]))
Now we can group the dataset by teams by seasons and include standings from particular seasons. We add both Group Standings (position in the table after group phase) and Final Standings (position in the table after play-off phase that happens after the last match of group phase)
| Season | Team | Aces per set | Blocks per set | serve_consistency | Reception_consistency | Perfect_reception_consistency | Attacks_consistency | Perfect_Attacks_ratio | Group Standings | Final Standings |
|---|---|---|---|---|---|---|---|---|---|---|
| 2017/18 | Aluron CMC Warta Zawiercie | 1.662187 | 2.062500 | 0.8018745 | 0.6438980 | 0.2361160 | 0.9292495 | 0.4735497 | 10 | 9 |
| 2017/18 | Asseco Resovia Rzeszów | 1.469722 | 2.064722 | 0.8271924 | 0.6614031 | 0.2335482 | 0.9348035 | 0.5076606 | 4 | 6 |
| 2017/18 | BBTS Bielsko-Biała | 1.099667 | 2.396667 | 0.8353156 | 0.6044907 | 0.2006393 | 0.9118757 | 0.4616044 | 15 | 15 |
| 2017/18 | BKS Visła Bydgoszcz | 1.136061 | 2.046667 | 0.7990014 | 0.6230099 | 0.2058731 | 0.8947184 | 0.4402737 | 14 | 14 |
| 2017/18 | Cerrad Czarni Radom | 1.662813 | 2.041250 | 0.8149306 | 0.6713662 | 0.2356678 | 0.9228777 | 0.5070535 | 9 | 10 |
| 2017/18 | Cuprum Lubin | 1.255625 | 2.092500 | 0.8704322 | 0.6425882 | 0.1926827 | 0.9438522 | 0.4672254 | 8 | 7 |
We will also make a partition for training data (seasons 2017/18 - 2019/20) and test data (2020/21)
training <- df[df$Season != "2020/21",]
test <- df[df$Season == "2020/21",]
As can be spotted in Chart 1 below, group standings are negatively correlated with every of the selected variables. We expected that kind of relation, as the high value of every variable theoretically means positive effect, so that brings us to lower values of Group and Final Standings (higher places). Moreover, as both Group Standings and Final Standings result with similar correlations, from now on we will use only Final Standings
corr = cor(training[,3:11], method='pearson')
corrplot(corr, title = "Chart 1. Correlation analysis between selected variables")
In this part, we will create a one Principal Components for each group of variables: Consistency and IOE. Instead of normalizing variables, we will set the parameters of prcomp function: center and scale for TRUE.
df.pcat1 <- prcomp(training[c(5:8)], center = T, scale. = T)
df.pcat2 <- prcomp(training[c(3:4,9)], center = T, scale. = T)
# Visualizing Explained variance
grid.arrange(fviz_eig(df.pcat1, main = "Chart 2. Explained variance by Consistency PCs"), fviz_eig(df.pcat2, main = "Chart 3. Explained variance by IOE PCs"), nrow = 1)
As we can see above, the first Principal Components explain respectively >40% and >50% of the variance, therefore we will use those two components
# Binding dataframe with teams, seasons, final standings and Eigenvalues
trainingresults <- data.frame(training$Season, training$Team, training$`Final Standings`, df.pcat1$x[,1], -df.pcat2$x[,1])
colnames(trainingresults) <- c("team", "season", "Final Standings", "Consistency", "IOE")
#Plotting Eigenvalues
sp <- ggplot(trainingresults, aes(Consistency, IOE))+
geom_point(aes(size=`Final Standings`, color= `Final Standings`)) + ggtitle("Chart 4. Consistency and IOE for the teams")
sp
Graph number 4. can give us some intuition, that the two derived measures indeed are positively related to the final standing and that there is some linear relation (higher values of those measures mean higher final positions)
#Gap analysis chart
fviz_nbclust(trainingresults[4:5],FUNcluster = stats::kmeans) + ggtitle("Chart 5. Gap Analysis")
# K-means clustering and plotting results
kmtraining <- kmeans(trainingresults[,4:5],6, nstart = 100)
clusterstrain <- ggplot(trainingresults, aes(Consistency, IOE))+
geom_point(aes(size=`Final Standings`, color= factor(kmtraining$cluster))) + scale_color_manual(values = c('#EDE61A','#84D076', '29B510', "#ED481A", "#ED7D1A","#EDB21A")) + geom_text(label = trainingresults$`Final Standings`, check_overlap = TRUE) + ggtitle("Chart 6. Trained Clusters")
clusterstrain
Chart number 5. above, suggest an optimal number of clusters for the given example for k-means method as 6 based on the gap analysis. Therefore we group the training dataset into 6 clusters, which was expressed in Chart 6. We can see that the positions inside clusters are similar.
#assigning clusters to values, calculating and plotting mean position within clusters
trainingresults$cluster <- kmtraining$cluster
mean_pos <- aggregate(trainingresults[,3], by = list(trainingresults$cluster), FUN = mean)
centroids <- data.frame("Consistency" = kmtraining$centers[,1],
"IOE" = kmtraining$centers[,2],
"cluster" = rownames(kmtraining$centers),
"position" = round(mean_pos$x,2)
)
Trainchart_avgpos <- ggplot(trainingresults, aes(Consistency, IOE))+
geom_point(aes(size=`Final Standings`, color= factor(kmtraining$cluster))) +
scale_color_manual(values = c('#EDE61A','#84D076', '29B510', "#ED481A", "#ED7D1A","#EDB21A")) +
geom_label(data = centroids, label = centroids$position, label.r = unit(0.25, "lines"), colour = "red")+
ggtitle("Chart 7. Clusters with average position")
Trainchart_avgpos
Chart number 7 represents calculated clusters with the average final position of teams within those clusters. As we can see, Cluster 4 (Red) describes the team at the last positions in the table, so teams with low consistency and Individual Offensive Effort, while Cluster 5 (Dark Orange) represents teams with IOE on a similar, below-average level, same as C4, although significantly higher Consistency level resulting in higher final position almost by 4 positions on average. Those two clusters show that for the same level of IOE consistency plays important role in the final position. The same scheme we can see across Clusters 6, 1 and 2, which are characterized by similar IOE levels (around 0.5), but different Consistency, where better consistency means higher position. Finally Cluster 3 (Blue) shows “unicorns” with incredible Individual achievements guaranteeing a high position regardless of team consistency. We can conclude from that, that the main driver of the final position was an IOE level, as more IOE meant higher positions for every cluster, but consistency, as a secondary effect played also an important role.
We will run our calculations now for 2020/21 season and see how well we can assign cluster and positions basing on our calculations:
#calculating PCs and clusters for test data
#starting with normalization
test[,3:9] <-data.Normalization(test[,3:9], type="n1", normalization="column")#clusterSim::
testing <- test[,c(1:2,11)]
testing$Consistency <- as.matrix(test[,c(5:8)])%*%df.pcat1$rotation[,1]
testing$IOE <- -as.matrix(test[,c(3:4,9)])%*%df.pcat2$rotation[,1]
km.train.kcca<-as.kcca(kmtraining, trainingresults[,4:5]) # conversion to kcca
km.pred<-predict(km.train.kcca, testing[,4:5]) # prediction for k-means
testing$cluster <- km.pred
# Plotting both train and test data
colnames(trainingresults) = colnames(testing[,1:6])
all <- rbind(trainingresults, testing[,1:6])
Trainchart_avgpos <- ggplot(all, aes(Consistency, IOE))+
geom_point(aes(size=`Final Standings`, color= factor(cluster))) +
scale_color_manual(values = c('#EDE61A','#84D076', '29B510', "#ED481A", "#ED7D1A","#EDB21A")) +
geom_label(data = centroids, label = centroids$cluster, label.r = unit(0.25, "lines") )+
ggtitle("Chart 8. Training and testing data by clusers")
Trainchart_avgpos
# Finding more suited clusters (based on the position) and calculating error
testing$actual_cluster <- rep(0,14)
for (i in 1:length(testing$Team))
{
testing$actual_cluster[i] <- which.min((abs(rep(testing$`Final Standings`[i],5)-mean_pos))$x)
}
testing$mean_in_cluster <- mean_pos$x[testing$cluster]
testing$error <- abs(testing$`Final Standings`- mean_pos$x[testing$cluster])
kable(testing[,c(2,3,6,7,8,9)])
| Team | Final Standings | cluster | actual_cluster | mean_in_cluster | error | |
|---|---|---|---|---|---|---|
| 41 | Aluron CMC Warta Zawiercie | 8 | 5 | 5 | 9.428571 | 1.4285714 |
| 42 | Asseco Resovia Rzeszów | 5 | 2 | 1 | 3.777778 | 1.2222222 |
| 43 | Cerrad Czarni Radom | 12 | 4 | 4 | 13.125000 | 1.1250000 |
| 44 | Cuprum Lubin | 11 | 5 | 5 | 9.428571 | 1.5714286 |
| 45 | GKS Katowice | 9 | 5 | 5 | 9.428571 | 0.4285714 |
| 46 | Grupa Azoty ZAKSA Kędzierzyn-Koźle | 2 | 3 | 3 | 1.500000 | 0.5000000 |
| 47 | Indykpol AZS Olsztyn | 10 | 5 | 5 | 9.428571 | 0.5714286 |
| 48 | Jastrzębski Węgiel | 1 | 2 | 3 | 3.777778 | 2.7777778 |
| 49 | MKS Będzin | 14 | 4 | 4 | 13.125000 | 0.8750000 |
| 50 | PGE Skra Bełchatów | 4 | 3 | 2 | 1.500000 | 2.5000000 |
| 51 | Ślepsk Malow Suwałki | 7 | 5 | 6 | 9.428571 | 2.4285714 |
| 52 | Stal Nysa | 13 | 4 | 4 | 13.125000 | 0.1250000 |
| 53 | Trefl Gdańsk | 6 | 1 | 1 | 5.857143 | 0.1428571 |
| 54 | VERVA Warszawa ORLEN Paliwa | 3 | 1 | 2 | 5.857143 | 2.8571429 |
mean(testing$error)
## [1] 1.325255
As we can see above, the new season records were assigned to the clusters, which can be shown in Chart 8. The accuracy of classification can be seen in the table above. As we can see 9 out of 14 new records were assigned to the cluster with the closest mean final position to actuals and what is more, if we had guessed teams’ final positions based on this method, we would have guessed with the mean error of ~1.32 position, which is in my opinion pretty good guess.
We have carried quite an unusual method based on Principal Components Analysis to calculate theoretical measures, which helped us to create visualisations and cluster teams season performance. As we proved in the results above, based on next season’s data our clusters are pretty accurate and we can measure teams consistency and individual offensive effort which are highly correlated with the final position. The outcomes could definitely help top teams to achieve their season goals by allocating budget in either improving IOE or teams consistency.