This article is a part of the Unsupervised Learning course at the Faculty of Economic Sciences, University of Warsaw.

Purpose

Volleyball is a sport played in teams, where each point is a separate action and the team’s goal is to achieve 25 points before the opposite team. Therefore the main drivers for winning matches are team consistency also expressed as avoiding making mistakes, and points gained by players’ individual effort. In this paper, I will try to identify and measure those theoretical concepts and try to predict the positions of teams based on the assignment to calculated clusters with Principal Component Analysis and K-means clustering. I will use statistics from Plusliga, the highest level professional league in Poland and one of the best in the world, for seasons 2017/18 - 2020/21 provided by Plusliga.pl. The original dataset comes from my prior university projects.

kable(head(Plusliga))

Season	Team Name	Sets	Points	Serves	Aces	Serve Errors	Aces per set	Receptions	Reception Errors	Negative reception	Perfect Reception	% perf reception	Attacks	Attack errors	Blocked attacks	Perfect attacs	% perf attacks	Blocks	Blocks per set
2020/21	Aluron CMC Warta Zawiercie	5	73	108	6	11	1.20	85	9	19	8	9.41	121	7	9	60	49.59	7	1.40
2020/21	Aluron CMC Warta Zawiercie	4	76	101	6	10	1.50	81	4	30	24	29.63	111	7	8	60	54.05	10	2.50
2020/21	Aluron CMC Warta Zawiercie	3	54	74	3	8	1.00	43	3	12	9	20.93	67	0	2	43	64.18	8	2.67
2020/21	Aluron CMC Warta Zawiercie	3	61	74	5	5	1.67	51	2	8	13	25.49	83	5	6	51	61.45	5	1.67
2020/21	Aluron CMC Warta Zawiercie	4	82	108	5	10	1.25	77	4	17	21	27.27	117	5	7	70	59.83	7	1.75
2020/21	Aluron CMC Warta Zawiercie	3	54	74	3	11	1.00	49	3	7	10	20.41	71	4	7	40	56.34	11	3.67

Preparing dataset

We want to spot two measures: Team Consistency and Individual Offensive Effort (IOE). As most of the original statistics derived from the Plusliga website are highly determined by the number of sets played every match, we will create custom measures, To achieve that, we need to define other, more granular drivers of those measures

#Team consistancy measures:
Plusliga$serve_consistency <- coalesce((Plusliga$Serves - Plusliga$`Serve Errors`)/Plusliga$Serves, 0)
Plusliga$Reception_consistency <- coalesce((Plusliga$Receptions - Plusliga$`Reception Errors` - Plusliga$`Negative reception`)/Plusliga$Receptions, 0)
Plusliga$Perfect_reception_consistency <- coalesce(Plusliga$`Perfect Reception`/Plusliga$Receptions, 0)
Plusliga$Attacks_consistency <- coalesce((Plusliga$Attacks - Plusliga$`Attack errors`) / Plusliga$Attacks, 0)

#IOE measures (we will use two common measures - blocks per set and aces per set, and one custom)
Plusliga$Perfect_Attacks_ratio <- coalesce(Plusliga$`Perfect attacs`/Plusliga$Attacks,0)

# Grouping by Team name and Season
df <- aggregate(Plusliga[c(8,20:25)], by =list(Plusliga$`Team Name`, Plusliga$Season), FUN = mean)


# Adding group and final standings data 
Final_Standings <- read_delim("Final_Standings.csv", 
                              delim = ";", escape_double = FALSE, trim_ws = TRUE)

df <- merge(df, Final_Standings, by.x = c("Group.2", "Group.1"), by.y = c("Season", "Team"))
colnames(df) <- c("Season", "Team", colnames(df[,3:11]))

Now we can group the dataset by teams by seasons and include standings from particular seasons. We add both Group Standings (position in the table after group phase) and Final Standings (position in the table after play-off phase that happens after the last match of group phase)

Season	Team	Aces per set	Blocks per set	serve_consistency	Reception_consistency	Perfect_reception_consistency	Attacks_consistency	Perfect_Attacks_ratio	Group Standings	Final Standings
2017/18	Aluron CMC Warta Zawiercie	1.662187	2.062500	0.8018745	0.6438980	0.2361160	0.9292495	0.4735497	10	9
2017/18	Asseco Resovia Rzeszów	1.469722	2.064722	0.8271924	0.6614031	0.2335482	0.9348035	0.5076606	4	6
2017/18	BBTS Bielsko-Biała	1.099667	2.396667	0.8353156	0.6044907	0.2006393	0.9118757	0.4616044	15	15
2017/18	BKS Visła Bydgoszcz	1.136061	2.046667	0.7990014	0.6230099	0.2058731	0.8947184	0.4402737	14	14
2017/18	Cerrad Czarni Radom	1.662813	2.041250	0.8149306	0.6713662	0.2356678	0.9228777	0.5070535	9	10
2017/18	Cuprum Lubin	1.255625	2.092500	0.8704322	0.6425882	0.1926827	0.9438522	0.4672254	8	7

We will also make a partition for training data (seasons 2017/18 - 2019/20) and test data (2020/21)

training <- df[df$Season != "2020/21",]
test <- df[df$Season == "2020/21",]

Correlation analysis

As can be spotted in Chart 1 below, group standings are negatively correlated with every of the selected variables. We expected that kind of relation, as the high value of every variable theoretically means positive effect, so that brings us to lower values of Group and Final Standings (higher places). Moreover, as both Group Standings and Final Standings result with similar correlations, from now on we will use only Final Standings

corr = cor(training[,3:11], method='pearson')
corrplot(corr, title = "Chart 1. Correlation analysis between selected variables")

Principal Components Analysis

In this part, we will create a one Principal Components for each group of variables: Consistency and IOE. Instead of normalizing variables, we will set the parameters of prcomp function: center and scale for TRUE.

df.pcat1 <- prcomp(training[c(5:8)], center = T, scale. = T)

df.pcat2 <- prcomp(training[c(3:4,9)], center = T, scale. = T)

# Visualizing Explained variance
grid.arrange(fviz_eig(df.pcat1, main = "Chart 2. Explained variance by Consistency PCs"), fviz_eig(df.pcat2, main = "Chart 3. Explained variance by IOE PCs"), nrow = 1)

As we can see above, the first Principal Components explain respectively >40% and >50% of the variance, therefore we will use those two components

# Binding dataframe with teams, seasons, final standings and Eigenvalues
trainingresults <- data.frame(training$Season, training$Team, training$`Final Standings`, df.pcat1$x[,1], -df.pcat2$x[,1])
colnames(trainingresults) <- c("team", "season", "Final Standings", "Consistency", "IOE")

#Plotting Eigenvalues
sp <- ggplot(trainingresults, aes(Consistency, IOE))+
  geom_point(aes(size=`Final Standings`, color= `Final Standings`)) + ggtitle("Chart 4. Consistency and IOE for the teams")
sp

Graph number 4. can give us some intuition, that the two derived measures indeed are positively related to the final standing and that there is some linear relation (higher values of those measures mean higher final positions)

Clustering

#Gap analysis chart
fviz_nbclust(trainingresults[4:5],FUNcluster = stats::kmeans) + ggtitle("Chart 5. Gap Analysis")

# K-means clustering and plotting results
kmtraining <- kmeans(trainingresults[,4:5],6, nstart = 100)
clusterstrain <- ggplot(trainingresults, aes(Consistency, IOE))+
  geom_point(aes(size=`Final Standings`, color= factor(kmtraining$cluster))) + scale_color_manual(values = c('#EDE61A','#84D076', '29B510', "#ED481A", "#ED7D1A","#EDB21A")) + geom_text(label = trainingresults$`Final Standings`, check_overlap = TRUE) + ggtitle("Chart 6. Trained Clusters")
clusterstrain

Chart number 5. above, suggest an optimal number of clusters for the given example for k-means method as 6 based on the gap analysis. Therefore we group the training dataset into 6 clusters, which was expressed in Chart 6. We can see that the positions inside clusters are similar.

#assigning clusters to values, calculating and plotting mean position within clusters
trainingresults$cluster <- kmtraining$cluster

mean_pos <- aggregate(trainingresults[,3], by = list(trainingresults$cluster), FUN = mean)


centroids <- data.frame("Consistency" = kmtraining$centers[,1],
                        "IOE" = kmtraining$centers[,2],
                        "cluster" = rownames(kmtraining$centers),
                        "position" = round(mean_pos$x,2)
                        )
Trainchart_avgpos <- ggplot(trainingresults, aes(Consistency, IOE))+
  geom_point(aes(size=`Final Standings`, color= factor(kmtraining$cluster))) + 
  scale_color_manual(values = c('#EDE61A','#84D076', '29B510', "#ED481A", "#ED7D1A","#EDB21A")) +
  geom_label(data = centroids, label = centroids$position, label.r = unit(0.25, "lines"), colour = "red")+ 
  ggtitle("Chart 7. Clusters with average position")
Trainchart_avgpos

Chart number 7 represents calculated clusters with the average final position of teams within those clusters. As we can see, Cluster 4 (Red) describes the team at the last positions in the table, so teams with low consistency and Individual Offensive Effort, while Cluster 5 (Dark Orange) represents teams with IOE on a similar, below-average level, same as C4, although significantly higher Consistency level resulting in higher final position almost by 4 positions on average. Those two clusters show that for the same level of IOE consistency plays important role in the final position. The same scheme we can see across Clusters 6, 1 and 2, which are characterized by similar IOE levels (around 0.5), but different Consistency, where better consistency means higher position. Finally Cluster 3 (Blue) shows “unicorns” with incredible Individual achievements guaranteeing a high position regardless of team consistency. We can conclude from that, that the main driver of the final position was an IOE level, as more IOE meant higher positions for every cluster, but consistency, as a secondary effect played also an important role.

Testing

We will run our calculations now for 2020/21 season and see how well we can assign cluster and positions basing on our calculations:

#calculating PCs and clusters for test data
#starting with normalization
test[,3:9] <-data.Normalization(test[,3:9], type="n1", normalization="column")#clusterSim::

testing <- test[,c(1:2,11)]
testing$Consistency <- as.matrix(test[,c(5:8)])%*%df.pcat1$rotation[,1]
testing$IOE <- -as.matrix(test[,c(3:4,9)])%*%df.pcat2$rotation[,1]

km.train.kcca<-as.kcca(kmtraining, trainingresults[,4:5]) # conversion to kcca
km.pred<-predict(km.train.kcca, testing[,4:5]) # prediction for k-means
testing$cluster <- km.pred

# Plotting both train and test data
colnames(trainingresults) = colnames(testing[,1:6])
all <- rbind(trainingresults, testing[,1:6])
Trainchart_avgpos <- ggplot(all, aes(Consistency, IOE))+
  geom_point(aes(size=`Final Standings`, color= factor(cluster))) + 
  scale_color_manual(values = c('#EDE61A','#84D076', '29B510', "#ED481A", "#ED7D1A","#EDB21A")) +
  geom_label(data = centroids, label = centroids$cluster, label.r = unit(0.25, "lines") )+ 
  ggtitle("Chart 8. Training and testing data by clusers")
Trainchart_avgpos

# Finding more suited clusters (based on the position) and calculating error
testing$actual_cluster <- rep(0,14)

for (i in 1:length(testing$Team))
{
  testing$actual_cluster[i] <- which.min((abs(rep(testing$`Final Standings`[i],5)-mean_pos))$x)
  
}
testing$mean_in_cluster <-  mean_pos$x[testing$cluster]

testing$error <- abs(testing$`Final Standings`- mean_pos$x[testing$cluster])

kable(testing[,c(2,3,6,7,8,9)])

	Team	Final Standings	cluster	actual_cluster	mean_in_cluster	error
41	Aluron CMC Warta Zawiercie	8	5	5	9.428571	1.4285714
42	Asseco Resovia Rzeszów	5	2	1	3.777778	1.2222222
43	Cerrad Czarni Radom	12	4	4	13.125000	1.1250000
44	Cuprum Lubin	11	5	5	9.428571	1.5714286
45	GKS Katowice	9	5	5	9.428571	0.4285714
46	Grupa Azoty ZAKSA Kędzierzyn-Koźle	2	3	3	1.500000	0.5000000
47	Indykpol AZS Olsztyn	10	5	5	9.428571	0.5714286
48	Jastrzębski Węgiel	1	2	3	3.777778	2.7777778
49	MKS Będzin	14	4	4	13.125000	0.8750000
50	PGE Skra Bełchatów	4	3	2	1.500000	2.5000000
51	Ślepsk Malow Suwałki	7	5	6	9.428571	2.4285714
52	Stal Nysa	13	4	4	13.125000	0.1250000
53	Trefl Gdańsk	6	1	1	5.857143	0.1428571
54	VERVA Warszawa ORLEN Paliwa	3	1	2	5.857143	2.8571429

mean(testing$error)

## [1] 1.325255

As we can see above, the new season records were assigned to the clusters, which can be shown in Chart 8. The accuracy of classification can be seen in the table above. As we can see 9 out of 14 new records were assigned to the cluster with the closest mean final position to actuals and what is more, if we had guessed teams’ final positions based on this method, we would have guessed with the mean error of ~1.32 position, which is in my opinion pretty good guess.

Conclusions

We have carried quite an unusual method based on Principal Components Analysis to calculate theoretical measures, which helped us to create visualisations and cluster teams season performance. As we proved in the results above, based on next season’s data our clusters are pretty accurate and we can measure teams consistency and individual offensive effort which are highly correlated with the final position. The outcomes could definitely help top teams to achieve their season goals by allocating budget in either improving IOE or teams consistency.

Identifying and measuring the main drivers of volleyball teams season performance based on Plusliga in season 2017/18-2020/21

Jan Dudzik, dudzikjanek@gmail.com