The investigated data set consists of the football matches statistics. It is the data of English Premier League for seasons 2014/2015-2020/2021. It was scraped from a website providing sport events statistics. Final size of the data set is 2660x18 – 2660 observations and 18 variables (after elimination). Each observation contains the information on the statistics of a single match.
Conducted analysis concerns clustering of the data in order to get some insights on the football matches. Its results could be useful for people betting on the results of sport events.
Consequent steps of the analysis are:
library(NbClust)
library(ClusterR)
library(factoextra)
library(fpc)
library(cluster)
library(gridExtra)
# setwd("C:/Users/blaze/Dropbox/WNE/USL/project/clust")
setwd("C:/Users/bpop/OneDrive/R/USL/project/clust")
library(readr)
matches_data <- read_csv("matches-data.csv")
matches_data
## # A tibble: 2,660 x 29
## MatchID MatchDate Week HomeTeam AwayTeam HomeGoalsHT AwayGoalsHT HomeGoalsFT AwayGoalsFT HomeBallPos AwayBallPos HomeShotsOffTarget AwayShotsOffTarget HomeShotsOnTarget AwayShotsOnTarget HomeBlockedShots AwayBlockedShots HomeCorners AwayCorners HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon AwayAerialsWon HomeFouls AwayFouls HomeYellowCards AwayYellowCards HomeRedCards AwayRedCards
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 Aug 16, 2014 1 Man Utd Swansea 0 1 1 2 60% 40% 5 0 5 4 4 1 4 0 86 80 20 10 14 20 2 4 0 0
## 2 1 Aug 16, 2014 1 QPR Hull City 0 0 0 1 51% 49% 7 3 6 4 6 4 8 9 77 76 30 15 10 10 1 2 0 0
## 3 2 Aug 16, 2014 1 Stoke Aston Villa 0 0 0 1 63% 37% 4 4 2 1 6 2 2 8 84 68 30 9 14 9 0 3 0 0
## 4 3 Aug 16, 2014 1 West Brom Sunderland 1 1 2 2 58% 42% 5 2 5 2 0 3 6 3 80 75 16 15 18 9 3 1 0 0
## 5 4 Aug 16, 2014 1 Leicester City Everton 1 2 2 2 37% 63% 5 5 3 3 3 5 3 6 77 84 27 14 16 10 1 1 0 0
## 6 5 Aug 16, 2014 1 West Ham Tottenham 0 0 0 1 47% 53% 10 2 4 4 4 4 8 5 83 80 15 12 12 10 2 0 0 1
## 7 6 Aug 16, 2014 1 Arsenal Crystal Palace 1 1 2 1 76% 24% 5 0 6 2 3 2 9 3 88 57 23 17 13 19 2 3 0 0
## 8 7 Aug 17, 2014 1 Liverpool Southampton 1 0 2 1 56% 44% 5 4 5 6 2 2 2 6 86 77 23 14 8 11 1 2 0 0
## 9 8 Aug 17, 2014 1 Newcastle Man City 0 1 0 2 44% 56% 9 5 0 5 3 3 3 3 83 86 14 16 8 11 1 5 0 0
## 10 9 Aug 18, 2014 1 Burnley Chelsea 1 3 1 3 39% 61% 6 4 2 3 1 4 4 3 70 82 27 20 6 7 1 1 0 0
## # ... with 2,650 more rows
The data set contains statistics from Premier League football matches. It consists of 7 seasons 2014/2015-2020/2021 which sums up to 2660 observations. The data was gathered using a self-built web scraper.
Available variables are:
Next variables have prefixes ‘Home’ and ‘Away’ which points to the team. Only the names without prefixes are listed below.
There are no missing values in the data set.
Changing strings to numbers, rescaling and removing not necessary variables.
# Change ball possession from % to numbers
matches_data$HomeBallPos = sub('%', '', matches_data$HomeBallPos)
matches_data$AwayBallPos = sub('%', '', matches_data$AwayBallPos)
matches_data$HomeBallPos = as.numeric(matches_data$HomeBallPos)/100
matches_data$AwayBallPos = as.numeric(matches_data$AwayBallPos)/100
# Change successful pass percentage to numbers
matches_data$HomePassSuccPerc = matches_data$HomePassSuccPerc/100
matches_data$AwayPassSuccPerc = matches_data$AwayPassSuccPerc/100
matches_data_clear = subset(matches_data, select= -c(MatchID, MatchDate, Week, HomeTeam, AwayTeam))
Clustered variables should not be highly correlated. Some of them must be removed from the data set based on the correlation analysis.
library(corrplot)
testRes = cor.mtest(matches_data_clear, conf.level = 0.95) # Significance of the correlation
corrplot(cor(matches_data_clear), p.mat = testRes$p, type = 'lower', method = 'number',
insig = 'blank', tl.cex=1)
Home team ball possession and away team ball possession are perfectly negatively correlated. One of them must be discarded. Full time goals and half time goals are highly correlated. Half time goals are discarded. Shots on target are highly correlated with goals scored for both home and away teams.
matches_data_clear = subset(matches_data_clear, select= -c(AwayBallPos))
matches_data_clear = subset(matches_data_clear, select= -c(HomeGoalsHT, AwayGoalsHT))
matches_data_clear = subset(matches_data_clear, select= -c(HomeShotsOnTarget, AwayShotsOnTarget))
corrplot(cor(matches_data_clear), type = 'lower', method = 'number')
Home team ball possession, which was chosen before when we considered the perfect correlation case, is still highly correlated with number of other variables. It should be discarded from the dataset.
matches_data_clear = subset(matches_data_clear, select= -c(HomeBallPos))
corrplot(cor(matches_data_clear), type = 'lower', method = 'number')
table(matches_data_clear$HomeRedCards)
##
## 0 1 2
## 2567 89 4
table(matches_data_clear$AwayRedCards)
##
## 0 1 2
## 2557 98 5
The highest number of red cards given to a team during a match is only 2. It took place only in 9 cases. For this reason it is reasonable to convert this variable into a binary one.
matches_data_clear$HomeRedCards[matches_data_clear$HomeRedCards > 1] = 1
matches_data_clear$AwayRedCards[matches_data_clear$AwayRedCards > 1] = 1
summary(matches_data_clear)
## HomeGoalsFT AwayGoalsFT HomeShotsOffTarget AwayShotsOffTarget HomeBlockedShots AwayBlockedShots HomeCorners AwayCorners HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon AwayAerialsWon HomeFouls AwayFouls HomeYellowCards AwayYellowCards HomeRedCards AwayRedCards
## Min. :0.000 Min. :0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.5100 Min. :0.4800 Min. : 3.0 Min. : 1.0 Min. : 0.00 Min. : 1.00 Min. :0.000 Min. :0.000 Min. :0.00000 Min. :0.00000
## 1st Qu.:1.000 1st Qu.:0.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.:0.7300 1st Qu.:0.7200 1st Qu.:13.0 1st Qu.:13.0 1st Qu.: 8.00 1st Qu.: 9.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :1.000 Median :1.000 Median : 5.000 Median : 4.000 Median : 3.000 Median : 3.000 Median : 5.000 Median : 4.000 Median :0.7900 Median :0.7800 Median :17.0 Median :17.0 Median :11.00 Median :11.00 Median :1.000 Median :2.000 Median :0.00000 Median :0.00000
## Mean :1.505 Mean :1.207 Mean : 5.379 Mean : 4.375 Mean : 3.818 Mean : 3.043 Mean : 5.776 Mean : 4.704 Mean :0.7823 Mean :0.7697 Mean :18.3 Mean :18.1 Mean :10.64 Mean :10.99 Mean :1.573 Mean :1.757 Mean :0.03496 Mean :0.03872
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 6.000 3rd Qu.:0.8400 3rd Qu.:0.8300 3rd Qu.:23.0 3rd Qu.:23.0 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :9.000 Max. :9.000 Max. :16.000 Max. :14.000 Max. :19.000 Max. :14.000 Max. :19.000 Max. :16.000 Max. :0.9400 Max. :0.9400 Max. :67.0 Max. :53.0 Max. :24.00 Max. :26.00 Max. :6.000 Max. :9.000 Max. :1.00000 Max. :1.00000
sapply(matches_data_clear, mean)
## HomeGoalsFT AwayGoalsFT HomeShotsOffTarget AwayShotsOffTarget HomeBlockedShots AwayBlockedShots HomeCorners AwayCorners HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon AwayAerialsWon HomeFouls AwayFouls HomeYellowCards AwayYellowCards HomeRedCards AwayRedCards
## 1.50451128 1.20714286 5.37894737 4.37518797 3.81804511 3.04285714 5.77631579 4.70375940 0.78228571 0.76968421 18.30375940 18.09887218 10.64172932 10.98646617 1.57330827 1.75714286 0.03496241 0.03872180
The differences in means of the variables reaches 3 orders of magnitude. It is reasonable to standardize the data. There is one binary variable in the data set. In such case the data could be normalized with the z-score transformation or rescaled to the [0, 1] interval. The second alternative was chosen for the analysis.
library(scales)
matches_data_scaled = as.data.frame(sapply(matches_data_clear, rescale))
summary(matches_data_scaled)
## HomeGoalsFT AwayGoalsFT HomeShotsOffTarget AwayShotsOffTarget HomeBlockedShots AwayBlockedShots HomeCorners AwayCorners HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon AwayAerialsWon HomeFouls AwayFouls HomeYellowCards AwayYellowCards HomeRedCards AwayRedCards
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.1111 1st Qu.:0.0000 1st Qu.:0.1875 1st Qu.:0.2143 1st Qu.:0.1053 1st Qu.:0.07143 1st Qu.:0.1579 1st Qu.:0.1875 1st Qu.:0.5116 1st Qu.:0.5217 1st Qu.:0.1562 1st Qu.:0.2308 1st Qu.:0.3333 1st Qu.:0.3200 1st Qu.:0.1667 1st Qu.:0.1111 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.1111 Median :0.1111 Median :0.3125 Median :0.2857 Median :0.1579 Median :0.21429 Median :0.2632 Median :0.2500 Median :0.6512 Median :0.6522 Median :0.2188 Median :0.3077 Median :0.4583 Median :0.4000 Median :0.1667 Median :0.2222 Median :0.00000 Median :0.00000
## Mean :0.1672 Mean :0.1341 Mean :0.3362 Mean :0.3125 Mean :0.2009 Mean :0.21735 Mean :0.3040 Mean :0.2940 Mean :0.6332 Mean :0.6297 Mean :0.2391 Mean :0.3288 Mean :0.4434 Mean :0.3995 Mean :0.2622 Mean :0.1952 Mean :0.03496 Mean :0.03872
## 3rd Qu.:0.2222 3rd Qu.:0.2222 3rd Qu.:0.4375 3rd Qu.:0.4286 3rd Qu.:0.2632 3rd Qu.:0.28571 3rd Qu.:0.4211 3rd Qu.:0.3750 3rd Qu.:0.7674 3rd Qu.:0.7609 3rd Qu.:0.3125 3rd Qu.:0.4231 3rd Qu.:0.5417 3rd Qu.:0.4800 3rd Qu.:0.3333 3rd Qu.:0.3333 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
dist_eucl<-get_dist(matches_data_scaled, method="euclidean")
fviz_dist(dist_eucl, show_labels = FALSE) + labs(title="Matches data") # factoextra::
There are some visible blocks in the data set. However they are not very clear. It may be assumed that the data is clusterable to some extent.
Get the best number of clusters.
# with min.nc the system is singular
opt1 = NbClust(data = matches_data_scaled, distance = 'euclidean', min.nc = 3, max.nc = 10, method = 'average')
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 3 as the best number of clusters
## * 13 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 4 proposed 7 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
opt2 = NbClust(data = matches_data_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'median')
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 10 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 7 proposed 7 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
opt3 = NbClust(data = matches_data_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'centroid')
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 7 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
For the ‘average’ method of distance calculation the best choice is 4 clusters. For the ‘median’ method of distance calculation the best choice is 2 clusters. For the ‘centroid’ method of distance calculation the best choice is 2 clusters.
fv_km = fviz_nbclust(matches_data_scaled,kmeans,method = "silhouette") +ggtitle("kmeans")
fv_pam = fviz_nbclust(matches_data_scaled,pam,method = "silhouette") +ggtitle("pam")
fv_clara = fviz_nbclust(matches_data_scaled,clara,method = "silhouette") +ggtitle("clara")
grid.arrange(fv_km,fv_pam,fv_clara, ncol=2, top = "Silhouette based number of clusters")
Silhouette rule says that for ‘Kmeans’ method 3 clusters is the best choice, for ‘PAM’ and ‘CLARA’ 2 are the best.
Most methods pointed 2 clusters as the best choice. Also 3 and 4 were proposed.
Based on the results of the analysis 2 and 3 clusters will be chosen.
The data set is considered to be big. The data does not manifest hierarchical characteristics. This is why only non-hierarchical methods are used in the analysis.
Dim1 and Dim2 on the graphs stand for the first two dimensions resulting from the application of PCA.
# k-means
cluster_km <- eclust(matches_data_scaled,"kmeans", k = 3)
It divides the data in three groups when looked on the first two dimensions. They are not very well separated. We can see that they are divided into clusters based mainly on the Dim1 values.
# pam
cluster_pam<-eclust(matches_data_scaled, "pam", k = 2)
Here the case is simalr to Kmeans – Dim1 divides the data into two clusters.
# clara
cluster_clara<-eclust(matches_data_scaled, "clara", k = 2)
In CLARA approach both dimensions are significant for the cluster separation.
Analyzing the graphs we can say that there are no clearly visible separated groups in the data. Derived clusters overlap each other – have fuzzy boundaries between each other. However, it must be noted that they are visibly separated.
matches_data_clear$cluster_km = cluster_km$cluster
matches_data_clear$cluster_pam = cluster_pam$clustering
matches_data_clear$cluster_clara = cluster_clara$clustering
sil_km<-silhouette(cluster_km$cluster, dist(matches_data_scaled))
fviz_silhouette(sil_km)
## cluster size ave.sil.width
## 1 1 1026 0.10
## 2 2 893 0.07
## 3 3 741 0.10
sil_pam<-silhouette(cluster_pam$clustering, dist(matches_data_scaled))
fviz_silhouette(sil_pam)
## cluster size ave.sil.width
## 1 1 1640 0.12
## 2 2 1020 0.09
sil_clara<-silhouette(cluster_clara$clustering, dist(matches_data_scaled))
fviz_silhouette(sil_clara)
## cluster size ave.sil.width
## 1 1 1220 0.12
## 2 2 1440 0.07
Average silhouette width is almost the same for all three methods. The highest one is reached for PAM – 0.11.
The highest share of negative silhouette width values is observed for the second cluster in CLARA method.
All three methods give clusters of similar sizes:
table(cluster_km$cluster)/length(cluster_km$cluster)
##
## 1 2 3
## 0.3857143 0.3357143 0.2785714
table(cluster_pam$clustering)/length(cluster_pam$clustering)
##
## 1 2
## 0.6165414 0.3834586
table(cluster_clara$clustering)/length(cluster_clara$clustering)
##
## 1 2
## 0.4586466 0.5413534
# general characteristics of clusters for kmeans
# clust_avg_km = aggregate(. ~ cluster_km, data = subset(matches_data_clear, select= -c(cluster_pam, cluster_clara)), mean)
# t(clust_avg_km)
#
# # general characteristics of clusters for pam
# clust_avg_pam = aggregate(. ~ cluster_pam, data = subset(matches_data_clear, select= -c(cluster_km, cluster_clara)), mean)
# t(clust_avg_pam)
#
# # general characteristics of clusters for clara
# clust_avg_clara = aggregate(. ~ cluster_clara, data = subset(matches_data_clear, select= -c(cluster_pam, cluster_km)), mean)
# t(clust_avg_clara)
clust_avg_km = aggregate(. ~ cluster_km, data = matches_data_clear, mean)
t(clust_avg_km)
## [,1] [,2] [,3]
## cluster_km 1.00000000 2.00000000 3.00000000
## HomeGoalsFT 1.88206628 1.31466965 1.21052632
## AwayGoalsFT 0.93957115 1.46584546 1.26585695
## HomeShotsOffTarget 6.97270955 3.82306831 5.04723347
## AwayShotsOffTarget 3.27582846 5.91489362 4.04183536
## HomeBlockedShots 5.14717349 2.47592385 3.59514170
## AwayBlockedShots 1.98538012 4.37737962 2.89878543
## HomeCorners 7.59844055 3.92609183 5.48313090
## AwayCorners 3.29044834 6.57110862 4.41025641
## HomePassSuccPerc 0.83294347 0.73539754 0.76865047
## AwayPassSuccPerc 0.73125731 0.81979843 0.76249663
## HomeAerialsWon 17.78947368 17.71444569 19.72604588
## AwayAerialsWon 16.84502924 18.19932811 19.71390013
## HomeFouls 9.49025341 10.11646137 12.86909582
## AwayFouls 10.64424951 10.37849944 12.19298246
## HomeYellowCards 0.97173489 1.24636058 2.80026991
## AwayYellowCards 1.66569201 1.40649496 2.30634278
## HomeRedCards 0.01559454 0.07390817 0.01484480
## AwayRedCards 0.06335283 0.01791713 0.02968961
## cluster_pam 1.00682261 1.68980963 1.53576248
## cluster_clara 1.13157895 1.79171333 1.80701754
# general characteristics of clusters for pam
clust_avg_pam = aggregate(. ~ cluster_pam, data = matches_data_clear, mean)
t(clust_avg_pam)
## [,1] [,2]
## cluster_pam 1.00000000 2.00000000
## HomeGoalsFT 1.61707317 1.32352941
## AwayGoalsFT 1.01585366 1.51470588
## HomeShotsOffTarget 6.10914634 4.20490196
## AwayShotsOffTarget 3.62560976 5.58039216
## HomeBlockedShots 4.57378049 2.60294118
## AwayBlockedShots 2.37621951 4.11470588
## HomeCorners 6.87073171 4.01666667
## AwayCorners 3.61524390 6.45392157
## HomePassSuccPerc 0.81172561 0.73495098
## AwayPassSuccPerc 0.74317683 0.81230392
## HomeAerialsWon 18.77317073 17.54901961
## AwayAerialsWon 17.85548780 18.49019608
## HomeFouls 9.82134146 11.96078431
## AwayFouls 11.05304878 10.87941176
## HomeYellowCards 1.22195122 2.13823529
## AwayYellowCards 1.68170732 1.87843137
## HomeRedCards 0.02439024 0.05196078
## AwayRedCards 0.04390244 0.03039216
## cluster_km 1.58841463 2.38235294
## cluster_clara 1.30365854 1.92352941
# general characteristics of clusters for clara
clust_avg_clara = aggregate(. ~ cluster_clara, data = matches_data_clear, mean)
t(clust_avg_clara)
## [,1] [,2]
## cluster_clara 1.00000000 2.00000000
## HomeGoalsFT 1.84918033 1.21250000
## AwayGoalsFT 0.95901639 1.41736111
## HomeShotsOffTarget 5.83688525 4.99097222
## AwayShotsOffTarget 3.51885246 5.10069444
## HomeBlockedShots 4.78688525 2.99722222
## AwayBlockedShots 2.00655738 3.92083333
## HomeCorners 6.83934426 4.87569444
## AwayCorners 3.56147541 5.67152778
## HomePassSuccPerc 0.82970492 0.74211111
## AwayPassSuccPerc 0.74189344 0.79322917
## HomeAerialsWon 17.03442623 19.37916667
## AwayAerialsWon 16.20000000 19.70763889
## HomeFouls 9.84590164 11.31597222
## AwayFouls 10.77704918 11.16388889
## HomeYellowCards 0.96557377 2.08819444
## AwayYellowCards 1.70245902 1.80347222
## HomeRedCards 0.02704918 0.04166667
## AwayRedCards 0.05081967 0.02847222
## cluster_km 1.38688525 2.32152778
## cluster_pam 1.06393443 1.65416667
Analyzing the means of the variables grouped by clusters shows that dividing the data into two clusters (PAM and CLARA) gives the division on matches in which one of the two teams was better (1st cluster for home teams and 2nd for away teams).
When the data is divided into three clusters (Kmeans case) we see the same tendency: first two clusters aggregate the matches in which one of the teams is clearly better (even more pronounced than in the case of two clusters) and the third one gathers matches in which the statistics were evenly distributed.
Conducted cluster analysis showed that the data set should be separated into two or three clusters. Resulting clusters are reasonable – they divide the matches into logical sets. Obtained results could be utilized in further analyses which could reveal some other regularities in football matches.