Footbal matches clustering

Short introduction

The investigated data set consists of the football matches statistics. It is the data of English Premier League for seasons 2014/2015-2020/2021. It was scraped from a website providing sport events statistics. Final size of the data set is 2660x18 – 2660 observations and 18 variables (after elimination). Each observation contains the information on the statistics of a single match.

Conducted analysis concerns clustering of the data in order to get some insights on the football matches. Its results could be useful for people betting on the results of sport events.

Consequent steps of the analysis are:

Data set preparation (variable elimination and scaling)
Deriving the optimal number of clusters
Clustering (Kmeans, PAM, CLARA)
Clusters description

Read the data

library(NbClust)
library(ClusterR)
library(factoextra)
library(fpc)
library(cluster)
library(gridExtra)

# setwd("C:/Users/blaze/Dropbox/WNE/USL/project/clust")
setwd("C:/Users/bpop/OneDrive/R/USL/project/clust")

library(readr)
matches_data <- read_csv("matches-data.csv")
matches_data

## # A tibble: 2,660 x 29
##    MatchID MatchDate     Week HomeTeam       AwayTeam       HomeGoalsHT AwayGoalsHT HomeGoalsFT AwayGoalsFT HomeBallPos AwayBallPos HomeShotsOffTarget AwayShotsOffTarget HomeShotsOnTarget AwayShotsOnTarget HomeBlockedShots AwayBlockedShots HomeCorners AwayCorners HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon AwayAerialsWon HomeFouls AwayFouls HomeYellowCards AwayYellowCards HomeRedCards AwayRedCards
##      <dbl> <chr>        <dbl> <chr>          <chr>                <dbl>       <dbl>       <dbl>       <dbl> <chr>       <chr>                    <dbl>              <dbl>             <dbl>             <dbl>            <dbl>            <dbl>       <dbl>       <dbl>            <dbl>            <dbl>          <dbl>          <dbl>     <dbl>     <dbl>           <dbl>           <dbl>        <dbl>        <dbl>
##  1       0 Aug 16, 2014     1 Man Utd        Swansea                  0           1           1           2 60%         40%                          5                  0                 5                 4                4                1           4           0               86               80             20             10        14        20               2               4            0            0
##  2       1 Aug 16, 2014     1 QPR            Hull City                0           0           0           1 51%         49%                          7                  3                 6                 4                6                4           8           9               77               76             30             15        10        10               1               2            0            0
##  3       2 Aug 16, 2014     1 Stoke          Aston Villa              0           0           0           1 63%         37%                          4                  4                 2                 1                6                2           2           8               84               68             30              9        14         9               0               3            0            0
##  4       3 Aug 16, 2014     1 West Brom      Sunderland               1           1           2           2 58%         42%                          5                  2                 5                 2                0                3           6           3               80               75             16             15        18         9               3               1            0            0
##  5       4 Aug 16, 2014     1 Leicester City Everton                  1           2           2           2 37%         63%                          5                  5                 3                 3                3                5           3           6               77               84             27             14        16        10               1               1            0            0
##  6       5 Aug 16, 2014     1 West Ham       Tottenham                0           0           0           1 47%         53%                         10                  2                 4                 4                4                4           8           5               83               80             15             12        12        10               2               0            0            1
##  7       6 Aug 16, 2014     1 Arsenal        Crystal Palace           1           1           2           1 76%         24%                          5                  0                 6                 2                3                2           9           3               88               57             23             17        13        19               2               3            0            0
##  8       7 Aug 17, 2014     1 Liverpool      Southampton              1           0           2           1 56%         44%                          5                  4                 5                 6                2                2           2           6               86               77             23             14         8        11               1               2            0            0
##  9       8 Aug 17, 2014     1 Newcastle      Man City                 0           1           0           2 44%         56%                          9                  5                 0                 5                3                3           3           3               83               86             14             16         8        11               1               5            0            0
## 10       9 Aug 18, 2014     1 Burnley        Chelsea                  1           3           1           3 39%         61%                          6                  4                 2                 3                1                4           4           3               70               82             27             20         6         7               1               1            0            0
## # ... with 2,650 more rows

Data description

The data set contains statistics from Premier League football matches. It consists of 7 seasons 2014/2015-2020/2021 which sums up to 2660 observations. The data was gathered using a self-built web scraper.

Available variables are:

MatchID – number of the match in the data set
MatchDate – calendar date of the played match
Week – week of the season
HomeTeam – home team name
Away Team – away team name

Next variables have prefixes ‘Home’ and ‘Away’ which points to the team. Only the names without prefixes are listed below.

GoalsHT – number of goals scored in the first half of the match
GoalsFT – number of goals scored till the end of the match
BallPos – percentage of ball possession
ShotsOffTarget – number of shots off the goal
ShotsOnTarget – number of shots on the goal
Blocked shots – number of shots blocked by the opponent
Corners – number of corners
PassSuccPerc – percentage of successful passes
AerialsWon – number of aerial duels won
Fouls – number of fouls commited
YellowCards – number of yellow cards
RedCards – number of red cards

There are no missing values in the data set.

Data transformation

Changing strings to numbers, rescaling and removing not necessary variables.

# Change ball possession from % to numbers
matches_data$HomeBallPos = sub('%', '', matches_data$HomeBallPos)
matches_data$AwayBallPos = sub('%', '', matches_data$AwayBallPos)

matches_data$HomeBallPos = as.numeric(matches_data$HomeBallPos)/100
matches_data$AwayBallPos = as.numeric(matches_data$AwayBallPos)/100

# Change successful pass percentage to numbers
matches_data$HomePassSuccPerc = matches_data$HomePassSuccPerc/100
matches_data$AwayPassSuccPerc = matches_data$AwayPassSuccPerc/100

matches_data_clear = subset(matches_data, select= -c(MatchID, MatchDate, Week, HomeTeam, AwayTeam))

Basic statistics

Correlations plots

Clustered variables should not be highly correlated. Some of them must be removed from the data set based on the correlation analysis.

library(corrplot)
testRes = cor.mtest(matches_data_clear, conf.level = 0.95) # Significance of the correlation
corrplot(cor(matches_data_clear), p.mat = testRes$p, type = 'lower', method = 'number',
         insig = 'blank',  tl.cex=1)

Home team ball possession and away team ball possession are perfectly negatively correlated. One of them must be discarded. Full time goals and half time goals are highly correlated. Half time goals are discarded. Shots on target are highly correlated with goals scored for both home and away teams.

matches_data_clear = subset(matches_data_clear, select= -c(AwayBallPos))
matches_data_clear = subset(matches_data_clear, select= -c(HomeGoalsHT, AwayGoalsHT))
matches_data_clear = subset(matches_data_clear, select= -c(HomeShotsOnTarget, AwayShotsOnTarget))

corrplot(cor(matches_data_clear), type = 'lower', method = 'number')

Home team ball possession, which was chosen before when we considered the perfect correlation case, is still highly correlated with number of other variables. It should be discarded from the dataset.

matches_data_clear = subset(matches_data_clear, select= -c(HomeBallPos))
corrplot(cor(matches_data_clear), type = 'lower', method = 'number')

Number of red cards

table(matches_data_clear$HomeRedCards)

## 
##    0    1    2 
## 2567   89    4

table(matches_data_clear$AwayRedCards)

## 
##    0    1    2 
## 2557   98    5

The highest number of red cards given to a team during a match is only 2. It took place only in 9 cases. For this reason it is reasonable to convert this variable into a binary one.

matches_data_clear$HomeRedCards[matches_data_clear$HomeRedCards > 1] = 1
matches_data_clear$AwayRedCards[matches_data_clear$AwayRedCards > 1] = 1

Data summary

summary(matches_data_clear)

##   HomeGoalsFT     AwayGoalsFT    HomeShotsOffTarget AwayShotsOffTarget HomeBlockedShots AwayBlockedShots  HomeCorners      AwayCorners     HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon AwayAerialsWon   HomeFouls       AwayFouls     HomeYellowCards AwayYellowCards  HomeRedCards      AwayRedCards    
##  Min.   :0.000   Min.   :0.000   Min.   : 0.000     Min.   : 0.000     Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.5100   Min.   :0.4800   Min.   : 3.0   Min.   : 1.0   Min.   : 0.00   Min.   : 1.00   Min.   :0.000   Min.   :0.000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:1.000   1st Qu.:0.000   1st Qu.: 3.000     1st Qu.: 3.000     1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.:0.7300   1st Qu.:0.7200   1st Qu.:13.0   1st Qu.:13.0   1st Qu.: 8.00   1st Qu.: 9.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :1.000   Median :1.000   Median : 5.000     Median : 4.000     Median : 3.000   Median : 3.000   Median : 5.000   Median : 4.000   Median :0.7900   Median :0.7800   Median :17.0   Median :17.0   Median :11.00   Median :11.00   Median :1.000   Median :2.000   Median :0.00000   Median :0.00000  
##  Mean   :1.505   Mean   :1.207   Mean   : 5.379     Mean   : 4.375     Mean   : 3.818   Mean   : 3.043   Mean   : 5.776   Mean   : 4.704   Mean   :0.7823   Mean   :0.7697   Mean   :18.3   Mean   :18.1   Mean   :10.64   Mean   :10.99   Mean   :1.573   Mean   :1.757   Mean   :0.03496   Mean   :0.03872  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 7.000     3rd Qu.: 6.000     3rd Qu.: 5.000   3rd Qu.: 4.000   3rd Qu.: 8.000   3rd Qu.: 6.000   3rd Qu.:0.8400   3rd Qu.:0.8300   3rd Qu.:23.0   3rd Qu.:23.0   3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :9.000   Max.   :9.000   Max.   :16.000     Max.   :14.000     Max.   :19.000   Max.   :14.000   Max.   :19.000   Max.   :16.000   Max.   :0.9400   Max.   :0.9400   Max.   :67.0   Max.   :53.0   Max.   :24.00   Max.   :26.00   Max.   :6.000   Max.   :9.000   Max.   :1.00000   Max.   :1.00000

sapply(matches_data_clear, mean)

##        HomeGoalsFT        AwayGoalsFT HomeShotsOffTarget AwayShotsOffTarget   HomeBlockedShots   AwayBlockedShots        HomeCorners        AwayCorners   HomePassSuccPerc   AwayPassSuccPerc     HomeAerialsWon     AwayAerialsWon          HomeFouls          AwayFouls    HomeYellowCards    AwayYellowCards       HomeRedCards       AwayRedCards 
##         1.50451128         1.20714286         5.37894737         4.37518797         3.81804511         3.04285714         5.77631579         4.70375940         0.78228571         0.76968421        18.30375940        18.09887218        10.64172932        10.98646617         1.57330827         1.75714286         0.03496241         0.03872180

Scale the data

The differences in means of the variables reaches 3 orders of magnitude. It is reasonable to standardize the data. There is one binary variable in the data set. In such case the data could be normalized with the z-score transformation or rescaled to the [0, 1] interval. The second alternative was chosen for the analysis.

library(scales)

matches_data_scaled = as.data.frame(sapply(matches_data_clear, rescale))
summary(matches_data_scaled)

##   HomeGoalsFT      AwayGoalsFT     HomeShotsOffTarget AwayShotsOffTarget HomeBlockedShots AwayBlockedShots   HomeCorners      AwayCorners     HomePassSuccPerc AwayPassSuccPerc HomeAerialsWon   AwayAerialsWon     HomeFouls        AwayFouls      HomeYellowCards  AwayYellowCards   HomeRedCards      AwayRedCards    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000     Min.   :0.0000     Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.1111   1st Qu.:0.0000   1st Qu.:0.1875     1st Qu.:0.2143     1st Qu.:0.1053   1st Qu.:0.07143   1st Qu.:0.1579   1st Qu.:0.1875   1st Qu.:0.5116   1st Qu.:0.5217   1st Qu.:0.1562   1st Qu.:0.2308   1st Qu.:0.3333   1st Qu.:0.3200   1st Qu.:0.1667   1st Qu.:0.1111   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.1111   Median :0.1111   Median :0.3125     Median :0.2857     Median :0.1579   Median :0.21429   Median :0.2632   Median :0.2500   Median :0.6512   Median :0.6522   Median :0.2188   Median :0.3077   Median :0.4583   Median :0.4000   Median :0.1667   Median :0.2222   Median :0.00000   Median :0.00000  
##  Mean   :0.1672   Mean   :0.1341   Mean   :0.3362     Mean   :0.3125     Mean   :0.2009   Mean   :0.21735   Mean   :0.3040   Mean   :0.2940   Mean   :0.6332   Mean   :0.6297   Mean   :0.2391   Mean   :0.3288   Mean   :0.4434   Mean   :0.3995   Mean   :0.2622   Mean   :0.1952   Mean   :0.03496   Mean   :0.03872  
##  3rd Qu.:0.2222   3rd Qu.:0.2222   3rd Qu.:0.4375     3rd Qu.:0.4286     3rd Qu.:0.2632   3rd Qu.:0.28571   3rd Qu.:0.4211   3rd Qu.:0.3750   3rd Qu.:0.7674   3rd Qu.:0.7609   3rd Qu.:0.3125   3rd Qu.:0.4231   3rd Qu.:0.5417   3rd Qu.:0.4800   3rd Qu.:0.3333   3rd Qu.:0.3333   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000     Max.   :1.0000     Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000

Check if data is clusterable

dist_eucl<-get_dist(matches_data_scaled, method="euclidean")
fviz_dist(dist_eucl, show_labels = FALSE) + labs(title="Matches data") # factoextra::

There are some visible blocks in the data set. However they are not very clear. It may be assumed that the data is clusterable to some extent.

Optimal number of clusters

Get the best number of clusters.

NbClust – average

# with min.nc the system is singular
opt1 = NbClust(data = matches_data_scaled, distance = 'euclidean', min.nc = 3, max.nc = 10, method = 'average')

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 3 as the best number of clusters 
## * 13 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 4 proposed 7 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4 
##  
##  
## *******************************************************************

NbClust – median

opt2 = NbClust(data = matches_data_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'median')

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 10 proposed 2 as the best number of clusters 
## * 1 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 7 proposed 7 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

NbClust – centroid

opt3 = NbClust(data = matches_data_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'centroid')

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 7 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

For the ‘average’ method of distance calculation the best choice is 4 clusters. For the ‘median’ method of distance calculation the best choice is 2 clusters. For the ‘centroid’ method of distance calculation the best choice is 2 clusters.

fv_km = fviz_nbclust(matches_data_scaled,kmeans,method = "silhouette") +ggtitle("kmeans")
fv_pam = fviz_nbclust(matches_data_scaled,pam,method = "silhouette") +ggtitle("pam")
fv_clara = fviz_nbclust(matches_data_scaled,clara,method = "silhouette") +ggtitle("clara")
grid.arrange(fv_km,fv_pam,fv_clara, ncol=2, top = "Silhouette based number of clusters")

Silhouette rule says that for ‘Kmeans’ method 3 clusters is the best choice, for ‘PAM’ and ‘CLARA’ 2 are the best.

Most methods pointed 2 clusters as the best choice. Also 3 and 4 were proposed.

Based on the results of the analysis 2 and 3 clusters will be chosen.

Clustering

The data set is considered to be big. The data does not manifest hierarchical characteristics. This is why only non-hierarchical methods are used in the analysis.

Dim1 and Dim2 on the graphs stand for the first two dimensions resulting from the application of PCA.

Kmeans

# k-means
cluster_km <- eclust(matches_data_scaled,"kmeans", k = 3)

It divides the data in three groups when looked on the first two dimensions. They are not very well separated. We can see that they are divided into clusters based mainly on the Dim1 values.

PAM

# pam
cluster_pam<-eclust(matches_data_scaled, "pam", k = 2)

Here the case is simalr to Kmeans – Dim1 divides the data into two clusters.

CLARA

# clara
cluster_clara<-eclust(matches_data_scaled, "clara", k = 2)

In CLARA approach both dimensions are significant for the cluster separation.

Analyzing the graphs we can say that there are no clearly visible separated groups in the data. Derived clusters overlap each other – have fuzzy boundaries between each other. However, it must be noted that they are visibly separated.

Silhouette statistic

matches_data_clear$cluster_km = cluster_km$cluster
matches_data_clear$cluster_pam = cluster_pam$clustering
matches_data_clear$cluster_clara = cluster_clara$clustering

Kmeans

sil_km<-silhouette(cluster_km$cluster, dist(matches_data_scaled))
fviz_silhouette(sil_km)

##   cluster size ave.sil.width
## 1       1 1026          0.10
## 2       2  893          0.07
## 3       3  741          0.10

PAM

sil_pam<-silhouette(cluster_pam$clustering, dist(matches_data_scaled))
fviz_silhouette(sil_pam)

##   cluster size ave.sil.width
## 1       1 1640          0.12
## 2       2 1020          0.09

CLARA

sil_clara<-silhouette(cluster_clara$clustering, dist(matches_data_scaled))
fviz_silhouette(sil_clara)

##   cluster size ave.sil.width
## 1       1 1220          0.12
## 2       2 1440          0.07

Average silhouette width is almost the same for all three methods. The highest one is reached for PAM – 0.11.

The highest share of negative silhouette width values is observed for the second cluster in CLARA method.

All three methods give clusters of similar sizes:

table(cluster_km$cluster)/length(cluster_km$cluster)

## 
##         1         2         3 
## 0.3857143 0.3357143 0.2785714

table(cluster_pam$clustering)/length(cluster_pam$clustering)

## 
##         1         2 
## 0.6165414 0.3834586

table(cluster_clara$clustering)/length(cluster_clara$clustering)

## 
##         1         2 
## 0.4586466 0.5413534

Average variable values for clusters

# general characteristics of clusters for kmeans
# clust_avg_km = aggregate(. ~ cluster_km, data = subset(matches_data_clear, select= -c(cluster_pam, cluster_clara)), mean)
# t(clust_avg_km)
# 
# # general characteristics of clusters for pam
# clust_avg_pam = aggregate(. ~ cluster_pam, data = subset(matches_data_clear, select= -c(cluster_km, cluster_clara)), mean)
# t(clust_avg_pam)
# 
# # general characteristics of clusters for clara
# clust_avg_clara = aggregate(. ~ cluster_clara, data = subset(matches_data_clear, select= -c(cluster_pam, cluster_km)), mean)
# t(clust_avg_clara)

Kmeans

clust_avg_km = aggregate(. ~ cluster_km, data = matches_data_clear, mean)
t(clust_avg_km)

##                           [,1]        [,2]        [,3]
## cluster_km          1.00000000  2.00000000  3.00000000
## HomeGoalsFT         1.88206628  1.31466965  1.21052632
## AwayGoalsFT         0.93957115  1.46584546  1.26585695
## HomeShotsOffTarget  6.97270955  3.82306831  5.04723347
## AwayShotsOffTarget  3.27582846  5.91489362  4.04183536
## HomeBlockedShots    5.14717349  2.47592385  3.59514170
## AwayBlockedShots    1.98538012  4.37737962  2.89878543
## HomeCorners         7.59844055  3.92609183  5.48313090
## AwayCorners         3.29044834  6.57110862  4.41025641
## HomePassSuccPerc    0.83294347  0.73539754  0.76865047
## AwayPassSuccPerc    0.73125731  0.81979843  0.76249663
## HomeAerialsWon     17.78947368 17.71444569 19.72604588
## AwayAerialsWon     16.84502924 18.19932811 19.71390013
## HomeFouls           9.49025341 10.11646137 12.86909582
## AwayFouls          10.64424951 10.37849944 12.19298246
## HomeYellowCards     0.97173489  1.24636058  2.80026991
## AwayYellowCards     1.66569201  1.40649496  2.30634278
## HomeRedCards        0.01559454  0.07390817  0.01484480
## AwayRedCards        0.06335283  0.01791713  0.02968961
## cluster_pam         1.00682261  1.68980963  1.53576248
## cluster_clara       1.13157895  1.79171333  1.80701754

PAM

# general characteristics of clusters for pam
clust_avg_pam = aggregate(. ~ cluster_pam, data = matches_data_clear, mean)
t(clust_avg_pam)

##                           [,1]        [,2]
## cluster_pam         1.00000000  2.00000000
## HomeGoalsFT         1.61707317  1.32352941
## AwayGoalsFT         1.01585366  1.51470588
## HomeShotsOffTarget  6.10914634  4.20490196
## AwayShotsOffTarget  3.62560976  5.58039216
## HomeBlockedShots    4.57378049  2.60294118
## AwayBlockedShots    2.37621951  4.11470588
## HomeCorners         6.87073171  4.01666667
## AwayCorners         3.61524390  6.45392157
## HomePassSuccPerc    0.81172561  0.73495098
## AwayPassSuccPerc    0.74317683  0.81230392
## HomeAerialsWon     18.77317073 17.54901961
## AwayAerialsWon     17.85548780 18.49019608
## HomeFouls           9.82134146 11.96078431
## AwayFouls          11.05304878 10.87941176
## HomeYellowCards     1.22195122  2.13823529
## AwayYellowCards     1.68170732  1.87843137
## HomeRedCards        0.02439024  0.05196078
## AwayRedCards        0.04390244  0.03039216
## cluster_km          1.58841463  2.38235294
## cluster_clara       1.30365854  1.92352941

CLARA

# general characteristics of clusters for clara
clust_avg_clara = aggregate(. ~ cluster_clara, data = matches_data_clear, mean)
t(clust_avg_clara)

##                           [,1]        [,2]
## cluster_clara       1.00000000  2.00000000
## HomeGoalsFT         1.84918033  1.21250000
## AwayGoalsFT         0.95901639  1.41736111
## HomeShotsOffTarget  5.83688525  4.99097222
## AwayShotsOffTarget  3.51885246  5.10069444
## HomeBlockedShots    4.78688525  2.99722222
## AwayBlockedShots    2.00655738  3.92083333
## HomeCorners         6.83934426  4.87569444
## AwayCorners         3.56147541  5.67152778
## HomePassSuccPerc    0.82970492  0.74211111
## AwayPassSuccPerc    0.74189344  0.79322917
## HomeAerialsWon     17.03442623 19.37916667
## AwayAerialsWon     16.20000000 19.70763889
## HomeFouls           9.84590164 11.31597222
## AwayFouls          10.77704918 11.16388889
## HomeYellowCards     0.96557377  2.08819444
## AwayYellowCards     1.70245902  1.80347222
## HomeRedCards        0.02704918  0.04166667
## AwayRedCards        0.05081967  0.02847222
## cluster_km          1.38688525  2.32152778
## cluster_pam         1.06393443  1.65416667

Analyzing the means of the variables grouped by clusters shows that dividing the data into two clusters (PAM and CLARA) gives the division on matches in which one of the two teams was better (1st cluster for home teams and 2nd for away teams).

When the data is divided into three clusters (Kmeans case) we see the same tendency: first two clusters aggregate the matches in which one of the teams is clearly better (even more pronounced than in the case of two clusters) and the third one gathers matches in which the statistics were evenly distributed.

Conclusions

Conducted cluster analysis showed that the data set should be separated into two or three clusters. Resulting clusters are reasonable – they divide the matches into logical sets. Obtained results could be utilized in further analyses which could reveal some other regularities in football matches.

Footbal matches clustering

Błażej Popławski

Short introduction

Read the data

Data description

Data transformation

Basic statistics

Correlations plots

Number of red cards

Data summary

Scale the data

Check if data is clusterable

Optimal number of clusters

NbClust – average

NbClust – median

NbClust – centroid

Clustering

Kmeans

PAM

CLARA

Silhouette statistic

Kmeans

PAM

CLARA

Average variable values for clusters

Kmeans

PAM

CLARA

Conclusions