Cluster Sampling of Soccer Matches by Season to Estimate Average Goals per Match

Author

Yash Shah

Code

PremLg = read.csv("results.csv")
PremLg$TotalGoals = PremLg$FTHG + PremLg$FTAG #Creating the total goals column
str(PremLg)

'data.frame':   11113 obs. of  24 variables:
 $ Season    : chr  "1993-94" "1993-94" "1993-94" "1993-94" ...
 $ DateTime  : chr  "1993-08-14T00:00:00Z" "1993-08-14T00:00:00Z" "1993-08-14T00:00:00Z" "1993-08-14T00:00:00Z" ...
 $ HomeTeam  : chr  "Arsenal" "Aston Villa" "Chelsea" "Liverpool" ...
 $ AwayTeam  : chr  "Coventry" "QPR" "Blackburn" "Sheffield Weds" ...
 $ FTHG      : int  0 4 1 2 1 0 0 3 0 0 ...
 $ FTAG      : int  3 1 2 0 1 1 3 1 2 2 ...
 $ FTR       : chr  "A" "H" "A" "H" ...
 $ HTHG      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ HTAG      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ HTR       : chr  NA NA NA NA ...
 $ Referee   : chr  NA NA NA NA ...
 $ HS        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ AS        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ HST       : int  NA NA NA NA NA NA NA NA NA NA ...
 $ AST       : int  NA NA NA NA NA NA NA NA NA NA ...
 $ HC        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ AC        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ HF        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ AF        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ HY        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ AY        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ HR        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ AR        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ TotalGoals: int  3 5 3 2 2 1 3 4 2 2 ...

Code

head(PremLg)

# A tibble: 6 × 24
  Season  DateTime HomeTeam AwayTeam  FTHG  FTAG FTR    HTHG  HTAG HTR   Referee
  <chr>   <chr>    <chr>    <chr>    <int> <int> <chr> <int> <int> <chr> <chr>  
1 1993-94 1993-08… Arsenal  Coventry     0     3 A        NA    NA <NA>  <NA>   
2 1993-94 1993-08… Aston V… QPR          4     1 H        NA    NA <NA>  <NA>   
3 1993-94 1993-08… Chelsea  Blackbu…     1     2 A        NA    NA <NA>  <NA>   
4 1993-94 1993-08… Liverpo… Sheffie…     2     0 H        NA    NA <NA>  <NA>   
5 1993-94 1993-08… Man City Leeds        1     1 D        NA    NA <NA>  <NA>   
6 1993-94 1993-08… Newcast… Tottenh…     0     1 A        NA    NA <NA>  <NA>   
# ℹ 13 more variables: HS <int>, AS <int>, HST <int>, AST <int>, HC <int>,
#   AC <int>, HF <int>, AF <int>, HY <int>, AY <int>, HR <int>, AR <int>,
#   TotalGoals <int>

Introduction

Above is the structure and head of the data set I have chosen to do my final project on, it contains the results of every Premier League game from the 1993–1994 season up to the 2020–2021 season. This data set provides detailed information about match outcomes, including the number of goals scored by both the home and away teams, which is the primary variable used in this analysis.. Because the data set contains results from many seasons and thousands of matches, it is perfectly suited for us to apply statistical sampling techniques to estimate overall patterns in match

In this project I will use cluster sampling to estimate the average number of goals scored per soccer match using historical match data. Soccer match outcomes and scoring patterns are widely studied in sports analytics, as goals scored per match are an important measure of offensive performance and game dynamics. However, analyzing every match in a large data set can be time-consuming, making sampling methods useful for producing accurate estimates efficiently. Sampling methods allow researchers to estimate population characteristics without needing to analyze every individual observation.

The strategy we will use is cluster sampling. Cluster sampling is particularly appropriate when data are naturally grouped into clusters. In this project, individual soccer matches are grouped by season, making each season a natural cluster. For each match, the number of goals scored by the home team and away team are recorded, allowing us to calculate the total number of goals scored and do analysis with it.

The main objective of this project is to estimate the average number of total goals scored per match using cluster sampling. By selecting a subset of seasons and analyzing all matches within those seasons, this study demonstrates how cluster sampling can be applied to real-world sports data. The results provide insight into the effectiveness of cluster sampling for estimating population-level statistics in grouped data and highlight how sampling methods can be useful in sports analytics and performance analysis.

Methods

Code

length(unique(PremLg$Season))

[1] 29

Code

nrow(PremLg)

[1] 11113

Cluster sampling was selected as the primary sampling method for this project because the data set is naturally divided into groups based on seasons. In cluster sampling, the population is divided into clusters, and a subset of these clusters is randomly selected. All observations within the selected clusters are then included in the sample. This method is useful when the population is large and naturally grouped, as it allows efficient estimation of population parameters without requiring analysis of every individual observation.

In this data set, each season represents a natural cluster because matches are organized by season, and each season contains many individual games played between different teams. The full data set consists of 29 seasons, ranging from the 1993–1994 season to the 2020–2021 season, and includes a total of 11,113 matches. Treating each season as a cluster allows groups of matches to be sampled together while preserving the natural structure of the data.

We will choose a simple random sample of six seasons from the total of 29 available seasons. This selection of clusters will be performed randomly to ensure that each season had an equal probability of being chosen. Once the clusters are selected, all matches within the chosen seasons are included in the sample, following a one-stage cluster sampling design.

For each match, the total number of goals scored is calculated by adding the number of goals scored by the home team and the away team. This variable serves as the primary measurement used in the analysis. The average number of goals per match is then estimated using the matches contained within the selected clusters.

Simulation

A simulation study is conducted to evaluate the performance of cluster sampling when estimating the average number of goals scored per match. Simulation methods allow repeated sampling from the population, making it possible to observe how estimates vary across different samples. This approach helps assess the reliability and variability of the sampling method and provides insight into how close sample estimates are expected to be to the true population value.

Here, repeated cluster samples are generated by randomly selecting a fixed number of seasons from the total set of 29 available seasons. For each simulation, six seasons are selected at random, and all matches from the selected seasons are included in the sample. This process follows the same one-stage cluster sampling design described in the Methods section, ensuring consistency between the sampling procedure and the simulation process.

We will repeat the sampling procedure 100 times in order to produce multiple estimates of the average number of total goals scored per match. For each simulated sample, the total number of goals per match is calculated, and the average value is recorded. These repeated estimates are then used to examine how much variation occurs between samples and to better understand the behavior of the sampling method.

By analyzing the distribution of the simulated sample means, the simulation provides information about the variability and consistency of cluster sampling. Comparing these simulated estimates to the overall population mean helps demonstrate how well cluster sampling is expected to perform when applied to real-world sports data and large grouped data sets.

Coding Work

Picking out 6 Random Seasons:

Code

set.seed(54689633) #my student number, doing this for reproducibility

seasons = unique(PremLg$Season)

sampled_seasons = sample(seasons, 6)

sampled_seasons

[1] "2002-03" "1996-97" "2001-02" "2017-18" "1997-98" "2013-14"

Calculating the amount of matches in our sample:

Code

cluster_sample = subset(
  PremLg,
  Season %in% c(
    "2002-03",
    "1996-97",
    "2001-02",
    "2017-18",
    "1997-98",
    "2013-14"
  )
)

nrow(cluster_sample)

[1] 2280

Getting sample and population means:

Code

sample_mean = mean(cluster_sample$TotalGoals)

population_mean = mean(PremLg$TotalGoals)

sample_mean

[1] 2.657895

Code

population_mean

[1] 2.659678

We then run the simulation 100 times to get the values we desire:

Code

seasons = unique(PremLg$Season)

means = numeric(100)

for(i in 1:100){

  sampled_seasons = sample(seasons, 6)

  cluster_sample = subset(
    PremLg,
    Season %in% sampled_seasons
  )

  means[i] = mean(cluster_sample$TotalGoals)

}

mean(means)

[1] 2.664227

Code

var(means)

[1] 0.001880359

Code

sd(means)

[1] 0.04336311

Summarized results with these numbers:

Sampled Seasons:
2002-03, 1996-97, 2001-02, 2017-18, 1997-98, 2013-14

Number of matches in sample: 2280

Sample Mean Goals: 2.657895
Population Mean Goals: 2.65978

Simulation Mean: 2.664227
Simulation Variance: 0.001880359
Simulation SD: 0.04336311

Graph

Histogram to show the distribution:

Code

hist(
  means,
  main = "Distribution of Sample Means",
  xlab = "Average Goals per Match"
)

Real Data Analysis

A cluster sample was created by randomly selecting six seasons from the total of 29 available seasons. I then used a fixed random seed ensuring that the selection process was reproducible. The selected seasons were 2002–03, 1996–97, 2001–02, 2017–18, 1997–98, and 2013–14. All matches within these selected seasons were included in the cluster sample, resulting in a total of 2,280 matches used for analysis.

For each match, the total number of goals scored was calculated by adding the number of goals scored by the home team and the away team. The average number of total goals per match was then calculated using the sampled data. The estimated average number of goals per match based on the cluster sample was 2.6579 goals per match. This value represents the sample estimate obtained using cluster sampling.

To evaluate the accuracy of the sampling method, the sample estimate was compared to the population mean calculated using the full data set. The population mean number of total goals per match across all 11,113 matches was 2.6598 goals per match. The closeness of the sample mean to the population mean indicates that the cluster sampling method provided a reliable estimate of the true population value.

In addition to the single cluster sample, a simulation study was performed in which the cluster sampling procedure was repeated 100 times. For each repetition, six seasons were randomly selected and the mean number of total goals per match was calculated. The average of the simulated sample means was 2.6642, with a variance of 0.00188 and a standard deviation of 0.04336. These results show that the simulated sample means remained close to the population mean and did not vary greatly between repetitions, suggesting that cluster sampling produces consistent estimates when applied to this data set.

Conclusions and Discussions

The results of this project demonstrate that cluster sampling is an effective method for estimating the average number of goals scored in one of these matches. By selecting a subset of seasons and including all matches within those seasons, it was possible to generate an estimate that was very close to the true population mean. The similarity between the sample mean and the population mean indicates that cluster sampling can provide reliable estimates even when only a portion of the full data set is used.

The simulation results further support the effectiveness of cluster sampling. Repeating the sampling process multiple times produced estimates that showed relatively low variability and standard deviation while staying close to the population mean. This consistency suggests that cluster sampling is a stable and dependable method for analyzing large data sets that are naturally grouped into clusters.

This methodology can be useful in many real-world applications beyond sports analytics. Researchers, sports analysts, and data scientists can use cluster sampling to analyze large collections of sports data without needing to process every observation individually. More broadly, cluster sampling can be applied in fields such as economics, public health, and education, where data are often collected in naturally occurring groups. To conclude, this project highlights the practical usefulness of cluster sampling for estimating population characteristics in large, structured data sets.

References

Lohr, S. L. (2019). Sampling: Design and Analysis (2nd ed.). Chapman and Hall/CRC.

Alvin. (2021). English Premier League (EPL) Results data set.
Kaggle. Retrieved from https://www.kaggle.com/datasets/irkaal/english-premier-league-results