1 Introduction

When we think about Italy, three things instantly come to mind: food, the sea, and football. No one can resist the delicious taste of Italian cuisine or the allure of its stunning southern beaches. But when it comes to football, a deeper revelation of culture, passion, and rivalry emerges. As an Italian, I couldn’t pass up the opportunity to combine my love for football with the lessons from this homework, making learning both productive and fun.

That’s why I decided to analyze Serie A league, the crown jewel of Italian football, delving into the 2023/2024 season. Using a dataset packed with match statistics, from goals scored to corners, and even the influence of red cards, this homework explores whether these numbers reveal any interesting correlations. How many of these influence a game? Are they linked? Does playing at home really offer an advantage? While these questions might be the bread and butter of betting companies, they also offer a playground for statistical analysis.

So, let’s dive into this exciting fusion of football and statistics to uncover what the data can teach us about the beautiful game.

Milan-Inter, the most important derby in Serie A

Milan-Inter, the most important derby in Serie A

library(readr)
mydata <- read_csv("~/Downloads/season-2324.csv")
## Rows: 380 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Date, HomeTeam, AwayTeam, FTR, HTR
## dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...
## lgl  (1): Referee
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mydata)
## # A tibble: 6 × 22
##   Date     HomeTeam  AwayTeam   FTHG  FTAG FTR    HTHG  HTAG HTR   Referee    HS
##   <chr>    <chr>     <chr>     <dbl> <dbl> <chr> <dbl> <dbl> <chr> <lgl>   <dbl>
## 1 19/08/23 Empoli    Verona        0     1 A         0     0 D     NA         10
## 2 19/08/23 Frosinone Napoli        1     3 A         1     2 A     NA          4
## 3 19/08/23 Genoa     Fiorenti…     1     4 A         0     3 A     NA          4
## 4 19/08/23 Inter     Monza         2     0 H         1     0 H     NA         22
## 5 20/08/23 Roma      Salernit…     2     2 D         1     1 D     NA         13
## 6 20/08/23 Sassuolo  Atalanta      0     2 A         0     0 D     NA         11
## # ℹ 11 more variables: AS <dbl>, HST <dbl>, AST <dbl>, HF <dbl>, AF <dbl>,
## #   HC <dbl>, AC <dbl>, HY <dbl>, AY <dbl>, HR <dbl>, AR <dbl>

1.1 Dataset overview

  • Unit of observation: each row represents a single match in the Italian Serie A 2023/2024 season.
  • Sample size: the dataset includes all matches played in the 2023/2024 Serie A football season.

The dataset was retrieved from https://datahub.io/core/italian-serie-a, which obtained data from https://football-data.co.uk.

1.2 Variables description

  • Date: The date on which the match was played.
  • HomeTeam: The team playing at their home stadium.
  • AwayTeam: The team playing at the opponent’s stadium.
  • FTHG (Full-Time Home Goals): Total goals scored by the home team.
  • FTAG (Full-Time Away Goals): Total goals scored by the away team.
  • FTR (Full-Time Result): Match outcome. Categories: Home Win (H), Draw (D), Away Win (A).
  • HTHG (Half-Time Home Goals): Goals scored by the home team by halftime.
  • HTAG (Half-Time Away Goals): Goals scored by the away team by halftime.
  • HTR (Half-Time Result): Match outcome at halftime. Categories: Home Win (H), Draw (D), Away Win (A).
  • Referee: Name of the referee officiating the match.
  • HS (Home Shots): Total shots taken by the home team.
  • AS (Away Shots): Total shots taken by the away team.
  • HST (Home Shots on Target): Shots on target by the home team.
  • AST (Away Shots on Target): Shots on target by the away team.
  • HF (Home Fouls): Fouls committed by the home team.
  • AF (Away Fouls): Fouls committed by the away team.
  • HC (Home Corners): Number of corners awarded to the home team.
  • AC (Away Corners): Number of corners awarded to the away team.
  • HY (Home Yellow Cards): Number of yellow cards received by the home team.
  • AY (Away Yellow Cards): Number of yellow cards received by the away team.
  • HR (Home Red Cards): Number of red cards received by the home team.
  • AR (Away Red Cards): Number of red cards received by the away team.

1.3 Data manipulation

# Creating a new dataset with only the useful variables for my research questions analyses
Serie_A_2324 <- mydata[, c(-7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -19, -20)]

head(Serie_A_2324)
## # A tibble: 6 × 10
##   Date     HomeTeam  AwayTeam     FTHG  FTAG FTR      HC    AC    HR    AR
##   <chr>    <chr>     <chr>       <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19/08/23 Empoli    Verona          0     1 A         2     4     0     0
## 2 19/08/23 Frosinone Napoli          1     3 A         4     6     0     0
## 3 19/08/23 Genoa     Fiorentina      1     4 A         3     4     0     0
## 4 19/08/23 Inter     Monza           2     0 H         8     3     0     0
## 5 20/08/23 Roma      Salernitana     2     2 D         9     1     0     0
## 6 20/08/23 Sassuolo  Atalanta        0     2 A         7     7     0     0
# Convert HR (Home Red Cards) into Yes/No
Serie_A_2324$HR <- ifelse(Serie_A_2324$HR >= 1, "Yes", "No")

# Convert AR (Away Red Cards) into Yes/No
Serie_A_2324$AR <- ifelse(Serie_A_2324$AR >= 1, "Yes", "No")

head(Serie_A_2324)
## # A tibble: 6 × 10
##   Date     HomeTeam  AwayTeam     FTHG  FTAG FTR      HC    AC HR    AR   
##   <chr>    <chr>     <chr>       <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 19/08/23 Empoli    Verona          0     1 A         2     4 No    No   
## 2 19/08/23 Frosinone Napoli          1     3 A         4     6 No    No   
## 3 19/08/23 Genoa     Fiorentina      1     4 A         3     4 No    No   
## 4 19/08/23 Inter     Monza           2     0 H         8     3 No    No   
## 5 20/08/23 Roma      Salernitana     2     2 D         9     1 No    No   
## 6 20/08/23 Sassuolo  Atalanta        0     2 A         7     7 No    No

1.4 Descriptive statistics

# Set a wider console output width
options(width = 120)

library(psych)

# Generate descriptive statistics for the entire dataset
describeBy(Serie_A_2324)
## Warning in describeBy(Serie_A_2324): no grouping variable requested
##           vars   n  mean    sd median trimmed   mad min max range  skew kurtosis   se
## Date*        1 380 70.24 39.58   72.5   70.56 51.15   1 135   134 -0.05    -1.23 2.03
## HomeTeam*    2 380 10.50  5.77   10.5   10.50  7.41   1  20    19  0.00    -1.22 0.30
## AwayTeam*    3 380 10.50  5.77   10.5   10.50  7.41   1  20    19  0.00    -1.22 0.30
## FTHG         4 380  1.43  1.19    1.0    1.32  1.48   0   7     7  0.92     1.11 0.06
## FTAG         5 380  1.18  1.12    1.0    1.04  1.48   0   6     6  0.92     0.76 0.06
## FTR*         6 380  2.13  0.83    2.0    2.16  1.48   1   3     2 -0.25    -1.51 0.04
## HC           7 380  5.50  3.01    5.0    5.35  2.97   0  18    18  0.56     0.44 0.15
## AC           8 380  4.13  2.51    4.0    3.96  2.97   0  13    13  0.59     0.04 0.13
## HR*          9 380  1.08  0.27    1.0    1.00  0.00   1   2     1  3.18     8.13 0.01
## AR*         10 380  1.08  0.28    1.0    1.00  0.00   1   2     1  2.98     6.91 0.01
# Check range of original red card variables (before conversion to Yes/No)
range(mydata$HR, na.rm = TRUE) # Home Red Cards
## [1] 0 3
range(mydata$AR, na.rm = TRUE) # Away Red Cards
## [1] 0 2

1.4.1 Explanation of a few parameters estimates

  • The mean goals scored by home teams (1.43) is slightly higher than that scored by away teams (1.18). This suggests home teams have a scoring advantage. Further statistical testing will confirm if this difference is significant;
  • Home teams have a median of 5 corners (SD = 3.01), while away teams have a median of 4 corners (SD = 2.51). This indicates that home teams tend to win more corners and show slightly greater variability;
  • The range of red cards for home teams is from 0 to 3, while for away teams, it is from 0 to 2, reflecting the rarity of red cards overall. Matches with more than one red card for a single team are exceptional.

2 RQ1: comparing means between two independent samples

Question: is there a significant difference in the average number of goals scored by home teams and away teams in the Italian Serie A?

2.1 Step 1: parametric test (Independent t-test)

The independent t-test is suitable because we are comparing the means of goals scored by home teams and away teams, which are two independent groups. If variances differ significantly, we apply the Welch correction to account for heterogeneity.

Hypotheses:

  • H0: The mean number of goals scored by home teams is equal to the mean number of goals scored by away teams (μHome=μAway).
  • H1: The mean number of goals scored by home teams is not equal to the mean number of goals scored by away teams (μHome≠μAway).
# Perform the t-test
t.test(Serie_A_2324$FTHG, Serie_A_2324$FTAG, 
                        var.equal = FALSE, 
                        alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  Serie_A_2324$FTHG and Serie_A_2324$FTAG
## t = 3.0788, df = 755.32, p-value = 0.002154
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.09345364 0.42233583
## sample estimates:
## mean of x mean of y 
##  1.434211  1.176316

Explanation of results

  • p<0.05, reject H0: there is a statistically significant difference between the mean goals scored by home and away teams (p=0.003).
  • Home teams tend to score slightly more goals on average than away teams: mean of home goals (x=1.43), mean of away goals (y=1.18).

2.2 Step 2: non-parametric test (Wilcoxon Rank Sum Test)

The Wilcoxon Rank Sum Test is the non-parametric alternative to the t-test, used when data is not normally distributed or assumptions are violated.

# Perform the Wilcoxon Rank Sum Test
wilcox.test(Serie_A_2324$FTHG, Serie_A_2324$FTAG,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  Serie_A_2324$FTHG and Serie_A_2324$FTAG
## W = 81286, p-value = 0.001782
## alternative hypothesis: true location shift is not equal to 0

Explanation of results

  • p<0.05, reject H0: The location distribution of goals scored differs between home and away teams (p=0.002).

2.3 Step 3: assumptions for tests

To understand which of the two tests suits more our analysis, we verify the assumptions that support the parametric test.

Assumptions:

  • Variables are numeric.
  • The distributions of the variables are normally distributed in both populations.
  • Variables come from independent populations.
  • Variances are equal (or adjust with Welch correction if not).

To verify them, I will use:

  • Histograms, to check skewness visually.
  • Q-Q Plots, to assess normality with a Q-Q plot.
  • Shapiro-Wilk Test, to test for normality numerically.
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
# Histogram for Home and Away Goals
home_hist <- ggplot(Serie_A_2324, aes(x = FTHG)) +
  geom_histogram(bins = 10, fill = "darkolivegreen", color = "darkolivegreen", alpha = 0.6) +
  labs(title = "Home Goals", x = "Goals", y = "Frequency")

away_hist <- ggplot(Serie_A_2324, aes(x = FTAG)) +
  geom_histogram(bins = 10, fill = "goldenrod1", color = "goldenrod1", alpha = 0.6) +
  labs(title = "Away Goals", x = "Goals", y = "Frequency")

library(ggpubr)

# Arrange histograms on one page
ggarrange(home_hist, away_hist, 
          ncol = 2, nrow = 1)

Explanation of results

Both histograms show a right-skewed distribution with potential outliers, suggesting the distribution of goal scored by both home and away teams is not normally distributed. The skewness indicates that most matches have lower scores, while a few matches have unusually high scores. Outliers could affect the analysis, particularly parametric tests like the t-test.

# Q-Q Plots
qq_home <- ggqqplot(Serie_A_2324$FTHG, main = "Q-Q Plot for Home Goals")
qq_away <- ggqqplot(Serie_A_2324$FTAG, main = "Q-Q Plot for Away Goals")

# Arrange the plots side by side on the same page
ggarrange(qq_home, qq_away, 
          ncol = 2, nrow = 1)

Explanation of results

Dots form horizontal lines with steps rather than aligning along the main diagonal.This pattern arises because goals are discrete variables (whole numbers), not continuous. The lack of linearity confirms that the data does not follow a normal distribution.

# Shapiro-Wilk Test for Normality
shapiro_home <- shapiro.test(Serie_A_2324$FTHG)
shapiro_away <- shapiro.test(Serie_A_2324$FTAG)

print(shapiro_home)
## 
##  Shapiro-Wilk normality test
## 
## data:  Serie_A_2324$FTHG
## W = 0.87998, p-value < 2.2e-16
print(shapiro_away)
## 
##  Shapiro-Wilk normality test
## 
## data:  Serie_A_2324$FTAG
## W = 0.85667, p-value < 2.2e-16

Explanation of results

The p-values are less than 0.05 (p<0.001), indicating the data for both home and away goals significantly deviate from normality. This reinforces the decision to prioritize the non-parametric Wilcoxon test.

2.4 Step 4: effect size

I chose to use the Wilcoxon Rank Sum Test, and its Effect Size is measured by using the Biserial Correlation.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
# Calculate Biserial Correlation for Wilcoxon Test
effectsize(wilcox.test(Serie_A_2324$FTHG, Serie_A_2324$FTAG,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
## r (rank biserial) |       95% CI
## --------------------------------
## 0.13              | [0.04, 0.21]
interpret_rank_biserial(0.13)
## [1] "small"
## (Rules: funder2019)

Explanation of results

The rank-biserial correlation indicates a small effect size based on Funder’s (2019) guidelines. This suggests that while the difference in goals is statistically significant, the practical impact is relatively minor.

2.5 Conclusion

Since the assumptions of normality and linearity are violated (as shown by the histograms, Q-Q plots, and Shapiro-Wilk tests), the Wilcoxon Rank Sum Test is the more suitable test for this data. Although the difference in goals is statistically significant (p=0.003), the effect size (r=0.13) suggests the advantage of home teams in scoring goals is relatively small in practical terms.

3 RQ2: correlation between two numerical variables

Question: is there a significant correlation between the number of corners and the number of goals scored in matches?

To properly address this question I first need to create some new useful variables that I will use during the analysis.

These are:

  • Number of Corners: Sum of home and away corners (Total Corners=HC+AC)
  • Number of Goals: Sum of home and away goals (Total Goals=FTHG+FTAG)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Calculate Total Corners and Total Goals and save changes
Serie_A_2324 <- Serie_A_2324 %>%
  mutate(TotalCorners = HC + AC,
         TotalGoals = FTHG + FTAG)

head(Serie_A_2324[c("Date", "TotalCorners", "TotalGoals")])
## # A tibble: 6 × 3
##   Date     TotalCorners TotalGoals
##   <chr>           <dbl>      <dbl>
## 1 19/08/23            6          1
## 2 19/08/23           10          4
## 3 19/08/23            7          5
## 4 19/08/23           11          2
## 5 20/08/23           10          4
## 6 20/08/23           14          2

Now we can observe the number of total corners and total goals that happened on every match on Serie A 2023/2024 season. For example, during the fourth match played on the 19th of August 2023, 11 corners were shot and 2 goals were made. Now that we have these information, we can state the hypothesis for this research question.

  • H0: There is no correlation between corners and total goals (𝜌=0);
  • H1: There is a significant correlation between corners and total goals (𝜌≠0).

I will check this observing some useful visual data to understand whether there is significant correlation between the number of corners and the number of goals scored in matches, and afterwards checking the Pearson Correlation Coefficient.

3.1 Step 1: Scatterplot Matrix

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
# Scatterplot matrix for Total Corners and Total Goals
scatterplotMatrix(Serie_A_2324[ , c("TotalCorners", "TotalGoals")], smooth=FALSE)

Explanation of results

The scatterplots indicate no apparent linear relationship between corners and goals. The points are scattered horizontally, deviating from any diagonal trend. So we can state with confidence that visually, there is no evidence of correlation between the two variables.

3.2 Step 2: Pairwise Plot

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Generate pairwise plot
ggpairs(Serie_A_2324[, c("TotalCorners", "TotalGoals")],
        title = "Pairwise Plot for Corners and Goals")

Explanation of results

The Pearson correlation coefficient calculated in the pairwise plot is r=−0.045, which is very close to zero. This indicates a negligible or no linear relationship between corners and goals. The negative sign implies a slight inverse relationship, but it is too weak to have practical significance.

3.3 Step 3: Correlation Matrix

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
# Calculate Pearson correlation matrix
rcorr(as.matrix(Serie_A_2324[, c("TotalCorners", "TotalGoals")]), 
      type = "pearson")
##              TotalCorners TotalGoals
## TotalCorners         1.00      -0.04
## TotalGoals          -0.04       1.00
## 
## n= 380 
## 
## 
## P
##              TotalCorners TotalGoals
## TotalCorners              0.3853    
## TotalGoals   0.3853

Explanation of results

The correlation coefficient is extremely close to zero, reaffirming no meaningful linear relationship between total corners and total goals. Since p>0.05, we fail to reject the null hypothesis. The p-value (p=0.3853) indicates that the correlation is not statistically significant at the 0.05 level. This suggests that the observed correlation could easily occur by chance and does not reflect a genuine relationship in the data.

3.4 Conclusion

There is no significant correlation between the number of corners and the total goals scored in a match (r=−0.04, p=0.3853). Both visual inspection and statistical analysis confirm that the relationship between these variables is negligible.

4 RQ3: association between two categorical variables

Question: is there an association between home red cards and match outcomes in the Italian Serie A 2023/2024 season?

Hypotheses:

  • H0: There is no association between home red cards and match outcomes.
  • H1: Home red cards and match outcomes are associated.

4.1 Step 1: Pearson Chi-Square Test

Assumptions:

  • Observations are independent of each other;
  • Expected frequencies in all cells are greater than 5.
# Perform Pearson Chi-Square Test
results <- chisq.test(Serie_A_2324$HR, Serie_A_2324$FTR)

results
## 
##  Pearson's Chi-squared test
## 
## data:  Serie_A_2324$HR and Serie_A_2324$FTR
## X-squared = 14.847, df = 2, p-value = 0.0005971

Explanation of results

Since the p-value (p<0.001), we reject the null hypothesis. This means there is a statistically significant association between home red cards and match outcomes. Home red cards appear to influence whether the match ends in a home win, draw, or away win.

addmargins(results$observed)
##                Serie_A_2324$FTR
## Serie_A_2324$HR   A   D   H Sum
##             No   99  96 156 351
##             Yes  10  16   3  29
##             Sum 109 112 159 380

Explanation of results

From the observed data, home teams with red cards lose significantly more often (10 away wins vs. 3 home wins) compared to those without red cards. This trend supports the Chi-square test result, indicating a strong association between home red cards and match outcomes.

round(results$expected)
##                Serie_A_2324$FTR
## Serie_A_2324$HR   A   D   H
##             No  101 103 147
##             Yes   8   9  12

Explanation of results

The expected frequencies under the null hypothesis indicate that if home red cards were independent of match outcomes, home teams would experience approximately 101 home losses (or away wins=A), 103 draws (D), and 147 home wins (H) in matches without red cards. For matches where home teams received red cards, the expected outcomes would be 8 home losses/away wins, 9 draws, and 12 home wins. Since they are all higher than 5, the assumptions for chi-squared test are respected.

4.2 Step 2: residuals analysis

round(results$res)
##                Serie_A_2324$FTR
## Serie_A_2324$HR  A  D  H
##             No   0 -1  1
##             Yes  1  3 -3

Explanation of results

The residuals indicate that matches with home red cards are associated with significantly fewer home wins (−3) and more draws (+3) than expected under the null hypothesis. This suggests that home red cards negatively impact the likelihood of a home win.

library(effectsize)

# Calculate Cramér's V
cramers_v(Serie_A_2324$HR, Serie_A_2324$FTR)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.18              | [0.07, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.18)
## [1] "small"
## (Rules: funder2019)

Explanation of results

While the Chi-square test indicates a statistically significant association between home red cards and match outcomes, the small Cramér’s V value suggests that the strength of this association is weak. This implies that while home red cards do influence outcomes, other factors likely play a more significant role in determining match results.

4.3 Conclusion

The Chi-square test revealed a statistically significant association (p<0.001), indicating that home red cards influence whether matches end in a home win, draw, or away win. Observed frequencies showed that home teams with red cards experience significantly fewer home wins and more draws than expected. However, the effect size, measured by Cramér’s V (0.18), was small, suggesting that while the association is significant, it is relatively weak. This implies that red cards alone have a limited practical impact on match outcomes, and other factors likely contribute more substantially to the results.

Acknowledgments

I would like to extend my heartfelt thanks to our mutual friend, ChatGPT, for tirelessly providing advice and helping me elevate the aesthetics of this homework. From fonts to formatting, his wisdom and patience have been invaluable. Without him, this document might still look like it was crafted in the early 2000s. Cheers to modern technology!

Christian Lasalvia