When we think about Italy, three things instantly come to mind: food, the sea, and football. No one can resist the delicious taste of Italian cuisine or the allure of its stunning southern beaches. But when it comes to football, a deeper revelation of culture, passion, and rivalry emerges. As an Italian, I couldn’t pass up the opportunity to combine my love for football with the lessons from this homework, making learning both productive and fun.
That’s why I decided to analyze Serie A league, the crown jewel of Italian football, delving into the 2023/2024 season. Using a dataset packed with match statistics, from goals scored to corners, and even the influence of red cards, this homework explores whether these numbers reveal any interesting correlations. How many of these influence a game? Are they linked? Does playing at home really offer an advantage? While these questions might be the bread and butter of betting companies, they also offer a playground for statistical analysis.
So, let’s dive into this exciting fusion of football and statistics to uncover what the data can teach us about the beautiful game.
Milan-Inter, the most important derby in Serie A
## Rows: 380 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Date, HomeTeam, AwayTeam, FTR, HTR
## dbl (16): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, ...
## lgl (1): Referee
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 22
## Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee HS
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <lgl> <dbl>
## 1 19/08/23 Empoli Verona 0 1 A 0 0 D NA 10
## 2 19/08/23 Frosinone Napoli 1 3 A 1 2 A NA 4
## 3 19/08/23 Genoa Fiorenti… 1 4 A 0 3 A NA 4
## 4 19/08/23 Inter Monza 2 0 H 1 0 H NA 22
## 5 20/08/23 Roma Salernit… 2 2 D 1 1 D NA 13
## 6 20/08/23 Sassuolo Atalanta 0 2 A 0 0 D NA 11
## # ℹ 11 more variables: AS <dbl>, HST <dbl>, AST <dbl>, HF <dbl>, AF <dbl>,
## # HC <dbl>, AC <dbl>, HY <dbl>, AY <dbl>, HR <dbl>, AR <dbl>
The dataset was retrieved from https://datahub.io/core/italian-serie-a, which obtained data from https://football-data.co.uk.
# Creating a new dataset with only the useful variables for my research questions analyses
Serie_A_2324 <- mydata[, c(-7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -19, -20)]
head(Serie_A_2324)## # A tibble: 6 × 10
## Date HomeTeam AwayTeam FTHG FTAG FTR HC AC HR AR
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 19/08/23 Empoli Verona 0 1 A 2 4 0 0
## 2 19/08/23 Frosinone Napoli 1 3 A 4 6 0 0
## 3 19/08/23 Genoa Fiorentina 1 4 A 3 4 0 0
## 4 19/08/23 Inter Monza 2 0 H 8 3 0 0
## 5 20/08/23 Roma Salernitana 2 2 D 9 1 0 0
## 6 20/08/23 Sassuolo Atalanta 0 2 A 7 7 0 0
# Convert HR (Home Red Cards) into Yes/No
Serie_A_2324$HR <- ifelse(Serie_A_2324$HR >= 1, "Yes", "No")
# Convert AR (Away Red Cards) into Yes/No
Serie_A_2324$AR <- ifelse(Serie_A_2324$AR >= 1, "Yes", "No")
head(Serie_A_2324)## # A tibble: 6 × 10
## Date HomeTeam AwayTeam FTHG FTAG FTR HC AC HR AR
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 19/08/23 Empoli Verona 0 1 A 2 4 No No
## 2 19/08/23 Frosinone Napoli 1 3 A 4 6 No No
## 3 19/08/23 Genoa Fiorentina 1 4 A 3 4 No No
## 4 19/08/23 Inter Monza 2 0 H 8 3 No No
## 5 20/08/23 Roma Salernitana 2 2 D 9 1 No No
## 6 20/08/23 Sassuolo Atalanta 0 2 A 7 7 No No
# Set a wider console output width
options(width = 120)
library(psych)
# Generate descriptive statistics for the entire dataset
describeBy(Serie_A_2324)## Warning in describeBy(Serie_A_2324): no grouping variable requested
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Date* 1 380 70.24 39.58 72.5 70.56 51.15 1 135 134 -0.05 -1.23 2.03
## HomeTeam* 2 380 10.50 5.77 10.5 10.50 7.41 1 20 19 0.00 -1.22 0.30
## AwayTeam* 3 380 10.50 5.77 10.5 10.50 7.41 1 20 19 0.00 -1.22 0.30
## FTHG 4 380 1.43 1.19 1.0 1.32 1.48 0 7 7 0.92 1.11 0.06
## FTAG 5 380 1.18 1.12 1.0 1.04 1.48 0 6 6 0.92 0.76 0.06
## FTR* 6 380 2.13 0.83 2.0 2.16 1.48 1 3 2 -0.25 -1.51 0.04
## HC 7 380 5.50 3.01 5.0 5.35 2.97 0 18 18 0.56 0.44 0.15
## AC 8 380 4.13 2.51 4.0 3.96 2.97 0 13 13 0.59 0.04 0.13
## HR* 9 380 1.08 0.27 1.0 1.00 0.00 1 2 1 3.18 8.13 0.01
## AR* 10 380 1.08 0.28 1.0 1.00 0.00 1 2 1 2.98 6.91 0.01
# Check range of original red card variables (before conversion to Yes/No)
range(mydata$HR, na.rm = TRUE) # Home Red Cards## [1] 0 3
## [1] 0 2
Question: is there a significant difference in the average number of goals scored by home teams and away teams in the Italian Serie A?
The independent t-test is suitable because we are comparing the means of goals scored by home teams and away teams, which are two independent groups. If variances differ significantly, we apply the Welch correction to account for heterogeneity.
Hypotheses:
# Perform the t-test
t.test(Serie_A_2324$FTHG, Serie_A_2324$FTAG,
var.equal = FALSE,
alternative = "two.sided")##
## Welch Two Sample t-test
##
## data: Serie_A_2324$FTHG and Serie_A_2324$FTAG
## t = 3.0788, df = 755.32, p-value = 0.002154
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.09345364 0.42233583
## sample estimates:
## mean of x mean of y
## 1.434211 1.176316
The Wilcoxon Rank Sum Test is the non-parametric alternative to the t-test, used when data is not normally distributed or assumptions are violated.
# Perform the Wilcoxon Rank Sum Test
wilcox.test(Serie_A_2324$FTHG, Serie_A_2324$FTAG,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")##
## Wilcoxon rank sum test
##
## data: Serie_A_2324$FTHG and Serie_A_2324$FTAG
## W = 81286, p-value = 0.001782
## alternative hypothesis: true location shift is not equal to 0
To understand which of the two tests suits more our analysis, we verify the assumptions that support the parametric test.
Assumptions:
To verify them, I will use:
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
# Histogram for Home and Away Goals
home_hist <- ggplot(Serie_A_2324, aes(x = FTHG)) +
geom_histogram(bins = 10, fill = "darkolivegreen", color = "darkolivegreen", alpha = 0.6) +
labs(title = "Home Goals", x = "Goals", y = "Frequency")
away_hist <- ggplot(Serie_A_2324, aes(x = FTAG)) +
geom_histogram(bins = 10, fill = "goldenrod1", color = "goldenrod1", alpha = 0.6) +
labs(title = "Away Goals", x = "Goals", y = "Frequency")
library(ggpubr)
# Arrange histograms on one page
ggarrange(home_hist, away_hist,
ncol = 2, nrow = 1)Both histograms show a right-skewed distribution with potential outliers, suggesting the distribution of goal scored by both home and away teams is not normally distributed. The skewness indicates that most matches have lower scores, while a few matches have unusually high scores. Outliers could affect the analysis, particularly parametric tests like the t-test.
# Q-Q Plots
qq_home <- ggqqplot(Serie_A_2324$FTHG, main = "Q-Q Plot for Home Goals")
qq_away <- ggqqplot(Serie_A_2324$FTAG, main = "Q-Q Plot for Away Goals")
# Arrange the plots side by side on the same page
ggarrange(qq_home, qq_away,
ncol = 2, nrow = 1)Dots form horizontal lines with steps rather than aligning along the main diagonal.This pattern arises because goals are discrete variables (whole numbers), not continuous. The lack of linearity confirms that the data does not follow a normal distribution.
# Shapiro-Wilk Test for Normality
shapiro_home <- shapiro.test(Serie_A_2324$FTHG)
shapiro_away <- shapiro.test(Serie_A_2324$FTAG)
print(shapiro_home)##
## Shapiro-Wilk normality test
##
## data: Serie_A_2324$FTHG
## W = 0.87998, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: Serie_A_2324$FTAG
## W = 0.85667, p-value < 2.2e-16
The p-values are less than 0.05 (p<0.001), indicating the data for both home and away goals significantly deviate from normality. This reinforces the decision to prioritize the non-parametric Wilcoxon test.
I chose to use the Wilcoxon Rank Sum Test, and its Effect Size is measured by using the Biserial Correlation.
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
# Calculate Biserial Correlation for Wilcoxon Test
effectsize(wilcox.test(Serie_A_2324$FTHG, Serie_A_2324$FTAG,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))## r (rank biserial) | 95% CI
## --------------------------------
## 0.13 | [0.04, 0.21]
## [1] "small"
## (Rules: funder2019)
The rank-biserial correlation indicates a small effect size based on Funder’s (2019) guidelines. This suggests that while the difference in goals is statistically significant, the practical impact is relatively minor.
Since the assumptions of normality and linearity are violated (as shown by the histograms, Q-Q plots, and Shapiro-Wilk tests), the Wilcoxon Rank Sum Test is the more suitable test for this data. Although the difference in goals is statistically significant (p=0.003), the effect size (r=0.13) suggests the advantage of home teams in scoring goals is relatively small in practical terms.
Question: is there a significant correlation between the number of corners and the number of goals scored in matches?
To properly address this question I first need to create some new useful variables that I will use during the analysis.
These are:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Calculate Total Corners and Total Goals and save changes
Serie_A_2324 <- Serie_A_2324 %>%
mutate(TotalCorners = HC + AC,
TotalGoals = FTHG + FTAG)
head(Serie_A_2324[c("Date", "TotalCorners", "TotalGoals")])## # A tibble: 6 × 3
## Date TotalCorners TotalGoals
## <chr> <dbl> <dbl>
## 1 19/08/23 6 1
## 2 19/08/23 10 4
## 3 19/08/23 7 5
## 4 19/08/23 11 2
## 5 20/08/23 10 4
## 6 20/08/23 14 2
Now we can observe the number of total corners and total goals that happened on every match on Serie A 2023/2024 season. For example, during the fourth match played on the 19th of August 2023, 11 corners were shot and 2 goals were made. Now that we have these information, we can state the hypothesis for this research question.
I will check this observing some useful visual data to understand whether there is significant correlation between the number of corners and the number of goals scored in matches, and afterwards checking the Pearson Correlation Coefficient.
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
# Scatterplot matrix for Total Corners and Total Goals
scatterplotMatrix(Serie_A_2324[ , c("TotalCorners", "TotalGoals")], smooth=FALSE)The scatterplots indicate no apparent linear relationship between corners and goals. The points are scattered horizontally, deviating from any diagonal trend. So we can state with confidence that visually, there is no evidence of correlation between the two variables.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Generate pairwise plot
ggpairs(Serie_A_2324[, c("TotalCorners", "TotalGoals")],
title = "Pairwise Plot for Corners and Goals")The Pearson correlation coefficient calculated in the pairwise plot is r=−0.045, which is very close to zero. This indicates a negligible or no linear relationship between corners and goals. The negative sign implies a slight inverse relationship, but it is too weak to have practical significance.
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
# Calculate Pearson correlation matrix
rcorr(as.matrix(Serie_A_2324[, c("TotalCorners", "TotalGoals")]),
type = "pearson")## TotalCorners TotalGoals
## TotalCorners 1.00 -0.04
## TotalGoals -0.04 1.00
##
## n= 380
##
##
## P
## TotalCorners TotalGoals
## TotalCorners 0.3853
## TotalGoals 0.3853
The correlation coefficient is extremely close to zero, reaffirming no meaningful linear relationship between total corners and total goals. Since p>0.05, we fail to reject the null hypothesis. The p-value (p=0.3853) indicates that the correlation is not statistically significant at the 0.05 level. This suggests that the observed correlation could easily occur by chance and does not reflect a genuine relationship in the data.
There is no significant correlation between the number of corners and the total goals scored in a match (r=−0.04, p=0.3853). Both visual inspection and statistical analysis confirm that the relationship between these variables is negligible.
Question: is there an association between home red cards and match outcomes in the Italian Serie A 2023/2024 season?
Hypotheses:
Assumptions:
##
## Pearson's Chi-squared test
##
## data: Serie_A_2324$HR and Serie_A_2324$FTR
## X-squared = 14.847, df = 2, p-value = 0.0005971
Since the p-value (p<0.001), we reject the null hypothesis. This means there is a statistically significant association between home red cards and match outcomes. Home red cards appear to influence whether the match ends in a home win, draw, or away win.
## Serie_A_2324$FTR
## Serie_A_2324$HR A D H Sum
## No 99 96 156 351
## Yes 10 16 3 29
## Sum 109 112 159 380
From the observed data, home teams with red cards lose significantly more often (10 away wins vs. 3 home wins) compared to those without red cards. This trend supports the Chi-square test result, indicating a strong association between home red cards and match outcomes.
## Serie_A_2324$FTR
## Serie_A_2324$HR A D H
## No 101 103 147
## Yes 8 9 12
The expected frequencies under the null hypothesis indicate that if home red cards were independent of match outcomes, home teams would experience approximately 101 home losses (or away wins=A), 103 draws (D), and 147 home wins (H) in matches without red cards. For matches where home teams received red cards, the expected outcomes would be 8 home losses/away wins, 9 draws, and 12 home wins. Since they are all higher than 5, the assumptions for chi-squared test are respected.
## Serie_A_2324$FTR
## Serie_A_2324$HR A D H
## No 0 -1 1
## Yes 1 3 -3
The residuals indicate that matches with home red cards are associated with significantly fewer home wins (−3) and more draws (+3) than expected under the null hypothesis. This suggests that home red cards negatively impact the likelihood of a home win.
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.18 | [0.07, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
## [1] "small"
## (Rules: funder2019)
While the Chi-square test indicates a statistically significant association between home red cards and match outcomes, the small Cramér’s V value suggests that the strength of this association is weak. This implies that while home red cards do influence outcomes, other factors likely play a more significant role in determining match results.
The Chi-square test revealed a statistically significant association (p<0.001), indicating that home red cards influence whether matches end in a home win, draw, or away win. Observed frequencies showed that home teams with red cards experience significantly fewer home wins and more draws than expected. However, the effect size, measured by Cramér’s V (0.18), was small, suggesting that while the association is significant, it is relatively weak. This implies that red cards alone have a limited practical impact on match outcomes, and other factors likely contribute more substantially to the results.
Christian Lasalvia