Introduction

This dataset provides detailed statistics of football players participating in the 2025 UEFA Champions League season. It includes a wide array of performance metrics, offering a comprehensive view of players’ skills, contributions, and performance throughout the tournament.

Research quesion

The research aims to explore the factors contributing most to a team’s success in the 2024 Premier League season, focusing on performance metrics like goals scored, wins, draws, losses, and points, and their correlation with final rankings. The analysis will consider key variables such as team name, goals scored, wins, draws, losses, points, and final rank. Statistical methods including correlation analysis and regression modeling will be employed to identify which factors are most strongly linked to a team’s rank and points. The findings will provide valuable insights into the strategies that lead to success and highlight key performance indicators for teams striving for a top finish.

loading libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setting the working directory

setwd("C:/Users/eyong/Downloads")
df <- read_csv("PremierLeagueSeason2024.csv")
## Rows: 24 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): team
## dbl (8): goals_scored, goals_conceded, wins, draws, losses, points, goal_dif...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

selecting the relevant columns for the heatmap for the prefromance metric for each team

df_heatmap <- df[, c("team", "wins", "draws", "losses")]
df_heatmap <- df_heatmap %>% pivot_longer(cols = c("wins", "draws", "losses"), names_to = "type", values_to = "count")
df_heatmap <- df[, c("team", "wins", "draws", "losses", "goals_scored", "goals_conceded", "points")]
df_heatmap <- df_heatmap %>% pivot_longer(cols = c("wins", "draws", "losses", "goals_scored", "goals_conceded", "points"), 
                                          names_to = "type", values_to = "count")

ploting the preformance metric such as wins losses and draws using a heat map to show which teams have a better preformance based on thier metrics and we can see below

ggplot(df_heatmap, aes(x = team, y = type, fill = count)) +
  geom_tile() +
  scale_fill_gradient(low = "grey", high = "black") +
  labs(title = "Premier League Team Results", x = "Team", y = "Type") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  

From the heatmap, it’s clear that Manchester City has the highest number of wins,goals scored and points as indicated by the deeper the black color compared to the other teams, followed by Arsenal. Sheffield United has the most losses, while Brighton and Hove Albion have the highest number of draws. Therefore, it’s fair to say that if you had placed a bet on a parlay in the 2024 Premier League season, Manchester City and Arsenal would have been the safest picks for success

caculating the correlation between ‘wins’ and total goals scored

cor_win_goals <- cor(df$wins, df$goals_scored)
cor_win_goals
## [1] 0.9705855

the correlation between wins and draws is 0.9705855 which is a strong positive relationship

regression analysis for between wins and goals scored

model_1 <- lm(goals_scored~wins,data = df)
summary(model_1)
## 
## Call:
## lm(formula = goals_scored ~ wins, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.476  -7.445   2.766   5.841  21.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.6959     4.2787   5.538 1.45e-05 ***
## wins          2.8593     0.1512  18.909 4.29e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.1 on 22 degrees of freedom
## Multiple R-squared:  0.942,  Adjusted R-squared:  0.9394 
## F-statistic: 357.5 on 1 and 22 DF,  p-value: 4.289e-15

The relationship between wins and goals scored is strong, with about 93.94% of the variation in wins explained by goals scored, as indicated by the high adjusted R-squared value. The p-value of 4.289e-15 is extremely small, well below the conventional threshold of 0.05, suggesting that the relationship is statistically significant. Therefore, there is strong evidence to conclude that draws are a major predictor of wins in this analysis.

ploting the regression analysis for the model for wins

ggplot(df, aes(x = goals_scored, y = wins)) +
  geom_point() +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Regression of Wins on goals scored",  
       x = "goals scored",  
       y = "wins") +  
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

so if we predict if a team score 100 goals they are sure to win 25 matches according to the regression model

residaul histogram of the model

hist(residuals(model_1))

This histogram suggests that the residuals are approximately normally distributed with a bell-shaped curve, though slightly left-skewed, and exhibit an equal number of positive and negative outliers.

conclusion

From the 2024 Premier League season, it is clear that the top teams on the leaderboard, Arsenal and Manchester City, achieved higher ranks due to their superior performance in goals, points, and wins, ultimately defining the season’s outcome