Code
setwd("C:/Users/zyamj/Downloads")
# Load required libraries
library(tidyverse)
library(broom)
library(ggplot2)
library(scales)
library(knitr)The English Premier League (EPL) is one of the most watched and analyzed football leagues in the world. The 2024–25 season produced a wealth of player-level performance data across all 20 clubs, covering over 560 players and 57 distinct statistics ranging from basic counts like goals and appearances to advanced metrics like progressive carries and expected goals prevented.
This project examines what drives goal-scoring output among outfield EPL players during the 2024–25 season. Specifically, I explore whether the number of shots attempted and a player’s number of appearances can reliably predict how many goals they score. I also investigate how attacking output, which is measured by goals and assists, varies by player position (Forward, Midfielder, Defender), and whether any surprising patterns emerge across positions.
The dataset was sourced from the official Premier League statistics portal and contains season-long totals for all players who appeared in at least one match. The data includes all four positional groups: Goalkeepers (GKP), Defenders (DEF), Midfielders (MID), and Forwards (FWD).
setwd("C:/Users/zyamj/Downloads")
# Load required libraries
library(tidyverse)
library(broom)
library(ggplot2)
library(scales)
library(knitr)# Load the dataset using readr::read_csv() as required
epl <- readr::read_csv("C:/Users/zyamj/Downloads/epl_player_stats_24_25.csv")
glimpse(epl)Rows: 562
Columns: 57
$ `Player Name` <chr> "Ben White", "Bukayo Saka", "David Raya", …
$ Club <chr> "Arsenal", "Arsenal", "Arsenal", "Arsenal"…
$ Nationality <chr> "England", "England", "Spain", "England", …
$ Position <chr> "DEF", "MID", "GKP", "MID", "MID", "FWD", …
$ Appearances <dbl> 17, 25, 38, 35, 26, 17, 28, 33, 17, 15, 30…
$ Minutes <dbl> 1198, 1735, 3420, 2833, 889, 603, 2365, 23…
$ Goals <dbl> 0, 6, 0, 4, 4, 3, 3, 8, 1, 0, 1, 9, 1, 8, …
$ Assists <dbl> 2, 10, 0, 7, 0, 0, 1, 4, 0, 0, 3, 3, 0, 7,…
$ Shots <dbl> 9, 67, 0, 48, 24, 20, 22, 55, 3, 2, 14, 53…
$ `Shots On Target` <dbl> 12, 2, 0, 18, 0, 0, 25, 12, 1, 0, 34, 2, 0…
$ `Conversion %` <chr> "13%", "25%", "0%", "15%", "0%", "0%", "15…
$ `Big Chances Missed` <dbl> 0, 8, 0, 2, 0, 3, 4, 8, 0, 0, 1, 15, 1, 9,…
$ `Hit Woodwork` <dbl> 0, 0, 0, 0, 3, 1, 1, 0, 0, 0, 0, 0, 0, 1, …
$ Offsides <dbl> 1, 7, 0, 2, 6, 9, 0, 8, 0, 0, 6, 14, 2, 6,…
$ Touches <dbl> 833, 1094, 1599, 2016, 601, 328, 1911, 108…
$ Passes <dbl> 1678, 643, 0, 789, 0, 0, 590, 344, 533, 0,…
$ `Successful Passes` <dbl> 1493, 556, 0, 641, 0, 0, 466, 236, 461, 0,…
$ `Passes%` <chr> "89%", "87%", "0%", "81%", "0%", "0%", "79…
$ Crosses <dbl> 51, 1, 0, 63, 0, 0, 89, 9, 26, 0, 6, 6, 0,…
$ `Successful Crosses` <dbl> 10, 0, 0, 7, 0, 0, 21, 1, 6, 0, 1, 2, 0, 1…
$ `Crosses %` <chr> "20%", "0%", "0%", "11%", "0%", "0%", "24%…
$ `fThird Passes` <dbl> 714, 55, 0, 480, 0, 0, 339, 210, 132, 0, 1…
$ `Successful fThird Passes` <dbl> 592, 33, 0, 364, 0, 0, 254, 132, 93, 0, 99…
$ `fThird Passes %` <chr> "83%", "60%", "0%", "76%", "0%", "0%", "75…
$ `Through Balls` <dbl> 4, 1, 0, 11, 0, 0, 5, 12, 0, 0, 1, 1, 0, 3…
$ Carries <dbl> 583, 167, 0, 411, 0, 0, 337, 182, 101, 0, …
$ `Progressive Carries` <dbl> 296, 69, 0, 260, 0, 0, 216, 110, 52, 0, 38…
$ `Carries Ended with Goal` <dbl> 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, …
$ `Carries Ended with Assist` <dbl> 0, 0, 0, 2, 0, 0, 3, 2, 0, 0, 2, 0, 0, 2, …
$ `Carries Ended with Shot` <dbl> 5, 1, 0, 18, 0, 0, 18, 15, 1, 0, 8, 0, 0, …
$ `Carries Ended with Chance` <dbl> 17, 0, 0, 22, 0, 0, 22, 6, 0, 0, 4, 0, 0, …
$ `Possession Won` <dbl> 107, 44, 0, 121, 0, 0, 78, 66, 65, 0, 42, …
$ Dispossessed <dbl> 6, 40, 0, 32, 17, 13, 2, 38, 0, 9, 24, 23,…
$ `Clean Sheets` <dbl> 5, 2, 13, 7, 1, 1, 10, 3, 3, 1, 7, 7, 0, 5…
$ Clearances <dbl> 38, 6, 29, 50, 4, 7, 89, 16, 57, 2, 41, 32…
$ Interceptions <dbl> 23, 15, 0, 13, 0, 0, 5, 10, 9, 0, 6, 19, 0…
$ Blocks <dbl> 6, 14, 0, 5, 0, 0, 0, 3, 3, 0, 4, 6, 0, 8,…
$ Tackles <dbl> 20, 29, 0, 53, 11, 10, 25, 23, 22, 12, 58,…
$ `Ground Duels` <dbl> 231, 58, 0, 342, 0, 0, 206, 237, 53, 0, 14…
$ `gDuels Won` <dbl> 116, 34, 0, 121, 0, 0, 77, 111, 27, 0, 37,…
$ `gDuels %` <chr> "50%", "59%", "0%", "35%", "0%", "0%", "37…
$ `Aerial Duels` <dbl> 16, 45, 0, 26, 0, 0, 56, 72, 20, 0, 167, 3…
$ `aDuels Won` <dbl> 5, 23, 0, 10, 0, 0, 17, 25, 13, 0, 67, 8, …
$ `aDuels %` <chr> "31%", "51%", "0%", "39%", "0%", "0%", "30…
$ `Goals Conceded` <dbl> 0, 0, 34, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `xGoT Conceded` <dbl> 0, 0, 36, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `Own Goals` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Fouls <dbl> 10, 15, 1, 21, 9, 14, 19, 16, 10, 15, 0, 3…
$ `Yellow Cards` <dbl> 2, 3, 3, 5, 1, 4, 4, 1, 1, 5, 7, 5, 0, 2, …
$ `Red Cards` <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ Saves <dbl> 0, 0, 86, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `Saves %` <chr> "0%", "0%", "72%", "0%", "0%", "0%", "0%",…
$ `Penalties Saved` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ `Clearances Off Line` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Punches <dbl> 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ `High Claims` <dbl> 0, 0, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `Goals Prevented` <dbl> 0.0, 0.0, 2.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…
# Step 1: Inspect column names and types
cat("Dimensions:", nrow(epl), "rows x", ncol(epl), "columns\n")Dimensions: 562 rows x 57 columns
cat("Missing values per column:\n")Missing values per column:
print(colSums(is.na(epl))) Player Name Club Nationality
0 0 0
Position Appearances Minutes
0 0 0
Goals Assists Shots
0 0 0
Shots On Target Conversion % Big Chances Missed
0 0 0
Hit Woodwork Offsides Touches
0 0 0
Passes Successful Passes Passes%
0 0 0
Crosses Successful Crosses Crosses %
0 0 0
fThird Passes Successful fThird Passes fThird Passes %
0 0 0
Through Balls Carries Progressive Carries
0 0 0
Carries Ended with Goal Carries Ended with Assist Carries Ended with Shot
0 0 0
Carries Ended with Chance Possession Won Dispossessed
0 0 0
Clean Sheets Clearances Interceptions
0 0 0
Blocks Tackles Ground Duels
0 0 0
gDuels Won gDuels % Aerial Duels
0 0 0
aDuels Won aDuels % Goals Conceded
0 0 0
xGoT Conceded Own Goals Fouls
0 0 0
Yellow Cards Red Cards Saves
0 0 0
Saves % Penalties Saved Clearances Off Line
0 0 0
Punches High Claims Goals Prevented
0 0 0
# Step 2: Convert percentage columns from character strings to numeric
# Several columns store values like "89%" — strip the "%" and cast to double
percent_cols <- c("Conversion %", "Passes%", "Crosses %",
"fThird Passes %", "gDuels %", "aDuels %", "Saves %")
epl_clean <- epl %>%
mutate(across(all_of(percent_cols),
~ as.numeric(str_remove(., "%"))))
# Step 3: Rename columns with spaces/special characters for easier use
epl_clean <- epl_clean %>%
rename(
player = `Player Name`,
club = Club,
nationality = Nationality,
position = Position,
appearances = Appearances,
minutes = Minutes,
goals = Goals,
assists = Assists,
shots = Shots,
shots_on_target = `Shots On Target`,
conversion_pct = `Conversion %`,
prog_carries = `Progressive Carries`,
passes = Passes,
pass_pct = `Passes%`,
yellow_cards = `Yellow Cards`,
red_cards = `Red Cards`,
touches = Touches,
tackles = Tackles,
interceptions = Interceptions,
clearances = Clearances
)
# Step 4: Convert Position to an ordered factor for plotting
epl_clean <- epl_clean %>%
mutate(position = factor(position,
levels = c("FWD", "MID", "DEF", "GKP")))
# Step 5: Create an outfield-only subset (excludes goalkeepers)
# Goalkeepers have near-zero values for most attacking metrics,
# which would distort regression and visualization results
epl_out <- epl_clean %>%
filter(position != "GKP")
cat("Full dataset rows:", nrow(epl_clean), "\n")Full dataset rows: 562
cat("Outfield players only:", nrow(epl_out), "\n")Outfield players only: 517
# Step 6: Create a "goals + assists" combined metric (Goal Contributions)
epl_out <- epl_out %>%
mutate(goal_contributions = goals + assists)
# Step 7: Preview the cleaned dataset
epl_out %>%
select(player, club, position, appearances, goals, assists,
shots, pass_pct, prog_carries) %>%
slice_head(n = 10) %>%
kable(caption = "First 10 rows of the cleaned outfield dataset")| player | club | position | appearances | goals | assists | shots | pass_pct | prog_carries |
|---|---|---|---|---|---|---|---|---|
| Ben White | Arsenal | DEF | 17 | 0 | 2 | 9 | 89 | 296 |
| Bukayo Saka | Arsenal | MID | 25 | 6 | 10 | 67 | 87 | 69 |
| Declan Rice | Arsenal | MID | 35 | 4 | 7 | 48 | 81 | 260 |
| Ethan Nwaneri | Arsenal | MID | 26 | 4 | 0 | 24 | 0 | 0 |
| Gabriel Jesus | Arsenal | FWD | 17 | 3 | 0 | 20 | 0 | 0 |
| Gabriel Magalhães | Arsenal | DEF | 28 | 3 | 1 | 22 | 79 | 216 |
| Gabriel Martinelli | Arsenal | MID | 33 | 8 | 4 | 55 | 69 | 110 |
| Jakub Kiwior | Arsenal | DEF | 17 | 1 | 0 | 3 | 87 | 52 |
| Jorginho | Arsenal | MID | 15 | 0 | 0 | 2 | 0 | 0 |
| Jurriën Timber | Arsenal | DEF | 30 | 1 | 3 | 14 | 65 | 38 |
epl_out %>%
group_by(position) %>%
summarise(
n = n(),
avg_goals = round(mean(goals), 2),
avg_assists = round(mean(assists), 2),
avg_shots = round(mean(shots), 2),
avg_minutes = round(mean(minutes), 0)
) %>%
kable(caption = "Summary statistics by position (outfield players, 2024–25)")| position | n | avg_goals | avg_assists | avg_shots | avg_minutes |
|---|---|---|---|---|---|
| FWD | 86 | 3.57 | 0.81 | 26.80 | 969 |
| MID | 229 | 2.28 | 1.91 | 24.89 | 1396 |
| DEF | 202 | 0.65 | 0.66 | 9.03 | 1379 |
Can we predict the number of goals an outfield EPL player scores using the number of shots they attempt and the number of appearances they make?
# Fit multiple linear regression: Goals ~ Shots + Appearances
model <- lm(goals ~ shots + appearances, data = epl_out)
summary(model)
Call:
lm(formula = goals ~ shots + appearances, data = epl_out)
Residuals:
Min 1Q Median 3Q Max
-14.8372 -0.5098 -0.0618 0.4658 13.5580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.093144 0.194293 0.479 0.6319
shots 0.127234 0.005556 22.902 <2e-16 ***
appearances -0.031356 0.010553 -2.971 0.0031 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.149 on 514 degrees of freedom
Multiple R-squared: 0.6048, Adjusted R-squared: 0.6033
F-statistic: 393.4 on 2 and 514 DF, p-value: < 2.2e-16
Based on the fitted model, the estimated regression equation is:
\[ \hat{Goals} = 0.093 + 0.127 \times Shots - 0.031 \times Appearances \]
tidy(model) %>%
kable(digits = 4,
caption = "Regression coefficients, standard errors, and p-values")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0931 | 0.1943 | 0.4794 | 0.6319 |
| shots | 0.1272 | 0.0056 | 22.9020 | 0.0000 |
| appearances | -0.0314 | 0.0106 | -2.9712 | 0.0031 |
glance(model) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
kable(digits = 4,
caption = "Model fit statistics")| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.6048 | 0.6033 | 2.1493 | 393.3705 | 0 | 2 |
Coefficient Interpretation:
Model Fit:
# Four standard diagnostic plots
par(mfrow = c(2, 2))
plot(model)par(mfrow = c(1, 1))Diagnostic Interpretation:
# Custom non-default colors for three positional groups
position_colors <- c(
"FWD" = "#E63946", # vivid red
"MID" = "#2A9D8F", # teal
"DEF" = "#F4A261" # warm orange
)
# Highlight top 10 goal scorers
top_scorers <- epl_out %>%
slice_max(goals, n = 10)
ggplot(epl_out, aes(x = shots, y = goals, color = position)) +
# Regression line for all outfield players
geom_smooth(method = "lm", aes(group = 1),
color = "grey40", fill = "grey85",
linetype = "dashed", linewidth = 0.8, se = TRUE) +
# All player points
geom_point(aes(size = appearances), alpha = 0.55) +
# Label top 10 scorers
ggrepel::geom_text_repel(
data = top_scorers,
aes(label = player),
size = 2.8, color = "black",
box.padding = 0.4, point.padding = 0.3,
segment.color = "grey50", max.overlaps = 15
) +
# Scales
scale_color_manual(
values = position_colors,
name = "Position",
labels = c("FWD" = "Forward", "MID" = "Midfielder", "DEF" = "Defender")
) +
scale_size_continuous(name = "Appearances", range = c(1, 6)) +
# Labels and theme
labs(
title = "Shots vs. Goals by Position — EPL 2024–25 Season",
subtitle = "Outfield players only (n = 517) | Dashed line = overall OLS fit",
x = "Total Shots Attempted",
y = "Goals Scored",
caption = "Data source: Premier League official statistics (2024–25 season)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(color = "grey45", size = 11),
plot.caption = element_text(color = "grey55", size = 9, hjust = 0),
legend.position = "right",
panel.grid.minor = element_blank()
)ggplot(epl_out, aes(x = position, y = goal_contributions, fill = position)) +
geom_boxplot(outlier.shape = 21, outlier.size = 2,
outlier.fill = "white", alpha = 0.7) +
geom_jitter(width = 0.18, alpha = 0.25, size = 1.1,
aes(color = position)) +
scale_fill_manual(values = position_colors, guide = "none") +
scale_color_manual(values = position_colors, guide = "none") +
labs(
title = "Goal Contributions (Goals + Assists) by Position — EPL 2024–25",
x = "Position",
y = "Goals + Assists",
caption = "Data source: Premier League official statistics (2024–25 season)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 14),
panel.grid.minor = element_blank()
) +
scale_x_discrete(labels = c("FWD" = "Forward",
"MID" = "Midfielder",
"DEF" = "Defender"))The raw dataset contained 562 rows and 57 columns covering all EPL players who made at least one appearance in the 2024–25 season. The first and most important cleaning step was handling percentage columns stored as character strings. Seven columns — including Conversion %, Passes%, and gDuels % — contained values like "89%". These were converted to numeric doubles by stripping the % character using str_remove() and casting with as.numeric(). Without this step, these columns would be unusable in any quantitative analysis.
The second major step was renaming columns. Many variable names contained spaces, special characters (like %), and backtick-requiring syntax, which makes them inefficient in R’s tidyverse pipeline. All key columns were renamed to snake_case equivalents (e.g., Player Name → player, Shots On Target → shots_on_target).
Third, the Position variable was coerced into an ordered factor with levels FWD, MID, DEF, GKP to ensure consistent ordering in plots and summaries.
Fourth, a goalkeeper exclusion filter was applied for both the regression and visualization. Goalkeepers have near-zero values for goals, shots, and passes, but non-zero values for goalkeeper-specific metrics. Including them in the attacking analysis would introduce a structural confound — they are a categorically different player role. This left 517 outfield players for analysis.
Finally, a derived variable (goal_contributions = goals + assists) was engineered to capture a player’s total attacking output in a single metric, used in the boxplot visualization.
Notably, the dataset contained no missing values in any column, which is unusual and a sign the data was already well-curated at the source.
The primary scatterplot displays the relationship between total shots attempted and goals scored for all 517 outfield players, colored by position (Forward = red, Midfielder = teal, Defender = orange) and sized by appearances. An OLS regression line with confidence interval confirms the strong positive relationship.
Several patterns stand out. First, Forwards cluster at the upper-right — high shot volume, high goal output — as expected, while Defenders cluster tightly near the origin, rarely attempting more than 20 shots and almost never scoring more than 2–3 goals. Midfielders occupy the wide middle ground, with tremendous spread: Mohamed Salah (130 shots, 29 goals) sits far in the upper-right alongside Erling Haaland and Alexander Isak, yet is classified as a midfielder by the Premier League. For a midfielder that tracks back on defense, he is one of few in the world.
The most surprising finding is how Mohamed Salah’s season stands alone — he attempted nearly 30 more shots than the second-highest player and scored 29 goals, the most in the dataset. His dot is visually isolated from even other high-volume shooters, suggesting elite efficiency on top of elite volume.
The secondary boxplot reinforces that forwards produce dramatically more goal contributions than midfielders or defenders, but also reveals how wide the midfielder distribution is that some midfielders contribute as much as top forwards, while others contribute nothing all season. The jittered points make this diverseness visible in a way raw summary statistics cannot.
One limitation of the regression is that OLS assumes normally distributed residuals, but goals data is count-based and right-skewed. A Poisson or negative binomial regression would be more statistically appropriate and is something I explored conceptually but did not implement due to scope constraints. Including it in a future analysis would help me and the viewers of this data analysis a lot more in understanding what the numbers and relations truly mean to the minutest detail possible.
I also would have liked to include a heatmap of performance metrics by club, showing which teams produce the most shots, touches, or progressive carries per player on average, which is useful for identifying tactical styles. A treemap of total goals by club, sized by squad size, was also briefly explored but setting up the treemap geometry in R took longer than expected.
Finally, the ggrepel package is required for the text labels in the main scatterplot. If that package is unavailable, the labels can be removed by deleting the geom_text_repel() layer without affecting the rest of the chart.