Premier League Player Performance: Goals, Shots, and Position (2024–25)

Author

Zyam Jadoon Khawaja

Published

May 15, 2026


Introduction

The English Premier League (EPL) is one of the most watched and analyzed football leagues in the world. The 2024–25 season produced a wealth of player-level performance data across all 20 clubs, covering over 560 players and 57 distinct statistics ranging from basic counts like goals and appearances to advanced metrics like progressive carries and expected goals prevented.

This project examines what drives goal-scoring output among outfield EPL players during the 2024–25 season. Specifically, I explore whether the number of shots attempted and a player’s number of appearances can reliably predict how many goals they score. I also investigate how attacking output, which is measured by goals and assists, varies by player position (Forward, Midfielder, Defender), and whether any surprising patterns emerge across positions.

The dataset was sourced from the official Premier League statistics portal and contains season-long totals for all players who appeared in at least one match. The data includes all four positional groups: Goalkeepers (GKP), Defenders (DEF), Midfielders (MID), and Forwards (FWD).


Data Loading and Cleaning

Code
setwd("C:/Users/zyamj/Downloads")

# Load required libraries
library(tidyverse)
library(broom)
library(ggplot2)
library(scales)
library(knitr)
Code
# Load the dataset using readr::read_csv() as required
epl <- readr::read_csv("C:/Users/zyamj/Downloads/epl_player_stats_24_25.csv")
glimpse(epl)
Rows: 562
Columns: 57
$ `Player Name`               <chr> "Ben White", "Bukayo Saka", "David Raya", …
$ Club                        <chr> "Arsenal", "Arsenal", "Arsenal", "Arsenal"…
$ Nationality                 <chr> "England", "England", "Spain", "England", …
$ Position                    <chr> "DEF", "MID", "GKP", "MID", "MID", "FWD", …
$ Appearances                 <dbl> 17, 25, 38, 35, 26, 17, 28, 33, 17, 15, 30…
$ Minutes                     <dbl> 1198, 1735, 3420, 2833, 889, 603, 2365, 23…
$ Goals                       <dbl> 0, 6, 0, 4, 4, 3, 3, 8, 1, 0, 1, 9, 1, 8, …
$ Assists                     <dbl> 2, 10, 0, 7, 0, 0, 1, 4, 0, 0, 3, 3, 0, 7,…
$ Shots                       <dbl> 9, 67, 0, 48, 24, 20, 22, 55, 3, 2, 14, 53…
$ `Shots On Target`           <dbl> 12, 2, 0, 18, 0, 0, 25, 12, 1, 0, 34, 2, 0…
$ `Conversion %`              <chr> "13%", "25%", "0%", "15%", "0%", "0%", "15…
$ `Big Chances Missed`        <dbl> 0, 8, 0, 2, 0, 3, 4, 8, 0, 0, 1, 15, 1, 9,…
$ `Hit Woodwork`              <dbl> 0, 0, 0, 0, 3, 1, 1, 0, 0, 0, 0, 0, 0, 1, …
$ Offsides                    <dbl> 1, 7, 0, 2, 6, 9, 0, 8, 0, 0, 6, 14, 2, 6,…
$ Touches                     <dbl> 833, 1094, 1599, 2016, 601, 328, 1911, 108…
$ Passes                      <dbl> 1678, 643, 0, 789, 0, 0, 590, 344, 533, 0,…
$ `Successful Passes`         <dbl> 1493, 556, 0, 641, 0, 0, 466, 236, 461, 0,…
$ `Passes%`                   <chr> "89%", "87%", "0%", "81%", "0%", "0%", "79…
$ Crosses                     <dbl> 51, 1, 0, 63, 0, 0, 89, 9, 26, 0, 6, 6, 0,…
$ `Successful Crosses`        <dbl> 10, 0, 0, 7, 0, 0, 21, 1, 6, 0, 1, 2, 0, 1…
$ `Crosses %`                 <chr> "20%", "0%", "0%", "11%", "0%", "0%", "24%…
$ `fThird Passes`             <dbl> 714, 55, 0, 480, 0, 0, 339, 210, 132, 0, 1…
$ `Successful fThird Passes`  <dbl> 592, 33, 0, 364, 0, 0, 254, 132, 93, 0, 99…
$ `fThird Passes %`           <chr> "83%", "60%", "0%", "76%", "0%", "0%", "75…
$ `Through Balls`             <dbl> 4, 1, 0, 11, 0, 0, 5, 12, 0, 0, 1, 1, 0, 3…
$ Carries                     <dbl> 583, 167, 0, 411, 0, 0, 337, 182, 101, 0, …
$ `Progressive Carries`       <dbl> 296, 69, 0, 260, 0, 0, 216, 110, 52, 0, 38…
$ `Carries Ended with Goal`   <dbl> 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, …
$ `Carries Ended with Assist` <dbl> 0, 0, 0, 2, 0, 0, 3, 2, 0, 0, 2, 0, 0, 2, …
$ `Carries Ended with Shot`   <dbl> 5, 1, 0, 18, 0, 0, 18, 15, 1, 0, 8, 0, 0, …
$ `Carries Ended with Chance` <dbl> 17, 0, 0, 22, 0, 0, 22, 6, 0, 0, 4, 0, 0, …
$ `Possession Won`            <dbl> 107, 44, 0, 121, 0, 0, 78, 66, 65, 0, 42, …
$ Dispossessed                <dbl> 6, 40, 0, 32, 17, 13, 2, 38, 0, 9, 24, 23,…
$ `Clean Sheets`              <dbl> 5, 2, 13, 7, 1, 1, 10, 3, 3, 1, 7, 7, 0, 5…
$ Clearances                  <dbl> 38, 6, 29, 50, 4, 7, 89, 16, 57, 2, 41, 32…
$ Interceptions               <dbl> 23, 15, 0, 13, 0, 0, 5, 10, 9, 0, 6, 19, 0…
$ Blocks                      <dbl> 6, 14, 0, 5, 0, 0, 0, 3, 3, 0, 4, 6, 0, 8,…
$ Tackles                     <dbl> 20, 29, 0, 53, 11, 10, 25, 23, 22, 12, 58,…
$ `Ground Duels`              <dbl> 231, 58, 0, 342, 0, 0, 206, 237, 53, 0, 14…
$ `gDuels Won`                <dbl> 116, 34, 0, 121, 0, 0, 77, 111, 27, 0, 37,…
$ `gDuels %`                  <chr> "50%", "59%", "0%", "35%", "0%", "0%", "37…
$ `Aerial Duels`              <dbl> 16, 45, 0, 26, 0, 0, 56, 72, 20, 0, 167, 3…
$ `aDuels Won`                <dbl> 5, 23, 0, 10, 0, 0, 17, 25, 13, 0, 67, 8, …
$ `aDuels %`                  <chr> "31%", "51%", "0%", "39%", "0%", "0%", "30…
$ `Goals Conceded`            <dbl> 0, 0, 34, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `xGoT Conceded`             <dbl> 0, 0, 36, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `Own Goals`                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Fouls                       <dbl> 10, 15, 1, 21, 9, 14, 19, 16, 10, 15, 0, 3…
$ `Yellow Cards`              <dbl> 2, 3, 3, 5, 1, 4, 4, 1, 1, 5, 7, 5, 0, 2, …
$ `Red Cards`                 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ Saves                       <dbl> 0, 0, 86, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `Saves %`                   <chr> "0%", "0%", "72%", "0%", "0%", "0%", "0%",…
$ `Penalties Saved`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ `Clearances Off Line`       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Punches                     <dbl> 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ `High Claims`               <dbl> 0, 0, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ `Goals Prevented`           <dbl> 0.0, 0.0, 2.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.…

Cleaning Steps

Code
# Step 1: Inspect column names and types
cat("Dimensions:", nrow(epl), "rows x", ncol(epl), "columns\n")
Dimensions: 562 rows x 57 columns
Code
cat("Missing values per column:\n")
Missing values per column:
Code
print(colSums(is.na(epl)))
              Player Name                      Club               Nationality 
                        0                         0                         0 
                 Position               Appearances                   Minutes 
                        0                         0                         0 
                    Goals                   Assists                     Shots 
                        0                         0                         0 
          Shots On Target              Conversion %        Big Chances Missed 
                        0                         0                         0 
             Hit Woodwork                  Offsides                   Touches 
                        0                         0                         0 
                   Passes         Successful Passes                   Passes% 
                        0                         0                         0 
                  Crosses        Successful Crosses                 Crosses % 
                        0                         0                         0 
            fThird Passes  Successful fThird Passes           fThird Passes % 
                        0                         0                         0 
            Through Balls                   Carries       Progressive Carries 
                        0                         0                         0 
  Carries Ended with Goal Carries Ended with Assist   Carries Ended with Shot 
                        0                         0                         0 
Carries Ended with Chance            Possession Won              Dispossessed 
                        0                         0                         0 
             Clean Sheets                Clearances             Interceptions 
                        0                         0                         0 
                   Blocks                   Tackles              Ground Duels 
                        0                         0                         0 
               gDuels Won                  gDuels %              Aerial Duels 
                        0                         0                         0 
               aDuels Won                  aDuels %            Goals Conceded 
                        0                         0                         0 
            xGoT Conceded                 Own Goals                     Fouls 
                        0                         0                         0 
             Yellow Cards                 Red Cards                     Saves 
                        0                         0                         0 
                  Saves %           Penalties Saved       Clearances Off Line 
                        0                         0                         0 
                  Punches               High Claims           Goals Prevented 
                        0                         0                         0 
Code
# Step 2: Convert percentage columns from character strings to numeric
# Several columns store values like "89%" — strip the "%" and cast to double
percent_cols <- c("Conversion %", "Passes%", "Crosses %",
                  "fThird Passes %", "gDuels %", "aDuels %", "Saves %")

epl_clean <- epl %>%
  mutate(across(all_of(percent_cols),
                ~ as.numeric(str_remove(., "%"))))

# Step 3: Rename columns with spaces/special characters for easier use
epl_clean <- epl_clean %>%
  rename(
    player          = `Player Name`,
    club            = Club,
    nationality     = Nationality,
    position        = Position,
    appearances     = Appearances,
    minutes         = Minutes,
    goals           = Goals,
    assists         = Assists,
    shots           = Shots,
    shots_on_target = `Shots On Target`,
    conversion_pct  = `Conversion %`,
    prog_carries    = `Progressive Carries`,
    passes          = Passes,
    pass_pct        = `Passes%`,
    yellow_cards    = `Yellow Cards`,
    red_cards       = `Red Cards`,
    touches         = Touches,
    tackles         = Tackles,
    interceptions   = Interceptions,
    clearances      = Clearances
  )

# Step 4: Convert Position to an ordered factor for plotting
epl_clean <- epl_clean %>%
  mutate(position = factor(position,
                           levels = c("FWD", "MID", "DEF", "GKP")))

# Step 5: Create an outfield-only subset (excludes goalkeepers)
# Goalkeepers have near-zero values for most attacking metrics, 
# which would distort regression and visualization results
epl_out <- epl_clean %>%
  filter(position != "GKP")

cat("Full dataset rows:", nrow(epl_clean), "\n")
Full dataset rows: 562 
Code
cat("Outfield players only:", nrow(epl_out), "\n")
Outfield players only: 517 
Code
# Step 6: Create a "goals + assists" combined metric (Goal Contributions)
epl_out <- epl_out %>%
  mutate(goal_contributions = goals + assists)

# Step 7: Preview the cleaned dataset
epl_out %>%
  select(player, club, position, appearances, goals, assists,
         shots, pass_pct, prog_carries) %>%
  slice_head(n = 10) %>%
  kable(caption = "First 10 rows of the cleaned outfield dataset")
First 10 rows of the cleaned outfield dataset
player club position appearances goals assists shots pass_pct prog_carries
Ben White Arsenal DEF 17 0 2 9 89 296
Bukayo Saka Arsenal MID 25 6 10 67 87 69
Declan Rice Arsenal MID 35 4 7 48 81 260
Ethan Nwaneri Arsenal MID 26 4 0 24 0 0
Gabriel Jesus Arsenal FWD 17 3 0 20 0 0
Gabriel Magalhães Arsenal DEF 28 3 1 22 79 216
Gabriel Martinelli Arsenal MID 33 8 4 55 69 110
Jakub Kiwior Arsenal DEF 17 1 0 3 87 52
Jorginho Arsenal MID 15 0 0 2 0 0
Jurriën Timber Arsenal DEF 30 1 3 14 65 38

Summary Statistics

Code
epl_out %>%
  group_by(position) %>%
  summarise(
    n            = n(),
    avg_goals    = round(mean(goals), 2),
    avg_assists  = round(mean(assists), 2),
    avg_shots    = round(mean(shots), 2),
    avg_minutes  = round(mean(minutes), 0)
  ) %>%
  kable(caption = "Summary statistics by position (outfield players, 2024–25)")
Summary statistics by position (outfield players, 2024–25)
position n avg_goals avg_assists avg_shots avg_minutes
FWD 86 3.57 0.81 26.80 969
MID 229 2.28 1.91 24.89 1396
DEF 202 0.65 0.66 9.03 1379

Multiple Linear Regression

Research Question

Can we predict the number of goals an outfield EPL player scores using the number of shots they attempt and the number of appearances they make?

Model

Code
# Fit multiple linear regression: Goals ~ Shots + Appearances
model <- lm(goals ~ shots + appearances, data = epl_out)
summary(model)

Call:
lm(formula = goals ~ shots + appearances, data = epl_out)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.8372  -0.5098  -0.0618   0.4658  13.5580 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.093144   0.194293   0.479   0.6319    
shots        0.127234   0.005556  22.902   <2e-16 ***
appearances -0.031356   0.010553  -2.971   0.0031 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.149 on 514 degrees of freedom
Multiple R-squared:  0.6048,    Adjusted R-squared:  0.6033 
F-statistic: 393.4 on 2 and 514 DF,  p-value: < 2.2e-16

Regression Equation

Based on the fitted model, the estimated regression equation is:

\[ \hat{Goals} = 0.093 + 0.127 \times Shots - 0.031 \times Appearances \]

Interpreting the Model

Code
tidy(model) %>%
  kable(digits = 4,
        caption = "Regression coefficients, standard errors, and p-values")
Regression coefficients, standard errors, and p-values
term estimate std.error statistic p.value
(Intercept) 0.0931 0.1943 0.4794 0.6319
shots 0.1272 0.0056 22.9020 0.0000
appearances -0.0314 0.0106 -2.9712 0.0031
Code
glance(model) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
  kable(digits = 4,
        caption = "Model fit statistics")
Model fit statistics
r.squared adj.r.squared sigma statistic p.value df
0.6048 0.6033 2.1493 393.3705 0 2

Coefficient Interpretation:

  • Intercept (0.093, p = 0.632): Not statistically significant. When shots and appearances are both zero, the predicted goals value is essentially 0, which makes much sense.
  • Shots (0.127, p < 0.001): Highly significant. For every additional shot a player attempts over the season, they are predicted to score approximately 0.127 more goals, holding appearances constant. In other words, roughly 8 shots per additional goal.
  • Appearances (−0.031, p = 0.003): Statistically significant. Controlling for shots, each additional appearance slightly decreases the predicted goal count. This negative sign is counterintuitive at first glance, but it reflects that players who accumulate many appearances without generating shots (e.g., defensive midfielders with high game time, or defenders with high game time) tend to score less per shot than prolific attackers who appear in fewer games but shoot more efficiently.

Model Fit:

  • Adjusted R² = 0.603: The model explains approximately 60.3% of the variance in goals scored among outfield EPL players — a strong result for a two-predictor model.
  • F-statistic = 393.4 (p < 0.001): The overall model is highly statistically significant.

Diagnostic Plots

Code
# Four standard diagnostic plots
par(mfrow = c(2, 2))
plot(model)

Code
par(mfrow = c(1, 1))

Diagnostic Interpretation:

  • Residuals vs. Fitted: Residuals are roughly centered around zero at lower fitted values, but a handful of high-scoring outliers (e.g., Mohamed Salah, Alexander Isak) create positive residuals at the upper end. This is a natural feature of goals data, that is a few elite players are extremely difficult to predict from volume statistics alone.
  • Q-Q Plot: The residuals deviate noticeably from normality at the upper tail, consistent with the right-skewed distribution of goals (most players score zero or few goals). This is expected for count data and is a limitation of OLS in this context.
  • Scale-Location: There is mild heteroscedasticity at higher fitted values, again driven by the right skew in goals. For a project, this is acceptable and worth noting so that we may learn about the data from one of the greatest sports leagues in the world.
  • Residuals vs. Leverage: No single observation exerts extreme leverage on the model. The high-scoring players appear as high-residual but not high-leverage points.

Data Visualization

Goals vs. Shots by Position (Outfield Players, 2024–25 EPL)

Code
# Custom non-default colors for three positional groups
position_colors <- c(
  "FWD" = "#E63946",   # vivid red
  "MID" = "#2A9D8F",   # teal
  "DEF" = "#F4A261"    # warm orange
)

# Highlight top 10 goal scorers
top_scorers <- epl_out %>%
  slice_max(goals, n = 10)

ggplot(epl_out, aes(x = shots, y = goals, color = position)) +
  
  # Regression line for all outfield players
  geom_smooth(method = "lm", aes(group = 1),
              color = "grey40", fill = "grey85",
              linetype = "dashed", linewidth = 0.8, se = TRUE) +
  
  # All player points
  geom_point(aes(size = appearances), alpha = 0.55) +
  
  # Label top 10 scorers
  ggrepel::geom_text_repel(
    data = top_scorers,
    aes(label = player),
    size = 2.8, color = "black",
    box.padding = 0.4, point.padding = 0.3,
    segment.color = "grey50", max.overlaps = 15
  ) +
  
  # Scales
  scale_color_manual(
    values = position_colors,
    name   = "Position",
    labels = c("FWD" = "Forward", "MID" = "Midfielder", "DEF" = "Defender")
  ) +
  scale_size_continuous(name = "Appearances", range = c(1, 6)) +
  
  # Labels and theme
  labs(
    title    = "Shots vs. Goals by Position — EPL 2024–25 Season",
    subtitle = "Outfield players only (n = 517) | Dashed line = overall OLS fit",
    x        = "Total Shots Attempted",
    y        = "Goals Scored",
    caption  = "Data source: Premier League official statistics (2024–25 season)"
  ) +
  
  theme_minimal(base_size = 13) +
  theme(
    plot.title      = element_text(face = "bold", size = 15),
    plot.subtitle   = element_text(color = "grey45", size = 11),
    plot.caption    = element_text(color = "grey55", size = 9, hjust = 0),
    legend.position = "right",
    panel.grid.minor = element_blank()
  )


Bonus: Goal Contributions by Position (Side-by-Side Box Plots)

Code
ggplot(epl_out, aes(x = position, y = goal_contributions, fill = position)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 2,
               outlier.fill = "white", alpha = 0.7) +
  geom_jitter(width = 0.18, alpha = 0.25, size = 1.1,
              aes(color = position)) +
  scale_fill_manual(values  = position_colors, guide = "none") +
  scale_color_manual(values = position_colors, guide = "none") +
  labs(
    title   = "Goal Contributions (Goals + Assists) by Position — EPL 2024–25",
    x       = "Position",
    y       = "Goals + Assists",
    caption = "Data source: Premier League official statistics (2024–25 season)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", size = 14),
    panel.grid.minor = element_blank()
  ) +
  scale_x_discrete(labels = c("FWD" = "Forward",
                               "MID" = "Midfielder",
                               "DEF" = "Defender"))


Closing Essay

a. Data Cleaning Process

The raw dataset contained 562 rows and 57 columns covering all EPL players who made at least one appearance in the 2024–25 season. The first and most important cleaning step was handling percentage columns stored as character strings. Seven columns — including Conversion %, Passes%, and gDuels % — contained values like "89%". These were converted to numeric doubles by stripping the % character using str_remove() and casting with as.numeric(). Without this step, these columns would be unusable in any quantitative analysis.

The second major step was renaming columns. Many variable names contained spaces, special characters (like %), and backtick-requiring syntax, which makes them inefficient in R’s tidyverse pipeline. All key columns were renamed to snake_case equivalents (e.g., Player Nameplayer, Shots On Targetshots_on_target).

Third, the Position variable was coerced into an ordered factor with levels FWD, MID, DEF, GKP to ensure consistent ordering in plots and summaries.

Fourth, a goalkeeper exclusion filter was applied for both the regression and visualization. Goalkeepers have near-zero values for goals, shots, and passes, but non-zero values for goalkeeper-specific metrics. Including them in the attacking analysis would introduce a structural confound — they are a categorically different player role. This left 517 outfield players for analysis.

Finally, a derived variable (goal_contributions = goals + assists) was engineered to capture a player’s total attacking output in a single metric, used in the boxplot visualization.

Notably, the dataset contained no missing values in any column, which is unusual and a sign the data was already well-curated at the source.

b. Visualization: What It Shows and Surprising Patterns

The primary scatterplot displays the relationship between total shots attempted and goals scored for all 517 outfield players, colored by position (Forward = red, Midfielder = teal, Defender = orange) and sized by appearances. An OLS regression line with confidence interval confirms the strong positive relationship.

Several patterns stand out. First, Forwards cluster at the upper-right — high shot volume, high goal output — as expected, while Defenders cluster tightly near the origin, rarely attempting more than 20 shots and almost never scoring more than 2–3 goals. Midfielders occupy the wide middle ground, with tremendous spread: Mohamed Salah (130 shots, 29 goals) sits far in the upper-right alongside Erling Haaland and Alexander Isak, yet is classified as a midfielder by the Premier League. For a midfielder that tracks back on defense, he is one of few in the world.

The most surprising finding is how Mohamed Salah’s season stands alone — he attempted nearly 30 more shots than the second-highest player and scored 29 goals, the most in the dataset. His dot is visually isolated from even other high-volume shooters, suggesting elite efficiency on top of elite volume.

The secondary boxplot reinforces that forwards produce dramatically more goal contributions than midfielders or defenders, but also reveals how wide the midfielder distribution is that some midfielders contribute as much as top forwards, while others contribute nothing all season. The jittered points make this diverseness visible in a way raw summary statistics cannot.

c. Limitations and Wishes

One limitation of the regression is that OLS assumes normally distributed residuals, but goals data is count-based and right-skewed. A Poisson or negative binomial regression would be more statistically appropriate and is something I explored conceptually but did not implement due to scope constraints. Including it in a future analysis would help me and the viewers of this data analysis a lot more in understanding what the numbers and relations truly mean to the minutest detail possible.

I also would have liked to include a heatmap of performance metrics by club, showing which teams produce the most shots, touches, or progressive carries per player on average, which is useful for identifying tactical styles. A treemap of total goals by club, sized by squad size, was also briefly explored but setting up the treemap geometry in R took longer than expected.

Finally, the ggrepel package is required for the text labels in the main scatterplot. If that package is unavailable, the labels can be removed by deleting the geom_text_repel() layer without affecting the rest of the chart.