Predicting Player Performance Using Linear Regression

Introduction

In this tutorial, I will show how to use a Linear Regression model to predict a player’s performance score based on physical and match statistics.
Linear regression helps us understand which variables significantly influence performance outcomes.

Load Required Packages

# Install packages if not already installed
# install.packages("tidyverse")
# install.packages("broom")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)

Create a Sample Dataset

We simulate a dataset of 20 players with metrics like sprint speed, distance covered, training load, and a calculated performance score.

set.seed(100)

player_data <- tibble(
Player = paste("Player", 1:20),
SprintSpeed = runif(20, 25, 35),   # km/h
Distance = runif(20, 7, 12),       # km
TrainingLoad = runif(20, 200, 400),
PerformanceScore = 0.5 * runif(20, 25, 35) +
0.3 * runif(20, 7, 12) +
0.2 * runif(20, 200, 400)/10 +
rnorm(20, 0, 2)
)

# Show first 6 rows

head(player_data)

## # A tibble: 6 × 5
##   Player   SprintSpeed Distance TrainingLoad PerformanceScore
##   <chr>          <dbl>    <dbl>        <dbl>            <dbl>
## 1 Player 1        28.1     9.68         266.             22.4
## 2 Player 2        27.6    10.6          373.             24.2
## 3 Player 3        30.5     9.69         356.             24.2
## 4 Player 4        25.6    10.7          365.             30.1
## 5 Player 5        29.7     9.10         321.             23.4
## 6 Player 6        29.8     7.86         298.             22.6

Explore the Data

Check summary statistics and relationships between key variables.

summary(player_data)

##     Player           SprintSpeed       Distance       TrainingLoad  
##  Length:20          Min.   :25.56   Min.   : 7.651   Min.   :224.7  
##  Class :character   1st Qu.:28.01   1st Qu.: 9.011   1st Qu.:246.8  
##  Mode  :character   Median :29.34   Median : 9.946   Median :266.1  
##                     Mean   :29.63   Mean   : 9.960   Mean   :291.2  
##                     3rd Qu.:31.36   3rd Qu.:10.991   3rd Qu.:329.4  
##                     Max.   :33.82   Max.   :11.948   Max.   :376.8  
##  PerformanceScore
##  Min.   :19.43   
##  1st Qu.:22.46   
##  Median :24.17   
##  Mean   :24.09   
##  3rd Qu.:25.75   
##  Max.   :30.07

pairs(player_data[,2:5])

Fit the Linear Regression Model

We will predict PerformanceScore based on the other three variables.

model <- lm(PerformanceScore ~ SprintSpeed + Distance + TrainingLoad, data = player_data)
summary(model)

## 
## Call:
## lm(formula = PerformanceScore ~ SprintSpeed + Distance + TrainingLoad, 
##     data = player_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5330 -1.4332 -0.2928  1.9441  3.7064 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  27.80914    8.16636   3.405  0.00362 **
## SprintSpeed  -0.50719    0.24750  -2.049  0.05720 . 
## Distance      0.79078    0.43318   1.826  0.08665 . 
## TrainingLoad  0.01178    0.01127   1.045  0.31149   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.355 on 16 degrees of freedom
## Multiple R-squared:  0.3917, Adjusted R-squared:  0.2776 
## F-statistic: 3.434 on 3 and 16 DF,  p-value: 0.04235

Visualise Model Fit

Visualise the relationship between Sprint Speed and Performance Score.

ggplot(player_data, aes(x = SprintSpeed, y = PerformanceScore)) +
geom_point(size = 3, color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "red", lwd = 1) +
labs(
title = "Relationship Between Sprint Speed and Performance Score",
x = "Sprint Speed (km/h)",
y = "Performance Score"
) +
theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Extract Model Insights

View coefficients and goodness-of-fit metrics.

tidy(model)

## # A tibble: 4 × 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   27.8       8.17        3.41 0.00362
## 2 SprintSpeed   -0.507     0.247      -2.05 0.0572 
## 3 Distance       0.791     0.433       1.83 0.0866 
## 4 TrainingLoad   0.0118    0.0113      1.05 0.311

glance(model)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.392         0.278  2.35      3.43  0.0423     3  -43.3  96.5  102.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Interpretation: Coefficients show how each variable affects the performance score. R-squared indicates how much variation in performance is explained by the model.

Predict New Player Performance

Predict performance for a new player with known metrics.

new_player <- data.frame(SprintSpeed = 32, Distance = 10, TrainingLoad = 350)
predicted_score <- predict(model, new_player)
predicted_score

##        1 
## 23.60847

Summary & Conclusion

Linear regression identifies which physical metrics most impact performance. In this example, Sprint Speed and Training Load had a strong influence. This workflow can be extended to real-world sports datasets for deeper insights. Linear regression is a simple but powerful tool in sports performance analytics.