In this tutorial, I will show how to use a Linear Regression
model to predict a player’s performance score based on physical
and match statistics.
Linear regression helps us understand which variables significantly
influence performance outcomes.
# Install packages if not already installed
# install.packages("tidyverse")
# install.packages("broom")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
We simulate a dataset of 20 players with metrics like sprint speed, distance covered, training load, and a calculated performance score.
set.seed(100)
player_data <- tibble(
Player = paste("Player", 1:20),
SprintSpeed = runif(20, 25, 35), # km/h
Distance = runif(20, 7, 12), # km
TrainingLoad = runif(20, 200, 400),
PerformanceScore = 0.5 * runif(20, 25, 35) +
0.3 * runif(20, 7, 12) +
0.2 * runif(20, 200, 400)/10 +
rnorm(20, 0, 2)
)
# Show first 6 rows
head(player_data)
## # A tibble: 6 × 5
## Player SprintSpeed Distance TrainingLoad PerformanceScore
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Player 1 28.1 9.68 266. 22.4
## 2 Player 2 27.6 10.6 373. 24.2
## 3 Player 3 30.5 9.69 356. 24.2
## 4 Player 4 25.6 10.7 365. 30.1
## 5 Player 5 29.7 9.10 321. 23.4
## 6 Player 6 29.8 7.86 298. 22.6
Check summary statistics and relationships between key variables.
summary(player_data)
## Player SprintSpeed Distance TrainingLoad
## Length:20 Min. :25.56 Min. : 7.651 Min. :224.7
## Class :character 1st Qu.:28.01 1st Qu.: 9.011 1st Qu.:246.8
## Mode :character Median :29.34 Median : 9.946 Median :266.1
## Mean :29.63 Mean : 9.960 Mean :291.2
## 3rd Qu.:31.36 3rd Qu.:10.991 3rd Qu.:329.4
## Max. :33.82 Max. :11.948 Max. :376.8
## PerformanceScore
## Min. :19.43
## 1st Qu.:22.46
## Median :24.17
## Mean :24.09
## 3rd Qu.:25.75
## Max. :30.07
pairs(player_data[,2:5])
We will predict PerformanceScore based on the other three variables.
model <- lm(PerformanceScore ~ SprintSpeed + Distance + TrainingLoad, data = player_data)
summary(model)
##
## Call:
## lm(formula = PerformanceScore ~ SprintSpeed + Distance + TrainingLoad,
## data = player_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5330 -1.4332 -0.2928 1.9441 3.7064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.80914 8.16636 3.405 0.00362 **
## SprintSpeed -0.50719 0.24750 -2.049 0.05720 .
## Distance 0.79078 0.43318 1.826 0.08665 .
## TrainingLoad 0.01178 0.01127 1.045 0.31149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.355 on 16 degrees of freedom
## Multiple R-squared: 0.3917, Adjusted R-squared: 0.2776
## F-statistic: 3.434 on 3 and 16 DF, p-value: 0.04235
Visualise the relationship between Sprint Speed and Performance Score.
ggplot(player_data, aes(x = SprintSpeed, y = PerformanceScore)) +
geom_point(size = 3, color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "red", lwd = 1) +
labs(
title = "Relationship Between Sprint Speed and Performance Score",
x = "Sprint Speed (km/h)",
y = "Performance Score"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
View coefficients and goodness-of-fit metrics.
tidy(model)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 27.8 8.17 3.41 0.00362
## 2 SprintSpeed -0.507 0.247 -2.05 0.0572
## 3 Distance 0.791 0.433 1.83 0.0866
## 4 TrainingLoad 0.0118 0.0113 1.05 0.311
glance(model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.392 0.278 2.35 3.43 0.0423 3 -43.3 96.5 102.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Interpretation: Coefficients show how each variable affects the performance score. R-squared indicates how much variation in performance is explained by the model.
Predict performance for a new player with known metrics.
new_player <- data.frame(SprintSpeed = 32, Distance = 10, TrainingLoad = 350)
predicted_score <- predict(model, new_player)
predicted_score
## 1
## 23.60847
Linear regression identifies which physical metrics most impact performance. In this example, Sprint Speed and Training Load had a strong influence. This workflow can be extended to real-world sports datasets for deeper insights. Linear regression is a simple but powerful tool in sports performance analytics.