This project analyzes football player data using data science techniques to answer economic questions related to player market values.
Dataset Description
This dataset contains football player statistics and market values. The data includes player performance metrics, disciplinary records, expected goals, assists, tackles, and other football-related statistics. The dataset is used to analyze the economic factors that influence football players’ market values.
Economic Question
Which player characteristics and performance statistics predict football players’ market values? # Regression Outcome Variable
The regression outcome variable is Bonservis, which represents football players’ market values measured in monetary units. Since market value is a continuous numeric variable, regression modeling techniques are appropriate for predicting and analyzing this outcome.
Data Import and Cleaning
The dataset was imported using the read_csv() function from the tidyverse package. The market value variable (Bonservis) was originally stored as a character variable with dots used as separators. These separators were removed, and the variable was converted into numeric format for analysis.
Summary Statistics
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.3 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
football <-read_csv("dataset.csv")
Rows: 4834 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Oyuncu, Uyruk, Mevki, Sezon, Lig, Kategori, Bonservis
dbl (31): Yaş, MP, DK, GLS, AST, ASR, TOS, SOT, BCM, KEYP, BCC, SDR, APS, AP...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(football)
# A tibble: 6 × 38
Oyuncu Yaş Uyruk Mevki Sezon Lig Kategori MP DK GLS AST ASR
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Cr… 35 ENG D 24/25 Prem… Domesti… 18 847 0 0 6.91
2 Aaron Cr… 34 ENG D 23/24 Prem… Domesti… 11 453 0 0 6.58
3 Aaron Cr… 33 ENG D 22/23 Prem… Domesti… 28 2241 0 1 6.88
4 Aaron Cr… 32 ENG D 21/22 Prem… Domesti… 31 2728 2 3 7.01
5 Aaron Cr… 31 ENG D 20/21 Prem… Domesti… 36 3172 0 8 7.1
6 Aaron Cr… 30 ENG D 19/20 Prem… Domesti… 31 2730 3 0 6.71
# ℹ 26 more variables: TOS <dbl>, SOT <dbl>, BCM <dbl>, KEYP <dbl>, BCC <dbl>,
# SDR <dbl>, APS <dbl>, `APS%` <dbl>, ALB <dbl>, `LBA%` <dbl>, ACR <dbl>,
# `CA%` <dbl>, CLS <dbl>, YC <dbl>, RC <dbl>, ELTG <dbl>, DRP <dbl>,
# TACK <dbl>, INT <dbl>, BLS <dbl>, ADW <dbl>, xG <dbl>, xA <dbl>, GI <dbl>,
# XGI <dbl>, Bonservis <chr>
football_clean <- football %>%mutate(Bonservis =str_replace_all(Bonservis, "\\.", ""),Bonservis =as.numeric(Bonservis) )summary(football_clean$Bonservis)
Min. 1st Qu. Median Mean 3rd Qu. Max.
40000 5000000 15000000 21841423 30000000 300000000
Probability Distribution Analysis
ggplot(football_clean, aes(x = Bonservis)) +geom_histogram(bins =30) +labs(title ="Distribution of Football Players' Market Values",x ="Market Value",y ="Frequency" )
The original market value distribution is heavily right-skewed, meaning that most football players have relatively low market values while a small number of players have extremely high market values.
After applying a logarithmic transformation, the distribution became more symmetric and closer to a normal distribution. Therefore, a log-normal distribution appears to better approximate football player market values. The log-transformed distribution appears approximately normal, suggesting that a log-normal distribution better represents football player market values.
Classification Dataset Description
This dataset contains English Premier League football match statistics. The data includes match performance indicators such as shots, fouls, yellow cards, corners, and match outcomes. The dataset is used to analyze whether football match statistics can predict match results.
Second Economic Question
Can football match statistics predict whether the home team will win a match?
Classification Outcome Variable
The classification outcome variable is home_win, a binary variable indicating whether the home team won the match.
1 = Home team won the match
0 = Home team did not win the match
Since the outcome has two categories, classification models such as logistic regression and decision trees are suitable for this analysis.
matches <-read_csv("epl-footballprediction.csv")
New names:
Rows: 6840 Columns: 40
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(16): Date, HomeTeam, AwayTeam, FTR, HM1, HM2, HM3, HM4, HM5, AM1, AM2, ... dbl
(24): ...1, FTHG, FTAG, HTGS, ATGS, HTGC, ATGC, HTP, ATP, MW, HTFormPts,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
ggplot(matches_clean, aes(x =factor(home_win))) +geom_bar() +labs(title ="Distribution of Home Team Wins",x ="Home Win",y ="Count" )
The dataset classifies football matches based on whether the home team won the match. The distribution provides a suitable binary outcome for classification modeling techniques such as logistic regression and decision trees.
The regression dataset was split into 80% training data and 20% test data using set.seed(465) for reproducibility. The training set contains 3867 observations, and the test set contains 967 observations.
Call:
lm(formula = Bonservis ~ GLS + AST + xG + Yaş, data = football_train)
Residuals:
Min 1Q Median 3Q Max
-65815266 -14144346 -5118892 8907295 142988428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44394929 3704004 11.986 < 2e-16 ***
GLS 1563883 442372 3.535 0.00042 ***
AST 1337185 312364 4.281 1.99e-05 ***
xG 346498 494200 0.701 0.48334
Yaş -960748 148859 -6.454 1.49e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22510000 on 1437 degrees of freedom
(2425 observations deleted due to missingness)
Multiple R-squared: 0.1546, Adjusted R-squared: 0.1522
F-statistic: 65.69 on 4 and 1437 DF, p-value: < 2.2e-16
This linear regression model examines whether player performance statistics can predict football players’ market values. The dependent variable is Bonservis (market value), while the independent variables are goals scored (GLS), assists (AST), expected goals (xG), and player age (Yaş).
The model is estimated using the training dataset. Regression analysis helps identify which variables have statistically significant effects on market value and whether player performance is associated with higher transfer values.
The model performance is evaluated using Root Mean Squared Error (RMSE). RMSE measures the average prediction error between the predicted market values and the actual market values in the test dataset. Lower RMSE values indicate better predictive performance.
R-squared measures how much of the variation in football players’ market values is explained by the regression model. A higher R-squared value means that the model explains more of the variation in the outcome variable.
model2 <-lm(Bonservis ~ GLS + AST + xG + xA + TOS + SOT + Yaş,data = football_train)summary(model2)
Call:
lm(formula = Bonservis ~ GLS + AST + xG + xA + TOS + SOT + Yaş,
data = football_train)
Residuals:
Min 1Q Median 3Q Max
-63591527 -14058273 -5194535 8914202 140667910
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49598426 3840350 12.915 < 2e-16 ***
GLS 1771754 499238 3.549 0.00040 ***
AST 466190 477376 0.977 0.32895
xG 2089992 634010 3.296 0.00100 **
xA 2697679 651794 4.139 3.70e-05 ***
TOS -282763 107109 -2.640 0.00838 **
SOT -263565 313985 -0.839 0.40138
Yaş -1143540 152613 -7.493 1.19e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22470000 on 1390 degrees of freedom
(2469 observations deleted due to missingness)
Multiple R-squared: 0.1716, Adjusted R-squared: 0.1674
F-statistic: 41.13 on 7 and 1390 DF, p-value: < 2.2e-16
Model 2 is an expanded linear regression model. It includes goals, assists, expected goals, expected assists, total shots, shots on target, and age as predictors. Compared with Model 1, this model uses more detailed attacking performance statistics to predict football players’ market values.
regression_comparison <-data.frame(Model =c("Model 1: Basic Linear Regression","Model 2: Expanded Linear Regression"),RMSE =c(rmse, rmse2),R_squared =c(r_squared, r_squared2))regression_comparison
Model RMSE R_squared
1 Model 1: Basic Linear Regression NA 0.7097705
2 Model 2: Expanded Linear Regression 22038191 0.7044190
The regression comparison table reports the test set performance of both linear regression models. RMSE measures the average prediction error, while R-squared measures how much variation in market value is explained by the model. A better regression model should have a lower RMSE and a higher R-squared value.
Based on the regression comparison results, Model 2 performs better than Model 1 because it includes more detailed football performance statistics. The expanded model captures more variation in football players’ market values and provides more accurate predictions.
Warning: Removed 653 rows containing missing values or values outside the scale range
(`geom_point()`).
The scatter plot compares actual market values with predicted market values from Model 2. If the model predicts accurately, the points should appear closer to a diagonal pattern. The graph helps visually evaluate the predictive performance of the regression model.
Conclusion
This project used regression modeling techniques to predict football players’ market values using player performance statistics. The results show that variables such as goals, assists, expected goals, shots, and age are associated with player market values. The expanded regression model achieved better predictive performance than the simpler baseline model, suggesting that detailed football statistics improve prediction accuracy.
The classification dataset was divided into training and test sets using an 80/20 split. The training dataset was used to estimate the logistic regression models, while the test dataset was used to evaluate predictive performance.
Call:
glm(formula = home_win ~ HTP + ATP + HTGD + ATGD + DiffPts, family = binomial,
data = matches_train)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.15780 0.12799 -1.233 0.2176
HTP 0.30674 0.12623 2.430 0.0151 *
ATP -0.28908 0.12667 -2.282 0.0225 *
HTGD 0.51093 0.09057 5.641 1.69e-08 ***
ATGD -0.52979 0.09163 -5.782 7.39e-09 ***
DiffPts NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7561.8 on 5471 degrees of freedom
Residual deviance: 6999.8 on 5467 degrees of freedom
AIC: 7009.8
Number of Fisher Scoring iterations: 4
A logistic regression model was built to predict whether the home team would win a match. Variables related to team performance, goal difference, and points difference were used as predictors.
The classification model achieved the following accuracy score on the test dataset. Accuracy measures the proportion of correctly predicted match outcomes.
Conclusion
This project applied regression and classification techniques to football datasets in order to answer two economic questions.
In the regression analysis, player performance statistics such as goals, assists, expected goals, and age were used to predict football players’ market values. The regression model achieved an R-squared value of approximately 0.71, meaning that the model explained a large portion of the variation in player market values.
In the classification analysis, logistic regression was used to predict whether the home team would win a match. Variables related to team strength, goal difference, and points difference were used as predictors. The model achieved an accuracy score of approximately 64%, indicating moderate predictive performance.
Overall, the project demonstrated how data science and statistical modeling techniques can be applied to sports economics and football analytics.
AI Interaction Log
AI Tool Used
ChatGPT
Prompt
“How can I build regression and logistic regression models in R for my football project?”
How It Was Used
The AI provided example R code and explanations for predictive modeling. The code was modified and tested in RStudio before being used in the project.
Reflection
The AI was helpful for understanding the modeling process and fixing coding errors.