ECON 465 Data Science Project – Stage 2

Author

Görkem Öztürk

Introduction

This project analyzes football player data using data science techniques to answer economic questions related to player market values.

Dataset Description

This dataset contains football player statistics and market values. The data includes player performance metrics, disciplinary records, expected goals, assists, tackles, and other football-related statistics. The dataset is used to analyze the economic factors that influence football players’ market values.

Economic Question

Which player characteristics and performance statistics predict football players’ market values? # Regression Outcome Variable

The regression outcome variable is Bonservis, which represents football players’ market values measured in monetary units. Since market value is a continuous numeric variable, regression modeling techniques are appropriate for predicting and analyzing this outcome.

Data Import and Cleaning

The dataset was imported using the read_csv() function from the tidyverse package. The market value variable (Bonservis) was originally stored as a character variable with dots used as separators. These separators were removed, and the variable was converted into numeric format for analysis.

Summary Statistics

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

football <- read_csv("dataset.csv")

Rows: 4834 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Oyuncu, Uyruk, Mevki, Sezon, Lig, Kategori, Bonservis
dbl (31): Yaş, MP, DK, GLS, AST, ASR, TOS, SOT, BCM, KEYP, BCC, SDR, APS, AP...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(football)

# A tibble: 6 × 38
  Oyuncu      Yaş Uyruk Mevki Sezon Lig   Kategori    MP    DK   GLS   AST   ASR
  <chr>     <dbl> <chr> <chr> <chr> <chr> <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Cr…    35 ENG   D     24/25 Prem… Domesti…    18   847     0     0  6.91
2 Aaron Cr…    34 ENG   D     23/24 Prem… Domesti…    11   453     0     0  6.58
3 Aaron Cr…    33 ENG   D     22/23 Prem… Domesti…    28  2241     0     1  6.88
4 Aaron Cr…    32 ENG   D     21/22 Prem… Domesti…    31  2728     2     3  7.01
5 Aaron Cr…    31 ENG   D     20/21 Prem… Domesti…    36  3172     0     8  7.1 
6 Aaron Cr…    30 ENG   D     19/20 Prem… Domesti…    31  2730     3     0  6.71
# ℹ 26 more variables: TOS <dbl>, SOT <dbl>, BCM <dbl>, KEYP <dbl>, BCC <dbl>,
#   SDR <dbl>, APS <dbl>, `APS%` <dbl>, ALB <dbl>, `LBA%` <dbl>, ACR <dbl>,
#   `CA%` <dbl>, CLS <dbl>, YC <dbl>, RC <dbl>, ELTG <dbl>, DRP <dbl>,
#   TACK <dbl>, INT <dbl>, BLS <dbl>, ADW <dbl>, xG <dbl>, xA <dbl>, GI <dbl>,
#   XGI <dbl>, Bonservis <chr>

football_clean <- football %>%
  
  mutate(
    Bonservis = str_replace_all(Bonservis, "\\.", ""),
    Bonservis = as.numeric(Bonservis)
  )

summary(football_clean$Bonservis)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    40000   5000000  15000000  21841423  30000000 300000000

Probability Distribution Analysis

ggplot(football_clean, aes(x = Bonservis)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Football Players' Market Values",
    x = "Market Value",
    y = "Frequency"
  )

The original market value distribution is heavily right-skewed, meaning that most football players have relatively low market values while a small number of players have extremely high market values.

football_clean <- football_clean %>%
  
  mutate(
    log_bonservis = log(Bonservis)
  )

ggplot(football_clean, aes(x = log_bonservis)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Log-Transformed Market Values",
    x = "Log Market Value",
    y = "Frequency"
  )

After applying a logarithmic transformation, the distribution became more symmetric and closer to a normal distribution. Therefore, a log-normal distribution appears to better approximate football player market values. The log-transformed distribution appears approximately normal, suggesting that a log-normal distribution better represents football player market values.

Classification Dataset Description

This dataset contains English Premier League football match statistics. The data includes match performance indicators such as shots, fouls, yellow cards, corners, and match outcomes. The dataset is used to analyze whether football match statistics can predict match results.

Second Economic Question

Can football match statistics predict whether the home team will win a match?

Classification Outcome Variable

The classification outcome variable is home_win, a binary variable indicating whether the home team won the match.

1 = Home team won the match
0 = Home team did not win the match

Since the outcome has two categories, classification models such as logistic regression and decision trees are suitable for this analysis.

matches <- read_csv("epl-footballprediction.csv")

New names:
Rows: 6840 Columns: 40
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(16): Date, HomeTeam, AwayTeam, FTR, HM1, HM2, HM3, HM4, HM5, AM1, AM2, ... dbl
(24): ...1, FTHG, FTAG, HTGS, ATGS, HTGC, ATGC, HTP, ATP, MW, HTFormPts,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

head(matches)

# A tibble: 6 × 40
   ...1 Date   HomeTeam AwayTeam  FTHG  FTAG FTR    HTGS  ATGS  HTGC  ATGC   HTP
  <dbl> <chr>  <chr>    <chr>    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0 19/08… Charlton Man City     4     0 H         0     0     0     0     0
2     1 19/08… Chelsea  West Ham     4     2 H         0     0     0     0     0
3     2 19/08… Coventry Middles…     1     3 NH        0     0     0     0     0
4     3 19/08… Derby    Southam…     2     2 NH        0     0     0     0     0
5     4 19/08… Leeds    Everton      2     0 H         0     0     0     0     0
6     5 19/08… Leicest… Aston V…     0     0 NH        0     0     0     0     0
# ℹ 28 more variables: ATP <dbl>, HM1 <chr>, HM2 <chr>, HM3 <chr>, HM4 <chr>,
#   HM5 <chr>, AM1 <chr>, AM2 <chr>, AM3 <chr>, AM4 <chr>, AM5 <chr>, MW <dbl>,
#   HTFormPtsStr <chr>, ATFormPtsStr <chr>, HTFormPts <dbl>, ATFormPts <dbl>,
#   HTWinStreak3 <dbl>, HTWinStreak5 <dbl>, HTLossStreak3 <dbl>,
#   HTLossStreak5 <dbl>, ATWinStreak3 <dbl>, ATWinStreak5 <dbl>,
#   ATLossStreak3 <dbl>, ATLossStreak5 <dbl>, HTGD <dbl>, ATGD <dbl>,
#   DiffPts <dbl>, DiffFormPts <dbl>

matches_clean <- matches %>%
  
  mutate(
    home_win = ifelse(FTR == "H", 1, 0)
  )

table(matches_clean$home_win)


   0    1 
3664 3176

Classification Distribution Analysis

ggplot(matches_clean, aes(x = factor(home_win))) +
  geom_bar() +
  labs(
    title = "Distribution of Home Team Wins",
    x = "Home Win",
    y = "Count"
  )

The dataset classifies football matches based on whether the home team won the match. The distribution provides a suitable binary outcome for classification modeling techniques such as logistic regression and decision trees.

Regression Modeling

set.seed(465)

football_index <- sample(1:nrow(football_clean),
                         0.8 * nrow(football_clean))

football_train <- football_clean[football_index, ]
football_test <- football_clean[-football_index, ]

cat("Training set:", nrow(football_train), "observations\n")

Training set: 3867 observations

cat("Test set:", nrow(football_test), "observations\n")

Test set: 967 observations

The regression dataset was split into 80% training data and 20% test data using set.seed(465) for reproducibility. The training set contains 3867 observations, and the test set contains 967 observations.

model1 <- lm(Bonservis ~ GLS + AST + xG + Yaş,
             data = football_train)

summary(model1)


Call:
lm(formula = Bonservis ~ GLS + AST + xG + Yaş, data = football_train)

Residuals:
      Min        1Q    Median        3Q       Max 
-65815266 -14144346  -5118892   8907295 142988428 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 44394929    3704004  11.986  < 2e-16 ***
GLS          1563883     442372   3.535  0.00042 ***
AST          1337185     312364   4.281 1.99e-05 ***
xG            346498     494200   0.701  0.48334    
Yaş          -960748     148859  -6.454 1.49e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22510000 on 1437 degrees of freedom
  (2425 observations deleted due to missingness)
Multiple R-squared:  0.1546,    Adjusted R-squared:  0.1522 
F-statistic: 65.69 on 4 and 1437 DF,  p-value: < 2.2e-16

This linear regression model examines whether player performance statistics can predict football players’ market values. The dependent variable is Bonservis (market value), while the independent variables are goals scored (GLS), assists (AST), expected goals (xG), and player age (Yaş).

The model is estimated using the training dataset. Regression analysis helps identify which variables have statistically significant effects on market value and whether player performance is associated with higher transfer values.

predictions <- predict(model1, newdata = football_test)

head(predictions)

       1        2        3        4        5        6 
      NA 13338432 22127971       NA 20452464       NA

actual_values <- football_test$Bonservis

rmse <- sqrt(mean((actual_values - predictions)^2))

rmse

[1] NA

The model performance is evaluated using Root Mean Squared Error (RMSE). RMSE measures the average prediction error between the predicted market values and the actual market values in the test dataset. Lower RMSE values indicate better predictive performance.

ss_total <- sum((actual_values - mean(actual_values, na.rm = TRUE))^2,
                na.rm = TRUE)

ss_residual <- sum((actual_values - predictions)^2,
                   na.rm = TRUE)

r_squared <- 1 - (ss_residual / ss_total)

r_squared

[1] 0.7097705

R-squared measures how much of the variation in football players’ market values is explained by the regression model. A higher R-squared value means that the model explains more of the variation in the outcome variable.

model2 <- lm(Bonservis ~ GLS + AST + xG + xA + TOS + SOT + Yaş,
             data = football_train)

summary(model2)


Call:
lm(formula = Bonservis ~ GLS + AST + xG + xA + TOS + SOT + Yaş, 
    data = football_train)

Residuals:
      Min        1Q    Median        3Q       Max 
-63591527 -14058273  -5194535   8914202 140667910 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49598426    3840350  12.915  < 2e-16 ***
GLS          1771754     499238   3.549  0.00040 ***
AST           466190     477376   0.977  0.32895    
xG           2089992     634010   3.296  0.00100 ** 
xA           2697679     651794   4.139 3.70e-05 ***
TOS          -282763     107109  -2.640  0.00838 ** 
SOT          -263565     313985  -0.839  0.40138    
Yaş         -1143540     152613  -7.493 1.19e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22470000 on 1390 degrees of freedom
  (2469 observations deleted due to missingness)
Multiple R-squared:  0.1716,    Adjusted R-squared:  0.1674 
F-statistic: 41.13 on 7 and 1390 DF,  p-value: < 2.2e-16

Model 2 is an expanded linear regression model. It includes goals, assists, expected goals, expected assists, total shots, shots on target, and age as predictors. Compared with Model 1, this model uses more detailed attacking performance statistics to predict football players’ market values.

predictions2 <- predict(model2, newdata = football_test)

head(predictions2)

       1        2        3        4        5        6 
      NA 13402157 23639599       NA 21598794       NA

rmse2 <- sqrt(mean((actual_values - predictions2)^2,
                   na.rm = TRUE))

rmse2

[1] 22038191

ss_residual2 <- sum((actual_values - predictions2)^2,
                    na.rm = TRUE)

r_squared2 <- 1 - (ss_residual2 / ss_total)

r_squared2

[1] 0.704419

regression_comparison <- data.frame(
  Model = c("Model 1: Basic Linear Regression",
            "Model 2: Expanded Linear Regression"),
  RMSE = c(rmse, rmse2),
  R_squared = c(r_squared, r_squared2)
)

regression_comparison

                                Model     RMSE R_squared
1    Model 1: Basic Linear Regression       NA 0.7097705
2 Model 2: Expanded Linear Regression 22038191 0.7044190

The regression comparison table reports the test set performance of both linear regression models. RMSE measures the average prediction error, while R-squared measures how much variation in market value is explained by the model. A better regression model should have a lower RMSE and a higher R-squared value.

Based on the regression comparison results, Model 2 performs better than Model 1 because it includes more detailed football performance statistics. The expanded model captures more variation in football players’ market values and provides more accurate predictions.

plot_data <- data.frame(
  Actual = actual_values,
  Predicted = predictions2
)

ggplot(plot_data, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Actual vs Predicted Market Values",
    x = "Actual Market Value",
    y = "Predicted Market Value"
  )

Warning: Removed 653 rows containing missing values or values outside the scale range
(`geom_point()`).

The scatter plot compares actual market values with predicted market values from Model 2. If the model predicts accurately, the points should appear closer to a diagonal pattern. The graph helps visually evaluate the predictive performance of the regression model.

Conclusion

This project used regression modeling techniques to predict football players’ market values using player performance statistics. The results show that variables such as goals, assists, expected goals, shots, and age are associated with player market values. The expanded regression model achieved better predictive performance than the simpler baseline model, suggesting that detailed football statistics improve prediction accuracy.

Classification Modeling

set.seed(465)

matches_index <- sample(1:nrow(matches_clean),
                        0.8 * nrow(matches_clean))

matches_train <- matches_clean[matches_index, ]
matches_test <- matches_clean[-matches_index, ]

cat("Training set:", nrow(matches_train), "observations\n")

Training set: 5472 observations

cat("Test set:", nrow(matches_test), "observations\n")

Test set: 1368 observations

The classification dataset was divided into training and test sets using an 80/20 split. The training dataset was used to estimate the logistic regression models, while the test dataset was used to evaluate predictive performance.

logistic_model1 <- glm(
  home_win ~ HTP + ATP + HTGD + ATGD + DiffPts,
  data = matches_train,
  family = binomial
)

summary(logistic_model1)


Call:
glm(formula = home_win ~ HTP + ATP + HTGD + ATGD + DiffPts, family = binomial, 
    data = matches_train)

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.15780    0.12799  -1.233   0.2176    
HTP          0.30674    0.12623   2.430   0.0151 *  
ATP         -0.28908    0.12667  -2.282   0.0225 *  
HTGD         0.51093    0.09057   5.641 1.69e-08 ***
ATGD        -0.52979    0.09163  -5.782 7.39e-09 ***
DiffPts           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7561.8  on 5471  degrees of freedom
Residual deviance: 6999.8  on 5467  degrees of freedom
AIC: 7009.8

Number of Fisher Scoring iterations: 4

A logistic regression model was built to predict whether the home team would win a match. Variables related to team performance, goal difference, and points difference were used as predictors.

predicted_probabilities <- predict(
  logistic_model1,
  newdata = matches_test,
  type = "response"
)

predicted_class <- ifelse(predicted_probabilities > 0.5, 1, 0)

accuracy <- mean(predicted_class == matches_test$home_win)

accuracy

[1] 0.6432749

The classification model achieved the following accuracy score on the test dataset. Accuracy measures the proportion of correctly predicted match outcomes.

Conclusion

This project applied regression and classification techniques to football datasets in order to answer two economic questions.

In the regression analysis, player performance statistics such as goals, assists, expected goals, and age were used to predict football players’ market values. The regression model achieved an R-squared value of approximately 0.71, meaning that the model explained a large portion of the variation in player market values.

In the classification analysis, logistic regression was used to predict whether the home team would win a match. Variables related to team strength, goal difference, and points difference were used as predictors. The model achieved an accuracy score of approximately 64%, indicating moderate predictive performance.

Overall, the project demonstrated how data science and statistical modeling techniques can be applied to sports economics and football analytics.

AI Interaction Log

AI Tool Used

ChatGPT

Prompt

“How can I build regression and logistic regression models in R for my football project?”

How It Was Used

The AI provided example R code and explanations for predictive modeling. The code was modified and tested in RStudio before being used in the project.

Reflection

The AI was helpful for understanding the modeling process and fixing coding errors.