ECON 465 Data Science Project – Stage 1



Introduction

This project analyzes football player data using data science techniques to answer economic questions related to player market values.

Dataset Description

This dataset contains football player statistics and market values. The data includes player performance metrics, disciplinary records, expected goals, assists, tackles, and other football-related statistics. The dataset is used to analyze the economic factors that influence football players’ market values.

Economic Question

Which player characteristics and performance statistics predict football players’ market values?

Data Import and Cleaning

The dataset was imported using the read_csv() function from the tidyverse package. The market value variable (Bonservis) was originally stored as a character variable with dots used as separators. These separators were removed, and the variable was converted into numeric format for analysis.

Summary Statistics

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
football <- read_csv("dataset.csv")
Rows: 4834 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Oyuncu, Uyruk, Mevki, Sezon, Lig, Kategori, Bonservis
dbl (31): Yaş, MP, DK, GLS, AST, ASR, TOS, SOT, BCM, KEYP, BCC, SDR, APS, AP...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(football)
# A tibble: 6 × 38
  Oyuncu      Yaş Uyruk Mevki Sezon Lig   Kategori    MP    DK   GLS   AST   ASR
  <chr>     <dbl> <chr> <chr> <chr> <chr> <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Cr…    35 ENG   D     24/25 Prem… Domesti…    18   847     0     0  6.91
2 Aaron Cr…    34 ENG   D     23/24 Prem… Domesti…    11   453     0     0  6.58
3 Aaron Cr…    33 ENG   D     22/23 Prem… Domesti…    28  2241     0     1  6.88
4 Aaron Cr…    32 ENG   D     21/22 Prem… Domesti…    31  2728     2     3  7.01
5 Aaron Cr…    31 ENG   D     20/21 Prem… Domesti…    36  3172     0     8  7.1 
6 Aaron Cr…    30 ENG   D     19/20 Prem… Domesti…    31  2730     3     0  6.71
# ℹ 26 more variables: TOS <dbl>, SOT <dbl>, BCM <dbl>, KEYP <dbl>, BCC <dbl>,
#   SDR <dbl>, APS <dbl>, `APS%` <dbl>, ALB <dbl>, `LBA%` <dbl>, ACR <dbl>,
#   `CA%` <dbl>, CLS <dbl>, YC <dbl>, RC <dbl>, ELTG <dbl>, DRP <dbl>,
#   TACK <dbl>, INT <dbl>, BLS <dbl>, ADW <dbl>, xG <dbl>, xA <dbl>, GI <dbl>,
#   XGI <dbl>, Bonservis <chr>
football_clean <- football %>%
  
  mutate(
    Bonservis = str_replace_all(Bonservis, "\\.", ""),
    Bonservis = as.numeric(Bonservis)
  )

summary(football_clean$Bonservis)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    40000   5000000  15000000  21841423  30000000 300000000 

Probability Distribution Analysis

ggplot(football_clean, aes(x = Bonservis)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Football Players' Market Values",
    x = "Market Value",
    y = "Frequency"
  )

football_clean <- football_clean %>%
  
  mutate(
    log_bonservis = log(Bonservis)
  )

ggplot(football_clean, aes(x = log_bonservis)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Log-Transformed Market Values",
    x = "Log Market Value",
    y = "Frequency"
  )

After applying a logarithmic transformation, the distribution became more symmetric and closer to a normal distribution. Therefore, a log-normal distribution appears to better approximate football player market values.

Classification Dataset

For the classification analysis, a binary variable was created based on football players’ market values. Players with market values above the median were classified as high-value players, while players below the median were classified as low-value players.

Second Economic Question

Can player performance statistics classify footballers as high-value or low-value players?

football_classification <- football_clean %>%
  
  mutate(
    high_value = ifelse(Bonservis > median(Bonservis), 1, 0)
  )

table(football_classification$high_value)

   0    1 
2507 2327 
ggplot(football_classification, aes(x = factor(high_value))) +
  geom_bar() +
  labs(
    title = "Distribution of High-Value and Low-Value Players",
    x = "Player Category",
    y = "Count"
  )

The classification dataset divides players into high-value and low-value categories based on the median market value. The dataset appears relatively balanced, which makes it appropriate for classification modeling techniques.