ECON 465 Data Science Project – Stage 1



Introduction

This project analyzes football player data using data science techniques to answer economic questions related to player market values.

Dataset Description

This dataset contains football player statistics and market values. The data includes player performance metrics, disciplinary records, expected goals, assists, tackles, and other football-related statistics. The dataset is used to analyze the economic factors that influence football players’ market values.

Economic Question

Which player characteristics and performance statistics predict football players’ market values?

Data Import and Cleaning

The dataset was imported using the read_csv() function from the tidyverse package. The market value variable (Bonservis) was originally stored as a character variable with dots used as separators. These separators were removed, and the variable was converted into numeric format for analysis.

Summary Statistics

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
football <- read_csv("dataset.csv")
Rows: 4834 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Oyuncu, Uyruk, Mevki, Sezon, Lig, Kategori, Bonservis
dbl (31): Yaş, MP, DK, GLS, AST, ASR, TOS, SOT, BCM, KEYP, BCC, SDR, APS, AP...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(football)
# A tibble: 6 × 38
  Oyuncu      Yaş Uyruk Mevki Sezon Lig   Kategori    MP    DK   GLS   AST   ASR
  <chr>     <dbl> <chr> <chr> <chr> <chr> <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Cr…    35 ENG   D     24/25 Prem… Domesti…    18   847     0     0  6.91
2 Aaron Cr…    34 ENG   D     23/24 Prem… Domesti…    11   453     0     0  6.58
3 Aaron Cr…    33 ENG   D     22/23 Prem… Domesti…    28  2241     0     1  6.88
4 Aaron Cr…    32 ENG   D     21/22 Prem… Domesti…    31  2728     2     3  7.01
5 Aaron Cr…    31 ENG   D     20/21 Prem… Domesti…    36  3172     0     8  7.1 
6 Aaron Cr…    30 ENG   D     19/20 Prem… Domesti…    31  2730     3     0  6.71
# ℹ 26 more variables: TOS <dbl>, SOT <dbl>, BCM <dbl>, KEYP <dbl>, BCC <dbl>,
#   SDR <dbl>, APS <dbl>, `APS%` <dbl>, ALB <dbl>, `LBA%` <dbl>, ACR <dbl>,
#   `CA%` <dbl>, CLS <dbl>, YC <dbl>, RC <dbl>, ELTG <dbl>, DRP <dbl>,
#   TACK <dbl>, INT <dbl>, BLS <dbl>, ADW <dbl>, xG <dbl>, xA <dbl>, GI <dbl>,
#   XGI <dbl>, Bonservis <chr>
football_clean <- football %>%
  
  mutate(
    Bonservis = str_replace_all(Bonservis, "\\.", ""),
    Bonservis = as.numeric(Bonservis)
  )

summary(football_clean$Bonservis)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    40000   5000000  15000000  21841423  30000000 300000000 

Probability Distribution Analysis

ggplot(football_clean, aes(x = Bonservis)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Football Players' Market Values",
    x = "Market Value",
    y = "Frequency"
  )

football_clean <- football_clean %>%
  
  mutate(
    log_bonservis = log(Bonservis)
  )

ggplot(football_clean, aes(x = log_bonservis)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Log-Transformed Market Values",
    x = "Log Market Value",
    y = "Frequency"
  )

After applying a logarithmic transformation, the distribution became more symmetric and closer to a normal distribution. Therefore, a log-normal distribution appears to better approximate football player market values.

Classification Dataset Description

This dataset contains English Premier League football match statistics. The data includes match performance indicators such as shots, fouls, yellow cards, corners, and match outcomes. The dataset is used to analyze whether football match statistics can predict match results.

Second Economic Question

Can football match statistics predict whether the home team will win a match?

matches <- read_csv("epl-footballprediction.csv")
New names:
Rows: 6840 Columns: 40
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(16): Date, HomeTeam, AwayTeam, FTR, HM1, HM2, HM3, HM4, HM5, AM1, AM2, ... dbl
(24): ...1, FTHG, FTAG, HTGS, ATGS, HTGC, ATGC, HTP, ATP, MW, HTFormPts,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
head(matches)
# A tibble: 6 × 40
   ...1 Date   HomeTeam AwayTeam  FTHG  FTAG FTR    HTGS  ATGS  HTGC  ATGC   HTP
  <dbl> <chr>  <chr>    <chr>    <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1     0 19/08… Charlton Man City     4     0 H         0     0     0     0     0
2     1 19/08… Chelsea  West Ham     4     2 H         0     0     0     0     0
3     2 19/08… Coventry Middles…     1     3 NH        0     0     0     0     0
4     3 19/08… Derby    Southam…     2     2 NH        0     0     0     0     0
5     4 19/08… Leeds    Everton      2     0 H         0     0     0     0     0
6     5 19/08… Leicest… Aston V…     0     0 NH        0     0     0     0     0
# ℹ 28 more variables: ATP <dbl>, HM1 <chr>, HM2 <chr>, HM3 <chr>, HM4 <chr>,
#   HM5 <chr>, AM1 <chr>, AM2 <chr>, AM3 <chr>, AM4 <chr>, AM5 <chr>, MW <dbl>,
#   HTFormPtsStr <chr>, ATFormPtsStr <chr>, HTFormPts <dbl>, ATFormPts <dbl>,
#   HTWinStreak3 <dbl>, HTWinStreak5 <dbl>, HTLossStreak3 <dbl>,
#   HTLossStreak5 <dbl>, ATWinStreak3 <dbl>, ATWinStreak5 <dbl>,
#   ATLossStreak3 <dbl>, ATLossStreak5 <dbl>, HTGD <dbl>, ATGD <dbl>,
#   DiffPts <dbl>, DiffFormPts <dbl>
matches_clean <- matches %>%
  
  mutate(
    home_win = ifelse(FTR == "H", 1, 0)
  )

table(matches_clean$home_win)

   0    1 
3664 3176 

Classification Distribution Analysis

ggplot(matches_clean, aes(x = factor(home_win))) +
  geom_bar() +
  labs(
    title = "Distribution of Home Team Wins",
    x = "Home Win",
    y = "Count"
  )

The dataset classifies football matches based on whether the home team won the match. The distribution provides a suitable binary outcome for classification modeling techniques such as logistic regression and decision trees.