This project analyzes football player data using data science techniques to answer economic questions related to player market values.
Dataset Description
This dataset contains football player statistics and market values. The data includes player performance metrics, disciplinary records, expected goals, assists, tackles, and other football-related statistics. The dataset is used to analyze the economic factors that influence football players’ market values.
Economic Question
Which player characteristics and performance statistics predict football players’ market values?
Data Import and Cleaning
The dataset was imported using the read_csv() function from the tidyverse package. The market value variable (Bonservis) was originally stored as a character variable with dots used as separators. These separators were removed, and the variable was converted into numeric format for analysis.
Summary Statistics
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
football <-read_csv("dataset.csv")
Rows: 4834 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Oyuncu, Uyruk, Mevki, Sezon, Lig, Kategori, Bonservis
dbl (31): Yaş, MP, DK, GLS, AST, ASR, TOS, SOT, BCM, KEYP, BCC, SDR, APS, AP...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(football)
# A tibble: 6 × 38
Oyuncu Yaş Uyruk Mevki Sezon Lig Kategori MP DK GLS AST ASR
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Cr… 35 ENG D 24/25 Prem… Domesti… 18 847 0 0 6.91
2 Aaron Cr… 34 ENG D 23/24 Prem… Domesti… 11 453 0 0 6.58
3 Aaron Cr… 33 ENG D 22/23 Prem… Domesti… 28 2241 0 1 6.88
4 Aaron Cr… 32 ENG D 21/22 Prem… Domesti… 31 2728 2 3 7.01
5 Aaron Cr… 31 ENG D 20/21 Prem… Domesti… 36 3172 0 8 7.1
6 Aaron Cr… 30 ENG D 19/20 Prem… Domesti… 31 2730 3 0 6.71
# ℹ 26 more variables: TOS <dbl>, SOT <dbl>, BCM <dbl>, KEYP <dbl>, BCC <dbl>,
# SDR <dbl>, APS <dbl>, `APS%` <dbl>, ALB <dbl>, `LBA%` <dbl>, ACR <dbl>,
# `CA%` <dbl>, CLS <dbl>, YC <dbl>, RC <dbl>, ELTG <dbl>, DRP <dbl>,
# TACK <dbl>, INT <dbl>, BLS <dbl>, ADW <dbl>, xG <dbl>, xA <dbl>, GI <dbl>,
# XGI <dbl>, Bonservis <chr>
football_clean <- football %>%mutate(Bonservis =str_replace_all(Bonservis, "\\.", ""),Bonservis =as.numeric(Bonservis) )summary(football_clean$Bonservis)
Min. 1st Qu. Median Mean 3rd Qu. Max.
40000 5000000 15000000 21841423 30000000 300000000
Probability Distribution Analysis
ggplot(football_clean, aes(x = Bonservis)) +geom_histogram(bins =30) +labs(title ="Distribution of Football Players' Market Values",x ="Market Value",y ="Frequency" )
After applying a logarithmic transformation, the distribution became more symmetric and closer to a normal distribution. Therefore, a log-normal distribution appears to better approximate football player market values.
Classification Dataset Description
This dataset contains English Premier League football match statistics. The data includes match performance indicators such as shots, fouls, yellow cards, corners, and match outcomes. The dataset is used to analyze whether football match statistics can predict match results.
Second Economic Question
Can football match statistics predict whether the home team will win a match?
matches <-read_csv("epl-footballprediction.csv")
New names:
Rows: 6840 Columns: 40
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(16): Date, HomeTeam, AwayTeam, FTR, HM1, HM2, HM3, HM4, HM5, AM1, AM2, ... dbl
(24): ...1, FTHG, FTAG, HTGS, ATGS, HTGC, ATGC, HTP, ATP, MW, HTFormPts,...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
ggplot(matches_clean, aes(x =factor(home_win))) +geom_bar() +labs(title ="Distribution of Home Team Wins",x ="Home Win",y ="Count" )
The dataset classifies football matches based on whether the home team won the match. The distribution provides a suitable binary outcome for classification modeling techniques such as logistic regression and decision trees.