This project analyzes football player data using data science techniques to answer economic questions related to player market values.
Dataset Description
This dataset contains football player statistics and market values. The data includes player performance metrics, disciplinary records, expected goals, assists, tackles, and other football-related statistics. The dataset is used to analyze the economic factors that influence football players’ market values.
Economic Question
Which player characteristics and performance statistics predict football players’ market values?
Data Import and Cleaning
The dataset was imported using the read_csv() function from the tidyverse package. The market value variable (Bonservis) was originally stored as a character variable with dots used as separators. These separators were removed, and the variable was converted into numeric format for analysis.
Summary Statistics
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
football <-read_csv("dataset.csv")
Rows: 4834 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Oyuncu, Uyruk, Mevki, Sezon, Lig, Kategori, Bonservis
dbl (31): Yaş, MP, DK, GLS, AST, ASR, TOS, SOT, BCM, KEYP, BCC, SDR, APS, AP...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(football)
# A tibble: 6 × 38
Oyuncu Yaş Uyruk Mevki Sezon Lig Kategori MP DK GLS AST ASR
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Cr… 35 ENG D 24/25 Prem… Domesti… 18 847 0 0 6.91
2 Aaron Cr… 34 ENG D 23/24 Prem… Domesti… 11 453 0 0 6.58
3 Aaron Cr… 33 ENG D 22/23 Prem… Domesti… 28 2241 0 1 6.88
4 Aaron Cr… 32 ENG D 21/22 Prem… Domesti… 31 2728 2 3 7.01
5 Aaron Cr… 31 ENG D 20/21 Prem… Domesti… 36 3172 0 8 7.1
6 Aaron Cr… 30 ENG D 19/20 Prem… Domesti… 31 2730 3 0 6.71
# ℹ 26 more variables: TOS <dbl>, SOT <dbl>, BCM <dbl>, KEYP <dbl>, BCC <dbl>,
# SDR <dbl>, APS <dbl>, `APS%` <dbl>, ALB <dbl>, `LBA%` <dbl>, ACR <dbl>,
# `CA%` <dbl>, CLS <dbl>, YC <dbl>, RC <dbl>, ELTG <dbl>, DRP <dbl>,
# TACK <dbl>, INT <dbl>, BLS <dbl>, ADW <dbl>, xG <dbl>, xA <dbl>, GI <dbl>,
# XGI <dbl>, Bonservis <chr>
football_clean <- football %>%mutate(Bonservis =str_replace_all(Bonservis, "\\.", ""),Bonservis =as.numeric(Bonservis) )summary(football_clean$Bonservis)
Min. 1st Qu. Median Mean 3rd Qu. Max.
40000 5000000 15000000 21841423 30000000 300000000
Probability Distribution Analysis
ggplot(football_clean, aes(x = Bonservis)) +geom_histogram(bins =30) +labs(title ="Distribution of Football Players' Market Values",x ="Market Value",y ="Frequency" )
After applying a logarithmic transformation, the distribution became more symmetric and closer to a normal distribution. Therefore, a log-normal distribution appears to better approximate football player market values.
Classification Dataset
For the classification analysis, a binary variable was created based on football players’ market values. Players with market values above the median were classified as high-value players, while players below the median were classified as low-value players.
Second Economic Question
Can player performance statistics classify footballers as high-value or low-value players?
ggplot(football_classification, aes(x =factor(high_value))) +geom_bar() +labs(title ="Distribution of High-Value and Low-Value Players",x ="Player Category",y ="Count" )
The classification dataset divides players into high-value and low-value categories based on the median market value. The dataset appears relatively balanced, which makes it appropriate for classification modeling techniques.