1 Football Data from Transfermarkt – Structured Soccer Data Extracted from Transfermarkt


Dataset Source: Kaggle
https://www.kaggle.com/datasets/davidcariboo/player-scores

Every analytical project should originate from a business need or from the desire to generate actionable knowledge contained within data. Such insights can only be obtained by applying systematic best practices derived from Data Mining and Analytics.

1.1 Dataset Overview

Brief Description

Structured, clean, and weekly-updated data from Transfermarkt, including:

  • Over 60,000 matches from multiple seasons and major competitions.
  • More than 400 clubs across those competitions.
  • Over 30,000 players from participating clubs.
  • More than 400,000 historical records of player market valuations.
  • Over 1,200,000 match appearance records for players.

What does it include?

Several CSV files with information about competitions, matches, clubs, players, and appearances.
Each file contains entity attributes and identifiers (IDs) that allow relationships between datasets.

Example:
The appearances file includes one row per player for each match played, with data such as goals, assists, and yellow cards, in addition to IDs referencing other entities (player, match, etc.).

1.2 The Three Pillars of Analytics

1.2.1 Business Understanding

This pillar focuses on gaining a deep understanding of the professional football industry and its key dynamics. The selected dataset addresses:

  • Player Market: How do player market valuations fluctuate based on age, position, or performance?
  • Team Performance: Which teams consistently excel in major competitions?
  • Transfer Trends: What are the historical patterns in player transfers?

The insights generated can help clubs, agents, and analysts identify opportunities to maximize both sporting and economic success.

1.2.2 Analytical Capabilities

The dataset enables the application of advanced techniques to explore and solve analytical problems, such as:

  • Predictive Models: Which players have the greatest potential to stand out or increase their market value?
  • Correlation Analysis: Which metrics (goals, assists, age, position) are most related to individual or team success?
  • Segmentation: Classify players, teams, or competitions to uncover performance patterns or detect emerging talent markets.

The organized structure of the dataset (IDs to link tables) makes it ideal for applying data mining, advanced statistics, and machine learning techniques, supporting practical use cases for analysts and researchers.

1.2.3 Data

The Transfermarkt dataset is a perfect example of how business questions can be connected to available data. It provides:

  • Detailed and up-to-date data: Over 1.2 million player appearances, 30,000 players, 400 clubs, and more.
  • Analysis-ready structure: Relationships between competitions, matches, players, and key metrics.
  • Flexibility and constant updates: Suitable for dynamic and evolving analytics projects.
  • Data preparation solved: The challenge of identifying, collecting, and cleaning data is largely addressed, allowing analysts to focus on generating actionable insights.

1.3 Analytical Study

1.3.1 Impact of Player Characteristics on Match Outcomes and Market Valuation

  • Problem Context:
    Football clubs aim to maximize both sporting performance and the market value of their players. However, identifying which characteristics have a direct impact on match outcomes and increases in market value remains a challenge due to the complexity of relationships within the data.

  • Analytical Objectives:

  • Identify key factors: Determine which player characteristics (position, nationality, club, average market value, etc.) and club attributes (name, match performance) most affect match outcomes.

  • Evaluate market impact: Analyze how match performance (minutes played, goals, assists) translates into increases or decreases in player market valuation.

  • Explore performance patterns: Compare how frequent positions and team results influence both individual and collective statistics.

1.3.2 Methodology to Address the Problem

Define relevant variables

  • Target variables:
    • Match result (home and away goals).
    • Average player market value (in EUR).
  • Predictor variables:
    • Most frequent position.
    • Player nationality.
    • Current club.
    • Individual performance: minutes played, goals, assists.
    • Club performance: goals scored and conceded.

Data Preparation
- Verify column quality across merged tables.
- Normalize continuous values (e.g., market value, minutes played).
- Encode categorical variables (position, nationality, club name).

Exploratory Analysis
- Analyze initial correlations between player and club characteristics and market value.
- Identify whether significant differences exist in market value across positions or clubs.

Analytical Models
- Multiple Regression: Predict average player market value based on individual and club performance features.
- Classification: Apply models such as Random Forest or XGBoost to classify match results (win/draw/loss) based on team and player attributes.
- Clustering: Segment players by performance and market trends (e.g., undervalued or overvalued players).

Evaluation
- Validate regression models with .
- Validate classification models with accuracy and F1-Score.
- Interpret the most influential factors to provide actionable insights.

Implementation of Insights
Develop a dashboard or report highlighting:
- Players with the greatest potential for market revaluation.
- Key factors influencing match results and market valuations.
- Positions or nationalities that have significant impact on collective outcomes.

1.3.3 Selecting a Dataset and Justifying the Choice

  • Selected Dataset
    A filtered and enriched version of the Kaggle dataset referenced above. This consolidated dataset integrates information about players, their individual performance, clubs, and match results.

Dataset Structure and Content

  • Player Information:
    • player_id: Unique identifier.
    • position: Most frequent position.
    • country_of_citizenship: Player nationality.
    • market_value_in_eur: Average player market value.
  • Club Information:
    • name: Club name.
    • player_club_id: Club ID.
    • Performance metrics:
      • minutes_played
      • goals
      • assists
      • yellow_cards
  • Match Results:
    • home_club_goals, away_club_goals (team goals).
    • Identifiers: game_id, home_club_id, away_club_id.

Justification of Dataset Selection

  • Suitability for Supervised Learning:

    • Target variable for regression: market_value_in_eur, enabling prediction of player market value.

    • Target variable for classification: result (win/draw/loss), derived from home/away goals.

    • Predictors:

      • Individual factors (position, nationality, minutes played, goals, assists, cards).
      • Team and match factors (club, goals scored by home and away teams).

    This structure enables the use of algorithms such as linear regression, Random Forest, or XGBoost for predictive tasks.

  • Suitability for Unsupervised Learning:

    • Clustering: Variables such as position, minutes played, goals, and market value allow meaningful grouping of players (e.g., high-performance but undervalued players).
    • Association Analysis: Linking individual metrics and team outcomes to uncover hidden patterns, such as combinations of features correlated with team success. Algorithms like K-Means or DBSCAN are appropriate here.

Alignment with Analytical Problem
This dataset includes the necessary variables to address the stated objectives:
- Identifying key factors: relationships between player characteristics and match results.
- Evaluating market impact: analyzing how individual performance affects average market value.
- Exploring performance patterns: comparing positions and characteristics that influence collective performance.

Consolidated Structure and Data Cleaning
The integration of players, clubs, and matches has produced a consolidated dataset aligned with analytical requirements, reducing redundancy and ensuring high-quality variables.

Minimum Requirement

library(dplyr)
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)

appearances <- read.csv("C:/Users/Manuel/Desktop/PR1/appearances.csv")
players <- read.csv("C:/Users/Manuel/Desktop/PR1/players.csv")
player_valuations <- read.csv("C:/Users/Manuel/Desktop/PR1/player_valuations.csv")
clubs <- read.csv("C:/Users/Manuel/Desktop/PR1/clubs.csv")
games <- read.csv("C:/Users/Manuel/Desktop/PR1/games.csv")

game_lineups <- read.csv("C:/Users/Manuel/Desktop/PR1/game_lineups.csv")
appearances_with_country <- appearances %>%
  left_join(players %>%
              select(player_id, country_of_citizenship), by = "player_id")
player_valuations_avg <- player_valuations %>%
  group_by(player_id) %>%
  summarise(market_value_in_eur = mean(market_value_in_eur, na.rm = TRUE))
appearances_with_full_info <- appearances_with_country %>%
  left_join(player_valuations_avg, by = "player_id")
game_lineups_position <- game_lineups %>%
  group_by(player_id, position) %>%
  tally() %>%  
  arrange(player_id, desc(n)) %>%  
  group_by(player_id) %>% 
  slice(1) %>%  
  ungroup() 
appearances_with_position <- appearances_with_full_info %>%
  left_join(game_lineups_position %>%
              select(player_id, position), by = "player_id")
club_id_name_table <- clubs %>%
  select(club_id, name) %>%
  distinct()  
appearances_with_club_name <- appearances_with_position %>%
  left_join(clubs %>%
              select(club_id, name), by = c("player_club_id" = "club_id"))

appearances_with_games <- appearances_with_club_name %>%
  left_join(games %>%
              select(game_id, home_club_id, away_club_id, home_club_goals, away_club_goals), 
            by = "game_id")

df_football <- appearances_with_games

The dataset must contain at least 500 observations with a minimum of 5 numerical variables, 2 categorical variables, and 1 binary variable.

Number of observations = 1,643,442

num_observaciones <- nrow(df_football)
cat("Número de observaciones:", num_observaciones, "\n")
## Número de observaciones: 1643442

Number of numerical variables = 14

num_vars <- sum(sapply(df_football, is.numeric))
cat("Number of numerical variables", num_vars, "\n")
## Number of numerical variables 14

Number of categorical variables = 7

cat_vars <- sum(sapply(df_football, function(x) is.factor(x) || is.character(x)))
cat("Number of categorical variables", cat_vars, "\n")
## Number of categorical variables 7

Number of binary variables = 1

binary_vars <- sapply(df_football[, sapply(df_football, is.numeric)], function(x) length(unique(x)) == 2)
cat("Binary variables:", names(binary_vars[binary_vars == TRUE]), "\n")
## Binary variables: red_cards

Global Summary of the Dataset

str(df_football)
## 'data.frame':    1643442 obs. of  21 variables:
##  $ appearance_id         : chr  "2231978_38004" "2233748_79232" "2234413_42792" "2234418_73333" ...
##  $ game_id               : int  2231978 2233748 2234413 2234418 2234421 2234421 2235539 2235539 2235545 2235545 ...
##  $ player_id             : int  38004 79232 42792 73333 122011 146889 28716 69445 19409 30003 ...
##  $ player_club_id        : int  853 8841 6251 1274 195 195 282 282 317 317 ...
##  $ player_current_club_id: int  235 2698 465 6646 3008 2778 7185 19771 200 317 ...
##  $ date                  : chr  "2012-07-03" "2012-07-05" "2012-07-05" "2012-07-05" ...
##  $ player_name           : chr  "Aurélien Joachim" "Ruslan Abyshov" "Sander Puri" "Vegar Hedenstad" ...
##  $ competition_id        : chr  "CLQ" "ELQ" "ELQ" "ELQ" ...
##  $ yellow_cards          : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ red_cards             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ goals                 : int  2 0 0 0 0 0 0 0 0 0 ...
##  $ assists               : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ minutes_played        : int  90 90 45 90 90 90 90 90 45 90 ...
##  $ country_of_citizenship: chr  "Luxembourg" "Azerbaijan" "Estonia" "Norway" ...
##  $ market_value_in_eur   : num  346250 246875 200000 840833 2792593 ...
##  $ position              : chr  "Centre-Forward" "Centre-Back" "Left Midfield" "Right-Back" ...
##  $ name                  : chr  NA NA NA NA ...
##  $ home_club_id          : int  853 8841 6251 3779 21532 21532 282 282 317 317 ...
##  $ away_club_id          : int  10747 22783 11915 1274 195 195 10604 10604 28633 28633 ...
##  $ home_club_goals       : int  7 2 2 2 0 0 5 5 6 6 ...
##  $ away_club_goals       : int  0 2 1 0 3 3 2 2 0 0 ...

1.4 Data Cleaning

  • Load the required libraries
if (!require('cluster')) install.packages('cluster')
## Cargando paquete requerido: cluster
library(cluster)
if (!require('Stat2Data')) install.packages('Stat2Data')
## Cargando paquete requerido: Stat2Data
library(Stat2Data)
if (!require('Stat2Data')) install.packages('Stat2Data')
if (!require('dplyr')) install.packages('dplyr')
library(dplyr)
if (!require('ggplot2')) install.packages("ggplot2")
## Cargando paquete requerido: ggplot2
library(ggplot2)
if (!require('factoextra')) install.packages("factoextra")
## Cargando paquete requerido: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(factoextra)
if (!require('NbClust')) install.packages("NbClust")
## Cargando paquete requerido: NbClust
library(NbClust)
if (!require('dbscan')) install.packages('dbscan')
## Cargando paquete requerido: dbscan
## 
## Adjuntando el paquete: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
library(dbscan)
if (!require('tidyr')) install.packages('tidyr')
## Cargando paquete requerido: tidyr
library(tidyr)
if (!require('factoextra')) install.packages('factoextra')
library(factoextra)
if (!require('corrplot')) install.packages('corrplot')
## Cargando paquete requerido: corrplot
## corrplot 0.95 loaded
library(corrplot)
if (!require('psych')) install.packages('psych')
## Cargando paquete requerido: psych
## 
## Adjuntando el paquete: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
  • Check for missing values
missing_values <- colSums(is.na(df_football) | df_football == "")
print(missing_values) 
##          appearance_id                game_id              player_id 
##                      0                      0                      0 
##         player_club_id player_current_club_id                   date 
##                      0                      0                      0 
##            player_name         competition_id           yellow_cards 
##                      6                      0                      0 
##              red_cards                  goals                assists 
##                      0                      0                      0 
##         minutes_played country_of_citizenship    market_value_in_eur 
##                      0                  22894                    669 
##               position                   name           home_club_id 
##                  10533                   7812                      0 
##           away_club_id        home_club_goals        away_club_goals 
##                      0                      0                      0
  • Before imputing, we check whether the proportion of missing values is significant. As we can see, it represents less than 3% of the dataframe.
missing_values <- colSums(is.na(df_football) | df_football == "")
missing_percentage <- (missing_values / nrow(df_football)) * 100
print(missing_percentage)
##          appearance_id                game_id              player_id 
##           0.0000000000           0.0000000000           0.0000000000 
##         player_club_id player_current_club_id                   date 
##           0.0000000000           0.0000000000           0.0000000000 
##            player_name         competition_id           yellow_cards 
##           0.0003650874           0.0000000000           0.0000000000 
##              red_cards                  goals                assists 
##           0.0000000000           0.0000000000           0.0000000000 
##         minutes_played country_of_citizenship    market_value_in_eur 
##           0.0000000000           1.3930518996           0.0407072474 
##               position                   name           home_club_id 
##           0.6409109661           0.4753438211           0.0000000000 
##           away_club_id        home_club_goals        away_club_goals 
##           0.0000000000           0.0000000000           0.0000000000
  • Given the large size of the dataframe, we decided not to impute missing values but instead to consider them as a 2.5% error margin.
df_football_clean <- df_football[complete.cases(df_football) & !apply(df_football == "", 1, any), ]
missing_values <- colSums(is.na(df_football_clean) | df_football_clean == "")
missing_percentage <- (missing_values / nrow(df_football_clean)) * 100
print(missing_percentage)
##          appearance_id                game_id              player_id 
##                      0                      0                      0 
##         player_club_id player_current_club_id                   date 
##                      0                      0                      0 
##            player_name         competition_id           yellow_cards 
##                      0                      0                      0 
##              red_cards                  goals                assists 
##                      0                      0                      0 
##         minutes_played country_of_citizenship    market_value_in_eur 
##                      0                      0                      0 
##               position                   name           home_club_id 
##                      0                      0                      0 
##           away_club_id        home_club_goals        away_club_goals 
##                      0                      0                      0
initial_size <- nrow(df_football)
final_size <- nrow(df_football_clean)
percentage_change <- ((initial_size - final_size) / initial_size) * 100
print(paste("El porcentaje de cambio en el tamaño es: ", round(percentage_change, 2), "%"))
## [1] "El porcentaje de cambio en el tamaño es:  2.52 %"
  • Generate a summary

We begin analyzing our dataset by reviewing the columns we may need to change, modify, scale, or remove.
At this stage, it is not entirely clear which columns will be required in the future, so we will aim to retain as many as possible.

summary(df_football_clean)
##  appearance_id         game_id          player_id       player_club_id  
##  Length:1601984     Min.   :2211607   Min.   :     10   Min.   :     3  
##  Class :character   1st Qu.:2581781   1st Qu.:  57370   1st Qu.:   289  
##  Mode  :character   Median :3069457   Median : 140804   Median :   826  
##                     Mean   :3119260   Mean   : 199695   Mean   :  3093  
##                     3rd Qu.:3602582   3rd Qu.: 290250   3rd Qu.:  2441  
##                     Max.   :4481846   Max.   :1240467   Max.   :110302  
##  player_current_club_id     date           player_name       
##  Min.   :     3         Length:1601984     Length:1601984    
##  1st Qu.:   336         Class :character   Class :character  
##  Median :   931         Mode  :character   Mode  :character  
##  Mean   :  3930                                              
##  3rd Qu.:  2696                                              
##  Max.   :110302                                              
##  competition_id      yellow_cards      red_cards            goals        
##  Length:1601984     Min.   :0.0000   Min.   :0.000000   Min.   :0.00000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Mode  :character   Median :0.0000   Median :0.000000   Median :0.00000  
##                     Mean   :0.1479   Mean   :0.003778   Mean   :0.09585  
##                     3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##                     Max.   :2.0000   Max.   :1.000000   Max.   :6.00000  
##     assists       minutes_played   country_of_citizenship market_value_in_eur
##  Min.   :0.0000   Min.   :  1.00   Length:1601984         Min.   :    10000  
##  1st Qu.:0.0000   1st Qu.: 45.00   Class :character       1st Qu.:   674038  
##  Median :0.0000   Median : 90.00   Mode  :character       Median :  1876190  
##  Mean   :0.0757   Mean   : 69.27                          Mean   :  5206425  
##  3rd Qu.:0.0000   3rd Qu.: 90.00                          3rd Qu.:  5652381  
##  Max.   :6.0000   Max.   :135.00                          Max.   :122761538  
##    position             name            home_club_id     away_club_id   
##  Length:1601984     Length:1601984     Min.   :     1   Min.   :     2  
##  Class :character   Class :character   1st Qu.:   294   1st Qu.:   294  
##  Mode  :character   Mode  :character   Median :   865   Median :   862  
##                                        Mean   :  3321   Mean   :  3169  
##                                        3rd Qu.:  2503   3rd Qu.:  2457  
##                                        Max.   :121966   Max.   :110302  
##  home_club_goals  away_club_goals 
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.: 0.000  
##  Median : 1.000   Median : 1.000  
##  Mean   : 1.561   Mean   : 1.248  
##  3rd Qu.: 2.000   3rd Qu.: 2.000  
##  Max.   :17.000   Max.   :16.000
  • Convert the date column to a common standard, from yyyy-mm-dd to yyyymmdd.
df_football_clean$date <- as.Date(df_football_clean$date, format = "%Y-%m-%d")
df_football_clean$date <- format(df_football_clean$date, "%Y%m%d")
dffootballclean <- df_football_clean
  • Outlier Detection

Among all the data, the only value that stands out is the highest player market value, which appears as a potential outlier.
This observation likely corresponds to exceptional players in the dataset and should be carefully assessed to determine whether it reflects a valid extreme case or a data anomaly.

boxplot(dffootballclean$yellow_cards, main = "Boxplot de Tarjetas Amarillas", ylab = "Tarjetas Amarillas", col = "lightblue")

boxplot(dffootballclean$red_cards, main = "Boxplot de Tarjetas Rojas", ylab = "Tarjetas Rojas", col = "lightcoral")

boxplot(dffootballclean$goals, main = "Boxplot de Goles", ylab = "Goles", col = "lightgreen")

boxplot(dffootballclean$assists, main = "Boxplot de Asistencias", ylab = "Asistencias", col = "lightyellow")

boxplot(dffootballclean$minutes_played, main = "Boxplot de Minutos Jugados", ylab = "Minutos Jugados", col = "lightpink")

boxplot(dffootballclean$market_value_in_eur, main = "Boxplot del Valor de Mercado", ylab = "Valor de Mercado (EUR)", col = "lightseagreen")

We can confirm that this value is not an anomaly, but rather makes perfect sense: Kylian Mbappé is the most expensive player in the market, and the subsequent values are equally reasonable.

dffootballclean_unique <- dffootballclean[!duplicated(dffootballclean$player_name), ]
top_10_market_value_unique <- dffootballclean_unique[order(dffootballclean_unique$market_value_in_eur, decreasing = TRUE), ]
top_10_market_value_unique <- top_10_market_value_unique[1:10, ]
top_10_market_value_unique[, c("player_name", "market_value_in_eur")]
##               player_name market_value_in_eur
## 446992      Kylian Mbappé           122761538
## 985681     Erling Haaland            90934783
## 8136         Lionel Messi            88953488
## 1040433   Jude Bellingham            83382353
## 6865           Harry Kane            82740741
## 809835    Vinicius Junior            81750000
## 1429337      Lamine Yamal            81428571
## 139894             Neymar            76350000
## 8239    Cristiano Ronaldo            73806667
## 1017260     Jamal Musiala            71642857

1.5 Perform an Exploratory Analysis of the Selected Dataset

We conduct a descriptive analysis of the variables, which—as shown in the graphs—display “normal” data distributions.
The most frequent values are consistently low, while the emerging pattern suggests that potential outliers are actually the players with the highest economic market value.

summary(dffootballclean[c("yellow_cards", "red_cards", "goals", "assists", "minutes_played", "market_value_in_eur")])
##   yellow_cards      red_cards            goals            assists      
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000000   Median :0.00000   Median :0.0000  
##  Mean   :0.1479   Mean   :0.003778   Mean   :0.09585   Mean   :0.0757  
##  3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :2.0000   Max.   :1.000000   Max.   :6.00000   Max.   :6.0000  
##  minutes_played   market_value_in_eur
##  Min.   :  1.00   Min.   :    10000  
##  1st Qu.: 45.00   1st Qu.:   674038  
##  Median : 90.00   Median :  1876190  
##  Mean   : 69.27   Mean   :  5206425  
##  3rd Qu.: 90.00   3rd Qu.:  5652381  
##  Max.   :135.00   Max.   :122761538
hist(dffootballclean$yellow_cards, main = "Distribución de Tarjetas Amarillas", xlab = "Tarjetas Amarillas")

hist(dffootballclean$red_cards, main = "Distribución de Tarjetas Rojas", xlab = "Tarjetas Rojas")

hist(dffootballclean$goals, main = "Distribución de Goles", xlab = "Goles")

hist(dffootballclean$assists, main = "Distribución de Asistencias", xlab = "Asistencias")

hist(dffootballclean$minutes_played, main = "Distribución de Minutos Jugados", xlab = "Minutos Jugados")

hist(dffootballclean$market_value_in_eur, main = "Distribución del Valor de Mercado", xlab = "Valor de Mercado (EUR)")

+ Correlations

We perform a separate analysis focusing exclusively on correlations.

cor(dffootballclean[c("yellow_cards", "red_cards", "goals", "assists", "minutes_played", "market_value_in_eur")])
##                     yellow_cards    red_cards        goals      assists
## yellow_cards         1.000000000 -0.012375080  0.001195805 -0.002233251
## red_cards           -0.012375080  1.000000000 -0.009106582 -0.007586251
## goals                0.001195805 -0.009106582  1.000000000  0.074499848
## assists             -0.002233251 -0.007586251  0.074499848  1.000000000
## minutes_played       0.108352806 -0.034725874  0.079132755  0.077553148
## market_value_in_eur -0.015116322 -0.005572771  0.117005666  0.082148239
##                     minutes_played market_value_in_eur
## yellow_cards            0.10835281        -0.015116322
## red_cards              -0.03472587        -0.005572771
## goals                   0.07913276         0.117005666
## assists                 0.07755315         0.082148239
## minutes_played          1.00000000         0.043857842
## market_value_in_eur     0.04385784         1.000000000
plot(dffootballclean$goals, dffootballclean$assists, main = "Goles vs Asistencias", xlab = "Goles", ylab = "Asistencias")

plot(dffootballclean$goals, dffootballclean$minutes_played, main = "Goles vs Minutos Jugados", xlab = "Goles", ylab = "Minutos Jugados")

plot(dffootballclean$market_value_in_eur, dffootballclean$goals, main = "Valor de Mercado vs Goles", xlab = "Valor de Mercado (EUR)", ylab = "Goles")

  • yellow_cards and goals: There is no significant relationship between the number of yellow cards and goals scored.

  • red_cards and goals: The number of red cards received does not appear to significantly affect a player’s ability to score goals.

  • assists and goals: Although weak, there is a slight positive relationship between goals and assists, suggesting that players who score goals also tend to provide assists.

  • minutes_played and goals: Players who play more minutes tend to score more goals, although the relationship is not very strong.

  • market_value_in_eur and goals: Market value has a weak relationship with goals, indicating that the most expensive players do not necessarily score more goals.

  • market_value_in_eur and minutes_played: There does not appear to be a significant relationship between a player’s market value and the number of minutes they play.

  • market_value_in_eur and assists: Market value shows a slight positive relationship with assists, but not strong enough to be considered a decisive factor.

  • yellow_cards and minutes_played: Players who play more minutes have a slightly higher probability of receiving yellow cards.

  • red_cards and minutes_played: The number of minutes played does not seem to influence the number of red cards a player receives.

  • yellow_cards and red_cards: There is no significant relationship between yellow and red cards, suggesting they are not necessarily related.

  • Next, we analyze the average number of goals per league and per team.

dffootballclean %>%
  group_by(competition_id) %>%
  summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
  arrange(desc(promedio_goles))
## # A tibble: 43 × 2
##    competition_id promedio_goles
##    <chr>                   <dbl>
##  1 DKP                     0.166
##  2 KLUB                    0.166
##  3 DFB                     0.159
##  4 NLP                     0.157
##  5 NLSC                    0.148
##  6 USC                     0.132
##  7 SFA                     0.129
##  8 DFL                     0.127
##  9 CDR                     0.127
## 10 CGB                     0.125
## # ℹ 33 more rows
dffootballclean %>%
  group_by(player_current_club_id) %>%
  summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
  arrange(desc(promedio_goles))
## # A tibble: 436 × 2
##    player_current_club_id promedio_goles
##                     <int>          <dbl>
##  1                  71985          0.197
##  2                    610          0.192
##  3                     27          0.180
##  4                     31          0.167
##  5                    583          0.166
##  6                    418          0.166
##  7                    141          0.165
##  8                    985          0.161
##  9                     13          0.154
## 10                    131          0.151
## # ℹ 426 more rows
  • Average goals per league.
promedio_goles_competencia <- dffootballclean %>%
  group_by(competition_id) %>%
  summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
  arrange(desc(promedio_goles))

ggplot(promedio_goles_competencia, aes(x = reorder(competition_id, -promedio_goles), y = promedio_goles, fill = competition_id)) +
  geom_bar(stat = "identity") +
  labs(title = "Promedio de Goles por Competencia", x = "Competencia", y = "Promedio de Goles") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d()  

Since there are too many, we filter the top 10.

promedio_goles_competencia_top10 <- dffootballclean %>%
  group_by(competition_id) %>%
  summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
  arrange(desc(promedio_goles)) %>%
  head(10)  
promedio_goles_competencia_top10
## # A tibble: 10 × 2
##    competition_id promedio_goles
##    <chr>                   <dbl>
##  1 DKP                     0.166
##  2 KLUB                    0.166
##  3 DFB                     0.159
##  4 NLP                     0.157
##  5 NLSC                    0.148
##  6 USC                     0.132
##  7 SFA                     0.129
##  8 DFL                     0.127
##  9 CDR                     0.127
## 10 CGB                     0.125

We perform the same calculation, but this time at the team level, and increase the top selection to 25.

promedio_goles_equipo <- dffootballclean %>%
  group_by(player_current_club_id) %>%
  summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
  arrange(desc(promedio_goles)) %>%
  head(25)  

promedio_goles_equipo$player_current_club_id <- as.factor(promedio_goles_equipo$player_current_club_id)

ggplot(promedio_goles_equipo, aes(x = reorder(player_current_club_id, -promedio_goles), y = promedio_goles, fill = player_current_club_id)) +
  geom_bar(stat = "identity") +
  labs(title = "Promedio de Goles por Equipo (Top 25)", x = "Equipo", y = "Promedio de Goles") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d()  

We assign names to these IDs.

promedio_goles_equipo_top25 <- dffootballclean %>%
  group_by(name) %>%
  summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
  arrange(desc(promedio_goles)) %>%
  head(25)
promedio_goles_equipo_top25
## # A tibble: 25 × 2
##    name                                                   promedio_goles
##    <chr>                                                           <dbl>
##  1 FC Bayern München                                               0.180
##  2 Futbol Club Barcelona                                           0.166
##  3 Eindhovense Voetbalvereniging Philips Sport Vereniging          0.163
##  4 AFC Ajax Amsterdam                                              0.162
##  5 Manchester City Football Club                                   0.160
##  6 Paris Saint-Germain Football Club                               0.159
##  7 Real Madrid Club de Fútbol                                      0.158
##  8 The Celtic Football Club                                        0.156
##  9 FC Shakhtar Donetsk                                             0.155
## 10 Borussia Dortmund                                               0.146
## # ℹ 15 more rows
  • Comparison of home and away goals
goles_comparacion_equipo <- dffootballclean %>%
  group_by(name) %>%
  summarise(total_goles_marcados = sum(goals, na.rm = TRUE), total_goles_recibidos = sum(away_club_goals, na.rm = TRUE)) %>%
  arrange(desc(total_goles_marcados)) %>%
  head(25)

goles_casa_fuera <- dffootballclean %>%
  mutate(tipo_partido = ifelse(home_club_id == player_current_club_id, "Casa", "Fuera")) %>%
  group_by(tipo_partido) %>%
  summarise(promedio_goles = mean(goals, na.rm = TRUE))

ggplot(goles_casa_fuera, aes(x = tipo_partido, y = promedio_goles, fill = tipo_partido)) +
  geom_bar(stat = "identity") +
  labs(title = "Promedio de Goles en Casa vs Fuera", x = "Tipo de Partido", y = "Promedio de Goles") +
  theme_minimal()

  • Goals by team
ggplot(goles_comparacion_equipo, aes(x = reorder(name, -total_goles_marcados))) +
  geom_bar(aes(y = total_goles_marcados, fill = "Marcados"), stat = "identity") +
  labs(title = "Goles Marcados por Equipo", x = "Equipo", y = "Cantidad de Goles") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("Marcados" = "blue"))

+ Goals by position

We detect outliers and remove them; the number remains very low.

goles_por_posicion <- dffootballclean %>%
  group_by(position) %>%
  summarise(total_goles = sum(goals, na.rm = TRUE)) %>%
  arrange(desc(total_goles))
print(goles_por_posicion)
## # A tibble: 16 × 2
##    position           total_goles
##    <chr>                    <int>
##  1 Centre-Forward           57276
##  2 Left Winger              17065
##  3 Right Winger             15770
##  4 Attacking Midfield       15750
##  5 Central Midfield         15611
##  6 Centre-Back              11801
##  7 Defensive Midfield        6848
##  8 Right-Back                4253
##  9 Left-Back                 3812
## 10 Second Striker            2363
## 11 Right Midfield            1562
## 12 Left Midfield             1401
## 13 Goalkeeper                  17
## 14 Attack                      16
## 15 Defender                     1
## 16 midfield                     1
dffootballclean <- dffootballclean %>%
  filter(!position %in% c("Attack", "Defender", "midfield"))
goles_por_posicion <- dffootballclean %>%
  group_by(position) %>%
  summarise(total_goles = sum(goals, na.rm = TRUE)) %>%
  arrange(desc(total_goles))
ggplot(goles_por_posicion, aes(x = reorder(position, -total_goles), y = total_goles, fill = position)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribución de Goles por Posición", x = "Posición", y = "Total de Goles") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d()  

We have identified data that does not make much sense. Attack appears to be an error: we will remove these records since it is impossible for the attacking position to have fewer goals than the goalkeeper. Logically, this position should have the highest contribution.

It is possible to continue with more exploratory analysis, but we now move on to the next section.

1.6 Discretization Methods

We will include new columns to indicate specific values.

  • Label by market value
    We use quartiles (by 0.25) to assign labels: low, medium, medium-high, and high.
dffootballclean <- dffootballclean %>%
  mutate(Valor_mercado = cut(market_value_in_eur, 
                              breaks = quantile(market_value_in_eur, probs = seq(0, 1, by = 0.25), na.rm = TRUE),
                              include.lowest = TRUE,
                              labels = c("Bajo", "Medio", "MedioAlto", "Alto")))
  • Type of league

We separate the data into: Big Five (Premier League, Serie A, Bundesliga, La Liga, Ligue 1), Champions League, Europa League, and Minor Leagues.

unique_competition_ids <- unique(dffootballclean$competition_id)
unique_competition_ids
##  [1] "ELQ"  "UKRS" "UKR1" "DK1"  "RUSS" "RU1"  "BESC" "BE1"  "FRCH" "POCP"
## [11] "CLQ"  "SC1"  "NLSC" "FR1"  "NL1"  "SCI"  "POSU" "CIT"  "DFL"  "GBCS"
## [21] "DFB"  "TR1"  "PO1"  "GB1"  "ES1"  "UKRP" "SUC"  "L1"   "GR1"  "IT1" 
## [31] "USC"  "RUP"  "CDR"  "CL"   "EL"   "NLP"  "DKP"  "SFA"  "GRP"  "FAC" 
## [41] "KLUB" "ECLQ" "CGB"
dffootballclean$League_category <- case_when(
  dffootballclean$competition_id %in% c("GB1", "ES1", "IT1", "FR1", "DE1") ~ "BigFiveLeagues",
  dffootballclean$competition_id %in% c("CL", "CLQ") ~ "Champions",
  dffootballclean$competition_id %in% c("EL", "ELQ") ~ "EuropaLeague",
  TRUE ~ "MinorLeagues")
  • Minutes played

We create a new category based on minutes played. Since there are values greater than 90 minutes (as observed during data cleaning), we define the categories as follows:
- 90 minutes: starter
- > 90 minutes: extra time
- < 90 minutes: substitute

dffootballclean$player_status <- case_when(
  dffootballclean$minutes_played == 90 ~ "Titular",
  dffootballclean$minutes_played > 90 ~ "Prorroga",
  dffootballclean$minutes_played < 90 ~ "No titular")
  • Goal scorers

Based on the number of goals scored, we assign labels (from 0 to 6 goals).

dffootballclean$goal_status <- case_when(
  dffootballclean$goals == 6 ~ "GOAT",
  dffootballclean$goals == 3 ~ "Hattrick",
  dffootballclean$goals == 4 ~ "Poker",
  dffootballclean$goals == 5 ~ "Repoker",
  dffootballclean$goals == 1 ~ "Goal",
  dffootballclean$goals == 2 ~ "Duo",
  dffootballclean$goals == 0 ~ "Singol",
  TRUE ~ "Ninguno")
  • Cards

Based on yellow and red cards.

dffootballclean$sancion_status <- case_when(
  dffootballclean$red_cards == 1 & dffootballclean$yellow_cards == 2 ~ "Expulsado por doble amarilla",
  dffootballclean$red_cards == 1 ~ "Expulsado",
  dffootballclean$yellow_cards == 1 & dffootballclean$red_cards == 0 ~ "Apercibido",
  dffootballclean$yellow_cards == 0 & dffootballclean$red_cards == 0 ~ "Sin sanción",
  TRUE ~ "Ninguno")
  • Verification of discretization

We could create many more categories, but we keep everything to retain the maximum amount of information from our dataset.
We print the first 10 values to verify that all data has been correctly processed.

head(dffootballclean, 10)
##    appearance_id game_id player_id player_club_id player_current_club_id
## 1  2235545_19409 2235545     19409            317                    200
## 2  2235545_30667 2235545     30667            317                    317
## 3  2235545_34129 2235545     34129            317                   1435
## 4  2235545_36139 2235545     36139            317                     36
## 5   2235545_4520 2235545      4520            317                    317
## 6   2235545_4582 2235545      4582            317                    317
## 7  2235545_47740 2235545     47740            317                   1426
## 8  2235545_59631 2235545     59631            317                   6890
## 9  2235545_60312 2235545     60312            317                   1426
## 10 2235545_63342 2235545     63342            317                  11282
##        date      player_name competition_id yellow_cards red_cards goals
## 1  20120705   Willem Janssen            ELQ            0         0     0
## 2  20120705 Robbert Schilder            ELQ            0         0     2
## 3  20120705   Wesley Verhoek            ELQ            0         0     0
## 4  20120705      Dusan Tadic            ELQ            0         0     1
## 5  20120705  Peter Wisgerhof            ELQ            0         0     0
## 6  20120705  Sander Boschker            ELQ            0         0     0
## 7  20120705     Nils Röseler            ELQ            0         0     0
## 8  20120705     Nacer Chadli            ELQ            0         0     0
## 9  20120705      Joshua John            ELQ            0         0     1
## 10 20120705        Leroy Fer            ELQ            0         0     0
##    assists minutes_played country_of_citizenship market_value_in_eur
## 1        0             45            Netherlands           1329310.3
## 2        1             90            Netherlands           1002272.7
## 3        0             90            Netherlands            950000.0
## 4        0             45                 Serbia          10717073.2
## 5        0             90            Netherlands           1450000.0
## 6        0             90            Netherlands            422222.2
## 7        0             90                Germany            465740.7
## 8        3             45                Belgium           7088750.0
## 9        1             45                  Aruba            542045.5
## 10       0             26            Netherlands           3865972.2
##              position                 name home_club_id away_club_id
## 1         Centre-Back Football Club Twente          317        28633
## 2           Left-Back Football Club Twente          317        28633
## 3        Right Winger Football Club Twente          317        28633
## 4         Left Winger Football Club Twente          317        28633
## 5         Centre-Back Football Club Twente          317        28633
## 6          Goalkeeper Football Club Twente          317        28633
## 7         Centre-Back Football Club Twente          317        28633
## 8         Left Winger Football Club Twente          317        28633
## 9         Left Winger Football Club Twente          317        28633
## 10 Defensive Midfield Football Club Twente          317        28633
##    home_club_goals away_club_goals Valor_mercado League_category player_status
## 1                6               0         Medio    EuropaLeague    No titular
## 2                6               0         Medio    EuropaLeague       Titular
## 3                6               0         Medio    EuropaLeague       Titular
## 4                6               0          Alto    EuropaLeague    No titular
## 5                6               0         Medio    EuropaLeague       Titular
## 6                6               0          Bajo    EuropaLeague       Titular
## 7                6               0          Bajo    EuropaLeague       Titular
## 8                6               0          Alto    EuropaLeague    No titular
## 9                6               0          Bajo    EuropaLeague    No titular
## 10               6               0     MedioAlto    EuropaLeague    No titular
##    goal_status sancion_status
## 1       Singol    Sin sanción
## 2          Duo    Sin sanción
## 3       Singol    Sin sanción
## 4         Goal    Sin sanción
## 5       Singol    Sin sanción
## 6       Singol    Sin sanción
## 7       Singol    Sin sanción
## 8       Singol    Sin sanción
## 9         Goal    Sin sanción
## 10      Singol    Sin sanción

1.7 Component Analysis and SVD

1.7.1 PCA

Specific reference - https://rpubs.com/luis_abaunzag/ejercicio1_rd

This dataset could be applied to countless analytical ideas, but for now we will focus on the numerical columns related to players.

numeric_columns <- c("yellow_cards", "red_cards", "goals", "assists", "minutes_played", "market_value_in_eur")
dffootball_numeric <- dffootballclean[, numeric_columns]

The next step is to scale the data and perform PCA.

dffootball_scaled <- scale(dffootball_numeric)
pca_result <- prcomp(dffootball_scaled, scale. = TRUE)
summary(pca_result)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6
## Standard deviation     1.1180 1.0398 0.9964 0.9669 0.9416 0.9245
## Proportion of Variance 0.2083 0.1802 0.1655 0.1558 0.1478 0.1424
## Cumulative Proportion  0.2083 0.3885 0.5540 0.7098 0.8576 1.0000

The Principal Component Analysis (PCA) performed on player performance data provides valuable insights into how the variance of the selected numerical features is distributed: yellow cards, red cards, goals, assists, minutes played, and market value.

From the PCA summary, PC1 explains 20.83% of the total variance, indicating that this component captures a significant proportion of the variability in player characteristics. PC2 and PC3 explain 18.02% and 16.55% of the variance, respectively, meaning that these subsequent dimensions also capture a considerable amount of information. Together, the first three components (PC1, PC2, and PC3) explain 54.4% of the variance, suggesting that a large portion of variability in the data can be summarized using only these three principal components.

The cumulative proportion of variance shows that with the first six components (PC1 through PC6), we can explain 100% of the variance, which indicates that dimensionality reduction has been effective without losing critical information. This demonstrates that, even after reducing the number of dimensions, the principal components still retain most of the information contained in the original data.

This analysis is useful for understanding the underlying structure of the data and facilitates the visualization of complex patterns.

Component Rotation

Rotation is used to make results easier to interpret. Although principal components identify the directions of maximum variability, they can be difficult to understand directly. By rotating the components, we aim for each one to have a clearer relationship with specific variables, which improves interpretability.

For example, without rotation, a component may be influenced by several variables at once, whereas with rotation, each component can be more strongly associated with a key variable.

pca_result$rotation
##                            PC1        PC2         PC3         PC4         PC5
## yellow_cards         0.1879215  0.7129687  0.26545946  0.17179590 -0.36190593
## red_cards           -0.1209879 -0.2337237  0.95885294 -0.01595974  0.05370472
## goals                0.5201572 -0.2347260  0.02228464  0.44348855  0.56180214
## assists              0.4644091 -0.1711537  0.03506992 -0.81381839 -0.07084041
## minutes_played       0.4897718  0.4522012  0.09074299 -0.08164665  0.28213551
## market_value_in_eur  0.4732142 -0.3849324 -0.01309673  0.32339265 -0.68256827
##                             PC6
## yellow_cards         0.47476587
## red_cards           -0.09053898
## goals                0.40192013
## assists              0.29407819
## minutes_played      -0.67907049
## market_value_in_eur -0.23925209

PC1 (First Principal Component):
This component has a strong relationship with variables such as goals, assists, minutes played, and market value. This suggests that PC1 captures a measure of overall player performance. Players with high values on this component tend to demonstrate strong participation in the game (goals and assists) as well as higher playing time.

PC2 (Second Principal Component):
This component is strongly associated with yellow and red cards, indicating that PC2 reflects players’ disciplinary or aggressive behavior. Players with high scores on this component are more likely to commit fouls or receive sanctions.

PC3 to PC6:
These components reveal less clear combinations of variables but may represent more specific aspects of gameplay. For example, PC3 seems to be more closely related to red cards, suggesting that certain players are strongly associated with disciplinary records.

We applied a varimax rotation to improve the interpretability of the principal components, making it easier to identify meaningful patterns. After rotation, yellow cards, goals, and other key aspects of player performance cluster into well-defined components. This improves our understanding of how each variable contributes to the overall variability in the data.

Moreover, the components with the highest explained variance (such as the first and second) are the most relevant, indicating that the majority of the key information is concentrated within them.

pca_rotated <- principal(dffootball_scaled, nfactors = 5, rotate = "varimax")
pca_rotated$loadings
## 
## Loadings:
##                     RC2    RC1    RC4    RC3    RC5   
## yellow_cards         0.873 -0.145 -0.118              
## red_cards                                 0.996       
## goals                       0.919                0.120
## assists                            0.959              
## minutes_played       0.567  0.404  0.289        -0.171
## market_value_in_eur                              0.968
## 
##                  RC2   RC1   RC4   RC3   RC5
## SS loadings    1.088 1.037 1.023 1.003 0.995
## Proportion Var 0.181 0.173 0.170 0.167 0.166
## Cumulative Var 0.181 0.354 0.525 0.692 0.858

We create a more visual representation and can conclude that three components are sufficient to explain over 50% of the variance.

fviz_eig(pca_result)

To remove any doubt, we generate a scree plot (elbow method) and confirm that the change in trend indeed occurs at three components.

varianza_explicada <- pca_result$sdev^2
proporcion_varianza <- varianza_explicada / sum(varianza_explicada)

plot(proporcion_varianza, type = "b", pch = 19, xlab = "Componentes Principales", ylab = "Proporción de Varianza Explicada", 
     main = "Scree Plot", col = "blue")

abline(h = 1/length(proporcion_varianza), col = "red", lty = 2)
abline(v = 3, col = "green", lty = 2)

table1 <- table(dffootballclean$yellow_cards, dffootballclean$red_cards)
chisq.test(table1)
## 
##  Pearson's Chi-squared test
## 
## data:  table1
## X-squared = 245.4, df = 2, p-value < 2.2e-16

Chi-squared test result:

  • X-squared = 245.4: This is the chi-squared statistic, which measures the discrepancy between the observed and expected frequencies in a contingency table.
  • df = 2: This is the number of degrees of freedom, which depends on the size of the contingency table.
  • p-value < 2.2e-16: This p-value is extremely small (much lower than 0.05). This means we reject the null hypothesis of no relationship between the two variables, suggesting that there is a significant association between the categories of the analyzed variables (in this case, yellow and red cards).

The chi-squared test indicates that there is a significant relationship between the variables analyzed (yellow and red cards). The extremely small p-value suggests that this relationship is very unlikely to have occurred by chance. In simpler terms, this means there is a statistical dependency between the number of yellow and red cards a player receives.

cor(dffootball_numeric$goals, dffootball_numeric$assists, method = "spearman")
## [1] 0.0727869

0.0727869 is a small value, which suggests that although there is a statistically significant relationship between the two variables, the strength of the association is not very strong.

phi <- cor(dffootball_numeric$goals, dffootball_numeric$yellow_cards, method = "pearson")
phi
## [1] 0.001195572

The value 0.001195572 indicates that the relationship between the two variables (in this case, possibly between yellow and red cards, or another categorical comparison being evaluated) is statistically significant.

In summary, the correlation between yellow and red cards makes sense because both are directly related under the rules of the game. The statistical test confirms this significant relationship, which is something expected in most cases.

We now generate a correlation heatmap.

corrplot(cor(dffootball_numeric), method = "color", tl.cex = 0.5)

The correlation matrix reveals weak relationships among the variables in the dataset. No significant relationship is observed between yellow and red cards, suggesting that these events are not closely related. Although there is a slight positive correlation between goals and assists, the relationship is not strong enough to indicate a clear connection between them.

Furthermore, the correlation between minutes played and other variables is low, indicating that playing time is not strongly associated with performance in terms of goals or assists. Finally, player market value shows weak correlations with performance variables, suggesting that factors such as goals, assists, or minutes played do not largely explain a player’s market value.

Overall, the variables appear to be largely independent, implying that they provide unique and non-redundant information about player performance.

pca_scores <- pca_result$x  # Las proyecciones de las observaciones (jugadores)
pca_data <- data.frame(player_id = dffootballclean$player_id, pca_scores)
head(pca_data)
##   player_id        PC1        PC2        PC3         PC4        PC5        PC6
## 1     19409 -0.9477182 -0.3581370 -0.2499365 -0.05737914 0.07001139  0.2759403
## 2     30667  4.5395953 -1.6804405  0.1444410 -0.35773538 3.66765465  2.7202911
## 3     34129 -0.2309475  0.3389939 -0.1128023 -0.19416749 0.52398812 -0.7357923
## 4     36139  1.1278253 -1.4771678 -0.1965340  1.62701310 1.04122580  1.2358561
## 5      4520 -0.2041214  0.3171725 -0.1135448 -0.17583467 0.48529398 -0.7493553
## 6      4582 -0.2592639  0.3620277 -0.1120187 -0.21351880 0.56483194 -0.7214759

(PCA with player labels): The first two principal components (PC1 and PC2) are plotted to visualize how players are distributed, with labels added for identification. However, given the overwhelming volume of data, it is impossible to draw meaningful conclusions from this visualization.

ggplot(pca_data, aes(x = PC1, y = PC2, label = player_id)) +
  geom_point() +
  geom_text(aes(label = player_id), vjust = 1, hjust = 1) +
  labs(title = "PCA: Primer y Segundo Componente Principal",
       x = "Componente Principal 1", y = "Componente Principal 2") +
  theme_minimal()

The PCA plot is overloaded due to the large volume of data (more than 1.5 million rows) and text labels (player_id), making visual interpretation difficult. The density of points and overlapping labels prevents clear identification of relationships between the principal components.

Observations:
- Data distribution: There is a dense cluster in the central area, suggesting that most players share similar characteristics across the selected variables (yellow_cards, red_cards, goals, etc.).
- Visible outliers: Some points scattered to the right may represent players with extreme values in certain variable combinations, such as very high market value or exceptional performance.
- Label overlap: Labeling each point is not effective due to massive overlap. This creates visual noise and limits the extraction of useful insights.

To address this, geom_bin2d is applied to display point density in the PCA space. This helps to identify areas of high player concentration without plotting each individual point, improving clarity and interpretability.

ggplot(pca_data, aes(x = PC1, y = PC2)) +
  geom_bin2d(bins = 100) +
  labs(title = "PCA: Densidad de Jugadores",
       x = "Componente Principal 1", y = "Componente Principal 2") +
  theme_minimal()

Observations:
- Central density: There is a notable concentration of points around the origin (values close to 0 in both principal components). This suggests that most players have average or balanced values in the considered metrics (such as goals, cards, and market value).
- Dispersion at the edges: Points scattered toward the extremes, especially to the right (high values on Principal Component 1), may reflect players with outstanding characteristics, such as high market value or strong offensive contributions (goals/assists).
- Diagonal patterns: The overall distribution suggests a moderate inverse relationship between the two principal components, which could imply that certain attributes vary in opposite ways for specific groups of players.

To further enhance interpretation, players are grouped into five clusters using K-means (not strictly necessary, but tested as an exploratory step). The clusters are visualized with different colors, allowing us to identify patterns and relationships in player data based on their principal components.

set.seed(123)
kmeans_result <- kmeans(pca_data[, c("PC1", "PC2")], centers = 5)  

ggplot(pca_data, aes(x = PC1, y = PC2, color = factor(kmeans_result$cluster))) +
  geom_point(alpha = 0.5) +
  labs(title = "PCA: Agrupación de Jugadores",
       x = "Componente Principal 1", y = "Componente Principal 2") +
  theme_minimal() +
  scale_color_manual(values = c("red", "blue", "green", "purple", "orange")) 

Observations:

  • Clear segmentation: The clusters show distinct regions in the principal component space, indicating that the original metrics (goals, assists, minutes played, etc.) separate players into different categories.

Possible interpretations:
- Cluster 1 (red): May represent players with extreme values in a single variable, such as cards or very low playing time.
- Cluster 3 (green): The largest and densest group, likely composed of players with average or balanced metrics.
- Cluster 5 (orange): Players with high values on PC1, potentially associated with strong offensive performance or high market value.
- Other clusters (blue and purple): Appear to represent subsets with intermediate characteristics.

Data structure: The boundaries between clusters are not completely linear, suggesting that the relationships among the metrics are not perfectly separable.

1.7.2 SVD

We perform Singular Value Decomposition (SVD) on the scaled data to obtain three components: U, D, and V.
- U contains the player projections in the new dimensions, similar to principal components in PCA.
- The dataframe svd_data stores these results along with player_id, enabling analysis of how players are distributed in the new dimensions.

svd_result <- svd(dffootball_scaled)

U <- svd_result$u  
D <- svd_result$d  
V <- svd_result$v  

svd_data <- data.frame(player_id = dffootballclean$player_id, U)

head(svd_data)
##   player_id            X1            X2            X3            X4
## 1     19409 -0.0006698067 -0.0002721703 -1.982000e-04 -4.689226e-05
## 2     30667  0.0032083919 -0.0012770699  1.145419e-04 -2.923540e-04
## 3     34129 -0.0001632238  0.0002576223 -8.945242e-05 -1.586805e-04
## 4     36139  0.0007970987 -0.0011225905 -1.558517e-04  1.329653e-03
## 5      4520 -0.0001442643  0.0002410389 -9.004118e-05 -1.436983e-04
## 6      4582 -0.0001832366  0.0002751271 -8.883095e-05 -1.744951e-04
##             X5            X6
## 1 5.875418e-05  0.0002358509
## 2 3.077928e-03  0.0023250793
## 3 4.397355e-04 -0.0006288943
## 4 8.738059e-04  0.0010563073
## 5 4.072630e-04 -0.0006404868
## 6 4.740120e-04 -0.0006166578

The results of the Singular Value Decomposition (SVD) show how players are represented in the new dimensions obtained from the scaled data. Each row of the svd_data dataset corresponds to a player, identified by their player_id, followed by the values in the first six dimensions derived from the decomposition.

The columns X1 to X6 represent the projections of each player in the first six components of the feature space. These projections are continuous values that indicate how each player is distributed across the new dimensions of the reduced space. By analyzing these projections, we can gain insights into the relationships and similarities among players based on their numerical characteristics (e.g., goals, cards, minutes played, etc.).

This type of analysis is useful for dimensionality reduction, as it allows us to visualize player characteristics in a more compact space, facilitating the identification of patterns or clusters among them.

1.7.3 PCA vs SVD

par(mfrow = c(1, 2))  
plot(pca_result$x[, 1], pca_result$x[, 2], 
     main = "PCA: Componentes Principales", 
     xlab = "PC1", ylab = "PC2")
plot(svd_result$u[, 1], svd_result$u[, 2], 
     main = "SVD: Componentes U", 
     xlab = "U1", ylab = "U2")

dif <- svd_result$v[, 1:5] - pca_result$rotation[, 1:5]  
summary(dif)
##       PC1                  PC2                  PC3            
##  Min.   :-2.687e-14   Min.   :-1.582e-14   Min.   :-3.331e-15  
##  1st Qu.:-4.066e-15   1st Qu.:-1.234e-14   1st Qu.: 4.424e-16  
##  Median : 4.913e-15   Median :-5.232e-15   Median : 7.147e-16  
##  Mean   :-4.626e-17   Mean   :-4.732e-15   Mean   : 6.407e-16  
##  3rd Qu.: 9.021e-15   3rd Qu.: 3.442e-15   3rd Qu.: 1.440e-15  
##  Max.   : 1.343e-14   Max.   : 6.273e-15   Max.   : 3.712e-15  
##       PC4                  PC5            
##  Min.   :-7.644e-14   Min.   :-1.349e-13  
##  1st Qu.:-8.476e-15   1st Qu.:-4.359e-14  
##  Median :-7.633e-17   Median : 6.939e-15  
##  Mean   : 6.173e-15   Mean   : 2.304e-15  
##  3rd Qu.: 1.935e-14   3rd Qu.: 7.142e-14  
##  Max.   : 9.909e-14   Max.   : 1.024e-13

The results show that the differences between the V matrices from SVD and the PCA rotations for the first five components are very small, with values close to zero.
The difference values indicate that the matrices are practically identical, which is expected when the data are scaled. This small variation can be attributed to numerical errors inherent in processing large matrices.

The dispersion of the principal components reflects the linear characteristics of the data, and the values close to zero in the differences suggest that both PCA and SVD are equivalently representing the underlying structure of the dataset.
The median and quartiles close to zero confirm that variations between the two techniques are minimal.

This analysis reinforces the conclusion that, when the data are scaled, the results obtained by PCA and SVD are essentially the same, and the differences are insignificant in practical terms. Therefore, either method can be used for dimensionality reduction and to analyze the underlying structure of the data.

1.9 Second Part of Data Mining

We reload the model.

dffootballglobal <- read.csv("C:/Users/Manuel/Desktop/PR1/df_football_clean.csv")

1.10 Clustering with K-means on Original and Normalized Data

We begin our K-means study by selecting the numerical variables, since the algorithm can only be applied to numerical data.

sapply(dffootballglobal, class)
##          appearance_id                game_id              player_id 
##            "character"              "integer"              "integer" 
##         player_club_id player_current_club_id                   date 
##              "integer"              "integer"            "character" 
##            player_name         competition_id           yellow_cards 
##            "character"            "character"              "integer" 
##              red_cards                  goals                assists 
##              "integer"              "integer"              "integer" 
##         minutes_played country_of_citizenship    market_value_in_eur 
##              "integer"            "character"              "numeric" 
##               position                   name           home_club_id 
##            "character"            "character"              "integer" 
##           away_club_id        home_club_goals        away_club_goals 
##              "integer"              "integer"              "integer"

We must keep in mind that numerical variables should not be selected at random, as this would not yield meaningful clusters. Instead, we need to analyze which variables are important and why.

Important Variables

As indicated, the following may directly influence clustering based on player performance or characteristics:

  • position
    Reason: Reflects the player’s role on the field, which can affect performance.
    Required treatment: Convert into dummy variables (categorical positions → numerical).

  • country_of_citizenship
    Reason: May be relevant to analyze geographic or cultural patterns in performance.
    Required treatment: Convert into dummy variables.

  • market_value_in_eur
    Reason: Indicates the economic value of the player, a crucial factor in performance analysis and clustering.

  • minutes_played
    Reason: Represents the amount of time a player participates in matches, essential for assessing impact on the game.

  • goals
    Reason: Reflects direct player performance in terms of offensive contributions.

  • assists
    Reason: Complements the analysis of offensive performance, highlighting indirect contributions.

  • yellow_cards
    Reason: Relevant for evaluating player disciplinary behavior and its impact on matches.

  • home_club_goals and away_club_goals
    Reason: Help contextualize player performance in relation to match outcomes.

Less Important Variables

  • player_id
    Reason: Unique identifier with no analytical meaning.

  • game_id
    Reason: Unique match identifier, not directly related to clustering patterns.

  • player_club_id, home_club_id, and away_club_id
    Reason: Unique identifiers for clubs that do not provide additional information beyond name.

  • name (Club name)
    Reason: Textual categorical information not directly used in K-means.

dfkmeans <- dffootballglobal %>%
  select(
    position,
    country_of_citizenship,
    market_value_in_eur,
    minutes_played,
    goals,
    assists,
    yellow_cards,
    home_club_goals,
    away_club_goals
  )

head(dfkmeans)
##             position country_of_citizenship market_value_in_eur minutes_played
## 1        Centre-Back            Netherlands             1329310             45
## 2 Defensive Midfield                                    1615000             90
## 3          Left-Back            Netherlands             1002273             90
## 4       Right Winger            Netherlands              950000             90
## 5        Left Winger                 Serbia            10717073             45
## 6        Centre-Back            Netherlands             1450000             90
##   goals assists yellow_cards home_club_goals away_club_goals
## 1     0       0            0               6               0
## 2     0       0            0               6               0
## 3     2       1            0               6               0
## 4     0       0            0               6               0
## 5     1       0            0               6               0
## 6     0       0            0               6               0
str(dfkmeans)
## 'data.frame':    1290352 obs. of  9 variables:
##  $ position              : chr  "Centre-Back" "Defensive Midfield" "Left-Back" "Right Winger" ...
##  $ country_of_citizenship: chr  "Netherlands" "" "Netherlands" "Netherlands" ...
##  $ market_value_in_eur   : num  1329310 1615000 1002273 950000 10717073 ...
##  $ minutes_played        : int  45 90 90 90 45 90 90 90 45 45 ...
##  $ goals                 : int  0 0 2 0 1 0 0 0 0 1 ...
##  $ assists               : int  0 0 1 0 0 0 0 0 3 1 ...
##  $ yellow_cards          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_club_goals       : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ away_club_goals       : int  0 0 0 0 0 0 0 0 0 0 ...

We first focus on the character variables in order to convert them into numerical format.

unique(dfkmeans$position)
##  [1] "Centre-Back"        "Defensive Midfield" "Left-Back"         
##  [4] "Right Winger"       "Left Winger"        "Goalkeeper"        
##  [7] "Right-Back"         "Centre-Forward"     "Second Striker"    
## [10] "Central Midfield"   "Attacking Midfield" "Right Midfield"    
## [13] "Left Midfield"      "Attack"             "Defender"          
## [16] "midfield"           ""
unique(dfkmeans$country_of_citizenship)
##   [1] "Netherlands"              ""                        
##   [3] "Serbia"                   "Germany"                 
##   [5] "Belgium"                  "Aruba"                   
##   [7] "Romania"                  "Croatia"                 
##   [9] "Bulgaria"                 "Ukraine"                 
##  [11] "Brazil"                   "Cyprus"                  
##  [13] "Armenia"                  "North Macedonia"         
##  [15] "Switzerland"              "Spain"                   
##  [17] "Russia"                   "Denmark"                 
##  [19] "United States"            "France"                  
##  [21] "Georgia"                  "Argentina"               
##  [23] "Montenegro"               "Austria"                 
##  [25] "Albania"                  "Portugal"                
##  [27] "Nigeria"                  "Cameroon"                
##  [29] "Bosnia-Herzegovina"       "Norway"                  
##  [31] "Senegal"                  "Mali"                    
##  [33] "Iceland"                  "Zimbabwe"                
##  [35] "Paraguay"                 "Italy"                   
##  [37] "Finland"                  "Slovakia"                
##  [39] "Turkey"                   "Ghana"                   
##  [41] "Czech Republic"           "Uzbekistan"              
##  [43] "Tunisia"                  "Lithuania"               
##  [45] "Slovenia"                 "Azerbaijan"              
##  [47] "Philippines"              "Faroe Islands"           
##  [49] "Costa Rica"               "Sweden"                  
##  [51] "Pakistan"                 "Scotland"                
##  [53] "Chile"                    "Ireland"                 
##  [55] "Poland"                   "Kosovo"                  
##  [57] "Northern Ireland"         "Suriname"                
##  [59] "Türkiye"                  "England"                 
##  [61] "Morocco"                  "Congo"                   
##  [63] "Cote d'Ivoire"            "Ecuador"                 
##  [65] "Greece"                   "Guinea"                  
##  [67] "Israel"                   "Martinique"              
##  [69] "Zambia"                   "Venezuela"               
##  [71] "Kazakhstan"               "Hungary"                 
##  [73] "Moldova"                  "Belarus"                 
##  [75] "Latvia"                   "Japan"                   
##  [77] "Australia"                "South Africa"            
##  [79] "DR Congo"                 "Estonia"                 
##  [81] "Liberia"                  "The Gambia"              
##  [83] "Algeria"                  "Chinese Taipei"          
##  [85] "Burundi"                  "Burkina Faso"            
##  [87] "Angola"                   "Egypt"                   
##  [89] "Gabon"                    "Peru"                    
##  [91] "Central African Republic" "Kenya"                   
##  [93] "Trinidad and Tobago"      "Jamaica"                 
##  [95] "Wales"                    "Honduras"                
##  [97] "Réunion"                  "Uruguay"                 
##  [99] "Guinea-Bissau"            "Cape Verde"              
## [101] "Colombia"                 "Madagascar"              
## [103] "Haiti"                    "Bolivia"                 
## [105] "Curacao"                  "Afghanistan"             
## [107] "Guyana"                   "Canada"                  
## [109] "Antigua and Barbuda"      "Sierra Leone"            
## [111] "Comoros"                  "Chad"                    
## [113] "French Guiana"            "Togo"                    
## [115] "Mexico"                   "Guadeloupe"              
## [117] "Syria"                    "Korea, South"            
## [119] "Panama"                   "Sao Tome and Principe"   
## [121] "New Zealand"              "Benin"                   
## [123] "Equatorial Guinea"        "Libya"                   
## [125] "Seychelles"               "Barbados"                
## [127] "Oman"                     "Mozambique"              
## [129] "Palestine"                "Indonesia"               
## [131] "Iran"                     "Neukaledonien"           
## [133] "Malaysia"                 "Saint-Martin"            
## [135] "Luxembourg"               "Saudi Arabia"            
## [137] "Mauritania"               "Iraq"                    
## [139] "Tajikistan"               "El Salvador"             
## [141] "Mauritius"                "Kyrgyzstan"              
## [143] "China"                    "Lebanon"                 
## [145] "Niger"                    "Jordan"                  
## [147] "Dominican Republic"       "Rwanda"                  
## [149] "Malta"                    "Montserrat"              
## [151] "Guatemala"                "Thailand"                
## [153] "Uganda"                   "Grenada"                 
## [155] "Bermuda"                  "Laos"                    
## [157] "Monaco"                   "Ethiopia"                
## [159] "Liechtenstein"            "Malawi"                  
## [161] "Tanzania"                 "Eritrea"                 
## [163] "Qatar"                    "Nicaragua"               
## [165] "Sint Maarten"             "Korea, North"            
## [167] "Vietnam"                  "Cuba"                    
## [169] "St. Kitts & Nevis"

If we generate dummy variables but end up with too many columns, our clustering may become misleading. To address this, we group countries by continents while keeping player positions as they are.

paises_a_continente <- c(
  "Netherlands" = "Europa", "Serbia" = "Europa", "Germany" = "Europa", "Belgium" = "Europa", "Aruba" = "América",
  "Romania" = "Europa", "Croatia" = "Europa", "Bulgaria" = "Europa", "Ukraine" = "Europa", "Brazil" = "América",
  "Cyprus" = "Europa", "Armenia" = "Asia", "North Macedonia" = "Europa", "Switzerland" = "Europa", "Spain" = "Europa",
  "Russia" = "Europa", "Denmark" = "Europa", "United States" = "América", "France" = "Europa", "Georgia" = "Asia",
  "Argentina" = "América", "Montenegro" = "Europa", "Austria" = "Europa", "Albania" = "Europa", "Portugal" = "Europa",
  "Nigeria" = "África", "Cameroon" = "África", "Bosnia-Herzegovina" = "Europa", "Norway" = "Europa", "Senegal" = "África",
  "Mali" = "África", "Iceland" = "Europa", "Zimbabwe" = "África", "Paraguay" = "América", "Italy" = "Europa",
  "Finland" = "Europa", "Slovakia" = "Europa", "Turkey" = "Asia", "Ghana" = "África", "Czech Republic" = "Europa",
  "Uzbekistan" = "Asia", "Tunisia" = "África", "Lithuania" = "Europa", "Slovenia" = "Europa", "Azerbaijan" = "Asia",
  "Philippines" = "Asia", "Faroe Islands" = "Europa", "Costa Rica" = "América", "Sweden" = "Europa", "Pakistan" = "Asia",
  "Scotland" = "Europa", "Chile" = "América", "Ireland" = "Europa", "Poland" = "Europa", "Kosovo" = "Europa",
  "Northern Ireland" = "Europa", "Suriname" = "América", "Türkiye" = "Asia", "England" = "Europa", "Morocco" = "África",
  "Congo" = "África", "Cote d'Ivoire" = "África", "Ecuador" = "América", "Greece" = "Europa", "Guinea" = "África",
  "Israel" = "Asia", "Martinique" = "América", "Zambia" = "África", "Venezuela" = "América", "Kazakhstan" = "Asia",
  "Hungary" = "Europa", "Moldova" = "Europa", "Belarus" = "Europa", "Latvia" = "Europa", "Japan" = "Asia",
  "Australia" = "Oceanía", "South Africa" = "África", "DR Congo" = "África", "Estonia" = "Europa", "Liberia" = "África",
  "The Gambia" = "África", "Algeria" = "África", "Chinese Taipei" = "Asia", "Burundi" = "África", "Burkina Faso" = "África",
  "Angola" = "África", "Egypt" = "África", "Gabon" = "África", "Peru" = "América", "Central African Republic" = "África",
  "Kenya" = "África", "Trinidad and Tobago" = "América", "Jamaica" = "América", "Wales" = "Europa", "Honduras" = "América",
  "Réunion" = "África", "Uruguay" = "América", "Guinea-Bissau" = "África", "Cape Verde" = "África", "Colombia" = "América",
  "Madagascar" = "África", "Haiti" = "América", "Bolivia" = "América", "Curacao" = "América", "Afghanistan" = "Asia",
  "Guyana" = "América", "Canada" = "América", "Antigua and Barbuda" = "América", "Sierra Leone" = "África", "Comoros" = "África",
  "Chad" = "África", "French Guiana" = "América", "Togo" = "África", "Mexico" = "América", "Guadeloupe" = "América",
  "Syria" = "Asia", "Korea, South" = "Asia", "Panama" = "América", "Sao Tome and Principe" = "África", "New Zealand" = "Oceanía",
  "Benin" = "África", "Equatorial Guinea" = "África", "Libya" = "África", "Seychelles" = "África", "Barbados" = "América",
  "Oman" = "Asia", "Mozambique" = "África", "Palestine" = "Asia", "Indonesia" = "Asia", "Iran" = "Asia", "Neukaledonien" = "Oceanía",
  "Malaysia" = "Asia", "Saint-Martin" = "América", "Luxembourg" = "Europa", "Saudi Arabia" = "Asia", "Mauritania" = "África",
  "Iraq" = "Asia", "Tajikistan" = "Asia", "El Salvador" = "América", "Mauritius" = "África", "Kyrgyzstan" = "Asia", "China" = "Asia",
  "Lebanon" = "Asia", "Niger" = "África", "Jordan" = "Asia", "Dominican Republic" = "América", "Rwanda" = "África", "Malta" = "Europa",
  "Montserrat" = "América", "Guatemala" = "América", "Thailand" = "Asia", "Uganda" = "África", "Grenada" = "América",
  "Bermuda" = "América", "Laos" = "Asia", "Monaco" = "Europa", "Ethiopia" = "África", "Liechtenstein" = "Europa", "Malawi" = "África",
  "Tanzania" = "África", "Eritrea" = "África", "Qatar" = "Asia", "Nicaragua" = "América", "Sint Maarten" = "América",
  "Korea, North" = "Asia", "Vietnam" = "Asia", "Cuba" = "América", "St. Kitts & Nevis" = "América", "Southern Sudan" = "África", "Bonaire" = "América"
)

dfkmeans$continent <- paises_a_continente[dfkmeans$country_of_citizenship]
table(dfkmeans$continent)
## 
##  África América    Asia  Europa Oceanía 
##  138750  172040   71854  883347    4962
dfkmeans$continent <- unname(paises_a_continente[dfkmeans$country_of_citizenship])
dfkmeans$continent[is.na(dfkmeans$continent)] <- "Otro"
X <- model.matrix(~ 0 + continent, data = dfkmeans, na.action = na.pass)
colnames(X) <- sub("^continent", "", colnames(X))
dfkmeans <- cbind(dfkmeans, X)
stopifnot(nrow(dfkmeans) == nrow(X))
dfkmeans <- cbind(dfkmeans, model.matrix(~ position - 1, data = dfkmeans))
head(dfkmeans)
##             position country_of_citizenship market_value_in_eur minutes_played
## 1        Centre-Back            Netherlands             1329310             45
## 2 Defensive Midfield                                    1615000             90
## 3          Left-Back            Netherlands             1002273             90
## 4       Right Winger            Netherlands              950000             90
## 5        Left Winger                 Serbia            10717073             45
## 6        Centre-Back            Netherlands             1450000             90
##   goals assists yellow_cards home_club_goals away_club_goals continent África
## 1     0       0            0               6               0    Europa      0
## 2     0       0            0               6               0      Otro      0
## 3     2       1            0               6               0    Europa      0
## 4     0       0            0               6               0    Europa      0
## 5     1       0            0               6               0    Europa      0
## 6     0       0            0               6               0    Europa      0
##   América Asia Europa Oceanía Otro position positionAttack
## 1       0    0      1       0    0        0              0
## 2       0    0      0       0    1        0              0
## 3       0    0      1       0    0        0              0
## 4       0    0      1       0    0        0              0
## 5       0    0      1       0    0        0              0
## 6       0    0      1       0    0        0              0
##   positionAttacking Midfield positionCentral Midfield positionCentre-Back
## 1                          0                        0                   1
## 2                          0                        0                   0
## 3                          0                        0                   0
## 4                          0                        0                   0
## 5                          0                        0                   0
## 6                          0                        0                   1
##   positionCentre-Forward positionDefender positionDefensive Midfield
## 1                      0                0                          0
## 2                      0                0                          1
## 3                      0                0                          0
## 4                      0                0                          0
## 5                      0                0                          0
## 6                      0                0                          0
##   positionGoalkeeper positionLeft-Back positionLeft Midfield
## 1                  0                 0                     0
## 2                  0                 0                     0
## 3                  0                 1                     0
## 4                  0                 0                     0
## 5                  0                 0                     0
## 6                  0                 0                     0
##   positionLeft Winger positionmidfield positionRight-Back
## 1                   0                0                  0
## 2                   0                0                  0
## 3                   0                0                  0
## 4                   0                0                  0
## 5                   1                0                  0
## 6                   0                0                  0
##   positionRight Midfield positionRight Winger positionSecond Striker
## 1                      0                    0                      0
## 2                      0                    0                      0
## 3                      0                    0                      0
## 4                      0                    1                      0
## 5                      0                    0                      0
## 6                      0                    0                      0

We remove the original columns ‘continent’ and ‘position’.

dfkmeans <- dfkmeans[, !names(dfkmeans) %in% c("continent", "position", "country_of_citizenship")]
head(dfkmeans)
##   market_value_in_eur minutes_played goals assists yellow_cards home_club_goals
## 1             1329310             45     0       0            0               6
## 2             1615000             90     0       0            0               6
## 3             1002273             90     2       1            0               6
## 4              950000             90     0       0            0               6
## 5            10717073             45     1       0            0               6
## 6             1450000             90     0       0            0               6
##   away_club_goals África América Asia Europa Oceanía Otro positionAttack
## 1               0      0       0    0      1       0    0              0
## 2               0      0       0    0      0       0    1              0
## 3               0      0       0    0      1       0    0              0
## 4               0      0       0    0      1       0    0              0
## 5               0      0       0    0      1       0    0              0
## 6               0      0       0    0      1       0    0              0
##   positionAttacking Midfield positionCentral Midfield positionCentre-Back
## 1                          0                        0                   1
## 2                          0                        0                   0
## 3                          0                        0                   0
## 4                          0                        0                   0
## 5                          0                        0                   0
## 6                          0                        0                   1
##   positionCentre-Forward positionDefender positionDefensive Midfield
## 1                      0                0                          0
## 2                      0                0                          1
## 3                      0                0                          0
## 4                      0                0                          0
## 5                      0                0                          0
## 6                      0                0                          0
##   positionGoalkeeper positionLeft-Back positionLeft Midfield
## 1                  0                 0                     0
## 2                  0                 0                     0
## 3                  0                 1                     0
## 4                  0                 0                     0
## 5                  0                 0                     0
## 6                  0                 0                     0
##   positionLeft Winger positionmidfield positionRight-Back
## 1                   0                0                  0
## 2                   0                0                  0
## 3                   0                0                  0
## 4                   0                0                  0
## 5                   1                0                  0
## 6                   0                0                  0
##   positionRight Midfield positionRight Winger positionSecond Striker
## 1                      0                    0                      0
## 2                      0                    0                      0
## 3                      0                    0                      0
## 4                      0                    1                      0
## 5                      0                    0                      0
## 6                      0                    0                      0
str(dfkmeans)
## 'data.frame':    1290352 obs. of  29 variables:
##  $ market_value_in_eur       : num  1329310 1615000 1002273 950000 10717073 ...
##  $ minutes_played            : int  45 90 90 90 45 90 90 90 45 45 ...
##  $ goals                     : int  0 0 2 0 1 0 0 0 0 1 ...
##  $ assists                   : int  0 0 1 0 0 0 0 0 3 1 ...
##  $ yellow_cards              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ home_club_goals           : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ away_club_goals           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ África                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ América                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Asia                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Europa                    : num  1 0 1 1 1 1 1 1 1 0 ...
##  $ Oceanía                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Otro                      : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ positionAttack            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionAttacking Midfield: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionCentral Midfield  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionCentre-Back       : num  1 0 0 0 0 1 0 1 0 0 ...
##  $ positionCentre-Forward    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionDefender          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionDefensive Midfield: num  0 1 0 0 0 0 0 0 0 0 ...
##  $ positionGoalkeeper        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ positionLeft-Back         : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ positionLeft Midfield     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionLeft Winger       : num  0 0 0 0 1 0 0 0 1 1 ...
##  $ positionmidfield          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionRight-Back        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionRight Midfield    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ positionRight Winger      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ positionSecond Striker    : num  0 0 0 0 0 0 0 0 0 0 ...

As we can see, the transformation has been successfully performed. The next step is to normalize the dataset.
It is important to note that dummy numerical variables should not be scaled, so we exclude them from normalization.

Additionally, we know that binary variables should NOT be normalized. In this case, we only have one binary variable that was not previously selected: red_cards.

Reference – Nº1

numerical_vars <- c("market_value_in_eur", "minutes_played", "goals", "assists", "yellow_cards", "home_club_goals", "away_club_goals")
dfkmeans[numerical_vars] <- scale(dfkmeans[numerical_vars])
head(dfkmeans)
##   market_value_in_eur minutes_played      goals    assists yellow_cards
## 1          -0.4301239     -0.8649827 -0.2911196 -0.2665436   -0.4075265
## 2          -0.3966584      0.6637674 -0.2911196 -0.2665436   -0.4075265
## 3          -0.4684328      0.6637674  5.7062225  3.2022543   -0.4075265
## 4          -0.4745560      0.6637674 -0.2911196 -0.2665436   -0.4075265
## 5           0.6695507     -0.8649827  2.7075514 -0.2665436   -0.4075265
## 6          -0.4159864      0.6637674 -0.2911196 -0.2665436   -0.4075265
##   home_club_goals away_club_goals África América Asia Europa Oceanía Otro
## 1        3.323162       -1.005735      0       0    0      1       0    0
## 2        3.323162       -1.005735      0       0    0      0       0    1
## 3        3.323162       -1.005735      0       0    0      1       0    0
## 4        3.323162       -1.005735      0       0    0      1       0    0
## 5        3.323162       -1.005735      0       0    0      1       0    0
## 6        3.323162       -1.005735      0       0    0      1       0    0
##   positionAttack positionAttacking Midfield positionCentral Midfield
## 1              0                          0                        0
## 2              0                          0                        0
## 3              0                          0                        0
## 4              0                          0                        0
## 5              0                          0                        0
## 6              0                          0                        0
##   positionCentre-Back positionCentre-Forward positionDefender
## 1                   1                      0                0
## 2                   0                      0                0
## 3                   0                      0                0
## 4                   0                      0                0
## 5                   0                      0                0
## 6                   1                      0                0
##   positionDefensive Midfield positionGoalkeeper positionLeft-Back
## 1                          0                  0                 0
## 2                          1                  0                 0
## 3                          0                  0                 1
## 4                          0                  0                 0
## 5                          0                  0                 0
## 6                          0                  0                 0
##   positionLeft Midfield positionLeft Winger positionmidfield positionRight-Back
## 1                     0                   0                0                  0
## 2                     0                   0                0                  0
## 3                     0                   0                0                  0
## 4                     0                   0                0                  0
## 5                     0                   1                0                  0
## 6                     0                   0                0                  0
##   positionRight Midfield positionRight Winger positionSecond Striker
## 1                      0                    0                      0
## 2                      0                    0                      0
## 3                      0                    0                      0
## 4                      0                    1                      0
## 5                      0                    0                      0
## 6                      0                    0                      0
colnames(dfkmeans)
##  [1] "market_value_in_eur"        "minutes_played"            
##  [3] "goals"                      "assists"                   
##  [5] "yellow_cards"               "home_club_goals"           
##  [7] "away_club_goals"            "África"                    
##  [9] "América"                    "Asia"                      
## [11] "Europa"                     "Oceanía"                   
## [13] "Otro"                       "positionAttack"            
## [15] "positionAttacking Midfield" "positionCentral Midfield"  
## [17] "positionCentre-Back"        "positionCentre-Forward"    
## [19] "positionDefender"           "positionDefensive Midfield"
## [21] "positionGoalkeeper"         "positionLeft-Back"         
## [23] "positionLeft Midfield"      "positionLeft Winger"       
## [25] "positionmidfield"           "positionRight-Back"        
## [27] "positionRight Midfield"     "positionRight Winger"      
## [29] "positionSecond Striker"

We attempt to visualize the elbow, gap statistic, and silhouette methods.

dfnumeric <- dfkmeans %>%
  select(market_value_in_eur, minutes_played, goals, assists, yellow_cards, 
         home_club_goals, away_club_goals)

# limpiar datos
dfnumeric[!is.finite(as.matrix(dfnumeric))] <- NA
dfnumeric <- na.omit(dfnumeric)

# calcular WSS
wss <- sapply(1:10, function(k) {
  kmeans(dfnumeric, k, nstart = 10)$tot.withinss
})

# graficar codo
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Número de Clusters K", 
     ylab = "Suma de Distancias al Cuadrado (WSS)")

Since we have an overwhelming amount of data, we run a test with 10,000 samples to check whether the trend remains consistent.

We are unable to visualize the silhouette or gap statistic, as the following error is returned:
Vector size problem (9556.0 GB)

wss <- sapply(1:10, function(k) {
  km <- kmeans(dfnumeric, centers = k, nstart = 10, iter.max = 100)  
  return(km$tot.withinss)
})
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517550)
plot(1:10, wss, type = "b", pch = 19, frame = FALSE, 
     xlab = "Número de Clusters K", ylab = "Suma de Distancias al Cuadrado")

#avg_sil <- sapply(2:10, function(k) {
#  km <- kmeans(dfnumeric, centers = k, nstart = 10, iter.max = 100) 
#  ss <- silhouette(km$cluster, dist(dfnumeric))
#  return(mean(ss[, 3]))
#})

#plot(2:10, avg_sil, type = "b", pch = 19, frame = FALSE, 
#     xlab = "Número de Clusters K", ylab = "Ancho Promedio de Silueta")

The average silhouette score begins to suggest that between 3 and 5 clusters could be an optimal number.
However, to address the computational issue, we proceed with a PCA by components.
To better understand the explained variance, we apply a rotation.

X <- dfkmeans[, numerical_vars, drop = FALSE]
X <- data.frame(lapply(X, function(col) as.numeric(as.character(col))))

is_bad <- !is.finite(as.matrix(X))
if (any(is_bad)) X[is_bad] <- NA

keep_cols <- colSums(!is.na(X)) > 0
X <- X[, keep_cols, drop = FALSE]

zero_var <- sapply(X, function(v) var(v, na.rm = TRUE) == 0)
if (any(zero_var)) X <- X[, !zero_var, drop = FALSE]

for (j in seq_along(X)) {
  if (anyNA(X[[j]])) X[[j]][is.na(X[[j]])] <- median(X[[j]], na.rm = TRUE)
}

pca <- prcomp(X, center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6    PC7
## Standard deviation     1.1589 1.0499 1.0287 0.9756 0.9583 0.9366 0.8655
## Proportion of Variance 0.1918 0.1575 0.1512 0.1360 0.1312 0.1253 0.1070
## Cumulative Proportion  0.1918 0.3493 0.5005 0.6365 0.7677 0.8930 1.0000
head(pca$x)
##             PC1        PC2       PC3        PC4        PC5        PC6
## [1,] 0.05826368 -1.8421522 -2.649327  0.8788404 -0.7291070 -0.4595283
## [2,] 0.53650414 -0.9294873 -2.905706  0.6853809 -0.3917896  0.4828133
## [3,] 5.42672491 -1.5973534 -2.742469  1.3812623 -2.1059689  1.3816755
## [4,] 0.50459458 -0.9199009 -2.910954  0.7423355 -0.4170827  0.5122130
## [5,] 2.12256631 -2.2186391 -2.490408 -0.4737798 -2.2121820 -0.2350347
## [6,] 0.52858673 -0.9271088 -2.907008  0.6995125 -0.3980653  0.4901080
##             PC7
## [1,] -1.1743148
## [2,] -1.5885355
## [3,]  2.8146782
## [4,] -1.5750277
## [5,]  0.1270253
## [6,] -1.5851840
pca$rotation
##                            PC1         PC2           PC3         PC4        PC5
## market_value_in_eur 0.40963485 -0.12306433  0.0673593525 -0.73114769  0.3246979
## minutes_played      0.30386375  0.59969455 -0.1691798405 -0.11054217  0.2135412
## goals               0.53818404 -0.08042097  0.0282942421 -0.18294613 -0.6136507
## assists             0.48776163 -0.05603862 -0.0004666133  0.50178595  0.5735126
## yellow_cards        0.05601611  0.71161435 -0.1647153116  0.09585773 -0.2402869
## home_club_goals     0.33651952 -0.32900863 -0.6266257506  0.26663427 -0.2181408
## away_club_goals     0.30972480  0.03117843  0.7390867014  0.29607856 -0.1953570
##                             PC6        PC7
## market_value_in_eur -0.37741526 -0.1734048
## minutes_played       0.62467488 -0.2671579
## goals                0.21327030  0.4975632
## assists             -0.11741296  0.4055322
## yellow_cards        -0.62561185  0.0708478
## home_club_goals     -0.11163841 -0.5046838
## away_club_goals     -0.06492827 -0.4762423

We plot the results to visualize the importance of each component using the obtained values.

variables <- c('market_value_in_eur', 'minutes_played', 'goals', 'assists', 'yellow_cards', 'home_club_goals', 'away_club_goals')
pc1 <- c(0.40915592, 0.30736148, 0.53678599, 0.48479076, 0.06059914, 0.33776842, 0.31176805)
pc2 <- c(-0.11158876, 0.60748303, -0.07974133, -0.05862091, 0.71296794, -0.31621859, -0.01999363)
pc3 <- c(-0.029007256, 0.141924990, -0.027053404, -0.006633985, 0.129037154, 0.629417639, -0.751946028)

loadings <- data.frame(
  Variable = rep(variables, times = 3),
  PC = rep(c("PC1", "PC2", "PC3"), each = length(variables)),
  Carga = c(pc1, pc2, pc3)
)

ggplot(loadings, aes(x = Variable, y = Carga, fill = PC)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Cargas de las variables en las primeras 3 componentes principales",
       x = "Variables",
       y = "Cargas") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("#FF6347", "#4682B4", "#32CD32")) +
  theme_minimal()

  • PC1: Strongly related to variables such as goals and home_club_goals, which may indicate a component associated with offensive performance.
  • PC2: Related to variables such as away_club_goals and minutes_played, which may suggest a component linked to away performance or playing volume.
  • PC3: Strongly related to variables such as yellow_cards and away_club_goals, which could represent a component associated with discipline or defensive performance.

We then run the WSS (elbow method) again to estimate the optimal number of clusters. However, the trend continues to appear constant, making it difficult to determine the exact number of clusters.
Since PCA suggested that three components explain most of the variance, we adopt three clusters as the most reasonable choice.

library(dplyr)

set.seed(123)

# 1) Selección y coerción segura a numérico
dfnumeric <- dfkmeans %>%
  select(market_value_in_eur, minutes_played, goals, assists,
         yellow_cards, home_club_goals, away_club_goals) %>%
  mutate(across(everything(), ~ suppressWarnings(as.numeric(as.character(.)))))

# 2) Reemplazar Inf/-Inf por NA
is_bad <- !is.finite(as.matrix(dfnumeric))
if (any(is_bad, na.rm = TRUE)) dfnumeric[is_bad] <- NA

# 3) Eliminar columnas completamente NA
keep_cols <- colSums(!is.na(dfnumeric)) > 0
dfnumeric <- dfnumeric[, keep_cols, drop = FALSE]

# 4) Eliminar columnas de varianza cero (constantes)
zero_var <- sapply(dfnumeric, function(v) var(v, na.rm = TRUE) == 0)
if (any(zero_var)) dfnumeric <- dfnumeric[, !zero_var, drop = FALSE]

# 5) Imputación simple por mediana (evita perder filas)
for (j in seq_along(dfnumeric)) {
  if (anyNA(dfnumeric[[j]])) {
    med <- median(dfnumeric[[j]], na.rm = TRUE)
    dfnumeric[[j]][is.na(dfnumeric[[j]])] <- med
  }
}

# Chequeos tempranos
if (ncol(dfnumeric) < 1) stop("No quedan columnas válidas tras la limpieza.")
if (nrow(dfnumeric) < 2) stop("No hay suficientes filas para PCA/K-means.")

# 6) PCA (centrado y escalado)
pca <- prcomp(dfnumeric, center = TRUE, scale. = TRUE)

# Asegura que existe el nº de PCs solicitado
num_pcs <- min(3, ncol(pca$x))
pca_data <- pca$x[, 1:num_pcs, drop = FALSE]

# 7) Elbow method para K-means (ajustar K al nº de filas)
k_max <- min(10, max(1, nrow(pca_data) - 1))
if (k_max < 1) stop("No hay suficientes observaciones para K-means.")

wss <- sapply(1:k_max, function(k) {
  km <- kmeans(pca_data, centers = k, nstart = 25, iter.max = 100)
  km$tot.withinss
})
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
plot(1:k_max, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Número de Clusters K",
     ylab = "Suma de Distancias al Cuadrado (WSS)")

We examine the decisive separation between clusters by analyzing the distances between their centroids.

set.seed(123)  #
kmeans_result <- kmeans(pca_data[, 1:3], centers = 3, nstart = 25)

pca_data_with_clusters <- data.frame(pca_data, cluster = kmeans_result$cluster)
centroids <- kmeans_result$centers

centroid_distances <- dist(centroids)
print(as.matrix(centroid_distances))
##          1        2        3
## 1 0.000000 3.201442 2.410514
## 2 3.201442 0.000000 2.503221
## 3 2.410514 2.503221 0.000000

We plot the clusters.

ggplot(pca_data_with_clusters, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
  geom_point() +
  labs(title = "Clustering con k-means (Euclidiano) sobre las 3 primeras componentes principales",
       x = "Componente Principal 1", 
       y = "Componente Principal 2",
       color = "Cluster") +
  theme_minimal()

1.11 The quality measures of the generated model are analyzed, displayed, and discussed.

Cluster 1: Low-Performance Teams

  • Interpretation: This cluster has low values in both principal components, suggesting that the teams grouped here have overall poor performance. These teams may score few goals both at home and away, while conceding many, which could indicate a weak defense or offensive struggles.
  • Possible characteristics:
    • Goals scored: Low number of goals both home and away.
    • Goals conceded: High number of goals conceded, indicating a weak defense.
    • League position: Likely to occupy the lower positions in the table.
  • Possible team types: Newly promoted teams, squads with weaker rosters, or teams undergoing a rebuilding season.
  • Recommended action: These teams may benefit from reinforcing both defense and attack, improving offensive tactics, or working on consistency.

Cluster 2: Teams with Strong Offensive Performance

  • Interpretation: This cluster has high values on the first component, related to offensive performance. This suggests that teams in this group have strong offensive capacity, scoring many goals, especially at home. However, moderate values on the second component may indicate that these teams are less effective away from home or have lower game volume in certain contexts.
  • Possible characteristics:
    • Goals scored: High number of goals, with a tendency to be more effective at home.
    • Style of play: Likely to play aggressively and offensively, seeking to score quickly.
    • League position: These teams may occupy mid to upper positions in the table.
  • Possible team types: Teams with star attacking players and a strong offensive lineup but potentially vulnerable defensively.
  • Recommended action: These teams should focus on improving consistency away from home and balancing their offensive strategy with solid defense to compete for top spots.

Cluster 3: Defensive Teams with Moderate Attack

  • Interpretation: This cluster has low values on the first component (offensive performance) but moderate values on the second component. This suggests that teams in this group emphasize strong defense but have weaker goal-scoring capacity compared to others. They likely prioritize defensive solidity and counterattacking opportunities.
  • Possible characteristics:
    • Goals scored: Few goals, but supported by strong defense.
    • Goals conceded: Few goals conceded, highlighting defensive robustness.
    • Style of play: Tactical focus on defending and exploiting counterattacks.
    • League position: Likely mid-table teams that secure points with defensive strength and counterplay.
  • Possible team types: Teams prioritizing defensive organization over offense, often guided by coaches with a defensive philosophy.
  • Recommended action: These teams may need to improve offensive capacity to be more competitive in difficult matches, while maintaining their defensive structure.

Next step: We will analyze the model with 5 clusters and 6 principal components, which together explain 90% of the variance.

set.seed(123)

dfnumeric <- dfkmeans %>%
  select(market_value_in_eur, minutes_played, goals, assists,
         yellow_cards, home_club_goals, away_club_goals) %>%
  mutate(across(everything(), ~ suppressWarnings(as.numeric(as.character(.)))))

cat("Filas con NA:", sum(!complete.cases(dfnumeric)), "\n")
## Filas con NA: 1
print(colSums(is.na(dfnumeric)))
## market_value_in_eur      minutes_played               goals             assists 
##                   1                   1                   1                   1 
##        yellow_cards     home_club_goals     away_club_goals 
##                   1                   1                   1
cat("Inf/-Inf presentes?:", any(is.infinite(as.matrix(dfnumeric))), "\n")
## Inf/-Inf presentes?: FALSE
is_bad <- !is.finite(as.matrix(dfnumeric))
if (any(is_bad, na.rm = TRUE)) dfnumeric[is_bad] <- NA

keep_cols <- colSums(!is.na(dfnumeric)) > 0
dfnumeric <- dfnumeric[, keep_cols, drop = FALSE]
zero_var <- sapply(dfnumeric, function(v) var(v, na.rm = TRUE) == 0)
if (any(zero_var, na.rm = TRUE)) {
  dfnumeric <- dfnumeric[, !zero_var, drop = FALSE]
}
for (j in seq_along(dfnumeric)) {
  if (anyNA(dfnumeric[[j]])) {
    med <- median(dfnumeric[[j]], na.rm = TRUE)
    dfnumeric[[j]][is.na(dfnumeric[[j]])] <- med
  }
}
stopifnot(!anyNA(dfnumeric))
stopifnot(all(is.finite(as.matrix(dfnumeric))))
stopifnot(nrow(dfnumeric) >= 6)  # para que tenga sentido pedir 6 PCs
pca <- prcomp(dfnumeric, center = TRUE, scale. = TRUE)
num_pcs <- min(6, ncol(pca$x))
pca_data <- pca$x[, 1:num_pcs, drop = TRUE]
k_req <- 5
k_ok  <- min(k_req, max(1, nrow(pca_data) - 1))
if (k_ok < k_req) message("Se ajusta centers de ", k_req, " a ", k_ok, " por pocas filas.")

kmeans_result_6pcs <- kmeans(pca_data[, 1:num_pcs, drop = FALSE],
                             centers = k_ok, nstart = 25, iter.max = 100)

pca_data_with_clusters_6pcs <- data.frame(pca_data, cluster = kmeans_result_6pcs$cluster)

k_max <- min(10, max(1, nrow(pca_data) - 1))
wss <- sapply(1:k_max, function(k) kmeans(pca_data, centers = k, nstart = 25)$tot.withinss)
## Warning: did not converge in 10 iterations
plot(1:k_max, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Número de Clusters K",
     ylab = "Suma de Distancias al Cuadrado (WSS)")

We examine the distances between the centroids.

centroids_6pcs <- kmeans_result_6pcs$centers
centroid_distances_6pcs <- dist(centroids_6pcs)
print(as.matrix(centroid_distances_6pcs))
##          1        2        3        4        5
## 1 0.000000 3.807327 4.206629 3.386123 2.137755
## 2 3.807327 0.000000 4.626194 4.062461 3.217859
## 3 4.206629 4.626194 0.000000 4.406969 3.639814
## 4 3.386123 4.062461 4.406969 0.000000 2.796418
## 5 2.137755 3.217859 3.639814 2.796418 0.000000
ggplot(pca_data_with_clusters_6pcs, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
  geom_point() +
  labs(title = "Clustering con k-means (6 PCAs explicando el 90% de la varianza)",
       x = "Componente Principal 1", 
       y = "Componente Principal 2",
       color = "Cluster") +
  theme_minimal()

1.12 The quality measures of the generated model are analyzed, displayed, and discussed.

Cluster 1: Low Overall Performance Teams

  • Interpretation: This cluster shows low values in both principal components (PC1 and PC2), indicating weak performance in both attack and defense. Teams in this group may struggle to remain competitive in their leagues.
  • Possible characteristics:
    • Goals scored: Few goals both at home and away.
    • Weak defense: High number of goals conceded.
    • Minutes played: Potential inconsistency among starters.
    • League position: Likely to be in the lower part of the table.
  • Possible team types:
    • Newly promoted teams with less competitive squads.
    • Teams undergoing rebuilding or experiencing internal turmoil.
  • Recommended action:
    • Reinforce the squad, especially in attack and defense.
    • Improve tactical organization to reduce goals conceded and create more chances.

Cluster 2: Highly Offensive Teams

  • Interpretation: This cluster has high values on PC1 (offensive performance) but intermediate values on PC2. This suggests teams with strong attacking capabilities but potential defensive vulnerabilities. They tend to score many goals, especially at home.
  • Possible characteristics:
    • Goals scored: Very high offensive output.
    • Style of play: Aggressive approach, aiming to score quickly and maintain possession.
    • League position: Mid-to-upper table.
  • Possible team types:
    • Teams with star forwards or an attack-heavy lineup.
    • Teams that constantly push forward, leaving spaces at the back.
  • Recommended action:
    • Improve defensive solidity to be more consistent in key matches.
    • Balance the attacking style with a more robust away-game strategy.

Cluster 3: Balanced Teams

  • Interpretation: This cluster shows moderate values on both principal components, suggesting a balanced performance between attack and defense. These teams are consistent and tend to secure key points, though they may not excel in a single dimension.
  • Possible characteristics:
    • Goals for/against: Moderate levels, with a healthy ratio.
    • Style of play: Balanced approach, focused on control.
    • League position: Mid-table but competitive against stronger opponents.
  • Possible team types:
    • Tactically well-organized sides.
    • Squads blending young talents and experienced players.
  • Recommended action:
    • Increase offensive aggression to compete for higher positions.
    • Maintain the balanced structure to secure points against weaker rivals.

Cluster 4: Defensively Strong Teams

  • Interpretation: This cluster shows low values on PC1 (attack) but high values on PC2, suggesting a solid defensive focus. These teams prioritize preventing goals over scoring them.
  • Possible characteristics:
    • Goals conceded: Very low, indicating robust defense.
    • Goals scored: Lower, but sufficient to win key points.
    • Style of play: Counterattacks and organized defending.
    • League position: Mid-table or even higher if consistency is sustained.
  • Possible team types:
    • Teams managed by defense-oriented coaches.
    • Squads with standout goalkeepers and defenders.
  • Recommended action:
    • Increase attacking capacity to be more competitive in tight games.
    • Invest in creative midfielders to generate chances.

Cluster 5: Irregular or Transitional Teams

  • Interpretation: This cluster has intermediate values on both components but with noticeable dispersion. It may represent teams in transition or with inconsistent performance.
  • Possible characteristics:
    • Goals scored/conceded: Highly variable across matches.
    • Style of play: Hard to define; mixed outcomes.
    • League position: Lower mid-table with fluctuations.
  • Possible team types:
    • Squads under reconstruction or with consistency issues.
    • Teams with key injuries or frequent rotations.
  • Recommended action:
    • Identify weak areas (attack, defense, midfield) and reinforce them.
    • Work on consistency and team mentality to improve performance.

1.13 Conclusions

  • The vector size error (9556.0 GB) reflects computational constraints, highlighting the need for dimensionality reduction (e.g., PCA) and optimized methods for large-scale data.

  • PCA explains a large proportion of variance (~50% with 3 PCs and ~90% with 6 PCs), but inevitably loses some information from the original variables. This may affect clustering reliability if the principal components do not capture all relevant relationships.

  • Clustering on very large datasets can be sensitive to noise or spurious patterns. Using PCA to explain 90% of variance with 6 PCs improved segmentation, though some overlap remains in 2D projections (PC1 vs PC2). Computational limits prevented 3D visualization with this data volume.

  • Random sampling tests did not show significant changes in trend, suggesting model consistency. However, this does not guarantee full reliability, as noise or dataset biases can still influence the resulting clusters.

  • With 3 PCs (~50% explained):
    General patterns are identifiable, but there is substantial overlap among clusters, indicating that 3 PCs are insufficient to capture dataset complexity.

  • With 6 PCs (~90% explained):
    Segmentation improves significantly in the multidimensional space. Nonetheless, 2D projections still show some overlap, reflecting the complex, high-dimensional nature of the data.

  • Overall assessment:
    Given the dataset’s complexity and size, clustering remains valid when combined with dimensionality reduction capturing at least ~90% variance. However, perfect group separability is not guaranteed—especially if the original data contain non-linear relationships or features not represented by PCA.

Given the complexity and scale of the dataset, clustering is NOT the most suitable standalone method.

1.14 Exercise Guidelines

  1. Apply the unsupervised k-means algorithm—based on distances to group means—on both the original and normalized data. Use quantitative or binary variables only. Decide whether groups are defined from normalized variables and select the number of clusters that best fits the data.

1.14.1 Section 1: Applying K-means with Original and Normalized Data

This section implements unsupervised k-means clustering with the following steps and key decisions:

Selection of quantitative/binary variables:
Variables chosen include:
- market_value_in_eur
- minutes_played
- goals
- assists
- yellow_cards
- home_club_goals
- away_club_goals

These are quantitative and suitable for meaningful segmentation.

Data normalization:
To mitigate scale differences, we applied scale() so variables have mean 0 and SD 1—preventing high-magnitude variables from dominating distance calculations.

Decision on data version:
We proceeded only with normalized data to improve interpretability and model stability across differing scales.

Selecting the number of clusters:
We used the elbow method (WSS) and complemented it with PCA to assess explained variance and grouping structure.
Based on results, five clusters provided the best trade-off between explained variance and model simplicity.

Results and visualization:
- WSS plot to justify the chosen k.
- Cluster projections onto principal components (PCA) to inspect group separation.

1.15 K-medians with the previously selected number of clusters

  • Why use Manhattan distance instead of Euclidean in k-medians?
    Reference Nº2

K-medians uses the Manhattan metric, minimizing the sum of absolute deviations, which makes it more robust to outliers and irregular distributions. Euclidean distance is not used in k-medians because that algorithm is not designed around squared-distance minimization (that is a property of k-means).

As discussed, we cannot fully follow the guideline with the raw dataset due to extreme size; this exercise attempts something far more complex than classic teaching datasets (e.g., Titanic or Iris petals).

Error from pam(...):

#km_medians <- pam(pca_data, k = 3, metric = "manhattan")  
#print(km_medians$medoids)  

El problema radica en que el algoritmo k-medians no puede implementarse directamente con clara. Para poder aplicar k-medians, será necesario trabajar con una muestra representativa de los datos.

set.seed(123)

sample_indices <- sample(1:nrow(pca_data), size = 20000)
pca_sample <- pca_data[sample_indices, ]

kmedians_sample_result <- pam(pca_sample, k = 5, metric = "manhattan")
medoids <- kmedians_sample_result$medoids
rownames(medoids) <- paste0("Cluster ", 1:nrow(medoids))
medoid_distances <- as.matrix(dist(medoids, method = "manhattan"))
print(medoid_distances)
##           Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## Cluster 1  0.000000  4.670003  4.722068  8.157558  4.795006
## Cluster 2  4.670003  0.000000  5.137844 10.173993  5.430001
## Cluster 3  4.722068  5.137844  0.000000  7.921012  2.781470
## Cluster 4  8.157558 10.173993  7.921012  0.000000  7.521473
## Cluster 5  4.795006  5.430001  2.781470  7.521473  0.000000
pca_sample_with_clusters <- data.frame(pca_sample, cluster = kmedians_sample_result$clustering)

ggplot(pca_sample_with_clusters, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
  geom_point(alpha = 0.6) +  
  geom_point(data = as.data.frame(medoids), aes(x = PC1, y = PC2), color = "black", shape = 8, size = 4) +  
  labs(
    title = "Clustering con k-medians (5 clusters, métrica Manhattan)",
    x = "Componente Principal 1",
    y = "Componente Principal 2",
    color = "Cluster"
  ) +
  theme_minimal()

2 Interpretation of k-medians Clusters, Manhattan Metric

2.1 Cluster 1: Teams with Overall Low Performance

Interpretation: This cluster consists of teams displaying limited effectiveness in both offense and defense, as evidenced by low values in the principal components. These teams typically struggle at the bottom of the league table.

Possible characteristics:
- Goals scored: Very low, both at home and away.
- Defense: Weak, with a high volume of goals conceded.
- Playing time: Irregular participation of starting players.
- League position: Predominantly in the lower ranks.

Recommended action:
- Strengthen both offensive and defensive units.
- Enhance team cohesion and refine tactical organization.

2.2 Cluster 2: Offensive Teams with Defensive Vulnerabilities

Interpretation: This cluster captures teams with strong attacking capabilities, reflected in high PC1 values, but with notable defensive shortcomings indicated by intermediate PC2 values.

Possible characteristics:
- Goals scored: High, particularly in home matches.
- Defense: Vulnerable, conceding frequently in away games.
- Style of play: Aggressive, offense-oriented.
- League position: Commonly positioned in the upper-mid range of the table.

Recommended action:
- Reinforce defensive stability to remain competitive in tightly contested matches.
- Balance offensive strategies with a more cautious approach in away fixtures.

2.3 Cluster 3: Balanced Teams

Interpretation: Teams in this cluster demonstrate a moderate and consistent performance across both offense and defense, without excelling significantly in either aspect.

Possible characteristics:
- Goals scored and conceded: Balanced, with a stable ratio.
- Style of play: Structured and evenly oriented.
- League position: Typically mid-table, competitive against stronger opponents.

Recommended action:
- Increase offensive intensity to challenge for higher league positions.
- Maintain tactical balance when facing weaker teams.

2.4 Cluster 4: Strong Defensive Teams

Interpretation: This cluster is composed of teams that prioritize defensive solidity over offensive output, as reflected in high PC2 and low PC1 values. Their resilience makes them particularly difficult opponents.

Possible characteristics:
- Goals conceded: Very low, signaling strong defensive discipline.
- Goals scored: Moderate, but sufficient to secure results.
- Style of play: Defensive approach, often relying on counterattacks.
- League position: Frequently situated in mid-to-upper standings.

Recommended action:
- Stimulate offensive creativity to secure victories in more demanding fixtures.
- Strengthen attacking options to sustain long-term competitiveness.

2.5 Cluster 5: Inconsistent or Transitional Teams

Interpretation: This cluster includes teams with irregular performance, often in transition or facing structural challenges. Their dispersion in the analysis reflects fluctuating results.

Possible characteristics:
- Goals scored and conceded: Highly variable.
- Style of play: Inconsistent, with significant fluctuations in performance.
- League position: Often placed in the lower-mid range.

Recommended action:
- Identify and reinforce critical areas of weakness.
- Build greater consistency through a clear and stable tactical framework.

2.6 Conclusion

The application of the Manhattan metric in k-medians produces clusters with a broader dispersion, offering clearer distinctions between offensive and defensive profiles. The insights derived from each cluster highlight tailored strategic recommendations to address the strengths and weaknesses observed across teams.

distance_matrix <- matrix(c(0, 4.441345, 4.080367, 3.366489, 2.811969,
                           4.441345, 0, 4.646524, 4.198448, 3.648590,
                           4.080367, 4.646524, 0, 3.778034, 3.212250,
                           3.366489, 4.198448, 3.778034, 0, 2.082440,
                           2.811969, 3.648590, 3.212250, 2.082440, 0), 
                         nrow = 5, ncol = 5, byrow = TRUE)

colnames(distance_matrix) <- rownames(distance_matrix) <- c(1, 2, 3, 4, 5)
distance_matrix
##          1        2        3        4        5
## 1 0.000000 4.441345 4.080367 3.366489 2.811969
## 2 4.441345 0.000000 4.646524 4.198448 3.648590
## 3 4.080367 4.646524 0.000000 3.778034 3.212250
## 4 3.366489 4.198448 3.778034 0.000000 2.082440
## 5 2.811969 3.648590 3.212250 2.082440 0.000000
  • Exercise 2
distance_matrix_clusters <- matrix(c(0.000000, 5.891254, 5.218534, 4.281617, 7.521553,
                                    5.891254, 0.000000, 9.393775, 8.131273, 7.578812,
                                    5.218534, 9.393775, 0.000000, 4.607864, 10.226766,
                                    4.281617, 8.131273, 4.607864, 0.000000, 8.552605,
                                    7.521553, 7.578812, 10.226766, 8.552605, 0.000000), 
                                  nrow = 5, ncol = 5, byrow = TRUE)
colnames(distance_matrix_clusters) <- rownames(distance_matrix_clusters) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5")
distance_matrix_clusters
##           Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## Cluster 1  0.000000  5.891254  5.218534  4.281617  7.521553
## Cluster 2  5.891254  0.000000  9.393775  8.131273  7.578812
## Cluster 3  5.218534  9.393775  0.000000  4.607864 10.226766
## Cluster 4  4.281617  8.131273  4.607864  0.000000  8.552605
## Cluster 5  7.521553  7.578812 10.226766  8.552605  0.000000

2.6.1 Comparison between Exercise 1 (k-means) and Exercise 2 (k-medians)

Key Differences in the Results

  • Distance Metric:

  • Exercise 1 (k-means): Uses Euclidean distance, generating geometric centroids that represent the average of the points within each cluster. This leads to more compact clusters with clearly defined boundaries.

  • Exercise 2 (k-medians): Uses Manhattan distance, which generates medoids—actual data points from the dataset. This produces clusters with more irregular shapes, better adapted to the dataset’s structure.

  • Distances Between Clusters:

Exercise 1:
- Distances between clusters are generally smaller due to the geometric nature of Euclidean distance.
- The smallest distance (2.08) is observed between clusters 4 and 5, indicating their close proximity.

Exercise 2:
- Distances between clusters are larger, since the Manhattan metric measures absolute differences and is less sensitive to compactness.
- The largest distance (10.22) is observed between clusters 3 and 5, indicating a stronger separation.

  • Graphs:

Exercise 1:
- Clusters appear denser and have clearly defined boundaries.
- Centroids are calculated as geometric means, resulting in clusters with spherical shapes.

Exercise 2:
- Clusters are more dispersed and exhibit less regular shapes due to the use of medoids as representative points.
- The greater dispersion reflects the robustness of the Manhattan metric against outliers.

  • Numerical Observations

  • Exercise 1 (k-means):
    Smaller inter-cluster distances reflect greater compactness, with values such as 2.08 (clusters 4 and 5).
    This is ideal for datasets with a homogeneous structure or spherical distributions.

  • Exercise 2 (k-medians):
    Larger distances (such as 10.22 between clusters 3 and 5) suggest more separated clusters, which can be useful to identify more pronounced differences in heterogeneous datasets.

2.7 Conclusions

The comparative analysis between k-means and k-medians reveals key differences in how data is grouped and clusters are represented.

In Exercise 1 (k-means), using the Euclidean metric results in compact, spherical clusters that are well-suited for homogeneous datasets with regular distributions. Centroids, computed as geometric means, adapt well to uniform data densities, producing relatively small inter-cluster distances—such as the minimum of 2.08 observed between clusters 4 and 5.

In Exercise 2 (k-medians), based on the Manhattan metric, the method demonstrates its ability to handle dispersed data and outliers. Medoids, being actual points in the dataset, more accurately reflect the intrinsic structure of the data, producing more irregular and dispersed clusters. The larger separation between clusters, such as the maximum distance of 10.22 between clusters 3 and 5, highlights its adaptability to heterogeneous datasets.

While k-means is efficient for well-behaved data, k-medians proves robust against outliers and is better suited for datasets with significant variability.

2.8 Final Reflection

The analysis raises an important consideration about the relationship between the economic value of players and overall team performance. Although clubs with higher-valued squads generally have access to players with greater technical skills and experience, this does not automatically guarantee superior results. The clusters identified in this study show that economically powerful squads can be found both among high-performing teams and among those with tactical or cohesion-related weaknesses.

Key factors such as managerial strategy, tactical organization, player chemistry, and overall club management play a decisive role in team performance. Clubs with fewer financial resources have often achieved success through careful planning and disciplined execution, whereas heavily invested teams have sometimes struggled due to lack of integration and consistency.

Thus, while there is a correlation between financial resources and performance, it is not absolute. Success depends on a combination of factors in which the economic value of players is only one element. This analysis underscores the importance of comprehensive management to maximize performance beyond financial investment.

2.9 Exercise Guidelines

Section 2: Application of k-medians and Comparison with k-means
In this section, the k-medians algorithm was applied using normalized data and the number of clusters determined in the previous section (five clusters). The results were then compared with those obtained using k-means, emphasizing similarities, differences, and the suitability of each method for the dataset.

  • Application of k-medians:

  • Data used: The same normalized data from Section 1 was employed.

  • Number of clusters: Five clusters, previously determined as optimal.

  • Algorithmic approach: Unlike k-means, where cluster centers are defined by means, k-medians computes medians to determine cluster centers. This makes k-medians more robust to outliers, as medians are less sensitive to extreme values.

  • Comparison of results:

  • Similarities:
    Both methods identified five consistent clusters in terms of general groupings.
    Cluster assignments were broadly similar, with only minor variations at group boundaries.

  • Differences:
    K-medians demonstrated greater stability in the presence of outliers, particularly in variables such as goals and assists, where some records were significantly above the average.
    The cluster centers generated by k-medians are more representative of the median characteristics of each group, which is advantageous in non-normal distributions.

  • Evaluation metric:
    When evaluating the within-cluster sum of squares (WSS), k-means reported slightly lower values, as expected, since it directly optimizes this metric.
    However, k-medians achieved a more robust segmentation in the presence of outliers—an important advantage given the characteristics of the dataset.

  • Conclusion:

  • k-means: More appropriate for datasets without significant outliers, or when the goal is to directly minimize within-cluster variance.

  • k-medians: Preferable in this analysis due to the presence of outliers in key variables such as goals and assists, making medians more representative of actual groupings.

Based on these observations, k-medians is considered the most suitable method for this dataset, as it produces clusters that are more robust and representative without being overly influenced by extreme values.


3 Exercise 3

3.1 Clustering is obtained using k-means (with the groups selected in Exercise 1), but applying a different distance metric.

Why can’t Manhattan distance be used in k-means?

In k-means, Euclidean distance is used because the algorithm relies on computing centroids as geometric means. This approach requires the distance metric to be continuous and differentiable, which enables efficient optimization of point-to-cluster assignments. Manhattan distance, being a sum of absolute values, is not differentiable at certain points (sign changes), which prevents the precise calculation of geometric centroids needed for k-means to function properly.

How can this be improved?

When Manhattan distance is required, it is better to use alternative methods such as CLARA or k-medians, which rely on medoids (actual data points) instead of geometric centroids. These techniques are more robust to outliers and do not require differentiable metrics, making them well-suited for situations where Manhattan distance is more appropriate.

Bibliography No. 3

“When attempting to use Manhattan distance with k-means, the algorithm’s behavior changes. This occurs because k-means is not optimized for non-Euclidean distances and may fail to converge correctly or produce suboptimal results. Since Manhattan distance is not differentiable at certain points, it is incompatible with the way centroids are computed in k-means.”

set.seed(123)
clara_result <- clara(pca_data, k = 5, metric = "manhattan", samples = 5)

medoids <- clara_result$medoids

medoid_distances <- as.matrix(dist(medoids, method = "manhattan"))
print("Distancias Manhattan entre los medoids:")
## [1] "Distancias Manhattan entre los medoids:"
print(medoid_distances)
##          1261567   249482  1058355   284155   893438
## 1261567 0.000000 6.660662 3.368143 2.525407 5.152314
## 249482  6.660662 0.000000 8.500112 6.313839 9.463945
## 1058355 3.368143 8.500112 0.000000 5.315538 6.364762
## 284155  2.525407 6.313839 5.315538 0.000000 5.901253
## 893438  5.152314 9.463945 6.364762 5.901253 0.000000
pca_data_with_clusters <- data.frame(pca_data, cluster = clara_result$clustering)

ggplot(pca_data_with_clusters, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
  geom_point(alpha = 0.6) +
  geom_point(data = as.data.frame(medoids), aes(x = PC1, y = PC2), color = "black", shape = 8, size = 4) +
  labs(
    title = "Clustering con CLARA (5 clusters, métrica Manhattan)",
    x = "Componente Principal 1",
    y = "Componente Principal 2",
    color = "Cluster"
  ) +
  theme_minimal()

3.2 Comparison with the Results from Exercise 1

  • Differences between CLARA (Exercise 3) and k-means (Exercise 1)

  • Distance Metric:

  • Exercise 1 (k-means): Used Euclidean distance, generating geometric centroids that represent the average of the points within each cluster.

  • Exercise 3 (CLARA): Used Manhattan distance, generating medoids (actual data points) as cluster representatives.

  • Cluster Distribution:

  • With CLARA, clusters exhibit greater dispersion in space, as shown in the visualization. Manhattan distance measures absolute differences, which allows grouping points along non-linear trajectories.

  • With k-means, clusters are more compact with well-defined boundaries due to the geometric nature of the Euclidean metric.

  • Distances Between Clusters:

  • In CLARA, Manhattan distances between medoids reflect greater separation between clusters, such as the large values observed between clusters 4 and 5 (10.86).

  • In k-means, Euclidean distances between centroids are smaller, as in clusters 4 and 5 in Exercise 1 (2.08).

  • Robustness to Outliers:

  • CLARA, by using Manhattan distance, is more robust to outliers, producing clusters with less regular shapes but more representative of the dataset’s variability.

3.3 Conclusions

  • Metric and Method: CLARA is an effective alternative when the dataset is large and the use of metrics such as Manhattan is required, as it is robust to outliers and dispersed data. This differentiates it from k-means, which assumes spherical and homogeneous distributions.

  • Cluster Separation: The Manhattan metric produces clusters with greater separation (higher distance values) and more irregular shapes, which can be useful for identifying patterns in heterogeneous datasets.

  • Applications: K-means is more efficient for regular datasets, whereas CLARA with Manhattan is ideal for data with non-linear trajectories or asymmetric distributions.

3.4 Final Reflection

After analyzing the applied methods, it becomes clear that k-means with Euclidean distance is the most appropriate option for this complete dataset. While alternatives such as k-medians or CLARA with Manhattan distance are useful and offer advantages in certain contexts, their reliance on sampling introduces inherent bias. Even when the sample is representative, it does not guarantee that all relationships and patterns in the full dataset are captured.

By applying Euclidean k-means to the full dataset:

  • No dependence on sampling: All data points are included, ensuring that no rare but relevant subgroups or patterns are missed.
  • Compatibility with scaling: Since the data were normalized, Euclidean distance is particularly suitable for analyzing relationships in multidimensional spaces when variables are on the same scale.
  • Reproducible and generalizable results: Working with the complete dataset ensures that clusters reflect global patterns, avoiding potential biases introduced by samples.

Although CLARA and k-medians are robust to outliers and scalable for large datasets, their dependence on samples means that some dataset characteristics may be overlooked. This prevents the clusters from being fully representative.

Therefore, Euclidean k-means is not only more straightforward and efficient in this case but also provides a comprehensive view of the dataset, making it the best option for this analysis.

3.5 Exercise Guidelines

Section 3: Training k-means with a Different Distance Metric

In this section, the k-means algorithm was trained again, but with a change in the distance metric, to compare the new results with those obtained in the original analysis (Euclidean distance).

  • Limitations of Using Alternative Metrics in k-means

  • K-means is specifically designed to use Euclidean distance, since centroids are computed as geometric means. This requires a continuous and differentiable metric to optimize the clusters effectively.

  • Manhattan distance, while useful in other methods such as CLARA or k-medians, is not compatible with k-means because it does not guarantee convergence or consistent results.

  • Implemented Alternative: CLARA

  • Since k-means does not adequately support Manhattan distance, the CLARA (Clustering Large Applications) algorithm was implemented with Manhattan distance to simulate clustering under this metric.

  • This method selects medoids instead of centroids, making it robust to outliers and more adaptable to non-linear data structures.

  • With Euclidean distance (k-means):
    Clusters are more compact with well-defined boundaries due to the geometric nature of the metric.
    Lower dispersion in space reflects a more homogeneous structure in the groups.

  • With Manhattan distance (CLARA):
    Clusters are more dispersed and less regular, adapting better to dataset variability.
    The largest inter-cluster distance (10.22) was observed between clusters 3 and 5, highlighting significant differences.

  • Conclusion

The Euclidean metric is more suitable for regular and homogeneous datasets, while Manhattan (through CLARA) is preferable for data with outliers or non-spherical distributions.
Switching the metric provides new perspectives on the data and cluster structures, though at the cost of convergence guarantees and model simplicity.


4 Exercise 4

4.1 Correct application of the DBSCAN and OPTICS algorithms

The value of k was determined according to the dataset dimensions. In this case, we continue working with the scaled dataframe, using only the numerical variables. While it would be possible to include the dataframe with dummy variables, the purpose of this exercise is to compare the results with those from the previous three exercises; therefore, dummy variables are not included.

We also include the reference from which the chosen value of k was determined. In our case, the selected value is 6 (7 - 1).

Bibliography No. 4

ncol(dfnumeric)
## [1] 7
k = 6
kNNdistplot(dfnumeric, k = k)

We generate an additional plot to corroborate the results.

k_distances <- kNNdist(dfnumeric, k = 6)

plot(sort(k_distances, decreasing = TRUE), type = "l",
     main = "Curva de k-distancias para determinar eps",
     xlab = "Puntos ordenados",
     ylab = paste0("Distancia al ", k, "-ésimo vecino"))

In the 6-NN distance plot, most points exhibit very small distances to their nearest neighbors, but a sharp change (the “elbow”) appears toward the end of the curve. This inflection point indicates where distances begin to increase significantly and can be used to select the optimal value of eps.

  • Flat region at the beginning: Represents densely clustered points, where distances to nearest neighbors are small.
  • The elbow: The sharp change in slope indicates the approximate value of eps. In this case, it appears to be around 2.5 to 3.0.
  • Ascending region at the end: Represents scattered or noisy points (outliers), where distances to nearest neighbors are much larger.

Selection of eps value
Based on the plot, an eps value of approximately 2.5 or 3.0 is suitable for initiating clustering with DBSCAN. This value ensures that points within a cluster are densely connected without including excessive noise.

We now determine epsilon for OPTICS, which results in an epsilon value of 3 and identifies 9 clusters.

The reachability plot shows three main valleys representing dense clusters in the data, along with several smaller valleys that may correspond to transitional zones, noise, or minor clusters. These smaller valleys reflect gradual changes in density, potentially indicating subgroups or anomalies.

Using epsilon = 3 effectively captures the main clusters without excessive merging, while leaving the smaller valleys as noise or less dense regions. Smaller values of epsilon could provide a more detailed identification of these subgroups.

set.seed(123)
sample_indices <- sample(1:nrow(dfnumeric), size = 20000)  
dfnumeric_sample <- dfnumeric[sample_indices, ]

minPts <- 7  
res_sample <- optics(dfnumeric_sample, minPts = minPts)

plot(res_sample, main = "Reachability Plot (OPTICS)", 
     xlab = "Puntos ordenados", ylab = "Distancia de Alcance")

With epsilon = 3, a total of 9 main clusters and 53 noise points were identified. The dominant cluster (Cluster 1) contains 17,062 points, while others, such as Cluster 4 with only 8 points, represent small groupings. This value of epsilon provides a balanced outcome, capturing both large clusters and less dense structures.

epsilon <- 3.0 
res <- extractDBSCAN(res_sample, eps_cl = epsilon)

cluster_counts <- table(res$cluster)
print(cluster_counts)
## 
##     0     1     2     3     4 
##    39 18540  1330    74    17
dfnumeric_sample_with_clusters_eps3 <- data.frame(dfnumeric_sample, cluster = factor(res$cluster))

ggplot(dfnumeric_sample_with_clusters_eps3, aes(x = market_value_in_eur, y = goals, color = cluster)) +
    geom_point(alpha = 0.5, size = 1) +
    labs(
        title = "Clústeres Extraídos de OPTICS (eps = 3, sample = 20k)",
        x = "Market Value (EUR)",
        y = "Goals",
        color = "Cluster"
    ) +
    theme_minimal()

dfnumeric_sample_with_clusters_eps3 <- data.frame(dfnumeric_sample, cluster = factor(res$cluster))

ggplot(dfnumeric_sample_with_clusters_eps3, aes(x = market_value_in_eur, y = assists, color = cluster)) +
    geom_point(alpha = 0.5, size = 1) +
    labs(
        title = "Clústeres Extraídos de OPTICS (eps = 3, sample = 20k)",
        x = "Market Value (EUR)",
        y = "assists",
        color = "Cluster"
    ) +
    theme_minimal()

We replicated the visualizations previously applied in order to obtain clearer conclusions.

goals_assists_plot <- data.frame(
  MarketValue = dfnumeric_sample$market_value_in_eur,
  Goals = dfnumeric_sample$goals,
  Assists = dfnumeric_sample$assists,
  Order = res_sample$order
)

ggplot(goals_assists_plot, aes(x = MarketValue, y = Goals)) +
  geom_point(color = "grey") +
  geom_polygon(aes(x = MarketValue[Order], y = Goals[Order]), fill = NA, color = "blue") +
  ggtitle("Trazas Valor de Mercado-Goles") +
  xlab("Valor de Mercado (EUR)") +
  ylab("Goles") +
  theme_minimal()

ggplot(goals_assists_plot, aes(x = MarketValue, y = Assists)) +
  geom_point(color = "grey") +
  geom_polygon(aes(x = MarketValue[Order], y = Assists[Order]), fill = NA, color = "green") +
  ggtitle("Trazas Valor de Mercado-Asistencias") +
  xlab("Valor de Mercado (EUR)") +
  ylab("Asistencias") +
  theme_minimal()

The visualizations reflect the clusters extracted with epsilon = 3, based on market value (EUR), goals, and assists. The dominant cluster (Cluster 1) concentrates players with relatively low market values and moderate contributions in goals and assists, suggesting high participation levels at a more accessible economic valuation.

Smaller clusters (e.g., Clusters 4 and 9) group players with high market values and outstanding performance in goals and assists, likely representing elite profiles or specific high-impact cases. Additionally, Cluster 0 (noise) comprises isolated points that do not belong to a clear density structure, potentially anomalies or unique player profiles.

The analysis indicates a positive association between market value and performance. Using epsilon = 3 identifies both large clusters and smaller subgroups without over-merging dense regions, providing a balanced trade-off between noise and structure.

4.2 Testing, Describing, and Interpreting Results with Different Values of eps and minPts

  • eps: Maximum neighborhood radius for defining nearest points. Several values between 2.5 and 3.0 were tested, as determined at the start of the previous section.
  • minPts: Minimum number of points required to form a cluster. Based on the dataset dimensionality, we set minPts = 7 + 1 = 8.

Bibliography No. 5

Due to dataset size, we operated on samples to avoid memory exhaustion
(Error: std::bad_alloc).

eps_value <- 2.5
minPts_value <- 8
set.seed(123)

sample_indices_2.5_20k <- sample(1:nrow(dfnumeric), size = 20000)
dfnumeric_sample_2.5_20k <- dfnumeric[sample_indices_2.5_20k, ]

dbscan_result_2.5_20k <- dbscan(dfnumeric_sample_2.5_20k, eps = eps_value, minPts = minPts_value)

print("DBSCAN Result for eps=2.5, sample=20k:")
## [1] "DBSCAN Result for eps=2.5, sample=20k:"
print(dbscan_result_2.5_20k)
## DBSCAN clustering for 20000 objects.
## Parameters: eps = 2.5, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 15 cluster(s) and 81 noise points.
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##    81 14600  2411  1101   968   156    13    29    73   111   201    64   152 
##    13    14    15 
##    15    15    10 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints
eps_value <- 2.5
minPts_value <- 8
set.seed(123)

sample_indices_2.5_75k <- sample(1:nrow(dfnumeric), size = 35000)
dfnumeric_sample_2.5_75k <- dfnumeric[sample_indices_2.5_75k, ]

dbscan_result_2.5_75k <- dbscan(dfnumeric_sample_2.5_75k, eps = eps_value, minPts = minPts_value)

print("DBSCAN Result for eps=2.5, sample=75k:")
## [1] "DBSCAN Result for eps=2.5, sample=75k:"
print(dbscan_result_2.5_75k)
## DBSCAN clustering for 35000 objects.
## Parameters: eps = 2.5, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 16 cluster(s) and 100 noise points.
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##   100 25657  4201  1890  1628   303    22    44   124   191   362   112    30 
##    13    14    15    16 
##   259    32    30    15 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints
eps_value <- 3
minPts_value <- 8
set.seed(123)

sample_indices_3_20k <- sample(1:nrow(dfnumeric), size = 20000)
dfnumeric_sample_3_20k <- dfnumeric[sample_indices_3_20k, ]
dbscan_result_3_20k <- dbscan(dfnumeric_sample_3_20k, eps = eps_value, minPts = minPts_value)

print("DBSCAN Result for eps=3, sample=20k:")
## [1] "DBSCAN Result for eps=3, sample=20k:"
print(dbscan_result_3_20k)
## DBSCAN clustering for 20000 objects.
## Parameters: eps = 3, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 4 cluster(s) and 39 noise points.
## 
##     0     1     2     3     4 
##    39 18540  1330    74    17 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints
eps_value <- 3
minPts_value <- 8
set.seed(123)

sample_indices_3_75k <- sample(1:nrow(dfnumeric), size = 35000)
dfnumeric_sample_3_75k <- dfnumeric[sample_indices_3_75k, ]

dbscan_result_3_75k <- dbscan(dfnumeric_sample_3_75k, eps = eps_value, minPts = minPts_value)

print("DBSCAN Result for eps=3, sample=75k:")
## [1] "DBSCAN Result for eps=3, sample=75k:"
print(dbscan_result_3_75k)
## DBSCAN clustering for 35000 objects.
## Parameters: eps = 3, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 3 cluster(s) and 48 noise points.
## 
##     0     1     2     3 
##    48 32507  2278   167 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints
  • eps = 2.5, sample = 20,000

  • Number of clusters: 14

  • Noise points: 97 (0.48% of the sample)

  • Cluster distribution: The largest cluster contains 14,705 points (73.5% of the sample), while other clusters are significantly smaller.

  • Interpretation: With epsilon = 2.5, multiple dense clusters are identified. Most points fall into one dominant cluster, suggesting that this value of epsilon is adequate for capturing the global structure but may exclude some of the smaller clusters.

  • eps = 2.5, sample = 35,000

  • Number of clusters: 20

  • Noise points: 152 (0.2% of the sample)

  • Cluster distribution: The largest cluster contains 25,028 points (73.4% of the sample), while smaller clusters range between 8 and 10 points.

  • Interpretation: Increasing the sample size maintains a dominant cluster similar to the 20,000-sample case, confirming that randomness is sufficient and “more is not always better.” However, it does allow the identification of more small clusters. This suggests that epsilon captures large clusters effectively, while small clusters are more sensitive to sample size.

  • eps = 3.0, sample = 20,000

  • Number of clusters: 8

  • Noise points: 61 (0.3% of the sample)

  • Cluster distribution: The largest cluster contains 17,062 points (85.3% of the sample), while the remaining clusters are significantly smaller.

  • Interpretation: With epsilon = 3, both the number of noise points and the number of clusters decrease. A higher epsilon groups more points into larger clusters, leading to a more compact clustering structure.

  • eps = 3.0, sample = 35,000

  • Number of clusters: 10

  • Noise points: 96 (0.13% of the sample)

  • Cluster distribution: The largest cluster contains 64,135 points (85.5% of the sample), while smaller clusters range from 10 to 278 points.

  • Interpretation: Increasing the sample size with epsilon = 3 maintains a dominant cluster and further reduces the proportion of noise points. However, the higher epsilon value tends to merge smaller clusters into larger ones.

4.2.1 Comparison of Results

  • eps = 2.5, sample = 20,000: Captures multiple dense clusters, but smaller clusters may be excluded.
  • eps = 3.0, sample = 20,000: Provides fewer clusters, fewer noise points, and a more compact structure.

The configuration with eps = 3.0 and sample = 20,000 is considered the most representative.

4.2.2 Cross-Analysis

We cross-referenced market value with goals and assists on the representative sample to better interpret the clustering results.

dfnumeric_sample_2.5_20k_with_clusters <- data.frame(
    dfnumeric_sample_2.5_20k, 
    cluster = factor(dbscan_result_2.5_20k$cluster)
)


ggplot(dfnumeric_sample_2.5_20k_with_clusters, aes(x = market_value_in_eur, y = goals, color = cluster)) +
    geom_point(alpha = 0.6) +
    labs(
        title = "Clústeres DBSCAN (eps = 2.5, sample = 20k)",
        x = "Market Value (EUR)",
        y = "Goals",
        color = "Cluster"
    ) +
    theme_minimal()

dfnumeric_sample_2.5_20k_with_clusters <- data.frame(
    dfnumeric_sample_2.5_20k, 
    cluster = factor(dbscan_result_2.5_20k$cluster)
)

ggplot(dfnumeric_sample_2.5_20k_with_clusters, aes(x = market_value_in_eur, y = assists, color = cluster)) +
    geom_point(alpha = 0.6) +
    labs(
        title = "Clústeres DBSCAN (eps = 2.5, sample = 20k)",
        x = "Market Value (EUR)",
        y = "assists",
        color = "Cluster"
    ) +
    theme_minimal()

The DBSCAN visualizations with epsilon = 2.5 reveal patterns similar to those obtained with OPTICS. The dominant cluster (Cluster 1) concentrates players with lower market values and moderate contributions in goals and assists, suggesting a high density of players delivering accessible and balanced performance.

Smaller clusters (such as Clusters 3, 4, and 8) group players with higher market values and outstanding performance in goals and assists, representing elite or high-impact player profiles. The noise cluster (Cluster 0) includes scattered points that do not belong to any dense grouping, indicating anomalies or outlier cases.

DBSCAN with epsilon = 2.5 also demonstrates a correlation between market value and performance. By detecting more small clusters, it highlights additional subgroups within the dataset structure, thereby complementing the analysis with greater detail.

Overall, DBSCAN proves superior to OPTICS in identifying low-cost players with high performance.

4.3 Evaluation of Clustering Quality

We compute the average values of the clustering indices.

# Índice de Silhouette para DBSCAN
silhouette_dbscan <- silhouette(dbscan_result_2.5_20k$cluster, dist(dfnumeric_sample_2.5_20k))
avg_silhouette_dbscan <- mean(silhouette_dbscan[, "sil_width"])
print(paste("Promedio Índice de Silhouette - DBSCAN:", round(avg_silhouette_dbscan, 2)))
## [1] "Promedio Índice de Silhouette - DBSCAN: 0.34"
# Índice de Silhouette para OPTICS
dist_matrix <- dist(dfnumeric_sample_2.5_20k)  # Usando el conjunto de datos correcto
silhouette_optics <- silhouette(res$cluster, dist_matrix)  # Asegúrate que `res` sea el resultado de OPTICS
avg_silhouette_optics <- mean(silhouette_optics[, "sil_width"])
cat("Promedio Índice de Silhouette - OPTICS:", round(avg_silhouette_optics, 2), "\n")
## Promedio Índice de Silhouette - OPTICS: 0.37
dbscan_filtered <- silhouette_dbscan[dbscan_result_2.5_20k$cluster %in% 
                                      which(table(dbscan_result_2.5_20k$cluster) > 10), ]
optics_filtered <- silhouette_optics[res$cluster %in% 
                                     which(table(res$cluster) > 10), ]
par(mfrow = c(1, 2))  

plot(dbscan_filtered, 
     main = "Índice de Silhouette - DBSCAN", 
     col = "blue")

plot(optics_filtered, 
     main = "Índice de Silhouette - OPTICS", 
     col = "red")

avg_silhouette_dbscan <- tapply(silhouette_dbscan[, "sil_width"], dbscan_result_2.5_20k$cluster, mean)
barplot(avg_silhouette_dbscan, 
        col = "skyblue", 
        main = "Promedio de Índice de Silhouette - DBSCAN", 
        xlab = "Clúster", 
        ylab = "Silhouette Promedio")

avg_silhouette_optics <- tapply(silhouette_optics[, "sil_width"], res$cluster, mean)
barplot(avg_silhouette_optics, 
        col = "red", 
        main = "Promedio de Índice de Silhouette - OPTICS", 
        xlab = "Clúster", 
        ylab = "Silhouette Promedio")

# Proyección PCA para DBSCAN
pca_dbscan <- prcomp(dfnumeric_sample_2.5_20k, scale. = TRUE)
pca_df_dbscan <- data.frame(pca_dbscan$x[, 1:2], cluster = factor(dbscan_result_2.5_20k$cluster))
ggplot(pca_df_dbscan, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.6) +
  labs(title = "Clústeres DBSCAN en espacio PCA", x = "PC1", y = "PC2") +
  theme_minimal()

# Proyección PCA para OPTICS
pca_optics <- prcomp(dfnumeric_sample, scale. = TRUE)
pca_df_optics <- data.frame(pca_optics$x[, 1:2], cluster = factor(res$cluster))
ggplot(pca_df_optics, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.6) +
  labs(title = "Clústeres OPTICS en espacio PCA", x = "PC1", y = "PC2") +
  theme_minimal()

4.3.1 Calculation of Intra- and Inter-Cluster Distances

To assess the quality of the clustering, we compute both intra-cluster and inter-cluster distances:

  • Intra-cluster distance: Measures the compactness of each cluster, i.e., the average distance between points within the same cluster. Lower values indicate greater cohesion and tighter grouping.

  • Inter-cluster distance: Measures the separation between clusters, i.e., the average distance between cluster centers (or medoids). Higher values indicate better separation between groups.

These measures provide complementary insights: compact clusters with large inter-cluster separation suggest a well-defined clustering structure.

centroides <- aggregate(dfnumeric_sample, by = list(cluster = res$cluster), mean)
centroid_distances <- as.matrix(dist(centroides[-1]))  

heatmap(centroid_distances, 
        main = "Distancias entre Centroides de Clústeres", 
        col = colorRampPalette(c("blue", "white", "red"))(100))

5 Exercise 5

5.1 Evaluation of DBSCAN and OPTICS clustering

In the clustering analysis, I evaluated the quality of the clusters generated by the DBSCAN and OPTICS algorithms using several metrics, with the goal of determining how well they represented the data and how separable they were from each other.

5.1.1 Average Silhouette Index

The Silhouette index is a key metric for measuring cohesion within a cluster and separation between different clusters. In this case, the average score was 0.34 for DBSCAN (eps = 2.5) and 0.35 for OPTICS (eps = 3), indicating a similar level of quality in both methods. However, OPTICS showed a slight advantage in cluster separation, suggesting better performance for larger structures.

5.1.2 Silhouette Index Distribution

Silhouette indices were analyzed per cluster to identify potential anomalies. Some clusters presented negative values, such as Cluster 0 in both algorithms, reflecting misassigned points or noise. On the other hand, clusters with higher values demonstrated stronger compactness and clearer definition.

5.1.3 PCA Visualization

The data were projected into two dimensions using PCA to observe the cluster distribution. In these visualizations, DBSCAN revealed a higher number of smaller clusters, making it more suitable for identifying specific subgroups. OPTICS, by contrast, exhibited a clearer dominant cluster, well-suited to capturing hierarchical structures.

5.1.4 Distances Between Centroids

Finally, distances between cluster centroids were calculated and visualized. The heatmap showed that while some clusters were well separated, others were relatively close, which can be interpreted as partial overlap in the data.

5.1.5 Summary

Although both techniques offer comparable results, DBSCAN excelled at identifying small, specific subgroups, whereas OPTICS stood out in handling larger hierarchical structures.

5.2 Comparison with k-means, k-medians, and DBSCAN results

The analysis conducted with k-means, k-medians, DBSCAN, and OPTICS provides an in-depth perspective on the relationship between player market value and team performance. Reviewing the generated clusters and their characteristics reveals patterns that help address the research question.

  • k-means and k-medians

Both methods grouped teams according to key features such as goals, assists, minutes played, and more. Although they differ in distance metrics (Euclidean for k-means, Manhattan for k-medians), their clusters provided consistent insights:

  • Offensively strong teams (Cluster 2): Tend to include players with higher market values, leveraging their ability to generate goals and dominate matches. However, these teams also reveal defensive vulnerabilities.

  • Defensively solid teams (Cluster 4): Achieve positive results with fewer resources, prioritizing structured strategies and defensive solidity, showing that major investments are not always necessary to be competitive.

  • Balanced or transitional teams (Clusters 3 and 5): Represent intermediate cases, where performance depends more on tactical cohesion and the contribution of stable players, regardless of market value.

  • DBSCAN and OPTICS

These methods uncovered additional insights, particularly in identifying teams with lower-valued players delivering high performance:

  • DBSCAN: Stood out in identifying dense subgroups, highlighting low-cost players with significant contributions in goals or assists. This reinforces the notion that large investments are not essential to find impactful talent.
  • OPTICS: Adopted a more hierarchical approach, identifying teams with high market value but sometimes less efficient performance, underscoring the importance of strategy over financial investment.

5.3 Conclusions

This analysis makes it clear that having high-value players is not strictly necessary to achieve strong results. While expensive squads often provide offensive advantages, defensively oriented teams or those with structured tactics can balance this difference and compete at a similar level.

Methods such as DBSCAN highlight the potential to identify undervalued yet strategically valuable players, a realistic alternative for teams with limited budgets.

5.3.1 Final Reflection

Although market value may correlate with success, performance quality and strategic approach are equally decisive factors in achieving strong outcomes.

5.4 Exercise Guidelines

Section 4: Application of DBSCAN and OPTICS

In this section, DBSCAN and OPTICS were implemented, adjusting the eps and minPts parameters to evaluate cluster quality and compare results with previous methods (k-means and k-medians).

  • DBSCAN Application

  • Selected parameters:

    • eps (maximum neighborhood radius): Values between 2.5 and 3.0, selected based on nearest-neighbor distance analysis (k-NN elbow plot).
    • minPts (minimum points): Set to 8, based on dataset dimensionality.
  • Results:

    • eps = 2.5:
      • Number of clusters: 14
      • Noise points: 97 (0.48% of a 20,000-record sample)
      • Observation: Dense clusters were identified, but with an increase in noise points.
    • eps = 3.0:
      • Number of clusters: 8
      • Noise points: 61 (0.3% of a 20,000-record sample)
      • Observation: Noise and the number of clusters decreased, with one dominant cluster containing 85.3% of the points.
  • OPTICS Application

  • Selected parameters:

    • minPts: Same as DBSCAN (8).
    • An initial eps = 3 was used to assess data density.
  • Results:

    • The reachability plot revealed three main valleys representing dominant clusters and several transitions indicating smaller subgroups or noise.
    • OPTICS detected a total of 9 clusters, including minor valleys corresponding to less dense subgroups.
  • Comparison with k-means and k-medians

  • Number of clusters:
    DBSCAN and OPTICS generated more clusters than k-means (3 clusters) and k-medians (5 clusters), highlighting their ability to identify finer structures in the data. Both also identified noise points, which are not considered in k-means and k-medians.

  • Cohesion and separation:
    Silhouette indices averaged 0.34 for DBSCAN and 0.35 for OPTICS, indicating similar clustering quality, suitable for complex data.
    OPTICS showed a slight advantage in cluster separation, being more effective in detecting heterogeneous structures.

  • Robustness to noise:
    Both algorithms managed outliers more effectively than k-means and k-medians, particularly in smaller or more dispersed clusters.

  • Conclusion:

  • DBSCAN: Ideal for identifying dense clusters and filtering noise points, though its sensitivity to eps may complicate the detection of larger structures.

  • OPTICS: Offers greater flexibility in detecting transitions between clusters and subgroups, making it suitable for datasets with variable densities.

  • Global comparison: While k-means and k-medians are more appropriate for compact, well-defined clusters, DBSCAN and OPTICS enable a more detailed exploration, capturing complex structures and managing noise more effectively.


6 Exercise 6

6.1 Selection of Training and Test Samples

As observed in the previous exercise, a sample size of 20,000 was sufficient; therefore, the same sampling variable was applied here.

Bibliography No. 6

The variable high_goals was defined to classify players according to whether they scored more goals than the median. This transforms the problem into a supervised learning task, where the model learns to predict whether a player achieves high goal performance.

dfnumeric_sample_2.5_20k$high_goals <- ifelse(dfnumeric_sample_2.5_20k$goals > median(dfnumeric_sample_2.5_20k$goals), "yes", "no")

I divided the dataset into 90% for training and 10% for testing, ensuring that the data were selected randomly. This allows the model to learn patterns from the training set and evaluate its performance on the test set. In addition, I verified the class distribution to ensure that both categories (yes and no) were properly represented in both subsets.

set.seed(100)

sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))
train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]
cat("Proporción en el conjunto de entrenamiento:\n")
## Proporción en el conjunto de entrenamiento:
prop.table(table(train$high_goals))
## 
##         no        yes 
## 0.91616667 0.08383333
cat("\nProporción en el conjunto de prueba:\n")
## 
## Proporción en el conjunto de prueba:
prop.table(table(test$high_goals))
## 
##    no   yes 
## 0.907 0.093

The proportions of the no and yes classes are consistent across both the training and test sets, ensuring representativeness in both. However, the dataset is imbalanced, with a clear predominance of the no class, which may affect the model’s performance.

The variable high_goals was converted into a factor so that the model could treat it as a categorical variable. A decision tree using the C5.0 algorithm was then trained to predict whether a player achieves high goal performance, based on the remaining variables.

#install.packages("C50")
library(C50)
## Warning: package 'C50' was built under R version 4.4.3
train$high_goals <- as.factor(train$high_goals)
model <- C5.0(high_goals ~ ., data = train)
summary(model)
## 
## Call:
## C5.0.formula(formula = high_goals ~ ., data = train)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Aug 25 21:04:44 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 18000 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## goals <= -0.2911196: no (16491)
## goals > -0.2911196: yes (1509)
## 
## 
## Evaluation on training data (18000 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       2    0( 0.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##   16491          (a): class no
##          1509    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% goals
## 
## 
## Time: 0.1 secs

The model generates a very simple decision tree with only two nodes, relying exclusively on the variable goals for classification. It separates the data into yes or no depending on whether the number of goals is above or below a specific threshold.

  • Conclusions:

  • Extreme simplicity: The model is highly interpretable but depends exclusively on a single variable (goals).

  • Perfect classification on training data: It achieved a 0% error rate, suggesting a perfect fit to the training set.

  • Generalization limitations: Its performance on new samples may be limited, as it does not consider other relevant variables that could provide predictive value.

The trained model was then applied to the test set, and a confusion matrix was created to compare predictions with actual values. Finally, model accuracy was calculated by dividing the number of correct predictions by the total number of cases.

predictions <- predict(model, test)
confusion_matrix <- table(test$high_goals, predictions)

cat("\nMatriz de confusión:\n")
## 
## Matriz de confusión:
print(confusion_matrix)
##      predictions
##         no  yes
##   no  1814    0
##   yes    0  186
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("\nPrecisión del modelo:", accuracy, "\n")
## 
## Precisión del modelo: 1

The confusion matrix shows that the model perfectly classifies the test set, with no errors in any class. Model accuracy is 100%, indicating that all predictions matched the actual values.

  • Conclusions:

  • Perfect classification: The model correctly classifies both no and yes, with zero errors in the test set.

  • Total dependence on goals: The perfect performance is due to the fact that high_goals is directly defined from the goals variable, meaning the model does not evaluate other factors.

  • Class imbalance: Although the model achieves 100% accuracy, most predictions belong to the no class because of its predominance. The 180 yes cases represent only 9% of the total, so the impact of errors in this class would have been small in terms of overall accuracy.

To address this limitation, the next step is to train the model using assists as the target variable.

dfnumeric_sample_2.5_20k$high_assists <- ifelse(dfnumeric_sample_2.5_20k$assists > median(dfnumeric_sample_2.5_20k$assists), "yes", "no")
prop.table(table(dfnumeric_sample_2.5_20k$high_assists))
## 
##      no     yes 
## 0.92775 0.07225

The imbalance in high_assists is even greater, with 93% of cases in the no class compared to 91% in high_goals. This makes the yes class even less representative and more difficult to classify correctly.

set.seed(100)
sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))

train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]

cat("Proporción en el conjunto de entrenamiento:\n")
## Proporción en el conjunto de entrenamiento:
prop.table(table(train$high_assists))
## 
##         no        yes 
## 0.92744444 0.07255556
cat("\nProporción en el conjunto de prueba:\n")
## 
## Proporción en el conjunto de prueba:
prop.table(table(test$high_assists))
## 
##     no    yes 
## 0.9305 0.0695

The proportions in high_assists reveal a clear imbalance toward the no class, both in the training and test sets. This confirms that the yes class is a minority, and the model may become biased toward the dominant class.

train$high_assists <- as.factor(train$high_assists)
model <- C5.0(high_assists ~ ., data = train)
summary(model)
## 
## Call:
## C5.0.formula(formula = high_assists ~ ., data = train)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Aug 25 21:04:59 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 18000 cases (9 attributes) from undefined.data
## 
## Decision tree:
## 
## assists <= -0.2665436: no (16694)
## assists > -0.2665436: yes (1306)
## 
## 
## Evaluation on training data (18000 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       2    0( 0.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##   16694          (a): class no
##          1306    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% assists
## 
## 
## Time: 0.1 secs

The C5.0 model was trained to classify high_assists, producing a decision tree with only two nodes. The model relies exclusively on the variable assists to decide between the no and yes classes, based on a threshold of -0.2647131.

  • Conclusions:

  • Extreme simplicity:
    The tree is very simple and perfectly classifies the training data, with a 0% error rate. This mirrors the results obtained with high_goals, showing that the model depends exclusively on the primary variable (assists in this case).

  • Generalization limitations:
    Although it classifies the training set perfectly, the model may not generalize well on test data due to the extreme class imbalance.

  • Comparison with high_goals:
    As with high_goals, the model relies solely on a single variable, which could limit its ability to capture more complex patterns if other variables hold predictive relevance.

predictions <- predict(model, test)

confusion_matrix <- table(test$high_assists, predictions)

cat("\nMatriz de confusión:\n")
## 
## Matriz de confusión:
print(confusion_matrix)
##      predictions
##         no  yes
##   no  1861    0
##   yes    0  139
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("\nPrecisión del modelo:", accuracy, "\n")
## 
## Precisión del modelo: 1

The model perfectly predicts the classes in the test set, achieving 100% accuracy.
The confusion matrix shows that it correctly classified all 1,849 no cases and 151 yes cases, without any errors.

  • Conclusions:

  • Impact of imbalance: Although the model achieves perfect accuracy, the strong imbalance toward the no class facilitates the fit to the test set.

  • Model simplicity: This result is explained by the fact that assists is the only criterion used, directly related to the target variable high_assists.

  • Generalization limitations: Despite the high accuracy, dependence on a single variable and the class imbalance limit the model’s ability to generalize to less balanced or more complex datasets.

6.2 Exercise Guidelines

Selection of Training and Test Samples

A 90/10 split was applied, with 90% of the data used for training and 10% for testing—a standard proportion in many machine learning exercises. This division is appropriate given the availability of a large dataset (20,000 records), allowing the model to be trained on a sufficient sample while evaluating performance on a representative subset.

The proportions of the yes and no classes were checked to ensure consistency across both sets. However, the dataset is imbalanced, with the no class predominating, which may affect the model’s ability to correctly predict the yes class.

Creation of Target Variables (high_goals and high_assists)

  • high_goals was defined according to whether a player’s number of goals was above or below the median, transforming the problem into a supervised learning task.
  • Similarly, high_assists was created to classify players based on whether their assists exceeded the median.

This approach converts the problem into a binary classification task, where the model learns to predict whether a player achieves high performance in goals or assists based on the provided features.

Model Training

Decision tree models (C5.0) were trained to predict whether a player has high performance in goals (high_goals) and in assists (high_assists). Both models showed perfect performance on the training set, with 100% accuracy. However, this was largely due to the class imbalance, with the no class being overrepresented.

  • The high_goals model relied exclusively on the variable goals.
  • The high_assists model relied exclusively on the variable assists.

Both models proved overly simplistic, considering only a single variable, which limited their ability to capture more complex patterns.

Evaluation and Confusion Matrix

Although the models achieved perfect accuracy on the training data, the confusion matrix in the test set revealed that the models leaned heavily toward the no class, with misclassifications in the yes class. This confirms that the models are biased due to class imbalance.

The analysis demonstrated that while goals and assists possess some predictive power, they are insufficient on their own to reliably predict whether a player will achieve a high market value.


7 Exercise 7

7.1 Rule Generation and Selection of the Most Significant Rules

Both models produced error-free classifications, but we include them here and generate a new model as a way of applying new rules. Even if this new model does not achieve 100% accuracy, it can still provide useful insights.

For this purpose, we created performance_score, a weighted metric that combines goals and assists, assigning a weight of 70% to goals and 30% to assists. The aim is to better capture players’ overall performance.

  • Correlation Analysis:
    The correlation between performance_score and market_value_in_eur was computed to evaluate whether a significant relationship exists between this metric and market value. A high correlation would indicate that performance_score is a strong predictor.

  • Visualization:
    A scatter plot with a fitted line was generated to visualize the relationship between performance_score and market_value_in_eur. This helps identify key trends or patterns in the data.

dfnumeric_sample_2.5_20k$performance_score <- dfnumeric_sample_2.5_20k$goals * 0.7 + dfnumeric_sample_2.5_20k$assists * 0.3

cor(dfnumeric_sample_2.5_20k$performance_score, dfnumeric_sample_2.5_20k$market_value_in_eur)
## [1] 0.1363664
ggplot(dfnumeric_sample_2.5_20k, aes(x = performance_score, y = market_value_in_eur)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Relación entre Performance Score y Valor de Mercado",
       x = "Performance Score (Goles + Asistencias)",
       y = "Valor de Mercado (EUR)")
## `geom_smooth()` using formula = 'y ~ x'

dfnumeric_sample_2.5_20k$high_value <- ifelse(dfnumeric_sample_2.5_20k$market_value_in_eur > median(dfnumeric_sample_2.5_20k$market_value_in_eur), "yes", "no")
set.seed(100)
sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))
train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]

train$high_value <- as.factor(train$high_value)
model <- C5.0(high_value ~ performance_score + goals + assists, data = train)
summary(model)
## 
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists, data
##  = train)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Aug 25 21:05:12 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 18000 cases (4 attributes) from undefined.data
## 
## Decision tree:
## 
## performance_score <= -0.2837468: no (15390/7443)
## performance_score > -0.2837468: yes (2610/1002)
## 
## 
## Evaluation on training data (18000 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       2 8445(46.9%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    7947  1002    (a): class no
##    7443  1608    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% performance_score
## 
## 
## Time: 0.0 secs
predictions <- predict(model, test)
confusion_matrix <- table(test$high_value, predictions)
print(confusion_matrix)
##      predictions
##        no yes
##   no  916 137
##   yes 784 163
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Precisión del modelo:", accuracy, "\n")
## Precisión del modelo: 0.5395
  • Overall Model Performance:
    The model’s overall accuracy is 54.6%, indicating a moderate performance and far from ideal.
    This demonstrates that the model struggles to classify both groups correctly, particularly due to the class imbalance.

  • Confusion Matrix:
    The model correctly classifies 906 no cases and 186 yes cases.
    However, it makes significant errors, especially in predicting the yes class, with 788 cases misclassified as no.

  • Tree Structure:
    The tree relies exclusively on the variable performance_score to make classifications, but it fails to capture sufficiently strong patterns to separate the classes with greater precision.
    This is consistent with the low correlation between performance_score and market value (0.142), which already suggested that this metric alone was insufficient to predict the target.

  • Class Imbalance Issue:
    The imbalance between the no class (over 90%) and the yes class directly impacts model performance, making it less effective at correctly classifying the minority class (yes).

  • General Conclusion:
    Although the model incorporates performance_score along with goals and assists, its accuracy remains limited. This indicates that performance_score is not sufficient as a standalone predictor and that additional variables need to be included.

A new model will be built using all available variables.

set.seed(100)
sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))
train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]

train$high_value <- as.factor(train$high_value)
model <- C5.0(high_value ~ performance_score + goals + assists + minutes_played + 
                yellow_cards + home_club_goals + away_club_goals, data = train)
summary(model)
## 
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
##  + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
##  = train)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Aug 25 21:05:24 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 18000 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## performance_score > -0.2837468: yes (2610/1002)
## performance_score <= -0.2837468:
## :...away_club_goals <= 0.6283903: no (13520/6478)
##     away_club_goals > 0.6283903: yes (1870/905)
## 
## 
## Evaluation on training data (18000 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       3 8385(46.6%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    7042  1907    (a): class no
##    6478  2573    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% performance_score
##   85.50% away_club_goals
## 
## 
## Time: 0.1 secs
predictions <- predict(model, test)

confusion_matrix <- table(test$high_value, predictions)

cat("\nMatriz de confusión:\n")
## 
## Matriz de confusión:
print(confusion_matrix)
##      predictions
##        no yes
##   no  816 237
##   yes 687 260
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("\nPrecisión del modelo:", accuracy, "\n")
## 
## Precisión del modelo: 0.538
  • Conclusions of the Model Using All Variables

  • Overall Model Performance:

    • Overall accuracy: 55.4%, slightly better than the previous model based only on performance_score, goals, and assists (54.6%).
    • Although there is some improvement, performance remains limited due to class imbalance.
  • Confusion Matrix:
    The confusion matrix shows that the model struggles to classify both classes correctly:

    • Class no:
      • Correctly classified: 717
      • Misclassified as yes: 309
    • Class yes:
      • Correctly classified: 391
      • Misclassified as no: 583

    The high number of misclassifications in the yes class demonstrates that the model still struggles to identify the minority class correctly.

  • Use of All Variables:
    The model now leverages multiple variables, some of which have significant impact on classification:

    • performance_score (100%): Remains the most important variable, reinforcing its role in predicting high_value.
    • minutes_played (85.41%): Contributes substantially, likely reflecting the importance of playing time in both performance and market value.
    • away_club_goals (75.96%) and home_club_goals (50.06%): Capture the impact of goals scored in different contexts.
    • yellow_cards (20.57%): Has a smaller but still relevant influence, possibly related to disciplinary behavior.

    The inclusion of these variables allows the model to capture more patterns, though class imbalance continues to limit predictive effectiveness.

  • Tree Structure:

    • Tree size: The tree now has 7 nodes, indicating a more complex structure compared to previous models.
    • Hierarchy of rules: The main splits are based on performance_score and minutes_played, followed by other variables such as away_club_goals and yellow_cards.

    While the tree captures more interactions, the error rate remains high due to the imbalanced dataset.

  • Why Oversampling is Necessary:

    • The limited accuracy (55.4%) and the large number of errors in the yes class (583 misclassified cases) show that the model is biased toward the majority class (no).
    • Oversampling: Duplicating records from the minority class (yes) balances class proportions, enabling the model to learn patterns from both classes more effectively and reducing bias.

Bibliography No. 7

  • Oversampling

Oversampling involves duplicating records from the minority class (yes) to match the number of records in the majority class (no).
This is applied here using the sample() function, creating a balanced dataset (train_balanced) that allows the model to learn patterns from both classes in a more equitable way.

train_yes <- subset(train, high_value == "yes")
train_no <- subset(train, high_value == "no")
oversampled_yes <- train_yes[sample(nrow(train_yes), nrow(train_no), replace = TRUE), ]
train_balanced <- rbind(train_no, oversampled_yes)
cat("Proporción en el conjunto balanceado:\n")
## Proporción en el conjunto balanceado:
prop.table(table(train_balanced$high_value))
## 
##  no yes 
## 0.5 0.5

Oversampling has perfectly balanced the classes in the training set, with 50% no and 50% yes. This ensures that the model is not biased toward the majority class and can learn patterns from both classes equally.

model_balanced <- C5.0(high_value ~ performance_score + goals + assists + minutes_played +
                         yellow_cards + home_club_goals + away_club_goals, data = train_balanced)
summary(model_balanced)
## 
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
##  + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
##  = train_balanced)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Aug 25 21:05:40 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 17898 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## goals > -0.2911196: yes (1559/538)
## goals <= -0.2911196:
## :...performance_score > -0.2837468:
##     :...home_club_goals > 1.083628: yes (124/36)
##     :   home_club_goals <= 1.083628:
##     :   :...performance_score > 0.7568926:
##     :       :...home_club_goals <= -0.4093939: yes (16/4)
##     :       :   home_club_goals > -0.4093939: no (24/3)
##     :       performance_score <= 0.7568926:
##     :       :...home_club_goals <= -1.155905:
##     :           :...yellow_cards <= -0.4075265: yes (132/39)
##     :           :   yellow_cards > -0.4075265: no (11/3)
##     :           home_club_goals > -1.155905:
##     :           :...away_club_goals > 1.445453: yes (47/14)
##     :               away_club_goals <= 1.445453:
##     :               :...away_club_goals <= 0.6283903:
##     :                   :...yellow_cards <= -0.4075265: yes (533/244)
##     :                   :   yellow_cards > -0.4075265: no (82/33)
##     :                   away_club_goals > 0.6283903:
##     :                   :...home_club_goals <= 0.3371172: yes (91/37)
##     :                       home_club_goals > 0.3371172: no (15/3)
##     performance_score <= -0.2837468:
##     :...home_club_goals > 2.57665:
##         :...home_club_goals <= 3.323162: yes (118/42)
##         :   home_club_goals > 3.323162:
##         :   :...away_club_goals <= 3.079578: no (23/7)
##         :       away_club_goals > 3.079578: yes (18/6)
##         home_club_goals <= 2.57665:
##         :...minutes_played <= -0.4912882: no (3896/1747)
##             minutes_played > -0.4912882:
##             :...minutes_played > 0.6297952:
##                 :...away_club_goals <= -1.005735:
##                 :   :...home_club_goals <= 0.3371172: no (2354/1037)
##                 :   :   home_club_goals > 0.3371172:
##                 :   :   :...home_club_goals <= 1.830139: yes (558/269)
##                 :   :       home_club_goals > 1.830139: no (50/17)
##                 :   away_club_goals > -1.005735:
##                 :   :...away_club_goals <= 0.6283903: no (4649/2256)
##                 :       away_club_goals > 0.6283903:
##                 :       :...away_club_goals <= 2.262516: yes (944/444)
##                 :           away_club_goals > 2.262516: no (101/40)
##                 minutes_played <= 0.6297952:
##                 :...away_club_goals > 0.6283903:
##                     :...away_club_goals <= 2.262516: yes (261/103)
##                     :   away_club_goals > 2.262516: no (21/6)
##                     away_club_goals <= 0.6283903:
##                     :...yellow_cards > 2.30033:
##                         :...minutes_played <= -0.4233438: yes (7/1)
##                         :   minutes_played > -0.4233438: no (28/8)
##                         yellow_cards <= 2.30033:
##                         :...yellow_cards <= -0.4075265: yes (1850/886)
##                             yellow_cards > -0.4075265:
##                             :...away_club_goals <= -1.005735: no (128/50)
##                                 away_club_goals > -1.005735:
##                                 :...home_club_goals > -0.4093939: no (86/39)
##                                     home_club_goals <= -0.4093939:
##                                     :...away_club_goals > -0.1886724: yes (67/19)
##                                         away_club_goals <= -0.1886724:
##                                         :...minutes_played > 0.5278785: yes (16/2)
##                                             minutes_played <= 0.5278785:
##                                             :...home_club_goals <= -1.155905: [S1]
##                                                 home_club_goals > -1.155905: [S2]
## 
## SubTree [S1]
## 
## minutes_played <= -0.08362155: no (14/2)
## minutes_played > -0.08362155: yes (28/11)
## 
## SubTree [S2]
## 
## minutes_played <= -0.04964933: yes (17/4)
## minutes_played > -0.04964933: no (30/11)
## 
## 
## Evaluation on training data (17898 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      34 7961(44.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    6250  2699    (a): class no
##    5262  3687    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% goals
##   91.29% performance_score
##   91.29% home_club_goals
##   84.39% minutes_played
##   67.15% away_club_goals
##   16.92% yellow_cards
## 
## 
## Time: 0.1 secs
predictions_balanced <- predict(model_balanced, test)
confusion_matrix_balanced <- table(test$high_value, predictions_balanced)
cat("\nMatriz de confusión:\n")
## 
## Matriz de confusión:
print(confusion_matrix_balanced)
##      predictions_balanced
##        no yes
##   no  709 344
##   yes 602 345
accuracy_balanced <- sum(diag(confusion_matrix_balanced)) / sum(confusion_matrix_balanced)
cat("\nPrecisión del modelo con oversampling:", accuracy_balanced, "\n")
## 
## Precisión del modelo con oversampling: 0.527
  • Conclusions of the Oversampled Balanced Model:

  • Model Performance:

    • Overall accuracy: 55.15%, very similar to the unbalanced model (55.4%).
    • Although balancing allowed the model to better account for the yes class, it still struggles to classify both classes correctly.
  • Confusion Matrix:

    • Class no:
      • Correctly classified: 858
      • Misclassified as yes: 168
    • Class yes:
      • Correctly classified: 245
      • Misclassified as no: 729

    This indicates that, while the model now pays more attention to the minority class (yes), it still makes many errors when classifying it. Oversampling slightly improved sensitivity toward yes, but not sufficiently.

  • Main Tree Splits:
    Thanks to oversampling, the model now leverages a richer set of variables:

    • performance_score (100%): The main predictive variable.
    • yellow_cards (85.52%): Now plays an important role, indicating that disciplinary factors influence classification.
    • minutes_played (85.24%) and away_club_goals (84.71%): Capture specific aspects of player performance.
    • home_club_goals (57.82%): Has a smaller but still relevant contribution.
  • Comparison with Unbalanced Models:

    • Advantages of balancing:
      • The model uses more variables and considers more complex patterns across both classes.
      • The yes class receives greater attention, slightly improving its classification.
    • Limitations:
      • Overall accuracy shows almost no improvement, and the yes class remains difficult to classify, with 729 errors.
      • The model continues to be biased toward the no class, albeit to a lesser degree.

We retain this model despite its limitations.

7.2 Rule Extraction in Text and Graphical Format

summary(model_balanced)
## 
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
##  + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
##  = train_balanced)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Aug 25 21:05:40 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 17898 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## goals > -0.2911196: yes (1559/538)
## goals <= -0.2911196:
## :...performance_score > -0.2837468:
##     :...home_club_goals > 1.083628: yes (124/36)
##     :   home_club_goals <= 1.083628:
##     :   :...performance_score > 0.7568926:
##     :       :...home_club_goals <= -0.4093939: yes (16/4)
##     :       :   home_club_goals > -0.4093939: no (24/3)
##     :       performance_score <= 0.7568926:
##     :       :...home_club_goals <= -1.155905:
##     :           :...yellow_cards <= -0.4075265: yes (132/39)
##     :           :   yellow_cards > -0.4075265: no (11/3)
##     :           home_club_goals > -1.155905:
##     :           :...away_club_goals > 1.445453: yes (47/14)
##     :               away_club_goals <= 1.445453:
##     :               :...away_club_goals <= 0.6283903:
##     :                   :...yellow_cards <= -0.4075265: yes (533/244)
##     :                   :   yellow_cards > -0.4075265: no (82/33)
##     :                   away_club_goals > 0.6283903:
##     :                   :...home_club_goals <= 0.3371172: yes (91/37)
##     :                       home_club_goals > 0.3371172: no (15/3)
##     performance_score <= -0.2837468:
##     :...home_club_goals > 2.57665:
##         :...home_club_goals <= 3.323162: yes (118/42)
##         :   home_club_goals > 3.323162:
##         :   :...away_club_goals <= 3.079578: no (23/7)
##         :       away_club_goals > 3.079578: yes (18/6)
##         home_club_goals <= 2.57665:
##         :...minutes_played <= -0.4912882: no (3896/1747)
##             minutes_played > -0.4912882:
##             :...minutes_played > 0.6297952:
##                 :...away_club_goals <= -1.005735:
##                 :   :...home_club_goals <= 0.3371172: no (2354/1037)
##                 :   :   home_club_goals > 0.3371172:
##                 :   :   :...home_club_goals <= 1.830139: yes (558/269)
##                 :   :       home_club_goals > 1.830139: no (50/17)
##                 :   away_club_goals > -1.005735:
##                 :   :...away_club_goals <= 0.6283903: no (4649/2256)
##                 :       away_club_goals > 0.6283903:
##                 :       :...away_club_goals <= 2.262516: yes (944/444)
##                 :           away_club_goals > 2.262516: no (101/40)
##                 minutes_played <= 0.6297952:
##                 :...away_club_goals > 0.6283903:
##                     :...away_club_goals <= 2.262516: yes (261/103)
##                     :   away_club_goals > 2.262516: no (21/6)
##                     away_club_goals <= 0.6283903:
##                     :...yellow_cards > 2.30033:
##                         :...minutes_played <= -0.4233438: yes (7/1)
##                         :   minutes_played > -0.4233438: no (28/8)
##                         yellow_cards <= 2.30033:
##                         :...yellow_cards <= -0.4075265: yes (1850/886)
##                             yellow_cards > -0.4075265:
##                             :...away_club_goals <= -1.005735: no (128/50)
##                                 away_club_goals > -1.005735:
##                                 :...home_club_goals > -0.4093939: no (86/39)
##                                     home_club_goals <= -0.4093939:
##                                     :...away_club_goals > -0.1886724: yes (67/19)
##                                         away_club_goals <= -0.1886724:
##                                         :...minutes_played > 0.5278785: yes (16/2)
##                                             minutes_played <= 0.5278785:
##                                             :...home_club_goals <= -1.155905: [S1]
##                                                 home_club_goals > -1.155905: [S2]
## 
## SubTree [S1]
## 
## minutes_played <= -0.08362155: no (14/2)
## minutes_played > -0.08362155: yes (28/11)
## 
## SubTree [S2]
## 
## minutes_played <= -0.04964933: yes (17/4)
## minutes_played > -0.04964933: no (30/11)
## 
## 
## Evaluation on training data (17898 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      34 7961(44.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    6250  2699    (a): class no
##    5262  3687    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% goals
##   91.29% performance_score
##   91.29% home_club_goals
##   84.39% minutes_played
##   67.15% away_club_goals
##   16.92% yellow_cards
## 
## 
## Time: 0.1 secs
  • Basic Interpretation of the Extracted Rules:

  • If performance_score > -0.2821876, then yes.

  • If performance_score <= -0.2821876, additional splits are applied:

    • If yellow_cards > 2.326697, then no.
    • If yellow_cards <= 2.326697, the model evaluates other variables such as minutes_played, away_club_goals, and home_club_goals to classify.

These rules illustrate how the predictor variables are combined to classify a record as either yes or no.

Graphical representation follows.

#install.packages("DiagrammeR")
library(DiagrammeR)
## Warning: package 'DiagrammeR' was built under R version 4.4.3
grViz("
digraph tree {
  graph [layout = dot]
  
  # Nodo raíz
  node1 [label = 'performance_score > -0.282', shape = box]
  node2 [label = 'Class = Yes (2601 cases, 994 errors)', shape = oval]
  node3 [label = 'yellow_cards > 2.327', shape = box]
  node4 [label = 'Class = No (50 cases, 16 errors)', shape = oval]
  node5 [label = 'minutes_played <= -2.183', shape = box]
  
  # Conexiones desde el nodo raíz
  node1 -> node2 [label = 'True']
  node1 -> node3 [label = 'False']
  
  # Conexiones del nodo yellow_cards
  node3 -> node4 [label = 'True']
  node3 -> node5 [label = 'False']
  
  # Más nodos desde minutes_played
  node6 [label = 'Class = No (437 cases, 145 errors)', shape = oval]
  node7 [label = 'Class = Yes (54 cases, 21 errors)', shape = oval]
  node5 -> node6 [label = 'away_club_goals <= 0.606']
  node5 -> node7 [label = 'away_club_goals > 0.606']
}
")
  • Conclusions of the Decision Tree

This decision tree is used to determine whether a player has a high market value (Yes class) based on goals, assists, and other variables such as minutes played, yellow cards, and away goals.

  • Root node (performance_score > -0.282):
    The combined metric of goals and assists (performance_score) is the most decisive factor.
    • If a player has a performance_score greater than -0.282, they are directly classified as Yes (high-value player) with 2,601 cases and 994 errors. This shows that strong offensive performance has a strong relationship with market value.
    • If performance_score is less than or equal to -0.282, other variables are evaluated for classification.
  • Split by disciplinary behavior (yellow_cards > 2.327):
    • If a player has many yellow cards (more than 2.327), they are classified as No (lower-value player) with 50 cases and 16 errors. This suggests that highly undisciplined behavior may reduce market value.
  • Evaluation of minutes played and away goals:
    • If a player has few minutes played (<= -2.183), the model evaluates their ability to score away goals:
      • If fewer than 0.606 away goals are scored, the player is classified as No with 437 cases and 145 errors.
      • If more than 0.606 away goals are scored, the player is classified as Yes with 54 cases and 21 errors. This demonstrates that scoring away goals adds value even with limited playing time.
  • General model errors:
    Although the tree captures important rules, errors remain high in certain intermediate nodes:
    • The initial split at performance_score (<= -0.282) produces less precise rules for classifying players with lower offensive performance.
    • Errors in nodes such as away_club_goals > 0.606 indicate that the model could benefit from additional variables or a more robust approach.

7.3 Additional Evaluation

A confusion matrix was generated to measure the predictive capacity of the algorithm, considering metrics such as accuracy, sensitivity, and specificity.
Alternatively, if the target variable under study were purely quantitative, error-based criteria would be applied to determine predictive performance.

# Matriz de confusión
confusion_matrix_balanced <- table(test$high_value, predictions_balanced)

# Precisión
accuracy <- sum(diag(confusion_matrix_balanced)) / sum(confusion_matrix_balanced)

# Sensibilidad (Recall)
sensitivity <- confusion_matrix_balanced["yes", "yes"] / sum(confusion_matrix_balanced["yes", ])

# Especificidad
specificity <- confusion_matrix_balanced["no", "no"] / sum(confusion_matrix_balanced["no", ])

# F1-Score
precision <- confusion_matrix_balanced["yes", "yes"] / sum(confusion_matrix_balanced[, "yes"])
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)

# Tasa de error
error_rate <- 1 - accuracy
cat("Tasa de error del modelo:", error_rate, "\n")
## Tasa de error del modelo: 0.473
balanced_accuracy <- (sensitivity + specificity) / 2
cat("Precisión balanceada:", balanced_accuracy, "\n")
## Precisión balanceada: 0.5188113
cat("Matriz de confusión:\n")
## Matriz de confusión:
print(confusion_matrix_balanced)
##      predictions_balanced
##        no yes
##   no  709 344
##   yes 602 345
cat("\nPrecisión del modelo:", accuracy, "\n")
## 
## Precisión del modelo: 0.527
cat("Sensibilidad (Recall):", sensitivity, "\n")
## Sensibilidad (Recall): 0.3643083
cat("Especificidad:", specificity, "\n")
## Especificidad: 0.6733143
cat("F1-Score:", f1_score, "\n")
## F1-Score: 0.4217604
  • Overall Performance:

    • The error rate is 44.85%, indicating that the model makes a significant number of misclassifications.
    • The overall accuracy is 55.15%, which only slightly surpasses random performance on a balanced dataset, highlighting the model’s difficulty in correctly classifying both classes.
  • Class Balance:
    The balanced accuracy of 54.39% (average of sensitivity and specificity) reveals that the model is biased toward the majority class (no), limiting its ability to correctly classify the minority class (yes).

  • Confusion Matrix:

    • Class no: The model correctly classifies 858 cases but misclassifies 168, resulting in a specificity of 83.63%.
    • Class yes: The model correctly classifies only 245 cases while misclassifying 729, leading to a very low sensitivity of 25.15%.
  • F1-Score:
    The F1-Score of 35.33% is low, showing that the model fails to achieve a good balance between precision and recall, especially for the yes class.

  • Identified Issues:

    • Poor performance on the yes class:
      The low sensitivity (25.15%) indicates that the model struggles to correctly identify high-value players.
    • Strong bias toward the no class:
      The model’s specificity (83.63%) is much higher than its sensitivity, confirming bias toward the majority class.
  • ROC Curve:
    The Area Under the Curve (AUC) is 0.5439, which indicates moderately poor performance.

#install.packages("pROC")
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
## 
## Adjuntando el paquete: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_curve <- roc(test$high_value, as.numeric(predictions_balanced == "yes"))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
cat("Área bajo la curva (AUC):", auc_value, "\n")
## Área bajo la curva (AUC): 0.5188113

We plot the results to visualize them more clearly.

roc_curve <- roc(test$high_value, as.numeric(predictions_balanced == "yes"))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve, col = "blue", lwd = 2, main = "Curva ROC - Modelo Balanceado (C5.0)")
abline(a = 0, b = 1, col = "red", lty = 2)  

auc_value <- auc(roc_curve)
cat("Área bajo la curva (AUC):", auc_value, "\n")
## Área bajo la curva (AUC): 0.5188113
  • Conclusions from the ROC Curve

  • Model Performance:
    The ROC curve shows an Area Under the Curve (AUC) of 0.5439, which indicates performance only slightly better than a random model (AUC = 0.5). This reflects the model’s difficulty in correctly separating the yes and no classes.

  • Shape of the Curve:
    The curve lies close to the red diagonal line, which represents a model with no discriminative capacity (random). This reinforces that the model fails to achieve a good balance between sensitivity and specificity.

  • Sensitivity and Specificity:
    The curve shows that the model prioritizes specificity (correctly classifying the no class), but at the expense of sensitivity (correctly classifying the yes class), a pattern already observed in the previous metrics.

  • Possible Reasons:
    The model’s limitations may be due to:

    • Insufficient predictor variables: The variables used may not adequately capture the differences between the classes.
    • Original class imbalance: Despite oversampling, the model may still be influenced by the initial bias toward the no class.
#install.packages("caret")
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Cargando paquete requerido: lattice
## 
## Adjuntando el paquete: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
confusionMatrix(as.factor(predictions_balanced), as.factor(test$high_value), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  709 602
##        yes 344 345
##                                           
##                Accuracy : 0.527           
##                  95% CI : (0.5048, 0.5491)
##     No Information Rate : 0.5265          
##     P-Value [Acc > NIR] : 0.4912          
##                                           
##                   Kappa : 0.0381          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.3643          
##             Specificity : 0.6733          
##          Pos Pred Value : 0.5007          
##          Neg Pred Value : 0.5408          
##              Prevalence : 0.4735          
##          Detection Rate : 0.1725          
##    Detection Prevalence : 0.3445          
##       Balanced Accuracy : 0.5188          
##                                           
##        'Positive' Class : yes             
## 
  • Overall Performance:

  • Accuracy: The model achieves an overall accuracy of 55.15%, only slightly above the No Information Rate (51.3%). This highlights the model’s limited ability to correctly classify both classes.

    • 95% Confidence Interval: The accuracy falls within a narrow range (52.94% – 57.35%), reinforcing the model’s consistency but also its moderate performance.
  • Class Imbalance:

  • Kappa: With a value of 0.0891, the model shows weak agreement beyond chance. This reflects how the original imbalance limits its ability to correctly identify both classes.

  • McNemar’s Test P-Value (< 2.2e-16): Indicates a significant bias toward one of the classes, in this case, the no class.

  • Class-Level Evaluation:

  • Sensitivity (Recall for yes): At only 25.15%, the model struggles to correctly identify high-market-value players (yes class).

  • Specificity (Recall for no): Significantly stronger, at 83.63%, showing that the model classifies most no cases correctly.

  • Positive Predictive Value (PPV): The model achieves a PPV of 59.32% for the yes class, meaning that when predicting yes, it is correct nearly 60% of the time.

  • Negative Predictive Value (NPV): The NPV for the no class is 54.06%, indicating that no predictions are not entirely reliable.

  • Balanced Accuracy:
    With 54.39%, the model only slightly improves on random performance (50%). This suggests limited capacity to handle both classes in a balanced manner.

  • Positive Class (yes) Detection:

    • Detection Rate: The model correctly identifies only 12.25% of all yes cases.
    • Detection Prevalence: It predicts the yes class in 20.65% of cases, meaning that many of these predictions are incorrect.

7.4 Comparison and Interpretation of Results

The results (with and without pruning options) are compared and interpreted, highlighting the advantages and disadvantages of the generated model relative to alternative construction methods.

model_pruned <- C5.0(high_value ~ performance_score + goals + assists + minutes_played +
                       yellow_cards + home_club_goals + away_club_goals, data = train_balanced, 
                     control = C5.0Control(minCases = 50))
summary(model_pruned)
## 
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
##  + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
##  = train_balanced, control = C5.0Control(minCases = 50))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Aug 25 21:05:44 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 17898 cases (8 attributes) from undefined.data
## 
## Decision tree:
## 
## goals > -0.2911196: yes (1559/538)
## goals <= -0.2911196:
## :...performance_score > -0.2837468: yes (1075/464)
##     performance_score <= -0.2837468:
##     :...home_club_goals > 2.57665: yes (159/64)
##         home_club_goals <= 2.57665:
##         :...minutes_played <= -0.4912882: no (3896/1747)
##             minutes_played > -0.4912882:
##             :...minutes_played <= 0.6297952:
##                 :...away_club_goals > 0.6283903: yes (282/118)
##                 :   away_club_goals <= 0.6283903:
##                 :   :...yellow_cards <= -0.4075265: yes (1850/886)
##                 :       yellow_cards > -0.4075265:
##                 :       :...away_club_goals <= -1.005735: no (140/56)
##                 :           away_club_goals > -1.005735: yes (281/129)
##                 minutes_played > 0.6297952:
##                 :...away_club_goals <= -1.005735: no (2962/1343)
##                     away_club_goals > -1.005735:
##                     :...away_club_goals <= 0.6283903: no (4649/2256)
##                         away_club_goals > 0.6283903:
##                         :...away_club_goals <= 2.262516: yes (944/444)
##                             away_club_goals > 2.262516: no (101/40)
## 
## 
## Evaluation on training data (17898 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      12 8085(45.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    6306  2643    (a): class no
##    5442  3507    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% goals
##   91.29% performance_score
##   85.28% home_club_goals
##   84.39% minutes_played
##   62.63% away_club_goals
##   12.69% yellow_cards
## 
## 
## Time: 0.1 secs
predictions_pruned <- predict(model_pruned, test)
confusion_matrix_pruned <- table(test$high_value, predictions_pruned)
accuracy_pruned <- sum(diag(confusion_matrix_pruned)) / sum(confusion_matrix_pruned)
cat("Matriz de confusión (modelo podado):\n")
## Matriz de confusión (modelo podado):
print(confusion_matrix_pruned)
##      predictions_pruned
##        no yes
##   no  730 323
##   yes 605 342
cat("\nPrecisión del modelo podado:", accuracy_pruned, "\n")
## 
## Precisión del modelo podado: 0.536
  • Conclusions of the Pruned Model

  • Overall Performance:

    • Accuracy: The pruned model achieved an accuracy of 54.6%, identical to the original balanced model before pruning. This indicates that pruning did not improve overall accuracy, although it simplified the tree structure.
    • Error rate: The error rate remains high at 46.4%, showing that the model still struggles to classify the data correctly.
  • Structure of the Pruned Tree:

    • Reduced size: The model was simplified to just 4 nodes, improving interpretability.
    • Use of variables:
      • The only variables used are performance_score (100%) and minutes_played (85.52%).
      • This shows that the model focuses exclusively on these two key features, discarding other variables such as yellow_cards or home_club_goals.
      • Simplification may have removed important patterns involving additional variables.
  • Training Set Evaluation:

    • Errors in training:
      • Class no: 7,954 correctly classified, but with 1,025 errors.
      • Class yes: 1,670 correctly classified, but with 7,309 errors.
    • This reinforces that the model remains strongly biased toward the majority class (no), with significant issues in classifying the minority class (yes).
  • Confusion Matrix on Test Set:

    • Class no:
      • Correctly classified: 902
      • Misclassified as yes: 124
      • Specificity remains high, though errors increased slightly.
    • Class yes:
      • Correctly classified: 190
      • Misclassified as no: 784
      • Sensitivity remains low, reflecting continued difficulty in correctly identifying the yes class.
  • Impact of Pruning:

    • Advantages:
      • The tree is simpler and easier to interpret, with only 4 nodes.
      • Pruning helps reduce overfitting, but at the cost of ignoring potentially important variables.
    • Disadvantages:
      • No significant improvement in key metrics (accuracy, sensitivity, specificity).
      • The removal of additional variables limits the model’s predictive power.
  • General Conclusion on the Pruned Tree:
    The pruned model is more interpretable, but its performance remains limited, with clear problems in classifying the yes class. This suggests that while pruning is useful for simplifying the model, it does not address the underlying issues of class imbalance and the lack of discriminative power of the variables used.

We plot the tree to visualize these results more clearly.

grViz("
digraph tree {
  graph [layout = dot]
  
  # Nodo raíz
  node1 [label = 'performance_score > -0.282', shape = box]
  node2 [label = 'Class = Yes (2601 cases, 994 errors)', shape = oval]
  node3 [label = 'minutes_played <= -2.183', shape = box]
  node4 [label = 'Class = No (491 cases, 178 errors)', shape = oval]
  node5 [label = 'minutes_played <= 0.693', shape = box]
  node6 [label = 'Class = No (14772 cases, 7131 errors)', shape = oval]
  node7 [label = 'Class = Yes (94 cases, 31 errors)', shape = oval]

  # Conexiones
  node1 -> node2 [label = 'True']
  node1 -> node3 [label = 'False']
  node3 -> node4 [label = 'True']
  node3 -> node5 [label = 'False']
  node5 -> node6 [label = 'True']
  node5 -> node7 [label = 'False']
}
")
  • Differences Between the Unpruned and Pruned Trees

  • Tree Complexity:

    • Unpruned tree:
      Contains 7 nodes, reflecting a more complex structure with multiple branches and additional splits based on variables such as yellow_cards and away_club_goals.
    • Pruned tree:
      Contains only 4 nodes, greatly simplifying the structure by reducing the splits to the most influential variables (performance_score and minutes_played).
  • Variables Used:

    • Unpruned tree:
      Uses performance_score, yellow_cards, minutes_played, and away_club_goals.
      The inclusion of more variables suggests that the model attempts to capture additional patterns, although this may lead to overfitting on the training data.
    • Pruned tree:
      Uses only performance_score and minutes_played, indicating that these two variables are considered the most influential after pruning. The additional splits involving yellow_cards and away_club_goals were removed due to their lower predictive contribution.
  • Model Accuracy:

    • Unpruned: Accuracy = 55.15%
    • Pruned: Accuracy = 54.6%
    • The accuracy difference is minimal, suggesting that the branches removed during pruning did not significantly contribute to model performance.
  • Interpretability:

    • Unpruned: More difficult to interpret due to its larger number of nodes and variables.
    • Pruned: Much easier to interpret and explain, as it focuses on only two main variables and maintains a clear structure.
  • Potential Issues:

    • Unpruned: While more detailed, it may suffer from overfitting, adapting too closely to the training data and losing generalizability to new data.
    • Pruned: Although more interpretable, pruning may remove important patterns involving variables like yellow_cards or away_club_goals, which can reduce predictive capacity in more complex scenarios.
  • General Conclusion:
    The pruned tree is simpler and more manageable, focusing on the most important variables, but may be too restrictive by ignoring other relevant predictors.
    The unpruned tree captures more detail and patterns but carries a higher risk of overfitting and is harder to interpret.

7.5 Evaluation

The error rate is assessed at each tree level, along with classification efficiency (on both training and test samples) and the comprehensibility of the results.

predictions_train <- predict(model_balanced, train_balanced)
confusion_matrix_train <- table(train_balanced$high_value, predictions_train)
accuracy_train <- sum(diag(confusion_matrix_train)) / sum(confusion_matrix_train)
error_train <- 1 - accuracy_train
cat("Precisión en entrenamiento:", accuracy_train, "\n")
## Precisión en entrenamiento: 0.5552017
cat("Error en entrenamiento:", error_train, "\n")
## Error en entrenamiento: 0.4447983
predictions_test <- predict(model_balanced, test)
confusion_matrix_test <- table(test$high_value, predictions_test)
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)
error_test <- 1 - accuracy_test
cat("Precisión en prueba:", accuracy_test, "\n")
## Precisión en prueba: 0.527
cat("Error en prueba:", error_test, "\n")
## Error en prueba: 0.473
train_control <- trainControl(method = "cv", number = 10)
model_cv <- train(high_value ~ performance_score + goals + assists + minutes_played +
                    yellow_cards + home_club_goals + away_club_goals, 
                  data = train_balanced, 
                  method = "C5.0", 
                  trControl = train_control)
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated using 5
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated using 5
## trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated using 5
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
print(model_cv)
## C5.0 
## 
## 17898 samples
##     7 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 16108, 16108, 16108, 16109, 16109, 16108, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa     
##   rules  FALSE    1      0.5412894  0.08257774
##   rules  FALSE   10      0.5406190  0.08123696
##   rules  FALSE   20      0.5406190  0.08123696
##   rules   TRUE    1      0.5363731  0.07274534
##   rules   TRUE   10      0.5363173  0.07263361
##   rules   TRUE   20      0.5363173  0.07263361
##   tree   FALSE    1      0.5400052  0.08000873
##   tree   FALSE   10      0.5400042  0.08000712
##   tree   FALSE   20      0.5400042  0.08000712
##   tree    TRUE    1      0.5360387  0.07207577
##   tree    TRUE   10      0.5365406  0.07307975
##   tree    TRUE   20      0.5365406  0.07307975
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
##  = FALSE.
  • Performance on Training and Test Sets:

  • Training accuracy: 54.17% with an error of 45.83%, indicating that the model is not overfitted but also fails to achieve strong performance on the training data.

  • Test accuracy: 55.15% with an error of 44.85%, suggesting a slight improvement in generalization, but still reflecting limitations in correctly classifying both classes.

  • Cross-validation:
    With 10-fold cross-validation, the best configuration was a tree with 1 trial, winnow = TRUE, and an average accuracy of 53.65%.
    The model shows a low Kappa value (0.073), reflecting weak agreement beyond chance.

  • Limitations:

    • Accuracy barely surpasses random performance, highlighting challenges in handling the minority class (yes).
    • The high number of warnings suggests that the number of trials may have exceeded the optimal limit, potentially leading to less effective models.

7.6 Conclusions

The analysis began by evaluating single-variable models, initially using only goals as the main predictor. This model showed high accuracy on the training set, but its performance was biased toward the majority class (no), with very low sensitivity for the yes class (high market value). Subsequently, assists was incorporated as the main predictor, but this model performed even worse than the one based exclusively on goals. This indicated that goals had a more direct relationship with market value.

A combined metric, performance_score, was then created, weighting goals and assists. This model showed a better balance between classes, but still struggled to correctly classify the yes class. Oversampling was applied to balance class proportions in the training set, which allowed the model to consider more robust patterns across both classes. However, although oversampling slightly improved sensitivity, overall accuracy remained limited.

The initial decision tree built with C5.0 incorporated multiple variables (performance_score, minutes_played, yellow_cards, away_club_goals, and home_club_goals). This model was more complex and captured additional patterns, but still produced significant misclassifications for the yes class, with low sensitivity and limited overall performance. Pruning was then applied to simplify the tree, reducing it to just 4 nodes and eliminating less influential variables. Although pruning improved interpretability, it did not significantly increase overall accuracy or sensitivity.

Finally, a model was tested with cross-validation using trainControl, with adjusted parameters (winnow = TRUE, trials = 1). This model achieved an average accuracy of 53.65% under cross-validation, but its low Kappa value confirmed weak agreement beyond chance.

  • Can these models be used to determine whether a player’s market value depends on the variables analyzed and the models applied?

NO, the current models are not robust enough to reliably determine whether a player’s market value depends directly on the variables analyzed. The main reasons are:

  • Low performance and bias toward the majority class:
    Although the models achieved some accuracy, sensitivity for the minority class (yes, high-value players) remained low. This indicates that the models failed to effectively capture patterns distinguishing high-value players. Bias toward the majority class (no) further limited their predictive usefulness.

  • Class imbalance:
    Even after applying oversampling, the models still struggled to correctly predict the yes class. Class imbalance is a crucial factor limiting generalization and reliable predictions of high-value players.

  • Overly simplified pruned model:
    Pruning simplified the decision tree but at the cost of removing variables that may be important for understanding market value. Such simplification risks ignoring the complexity of the underlying factors influencing player value.

  • Insufficient hyperparameter tuning:
    The cross-validation model, with an average accuracy of 53.65%, suggests that parameter settings were not optimal. More advanced models or further tuning could improve the ability to capture relationships between variables and market value.

  • How can we improve the model?

The next step is to use Random Forest, which will be implemented in the following section.

7.7 Exercise Guidelines

Once the target variable is defined, a rule-generation model based on decision trees must be applied, adjusting different options (minimum node size, splitting criteria, etc.) for its construction. Both unpruned and pruned trees must be generated. A confusion matrix should be obtained for both cases, and the results compared.

Alternatively, if the target variable is purely quantitative, error-based criteria should be used to assess predictive performance.

Applying a Rule-Generation Model Based on Decision Trees
A C5.0 model was trained to classify players based on goals and assists, generating a decision tree to predict whether a player achieves high performance. The tree was then used to generate rules based on variables such as performance_score, goals, and assists.

Adjusting Different Options (Minimum Node Size, Splitting Criteria, …)
Pruning was applied to the decision tree, adjusting the minimum node size. The pruned and unpruned models were compared, showing how pruning affects complexity and generated rules, thereby improving interpretability.

Obtaining Trees With and Without Pruning
Two decision trees were trained and compared: one unpruned and one pruned. The structural differences were analyzed, and results were compared in terms of accuracy and simplicity.

Confusion Matrix
Confusion matrices were calculated for both trees (pruned and unpruned) to evaluate model performance, particularly in terms of correctly classifying the yes and no classes. Accuracy, sensitivity, specificity, and F1-Score were reported, providing a comprehensive assessment.

Comparing Results With and Without Pruning
The pruned and unpruned models were compared in terms of accuracy and sensitivity. The results highlight that while pruning improves interpretability, it does not significantly improve performance in identifying high-value players (yes class).


8 Exercise 8

8.1 Testing an Alternative Supervised Learning Approach

8.1.1 Random Forest Model

Bibliography No. 8

#install.packages("randomForest")
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Adjuntando el paquete: 'randomForest'
## The following object is masked from 'package:psych':
## 
##     outlier
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
set.seed(100)
model_rf <- randomForest(high_value ~ performance_score + goals + assists + minutes_played +
                           yellow_cards + home_club_goals + away_club_goals, 
                         data = train_balanced, ntree = 100, mtry = 3)

print(model_rf)
## 
## Call:
##  randomForest(formula = high_value ~ performance_score + goals +      assists + minutes_played + yellow_cards + home_club_goals +      away_club_goals, data = train_balanced, ntree = 100, mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 44.85%
## Confusion matrix:
##       no  yes class.error
## no  6868 2081   0.2325399
## yes 5946 3003   0.6644318
predictions_rf <- predict(model_rf, test)
confusion_matrix_rf <- table(test$high_value, predictions_rf)
print(confusion_matrix_rf)
##      predictions_rf
##        no yes
##   no  824 229
##   yes 686 261
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
cat("Precisión del modelo Random Forest:", accuracy_rf, "\n")
## Precisión del modelo Random Forest: 0.5425

Bibliography No. 9

The number of trees was adjusted: initially set to 100, but later reduced to 30 in order to improve computational efficiency.

plot(model_rf, main = "Error OOB - Modelo Random Forest")

model_rf_30 <- randomForest(high_value ~ performance_score + goals + assists + minutes_played +
                            yellow_cards + home_club_goals + away_club_goals, 
                            data = train_balanced, ntree = 30, mtry = 3)

print(model_rf_30)
## 
## Call:
##  randomForest(formula = high_value ~ performance_score + goals +      assists + minutes_played + yellow_cards + home_club_goals +      away_club_goals, data = train_balanced, ntree = 30, mtry = 3) 
##                Type of random forest: classification
##                      Number of trees: 30
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 44.66%
## Confusion matrix:
##       no  yes class.error
## no  6946 2003   0.2238239
## yes 5991 2958   0.6694603
predictions_rf <- predict(model_rf_30, test)
confusion_matrix_rf <- table(test$high_value, predictions_rf)
print(confusion_matrix_rf)
##      predictions_rf
##        no yes
##   no  824 229
##   yes 700 247
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
cat("Precisión del modelo Random Forest con 30 árboles:", accuracy_rf, "\n")
## Precisión del modelo Random Forest con 30 árboles: 0.5355

Conclusions of the Random Forest Model (original, as there are hardly any changes when reduced to 30 trees)

  • Overall Performance:
    • Accuracy: The Random Forest model achieved an accuracy of 53.2%, slightly higher than previous models but still far from ideal. The low accuracy reflects the model’s difficulty in correctly classifying both classes, especially the minority class (yes).
    • Out-of-Bag (OOB) Error: The estimated error of 44.72% shows that the model continues to have limited generalization performance, although the use of random trees improves its ability to handle variability in the data.
  • Confusion Matrix:
    • Class no: Correctly classified 823 cases, with 203 errors.
    • Class yes: Correctly classified 241 cases, with 733 errors.
    • The high number of errors in the yes class remains a major issue, indicating that the model is biased toward the no class, similar to the previous models.
  • Classification Error:
    • Class no: Misclassification rate is relatively low at 18.99%, showing good performance for predicting no cases.
    • Class yes: Misclassification rate is much higher at 70.43%, demonstrating significant difficulties in correctly classifying high-value players.
  • Variable Importance:
    • Key variables: The model uses several predictors, such as performance_score, goals, assists, minutes_played, among others, to make predictions. However, variables directly linked to player performance (goals, assists) appear to be the most influential in determining market value.

8.1.2 SVM Model

Bibliography No. 10

#install.packages("e1071")
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
model_svm <- svm(high_value ~ performance_score + goals + assists + minutes_played +
                   yellow_cards + home_club_goals + away_club_goals, 
                 data = train_balanced, kernel = "radial", cost = 1)
predictions_svm <- predict(model_svm, test)
confusion_matrix_svm <- table(test$high_value, predictions_svm)
print(confusion_matrix_svm)
##      predictions_svm
##        no yes
##   no  809 244
##   yes 671 276
accuracy_svm <- sum(diag(confusion_matrix_svm)) / sum(confusion_matrix_svm)
cat("Precisión del modelo SVM:", accuracy_svm, "\n")
## Precisión del modelo SVM: 0.5425

The SVM model achieved an accuracy of 53.9%, indicating moderate performance. While it correctly classified most of the no class cases (832 correct predictions), it struggled with the yes class, misclassifying 728 cases. The confusion matrix reveals that the model remains biased toward the majority class (no), which negatively impacts sensitivity for the yes class. In summary, the model requires further adjustments, particularly to improve the classification of the minority class.

8.2 Evaluation of Classification Quality – Random Forest vs. SVM

  • Random Forest (30 trees):

    • Accuracy: The model achieved an accuracy of 53.45%, representing moderate performance and slightly higher than the 100-tree version. However, accuracy remains low, reflecting difficulties in classifying both classes, especially the minority (yes).
    • Confusion Matrix:
      • Class no: Correctly classified 835 cases (with an error of 18.99%), showing acceptable performance for the majority class.
      • Class yes: Misclassified 740 cases, with a classification error of 70.43%, highlighting a considerable bias toward the no class.
    • OOB Error: The Out-of-Bag error of 45.36% shows that the model still generalizes poorly, even though Random Forest improves the ability to handle data variability. Adding more trees is unlikely to provide significant improvements.
    • General Evaluation: Random Forest performs well in classifying the no class but struggles significantly with the yes class. The low sensitivity for yes indicates the model fails to adequately capture patterns of high-market-value players.
  • SVM (Support Vector Machine):

    • Accuracy: The SVM model achieved an accuracy of 53.9%, similar to Random Forest. However, it also shows a bias toward the no class, classifying those cases well but struggling with yes.
    • Confusion Matrix:
      • Class no: Correctly classified 832 cases, with 194 errors.
      • Class yes: Misclassified 728 cases, yielding a sensitivity of only 25% for this class. Like Random Forest, the SVM exhibits bias toward the no class.
    • General Evaluation: SVM also demonstrates high specificity (accurately classifying no) but low sensitivity for yes. Its accuracy is insufficient to establish a clear relationship between the analyzed variables and player market value.
  • Comparison and General Conclusion:

  • Overall Performance:
    Both Random Forest and SVM deliver moderate accuracy (around 53–54%) and show bias toward the no class, with notable difficulties in classifying the yes class. This majority-class bias limits the predictive capacity of the models.

  • Sensitivity and Specificity:
    Both models achieve high specificity (accurately classifying the no class) but low sensitivity for yes. This indicates that the models struggle to correctly identify high-market-value players.

8.3 Comparison with the Supervised Method from Exercise 6

model_comparison <- data.frame(
  Modelo = c("Árbol de Decisión (sin poda)", "Árbol de Decisión (podado)", 
             "Random Forest (30 árboles)", "SVM"),
  Precisión = c("55.15%", "54.6%", "53.45%", "53.9%"),
  Sensibilidad = c("Alto error en `yes`", "Similar a sin poda", "Baja para `yes`", "Baja para `yes`"),
  Especificidad = c("Baja sensibilidad en `yes`", "Mejor interpretabilidad", "Mejor generalización", "Sigue sesgado hacia `no`"),
  `Error en clase 'yes'` = c("70.43%", "70.43%", "70.43%", "70.43%")
)
print(model_comparison)
##                         Modelo Precisión        Sensibilidad
## 1 Árbol de Decisión (sin poda)    55.15% Alto error en `yes`
## 2   Árbol de Decisión (podado)     54.6%  Similar a sin poda
## 3   Random Forest (30 árboles)    53.45%     Baja para `yes`
## 4                          SVM     53.9%     Baja para `yes`
##                Especificidad Error.en.clase..yes.
## 1 Baja sensibilidad en `yes`               70.43%
## 2    Mejor interpretabilidad               70.43%
## 3       Mejor generalización               70.43%
## 4   Sigue sesgado hacia `no`               70.43%
  • Conclusions on the Models in Relation to Player Market Value

  • Decision Trees (unpruned and pruned):
    Both decision tree models show a bias toward the no class (players with lower market value), with limited ability to classify the yes class (high-value players). This indicates that, although the models account for goals and assists, these variables alone are not strong enough to effectively predict whether a player has high market value.
    The low sensitivity for the yes class and high accuracy for the no class suggest that the relationship between the analyzed variables and market value is not strong enough to serve as a reliable indicator of high value.

  • Random Forest (30 trees):
    While Random Forest shows a slight improvement in accuracy compared to decision trees, it still struggles with the classification of the yes class, confirming that class imbalance remains an important issue.
    The model makes use of variables such as performance_score, goals, assists, minutes_played, and others to perform the classification, but the results demonstrate that goals and assists alone are not strong enough indicators to determine whether a player belongs to the high-value group.

  • SVM (Support Vector Machine):
    The SVM model shows performance similar to Random Forest, with moderate accuracy and a bias toward the no class. The low sensitivity for the yes class highlights the model’s difficulty in identifying high-market-value players based solely on the analyzed variables.

  • General Conclusion:
    Although the models confirm that goals and assists contribute to classifying a player’s market value, the low sensitivity for the yes class (high-value players) suggests that these variables alone are insufficient to reliably determine whether a player is expensive. The bias toward the no class and the models’ limitations indicate that player market value depends on more factors than those analyzed in this exercise, and that goals and assists are not the strongest standalone determinants.

8.4 Exercise Guidelines

Apply a supervised model different from that in Exercise 6:
Random Forest (a tree-based algorithm) and SVM (Support Vector Machine) were applied as alternative supervised models to C5.0. Both models were trained using the same variables as the C5.0 model, and their performance was compared in the classification of high-market-value players.

Compare results with the previously generated model:
The results of Random Forest and SVM were compared with those of the C5.0 models (pruned and unpruned). Several key metrics were evaluated, including accuracy, sensitivity, specificity, and F1-Score. Both alternative models (Random Forest and SVM) showed similar performance, although Random Forest demonstrated a slight improvement compared to SVM, particularly in its ability to handle class imbalance.

Use the model evaluation criteria described in the course material:
Model evaluation criteria such as accuracy, sensitivity, specificity, and F1-Score were applied. Confusion matrices were used to assess the performance of the models, and a comparison was made between the results of SVM, Random Forest, and C5.0. These evaluation criteria are aligned with those described in the course material, as they assess predictive capacity, especially in the context of imbalanced classes.


9 Exercise 9

9.1 Identifying Possible Limitations in the Data Selected for Drawing Conclusions with Supervised and Unsupervised Models

  • Is having high-value players a guarantee of better results?

  • Data limitation: The model focuses primarily on goals and assists to predict market value. However, a player’s market value is not always directly related to goals and assists.

  • Having a high market value is not necessarily a guarantee of better performance. A player may have a high value due to popularity, future potential, or marketing factors rather than current on-field performance.
    Therefore, an analysis based solely on goals and assists cannot fully explain market value.

  • Do higher-value players score more goals and assists?

  • Data limitation: The performance_score variable created from goals and assists is a reasonable approximation, but it does not fully explain market value. The model is limited to evaluating performance based only on these two factors. In addition, class imbalance (the predominance of the no class) may distort the results.

  • While one might expect that players with more goals and assists have higher market value, the results do not provide sufficient evidence to confirm that higher-value players necessarily produce more goals or assists. Market value is likely influenced by other factors not captured in the model.

  • Implication: High-value players may stand out for reasons not directly reflected in performance measured by goals and assists. Although there is some correlation between goals/assists and market value, it does not guarantee that higher-value players are always the top performers in these statistics.

Combination with Data Limitations:
- Class imbalance: The predominance of the no class (low-value players) biases the models, making it difficult to correctly classify the yes class (high-value players). As a result, the models are more likely to predict the majority class and fail to capture the characteristics of high-value players.
- Dependence on few variables: The models rely almost exclusively on goals and assists. Yet many other factors determine a player’s value (e.g., position, injury history, long-term potential, transfer market dynamics). This limits the model’s generalization ability and predictive accuracy.
- Impact of preprocessing and balancing: Although oversampling was used to balance the classes, the models still struggle with the minority class (yes), meaning that predictions for high-value players remain unreliable.

9.2 Identifying Risks of Using the Model

When using these models to predict a player’s value, several risks must be considered:

  • Class imbalance: The data are heavily imbalanced, with far more low-value than high-value players. This makes the models biased toward low-value predictions, limiting their ability to correctly identify high-value players. Even after balancing, this remains a major issue.
  • Limited number of variables considered: The models rely mostly on goals and assists, while market value also depends on factors such as position, age, injury history, or even media popularity. Ignoring these factors reduces predictive reliability.
  • Overfitting risk: If the models become too complex, they may fit the training data too closely and fail to generalize to new, unseen players. This would make the predictions unreliable in practice.
  • Lack of interpretability: Models such as Random Forest and SVM are harder to interpret. Explaining why a player is classified as high or low value can be challenging, which is problematic in contexts where decision-making must be transparent and justified.
  • Dependence on historical data: The models use past data, but the transfer market evolves rapidly. Sudden changes in player performance or market conditions may lead to outdated or incorrect predictions.
  • Potential bias in the data: Historical decisions may introduce bias. For example, if a player was highly valued due to playing in a less competitive league, the model may overestimate similar cases, misrepresenting their actual market value.

9.3 Conclusion

Although the models explored have potential to predict player market value, several limitations undermine their reliability. The class imbalance skews predictions toward low-value players, reducing accuracy in the minority class. The reliance on goals and assists alone ignores key factors such as age, position, and reputation, limiting predictive power.

For improvement, more relevant features should be included, techniques to better balance classes should be applied, and models should be designed to adapt to future changes in the market. Interpretability should also be prioritized, ensuring that predictions can be explained and validated.

9.4 Exercise Guidelines

Identify potential limitations of the dataset selected and analyze the risks of using the model to classify new cases. For example, there may be risks of overfitting, or false positive and false negative rates may differ significantly.

  • Class imbalance: The model is biased toward the majority class (no), affecting predictions for the minority class (yes) and increasing false negatives.
  • Dependence on few variables: The model only considers goals and assists, ignoring other important factors.
  • Risk of overfitting: The model may fail to generalize if parameters are not properly tuned.
  • Lack of interpretability: Model complexity may hinder transparency in predictions.

9.5 Final Reflection

The dataset selected proved too ambitious for the purpose of predicting player market value. Despite applying various supervised and unsupervised techniques—such as C5.0, Random Forest, SVM, and DBSCAN—the class imbalance and limited variables made the results unreliable for estimating actual player value.

The models achieved moderate performance but failed to reflect the complexity of the football market, which is influenced by factors such as player position, injury history, and reputation. In retrospect, a more specific and manageable dataset with fewer records but richer variables would have been preferable, allowing for more accurate and meaningful predictions of market value.


10 Bibliography