Dataset Source: Kaggle
https://www.kaggle.com/datasets/davidcariboo/player-scores
Every analytical project should originate from a business need or from the desire to generate actionable knowledge contained within data. Such insights can only be obtained by applying systematic best practices derived from Data Mining and Analytics.
Brief Description
Structured, clean, and weekly-updated data from Transfermarkt, including:
What does it include?
Several CSV files with information about competitions, matches,
clubs, players, and appearances.
Each file contains entity attributes and identifiers (IDs) that allow
relationships between datasets.
Example:
The appearances file includes one row per player for each match played,
with data such as goals, assists, and yellow cards, in addition to IDs
referencing other entities (player, match, etc.).
This pillar focuses on gaining a deep understanding of the professional football industry and its key dynamics. The selected dataset addresses:
The insights generated can help clubs, agents, and analysts identify opportunities to maximize both sporting and economic success.
The dataset enables the application of advanced techniques to explore and solve analytical problems, such as:
The organized structure of the dataset (IDs to link tables) makes it ideal for applying data mining, advanced statistics, and machine learning techniques, supporting practical use cases for analysts and researchers.
The Transfermarkt dataset is a perfect example of how business questions can be connected to available data. It provides:
Problem Context:
Football clubs aim to maximize both sporting performance and the market
value of their players. However, identifying which characteristics have
a direct impact on match outcomes and increases in market value remains
a challenge due to the complexity of relationships within the
data.
Analytical Objectives:
Identify key factors: Determine which player
characteristics (position, nationality, club, average market value,
etc.) and club attributes (name, match performance) most affect match
outcomes.
Evaluate market impact: Analyze how match
performance (minutes played, goals, assists) translates into increases
or decreases in player market valuation.
Explore performance patterns: Compare how frequent positions and team results influence both individual and collective statistics.
Define relevant variables
Data Preparation
- Verify column quality across merged tables.
- Normalize continuous values (e.g., market value, minutes
played).
- Encode categorical variables (position, nationality, club name).
Exploratory Analysis
- Analyze initial correlations between player and club characteristics
and market value.
- Identify whether significant differences exist in market value across
positions or clubs.
Analytical Models
- Multiple Regression: Predict average player market
value based on individual and club performance features.
- Classification: Apply models such as Random Forest or
XGBoost to classify match results (win/draw/loss) based on team and
player attributes.
- Clustering: Segment players by performance and market
trends (e.g., undervalued or overvalued players).
Evaluation
- Validate regression models with R².
- Validate classification models with accuracy and
F1-Score.
- Interpret the most influential factors to provide actionable
insights.
Implementation of Insights
Develop a dashboard or report highlighting:
- Players with the greatest potential for market revaluation.
- Key factors influencing match results and market valuations.
- Positions or nationalities that have significant impact on collective
outcomes.
Dataset Structure and Content
player_id: Unique identifier.position: Most frequent position.country_of_citizenship: Player nationality.market_value_in_eur: Average player market value.name: Club name.player_club_id: Club ID.minutes_playedgoalsassistsyellow_cardshome_club_goals, away_club_goals (team
goals).game_id, home_club_id,
away_club_id.Justification of Dataset Selection
Suitability for Supervised Learning:
Target variable for regression:
market_value_in_eur, enabling prediction of player market
value.
Target variable for classification:
result (win/draw/loss), derived from home/away
goals.
Predictors:
This structure enables the use of algorithms such as linear regression, Random Forest, or XGBoost for predictive tasks.
Suitability for Unsupervised Learning:
Alignment with Analytical Problem
This dataset includes the necessary variables to address the stated
objectives:
- Identifying key factors: relationships between player characteristics
and match results.
- Evaluating market impact: analyzing how individual performance affects
average market value.
- Exploring performance patterns: comparing positions and
characteristics that influence collective performance.
Consolidated Structure and Data Cleaning
The integration of players, clubs, and matches has produced a
consolidated dataset aligned with analytical requirements, reducing
redundancy and ensuring high-quality variables.
Minimum Requirement
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(purrr)
appearances <- read.csv("C:/Users/Manuel/Desktop/PR1/appearances.csv")
players <- read.csv("C:/Users/Manuel/Desktop/PR1/players.csv")
player_valuations <- read.csv("C:/Users/Manuel/Desktop/PR1/player_valuations.csv")
clubs <- read.csv("C:/Users/Manuel/Desktop/PR1/clubs.csv")
games <- read.csv("C:/Users/Manuel/Desktop/PR1/games.csv")
game_lineups <- read.csv("C:/Users/Manuel/Desktop/PR1/game_lineups.csv")
appearances_with_country <- appearances %>%
left_join(players %>%
select(player_id, country_of_citizenship), by = "player_id")
player_valuations_avg <- player_valuations %>%
group_by(player_id) %>%
summarise(market_value_in_eur = mean(market_value_in_eur, na.rm = TRUE))
appearances_with_full_info <- appearances_with_country %>%
left_join(player_valuations_avg, by = "player_id")
game_lineups_position <- game_lineups %>%
group_by(player_id, position) %>%
tally() %>%
arrange(player_id, desc(n)) %>%
group_by(player_id) %>%
slice(1) %>%
ungroup()
appearances_with_position <- appearances_with_full_info %>%
left_join(game_lineups_position %>%
select(player_id, position), by = "player_id")
club_id_name_table <- clubs %>%
select(club_id, name) %>%
distinct()
appearances_with_club_name <- appearances_with_position %>%
left_join(clubs %>%
select(club_id, name), by = c("player_club_id" = "club_id"))
appearances_with_games <- appearances_with_club_name %>%
left_join(games %>%
select(game_id, home_club_id, away_club_id, home_club_goals, away_club_goals),
by = "game_id")
df_football <- appearances_with_games
The dataset must contain at least 500 observations with a minimum of 5 numerical variables, 2 categorical variables, and 1 binary variable.
Number of observations = 1,643,442
num_observaciones <- nrow(df_football)
cat("Número de observaciones:", num_observaciones, "\n")
## Número de observaciones: 1643442
Number of numerical variables = 14
num_vars <- sum(sapply(df_football, is.numeric))
cat("Number of numerical variables", num_vars, "\n")
## Number of numerical variables 14
Number of categorical variables = 7
cat_vars <- sum(sapply(df_football, function(x) is.factor(x) || is.character(x)))
cat("Number of categorical variables", cat_vars, "\n")
## Number of categorical variables 7
Number of binary variables = 1
binary_vars <- sapply(df_football[, sapply(df_football, is.numeric)], function(x) length(unique(x)) == 2)
cat("Binary variables:", names(binary_vars[binary_vars == TRUE]), "\n")
## Binary variables: red_cards
Global Summary of the Dataset
str(df_football)
## 'data.frame': 1643442 obs. of 21 variables:
## $ appearance_id : chr "2231978_38004" "2233748_79232" "2234413_42792" "2234418_73333" ...
## $ game_id : int 2231978 2233748 2234413 2234418 2234421 2234421 2235539 2235539 2235545 2235545 ...
## $ player_id : int 38004 79232 42792 73333 122011 146889 28716 69445 19409 30003 ...
## $ player_club_id : int 853 8841 6251 1274 195 195 282 282 317 317 ...
## $ player_current_club_id: int 235 2698 465 6646 3008 2778 7185 19771 200 317 ...
## $ date : chr "2012-07-03" "2012-07-05" "2012-07-05" "2012-07-05" ...
## $ player_name : chr "Aurélien Joachim" "Ruslan Abyshov" "Sander Puri" "Vegar Hedenstad" ...
## $ competition_id : chr "CLQ" "ELQ" "ELQ" "ELQ" ...
## $ yellow_cards : int 0 0 0 0 0 1 0 1 0 0 ...
## $ red_cards : int 0 0 0 0 0 0 0 0 0 0 ...
## $ goals : int 2 0 0 0 0 0 0 0 0 0 ...
## $ assists : int 0 0 0 0 1 0 0 1 0 0 ...
## $ minutes_played : int 90 90 45 90 90 90 90 90 45 90 ...
## $ country_of_citizenship: chr "Luxembourg" "Azerbaijan" "Estonia" "Norway" ...
## $ market_value_in_eur : num 346250 246875 200000 840833 2792593 ...
## $ position : chr "Centre-Forward" "Centre-Back" "Left Midfield" "Right-Back" ...
## $ name : chr NA NA NA NA ...
## $ home_club_id : int 853 8841 6251 3779 21532 21532 282 282 317 317 ...
## $ away_club_id : int 10747 22783 11915 1274 195 195 10604 10604 28633 28633 ...
## $ home_club_goals : int 7 2 2 2 0 0 5 5 6 6 ...
## $ away_club_goals : int 0 2 1 0 3 3 2 2 0 0 ...
if (!require('cluster')) install.packages('cluster')
## Cargando paquete requerido: cluster
library(cluster)
if (!require('Stat2Data')) install.packages('Stat2Data')
## Cargando paquete requerido: Stat2Data
library(Stat2Data)
if (!require('Stat2Data')) install.packages('Stat2Data')
if (!require('dplyr')) install.packages('dplyr')
library(dplyr)
if (!require('ggplot2')) install.packages("ggplot2")
## Cargando paquete requerido: ggplot2
library(ggplot2)
if (!require('factoextra')) install.packages("factoextra")
## Cargando paquete requerido: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(factoextra)
if (!require('NbClust')) install.packages("NbClust")
## Cargando paquete requerido: NbClust
library(NbClust)
if (!require('dbscan')) install.packages('dbscan')
## Cargando paquete requerido: dbscan
##
## Adjuntando el paquete: 'dbscan'
## The following object is masked from 'package:stats':
##
## as.dendrogram
library(dbscan)
if (!require('tidyr')) install.packages('tidyr')
## Cargando paquete requerido: tidyr
library(tidyr)
if (!require('factoextra')) install.packages('factoextra')
library(factoextra)
if (!require('corrplot')) install.packages('corrplot')
## Cargando paquete requerido: corrplot
## corrplot 0.95 loaded
library(corrplot)
if (!require('psych')) install.packages('psych')
## Cargando paquete requerido: psych
##
## Adjuntando el paquete: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
missing_values <- colSums(is.na(df_football) | df_football == "")
print(missing_values)
## appearance_id game_id player_id
## 0 0 0
## player_club_id player_current_club_id date
## 0 0 0
## player_name competition_id yellow_cards
## 6 0 0
## red_cards goals assists
## 0 0 0
## minutes_played country_of_citizenship market_value_in_eur
## 0 22894 669
## position name home_club_id
## 10533 7812 0
## away_club_id home_club_goals away_club_goals
## 0 0 0
missing_values <- colSums(is.na(df_football) | df_football == "")
missing_percentage <- (missing_values / nrow(df_football)) * 100
print(missing_percentage)
## appearance_id game_id player_id
## 0.0000000000 0.0000000000 0.0000000000
## player_club_id player_current_club_id date
## 0.0000000000 0.0000000000 0.0000000000
## player_name competition_id yellow_cards
## 0.0003650874 0.0000000000 0.0000000000
## red_cards goals assists
## 0.0000000000 0.0000000000 0.0000000000
## minutes_played country_of_citizenship market_value_in_eur
## 0.0000000000 1.3930518996 0.0407072474
## position name home_club_id
## 0.6409109661 0.4753438211 0.0000000000
## away_club_id home_club_goals away_club_goals
## 0.0000000000 0.0000000000 0.0000000000
df_football_clean <- df_football[complete.cases(df_football) & !apply(df_football == "", 1, any), ]
missing_values <- colSums(is.na(df_football_clean) | df_football_clean == "")
missing_percentage <- (missing_values / nrow(df_football_clean)) * 100
print(missing_percentage)
## appearance_id game_id player_id
## 0 0 0
## player_club_id player_current_club_id date
## 0 0 0
## player_name competition_id yellow_cards
## 0 0 0
## red_cards goals assists
## 0 0 0
## minutes_played country_of_citizenship market_value_in_eur
## 0 0 0
## position name home_club_id
## 0 0 0
## away_club_id home_club_goals away_club_goals
## 0 0 0
initial_size <- nrow(df_football)
final_size <- nrow(df_football_clean)
percentage_change <- ((initial_size - final_size) / initial_size) * 100
print(paste("El porcentaje de cambio en el tamaño es: ", round(percentage_change, 2), "%"))
## [1] "El porcentaje de cambio en el tamaño es: 2.52 %"
We begin analyzing our dataset by reviewing the columns we may need
to change, modify, scale, or remove.
At this stage, it is not entirely clear which columns will be required
in the future, so we will aim to retain as many as possible.
summary(df_football_clean)
## appearance_id game_id player_id player_club_id
## Length:1601984 Min. :2211607 Min. : 10 Min. : 3
## Class :character 1st Qu.:2581781 1st Qu.: 57370 1st Qu.: 289
## Mode :character Median :3069457 Median : 140804 Median : 826
## Mean :3119260 Mean : 199695 Mean : 3093
## 3rd Qu.:3602582 3rd Qu.: 290250 3rd Qu.: 2441
## Max. :4481846 Max. :1240467 Max. :110302
## player_current_club_id date player_name
## Min. : 3 Length:1601984 Length:1601984
## 1st Qu.: 336 Class :character Class :character
## Median : 931 Mode :character Mode :character
## Mean : 3930
## 3rd Qu.: 2696
## Max. :110302
## competition_id yellow_cards red_cards goals
## Length:1601984 Min. :0.0000 Min. :0.000000 Min. :0.00000
## Class :character 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.00000
## Mode :character Median :0.0000 Median :0.000000 Median :0.00000
## Mean :0.1479 Mean :0.003778 Mean :0.09585
## 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :2.0000 Max. :1.000000 Max. :6.00000
## assists minutes_played country_of_citizenship market_value_in_eur
## Min. :0.0000 Min. : 1.00 Length:1601984 Min. : 10000
## 1st Qu.:0.0000 1st Qu.: 45.00 Class :character 1st Qu.: 674038
## Median :0.0000 Median : 90.00 Mode :character Median : 1876190
## Mean :0.0757 Mean : 69.27 Mean : 5206425
## 3rd Qu.:0.0000 3rd Qu.: 90.00 3rd Qu.: 5652381
## Max. :6.0000 Max. :135.00 Max. :122761538
## position name home_club_id away_club_id
## Length:1601984 Length:1601984 Min. : 1 Min. : 2
## Class :character Class :character 1st Qu.: 294 1st Qu.: 294
## Mode :character Mode :character Median : 865 Median : 862
## Mean : 3321 Mean : 3169
## 3rd Qu.: 2503 3rd Qu.: 2457
## Max. :121966 Max. :110302
## home_club_goals away_club_goals
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.000
## Median : 1.000 Median : 1.000
## Mean : 1.561 Mean : 1.248
## 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :17.000 Max. :16.000
yyyy-mm-dd to yyyymmdd.df_football_clean$date <- as.Date(df_football_clean$date, format = "%Y-%m-%d")
df_football_clean$date <- format(df_football_clean$date, "%Y%m%d")
dffootballclean <- df_football_clean
Among all the data, the only value that stands out is the
highest player market value, which appears as a
potential outlier.
This observation likely corresponds to exceptional players in the
dataset and should be carefully assessed to determine whether it
reflects a valid extreme case or a data anomaly.
boxplot(dffootballclean$yellow_cards, main = "Boxplot de Tarjetas Amarillas", ylab = "Tarjetas Amarillas", col = "lightblue")
boxplot(dffootballclean$red_cards, main = "Boxplot de Tarjetas Rojas", ylab = "Tarjetas Rojas", col = "lightcoral")
boxplot(dffootballclean$goals, main = "Boxplot de Goles", ylab = "Goles", col = "lightgreen")
boxplot(dffootballclean$assists, main = "Boxplot de Asistencias", ylab = "Asistencias", col = "lightyellow")
boxplot(dffootballclean$minutes_played, main = "Boxplot de Minutos Jugados", ylab = "Minutos Jugados", col = "lightpink")
boxplot(dffootballclean$market_value_in_eur, main = "Boxplot del Valor de Mercado", ylab = "Valor de Mercado (EUR)", col = "lightseagreen")
We can confirm that this value is not an anomaly, but rather makes
perfect sense: Kylian Mbappé is the most expensive player in
the market, and the subsequent values are equally reasonable.
dffootballclean_unique <- dffootballclean[!duplicated(dffootballclean$player_name), ]
top_10_market_value_unique <- dffootballclean_unique[order(dffootballclean_unique$market_value_in_eur, decreasing = TRUE), ]
top_10_market_value_unique <- top_10_market_value_unique[1:10, ]
top_10_market_value_unique[, c("player_name", "market_value_in_eur")]
## player_name market_value_in_eur
## 446992 Kylian Mbappé 122761538
## 985681 Erling Haaland 90934783
## 8136 Lionel Messi 88953488
## 1040433 Jude Bellingham 83382353
## 6865 Harry Kane 82740741
## 809835 Vinicius Junior 81750000
## 1429337 Lamine Yamal 81428571
## 139894 Neymar 76350000
## 8239 Cristiano Ronaldo 73806667
## 1017260 Jamal Musiala 71642857
We conduct a descriptive analysis of the variables, which—as shown in
the graphs—display “normal” data distributions.
The most frequent values are consistently low, while the emerging
pattern suggests that potential outliers are actually the players with
the highest economic market value.
summary(dffootballclean[c("yellow_cards", "red_cards", "goals", "assists", "minutes_played", "market_value_in_eur")])
## yellow_cards red_cards goals assists
## Min. :0.0000 Min. :0.000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.000000 Median :0.00000 Median :0.0000
## Mean :0.1479 Mean :0.003778 Mean :0.09585 Mean :0.0757
## 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :2.0000 Max. :1.000000 Max. :6.00000 Max. :6.0000
## minutes_played market_value_in_eur
## Min. : 1.00 Min. : 10000
## 1st Qu.: 45.00 1st Qu.: 674038
## Median : 90.00 Median : 1876190
## Mean : 69.27 Mean : 5206425
## 3rd Qu.: 90.00 3rd Qu.: 5652381
## Max. :135.00 Max. :122761538
hist(dffootballclean$yellow_cards, main = "Distribución de Tarjetas Amarillas", xlab = "Tarjetas Amarillas")
hist(dffootballclean$red_cards, main = "Distribución de Tarjetas Rojas", xlab = "Tarjetas Rojas")
hist(dffootballclean$goals, main = "Distribución de Goles", xlab = "Goles")
hist(dffootballclean$assists, main = "Distribución de Asistencias", xlab = "Asistencias")
hist(dffootballclean$minutes_played, main = "Distribución de Minutos Jugados", xlab = "Minutos Jugados")
hist(dffootballclean$market_value_in_eur, main = "Distribución del Valor de Mercado", xlab = "Valor de Mercado (EUR)")
+ Correlations
We perform a separate analysis focusing exclusively on correlations.
cor(dffootballclean[c("yellow_cards", "red_cards", "goals", "assists", "minutes_played", "market_value_in_eur")])
## yellow_cards red_cards goals assists
## yellow_cards 1.000000000 -0.012375080 0.001195805 -0.002233251
## red_cards -0.012375080 1.000000000 -0.009106582 -0.007586251
## goals 0.001195805 -0.009106582 1.000000000 0.074499848
## assists -0.002233251 -0.007586251 0.074499848 1.000000000
## minutes_played 0.108352806 -0.034725874 0.079132755 0.077553148
## market_value_in_eur -0.015116322 -0.005572771 0.117005666 0.082148239
## minutes_played market_value_in_eur
## yellow_cards 0.10835281 -0.015116322
## red_cards -0.03472587 -0.005572771
## goals 0.07913276 0.117005666
## assists 0.07755315 0.082148239
## minutes_played 1.00000000 0.043857842
## market_value_in_eur 0.04385784 1.000000000
plot(dffootballclean$goals, dffootballclean$assists, main = "Goles vs Asistencias", xlab = "Goles", ylab = "Asistencias")
plot(dffootballclean$goals, dffootballclean$minutes_played, main = "Goles vs Minutos Jugados", xlab = "Goles", ylab = "Minutos Jugados")
plot(dffootballclean$market_value_in_eur, dffootballclean$goals, main = "Valor de Mercado vs Goles", xlab = "Valor de Mercado (EUR)", ylab = "Goles")
yellow_cards and goals: There is no significant
relationship between the number of yellow cards and goals scored.
red_cards and goals: The number of red cards received
does not appear to significantly affect a player’s ability to score
goals.
assists and goals: Although weak, there is a slight
positive relationship between goals and assists, suggesting that players
who score goals also tend to provide assists.
minutes_played and goals: Players who play more minutes
tend to score more goals, although the relationship is not very
strong.
market_value_in_eur and goals: Market value has a weak
relationship with goals, indicating that the most expensive players do
not necessarily score more goals.
market_value_in_eur and minutes_played: There does not
appear to be a significant relationship between a player’s market value
and the number of minutes they play.
market_value_in_eur and assists: Market value shows a
slight positive relationship with assists, but not strong enough to be
considered a decisive factor.
yellow_cards and minutes_played: Players who play more
minutes have a slightly higher probability of receiving yellow
cards.
red_cards and minutes_played: The number of minutes
played does not seem to influence the number of red cards a player
receives.
yellow_cards and red_cards: There is no significant relationship between yellow and red cards, suggesting they are not necessarily related.
Next, we analyze the average number of goals per league and per team.
dffootballclean %>%
group_by(competition_id) %>%
summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
arrange(desc(promedio_goles))
## # A tibble: 43 × 2
## competition_id promedio_goles
## <chr> <dbl>
## 1 DKP 0.166
## 2 KLUB 0.166
## 3 DFB 0.159
## 4 NLP 0.157
## 5 NLSC 0.148
## 6 USC 0.132
## 7 SFA 0.129
## 8 DFL 0.127
## 9 CDR 0.127
## 10 CGB 0.125
## # ℹ 33 more rows
dffootballclean %>%
group_by(player_current_club_id) %>%
summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
arrange(desc(promedio_goles))
## # A tibble: 436 × 2
## player_current_club_id promedio_goles
## <int> <dbl>
## 1 71985 0.197
## 2 610 0.192
## 3 27 0.180
## 4 31 0.167
## 5 583 0.166
## 6 418 0.166
## 7 141 0.165
## 8 985 0.161
## 9 13 0.154
## 10 131 0.151
## # ℹ 426 more rows
promedio_goles_competencia <- dffootballclean %>%
group_by(competition_id) %>%
summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
arrange(desc(promedio_goles))
ggplot(promedio_goles_competencia, aes(x = reorder(competition_id, -promedio_goles), y = promedio_goles, fill = competition_id)) +
geom_bar(stat = "identity") +
labs(title = "Promedio de Goles por Competencia", x = "Competencia", y = "Promedio de Goles") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis_d()
Since there are too many, we filter the top 10.
promedio_goles_competencia_top10 <- dffootballclean %>%
group_by(competition_id) %>%
summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
arrange(desc(promedio_goles)) %>%
head(10)
promedio_goles_competencia_top10
## # A tibble: 10 × 2
## competition_id promedio_goles
## <chr> <dbl>
## 1 DKP 0.166
## 2 KLUB 0.166
## 3 DFB 0.159
## 4 NLP 0.157
## 5 NLSC 0.148
## 6 USC 0.132
## 7 SFA 0.129
## 8 DFL 0.127
## 9 CDR 0.127
## 10 CGB 0.125
We perform the same calculation, but this time at the team level, and increase the top selection to 25.
promedio_goles_equipo <- dffootballclean %>%
group_by(player_current_club_id) %>%
summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
arrange(desc(promedio_goles)) %>%
head(25)
promedio_goles_equipo$player_current_club_id <- as.factor(promedio_goles_equipo$player_current_club_id)
ggplot(promedio_goles_equipo, aes(x = reorder(player_current_club_id, -promedio_goles), y = promedio_goles, fill = player_current_club_id)) +
geom_bar(stat = "identity") +
labs(title = "Promedio de Goles por Equipo (Top 25)", x = "Equipo", y = "Promedio de Goles") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis_d()
We assign names to these IDs.
promedio_goles_equipo_top25 <- dffootballclean %>%
group_by(name) %>%
summarise(promedio_goles = mean(goals, na.rm = TRUE)) %>%
arrange(desc(promedio_goles)) %>%
head(25)
promedio_goles_equipo_top25
## # A tibble: 25 × 2
## name promedio_goles
## <chr> <dbl>
## 1 FC Bayern München 0.180
## 2 Futbol Club Barcelona 0.166
## 3 Eindhovense Voetbalvereniging Philips Sport Vereniging 0.163
## 4 AFC Ajax Amsterdam 0.162
## 5 Manchester City Football Club 0.160
## 6 Paris Saint-Germain Football Club 0.159
## 7 Real Madrid Club de Fútbol 0.158
## 8 The Celtic Football Club 0.156
## 9 FC Shakhtar Donetsk 0.155
## 10 Borussia Dortmund 0.146
## # ℹ 15 more rows
goles_comparacion_equipo <- dffootballclean %>%
group_by(name) %>%
summarise(total_goles_marcados = sum(goals, na.rm = TRUE), total_goles_recibidos = sum(away_club_goals, na.rm = TRUE)) %>%
arrange(desc(total_goles_marcados)) %>%
head(25)
goles_casa_fuera <- dffootballclean %>%
mutate(tipo_partido = ifelse(home_club_id == player_current_club_id, "Casa", "Fuera")) %>%
group_by(tipo_partido) %>%
summarise(promedio_goles = mean(goals, na.rm = TRUE))
ggplot(goles_casa_fuera, aes(x = tipo_partido, y = promedio_goles, fill = tipo_partido)) +
geom_bar(stat = "identity") +
labs(title = "Promedio de Goles en Casa vs Fuera", x = "Tipo de Partido", y = "Promedio de Goles") +
theme_minimal()
ggplot(goles_comparacion_equipo, aes(x = reorder(name, -total_goles_marcados))) +
geom_bar(aes(y = total_goles_marcados, fill = "Marcados"), stat = "identity") +
labs(title = "Goles Marcados por Equipo", x = "Equipo", y = "Cantidad de Goles") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("Marcados" = "blue"))
+ Goals by position
We detect outliers and remove them; the number remains very low.
goles_por_posicion <- dffootballclean %>%
group_by(position) %>%
summarise(total_goles = sum(goals, na.rm = TRUE)) %>%
arrange(desc(total_goles))
print(goles_por_posicion)
## # A tibble: 16 × 2
## position total_goles
## <chr> <int>
## 1 Centre-Forward 57276
## 2 Left Winger 17065
## 3 Right Winger 15770
## 4 Attacking Midfield 15750
## 5 Central Midfield 15611
## 6 Centre-Back 11801
## 7 Defensive Midfield 6848
## 8 Right-Back 4253
## 9 Left-Back 3812
## 10 Second Striker 2363
## 11 Right Midfield 1562
## 12 Left Midfield 1401
## 13 Goalkeeper 17
## 14 Attack 16
## 15 Defender 1
## 16 midfield 1
dffootballclean <- dffootballclean %>%
filter(!position %in% c("Attack", "Defender", "midfield"))
goles_por_posicion <- dffootballclean %>%
group_by(position) %>%
summarise(total_goles = sum(goals, na.rm = TRUE)) %>%
arrange(desc(total_goles))
ggplot(goles_por_posicion, aes(x = reorder(position, -total_goles), y = total_goles, fill = position)) +
geom_bar(stat = "identity") +
labs(title = "Distribución de Goles por Posición", x = "Posición", y = "Total de Goles") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_viridis_d()
We have identified data that does not make much sense. Attack
appears to be an error: we will remove these records since it is
impossible for the attacking position to have fewer goals than the
goalkeeper. Logically, this position should have the highest
contribution.
It is possible to continue with more exploratory analysis, but we now move on to the next section.
We will include new columns to indicate specific values.
dffootballclean <- dffootballclean %>%
mutate(Valor_mercado = cut(market_value_in_eur,
breaks = quantile(market_value_in_eur, probs = seq(0, 1, by = 0.25), na.rm = TRUE),
include.lowest = TRUE,
labels = c("Bajo", "Medio", "MedioAlto", "Alto")))
We separate the data into: Big Five (Premier League, Serie A, Bundesliga, La Liga, Ligue 1), Champions League, Europa League, and Minor Leagues.
unique_competition_ids <- unique(dffootballclean$competition_id)
unique_competition_ids
## [1] "ELQ" "UKRS" "UKR1" "DK1" "RUSS" "RU1" "BESC" "BE1" "FRCH" "POCP"
## [11] "CLQ" "SC1" "NLSC" "FR1" "NL1" "SCI" "POSU" "CIT" "DFL" "GBCS"
## [21] "DFB" "TR1" "PO1" "GB1" "ES1" "UKRP" "SUC" "L1" "GR1" "IT1"
## [31] "USC" "RUP" "CDR" "CL" "EL" "NLP" "DKP" "SFA" "GRP" "FAC"
## [41] "KLUB" "ECLQ" "CGB"
dffootballclean$League_category <- case_when(
dffootballclean$competition_id %in% c("GB1", "ES1", "IT1", "FR1", "DE1") ~ "BigFiveLeagues",
dffootballclean$competition_id %in% c("CL", "CLQ") ~ "Champions",
dffootballclean$competition_id %in% c("EL", "ELQ") ~ "EuropaLeague",
TRUE ~ "MinorLeagues")
We create a new category based on minutes played. Since there are
values greater than 90 minutes (as observed during data cleaning), we
define the categories as follows:
- 90 minutes: starter
- > 90 minutes: extra time
- < 90 minutes: substitute
dffootballclean$player_status <- case_when(
dffootballclean$minutes_played == 90 ~ "Titular",
dffootballclean$minutes_played > 90 ~ "Prorroga",
dffootballclean$minutes_played < 90 ~ "No titular")
Based on the number of goals scored, we assign labels (from 0 to 6 goals).
dffootballclean$goal_status <- case_when(
dffootballclean$goals == 6 ~ "GOAT",
dffootballclean$goals == 3 ~ "Hattrick",
dffootballclean$goals == 4 ~ "Poker",
dffootballclean$goals == 5 ~ "Repoker",
dffootballclean$goals == 1 ~ "Goal",
dffootballclean$goals == 2 ~ "Duo",
dffootballclean$goals == 0 ~ "Singol",
TRUE ~ "Ninguno")
Based on yellow and red cards.
dffootballclean$sancion_status <- case_when(
dffootballclean$red_cards == 1 & dffootballclean$yellow_cards == 2 ~ "Expulsado por doble amarilla",
dffootballclean$red_cards == 1 ~ "Expulsado",
dffootballclean$yellow_cards == 1 & dffootballclean$red_cards == 0 ~ "Apercibido",
dffootballclean$yellow_cards == 0 & dffootballclean$red_cards == 0 ~ "Sin sanción",
TRUE ~ "Ninguno")
We could create many more categories, but we keep everything to
retain the maximum amount of information from our dataset.
We print the first 10 values to verify that all data has been correctly
processed.
head(dffootballclean, 10)
## appearance_id game_id player_id player_club_id player_current_club_id
## 1 2235545_19409 2235545 19409 317 200
## 2 2235545_30667 2235545 30667 317 317
## 3 2235545_34129 2235545 34129 317 1435
## 4 2235545_36139 2235545 36139 317 36
## 5 2235545_4520 2235545 4520 317 317
## 6 2235545_4582 2235545 4582 317 317
## 7 2235545_47740 2235545 47740 317 1426
## 8 2235545_59631 2235545 59631 317 6890
## 9 2235545_60312 2235545 60312 317 1426
## 10 2235545_63342 2235545 63342 317 11282
## date player_name competition_id yellow_cards red_cards goals
## 1 20120705 Willem Janssen ELQ 0 0 0
## 2 20120705 Robbert Schilder ELQ 0 0 2
## 3 20120705 Wesley Verhoek ELQ 0 0 0
## 4 20120705 Dusan Tadic ELQ 0 0 1
## 5 20120705 Peter Wisgerhof ELQ 0 0 0
## 6 20120705 Sander Boschker ELQ 0 0 0
## 7 20120705 Nils Röseler ELQ 0 0 0
## 8 20120705 Nacer Chadli ELQ 0 0 0
## 9 20120705 Joshua John ELQ 0 0 1
## 10 20120705 Leroy Fer ELQ 0 0 0
## assists minutes_played country_of_citizenship market_value_in_eur
## 1 0 45 Netherlands 1329310.3
## 2 1 90 Netherlands 1002272.7
## 3 0 90 Netherlands 950000.0
## 4 0 45 Serbia 10717073.2
## 5 0 90 Netherlands 1450000.0
## 6 0 90 Netherlands 422222.2
## 7 0 90 Germany 465740.7
## 8 3 45 Belgium 7088750.0
## 9 1 45 Aruba 542045.5
## 10 0 26 Netherlands 3865972.2
## position name home_club_id away_club_id
## 1 Centre-Back Football Club Twente 317 28633
## 2 Left-Back Football Club Twente 317 28633
## 3 Right Winger Football Club Twente 317 28633
## 4 Left Winger Football Club Twente 317 28633
## 5 Centre-Back Football Club Twente 317 28633
## 6 Goalkeeper Football Club Twente 317 28633
## 7 Centre-Back Football Club Twente 317 28633
## 8 Left Winger Football Club Twente 317 28633
## 9 Left Winger Football Club Twente 317 28633
## 10 Defensive Midfield Football Club Twente 317 28633
## home_club_goals away_club_goals Valor_mercado League_category player_status
## 1 6 0 Medio EuropaLeague No titular
## 2 6 0 Medio EuropaLeague Titular
## 3 6 0 Medio EuropaLeague Titular
## 4 6 0 Alto EuropaLeague No titular
## 5 6 0 Medio EuropaLeague Titular
## 6 6 0 Bajo EuropaLeague Titular
## 7 6 0 Bajo EuropaLeague Titular
## 8 6 0 Alto EuropaLeague No titular
## 9 6 0 Bajo EuropaLeague No titular
## 10 6 0 MedioAlto EuropaLeague No titular
## goal_status sancion_status
## 1 Singol Sin sanción
## 2 Duo Sin sanción
## 3 Singol Sin sanción
## 4 Goal Sin sanción
## 5 Singol Sin sanción
## 6 Singol Sin sanción
## 7 Singol Sin sanción
## 8 Singol Sin sanción
## 9 Goal Sin sanción
## 10 Singol Sin sanción
Specific reference - https://rpubs.com/luis_abaunzag/ejercicio1_rd
This dataset could be applied to countless analytical ideas, but for now we will focus on the numerical columns related to players.
numeric_columns <- c("yellow_cards", "red_cards", "goals", "assists", "minutes_played", "market_value_in_eur")
dffootball_numeric <- dffootballclean[, numeric_columns]
The next step is to scale the data and perform PCA.
dffootball_scaled <- scale(dffootball_numeric)
pca_result <- prcomp(dffootball_scaled, scale. = TRUE)
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.1180 1.0398 0.9964 0.9669 0.9416 0.9245
## Proportion of Variance 0.2083 0.1802 0.1655 0.1558 0.1478 0.1424
## Cumulative Proportion 0.2083 0.3885 0.5540 0.7098 0.8576 1.0000
The Principal Component Analysis (PCA) performed on player performance data provides valuable insights into how the variance of the selected numerical features is distributed: yellow cards, red cards, goals, assists, minutes played, and market value.
From the PCA summary, PC1 explains 20.83% of the total variance, indicating that this component captures a significant proportion of the variability in player characteristics. PC2 and PC3 explain 18.02% and 16.55% of the variance, respectively, meaning that these subsequent dimensions also capture a considerable amount of information. Together, the first three components (PC1, PC2, and PC3) explain 54.4% of the variance, suggesting that a large portion of variability in the data can be summarized using only these three principal components.
The cumulative proportion of variance shows that with the first six components (PC1 through PC6), we can explain 100% of the variance, which indicates that dimensionality reduction has been effective without losing critical information. This demonstrates that, even after reducing the number of dimensions, the principal components still retain most of the information contained in the original data.
This analysis is useful for understanding the underlying structure of the data and facilitates the visualization of complex patterns.
Component Rotation
Rotation is used to make results easier to interpret. Although principal components identify the directions of maximum variability, they can be difficult to understand directly. By rotating the components, we aim for each one to have a clearer relationship with specific variables, which improves interpretability.
For example, without rotation, a component may be influenced by several variables at once, whereas with rotation, each component can be more strongly associated with a key variable.
pca_result$rotation
## PC1 PC2 PC3 PC4 PC5
## yellow_cards 0.1879215 0.7129687 0.26545946 0.17179590 -0.36190593
## red_cards -0.1209879 -0.2337237 0.95885294 -0.01595974 0.05370472
## goals 0.5201572 -0.2347260 0.02228464 0.44348855 0.56180214
## assists 0.4644091 -0.1711537 0.03506992 -0.81381839 -0.07084041
## minutes_played 0.4897718 0.4522012 0.09074299 -0.08164665 0.28213551
## market_value_in_eur 0.4732142 -0.3849324 -0.01309673 0.32339265 -0.68256827
## PC6
## yellow_cards 0.47476587
## red_cards -0.09053898
## goals 0.40192013
## assists 0.29407819
## minutes_played -0.67907049
## market_value_in_eur -0.23925209
PC1 (First Principal Component):
This component has a strong relationship with variables such as goals,
assists, minutes played, and market value. This suggests that PC1
captures a measure of overall player performance. Players with
high values on this component tend to demonstrate strong participation
in the game (goals and assists) as well as higher playing time.
PC2 (Second Principal Component):
This component is strongly associated with yellow and red cards,
indicating that PC2 reflects players’ disciplinary or aggressive
behavior. Players with high scores on this component are more
likely to commit fouls or receive sanctions.
PC3 to PC6:
These components reveal less clear combinations of variables but may
represent more specific aspects of gameplay. For example, PC3 seems to
be more closely related to red cards, suggesting that certain players
are strongly associated with disciplinary records.
We applied a varimax rotation to improve the interpretability of the principal components, making it easier to identify meaningful patterns. After rotation, yellow cards, goals, and other key aspects of player performance cluster into well-defined components. This improves our understanding of how each variable contributes to the overall variability in the data.
Moreover, the components with the highest explained variance (such as the first and second) are the most relevant, indicating that the majority of the key information is concentrated within them.
pca_rotated <- principal(dffootball_scaled, nfactors = 5, rotate = "varimax")
pca_rotated$loadings
##
## Loadings:
## RC2 RC1 RC4 RC3 RC5
## yellow_cards 0.873 -0.145 -0.118
## red_cards 0.996
## goals 0.919 0.120
## assists 0.959
## minutes_played 0.567 0.404 0.289 -0.171
## market_value_in_eur 0.968
##
## RC2 RC1 RC4 RC3 RC5
## SS loadings 1.088 1.037 1.023 1.003 0.995
## Proportion Var 0.181 0.173 0.170 0.167 0.166
## Cumulative Var 0.181 0.354 0.525 0.692 0.858
We create a more visual representation and can conclude that three components are sufficient to explain over 50% of the variance.
fviz_eig(pca_result)
To remove any doubt, we generate a scree plot (elbow method) and confirm
that the change in trend indeed occurs at three
components.
varianza_explicada <- pca_result$sdev^2
proporcion_varianza <- varianza_explicada / sum(varianza_explicada)
plot(proporcion_varianza, type = "b", pch = 19, xlab = "Componentes Principales", ylab = "Proporción de Varianza Explicada",
main = "Scree Plot", col = "blue")
abline(h = 1/length(proporcion_varianza), col = "red", lty = 2)
abline(v = 3, col = "green", lty = 2)
table1 <- table(dffootballclean$yellow_cards, dffootballclean$red_cards)
chisq.test(table1)
##
## Pearson's Chi-squared test
##
## data: table1
## X-squared = 245.4, df = 2, p-value < 2.2e-16
Chi-squared test result:
The chi-squared test indicates that there is a significant relationship between the variables analyzed (yellow and red cards). The extremely small p-value suggests that this relationship is very unlikely to have occurred by chance. In simpler terms, this means there is a statistical dependency between the number of yellow and red cards a player receives.
cor(dffootball_numeric$goals, dffootball_numeric$assists, method = "spearman")
## [1] 0.0727869
0.0727869 is a small value, which suggests that although there is a statistically significant relationship between the two variables, the strength of the association is not very strong.
phi <- cor(dffootball_numeric$goals, dffootball_numeric$yellow_cards, method = "pearson")
phi
## [1] 0.001195572
The value 0.001195572 indicates that the relationship between the two variables (in this case, possibly between yellow and red cards, or another categorical comparison being evaluated) is statistically significant.
In summary, the correlation between yellow and red cards makes sense because both are directly related under the rules of the game. The statistical test confirms this significant relationship, which is something expected in most cases.
We now generate a correlation heatmap.
corrplot(cor(dffootball_numeric), method = "color", tl.cex = 0.5)
The correlation matrix reveals weak relationships among the variables in the dataset. No significant relationship is observed between yellow and red cards, suggesting that these events are not closely related. Although there is a slight positive correlation between goals and assists, the relationship is not strong enough to indicate a clear connection between them.
Furthermore, the correlation between minutes played and other variables is low, indicating that playing time is not strongly associated with performance in terms of goals or assists. Finally, player market value shows weak correlations with performance variables, suggesting that factors such as goals, assists, or minutes played do not largely explain a player’s market value.
Overall, the variables appear to be largely independent, implying that they provide unique and non-redundant information about player performance.
pca_scores <- pca_result$x # Las proyecciones de las observaciones (jugadores)
pca_data <- data.frame(player_id = dffootballclean$player_id, pca_scores)
head(pca_data)
## player_id PC1 PC2 PC3 PC4 PC5 PC6
## 1 19409 -0.9477182 -0.3581370 -0.2499365 -0.05737914 0.07001139 0.2759403
## 2 30667 4.5395953 -1.6804405 0.1444410 -0.35773538 3.66765465 2.7202911
## 3 34129 -0.2309475 0.3389939 -0.1128023 -0.19416749 0.52398812 -0.7357923
## 4 36139 1.1278253 -1.4771678 -0.1965340 1.62701310 1.04122580 1.2358561
## 5 4520 -0.2041214 0.3171725 -0.1135448 -0.17583467 0.48529398 -0.7493553
## 6 4582 -0.2592639 0.3620277 -0.1120187 -0.21351880 0.56483194 -0.7214759
(PCA with player labels): The first two principal components (PC1 and PC2) are plotted to visualize how players are distributed, with labels added for identification. However, given the overwhelming volume of data, it is impossible to draw meaningful conclusions from this visualization.
ggplot(pca_data, aes(x = PC1, y = PC2, label = player_id)) +
geom_point() +
geom_text(aes(label = player_id), vjust = 1, hjust = 1) +
labs(title = "PCA: Primer y Segundo Componente Principal",
x = "Componente Principal 1", y = "Componente Principal 2") +
theme_minimal()
The PCA plot is overloaded due to the large volume of data (more than 1.5 million rows) and text labels (player_id), making visual interpretation difficult. The density of points and overlapping labels prevents clear identification of relationships between the principal components.
Observations:
- Data distribution: There is a dense cluster in the
central area, suggesting that most players share similar characteristics
across the selected variables (yellow_cards, red_cards, goals,
etc.).
- Visible outliers: Some points scattered to the right
may represent players with extreme values in certain variable
combinations, such as very high market value or exceptional
performance.
- Label overlap: Labeling each point is not effective
due to massive overlap. This creates visual noise and limits the
extraction of useful insights.
To address this, geom_bin2d is applied to display point
density in the PCA space. This helps to identify areas of high player
concentration without plotting each individual point, improving clarity
and interpretability.
ggplot(pca_data, aes(x = PC1, y = PC2)) +
geom_bin2d(bins = 100) +
labs(title = "PCA: Densidad de Jugadores",
x = "Componente Principal 1", y = "Componente Principal 2") +
theme_minimal()
Observations:
- Central density: There is a notable concentration of
points around the origin (values close to 0 in both principal
components). This suggests that most players have average or balanced
values in the considered metrics (such as goals, cards, and market
value).
- Dispersion at the edges: Points scattered toward the
extremes, especially to the right (high values on Principal Component
1), may reflect players with outstanding characteristics, such as high
market value or strong offensive contributions (goals/assists).
- Diagonal patterns: The overall distribution suggests
a moderate inverse relationship between the two principal components,
which could imply that certain attributes vary in opposite ways for
specific groups of players.
To further enhance interpretation, players are grouped into five clusters using K-means (not strictly necessary, but tested as an exploratory step). The clusters are visualized with different colors, allowing us to identify patterns and relationships in player data based on their principal components.
set.seed(123)
kmeans_result <- kmeans(pca_data[, c("PC1", "PC2")], centers = 5)
ggplot(pca_data, aes(x = PC1, y = PC2, color = factor(kmeans_result$cluster))) +
geom_point(alpha = 0.5) +
labs(title = "PCA: Agrupación de Jugadores",
x = "Componente Principal 1", y = "Componente Principal 2") +
theme_minimal() +
scale_color_manual(values = c("red", "blue", "green", "purple", "orange"))
Observations:
Possible interpretations:
- Cluster 1 (red): May represent players with extreme
values in a single variable, such as cards or very low playing
time.
- Cluster 3 (green): The largest and densest group,
likely composed of players with average or balanced metrics.
- Cluster 5 (orange): Players with high values on PC1,
potentially associated with strong offensive performance or high market
value.
- Other clusters (blue and purple): Appear to represent
subsets with intermediate characteristics.
Data structure: The boundaries between clusters are not completely linear, suggesting that the relationships among the metrics are not perfectly separable.
We perform Singular Value Decomposition (SVD) on the
scaled data to obtain three components: U, D, and
V.
- U contains the player projections in the new
dimensions, similar to principal components in PCA.
- The dataframe svd_data stores these results along with
player_id, enabling analysis of how players are distributed
in the new dimensions.
svd_result <- svd(dffootball_scaled)
U <- svd_result$u
D <- svd_result$d
V <- svd_result$v
svd_data <- data.frame(player_id = dffootballclean$player_id, U)
head(svd_data)
## player_id X1 X2 X3 X4
## 1 19409 -0.0006698067 -0.0002721703 -1.982000e-04 -4.689226e-05
## 2 30667 0.0032083919 -0.0012770699 1.145419e-04 -2.923540e-04
## 3 34129 -0.0001632238 0.0002576223 -8.945242e-05 -1.586805e-04
## 4 36139 0.0007970987 -0.0011225905 -1.558517e-04 1.329653e-03
## 5 4520 -0.0001442643 0.0002410389 -9.004118e-05 -1.436983e-04
## 6 4582 -0.0001832366 0.0002751271 -8.883095e-05 -1.744951e-04
## X5 X6
## 1 5.875418e-05 0.0002358509
## 2 3.077928e-03 0.0023250793
## 3 4.397355e-04 -0.0006288943
## 4 8.738059e-04 0.0010563073
## 5 4.072630e-04 -0.0006404868
## 6 4.740120e-04 -0.0006166578
The results of the Singular Value Decomposition (SVD) show how
players are represented in the new dimensions obtained from the scaled
data. Each row of the svd_data dataset corresponds to a
player, identified by their player_id, followed by the
values in the first six dimensions derived from the decomposition.
The columns X1 to X6 represent the projections of each player in the first six components of the feature space. These projections are continuous values that indicate how each player is distributed across the new dimensions of the reduced space. By analyzing these projections, we can gain insights into the relationships and similarities among players based on their numerical characteristics (e.g., goals, cards, minutes played, etc.).
This type of analysis is useful for dimensionality reduction, as it allows us to visualize player characteristics in a more compact space, facilitating the identification of patterns or clusters among them.
par(mfrow = c(1, 2))
plot(pca_result$x[, 1], pca_result$x[, 2],
main = "PCA: Componentes Principales",
xlab = "PC1", ylab = "PC2")
plot(svd_result$u[, 1], svd_result$u[, 2],
main = "SVD: Componentes U",
xlab = "U1", ylab = "U2")
dif <- svd_result$v[, 1:5] - pca_result$rotation[, 1:5]
summary(dif)
## PC1 PC2 PC3
## Min. :-2.687e-14 Min. :-1.582e-14 Min. :-3.331e-15
## 1st Qu.:-4.066e-15 1st Qu.:-1.234e-14 1st Qu.: 4.424e-16
## Median : 4.913e-15 Median :-5.232e-15 Median : 7.147e-16
## Mean :-4.626e-17 Mean :-4.732e-15 Mean : 6.407e-16
## 3rd Qu.: 9.021e-15 3rd Qu.: 3.442e-15 3rd Qu.: 1.440e-15
## Max. : 1.343e-14 Max. : 6.273e-15 Max. : 3.712e-15
## PC4 PC5
## Min. :-7.644e-14 Min. :-1.349e-13
## 1st Qu.:-8.476e-15 1st Qu.:-4.359e-14
## Median :-7.633e-17 Median : 6.939e-15
## Mean : 6.173e-15 Mean : 2.304e-15
## 3rd Qu.: 1.935e-14 3rd Qu.: 7.142e-14
## Max. : 9.909e-14 Max. : 1.024e-13
The results show that the differences between the V
matrices from SVD and the PCA rotations for the first five
components are very small, with values close to zero.
The difference values indicate that the matrices are practically
identical, which is expected when the data are scaled. This small
variation can be attributed to numerical errors inherent in processing
large matrices.
The dispersion of the principal components reflects the linear
characteristics of the data, and the values close to zero in the
differences suggest that both PCA and SVD are equivalently representing
the underlying structure of the dataset.
The median and quartiles close to zero confirm that variations between
the two techniques are minimal.
This analysis reinforces the conclusion that, when the data are scaled, the results obtained by PCA and SVD are essentially the same, and the differences are insignificant in practical terms. Therefore, either method can be used for dimensionality reduction and to analyze the underlying structure of the data.
We reload the model.
dffootballglobal <- read.csv("C:/Users/Manuel/Desktop/PR1/df_football_clean.csv")
We begin our K-means study by selecting the numerical variables, since the algorithm can only be applied to numerical data.
sapply(dffootballglobal, class)
## appearance_id game_id player_id
## "character" "integer" "integer"
## player_club_id player_current_club_id date
## "integer" "integer" "character"
## player_name competition_id yellow_cards
## "character" "character" "integer"
## red_cards goals assists
## "integer" "integer" "integer"
## minutes_played country_of_citizenship market_value_in_eur
## "integer" "character" "numeric"
## position name home_club_id
## "character" "character" "integer"
## away_club_id home_club_goals away_club_goals
## "integer" "integer" "integer"
We must keep in mind that numerical variables should not be selected at random, as this would not yield meaningful clusters. Instead, we need to analyze which variables are important and why.
Important Variables
As indicated, the following may directly influence clustering based on player performance or characteristics:
position
Reason: Reflects the player’s role on the field, which can
affect performance.
Required treatment: Convert into dummy variables (categorical
positions → numerical).
country_of_citizenship
Reason: May be relevant to analyze geographic or cultural
patterns in performance.
Required treatment: Convert into dummy variables.
market_value_in_eur
Reason: Indicates the economic value of the player, a crucial
factor in performance analysis and clustering.
minutes_played
Reason: Represents the amount of time a player participates in
matches, essential for assessing impact on the game.
goals
Reason: Reflects direct player performance in terms of
offensive contributions.
assists
Reason: Complements the analysis of offensive performance,
highlighting indirect contributions.
yellow_cards
Reason: Relevant for evaluating player disciplinary behavior
and its impact on matches.
home_club_goals and
away_club_goals
Reason: Help contextualize player performance in relation to
match outcomes.
Less Important Variables
player_id
Reason: Unique identifier with no analytical meaning.
game_id
Reason: Unique match identifier, not directly related to
clustering patterns.
player_club_id, home_club_id, and
away_club_id
Reason: Unique identifiers for clubs that do not provide
additional information beyond name.
name (Club name)
Reason: Textual categorical information not directly used in
K-means.
dfkmeans <- dffootballglobal %>%
select(
position,
country_of_citizenship,
market_value_in_eur,
minutes_played,
goals,
assists,
yellow_cards,
home_club_goals,
away_club_goals
)
head(dfkmeans)
## position country_of_citizenship market_value_in_eur minutes_played
## 1 Centre-Back Netherlands 1329310 45
## 2 Defensive Midfield 1615000 90
## 3 Left-Back Netherlands 1002273 90
## 4 Right Winger Netherlands 950000 90
## 5 Left Winger Serbia 10717073 45
## 6 Centre-Back Netherlands 1450000 90
## goals assists yellow_cards home_club_goals away_club_goals
## 1 0 0 0 6 0
## 2 0 0 0 6 0
## 3 2 1 0 6 0
## 4 0 0 0 6 0
## 5 1 0 0 6 0
## 6 0 0 0 6 0
str(dfkmeans)
## 'data.frame': 1290352 obs. of 9 variables:
## $ position : chr "Centre-Back" "Defensive Midfield" "Left-Back" "Right Winger" ...
## $ country_of_citizenship: chr "Netherlands" "" "Netherlands" "Netherlands" ...
## $ market_value_in_eur : num 1329310 1615000 1002273 950000 10717073 ...
## $ minutes_played : int 45 90 90 90 45 90 90 90 45 45 ...
## $ goals : int 0 0 2 0 1 0 0 0 0 1 ...
## $ assists : int 0 0 1 0 0 0 0 0 3 1 ...
## $ yellow_cards : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_club_goals : int 6 6 6 6 6 6 6 6 6 6 ...
## $ away_club_goals : int 0 0 0 0 0 0 0 0 0 0 ...
We first focus on the character variables in order to convert them into numerical format.
unique(dfkmeans$position)
## [1] "Centre-Back" "Defensive Midfield" "Left-Back"
## [4] "Right Winger" "Left Winger" "Goalkeeper"
## [7] "Right-Back" "Centre-Forward" "Second Striker"
## [10] "Central Midfield" "Attacking Midfield" "Right Midfield"
## [13] "Left Midfield" "Attack" "Defender"
## [16] "midfield" ""
unique(dfkmeans$country_of_citizenship)
## [1] "Netherlands" ""
## [3] "Serbia" "Germany"
## [5] "Belgium" "Aruba"
## [7] "Romania" "Croatia"
## [9] "Bulgaria" "Ukraine"
## [11] "Brazil" "Cyprus"
## [13] "Armenia" "North Macedonia"
## [15] "Switzerland" "Spain"
## [17] "Russia" "Denmark"
## [19] "United States" "France"
## [21] "Georgia" "Argentina"
## [23] "Montenegro" "Austria"
## [25] "Albania" "Portugal"
## [27] "Nigeria" "Cameroon"
## [29] "Bosnia-Herzegovina" "Norway"
## [31] "Senegal" "Mali"
## [33] "Iceland" "Zimbabwe"
## [35] "Paraguay" "Italy"
## [37] "Finland" "Slovakia"
## [39] "Turkey" "Ghana"
## [41] "Czech Republic" "Uzbekistan"
## [43] "Tunisia" "Lithuania"
## [45] "Slovenia" "Azerbaijan"
## [47] "Philippines" "Faroe Islands"
## [49] "Costa Rica" "Sweden"
## [51] "Pakistan" "Scotland"
## [53] "Chile" "Ireland"
## [55] "Poland" "Kosovo"
## [57] "Northern Ireland" "Suriname"
## [59] "Türkiye" "England"
## [61] "Morocco" "Congo"
## [63] "Cote d'Ivoire" "Ecuador"
## [65] "Greece" "Guinea"
## [67] "Israel" "Martinique"
## [69] "Zambia" "Venezuela"
## [71] "Kazakhstan" "Hungary"
## [73] "Moldova" "Belarus"
## [75] "Latvia" "Japan"
## [77] "Australia" "South Africa"
## [79] "DR Congo" "Estonia"
## [81] "Liberia" "The Gambia"
## [83] "Algeria" "Chinese Taipei"
## [85] "Burundi" "Burkina Faso"
## [87] "Angola" "Egypt"
## [89] "Gabon" "Peru"
## [91] "Central African Republic" "Kenya"
## [93] "Trinidad and Tobago" "Jamaica"
## [95] "Wales" "Honduras"
## [97] "Réunion" "Uruguay"
## [99] "Guinea-Bissau" "Cape Verde"
## [101] "Colombia" "Madagascar"
## [103] "Haiti" "Bolivia"
## [105] "Curacao" "Afghanistan"
## [107] "Guyana" "Canada"
## [109] "Antigua and Barbuda" "Sierra Leone"
## [111] "Comoros" "Chad"
## [113] "French Guiana" "Togo"
## [115] "Mexico" "Guadeloupe"
## [117] "Syria" "Korea, South"
## [119] "Panama" "Sao Tome and Principe"
## [121] "New Zealand" "Benin"
## [123] "Equatorial Guinea" "Libya"
## [125] "Seychelles" "Barbados"
## [127] "Oman" "Mozambique"
## [129] "Palestine" "Indonesia"
## [131] "Iran" "Neukaledonien"
## [133] "Malaysia" "Saint-Martin"
## [135] "Luxembourg" "Saudi Arabia"
## [137] "Mauritania" "Iraq"
## [139] "Tajikistan" "El Salvador"
## [141] "Mauritius" "Kyrgyzstan"
## [143] "China" "Lebanon"
## [145] "Niger" "Jordan"
## [147] "Dominican Republic" "Rwanda"
## [149] "Malta" "Montserrat"
## [151] "Guatemala" "Thailand"
## [153] "Uganda" "Grenada"
## [155] "Bermuda" "Laos"
## [157] "Monaco" "Ethiopia"
## [159] "Liechtenstein" "Malawi"
## [161] "Tanzania" "Eritrea"
## [163] "Qatar" "Nicaragua"
## [165] "Sint Maarten" "Korea, North"
## [167] "Vietnam" "Cuba"
## [169] "St. Kitts & Nevis"
If we generate dummy variables but end up with too many columns, our clustering may become misleading. To address this, we group countries by continents while keeping player positions as they are.
paises_a_continente <- c(
"Netherlands" = "Europa", "Serbia" = "Europa", "Germany" = "Europa", "Belgium" = "Europa", "Aruba" = "América",
"Romania" = "Europa", "Croatia" = "Europa", "Bulgaria" = "Europa", "Ukraine" = "Europa", "Brazil" = "América",
"Cyprus" = "Europa", "Armenia" = "Asia", "North Macedonia" = "Europa", "Switzerland" = "Europa", "Spain" = "Europa",
"Russia" = "Europa", "Denmark" = "Europa", "United States" = "América", "France" = "Europa", "Georgia" = "Asia",
"Argentina" = "América", "Montenegro" = "Europa", "Austria" = "Europa", "Albania" = "Europa", "Portugal" = "Europa",
"Nigeria" = "África", "Cameroon" = "África", "Bosnia-Herzegovina" = "Europa", "Norway" = "Europa", "Senegal" = "África",
"Mali" = "África", "Iceland" = "Europa", "Zimbabwe" = "África", "Paraguay" = "América", "Italy" = "Europa",
"Finland" = "Europa", "Slovakia" = "Europa", "Turkey" = "Asia", "Ghana" = "África", "Czech Republic" = "Europa",
"Uzbekistan" = "Asia", "Tunisia" = "África", "Lithuania" = "Europa", "Slovenia" = "Europa", "Azerbaijan" = "Asia",
"Philippines" = "Asia", "Faroe Islands" = "Europa", "Costa Rica" = "América", "Sweden" = "Europa", "Pakistan" = "Asia",
"Scotland" = "Europa", "Chile" = "América", "Ireland" = "Europa", "Poland" = "Europa", "Kosovo" = "Europa",
"Northern Ireland" = "Europa", "Suriname" = "América", "Türkiye" = "Asia", "England" = "Europa", "Morocco" = "África",
"Congo" = "África", "Cote d'Ivoire" = "África", "Ecuador" = "América", "Greece" = "Europa", "Guinea" = "África",
"Israel" = "Asia", "Martinique" = "América", "Zambia" = "África", "Venezuela" = "América", "Kazakhstan" = "Asia",
"Hungary" = "Europa", "Moldova" = "Europa", "Belarus" = "Europa", "Latvia" = "Europa", "Japan" = "Asia",
"Australia" = "Oceanía", "South Africa" = "África", "DR Congo" = "África", "Estonia" = "Europa", "Liberia" = "África",
"The Gambia" = "África", "Algeria" = "África", "Chinese Taipei" = "Asia", "Burundi" = "África", "Burkina Faso" = "África",
"Angola" = "África", "Egypt" = "África", "Gabon" = "África", "Peru" = "América", "Central African Republic" = "África",
"Kenya" = "África", "Trinidad and Tobago" = "América", "Jamaica" = "América", "Wales" = "Europa", "Honduras" = "América",
"Réunion" = "África", "Uruguay" = "América", "Guinea-Bissau" = "África", "Cape Verde" = "África", "Colombia" = "América",
"Madagascar" = "África", "Haiti" = "América", "Bolivia" = "América", "Curacao" = "América", "Afghanistan" = "Asia",
"Guyana" = "América", "Canada" = "América", "Antigua and Barbuda" = "América", "Sierra Leone" = "África", "Comoros" = "África",
"Chad" = "África", "French Guiana" = "América", "Togo" = "África", "Mexico" = "América", "Guadeloupe" = "América",
"Syria" = "Asia", "Korea, South" = "Asia", "Panama" = "América", "Sao Tome and Principe" = "África", "New Zealand" = "Oceanía",
"Benin" = "África", "Equatorial Guinea" = "África", "Libya" = "África", "Seychelles" = "África", "Barbados" = "América",
"Oman" = "Asia", "Mozambique" = "África", "Palestine" = "Asia", "Indonesia" = "Asia", "Iran" = "Asia", "Neukaledonien" = "Oceanía",
"Malaysia" = "Asia", "Saint-Martin" = "América", "Luxembourg" = "Europa", "Saudi Arabia" = "Asia", "Mauritania" = "África",
"Iraq" = "Asia", "Tajikistan" = "Asia", "El Salvador" = "América", "Mauritius" = "África", "Kyrgyzstan" = "Asia", "China" = "Asia",
"Lebanon" = "Asia", "Niger" = "África", "Jordan" = "Asia", "Dominican Republic" = "América", "Rwanda" = "África", "Malta" = "Europa",
"Montserrat" = "América", "Guatemala" = "América", "Thailand" = "Asia", "Uganda" = "África", "Grenada" = "América",
"Bermuda" = "América", "Laos" = "Asia", "Monaco" = "Europa", "Ethiopia" = "África", "Liechtenstein" = "Europa", "Malawi" = "África",
"Tanzania" = "África", "Eritrea" = "África", "Qatar" = "Asia", "Nicaragua" = "América", "Sint Maarten" = "América",
"Korea, North" = "Asia", "Vietnam" = "Asia", "Cuba" = "América", "St. Kitts & Nevis" = "América", "Southern Sudan" = "África", "Bonaire" = "América"
)
dfkmeans$continent <- paises_a_continente[dfkmeans$country_of_citizenship]
table(dfkmeans$continent)
##
## África América Asia Europa Oceanía
## 138750 172040 71854 883347 4962
dfkmeans$continent <- unname(paises_a_continente[dfkmeans$country_of_citizenship])
dfkmeans$continent[is.na(dfkmeans$continent)] <- "Otro"
X <- model.matrix(~ 0 + continent, data = dfkmeans, na.action = na.pass)
colnames(X) <- sub("^continent", "", colnames(X))
dfkmeans <- cbind(dfkmeans, X)
stopifnot(nrow(dfkmeans) == nrow(X))
dfkmeans <- cbind(dfkmeans, model.matrix(~ position - 1, data = dfkmeans))
head(dfkmeans)
## position country_of_citizenship market_value_in_eur minutes_played
## 1 Centre-Back Netherlands 1329310 45
## 2 Defensive Midfield 1615000 90
## 3 Left-Back Netherlands 1002273 90
## 4 Right Winger Netherlands 950000 90
## 5 Left Winger Serbia 10717073 45
## 6 Centre-Back Netherlands 1450000 90
## goals assists yellow_cards home_club_goals away_club_goals continent África
## 1 0 0 0 6 0 Europa 0
## 2 0 0 0 6 0 Otro 0
## 3 2 1 0 6 0 Europa 0
## 4 0 0 0 6 0 Europa 0
## 5 1 0 0 6 0 Europa 0
## 6 0 0 0 6 0 Europa 0
## América Asia Europa Oceanía Otro position positionAttack
## 1 0 0 1 0 0 0 0
## 2 0 0 0 0 1 0 0
## 3 0 0 1 0 0 0 0
## 4 0 0 1 0 0 0 0
## 5 0 0 1 0 0 0 0
## 6 0 0 1 0 0 0 0
## positionAttacking Midfield positionCentral Midfield positionCentre-Back
## 1 0 0 1
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 1
## positionCentre-Forward positionDefender positionDefensive Midfield
## 1 0 0 0
## 2 0 0 1
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## positionGoalkeeper positionLeft-Back positionLeft Midfield
## 1 0 0 0
## 2 0 0 0
## 3 0 1 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## positionLeft Winger positionmidfield positionRight-Back
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 1 0 0
## 6 0 0 0
## positionRight Midfield positionRight Winger positionSecond Striker
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 1 0
## 5 0 0 0
## 6 0 0 0
We remove the original columns ‘continent’ and ‘position’.
dfkmeans <- dfkmeans[, !names(dfkmeans) %in% c("continent", "position", "country_of_citizenship")]
head(dfkmeans)
## market_value_in_eur minutes_played goals assists yellow_cards home_club_goals
## 1 1329310 45 0 0 0 6
## 2 1615000 90 0 0 0 6
## 3 1002273 90 2 1 0 6
## 4 950000 90 0 0 0 6
## 5 10717073 45 1 0 0 6
## 6 1450000 90 0 0 0 6
## away_club_goals África América Asia Europa Oceanía Otro positionAttack
## 1 0 0 0 0 1 0 0 0
## 2 0 0 0 0 0 0 1 0
## 3 0 0 0 0 1 0 0 0
## 4 0 0 0 0 1 0 0 0
## 5 0 0 0 0 1 0 0 0
## 6 0 0 0 0 1 0 0 0
## positionAttacking Midfield positionCentral Midfield positionCentre-Back
## 1 0 0 1
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 1
## positionCentre-Forward positionDefender positionDefensive Midfield
## 1 0 0 0
## 2 0 0 1
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## positionGoalkeeper positionLeft-Back positionLeft Midfield
## 1 0 0 0
## 2 0 0 0
## 3 0 1 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## positionLeft Winger positionmidfield positionRight-Back
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 1 0 0
## 6 0 0 0
## positionRight Midfield positionRight Winger positionSecond Striker
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 1 0
## 5 0 0 0
## 6 0 0 0
str(dfkmeans)
## 'data.frame': 1290352 obs. of 29 variables:
## $ market_value_in_eur : num 1329310 1615000 1002273 950000 10717073 ...
## $ minutes_played : int 45 90 90 90 45 90 90 90 45 45 ...
## $ goals : int 0 0 2 0 1 0 0 0 0 1 ...
## $ assists : int 0 0 1 0 0 0 0 0 3 1 ...
## $ yellow_cards : int 0 0 0 0 0 0 0 0 0 0 ...
## $ home_club_goals : int 6 6 6 6 6 6 6 6 6 6 ...
## $ away_club_goals : int 0 0 0 0 0 0 0 0 0 0 ...
## $ África : num 0 0 0 0 0 0 0 0 0 0 ...
## $ América : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Asia : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Europa : num 1 0 1 1 1 1 1 1 1 0 ...
## $ Oceanía : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Otro : num 0 1 0 0 0 0 0 0 0 0 ...
## $ positionAttack : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionAttacking Midfield: num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionCentral Midfield : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionCentre-Back : num 1 0 0 0 0 1 0 1 0 0 ...
## $ positionCentre-Forward : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionDefender : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionDefensive Midfield: num 0 1 0 0 0 0 0 0 0 0 ...
## $ positionGoalkeeper : num 0 0 0 0 0 0 1 0 0 0 ...
## $ positionLeft-Back : num 0 0 1 0 0 0 0 0 0 0 ...
## $ positionLeft Midfield : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionLeft Winger : num 0 0 0 0 1 0 0 0 1 1 ...
## $ positionmidfield : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionRight-Back : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionRight Midfield : num 0 0 0 0 0 0 0 0 0 0 ...
## $ positionRight Winger : num 0 0 0 1 0 0 0 0 0 0 ...
## $ positionSecond Striker : num 0 0 0 0 0 0 0 0 0 0 ...
As we can see, the transformation has been successfully performed.
The next step is to normalize the dataset.
It is important to note that dummy numerical variables should
not be scaled, so we exclude them from normalization.
Additionally, we know that binary variables should NOT be normalized. In this case, we only have one binary variable that was not previously selected: red_cards.
Reference – Nº1
numerical_vars <- c("market_value_in_eur", "minutes_played", "goals", "assists", "yellow_cards", "home_club_goals", "away_club_goals")
dfkmeans[numerical_vars] <- scale(dfkmeans[numerical_vars])
head(dfkmeans)
## market_value_in_eur minutes_played goals assists yellow_cards
## 1 -0.4301239 -0.8649827 -0.2911196 -0.2665436 -0.4075265
## 2 -0.3966584 0.6637674 -0.2911196 -0.2665436 -0.4075265
## 3 -0.4684328 0.6637674 5.7062225 3.2022543 -0.4075265
## 4 -0.4745560 0.6637674 -0.2911196 -0.2665436 -0.4075265
## 5 0.6695507 -0.8649827 2.7075514 -0.2665436 -0.4075265
## 6 -0.4159864 0.6637674 -0.2911196 -0.2665436 -0.4075265
## home_club_goals away_club_goals África América Asia Europa Oceanía Otro
## 1 3.323162 -1.005735 0 0 0 1 0 0
## 2 3.323162 -1.005735 0 0 0 0 0 1
## 3 3.323162 -1.005735 0 0 0 1 0 0
## 4 3.323162 -1.005735 0 0 0 1 0 0
## 5 3.323162 -1.005735 0 0 0 1 0 0
## 6 3.323162 -1.005735 0 0 0 1 0 0
## positionAttack positionAttacking Midfield positionCentral Midfield
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## positionCentre-Back positionCentre-Forward positionDefender
## 1 1 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 1 0 0
## positionDefensive Midfield positionGoalkeeper positionLeft-Back
## 1 0 0 0
## 2 1 0 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## positionLeft Midfield positionLeft Winger positionmidfield positionRight-Back
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 1 0 0
## 6 0 0 0 0
## positionRight Midfield positionRight Winger positionSecond Striker
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 1 0
## 5 0 0 0
## 6 0 0 0
colnames(dfkmeans)
## [1] "market_value_in_eur" "minutes_played"
## [3] "goals" "assists"
## [5] "yellow_cards" "home_club_goals"
## [7] "away_club_goals" "África"
## [9] "América" "Asia"
## [11] "Europa" "Oceanía"
## [13] "Otro" "positionAttack"
## [15] "positionAttacking Midfield" "positionCentral Midfield"
## [17] "positionCentre-Back" "positionCentre-Forward"
## [19] "positionDefender" "positionDefensive Midfield"
## [21] "positionGoalkeeper" "positionLeft-Back"
## [23] "positionLeft Midfield" "positionLeft Winger"
## [25] "positionmidfield" "positionRight-Back"
## [27] "positionRight Midfield" "positionRight Winger"
## [29] "positionSecond Striker"
We attempt to visualize the elbow, gap statistic, and silhouette methods.
dfnumeric <- dfkmeans %>%
select(market_value_in_eur, minutes_played, goals, assists, yellow_cards,
home_club_goals, away_club_goals)
# limpiar datos
dfnumeric[!is.finite(as.matrix(dfnumeric))] <- NA
dfnumeric <- na.omit(dfnumeric)
# calcular WSS
wss <- sapply(1:10, function(k) {
kmeans(dfnumeric, k, nstart = 10)$tot.withinss
})
# graficar codo
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Número de Clusters K",
ylab = "Suma de Distancias al Cuadrado (WSS)")
Since we have an overwhelming amount of data, we run a test with
10,000 samples to check whether the trend remains
consistent.
We are unable to visualize the silhouette or gap statistic, as the
following error is returned:
Vector size problem (9556.0 GB)
wss <- sapply(1:10, function(k) {
km <- kmeans(dfnumeric, centers = k, nstart = 10, iter.max = 100)
return(km$tot.withinss)
})
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517550)
plot(1:10, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Número de Clusters K", ylab = "Suma de Distancias al Cuadrado")
#avg_sil <- sapply(2:10, function(k) {
# km <- kmeans(dfnumeric, centers = k, nstart = 10, iter.max = 100)
# ss <- silhouette(km$cluster, dist(dfnumeric))
# return(mean(ss[, 3]))
#})
#plot(2:10, avg_sil, type = "b", pch = 19, frame = FALSE,
# xlab = "Número de Clusters K", ylab = "Ancho Promedio de Silueta")
The average silhouette score begins to suggest that between 3
and 5 clusters could be an optimal number.
However, to address the computational issue, we proceed with a
PCA by components.
To better understand the explained variance, we apply a rotation.
X <- dfkmeans[, numerical_vars, drop = FALSE]
X <- data.frame(lapply(X, function(col) as.numeric(as.character(col))))
is_bad <- !is.finite(as.matrix(X))
if (any(is_bad)) X[is_bad] <- NA
keep_cols <- colSums(!is.na(X)) > 0
X <- X[, keep_cols, drop = FALSE]
zero_var <- sapply(X, function(v) var(v, na.rm = TRUE) == 0)
if (any(zero_var)) X <- X[, !zero_var, drop = FALSE]
for (j in seq_along(X)) {
if (anyNA(X[[j]])) X[[j]][is.na(X[[j]])] <- median(X[[j]], na.rm = TRUE)
}
pca <- prcomp(X, center = TRUE, scale. = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.1589 1.0499 1.0287 0.9756 0.9583 0.9366 0.8655
## Proportion of Variance 0.1918 0.1575 0.1512 0.1360 0.1312 0.1253 0.1070
## Cumulative Proportion 0.1918 0.3493 0.5005 0.6365 0.7677 0.8930 1.0000
head(pca$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] 0.05826368 -1.8421522 -2.649327 0.8788404 -0.7291070 -0.4595283
## [2,] 0.53650414 -0.9294873 -2.905706 0.6853809 -0.3917896 0.4828133
## [3,] 5.42672491 -1.5973534 -2.742469 1.3812623 -2.1059689 1.3816755
## [4,] 0.50459458 -0.9199009 -2.910954 0.7423355 -0.4170827 0.5122130
## [5,] 2.12256631 -2.2186391 -2.490408 -0.4737798 -2.2121820 -0.2350347
## [6,] 0.52858673 -0.9271088 -2.907008 0.6995125 -0.3980653 0.4901080
## PC7
## [1,] -1.1743148
## [2,] -1.5885355
## [3,] 2.8146782
## [4,] -1.5750277
## [5,] 0.1270253
## [6,] -1.5851840
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## market_value_in_eur 0.40963485 -0.12306433 0.0673593525 -0.73114769 0.3246979
## minutes_played 0.30386375 0.59969455 -0.1691798405 -0.11054217 0.2135412
## goals 0.53818404 -0.08042097 0.0282942421 -0.18294613 -0.6136507
## assists 0.48776163 -0.05603862 -0.0004666133 0.50178595 0.5735126
## yellow_cards 0.05601611 0.71161435 -0.1647153116 0.09585773 -0.2402869
## home_club_goals 0.33651952 -0.32900863 -0.6266257506 0.26663427 -0.2181408
## away_club_goals 0.30972480 0.03117843 0.7390867014 0.29607856 -0.1953570
## PC6 PC7
## market_value_in_eur -0.37741526 -0.1734048
## minutes_played 0.62467488 -0.2671579
## goals 0.21327030 0.4975632
## assists -0.11741296 0.4055322
## yellow_cards -0.62561185 0.0708478
## home_club_goals -0.11163841 -0.5046838
## away_club_goals -0.06492827 -0.4762423
We plot the results to visualize the importance of each component using the obtained values.
variables <- c('market_value_in_eur', 'minutes_played', 'goals', 'assists', 'yellow_cards', 'home_club_goals', 'away_club_goals')
pc1 <- c(0.40915592, 0.30736148, 0.53678599, 0.48479076, 0.06059914, 0.33776842, 0.31176805)
pc2 <- c(-0.11158876, 0.60748303, -0.07974133, -0.05862091, 0.71296794, -0.31621859, -0.01999363)
pc3 <- c(-0.029007256, 0.141924990, -0.027053404, -0.006633985, 0.129037154, 0.629417639, -0.751946028)
loadings <- data.frame(
Variable = rep(variables, times = 3),
PC = rep(c("PC1", "PC2", "PC3"), each = length(variables)),
Carga = c(pc1, pc2, pc3)
)
ggplot(loadings, aes(x = Variable, y = Carga, fill = PC)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Cargas de las variables en las primeras 3 componentes principales",
x = "Variables",
y = "Cargas") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("#FF6347", "#4682B4", "#32CD32")) +
theme_minimal()
We then run the WSS (elbow method) again to estimate
the optimal number of clusters. However, the trend continues to appear
constant, making it difficult to determine the exact number of
clusters.
Since PCA suggested that three components explain most
of the variance, we adopt three clusters as the most
reasonable choice.
library(dplyr)
set.seed(123)
# 1) Selección y coerción segura a numérico
dfnumeric <- dfkmeans %>%
select(market_value_in_eur, minutes_played, goals, assists,
yellow_cards, home_club_goals, away_club_goals) %>%
mutate(across(everything(), ~ suppressWarnings(as.numeric(as.character(.)))))
# 2) Reemplazar Inf/-Inf por NA
is_bad <- !is.finite(as.matrix(dfnumeric))
if (any(is_bad, na.rm = TRUE)) dfnumeric[is_bad] <- NA
# 3) Eliminar columnas completamente NA
keep_cols <- colSums(!is.na(dfnumeric)) > 0
dfnumeric <- dfnumeric[, keep_cols, drop = FALSE]
# 4) Eliminar columnas de varianza cero (constantes)
zero_var <- sapply(dfnumeric, function(v) var(v, na.rm = TRUE) == 0)
if (any(zero_var)) dfnumeric <- dfnumeric[, !zero_var, drop = FALSE]
# 5) Imputación simple por mediana (evita perder filas)
for (j in seq_along(dfnumeric)) {
if (anyNA(dfnumeric[[j]])) {
med <- median(dfnumeric[[j]], na.rm = TRUE)
dfnumeric[[j]][is.na(dfnumeric[[j]])] <- med
}
}
# Chequeos tempranos
if (ncol(dfnumeric) < 1) stop("No quedan columnas válidas tras la limpieza.")
if (nrow(dfnumeric) < 2) stop("No hay suficientes filas para PCA/K-means.")
# 6) PCA (centrado y escalado)
pca <- prcomp(dfnumeric, center = TRUE, scale. = TRUE)
# Asegura que existe el nº de PCs solicitado
num_pcs <- min(3, ncol(pca$x))
pca_data <- pca$x[, 1:num_pcs, drop = FALSE]
# 7) Elbow method para K-means (ajustar K al nº de filas)
k_max <- min(10, max(1, nrow(pca_data) - 1))
if (k_max < 1) stop("No hay suficientes observaciones para K-means.")
wss <- sapply(1:k_max, function(k) {
km <- kmeans(pca_data, centers = k, nstart = 25, iter.max = 100)
km$tot.withinss
})
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 64517600)
plot(1:k_max, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Número de Clusters K",
ylab = "Suma de Distancias al Cuadrado (WSS)")
We examine the decisive separation between clusters by analyzing the distances between their centroids.
set.seed(123) #
kmeans_result <- kmeans(pca_data[, 1:3], centers = 3, nstart = 25)
pca_data_with_clusters <- data.frame(pca_data, cluster = kmeans_result$cluster)
centroids <- kmeans_result$centers
centroid_distances <- dist(centroids)
print(as.matrix(centroid_distances))
## 1 2 3
## 1 0.000000 3.201442 2.410514
## 2 3.201442 0.000000 2.503221
## 3 2.410514 2.503221 0.000000
We plot the clusters.
ggplot(pca_data_with_clusters, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
geom_point() +
labs(title = "Clustering con k-means (Euclidiano) sobre las 3 primeras componentes principales",
x = "Componente Principal 1",
y = "Componente Principal 2",
color = "Cluster") +
theme_minimal()
Cluster 1: Low-Performance Teams
Cluster 2: Teams with Strong Offensive Performance
Cluster 3: Defensive Teams with Moderate Attack
Next step: We will analyze the model with 5 clusters and 6 principal components, which together explain 90% of the variance.
set.seed(123)
dfnumeric <- dfkmeans %>%
select(market_value_in_eur, minutes_played, goals, assists,
yellow_cards, home_club_goals, away_club_goals) %>%
mutate(across(everything(), ~ suppressWarnings(as.numeric(as.character(.)))))
cat("Filas con NA:", sum(!complete.cases(dfnumeric)), "\n")
## Filas con NA: 1
print(colSums(is.na(dfnumeric)))
## market_value_in_eur minutes_played goals assists
## 1 1 1 1
## yellow_cards home_club_goals away_club_goals
## 1 1 1
cat("Inf/-Inf presentes?:", any(is.infinite(as.matrix(dfnumeric))), "\n")
## Inf/-Inf presentes?: FALSE
is_bad <- !is.finite(as.matrix(dfnumeric))
if (any(is_bad, na.rm = TRUE)) dfnumeric[is_bad] <- NA
keep_cols <- colSums(!is.na(dfnumeric)) > 0
dfnumeric <- dfnumeric[, keep_cols, drop = FALSE]
zero_var <- sapply(dfnumeric, function(v) var(v, na.rm = TRUE) == 0)
if (any(zero_var, na.rm = TRUE)) {
dfnumeric <- dfnumeric[, !zero_var, drop = FALSE]
}
for (j in seq_along(dfnumeric)) {
if (anyNA(dfnumeric[[j]])) {
med <- median(dfnumeric[[j]], na.rm = TRUE)
dfnumeric[[j]][is.na(dfnumeric[[j]])] <- med
}
}
stopifnot(!anyNA(dfnumeric))
stopifnot(all(is.finite(as.matrix(dfnumeric))))
stopifnot(nrow(dfnumeric) >= 6) # para que tenga sentido pedir 6 PCs
pca <- prcomp(dfnumeric, center = TRUE, scale. = TRUE)
num_pcs <- min(6, ncol(pca$x))
pca_data <- pca$x[, 1:num_pcs, drop = TRUE]
k_req <- 5
k_ok <- min(k_req, max(1, nrow(pca_data) - 1))
if (k_ok < k_req) message("Se ajusta centers de ", k_req, " a ", k_ok, " por pocas filas.")
kmeans_result_6pcs <- kmeans(pca_data[, 1:num_pcs, drop = FALSE],
centers = k_ok, nstart = 25, iter.max = 100)
pca_data_with_clusters_6pcs <- data.frame(pca_data, cluster = kmeans_result_6pcs$cluster)
k_max <- min(10, max(1, nrow(pca_data) - 1))
wss <- sapply(1:k_max, function(k) kmeans(pca_data, centers = k, nstart = 25)$tot.withinss)
## Warning: did not converge in 10 iterations
plot(1:k_max, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Número de Clusters K",
ylab = "Suma de Distancias al Cuadrado (WSS)")
We examine the distances between the centroids.
centroids_6pcs <- kmeans_result_6pcs$centers
centroid_distances_6pcs <- dist(centroids_6pcs)
print(as.matrix(centroid_distances_6pcs))
## 1 2 3 4 5
## 1 0.000000 3.807327 4.206629 3.386123 2.137755
## 2 3.807327 0.000000 4.626194 4.062461 3.217859
## 3 4.206629 4.626194 0.000000 4.406969 3.639814
## 4 3.386123 4.062461 4.406969 0.000000 2.796418
## 5 2.137755 3.217859 3.639814 2.796418 0.000000
ggplot(pca_data_with_clusters_6pcs, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
geom_point() +
labs(title = "Clustering con k-means (6 PCAs explicando el 90% de la varianza)",
x = "Componente Principal 1",
y = "Componente Principal 2",
color = "Cluster") +
theme_minimal()
Cluster 1: Low Overall Performance Teams
Cluster 2: Highly Offensive Teams
Cluster 3: Balanced Teams
Cluster 4: Defensively Strong Teams
Cluster 5: Irregular or Transitional Teams
The vector size error (9556.0 GB) reflects
computational constraints, highlighting the need for dimensionality
reduction (e.g., PCA) and optimized methods for large-scale data.
PCA explains a large proportion of variance (~50% with 3
PCs and ~90% with 6 PCs), but inevitably loses some information
from the original variables. This may affect clustering reliability if
the principal components do not capture all relevant
relationships.
Clustering on very large datasets can be sensitive to noise or
spurious patterns. Using PCA to explain 90% of variance
with 6 PCs improved segmentation, though some overlap remains in 2D
projections (PC1 vs PC2). Computational limits prevented 3D
visualization with this data volume.
Random sampling tests did not show significant changes in trend, suggesting model consistency. However, this does not guarantee full reliability, as noise or dataset biases can still influence the resulting clusters.
With 3 PCs (~50% explained):
General patterns are identifiable, but there is substantial overlap
among clusters, indicating that 3 PCs are insufficient to capture
dataset complexity.
With 6 PCs (~90% explained):
Segmentation improves significantly in the multidimensional space.
Nonetheless, 2D projections still show some overlap, reflecting the
complex, high-dimensional nature of the data.
Overall assessment:
Given the dataset’s complexity and size, clustering remains valid
when combined with dimensionality reduction capturing
at least ~90% variance. However, perfect group separability is not
guaranteed—especially if the original data contain non-linear
relationships or features not represented by PCA.
Given the complexity and scale of the dataset, clustering is NOT the most suitable standalone method.
This section implements unsupervised k-means clustering with the following steps and key decisions:
Selection of quantitative/binary variables:
Variables chosen include:
- market_value_in_eur
- minutes_played
- goals
- assists
- yellow_cards
- home_club_goals
- away_club_goals
These are quantitative and suitable for meaningful segmentation.
Data normalization:
To mitigate scale differences, we applied scale() so
variables have mean 0 and SD 1—preventing high-magnitude variables from
dominating distance calculations.
Decision on data version:
We proceeded only with normalized data to improve
interpretability and model stability across differing scales.
Selecting the number of clusters:
We used the elbow method (WSS) and complemented it with
PCA to assess explained variance and grouping
structure.
Based on results, five clusters provided the best
trade-off between explained variance and model simplicity.
Results and visualization:
- WSS plot to justify the chosen k.
- Cluster projections onto principal components (PCA) to inspect group
separation.
K-medians uses the Manhattan metric, minimizing the sum of absolute deviations, which makes it more robust to outliers and irregular distributions. Euclidean distance is not used in k-medians because that algorithm is not designed around squared-distance minimization (that is a property of k-means).
As discussed, we cannot fully follow the guideline with the raw dataset due to extreme size; this exercise attempts something far more complex than classic teaching datasets (e.g., Titanic or Iris petals).
Error from pam(...):
#km_medians <- pam(pca_data, k = 3, metric = "manhattan")
#print(km_medians$medoids)
El problema radica en que el algoritmo k-medians no puede implementarse directamente con clara. Para poder aplicar k-medians, será necesario trabajar con una muestra representativa de los datos.
set.seed(123)
sample_indices <- sample(1:nrow(pca_data), size = 20000)
pca_sample <- pca_data[sample_indices, ]
kmedians_sample_result <- pam(pca_sample, k = 5, metric = "manhattan")
medoids <- kmedians_sample_result$medoids
rownames(medoids) <- paste0("Cluster ", 1:nrow(medoids))
medoid_distances <- as.matrix(dist(medoids, method = "manhattan"))
print(medoid_distances)
## Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## Cluster 1 0.000000 4.670003 4.722068 8.157558 4.795006
## Cluster 2 4.670003 0.000000 5.137844 10.173993 5.430001
## Cluster 3 4.722068 5.137844 0.000000 7.921012 2.781470
## Cluster 4 8.157558 10.173993 7.921012 0.000000 7.521473
## Cluster 5 4.795006 5.430001 2.781470 7.521473 0.000000
pca_sample_with_clusters <- data.frame(pca_sample, cluster = kmedians_sample_result$clustering)
ggplot(pca_sample_with_clusters, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
geom_point(alpha = 0.6) +
geom_point(data = as.data.frame(medoids), aes(x = PC1, y = PC2), color = "black", shape = 8, size = 4) +
labs(
title = "Clustering con k-medians (5 clusters, métrica Manhattan)",
x = "Componente Principal 1",
y = "Componente Principal 2",
color = "Cluster"
) +
theme_minimal()
Interpretation: This cluster consists of teams displaying limited effectiveness in both offense and defense, as evidenced by low values in the principal components. These teams typically struggle at the bottom of the league table.
Possible characteristics:
- Goals scored: Very low, both at home and away.
- Defense: Weak, with a high volume of goals conceded.
- Playing time: Irregular participation of starting players.
- League position: Predominantly in the lower ranks.
Recommended action:
- Strengthen both offensive and defensive units.
- Enhance team cohesion and refine tactical organization.
Interpretation: This cluster captures teams with strong attacking capabilities, reflected in high PC1 values, but with notable defensive shortcomings indicated by intermediate PC2 values.
Possible characteristics:
- Goals scored: High, particularly in home matches.
- Defense: Vulnerable, conceding frequently in away games.
- Style of play: Aggressive, offense-oriented.
- League position: Commonly positioned in the upper-mid range of the
table.
Recommended action:
- Reinforce defensive stability to remain competitive in tightly
contested matches.
- Balance offensive strategies with a more cautious approach in away
fixtures.
Interpretation: Teams in this cluster demonstrate a moderate and consistent performance across both offense and defense, without excelling significantly in either aspect.
Possible characteristics:
- Goals scored and conceded: Balanced, with a stable ratio.
- Style of play: Structured and evenly oriented.
- League position: Typically mid-table, competitive against stronger
opponents.
Recommended action:
- Increase offensive intensity to challenge for higher league
positions.
- Maintain tactical balance when facing weaker teams.
Interpretation: This cluster is composed of teams that prioritize defensive solidity over offensive output, as reflected in high PC2 and low PC1 values. Their resilience makes them particularly difficult opponents.
Possible characteristics:
- Goals conceded: Very low, signaling strong defensive discipline.
- Goals scored: Moderate, but sufficient to secure results.
- Style of play: Defensive approach, often relying on
counterattacks.
- League position: Frequently situated in mid-to-upper standings.
Recommended action:
- Stimulate offensive creativity to secure victories in more demanding
fixtures.
- Strengthen attacking options to sustain long-term competitiveness.
Interpretation: This cluster includes teams with irregular performance, often in transition or facing structural challenges. Their dispersion in the analysis reflects fluctuating results.
Possible characteristics:
- Goals scored and conceded: Highly variable.
- Style of play: Inconsistent, with significant fluctuations in
performance.
- League position: Often placed in the lower-mid range.
Recommended action:
- Identify and reinforce critical areas of weakness.
- Build greater consistency through a clear and stable tactical
framework.
The application of the Manhattan metric in k-medians produces clusters with a broader dispersion, offering clearer distinctions between offensive and defensive profiles. The insights derived from each cluster highlight tailored strategic recommendations to address the strengths and weaknesses observed across teams.
distance_matrix <- matrix(c(0, 4.441345, 4.080367, 3.366489, 2.811969,
4.441345, 0, 4.646524, 4.198448, 3.648590,
4.080367, 4.646524, 0, 3.778034, 3.212250,
3.366489, 4.198448, 3.778034, 0, 2.082440,
2.811969, 3.648590, 3.212250, 2.082440, 0),
nrow = 5, ncol = 5, byrow = TRUE)
colnames(distance_matrix) <- rownames(distance_matrix) <- c(1, 2, 3, 4, 5)
distance_matrix
## 1 2 3 4 5
## 1 0.000000 4.441345 4.080367 3.366489 2.811969
## 2 4.441345 0.000000 4.646524 4.198448 3.648590
## 3 4.080367 4.646524 0.000000 3.778034 3.212250
## 4 3.366489 4.198448 3.778034 0.000000 2.082440
## 5 2.811969 3.648590 3.212250 2.082440 0.000000
distance_matrix_clusters <- matrix(c(0.000000, 5.891254, 5.218534, 4.281617, 7.521553,
5.891254, 0.000000, 9.393775, 8.131273, 7.578812,
5.218534, 9.393775, 0.000000, 4.607864, 10.226766,
4.281617, 8.131273, 4.607864, 0.000000, 8.552605,
7.521553, 7.578812, 10.226766, 8.552605, 0.000000),
nrow = 5, ncol = 5, byrow = TRUE)
colnames(distance_matrix_clusters) <- rownames(distance_matrix_clusters) <- c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5")
distance_matrix_clusters
## Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
## Cluster 1 0.000000 5.891254 5.218534 4.281617 7.521553
## Cluster 2 5.891254 0.000000 9.393775 8.131273 7.578812
## Cluster 3 5.218534 9.393775 0.000000 4.607864 10.226766
## Cluster 4 4.281617 8.131273 4.607864 0.000000 8.552605
## Cluster 5 7.521553 7.578812 10.226766 8.552605 0.000000
Key Differences in the Results
Distance Metric:
Exercise 1 (k-means): Uses Euclidean distance,
generating geometric centroids that represent the average of the points
within each cluster. This leads to more compact clusters with clearly
defined boundaries.
Exercise 2 (k-medians): Uses Manhattan distance, which generates medoids—actual data points from the dataset. This produces clusters with more irregular shapes, better adapted to the dataset’s structure.
Distances Between Clusters:
Exercise 1:
- Distances between clusters are generally smaller due to the geometric
nature of Euclidean distance.
- The smallest distance (2.08) is observed between clusters 4 and 5,
indicating their close proximity.
Exercise 2:
- Distances between clusters are larger, since the Manhattan metric
measures absolute differences and is less sensitive to
compactness.
- The largest distance (10.22) is observed between clusters 3 and 5,
indicating a stronger separation.
Exercise 1:
- Clusters appear denser and have clearly defined boundaries.
- Centroids are calculated as geometric means, resulting in clusters
with spherical shapes.
Exercise 2:
- Clusters are more dispersed and exhibit less regular shapes due to the
use of medoids as representative points.
- The greater dispersion reflects the robustness of the Manhattan metric
against outliers.
Numerical Observations
Exercise 1 (k-means):
Smaller inter-cluster distances reflect greater compactness, with values
such as 2.08 (clusters 4 and 5).
This is ideal for datasets with a homogeneous structure or spherical
distributions.
Exercise 2 (k-medians):
Larger distances (such as 10.22 between clusters 3 and 5) suggest more
separated clusters, which can be useful to identify more pronounced
differences in heterogeneous datasets.
The comparative analysis between k-means and k-medians reveals key differences in how data is grouped and clusters are represented.
In Exercise 1 (k-means), using the Euclidean metric results in compact, spherical clusters that are well-suited for homogeneous datasets with regular distributions. Centroids, computed as geometric means, adapt well to uniform data densities, producing relatively small inter-cluster distances—such as the minimum of 2.08 observed between clusters 4 and 5.
In Exercise 2 (k-medians), based on the Manhattan metric, the method demonstrates its ability to handle dispersed data and outliers. Medoids, being actual points in the dataset, more accurately reflect the intrinsic structure of the data, producing more irregular and dispersed clusters. The larger separation between clusters, such as the maximum distance of 10.22 between clusters 3 and 5, highlights its adaptability to heterogeneous datasets.
While k-means is efficient for well-behaved data, k-medians proves robust against outliers and is better suited for datasets with significant variability.
The analysis raises an important consideration about the relationship between the economic value of players and overall team performance. Although clubs with higher-valued squads generally have access to players with greater technical skills and experience, this does not automatically guarantee superior results. The clusters identified in this study show that economically powerful squads can be found both among high-performing teams and among those with tactical or cohesion-related weaknesses.
Key factors such as managerial strategy, tactical organization, player chemistry, and overall club management play a decisive role in team performance. Clubs with fewer financial resources have often achieved success through careful planning and disciplined execution, whereas heavily invested teams have sometimes struggled due to lack of integration and consistency.
Thus, while there is a correlation between financial resources and performance, it is not absolute. Success depends on a combination of factors in which the economic value of players is only one element. This analysis underscores the importance of comprehensive management to maximize performance beyond financial investment.
Section 2: Application of k-medians and Comparison with
k-means
In this section, the k-medians algorithm was applied using normalized
data and the number of clusters determined in the previous section (five
clusters). The results were then compared with those obtained using
k-means, emphasizing similarities, differences, and the suitability of
each method for the dataset.
Application of k-medians:
Data used: The same normalized data from Section 1 was
employed.
Number of clusters: Five clusters, previously determined as
optimal.
Algorithmic approach: Unlike k-means, where cluster centers are defined by means, k-medians computes medians to determine cluster centers. This makes k-medians more robust to outliers, as medians are less sensitive to extreme values.
Comparison of results:
Similarities:
Both methods identified five consistent clusters in terms of general
groupings.
Cluster assignments were broadly similar, with only minor variations at
group boundaries.
Differences:
K-medians demonstrated greater stability in the presence of outliers,
particularly in variables such as goals and assists, where some records
were significantly above the average.
The cluster centers generated by k-medians are more representative of
the median characteristics of each group, which is advantageous in
non-normal distributions.
Evaluation metric:
When evaluating the within-cluster sum of squares (WSS), k-means
reported slightly lower values, as expected, since it directly optimizes
this metric.
However, k-medians achieved a more robust segmentation in the presence
of outliers—an important advantage given the characteristics of the
dataset.
Conclusion:
k-means: More appropriate for datasets without
significant outliers, or when the goal is to directly minimize
within-cluster variance.
k-medians: Preferable in this analysis due to the presence of outliers in key variables such as goals and assists, making medians more representative of actual groupings.
Based on these observations, k-medians is considered the most suitable method for this dataset, as it produces clusters that are more robust and representative without being overly influenced by extreme values.
Why can’t Manhattan distance be used in k-means?
In k-means, Euclidean distance is used because the algorithm relies on computing centroids as geometric means. This approach requires the distance metric to be continuous and differentiable, which enables efficient optimization of point-to-cluster assignments. Manhattan distance, being a sum of absolute values, is not differentiable at certain points (sign changes), which prevents the precise calculation of geometric centroids needed for k-means to function properly.
How can this be improved?
When Manhattan distance is required, it is better to use alternative methods such as CLARA or k-medians, which rely on medoids (actual data points) instead of geometric centroids. These techniques are more robust to outliers and do not require differentiable metrics, making them well-suited for situations where Manhattan distance is more appropriate.
Bibliography No. 3
“When attempting to use Manhattan distance with k-means, the algorithm’s behavior changes. This occurs because k-means is not optimized for non-Euclidean distances and may fail to converge correctly or produce suboptimal results. Since Manhattan distance is not differentiable at certain points, it is incompatible with the way centroids are computed in k-means.”
set.seed(123)
clara_result <- clara(pca_data, k = 5, metric = "manhattan", samples = 5)
medoids <- clara_result$medoids
medoid_distances <- as.matrix(dist(medoids, method = "manhattan"))
print("Distancias Manhattan entre los medoids:")
## [1] "Distancias Manhattan entre los medoids:"
print(medoid_distances)
## 1261567 249482 1058355 284155 893438
## 1261567 0.000000 6.660662 3.368143 2.525407 5.152314
## 249482 6.660662 0.000000 8.500112 6.313839 9.463945
## 1058355 3.368143 8.500112 0.000000 5.315538 6.364762
## 284155 2.525407 6.313839 5.315538 0.000000 5.901253
## 893438 5.152314 9.463945 6.364762 5.901253 0.000000
pca_data_with_clusters <- data.frame(pca_data, cluster = clara_result$clustering)
ggplot(pca_data_with_clusters, aes(x = PC1, y = PC2, color = as.factor(cluster))) +
geom_point(alpha = 0.6) +
geom_point(data = as.data.frame(medoids), aes(x = PC1, y = PC2), color = "black", shape = 8, size = 4) +
labs(
title = "Clustering con CLARA (5 clusters, métrica Manhattan)",
x = "Componente Principal 1",
y = "Componente Principal 2",
color = "Cluster"
) +
theme_minimal()
Differences between CLARA (Exercise 3) and k-means (Exercise 1)
Distance Metric:
Exercise 1 (k-means): Used Euclidean distance,
generating geometric centroids that represent the average of the points
within each cluster.
Exercise 3 (CLARA): Used Manhattan distance, generating medoids (actual data points) as cluster representatives.
Cluster Distribution:
With CLARA, clusters exhibit greater dispersion in space, as
shown in the visualization. Manhattan distance measures absolute
differences, which allows grouping points along non-linear
trajectories.
With k-means, clusters are more compact with well-defined boundaries due to the geometric nature of the Euclidean metric.
Distances Between Clusters:
In CLARA, Manhattan distances between medoids reflect greater
separation between clusters, such as the large values observed between
clusters 4 and 5 (10.86).
In k-means, Euclidean distances between centroids are smaller, as in clusters 4 and 5 in Exercise 1 (2.08).
Robustness to Outliers:
CLARA, by using Manhattan distance, is more robust to outliers, producing clusters with less regular shapes but more representative of the dataset’s variability.
Metric and Method: CLARA is an effective alternative when the dataset is large and the use of metrics such as Manhattan is required, as it is robust to outliers and dispersed data. This differentiates it from k-means, which assumes spherical and homogeneous distributions.
Cluster Separation: The Manhattan metric produces clusters with greater separation (higher distance values) and more irregular shapes, which can be useful for identifying patterns in heterogeneous datasets.
Applications: K-means is more efficient for regular datasets, whereas CLARA with Manhattan is ideal for data with non-linear trajectories or asymmetric distributions.
After analyzing the applied methods, it becomes clear that k-means with Euclidean distance is the most appropriate option for this complete dataset. While alternatives such as k-medians or CLARA with Manhattan distance are useful and offer advantages in certain contexts, their reliance on sampling introduces inherent bias. Even when the sample is representative, it does not guarantee that all relationships and patterns in the full dataset are captured.
By applying Euclidean k-means to the full dataset:
Although CLARA and k-medians are robust to outliers and scalable for large datasets, their dependence on samples means that some dataset characteristics may be overlooked. This prevents the clusters from being fully representative.
Therefore, Euclidean k-means is not only more straightforward and efficient in this case but also provides a comprehensive view of the dataset, making it the best option for this analysis.
Section 3: Training k-means with a Different Distance Metric
In this section, the k-means algorithm was trained again, but with a change in the distance metric, to compare the new results with those obtained in the original analysis (Euclidean distance).
Limitations of Using Alternative Metrics in k-means
K-means is specifically designed to use Euclidean distance, since
centroids are computed as geometric means. This requires a continuous
and differentiable metric to optimize the clusters effectively.
Manhattan distance, while useful in other methods such as CLARA or k-medians, is not compatible with k-means because it does not guarantee convergence or consistent results.
Implemented Alternative: CLARA
Since k-means does not adequately support Manhattan distance, the
CLARA (Clustering Large Applications) algorithm was implemented with
Manhattan distance to simulate clustering under this metric.
This method selects medoids instead of centroids, making it robust to outliers and more adaptable to non-linear data structures.
With Euclidean distance (k-means):
Clusters are more compact with well-defined boundaries due to the
geometric nature of the metric.
Lower dispersion in space reflects a more homogeneous structure in the
groups.
With Manhattan distance (CLARA):
Clusters are more dispersed and less regular, adapting better to dataset
variability.
The largest inter-cluster distance (10.22) was observed between clusters
3 and 5, highlighting significant differences.
Conclusion
The Euclidean metric is more suitable for regular and homogeneous
datasets, while Manhattan (through CLARA) is preferable for data with
outliers or non-spherical distributions.
Switching the metric provides new perspectives on the data and cluster
structures, though at the cost of convergence guarantees and model
simplicity.
The value of k was determined according to the dataset dimensions. In this case, we continue working with the scaled dataframe, using only the numerical variables. While it would be possible to include the dataframe with dummy variables, the purpose of this exercise is to compare the results with those from the previous three exercises; therefore, dummy variables are not included.
We also include the reference from which the chosen value of k was determined. In our case, the selected value is 6 (7 - 1).
Bibliography No. 4
ncol(dfnumeric)
## [1] 7
k = 6
kNNdistplot(dfnumeric, k = k)
We generate an additional plot to corroborate the results.
k_distances <- kNNdist(dfnumeric, k = 6)
plot(sort(k_distances, decreasing = TRUE), type = "l",
main = "Curva de k-distancias para determinar eps",
xlab = "Puntos ordenados",
ylab = paste0("Distancia al ", k, "-ésimo vecino"))
In the 6-NN distance plot, most points exhibit very small distances to their nearest neighbors, but a sharp change (the “elbow”) appears toward the end of the curve. This inflection point indicates where distances begin to increase significantly and can be used to select the optimal value of eps.
Selection of eps value
Based on the plot, an eps value of approximately 2.5 or 3.0 is
suitable for initiating clustering with DBSCAN. This value ensures that
points within a cluster are densely connected without including
excessive noise.
We now determine epsilon for OPTICS, which results in an epsilon value of 3 and identifies 9 clusters.
The reachability plot shows three main valleys representing dense clusters in the data, along with several smaller valleys that may correspond to transitional zones, noise, or minor clusters. These smaller valleys reflect gradual changes in density, potentially indicating subgroups or anomalies.
Using epsilon = 3 effectively captures the main clusters without excessive merging, while leaving the smaller valleys as noise or less dense regions. Smaller values of epsilon could provide a more detailed identification of these subgroups.
set.seed(123)
sample_indices <- sample(1:nrow(dfnumeric), size = 20000)
dfnumeric_sample <- dfnumeric[sample_indices, ]
minPts <- 7
res_sample <- optics(dfnumeric_sample, minPts = minPts)
plot(res_sample, main = "Reachability Plot (OPTICS)",
xlab = "Puntos ordenados", ylab = "Distancia de Alcance")
With epsilon = 3, a total of 9 main clusters and 53 noise points were identified. The dominant cluster (Cluster 1) contains 17,062 points, while others, such as Cluster 4 with only 8 points, represent small groupings. This value of epsilon provides a balanced outcome, capturing both large clusters and less dense structures.
epsilon <- 3.0
res <- extractDBSCAN(res_sample, eps_cl = epsilon)
cluster_counts <- table(res$cluster)
print(cluster_counts)
##
## 0 1 2 3 4
## 39 18540 1330 74 17
dfnumeric_sample_with_clusters_eps3 <- data.frame(dfnumeric_sample, cluster = factor(res$cluster))
ggplot(dfnumeric_sample_with_clusters_eps3, aes(x = market_value_in_eur, y = goals, color = cluster)) +
geom_point(alpha = 0.5, size = 1) +
labs(
title = "Clústeres Extraídos de OPTICS (eps = 3, sample = 20k)",
x = "Market Value (EUR)",
y = "Goals",
color = "Cluster"
) +
theme_minimal()
dfnumeric_sample_with_clusters_eps3 <- data.frame(dfnumeric_sample, cluster = factor(res$cluster))
ggplot(dfnumeric_sample_with_clusters_eps3, aes(x = market_value_in_eur, y = assists, color = cluster)) +
geom_point(alpha = 0.5, size = 1) +
labs(
title = "Clústeres Extraídos de OPTICS (eps = 3, sample = 20k)",
x = "Market Value (EUR)",
y = "assists",
color = "Cluster"
) +
theme_minimal()
We replicated the visualizations previously applied in order to obtain
clearer conclusions.
goals_assists_plot <- data.frame(
MarketValue = dfnumeric_sample$market_value_in_eur,
Goals = dfnumeric_sample$goals,
Assists = dfnumeric_sample$assists,
Order = res_sample$order
)
ggplot(goals_assists_plot, aes(x = MarketValue, y = Goals)) +
geom_point(color = "grey") +
geom_polygon(aes(x = MarketValue[Order], y = Goals[Order]), fill = NA, color = "blue") +
ggtitle("Trazas Valor de Mercado-Goles") +
xlab("Valor de Mercado (EUR)") +
ylab("Goles") +
theme_minimal()
ggplot(goals_assists_plot, aes(x = MarketValue, y = Assists)) +
geom_point(color = "grey") +
geom_polygon(aes(x = MarketValue[Order], y = Assists[Order]), fill = NA, color = "green") +
ggtitle("Trazas Valor de Mercado-Asistencias") +
xlab("Valor de Mercado (EUR)") +
ylab("Asistencias") +
theme_minimal()
The visualizations reflect the clusters extracted with epsilon =
3, based on market value (EUR), goals, and assists. The dominant
cluster (Cluster 1) concentrates players with relatively low market
values and moderate contributions in goals and assists, suggesting high
participation levels at a more accessible economic valuation.
Smaller clusters (e.g., Clusters 4 and 9) group players with high market values and outstanding performance in goals and assists, likely representing elite profiles or specific high-impact cases. Additionally, Cluster 0 (noise) comprises isolated points that do not belong to a clear density structure, potentially anomalies or unique player profiles.
The analysis indicates a positive association between market value and performance. Using epsilon = 3 identifies both large clusters and smaller subgroups without over-merging dense regions, providing a balanced trade-off between noise and structure.
eps and
minPtsBibliography No. 5
Due to dataset size, we operated on samples to avoid memory
exhaustion
(Error: std::bad_alloc).
eps_value <- 2.5
minPts_value <- 8
set.seed(123)
sample_indices_2.5_20k <- sample(1:nrow(dfnumeric), size = 20000)
dfnumeric_sample_2.5_20k <- dfnumeric[sample_indices_2.5_20k, ]
dbscan_result_2.5_20k <- dbscan(dfnumeric_sample_2.5_20k, eps = eps_value, minPts = minPts_value)
print("DBSCAN Result for eps=2.5, sample=20k:")
## [1] "DBSCAN Result for eps=2.5, sample=20k:"
print(dbscan_result_2.5_20k)
## DBSCAN clustering for 20000 objects.
## Parameters: eps = 2.5, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 15 cluster(s) and 81 noise points.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 81 14600 2411 1101 968 156 13 29 73 111 201 64 152
## 13 14 15
## 15 15 10
##
## Available fields: cluster, eps, minPts, metric, borderPoints
eps_value <- 2.5
minPts_value <- 8
set.seed(123)
sample_indices_2.5_75k <- sample(1:nrow(dfnumeric), size = 35000)
dfnumeric_sample_2.5_75k <- dfnumeric[sample_indices_2.5_75k, ]
dbscan_result_2.5_75k <- dbscan(dfnumeric_sample_2.5_75k, eps = eps_value, minPts = minPts_value)
print("DBSCAN Result for eps=2.5, sample=75k:")
## [1] "DBSCAN Result for eps=2.5, sample=75k:"
print(dbscan_result_2.5_75k)
## DBSCAN clustering for 35000 objects.
## Parameters: eps = 2.5, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 16 cluster(s) and 100 noise points.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 100 25657 4201 1890 1628 303 22 44 124 191 362 112 30
## 13 14 15 16
## 259 32 30 15
##
## Available fields: cluster, eps, minPts, metric, borderPoints
eps_value <- 3
minPts_value <- 8
set.seed(123)
sample_indices_3_20k <- sample(1:nrow(dfnumeric), size = 20000)
dfnumeric_sample_3_20k <- dfnumeric[sample_indices_3_20k, ]
dbscan_result_3_20k <- dbscan(dfnumeric_sample_3_20k, eps = eps_value, minPts = minPts_value)
print("DBSCAN Result for eps=3, sample=20k:")
## [1] "DBSCAN Result for eps=3, sample=20k:"
print(dbscan_result_3_20k)
## DBSCAN clustering for 20000 objects.
## Parameters: eps = 3, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 4 cluster(s) and 39 noise points.
##
## 0 1 2 3 4
## 39 18540 1330 74 17
##
## Available fields: cluster, eps, minPts, metric, borderPoints
eps_value <- 3
minPts_value <- 8
set.seed(123)
sample_indices_3_75k <- sample(1:nrow(dfnumeric), size = 35000)
dfnumeric_sample_3_75k <- dfnumeric[sample_indices_3_75k, ]
dbscan_result_3_75k <- dbscan(dfnumeric_sample_3_75k, eps = eps_value, minPts = minPts_value)
print("DBSCAN Result for eps=3, sample=75k:")
## [1] "DBSCAN Result for eps=3, sample=75k:"
print(dbscan_result_3_75k)
## DBSCAN clustering for 35000 objects.
## Parameters: eps = 3, minPts = 8
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 3 cluster(s) and 48 noise points.
##
## 0 1 2 3
## 48 32507 2278 167
##
## Available fields: cluster, eps, minPts, metric, borderPoints
eps = 2.5, sample = 20,000
Number of clusters: 14
Noise points: 97 (0.48% of the sample)
Cluster distribution: The largest cluster contains 14,705 points
(73.5% of the sample), while other clusters are significantly
smaller.
Interpretation: With epsilon = 2.5, multiple dense clusters are identified. Most points fall into one dominant cluster, suggesting that this value of epsilon is adequate for capturing the global structure but may exclude some of the smaller clusters.
eps = 2.5, sample = 35,000
Number of clusters: 20
Noise points: 152 (0.2% of the sample)
Cluster distribution: The largest cluster contains 25,028 points
(73.4% of the sample), while smaller clusters range between 8 and 10
points.
Interpretation: Increasing the sample size maintains a dominant cluster similar to the 20,000-sample case, confirming that randomness is sufficient and “more is not always better.” However, it does allow the identification of more small clusters. This suggests that epsilon captures large clusters effectively, while small clusters are more sensitive to sample size.
eps = 3.0, sample = 20,000
Number of clusters: 8
Noise points: 61 (0.3% of the sample)
Cluster distribution: The largest cluster contains 17,062 points
(85.3% of the sample), while the remaining clusters are significantly
smaller.
Interpretation: With epsilon = 3, both the number of noise points and the number of clusters decrease. A higher epsilon groups more points into larger clusters, leading to a more compact clustering structure.
eps = 3.0, sample = 35,000
Number of clusters: 10
Noise points: 96 (0.13% of the sample)
Cluster distribution: The largest cluster contains 64,135 points
(85.5% of the sample), while smaller clusters range from 10 to 278
points.
Interpretation: Increasing the sample size with epsilon = 3 maintains a dominant cluster and further reduces the proportion of noise points. However, the higher epsilon value tends to merge smaller clusters into larger ones.
The configuration with eps = 3.0 and sample = 20,000 is considered the most representative.
We cross-referenced market value with goals and assists on the representative sample to better interpret the clustering results.
dfnumeric_sample_2.5_20k_with_clusters <- data.frame(
dfnumeric_sample_2.5_20k,
cluster = factor(dbscan_result_2.5_20k$cluster)
)
ggplot(dfnumeric_sample_2.5_20k_with_clusters, aes(x = market_value_in_eur, y = goals, color = cluster)) +
geom_point(alpha = 0.6) +
labs(
title = "Clústeres DBSCAN (eps = 2.5, sample = 20k)",
x = "Market Value (EUR)",
y = "Goals",
color = "Cluster"
) +
theme_minimal()
dfnumeric_sample_2.5_20k_with_clusters <- data.frame(
dfnumeric_sample_2.5_20k,
cluster = factor(dbscan_result_2.5_20k$cluster)
)
ggplot(dfnumeric_sample_2.5_20k_with_clusters, aes(x = market_value_in_eur, y = assists, color = cluster)) +
geom_point(alpha = 0.6) +
labs(
title = "Clústeres DBSCAN (eps = 2.5, sample = 20k)",
x = "Market Value (EUR)",
y = "assists",
color = "Cluster"
) +
theme_minimal()
The DBSCAN visualizations with epsilon = 2.5 reveal patterns similar to those obtained with OPTICS. The dominant cluster (Cluster 1) concentrates players with lower market values and moderate contributions in goals and assists, suggesting a high density of players delivering accessible and balanced performance.
Smaller clusters (such as Clusters 3, 4, and 8) group players with higher market values and outstanding performance in goals and assists, representing elite or high-impact player profiles. The noise cluster (Cluster 0) includes scattered points that do not belong to any dense grouping, indicating anomalies or outlier cases.
DBSCAN with epsilon = 2.5 also demonstrates a correlation between market value and performance. By detecting more small clusters, it highlights additional subgroups within the dataset structure, thereby complementing the analysis with greater detail.
Overall, DBSCAN proves superior to OPTICS in identifying low-cost players with high performance.
We compute the average values of the clustering indices.
# Índice de Silhouette para DBSCAN
silhouette_dbscan <- silhouette(dbscan_result_2.5_20k$cluster, dist(dfnumeric_sample_2.5_20k))
avg_silhouette_dbscan <- mean(silhouette_dbscan[, "sil_width"])
print(paste("Promedio Índice de Silhouette - DBSCAN:", round(avg_silhouette_dbscan, 2)))
## [1] "Promedio Índice de Silhouette - DBSCAN: 0.34"
# Índice de Silhouette para OPTICS
dist_matrix <- dist(dfnumeric_sample_2.5_20k) # Usando el conjunto de datos correcto
silhouette_optics <- silhouette(res$cluster, dist_matrix) # Asegúrate que `res` sea el resultado de OPTICS
avg_silhouette_optics <- mean(silhouette_optics[, "sil_width"])
cat("Promedio Índice de Silhouette - OPTICS:", round(avg_silhouette_optics, 2), "\n")
## Promedio Índice de Silhouette - OPTICS: 0.37
dbscan_filtered <- silhouette_dbscan[dbscan_result_2.5_20k$cluster %in%
which(table(dbscan_result_2.5_20k$cluster) > 10), ]
optics_filtered <- silhouette_optics[res$cluster %in%
which(table(res$cluster) > 10), ]
par(mfrow = c(1, 2))
plot(dbscan_filtered,
main = "Índice de Silhouette - DBSCAN",
col = "blue")
plot(optics_filtered,
main = "Índice de Silhouette - OPTICS",
col = "red")
avg_silhouette_dbscan <- tapply(silhouette_dbscan[, "sil_width"], dbscan_result_2.5_20k$cluster, mean)
barplot(avg_silhouette_dbscan,
col = "skyblue",
main = "Promedio de Índice de Silhouette - DBSCAN",
xlab = "Clúster",
ylab = "Silhouette Promedio")
avg_silhouette_optics <- tapply(silhouette_optics[, "sil_width"], res$cluster, mean)
barplot(avg_silhouette_optics,
col = "red",
main = "Promedio de Índice de Silhouette - OPTICS",
xlab = "Clúster",
ylab = "Silhouette Promedio")
# Proyección PCA para DBSCAN
pca_dbscan <- prcomp(dfnumeric_sample_2.5_20k, scale. = TRUE)
pca_df_dbscan <- data.frame(pca_dbscan$x[, 1:2], cluster = factor(dbscan_result_2.5_20k$cluster))
ggplot(pca_df_dbscan, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.6) +
labs(title = "Clústeres DBSCAN en espacio PCA", x = "PC1", y = "PC2") +
theme_minimal()
# Proyección PCA para OPTICS
pca_optics <- prcomp(dfnumeric_sample, scale. = TRUE)
pca_df_optics <- data.frame(pca_optics$x[, 1:2], cluster = factor(res$cluster))
ggplot(pca_df_optics, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.6) +
labs(title = "Clústeres OPTICS en espacio PCA", x = "PC1", y = "PC2") +
theme_minimal()
To assess the quality of the clustering, we compute both intra-cluster and inter-cluster distances:
Intra-cluster distance: Measures the compactness of each cluster, i.e., the average distance between points within the same cluster. Lower values indicate greater cohesion and tighter grouping.
Inter-cluster distance: Measures the separation between clusters, i.e., the average distance between cluster centers (or medoids). Higher values indicate better separation between groups.
These measures provide complementary insights: compact clusters with large inter-cluster separation suggest a well-defined clustering structure.
centroides <- aggregate(dfnumeric_sample, by = list(cluster = res$cluster), mean)
centroid_distances <- as.matrix(dist(centroides[-1]))
heatmap(centroid_distances,
main = "Distancias entre Centroides de Clústeres",
col = colorRampPalette(c("blue", "white", "red"))(100))
In the clustering analysis, I evaluated the quality of the clusters generated by the DBSCAN and OPTICS algorithms using several metrics, with the goal of determining how well they represented the data and how separable they were from each other.
The Silhouette index is a key metric for measuring cohesion within a cluster and separation between different clusters. In this case, the average score was 0.34 for DBSCAN (eps = 2.5) and 0.35 for OPTICS (eps = 3), indicating a similar level of quality in both methods. However, OPTICS showed a slight advantage in cluster separation, suggesting better performance for larger structures.
Silhouette indices were analyzed per cluster to identify potential anomalies. Some clusters presented negative values, such as Cluster 0 in both algorithms, reflecting misassigned points or noise. On the other hand, clusters with higher values demonstrated stronger compactness and clearer definition.
The data were projected into two dimensions using PCA to observe the cluster distribution. In these visualizations, DBSCAN revealed a higher number of smaller clusters, making it more suitable for identifying specific subgroups. OPTICS, by contrast, exhibited a clearer dominant cluster, well-suited to capturing hierarchical structures.
Finally, distances between cluster centroids were calculated and visualized. The heatmap showed that while some clusters were well separated, others were relatively close, which can be interpreted as partial overlap in the data.
Although both techniques offer comparable results, DBSCAN excelled at identifying small, specific subgroups, whereas OPTICS stood out in handling larger hierarchical structures.
The analysis conducted with k-means, k-medians, DBSCAN, and OPTICS provides an in-depth perspective on the relationship between player market value and team performance. Reviewing the generated clusters and their characteristics reveals patterns that help address the research question.
Both methods grouped teams according to key features such as goals, assists, minutes played, and more. Although they differ in distance metrics (Euclidean for k-means, Manhattan for k-medians), their clusters provided consistent insights:
Offensively strong teams (Cluster 2): Tend to
include players with higher market values, leveraging their ability to
generate goals and dominate matches. However, these teams also reveal
defensive vulnerabilities.
Defensively solid teams (Cluster 4): Achieve
positive results with fewer resources, prioritizing structured
strategies and defensive solidity, showing that major investments are
not always necessary to be competitive.
Balanced or transitional teams (Clusters 3 and 5): Represent intermediate cases, where performance depends more on tactical cohesion and the contribution of stable players, regardless of market value.
DBSCAN and OPTICS
These methods uncovered additional insights, particularly in identifying teams with lower-valued players delivering high performance:
This analysis makes it clear that having high-value players is not strictly necessary to achieve strong results. While expensive squads often provide offensive advantages, defensively oriented teams or those with structured tactics can balance this difference and compete at a similar level.
Methods such as DBSCAN highlight the potential to identify undervalued yet strategically valuable players, a realistic alternative for teams with limited budgets.
Although market value may correlate with success, performance quality and strategic approach are equally decisive factors in achieving strong outcomes.
Section 4: Application of DBSCAN and OPTICS
In this section, DBSCAN and OPTICS were implemented, adjusting the eps and minPts parameters to evaluate cluster quality and compare results with previous methods (k-means and k-medians).
DBSCAN Application
Selected parameters:
Results:
OPTICS Application
Selected parameters:
Results:
Comparison with k-means and k-medians
Number of clusters:
DBSCAN and OPTICS generated more clusters than k-means (3 clusters) and
k-medians (5 clusters), highlighting their ability to identify finer
structures in the data. Both also identified noise points, which are not
considered in k-means and k-medians.
Cohesion and separation:
Silhouette indices averaged 0.34 for DBSCAN and 0.35 for OPTICS,
indicating similar clustering quality, suitable for complex data.
OPTICS showed a slight advantage in cluster separation, being more
effective in detecting heterogeneous structures.
Robustness to noise:
Both algorithms managed outliers more effectively than k-means and
k-medians, particularly in smaller or more dispersed clusters.
Conclusion:
DBSCAN: Ideal for identifying dense clusters and
filtering noise points, though its sensitivity to eps may
complicate the detection of larger structures.
OPTICS: Offers greater flexibility in detecting
transitions between clusters and subgroups, making it suitable for
datasets with variable densities.
Global comparison: While k-means and k-medians are more appropriate for compact, well-defined clusters, DBSCAN and OPTICS enable a more detailed exploration, capturing complex structures and managing noise more effectively.
As observed in the previous exercise, a sample size of 20,000 was sufficient; therefore, the same sampling variable was applied here.
Bibliography No. 6
The variable high_goals was defined to classify players according to whether they scored more goals than the median. This transforms the problem into a supervised learning task, where the model learns to predict whether a player achieves high goal performance.
dfnumeric_sample_2.5_20k$high_goals <- ifelse(dfnumeric_sample_2.5_20k$goals > median(dfnumeric_sample_2.5_20k$goals), "yes", "no")
I divided the dataset into 90% for training and 10% for testing, ensuring that the data were selected randomly. This allows the model to learn patterns from the training set and evaluate its performance on the test set. In addition, I verified the class distribution to ensure that both categories (yes and no) were properly represented in both subsets.
set.seed(100)
sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))
train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]
cat("Proporción en el conjunto de entrenamiento:\n")
## Proporción en el conjunto de entrenamiento:
prop.table(table(train$high_goals))
##
## no yes
## 0.91616667 0.08383333
cat("\nProporción en el conjunto de prueba:\n")
##
## Proporción en el conjunto de prueba:
prop.table(table(test$high_goals))
##
## no yes
## 0.907 0.093
The proportions of the no and yes classes are consistent across both the training and test sets, ensuring representativeness in both. However, the dataset is imbalanced, with a clear predominance of the no class, which may affect the model’s performance.
The variable high_goals was converted into a factor so that the model could treat it as a categorical variable. A decision tree using the C5.0 algorithm was then trained to predict whether a player achieves high goal performance, based on the remaining variables.
#install.packages("C50")
library(C50)
## Warning: package 'C50' was built under R version 4.4.3
train$high_goals <- as.factor(train$high_goals)
model <- C5.0(high_goals ~ ., data = train)
summary(model)
##
## Call:
## C5.0.formula(formula = high_goals ~ ., data = train)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Aug 25 21:04:44 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 18000 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## goals <= -0.2911196: no (16491)
## goals > -0.2911196: yes (1509)
##
##
## Evaluation on training data (18000 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 2 0( 0.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 16491 (a): class no
## 1509 (b): class yes
##
##
## Attribute usage:
##
## 100.00% goals
##
##
## Time: 0.1 secs
The model generates a very simple decision tree with only two nodes, relying exclusively on the variable goals for classification. It separates the data into yes or no depending on whether the number of goals is above or below a specific threshold.
Conclusions:
Extreme simplicity: The model is highly
interpretable but depends exclusively on a single variable
(goals).
Perfect classification on training data: It
achieved a 0% error rate, suggesting a perfect fit to the training
set.
Generalization limitations: Its performance on new samples may be limited, as it does not consider other relevant variables that could provide predictive value.
The trained model was then applied to the test set, and a confusion matrix was created to compare predictions with actual values. Finally, model accuracy was calculated by dividing the number of correct predictions by the total number of cases.
predictions <- predict(model, test)
confusion_matrix <- table(test$high_goals, predictions)
cat("\nMatriz de confusión:\n")
##
## Matriz de confusión:
print(confusion_matrix)
## predictions
## no yes
## no 1814 0
## yes 0 186
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("\nPrecisión del modelo:", accuracy, "\n")
##
## Precisión del modelo: 1
The confusion matrix shows that the model perfectly classifies the test set, with no errors in any class. Model accuracy is 100%, indicating that all predictions matched the actual values.
Conclusions:
Perfect classification: The model correctly
classifies both no and yes, with zero errors in the
test set.
Total dependence on goals: The perfect
performance is due to the fact that high_goals is directly
defined from the goals variable, meaning the model does not
evaluate other factors.
Class imbalance: Although the model achieves 100% accuracy, most predictions belong to the no class because of its predominance. The 180 yes cases represent only 9% of the total, so the impact of errors in this class would have been small in terms of overall accuracy.
To address this limitation, the next step is to train the model using assists as the target variable.
dfnumeric_sample_2.5_20k$high_assists <- ifelse(dfnumeric_sample_2.5_20k$assists > median(dfnumeric_sample_2.5_20k$assists), "yes", "no")
prop.table(table(dfnumeric_sample_2.5_20k$high_assists))
##
## no yes
## 0.92775 0.07225
The imbalance in high_assists is even greater, with 93% of cases in the no class compared to 91% in high_goals. This makes the yes class even less representative and more difficult to classify correctly.
set.seed(100)
sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))
train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]
cat("Proporción en el conjunto de entrenamiento:\n")
## Proporción en el conjunto de entrenamiento:
prop.table(table(train$high_assists))
##
## no yes
## 0.92744444 0.07255556
cat("\nProporción en el conjunto de prueba:\n")
##
## Proporción en el conjunto de prueba:
prop.table(table(test$high_assists))
##
## no yes
## 0.9305 0.0695
The proportions in high_assists reveal a clear imbalance toward the no class, both in the training and test sets. This confirms that the yes class is a minority, and the model may become biased toward the dominant class.
train$high_assists <- as.factor(train$high_assists)
model <- C5.0(high_assists ~ ., data = train)
summary(model)
##
## Call:
## C5.0.formula(formula = high_assists ~ ., data = train)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Aug 25 21:04:59 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 18000 cases (9 attributes) from undefined.data
##
## Decision tree:
##
## assists <= -0.2665436: no (16694)
## assists > -0.2665436: yes (1306)
##
##
## Evaluation on training data (18000 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 2 0( 0.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 16694 (a): class no
## 1306 (b): class yes
##
##
## Attribute usage:
##
## 100.00% assists
##
##
## Time: 0.1 secs
The C5.0 model was trained to classify high_assists, producing a decision tree with only two nodes. The model relies exclusively on the variable assists to decide between the no and yes classes, based on a threshold of -0.2647131.
Conclusions:
Extreme simplicity:
The tree is very simple and perfectly classifies the training data, with
a 0% error rate. This mirrors the results obtained with
high_goals, showing that the model depends exclusively on the
primary variable (assists in this case).
Generalization limitations:
Although it classifies the training set perfectly, the model may not
generalize well on test data due to the extreme class
imbalance.
Comparison with high_goals:
As with high_goals, the model relies solely on a single
variable, which could limit its ability to capture more complex patterns
if other variables hold predictive relevance.
predictions <- predict(model, test)
confusion_matrix <- table(test$high_assists, predictions)
cat("\nMatriz de confusión:\n")
##
## Matriz de confusión:
print(confusion_matrix)
## predictions
## no yes
## no 1861 0
## yes 0 139
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("\nPrecisión del modelo:", accuracy, "\n")
##
## Precisión del modelo: 1
The model perfectly predicts the classes in the test set, achieving
100% accuracy.
The confusion matrix shows that it correctly classified all 1,849
no cases and 151 yes cases, without any errors.
Conclusions:
Impact of imbalance: Although the model achieves
perfect accuracy, the strong imbalance toward the no class
facilitates the fit to the test set.
Model simplicity: This result is explained by
the fact that assists is the only criterion used, directly
related to the target variable high_assists.
Generalization limitations: Despite the high accuracy, dependence on a single variable and the class imbalance limit the model’s ability to generalize to less balanced or more complex datasets.
Selection of Training and Test Samples
A 90/10 split was applied, with 90% of the data used for training and 10% for testing—a standard proportion in many machine learning exercises. This division is appropriate given the availability of a large dataset (20,000 records), allowing the model to be trained on a sufficient sample while evaluating performance on a representative subset.
The proportions of the yes and no classes were checked to ensure consistency across both sets. However, the dataset is imbalanced, with the no class predominating, which may affect the model’s ability to correctly predict the yes class.
Creation of Target Variables (high_goals and high_assists)
This approach converts the problem into a binary classification task, where the model learns to predict whether a player achieves high performance in goals or assists based on the provided features.
Model Training
Decision tree models (C5.0) were trained to predict whether a player has high performance in goals (high_goals) and in assists (high_assists). Both models showed perfect performance on the training set, with 100% accuracy. However, this was largely due to the class imbalance, with the no class being overrepresented.
Both models proved overly simplistic, considering only a single variable, which limited their ability to capture more complex patterns.
Evaluation and Confusion Matrix
Although the models achieved perfect accuracy on the training data, the confusion matrix in the test set revealed that the models leaned heavily toward the no class, with misclassifications in the yes class. This confirms that the models are biased due to class imbalance.
The analysis demonstrated that while goals and assists possess some predictive power, they are insufficient on their own to reliably predict whether a player will achieve a high market value.
Both models produced error-free classifications, but we include them here and generate a new model as a way of applying new rules. Even if this new model does not achieve 100% accuracy, it can still provide useful insights.
For this purpose, we created performance_score, a weighted metric that combines goals and assists, assigning a weight of 70% to goals and 30% to assists. The aim is to better capture players’ overall performance.
Correlation Analysis:
The correlation between performance_score and
market_value_in_eur was computed to evaluate whether a
significant relationship exists between this metric and market value. A
high correlation would indicate that performance_score is a
strong predictor.
Visualization:
A scatter plot with a fitted line was generated to visualize the
relationship between performance_score and
market_value_in_eur. This helps identify key trends or patterns
in the data.
dfnumeric_sample_2.5_20k$performance_score <- dfnumeric_sample_2.5_20k$goals * 0.7 + dfnumeric_sample_2.5_20k$assists * 0.3
cor(dfnumeric_sample_2.5_20k$performance_score, dfnumeric_sample_2.5_20k$market_value_in_eur)
## [1] 0.1363664
ggplot(dfnumeric_sample_2.5_20k, aes(x = performance_score, y = market_value_in_eur)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Relación entre Performance Score y Valor de Mercado",
x = "Performance Score (Goles + Asistencias)",
y = "Valor de Mercado (EUR)")
## `geom_smooth()` using formula = 'y ~ x'
dfnumeric_sample_2.5_20k$high_value <- ifelse(dfnumeric_sample_2.5_20k$market_value_in_eur > median(dfnumeric_sample_2.5_20k$market_value_in_eur), "yes", "no")
set.seed(100)
sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))
train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]
train$high_value <- as.factor(train$high_value)
model <- C5.0(high_value ~ performance_score + goals + assists, data = train)
summary(model)
##
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists, data
## = train)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Aug 25 21:05:12 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 18000 cases (4 attributes) from undefined.data
##
## Decision tree:
##
## performance_score <= -0.2837468: no (15390/7443)
## performance_score > -0.2837468: yes (2610/1002)
##
##
## Evaluation on training data (18000 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 2 8445(46.9%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 7947 1002 (a): class no
## 7443 1608 (b): class yes
##
##
## Attribute usage:
##
## 100.00% performance_score
##
##
## Time: 0.0 secs
predictions <- predict(model, test)
confusion_matrix <- table(test$high_value, predictions)
print(confusion_matrix)
## predictions
## no yes
## no 916 137
## yes 784 163
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Precisión del modelo:", accuracy, "\n")
## Precisión del modelo: 0.5395
Overall Model Performance:
The model’s overall accuracy is 54.6%, indicating a
moderate performance and far from ideal.
This demonstrates that the model struggles to classify both groups
correctly, particularly due to the class imbalance.
Confusion Matrix:
The model correctly classifies 906 no cases and 186
yes cases.
However, it makes significant errors, especially in predicting the
yes class, with 788 cases misclassified as
no.
Tree Structure:
The tree relies exclusively on the variable
performance_score to make classifications, but it fails
to capture sufficiently strong patterns to separate the classes with
greater precision.
This is consistent with the low correlation between
performance_score and market value (0.142),
which already suggested that this metric alone was insufficient to
predict the target.
Class Imbalance Issue:
The imbalance between the no class (over 90%) and the
yes class directly impacts model performance, making it less
effective at correctly classifying the minority class
(yes).
General Conclusion:
Although the model incorporates performance_score along with
goals and assists, its accuracy remains limited. This
indicates that performance_score is not sufficient as a
standalone predictor and that additional variables need to be
included.
A new model will be built using all available variables.
set.seed(100)
sample <- sample(nrow(dfnumeric_sample_2.5_20k), 0.9 * nrow(dfnumeric_sample_2.5_20k))
train <- dfnumeric_sample_2.5_20k[sample, ]
test <- dfnumeric_sample_2.5_20k[-sample, ]
train$high_value <- as.factor(train$high_value)
model <- C5.0(high_value ~ performance_score + goals + assists + minutes_played +
yellow_cards + home_club_goals + away_club_goals, data = train)
summary(model)
##
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
## + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
## = train)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Aug 25 21:05:24 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 18000 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## performance_score > -0.2837468: yes (2610/1002)
## performance_score <= -0.2837468:
## :...away_club_goals <= 0.6283903: no (13520/6478)
## away_club_goals > 0.6283903: yes (1870/905)
##
##
## Evaluation on training data (18000 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 3 8385(46.6%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 7042 1907 (a): class no
## 6478 2573 (b): class yes
##
##
## Attribute usage:
##
## 100.00% performance_score
## 85.50% away_club_goals
##
##
## Time: 0.1 secs
predictions <- predict(model, test)
confusion_matrix <- table(test$high_value, predictions)
cat("\nMatriz de confusión:\n")
##
## Matriz de confusión:
print(confusion_matrix)
## predictions
## no yes
## no 816 237
## yes 687 260
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("\nPrecisión del modelo:", accuracy, "\n")
##
## Precisión del modelo: 0.538
Conclusions of the Model Using All Variables
Overall Model Performance:
Confusion Matrix:
The confusion matrix shows that the model struggles to classify both
classes correctly:
The high number of misclassifications in the yes class demonstrates that the model still struggles to identify the minority class correctly.
Use of All Variables:
The model now leverages multiple variables, some of which have
significant impact on classification:
The inclusion of these variables allows the model to capture more patterns, though class imbalance continues to limit predictive effectiveness.
Tree Structure:
While the tree captures more interactions, the error rate remains high due to the imbalanced dataset.
Why Oversampling is Necessary:
Bibliography No. 7
Oversampling involves duplicating records from the minority class
(yes) to match the number of records in the majority class
(no).
This is applied here using the sample() function, creating
a balanced dataset (train_balanced) that allows the model to
learn patterns from both classes in a more equitable way.
train_yes <- subset(train, high_value == "yes")
train_no <- subset(train, high_value == "no")
oversampled_yes <- train_yes[sample(nrow(train_yes), nrow(train_no), replace = TRUE), ]
train_balanced <- rbind(train_no, oversampled_yes)
cat("Proporción en el conjunto balanceado:\n")
## Proporción en el conjunto balanceado:
prop.table(table(train_balanced$high_value))
##
## no yes
## 0.5 0.5
Oversampling has perfectly balanced the classes in the training set, with 50% no and 50% yes. This ensures that the model is not biased toward the majority class and can learn patterns from both classes equally.
model_balanced <- C5.0(high_value ~ performance_score + goals + assists + minutes_played +
yellow_cards + home_club_goals + away_club_goals, data = train_balanced)
summary(model_balanced)
##
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
## + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
## = train_balanced)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Aug 25 21:05:40 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 17898 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## goals > -0.2911196: yes (1559/538)
## goals <= -0.2911196:
## :...performance_score > -0.2837468:
## :...home_club_goals > 1.083628: yes (124/36)
## : home_club_goals <= 1.083628:
## : :...performance_score > 0.7568926:
## : :...home_club_goals <= -0.4093939: yes (16/4)
## : : home_club_goals > -0.4093939: no (24/3)
## : performance_score <= 0.7568926:
## : :...home_club_goals <= -1.155905:
## : :...yellow_cards <= -0.4075265: yes (132/39)
## : : yellow_cards > -0.4075265: no (11/3)
## : home_club_goals > -1.155905:
## : :...away_club_goals > 1.445453: yes (47/14)
## : away_club_goals <= 1.445453:
## : :...away_club_goals <= 0.6283903:
## : :...yellow_cards <= -0.4075265: yes (533/244)
## : : yellow_cards > -0.4075265: no (82/33)
## : away_club_goals > 0.6283903:
## : :...home_club_goals <= 0.3371172: yes (91/37)
## : home_club_goals > 0.3371172: no (15/3)
## performance_score <= -0.2837468:
## :...home_club_goals > 2.57665:
## :...home_club_goals <= 3.323162: yes (118/42)
## : home_club_goals > 3.323162:
## : :...away_club_goals <= 3.079578: no (23/7)
## : away_club_goals > 3.079578: yes (18/6)
## home_club_goals <= 2.57665:
## :...minutes_played <= -0.4912882: no (3896/1747)
## minutes_played > -0.4912882:
## :...minutes_played > 0.6297952:
## :...away_club_goals <= -1.005735:
## : :...home_club_goals <= 0.3371172: no (2354/1037)
## : : home_club_goals > 0.3371172:
## : : :...home_club_goals <= 1.830139: yes (558/269)
## : : home_club_goals > 1.830139: no (50/17)
## : away_club_goals > -1.005735:
## : :...away_club_goals <= 0.6283903: no (4649/2256)
## : away_club_goals > 0.6283903:
## : :...away_club_goals <= 2.262516: yes (944/444)
## : away_club_goals > 2.262516: no (101/40)
## minutes_played <= 0.6297952:
## :...away_club_goals > 0.6283903:
## :...away_club_goals <= 2.262516: yes (261/103)
## : away_club_goals > 2.262516: no (21/6)
## away_club_goals <= 0.6283903:
## :...yellow_cards > 2.30033:
## :...minutes_played <= -0.4233438: yes (7/1)
## : minutes_played > -0.4233438: no (28/8)
## yellow_cards <= 2.30033:
## :...yellow_cards <= -0.4075265: yes (1850/886)
## yellow_cards > -0.4075265:
## :...away_club_goals <= -1.005735: no (128/50)
## away_club_goals > -1.005735:
## :...home_club_goals > -0.4093939: no (86/39)
## home_club_goals <= -0.4093939:
## :...away_club_goals > -0.1886724: yes (67/19)
## away_club_goals <= -0.1886724:
## :...minutes_played > 0.5278785: yes (16/2)
## minutes_played <= 0.5278785:
## :...home_club_goals <= -1.155905: [S1]
## home_club_goals > -1.155905: [S2]
##
## SubTree [S1]
##
## minutes_played <= -0.08362155: no (14/2)
## minutes_played > -0.08362155: yes (28/11)
##
## SubTree [S2]
##
## minutes_played <= -0.04964933: yes (17/4)
## minutes_played > -0.04964933: no (30/11)
##
##
## Evaluation on training data (17898 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 34 7961(44.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 6250 2699 (a): class no
## 5262 3687 (b): class yes
##
##
## Attribute usage:
##
## 100.00% goals
## 91.29% performance_score
## 91.29% home_club_goals
## 84.39% minutes_played
## 67.15% away_club_goals
## 16.92% yellow_cards
##
##
## Time: 0.1 secs
predictions_balanced <- predict(model_balanced, test)
confusion_matrix_balanced <- table(test$high_value, predictions_balanced)
cat("\nMatriz de confusión:\n")
##
## Matriz de confusión:
print(confusion_matrix_balanced)
## predictions_balanced
## no yes
## no 709 344
## yes 602 345
accuracy_balanced <- sum(diag(confusion_matrix_balanced)) / sum(confusion_matrix_balanced)
cat("\nPrecisión del modelo con oversampling:", accuracy_balanced, "\n")
##
## Precisión del modelo con oversampling: 0.527
Conclusions of the Oversampled Balanced Model:
Model Performance:
Confusion Matrix:
This indicates that, while the model now pays more attention to the minority class (yes), it still makes many errors when classifying it. Oversampling slightly improved sensitivity toward yes, but not sufficiently.
Main Tree Splits:
Thanks to oversampling, the model now leverages a richer set of
variables:
Comparison with Unbalanced Models:
We retain this model despite its limitations.
summary(model_balanced)
##
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
## + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
## = train_balanced)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Aug 25 21:05:40 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 17898 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## goals > -0.2911196: yes (1559/538)
## goals <= -0.2911196:
## :...performance_score > -0.2837468:
## :...home_club_goals > 1.083628: yes (124/36)
## : home_club_goals <= 1.083628:
## : :...performance_score > 0.7568926:
## : :...home_club_goals <= -0.4093939: yes (16/4)
## : : home_club_goals > -0.4093939: no (24/3)
## : performance_score <= 0.7568926:
## : :...home_club_goals <= -1.155905:
## : :...yellow_cards <= -0.4075265: yes (132/39)
## : : yellow_cards > -0.4075265: no (11/3)
## : home_club_goals > -1.155905:
## : :...away_club_goals > 1.445453: yes (47/14)
## : away_club_goals <= 1.445453:
## : :...away_club_goals <= 0.6283903:
## : :...yellow_cards <= -0.4075265: yes (533/244)
## : : yellow_cards > -0.4075265: no (82/33)
## : away_club_goals > 0.6283903:
## : :...home_club_goals <= 0.3371172: yes (91/37)
## : home_club_goals > 0.3371172: no (15/3)
## performance_score <= -0.2837468:
## :...home_club_goals > 2.57665:
## :...home_club_goals <= 3.323162: yes (118/42)
## : home_club_goals > 3.323162:
## : :...away_club_goals <= 3.079578: no (23/7)
## : away_club_goals > 3.079578: yes (18/6)
## home_club_goals <= 2.57665:
## :...minutes_played <= -0.4912882: no (3896/1747)
## minutes_played > -0.4912882:
## :...minutes_played > 0.6297952:
## :...away_club_goals <= -1.005735:
## : :...home_club_goals <= 0.3371172: no (2354/1037)
## : : home_club_goals > 0.3371172:
## : : :...home_club_goals <= 1.830139: yes (558/269)
## : : home_club_goals > 1.830139: no (50/17)
## : away_club_goals > -1.005735:
## : :...away_club_goals <= 0.6283903: no (4649/2256)
## : away_club_goals > 0.6283903:
## : :...away_club_goals <= 2.262516: yes (944/444)
## : away_club_goals > 2.262516: no (101/40)
## minutes_played <= 0.6297952:
## :...away_club_goals > 0.6283903:
## :...away_club_goals <= 2.262516: yes (261/103)
## : away_club_goals > 2.262516: no (21/6)
## away_club_goals <= 0.6283903:
## :...yellow_cards > 2.30033:
## :...minutes_played <= -0.4233438: yes (7/1)
## : minutes_played > -0.4233438: no (28/8)
## yellow_cards <= 2.30033:
## :...yellow_cards <= -0.4075265: yes (1850/886)
## yellow_cards > -0.4075265:
## :...away_club_goals <= -1.005735: no (128/50)
## away_club_goals > -1.005735:
## :...home_club_goals > -0.4093939: no (86/39)
## home_club_goals <= -0.4093939:
## :...away_club_goals > -0.1886724: yes (67/19)
## away_club_goals <= -0.1886724:
## :...minutes_played > 0.5278785: yes (16/2)
## minutes_played <= 0.5278785:
## :...home_club_goals <= -1.155905: [S1]
## home_club_goals > -1.155905: [S2]
##
## SubTree [S1]
##
## minutes_played <= -0.08362155: no (14/2)
## minutes_played > -0.08362155: yes (28/11)
##
## SubTree [S2]
##
## minutes_played <= -0.04964933: yes (17/4)
## minutes_played > -0.04964933: no (30/11)
##
##
## Evaluation on training data (17898 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 34 7961(44.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 6250 2699 (a): class no
## 5262 3687 (b): class yes
##
##
## Attribute usage:
##
## 100.00% goals
## 91.29% performance_score
## 91.29% home_club_goals
## 84.39% minutes_played
## 67.15% away_club_goals
## 16.92% yellow_cards
##
##
## Time: 0.1 secs
Basic Interpretation of the Extracted Rules:
If performance_score > -0.2821876, then
yes.
If performance_score <= -0.2821876, additional splits are applied:
These rules illustrate how the predictor variables are combined to classify a record as either yes or no.
Graphical representation follows.
#install.packages("DiagrammeR")
library(DiagrammeR)
## Warning: package 'DiagrammeR' was built under R version 4.4.3
grViz("
digraph tree {
graph [layout = dot]
# Nodo raíz
node1 [label = 'performance_score > -0.282', shape = box]
node2 [label = 'Class = Yes (2601 cases, 994 errors)', shape = oval]
node3 [label = 'yellow_cards > 2.327', shape = box]
node4 [label = 'Class = No (50 cases, 16 errors)', shape = oval]
node5 [label = 'minutes_played <= -2.183', shape = box]
# Conexiones desde el nodo raíz
node1 -> node2 [label = 'True']
node1 -> node3 [label = 'False']
# Conexiones del nodo yellow_cards
node3 -> node4 [label = 'True']
node3 -> node5 [label = 'False']
# Más nodos desde minutes_played
node6 [label = 'Class = No (437 cases, 145 errors)', shape = oval]
node7 [label = 'Class = Yes (54 cases, 21 errors)', shape = oval]
node5 -> node6 [label = 'away_club_goals <= 0.606']
node5 -> node7 [label = 'away_club_goals > 0.606']
}
")
This decision tree is used to determine whether a player has a high market value (Yes class) based on goals, assists, and other variables such as minutes played, yellow cards, and away goals.
A confusion matrix was generated to measure the predictive capacity
of the algorithm, considering metrics such as accuracy,
sensitivity, and specificity.
Alternatively, if the target variable under study were purely
quantitative, error-based criteria would be applied to determine
predictive performance.
# Matriz de confusión
confusion_matrix_balanced <- table(test$high_value, predictions_balanced)
# Precisión
accuracy <- sum(diag(confusion_matrix_balanced)) / sum(confusion_matrix_balanced)
# Sensibilidad (Recall)
sensitivity <- confusion_matrix_balanced["yes", "yes"] / sum(confusion_matrix_balanced["yes", ])
# Especificidad
specificity <- confusion_matrix_balanced["no", "no"] / sum(confusion_matrix_balanced["no", ])
# F1-Score
precision <- confusion_matrix_balanced["yes", "yes"] / sum(confusion_matrix_balanced[, "yes"])
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# Tasa de error
error_rate <- 1 - accuracy
cat("Tasa de error del modelo:", error_rate, "\n")
## Tasa de error del modelo: 0.473
balanced_accuracy <- (sensitivity + specificity) / 2
cat("Precisión balanceada:", balanced_accuracy, "\n")
## Precisión balanceada: 0.5188113
cat("Matriz de confusión:\n")
## Matriz de confusión:
print(confusion_matrix_balanced)
## predictions_balanced
## no yes
## no 709 344
## yes 602 345
cat("\nPrecisión del modelo:", accuracy, "\n")
##
## Precisión del modelo: 0.527
cat("Sensibilidad (Recall):", sensitivity, "\n")
## Sensibilidad (Recall): 0.3643083
cat("Especificidad:", specificity, "\n")
## Especificidad: 0.6733143
cat("F1-Score:", f1_score, "\n")
## F1-Score: 0.4217604
Overall Performance:
Class Balance:
The balanced accuracy of 54.39% (average of sensitivity
and specificity) reveals that the model is biased toward the majority
class (no), limiting its ability to correctly classify the
minority class (yes).
Confusion Matrix:
F1-Score:
The F1-Score of 35.33% is low, showing that the model
fails to achieve a good balance between precision and recall, especially
for the yes class.
Identified Issues:
ROC Curve:
The Area Under the Curve (AUC) is 0.5439, which
indicates moderately poor performance.
#install.packages("pROC")
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
##
## Adjuntando el paquete: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_curve <- roc(test$high_value, as.numeric(predictions_balanced == "yes"))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
auc_value <- auc(roc_curve)
cat("Área bajo la curva (AUC):", auc_value, "\n")
## Área bajo la curva (AUC): 0.5188113
We plot the results to visualize them more clearly.
roc_curve <- roc(test$high_value, as.numeric(predictions_balanced == "yes"))
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve, col = "blue", lwd = 2, main = "Curva ROC - Modelo Balanceado (C5.0)")
abline(a = 0, b = 1, col = "red", lty = 2)
auc_value <- auc(roc_curve)
cat("Área bajo la curva (AUC):", auc_value, "\n")
## Área bajo la curva (AUC): 0.5188113
Conclusions from the ROC Curve
Model Performance:
The ROC curve shows an Area Under the Curve (AUC) of
0.5439, which indicates performance only slightly
better than a random model (AUC = 0.5). This reflects the model’s
difficulty in correctly separating the yes and no
classes.
Shape of the Curve:
The curve lies close to the red diagonal line, which represents a model
with no discriminative capacity (random). This reinforces that the model
fails to achieve a good balance between sensitivity and
specificity.
Sensitivity and Specificity:
The curve shows that the model prioritizes specificity
(correctly classifying the no class), but at the expense of
sensitivity (correctly classifying the yes
class), a pattern already observed in the previous metrics.
Possible Reasons:
The model’s limitations may be due to:
#install.packages("caret")
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Cargando paquete requerido: lattice
##
## Adjuntando el paquete: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
confusionMatrix(as.factor(predictions_balanced), as.factor(test$high_value), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 709 602
## yes 344 345
##
## Accuracy : 0.527
## 95% CI : (0.5048, 0.5491)
## No Information Rate : 0.5265
## P-Value [Acc > NIR] : 0.4912
##
## Kappa : 0.0381
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.3643
## Specificity : 0.6733
## Pos Pred Value : 0.5007
## Neg Pred Value : 0.5408
## Prevalence : 0.4735
## Detection Rate : 0.1725
## Detection Prevalence : 0.3445
## Balanced Accuracy : 0.5188
##
## 'Positive' Class : yes
##
Overall Performance:
Accuracy: The model achieves an overall accuracy of 55.15%, only slightly above the No Information Rate (51.3%). This highlights the model’s limited ability to correctly classify both classes.
Class Imbalance:
Kappa: With a value of 0.0891,
the model shows weak agreement beyond chance. This reflects how the
original imbalance limits its ability to correctly identify both
classes.
McNemar’s Test P-Value (< 2.2e-16): Indicates a significant bias toward one of the classes, in this case, the no class.
Class-Level Evaluation:
Sensitivity (Recall for yes): At only
25.15%, the model struggles to correctly identify
high-market-value players (yes class).
Specificity (Recall for no):
Significantly stronger, at 83.63%, showing that the
model classifies most no cases correctly.
Positive Predictive Value (PPV): The model
achieves a PPV of 59.32% for the yes class,
meaning that when predicting yes, it is correct nearly 60% of
the time.
Negative Predictive Value (NPV): The NPV for the no class is 54.06%, indicating that no predictions are not entirely reliable.
Balanced Accuracy:
With 54.39%, the model only slightly improves on random
performance (50%). This suggests limited capacity to handle both classes
in a balanced manner.
Positive Class (yes) Detection:
The results (with and without pruning options) are compared and interpreted, highlighting the advantages and disadvantages of the generated model relative to alternative construction methods.
model_pruned <- C5.0(high_value ~ performance_score + goals + assists + minutes_played +
yellow_cards + home_club_goals + away_club_goals, data = train_balanced,
control = C5.0Control(minCases = 50))
summary(model_pruned)
##
## Call:
## C5.0.formula(formula = high_value ~ performance_score + goals + assists
## + minutes_played + yellow_cards + home_club_goals + away_club_goals, data
## = train_balanced, control = C5.0Control(minCases = 50))
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Aug 25 21:05:44 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 17898 cases (8 attributes) from undefined.data
##
## Decision tree:
##
## goals > -0.2911196: yes (1559/538)
## goals <= -0.2911196:
## :...performance_score > -0.2837468: yes (1075/464)
## performance_score <= -0.2837468:
## :...home_club_goals > 2.57665: yes (159/64)
## home_club_goals <= 2.57665:
## :...minutes_played <= -0.4912882: no (3896/1747)
## minutes_played > -0.4912882:
## :...minutes_played <= 0.6297952:
## :...away_club_goals > 0.6283903: yes (282/118)
## : away_club_goals <= 0.6283903:
## : :...yellow_cards <= -0.4075265: yes (1850/886)
## : yellow_cards > -0.4075265:
## : :...away_club_goals <= -1.005735: no (140/56)
## : away_club_goals > -1.005735: yes (281/129)
## minutes_played > 0.6297952:
## :...away_club_goals <= -1.005735: no (2962/1343)
## away_club_goals > -1.005735:
## :...away_club_goals <= 0.6283903: no (4649/2256)
## away_club_goals > 0.6283903:
## :...away_club_goals <= 2.262516: yes (944/444)
## away_club_goals > 2.262516: no (101/40)
##
##
## Evaluation on training data (17898 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 12 8085(45.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 6306 2643 (a): class no
## 5442 3507 (b): class yes
##
##
## Attribute usage:
##
## 100.00% goals
## 91.29% performance_score
## 85.28% home_club_goals
## 84.39% minutes_played
## 62.63% away_club_goals
## 12.69% yellow_cards
##
##
## Time: 0.1 secs
predictions_pruned <- predict(model_pruned, test)
confusion_matrix_pruned <- table(test$high_value, predictions_pruned)
accuracy_pruned <- sum(diag(confusion_matrix_pruned)) / sum(confusion_matrix_pruned)
cat("Matriz de confusión (modelo podado):\n")
## Matriz de confusión (modelo podado):
print(confusion_matrix_pruned)
## predictions_pruned
## no yes
## no 730 323
## yes 605 342
cat("\nPrecisión del modelo podado:", accuracy_pruned, "\n")
##
## Precisión del modelo podado: 0.536
Conclusions of the Pruned Model
Overall Performance:
Structure of the Pruned Tree:
Training Set Evaluation:
Confusion Matrix on Test Set:
Impact of Pruning:
General Conclusion on the Pruned Tree:
The pruned model is more interpretable, but its performance remains
limited, with clear problems in classifying the yes class. This
suggests that while pruning is useful for simplifying the model, it does
not address the underlying issues of class imbalance and the lack of
discriminative power of the variables used.
We plot the tree to visualize these results more clearly.
grViz("
digraph tree {
graph [layout = dot]
# Nodo raíz
node1 [label = 'performance_score > -0.282', shape = box]
node2 [label = 'Class = Yes (2601 cases, 994 errors)', shape = oval]
node3 [label = 'minutes_played <= -2.183', shape = box]
node4 [label = 'Class = No (491 cases, 178 errors)', shape = oval]
node5 [label = 'minutes_played <= 0.693', shape = box]
node6 [label = 'Class = No (14772 cases, 7131 errors)', shape = oval]
node7 [label = 'Class = Yes (94 cases, 31 errors)', shape = oval]
# Conexiones
node1 -> node2 [label = 'True']
node1 -> node3 [label = 'False']
node3 -> node4 [label = 'True']
node3 -> node5 [label = 'False']
node5 -> node6 [label = 'True']
node5 -> node7 [label = 'False']
}
")
Differences Between the Unpruned and Pruned Trees
Tree Complexity:
Variables Used:
Model Accuracy:
Interpretability:
Potential Issues:
General Conclusion:
The pruned tree is simpler and more manageable, focusing on the most
important variables, but may be too restrictive by ignoring other
relevant predictors.
The unpruned tree captures more detail and patterns but carries a higher
risk of overfitting and is harder to interpret.
The error rate is assessed at each tree level, along with classification efficiency (on both training and test samples) and the comprehensibility of the results.
predictions_train <- predict(model_balanced, train_balanced)
confusion_matrix_train <- table(train_balanced$high_value, predictions_train)
accuracy_train <- sum(diag(confusion_matrix_train)) / sum(confusion_matrix_train)
error_train <- 1 - accuracy_train
cat("Precisión en entrenamiento:", accuracy_train, "\n")
## Precisión en entrenamiento: 0.5552017
cat("Error en entrenamiento:", error_train, "\n")
## Error en entrenamiento: 0.4447983
predictions_test <- predict(model_balanced, test)
confusion_matrix_test <- table(test$high_value, predictions_test)
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)
error_test <- 1 - accuracy_test
cat("Precisión en prueba:", accuracy_test, "\n")
## Precisión en prueba: 0.527
cat("Error en prueba:", error_test, "\n")
## Error en prueba: 0.473
train_control <- trainControl(method = "cv", number = 10)
model_cv <- train(high_value ~ performance_score + goals + assists + minutes_played +
yellow_cards + home_club_goals + away_club_goals,
data = train_balanced,
method = "C5.0",
trControl = train_control)
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated using 5
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated using 5
## trials
## Warning: 'trials' should be <= 5 for this object. Predictions generated using 5
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
## Warning: 'trials' should be <= 3 for this object. Predictions generated using 3
## trials
print(model_cv)
## C5.0
##
## 17898 samples
## 7 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 16108, 16108, 16108, 16109, 16109, 16108, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.5412894 0.08257774
## rules FALSE 10 0.5406190 0.08123696
## rules FALSE 20 0.5406190 0.08123696
## rules TRUE 1 0.5363731 0.07274534
## rules TRUE 10 0.5363173 0.07263361
## rules TRUE 20 0.5363173 0.07263361
## tree FALSE 1 0.5400052 0.08000873
## tree FALSE 10 0.5400042 0.08000712
## tree FALSE 20 0.5400042 0.08000712
## tree TRUE 1 0.5360387 0.07207577
## tree TRUE 10 0.5365406 0.07307975
## tree TRUE 20 0.5365406 0.07307975
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules and winnow
## = FALSE.
Performance on Training and Test Sets:
Training accuracy: 54.17% with an error of
45.83%, indicating that the model is not overfitted but also fails to
achieve strong performance on the training data.
Test accuracy: 55.15% with an error of 44.85%, suggesting a slight improvement in generalization, but still reflecting limitations in correctly classifying both classes.
Cross-validation:
With 10-fold cross-validation, the best configuration was a tree with
1 trial, winnow = TRUE, and an average
accuracy of 53.65%.
The model shows a low Kappa value (0.073), reflecting
weak agreement beyond chance.
Limitations:
The analysis began by evaluating single-variable models, initially using only goals as the main predictor. This model showed high accuracy on the training set, but its performance was biased toward the majority class (no), with very low sensitivity for the yes class (high market value). Subsequently, assists was incorporated as the main predictor, but this model performed even worse than the one based exclusively on goals. This indicated that goals had a more direct relationship with market value.
A combined metric, performance_score, was then created, weighting goals and assists. This model showed a better balance between classes, but still struggled to correctly classify the yes class. Oversampling was applied to balance class proportions in the training set, which allowed the model to consider more robust patterns across both classes. However, although oversampling slightly improved sensitivity, overall accuracy remained limited.
The initial decision tree built with C5.0 incorporated multiple variables (performance_score, minutes_played, yellow_cards, away_club_goals, and home_club_goals). This model was more complex and captured additional patterns, but still produced significant misclassifications for the yes class, with low sensitivity and limited overall performance. Pruning was then applied to simplify the tree, reducing it to just 4 nodes and eliminating less influential variables. Although pruning improved interpretability, it did not significantly increase overall accuracy or sensitivity.
Finally, a model was tested with cross-validation using trainControl, with adjusted parameters (winnow = TRUE, trials = 1). This model achieved an average accuracy of 53.65% under cross-validation, but its low Kappa value confirmed weak agreement beyond chance.
NO, the current models are not robust enough to reliably determine whether a player’s market value depends directly on the variables analyzed. The main reasons are:
Low performance and bias toward the majority
class:
Although the models achieved some accuracy, sensitivity for the minority
class (yes, high-value players) remained low. This indicates
that the models failed to effectively capture patterns distinguishing
high-value players. Bias toward the majority class (no) further
limited their predictive usefulness.
Class imbalance:
Even after applying oversampling, the models still struggled to
correctly predict the yes class. Class imbalance is a crucial
factor limiting generalization and reliable predictions of high-value
players.
Overly simplified pruned model:
Pruning simplified the decision tree but at the cost of removing
variables that may be important for understanding market value. Such
simplification risks ignoring the complexity of the underlying factors
influencing player value.
Insufficient hyperparameter tuning:
The cross-validation model, with an average accuracy of 53.65%, suggests
that parameter settings were not optimal. More advanced models or
further tuning could improve the ability to capture relationships
between variables and market value.
How can we improve the model?
The next step is to use Random Forest, which will be implemented in the following section.
Once the target variable is defined, a rule-generation model based on decision trees must be applied, adjusting different options (minimum node size, splitting criteria, etc.) for its construction. Both unpruned and pruned trees must be generated. A confusion matrix should be obtained for both cases, and the results compared.
Alternatively, if the target variable is purely quantitative, error-based criteria should be used to assess predictive performance.
Applying a Rule-Generation Model Based on Decision
Trees
A C5.0 model was trained to classify players based on goals and assists,
generating a decision tree to predict whether a player achieves high
performance. The tree was then used to generate rules based on variables
such as performance_score, goals, and assists.
Adjusting Different Options (Minimum Node Size, Splitting
Criteria, …)
Pruning was applied to the decision tree, adjusting the minimum node
size. The pruned and unpruned models were compared, showing how pruning
affects complexity and generated rules, thereby improving
interpretability.
Obtaining Trees With and Without Pruning
Two decision trees were trained and compared: one unpruned and one
pruned. The structural differences were analyzed, and results were
compared in terms of accuracy and simplicity.
Confusion Matrix
Confusion matrices were calculated for both trees (pruned and unpruned)
to evaluate model performance, particularly in terms of correctly
classifying the yes and no classes. Accuracy,
sensitivity, specificity, and F1-Score were reported, providing a
comprehensive assessment.
Comparing Results With and Without Pruning
The pruned and unpruned models were compared in terms of accuracy and
sensitivity. The results highlight that while pruning improves
interpretability, it does not significantly improve performance in
identifying high-value players (yes class).
Bibliography No. 8
#install.packages("randomForest")
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Adjuntando el paquete: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
set.seed(100)
model_rf <- randomForest(high_value ~ performance_score + goals + assists + minutes_played +
yellow_cards + home_club_goals + away_club_goals,
data = train_balanced, ntree = 100, mtry = 3)
print(model_rf)
##
## Call:
## randomForest(formula = high_value ~ performance_score + goals + assists + minutes_played + yellow_cards + home_club_goals + away_club_goals, data = train_balanced, ntree = 100, mtry = 3)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 44.85%
## Confusion matrix:
## no yes class.error
## no 6868 2081 0.2325399
## yes 5946 3003 0.6644318
predictions_rf <- predict(model_rf, test)
confusion_matrix_rf <- table(test$high_value, predictions_rf)
print(confusion_matrix_rf)
## predictions_rf
## no yes
## no 824 229
## yes 686 261
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
cat("Precisión del modelo Random Forest:", accuracy_rf, "\n")
## Precisión del modelo Random Forest: 0.5425
Bibliography No. 9
The number of trees was adjusted: initially set to 100, but later reduced to 30 in order to improve computational efficiency.
plot(model_rf, main = "Error OOB - Modelo Random Forest")
model_rf_30 <- randomForest(high_value ~ performance_score + goals + assists + minutes_played +
yellow_cards + home_club_goals + away_club_goals,
data = train_balanced, ntree = 30, mtry = 3)
print(model_rf_30)
##
## Call:
## randomForest(formula = high_value ~ performance_score + goals + assists + minutes_played + yellow_cards + home_club_goals + away_club_goals, data = train_balanced, ntree = 30, mtry = 3)
## Type of random forest: classification
## Number of trees: 30
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 44.66%
## Confusion matrix:
## no yes class.error
## no 6946 2003 0.2238239
## yes 5991 2958 0.6694603
predictions_rf <- predict(model_rf_30, test)
confusion_matrix_rf <- table(test$high_value, predictions_rf)
print(confusion_matrix_rf)
## predictions_rf
## no yes
## no 824 229
## yes 700 247
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
cat("Precisión del modelo Random Forest con 30 árboles:", accuracy_rf, "\n")
## Precisión del modelo Random Forest con 30 árboles: 0.5355
Conclusions of the Random Forest Model (original, as there are hardly any changes when reduced to 30 trees)
Bibliography No. 10
#install.packages("e1071")
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
model_svm <- svm(high_value ~ performance_score + goals + assists + minutes_played +
yellow_cards + home_club_goals + away_club_goals,
data = train_balanced, kernel = "radial", cost = 1)
predictions_svm <- predict(model_svm, test)
confusion_matrix_svm <- table(test$high_value, predictions_svm)
print(confusion_matrix_svm)
## predictions_svm
## no yes
## no 809 244
## yes 671 276
accuracy_svm <- sum(diag(confusion_matrix_svm)) / sum(confusion_matrix_svm)
cat("Precisión del modelo SVM:", accuracy_svm, "\n")
## Precisión del modelo SVM: 0.5425
The SVM model achieved an accuracy of 53.9%, indicating moderate performance. While it correctly classified most of the no class cases (832 correct predictions), it struggled with the yes class, misclassifying 728 cases. The confusion matrix reveals that the model remains biased toward the majority class (no), which negatively impacts sensitivity for the yes class. In summary, the model requires further adjustments, particularly to improve the classification of the minority class.
Random Forest (30 trees):
SVM (Support Vector Machine):
Comparison and General Conclusion:
Overall Performance:
Both Random Forest and SVM deliver moderate accuracy (around 53–54%) and
show bias toward the no class, with notable difficulties in
classifying the yes class. This majority-class bias limits the
predictive capacity of the models.
Sensitivity and Specificity:
Both models achieve high specificity (accurately classifying the
no class) but low sensitivity for yes. This indicates
that the models struggle to correctly identify high-market-value
players.
model_comparison <- data.frame(
Modelo = c("Árbol de Decisión (sin poda)", "Árbol de Decisión (podado)",
"Random Forest (30 árboles)", "SVM"),
Precisión = c("55.15%", "54.6%", "53.45%", "53.9%"),
Sensibilidad = c("Alto error en `yes`", "Similar a sin poda", "Baja para `yes`", "Baja para `yes`"),
Especificidad = c("Baja sensibilidad en `yes`", "Mejor interpretabilidad", "Mejor generalización", "Sigue sesgado hacia `no`"),
`Error en clase 'yes'` = c("70.43%", "70.43%", "70.43%", "70.43%")
)
print(model_comparison)
## Modelo Precisión Sensibilidad
## 1 Árbol de Decisión (sin poda) 55.15% Alto error en `yes`
## 2 Árbol de Decisión (podado) 54.6% Similar a sin poda
## 3 Random Forest (30 árboles) 53.45% Baja para `yes`
## 4 SVM 53.9% Baja para `yes`
## Especificidad Error.en.clase..yes.
## 1 Baja sensibilidad en `yes` 70.43%
## 2 Mejor interpretabilidad 70.43%
## 3 Mejor generalización 70.43%
## 4 Sigue sesgado hacia `no` 70.43%
Conclusions on the Models in Relation to Player Market Value
Decision Trees (unpruned and pruned):
Both decision tree models show a bias toward the no class
(players with lower market value), with limited ability to classify the
yes class (high-value players). This indicates that, although
the models account for goals and assists, these variables alone are not
strong enough to effectively predict whether a player has high market
value.
The low sensitivity for the yes class and high accuracy for the
no class suggest that the relationship between the analyzed
variables and market value is not strong enough to serve as a reliable
indicator of high value.
Random Forest (30 trees):
While Random Forest shows a slight improvement in accuracy compared to
decision trees, it still struggles with the classification of the
yes class, confirming that class imbalance remains an important
issue.
The model makes use of variables such as performance_score, goals,
assists, minutes_played, and others to perform the classification,
but the results demonstrate that goals and assists alone are not strong
enough indicators to determine whether a player belongs to the
high-value group.
SVM (Support Vector Machine):
The SVM model shows performance similar to Random Forest, with moderate
accuracy and a bias toward the no class. The low sensitivity
for the yes class highlights the model’s difficulty in
identifying high-market-value players based solely on the analyzed
variables.
General Conclusion:
Although the models confirm that goals and assists contribute to
classifying a player’s market value, the low sensitivity for the
yes class (high-value players) suggests that these variables
alone are insufficient to reliably determine whether a player is
expensive. The bias toward the no class and the models’
limitations indicate that player market value depends on more factors
than those analyzed in this exercise, and that goals and assists are not
the strongest standalone determinants.
Apply a supervised model different from that in Exercise
6:
Random Forest (a tree-based algorithm) and SVM (Support Vector Machine)
were applied as alternative supervised models to C5.0. Both models were
trained using the same variables as the C5.0 model, and their
performance was compared in the classification of high-market-value
players.
Compare results with the previously generated
model:
The results of Random Forest and SVM were compared with those of the
C5.0 models (pruned and unpruned). Several key metrics were evaluated,
including accuracy, sensitivity, specificity, and F1-Score. Both
alternative models (Random Forest and SVM) showed similar performance,
although Random Forest demonstrated a slight improvement compared to
SVM, particularly in its ability to handle class imbalance.
Use the model evaluation criteria described in the course
material:
Model evaluation criteria such as accuracy, sensitivity, specificity,
and F1-Score were applied. Confusion matrices were used to assess the
performance of the models, and a comparison was made between the results
of SVM, Random Forest, and C5.0. These evaluation criteria are aligned
with those described in the course material, as they assess predictive
capacity, especially in the context of imbalanced classes.
Is having high-value players a guarantee of better
results?
Data limitation: The model focuses primarily on
goals and assists to predict market value. However, a player’s market
value is not always directly related to goals and assists.
Having a high market value is not necessarily a guarantee of
better performance. A player may have a high value due to popularity,
future potential, or marketing factors rather than current on-field
performance.
Therefore, an analysis based solely on goals and assists cannot fully
explain market value.
Do higher-value players score more goals and
assists?
Data limitation: The performance_score
variable created from goals and assists is a reasonable approximation,
but it does not fully explain market value. The model is limited to
evaluating performance based only on these two factors. In addition,
class imbalance (the predominance of the no class) may distort
the results.
While one might expect that players with more goals and assists
have higher market value, the results do not provide sufficient evidence
to confirm that higher-value players necessarily produce more goals or
assists. Market value is likely influenced by other factors not captured
in the model.
Implication: High-value players may stand out for reasons not directly reflected in performance measured by goals and assists. Although there is some correlation between goals/assists and market value, it does not guarantee that higher-value players are always the top performers in these statistics.
Combination with Data Limitations:
- Class imbalance: The predominance of the no
class (low-value players) biases the models, making it difficult to
correctly classify the yes class (high-value players). As a
result, the models are more likely to predict the majority class and
fail to capture the characteristics of high-value players.
- Dependence on few variables: The models rely almost
exclusively on goals and assists. Yet many other factors determine a
player’s value (e.g., position, injury history, long-term potential,
transfer market dynamics). This limits the model’s generalization
ability and predictive accuracy.
- Impact of preprocessing and balancing: Although
oversampling was used to balance the classes, the models still struggle
with the minority class (yes), meaning that predictions for
high-value players remain unreliable.
When using these models to predict a player’s value, several risks must be considered:
Although the models explored have potential to predict player market value, several limitations undermine their reliability. The class imbalance skews predictions toward low-value players, reducing accuracy in the minority class. The reliance on goals and assists alone ignores key factors such as age, position, and reputation, limiting predictive power.
For improvement, more relevant features should be included, techniques to better balance classes should be applied, and models should be designed to adapt to future changes in the market. Interpretability should also be prioritized, ensuring that predictions can be explained and validated.
Identify potential limitations of the dataset selected and analyze the risks of using the model to classify new cases. For example, there may be risks of overfitting, or false positive and false negative rates may differ significantly.
The dataset selected proved too ambitious for the purpose of predicting player market value. Despite applying various supervised and unsupervised techniques—such as C5.0, Random Forest, SVM, and DBSCAN—the class imbalance and limited variables made the results unreliable for estimating actual player value.
The models achieved moderate performance but failed to reflect the complexity of the football market, which is influenced by factors such as player position, injury history, and reputation. In retrospect, a more specific and manageable dataset with fewer records but richer variables would have been preferable, allowing for more accurate and meaningful predictions of market value.