Main Project Goal
The main objective of this project is to create a regression model
that can accurately predict a FIFA football strikers market value. This
is done by diving deep into various attributes like performance metrics,
physical traits, and Player stats - age, height, weight. This analysis
aims to uncover the factors that truly drive a player’s worth in the
transfer market.
This will give Clubs and Agents the clarity they need to make
smarter, more informed decisions. It can pinpoint undervalued talent,
justify big-money investments, or even identify traits that consistently
lead to higher market valuations.
Specifically for the team’s manager, this could mean having a
competitive edge during negotiations. For the scouting team, it’s about
finding hidden gems that others might overlook. Finance teams get
clarity on the ROI of expensive signings, while the coaching staff gains
insights into which traits best align with the team’s style of play.
There are no missing values or null values on any of the
columns so we have very clean data to begin with
fifa_data <- fifa_data |>
mutate(
first_position = str_extract(positions, "^[^,]+")
)
fifa_data <- fifa_data |>
mutate(
position_category = case_when(
first_position %in% c("ST", "RW", "LW", "CF") ~ "Attacker",
first_position %in% c("CAM", "CM", "LM", "RM", "CDM") ~ "Midfielder",
first_position %in% c("CB", "LB", "RB", "LWB", "RWB") ~ "Defender",
first_position == "GK" ~ "Goalkeeper",
TRUE ~ "Unknown"
)
)
fifa_data |> count(position_category) |>
arrange(desc(n)) |>
ggplot(aes(x = reorder(position_category, -n), y=n)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme_minimal() +
labs(title = "Frequency of Player Positions", x = "Position", y = "Count")

Observations We get to see the distribution of
talent pool of attackers and we could see that there are approx ~3750
attackers.
# Filtering for only Attackers
fifa_data <- filter(fifa_data, position_category == "Attacker")
Let’s start off by looking at some key attributes of our
population to understand a bit about the data using some summary
statistics
# Summarize the data
summary_table <- fifa_data |>
select(age, height_cm, weight_kgs) |>
summary()
print(summary_table)
## age height_cm weight_kgs
## Min. :17.00 Min. :152.4 Min. : 49.9
## 1st Qu.:21.00 1st Qu.:154.9 1st Qu.: 69.9
## Median :25.00 Median :175.3 Median : 74.8
## Mean :25.12 Mean :174.1 Mean : 75.3
## 3rd Qu.:28.00 3rd Qu.:185.4 3rd Qu.: 79.8
## Max. :40.00 Max. :200.7 Max. :110.2
fifa_data |>
select(overall_rating, potential, value_euro, wage_euro) |>
pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
facet_wrap(~Variable, scales = "free") +
theme_minimal() +
labs(title = "Univariate Analysis: Histograms")
## Warning: Removed 84 rows containing non-finite outside the scale range
## (`stat_bin()`).

Summary Statistics - Mean age: 25.12 years - Total
age range: 17-40 years
Physical attributes - Median height is 175.3 cm - Median weight is
74.8 kg
As we have a lot of traits, let’s pick out the specific
traits which are important to us
We start off by picking the top Physical traits and traits specific
to football like finishing, dribbling, stamina, strength etc.., to
address our issue. This is done by creating a correlation matrix of the
attributes.
Correlation Matrix
key_attributes <- fifa_data |>
select(wage_euro,overall_rating, value_euro, potential, `skill_moves(1-5)`, `international_reputation(1-5)`)
# Compute the correlation matrix
cor_matrix_key <- cor(key_attributes, use = "complete.obs")
ggcorrplot(cor_matrix_key,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation: Overall Rating vs Key Attributes",
colors = c("blue", "white", "red"))

Observation: Overall rating, wage and international
reputation have the highest correlation to the value in euro for a
particular player.
physical_traits <- fifa_data |>
select(overall_rating, stamina, strength, vision, sprint_speed, shot_power, agility)
# Compute the correlation matrix
cor_matrix_physical <- cor(physical_traits, use = "complete.obs")
ggcorrplot(cor_matrix_physical,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation: Physical Traits",
colors = c("blue", "white", "red"))

Observation: Stamina, Vision and Shot power most
impactful attributes to the overall rating of a particular player which
based on the previous matrix strongly determines the value of the
particular player in the international transfer market.
football_traits <- fifa_data |>
select(overall_rating, finishing, dribbling, composure, reactions, positioning,
long_shots, ball_control, curve, volleys, heading_accuracy)
# Compute the correlation matrix
cor_matrix_football <- cor(football_traits, use = "complete.obs")
ggcorrplot(cor_matrix_football,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation: Football Traits",
colors = c("blue", "white", "red"))

Observations finishing, dribbling, composure,
reactions, positioning, long shots, ball control and volleys have the
highest correlation to the overall rating Based on these observations,
we can pick the traits below for further analysis which show a high
correlation towards the overall rating.
Football traits: finishing, dribbling, composure,
reactions, positioning, long_shots, ball_control, volleys.
Physical traits: shot power, vision, stamina
Exploratory Data Analysis
Bi-variate Analysis
Scatter plot: Market Value vs Overall rating
fifa_data |>
ggplot(aes(x = overall_rating, y = value_euro)) +
geom_point(alpha = 0.6, color = "steelblue") +
theme_minimal() +
scale_y_log10(labels = dollar_format(prefix = "€")) +
labs(title = "Market Value vs Overall Rating", x = "Overall Rating", y = "Market Value")
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Observations - Market values vs Overall rating
We can see that the Market value increases exponentially when
compared to overall rating
Scatter plot: Market Value vs Wage
fifa_data |>
ggplot(aes(y = overall_rating, x = wage_euro)) +
geom_point(alpha = 0.6, color = "steelblue") +
theme_minimal() +
scale_x_log10(labels = dollar_format(prefix = "€")) +
labs(title = "Player Wage vs Overall Rating", x = "Overall Rating", y = "Market Value")
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Observations - Value vs Wage
Strong link observed between Overall rating and Market value
International Reputation vs Wage
fifa_data |>
group_by(`international_reputation(1-5)`) |>
summarise(
avg_value_euro = mean(value_euro, na.rm = TRUE),
median_value_euro = median(value_euro, na.rm = TRUE),
count = n()
) |>
arrange(desc(avg_value_euro))
## # A tibble: 5 × 4
## `international_reputation(1-5)` avg_value_euro median_value_euro count
## <dbl> <dbl> <dbl> <int>
## 1 5 77900000 80000000 5
## 2 4 48571429. 45750000 14
## 3 3 23940455. 15750000 88
## 4 2 8678549. 7000000 259
## 5 1 1815058. 775000 2972
Observation
Majority of our talent pool has a very low international reputation.
Target international reputations would be 1 & 2.
fifa_data |>
ggplot(aes(x = factor(`international_reputation(1-5)`), y = value_euro)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, fill = "steelblue") +
theme_minimal() +
scale_y_continuous(labels = dollar_format(prefix = "€"))
## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

labs(
title = "Distribution of Player Value by International Reputation",
x = "International Reputation (1-5)",
y = "Player Value (Euro)"
)
## $x
## [1] "International Reputation (1-5)"
##
## $y
## [1] "Player Value (Euro)"
##
## $title
## [1] "Distribution of Player Value by International Reputation"
##
## attr(,"class")
## [1] "labels"
Observatinos - International reputation vs Value
We could see that there is a linear increase with overlap between
consecutive international reputation for the value.
Hypothesis Testing
Hypothesis 1 Null Hypothesis: Age does not affect
players’ market value Alternative Hypothesis: Age affects players’
market value
median_age <- median(fifa_data$age, na.rm = TRUE)
young_players <- fifa_data$value_euro[fifa_data$age < median_age]
older_players <- fifa_data$value_euro[fifa_data$age >= median_age]
model_age <- lm(log(value_euro) ~ age, data = fifa_data)
summary_model = summary(model_age)
p_value <- coef(summary_model)["age", "Pr(>|t|)"]
alpha <- 0.05
if (p_value <= alpha) {
cat("Reject the null hypothesis: Age significantly affects players' market value.\n")
} else {
cat("Fail to reject the null hypothesis: Age does not significantly affect players' market value.\n")
}
## Reject the null hypothesis: Age significantly affects players' market value.
Performing a two sided t-test
To test whether the average log-transformed market value for players
of a certain age differs from a hypothetical value.
fifa_data$age_group <- ifelse(fifa_data$age < median_age, "Young Players", "Older Players")
t.test(value_euro ~ age_group, data = fifa_data, var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: value_euro by age_group
## t = 8.9852, df = 2655.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Older Players and group Young Players is not equal to 0
## 95 percent confidence interval:
## 1850550 2883723
## sample estimates:
## mean in group Older Players mean in group Young Players
## 4389851 2022714
mu1 <- mean(young_players, na.rm = TRUE)
sd1 <- sd(young_players, na.rm = TRUE)
# Perform the power analysis
power_analysis <- pwr.t.test(
d = (mu1) / sd1,
power = 0.8,
sig.level = 0.1,
type = "two.sample",
alternative = "two.sided"
)
# Plot the power analysis result
plot(power_analysis)

Hypothesis 2 Null Hypothesis: Potential doesn’t
affect a players’ market value Alternative Hypothesis: Potential affects
players’ market value
median_potential <- median(fifa_data$potential, na.rm = TRUE)
fifa_data$potential_group <- ifelse(fifa_data$potential < median_potential, "Low Potential", "High Potential")
fifa_data$log_value_euro <- fifa_data$value_euro
t_test_result_potential <- t.test(log_value_euro ~ potential_group, data = fifa_data, var.equal = FALSE)
print(t_test_result_potential)
##
## Welch Two Sample t-test
##
## data: log_value_euro by potential_group
## t = 20.27, df = 1727.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High Potential and group Low Potential is not equal to 0
## 95 percent confidence interval:
## 4525001 5494504
## sample estimates:
## mean in group High Potential mean in group Low Potential
## 5649289.9 639537.4
median_potential <- median(fifa_data$potential, na.rm = TRUE)
low_potential <- fifa_data$value_euro[fifa_data$potential < 75]
high_potential <- fifa_data$value_euro[fifa_data$potential >= 75]
mu1 <- mean(low_potential, na.rm = TRUE)
sd1 <- sd(low_potential, na.rm = TRUE)
# Perform the power analysis
power_analysis <- pwr.t.test(
d = (mu1) / sd1,
power = 0.8,
sig.level = 0.1,
type = "two.sample"
)
# Plot the power analysis result
plot(power_analysis)

Observations We can see that the p-values for both
potential and age are minuscule. Hence, we can safely reject the null
hypothesis and can say that both age and potential are statistically
significant and must be considered when creating the linear regression
model.
Linear Regression Models
**Linear Regression Model of value ins euro based on age, potential
and overall rating.
# Create the linear regression model
log_value_model <- lm(log(value_euro) ~ potential + overall_rating + age + I(age^2),
data = fifa_data)
fifa_data$predicted_value_plot <- exp(predict(log_value_model, fifa_data))
summary(log_value_model)
##
## Call:
## lm(formula = log(value_euro) ~ potential + overall_rating + age +
## I(age^2), data = fifa_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94107 -0.11425 -0.01857 0.12120 0.95088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.9377265 0.1962523 -14.97 <2e-16 ***
## potential 0.0392308 0.0018412 21.31 <2e-16 ***
## overall_rating 0.1760239 0.0018499 95.16 <2e-16 ***
## age 0.2098850 0.0129872 16.16 <2e-16 ***
## I(age^2) -0.0046124 0.0002165 -21.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1808 on 3291 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.9837, Adjusted R-squared: 0.9837
## F-statistic: 4.97e+04 on 4 and 3291 DF, p-value: < 2.2e-16
plot_actual_vs_predicted_smooth <- function(predictor, data, actual, predicted) {
ggplot(data, aes_string(x = predictor)) +
geom_point(aes_string(y = actual), color = "blue", alpha = 0.5) +
geom_smooth(aes_string(x = predictor, y = predicted),
method = "loess",
color = "red",
se = FALSE,
linetype = "dashed",
size = 1) +
labs(title = paste("Actual vs Predicted for", predictor),
x = predictor, y = "Value (Euro)") +
theme_minimal() +
scale_y_continuous(labels = scales::comma) +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
}
plot_potential <- plot_actual_vs_predicted_smooth("potential", fifa_data, actual = "value_euro", predicted = "predicted_value_plot")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot_overall_rating <- plot_actual_vs_predicted_smooth("overall_rating", fifa_data, actual = "value_euro", predicted = "predicted_value_plot")
print(plot_potential)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

print(plot_overall_rating)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Quadratic Model to fit the Age vs Overall rating to find the turning
point
model_age <- lm(log(value_euro) ~ age + I(age^2), data = fifa_data)
age_seq <- seq(min(fifa_data$age), max(fifa_data$age), length.out = 1000)
predicted_log_values <- predict(model_age, newdata = data.frame(age = age_seq))
predicted_values <- exp(predicted_log_values) # Convert back to Euros
# Create a data frame for predictions
predicted_data <- data.frame(age = age_seq, value_euro = predicted_values)
# Calculate the turning point (peak age)
coefficients <- coef(model_age)
turning_point <- -coefficients["age"] / (2 * coefficients["I(age^2)"])
peak_value <- exp(predict(model_age, newdata = data.frame(age = turning_point)))
# Plot with ggplot2
ggplot(data = fifa_data, aes(x = age, y = value_euro)) +
geom_point(color = "darkgreen") + # Scatter plot for actual data
geom_line(data = predicted_data, aes(x = age, y = value_euro), color = "blue", size = 1) + # Predicted curve
geom_vline(xintercept = turning_point, linetype = "dashed", color = "red") + # Vertical line at the turning point
annotate("text", x = turning_point, y = peak_value, label = paste0("Peak: Age ", round(turning_point, 1), "\n€", format(round(peak_value, 2), big.mark = ",")),
hjust = -0.1, vjust = -0.5, color = "black", size = 10, fontface = "bold") + # Label at the turning point
labs(
title = "Age vs Value in Euros",
x = "Age",
y = "Value in Euros"
) +
scale_y_continuous(labels = dollar_format(prefix = "€")) +
coord_cartesian(ylim = c(0, 5000000)) +# Format y-axis as Euros
theme_minimal()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

Age (-0.0649):
A player’s value increases by about 21% per year till they reach a
peak value at the age of 28.8 years. The value slowly decreases at first
~1% per year till about 30 and then more sharply ~6.5% after 30 years of
age and as the quadratic term dominates this gets higher. So players
less than 28.8 years have a strong growth potential whereas older
players face a steady decline in value.
Potential (0.0086):
A one-unit increase in potential corresponds to a 0.87% increase in
market value This small yet significant effect reinforces the premium
placed on long-term development potential.
International Reputation (-0.0104):
The coefficient is not statistically significant (p = 0.228), meaning
there is no evidence that international_reputation(1-5) has a meaningful
direct effect on market value in this model. Possible reasons: -
Col-linearity with overall_rating. - Reputation might indirectly affect
value through other factors not included here.
Overall Rating (0.2078):
A one-point increase in overall rating is associated with a 23.13%
increase in market value. This high coefficient and significance
emphasize that current performance is a critical driver of a player’s
worth
prediction_interval <- exp(predict(
log_value_model,
newdata = fifa_data,
interval = "prediction",
level = 0.95
))
fifa_data_predictions <- cbind(fifa_data, prediction_interval) |>
rename(
predicted_value = fit,
lower_bound = lwr,
upper_bound = upr
)
# Plot actual vs. predicted with prediction intervals
ggplot(fifa_data_predictions, aes(x = predicted_value, y = value_euro)) +
geom_point(color = "blue", alpha = 0.7, size = 2) + # Actual values
geom_errorbar(aes(ymin = lower_bound, ymax = upper_bound), width = 0.2, color = "gray50") +
labs(
title = "Predicted Market Values with 95% Prediction Intervals",
x = "Predicted Market Value (Euro)",
y = "Actual Market Value (Euro)"
) +
theme_minimal() +
scale_x_continuous(labels = dollar_format(prefix = "€"))+
scale_y_continuous(labels = dollar_format(prefix = "€"))+
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 14)
)
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
Observations Less than 50million euros we can see that
our linear model is a perfect fit and the Actual market value matches
the predicted market value. high value players whose value > 50
million euros might have other factors such as injury history, marketing
potential etc., which influences their market value.
player_counts <- fifa_data |>
summarise(
less_than_50 = sum(value_euro < 50000000, na.rm = TRUE),
greater_than_50 = sum(value_euro > 50000000, na.rm = TRUE)
)
total <- sum(player_counts$less_than_50, player_counts$greater_than_50)
percentages <- round(c(player_counts$less_than_50, player_counts$greater_than_50) / total * 100, 1)
labels_with_counts_percentages <- paste(
c("Less than 50M", "Greater than 50M"),
"\n",
c(player_counts$less_than_50, player_counts$greater_than_50),
" (", percentages, "%)",
sep = ""
)
pie(
c(player_counts$less_than_50, player_counts$greater_than_50),
labels = labels_with_counts_percentages,
main = "Player Market Value Distribution",
col = c("lightblue", "lightgreen")
)

# Create the linear regression model
traits_model <- lm(overall_rating ~ ball_control + dribbling + finishing +
positioning + shot_power + stamina +
composure + long_shots + reactions +
vision + volleys,
data = fifa_data)
Interpretation and Observation
vif(traits_model)
## ball_control dribbling finishing positioning shot_power stamina
## 7.332871 4.855570 4.127150 5.759807 3.404632 1.338559
## composure long_shots reactions vision volleys
## 3.927081 3.298864 4.483114 2.670967 3.221693
vif(log_value_model)
## potential overall_rating age I(age^2)
## 13.30212 17.71124 366.99430 279.01831
Actionable Insights
For Player Scouts - Prioritize young players (age
21-25) with high potential ratings - Focus on players with strong - Ball
control (highest correlation to overall rating) - Finishing skills -
Positioning - Shot power - Look for players with development potential,
as it significantly impacts market value
For Club Finance Teams - Value calculation based on
our Model - Overall rating is the strongest value driver (23% increase
per point) - Potential adds 0.87% value per point - Age negatively
impacts value (-6.49% per year)
- Strategy
- Invest in players under 25 with high potential
- Prioritize players with ratings above 75
- Consider long term development over immediate reputation even though
that has other benefits
For Club Managers - Player development priorities
for attackers - Improve ball control, dribbling and finishing skills. -
Focus on positioning and shot power - Performance metrics -