In this report, we will explore the factors influencing a soccer
player’s market value (value_euro
). First, we will conduct
an ANOVA test using a categorical explanatory variable and follow it up
by building a simple linear regression model using a continuous
explanatory variable. These tests will help us understand which factors
are statistically significant in influencing player market value and how
they relate to the broader soccer market.
We will use positions
as our
categorical explanatory variable. Player positions (e.g., forward,
midfielder, defender) are expected to influence market value, as players
in different roles may be valued differently depending on their
contribution to the team’s success.
Consolidating Player Positions before we run the ANOVA test
unique_positions <- unique(player_data$positions)
# Define the position mapping
position_mapping <- list(
"Attackers" = c("CF", "ST", "RW", "LW", "CAM", "RM", "LM"),
"Midfielders" = c("CM", "CDM", "CAM", "RM", "LM"),
"Defenders" = c("CB", "RB", "LB", "RWB", "LWB"),
"Goalkeepers" = c("GK")
)
# Create a function to classify positions
classify_position <- function(position) {
for (category in names(position_mapping)) {
if (position %in% position_mapping[[category]]) {
return(category)
}
}
return(NA) # Return NA if position doesn't match
}
# Split positions and add the first category as a new column to the original dataset
player_data <- player_data |>
# Split positions that contain commas into separate rows
separate_rows(positions, sep = ",") |>
# Trim whitespace from positions (if any)
mutate(positions = trimws(positions)) |>
# Classify the positions and keep only the first match
mutate(Category = sapply(positions, classify_position)) |>
# Group by player and take only the first position category
group_by(name) |>
mutate(Category = first(na.omit(Category))) |>
ungroup() # Remove the grouping
table(player_data$Category)
##
## Attackers Defenders Goalkeepers Midfielders
## 11991 8876 2104 6563
There is no significant difference in the average player value between different position categories.
There is a significant difference in the average player value between at least one of the position categories.
# Perform ANOVA test using player value
anova_value <- aov(value_euro ~ Category, data = player_data)
# Summarize the ANOVA result
summary(anova_value)
## Df Sum Sq Mean Sq F value Pr(>F)
## Category 3 9.421e+15 3.140e+15 89.53 <2e-16 ***
## Residuals 29125 1.022e+18 3.508e+13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 405 observations deleted due to missingness
We get a extremely small p-value of 2e-16 meaning we reject the null hypothesis, meaning there is enough evidence to conclude that player values differ significantly across position categories.
What it means to Agents and Players ? During contract negotiations, Agents and players can leverage these findings. Players in higher-valued categories may negotiate from a stronger position, advocating for better salaries based on the established differences in market value.
To build a linear regression model, we need to select a continuous or ordered integer column that could potentially influence the response variable (player value). One logical choice is player potential, as players with higher potential are often valued more in the market.
# Fit the linear regression model using player potential to predict player value
model <- lm(value_euro ~ potential, data = player_data)
# Summarize the linear regression model
summary(model)
##
## Call:
## lm(formula = value_euro ~ potential, data = player_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10321795 -2058679 -236341 1205391 95260240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37737985 332966 -113.3 <2e-16 ***
## potential 563593 4626 121.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4842000 on 29127 degrees of freedom
## (405 observations deleted due to missingness)
## Multiple R-squared: 0.3376, Adjusted R-squared: 0.3376
## F-statistic: 1.485e+04 on 1 and 29127 DF, p-value: < 2.2e-16
Intercept: -37,737,985 : This is the expected value of value_euro when potential is zero. The p-value is <2e-16, indicating that this coefficient is statistically significant.
Multiple R-squared: 0.3376 Approximately 33.76% of the variance in player value can be explained by the potential variable
Potential: 563,593 For each one-unit increase in potential, the model predicts an increase of approximately 563,593 Euros in value_euro.
# Create predictions
player_data <- player_data |>
mutate(predicted_value = predict(model, newdata = player_data))
# Plotting
ggplot(player_data, aes(x = potential, y = value_euro)) +
geom_point(alpha = 0.5, color = "blue") + # Actual data points
geom_line(aes(y = predicted_value), color = "red", size = 1) + # Regression line
labs(title = "Linear Regression of Player Value vs. Player Potential",
x = "Player Potential",
y = "Player Value (in Euros)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 405 rows containing missing values or values outside the scale range
## (`geom_point()`).
There is a positive linear relationship between player potential and player value; as potential increases, so does the expected value.
The model indicates that potential is a significant predictor of player value in this dataset.
What other factors (e.g., age, skill ratings, injury history) could improve the predictive power of the model?
How does the player’s position influence their value? Would incorporating position into the model yield better predictions?