In this RMarkdown notebook, we’ll explore relationships between pairs of numbers in the data, focusing on how one might affect the other. We will then plot a relationship between the two pairs and then draw conclusions based on the observation. Finally, we will calculate the correlation co-efficient between the these combinations and build a confidence interval for each of the response variables.
This new variable shows the value per unit of rating, helping to explore how much market value each rating point represents.
# Create the 'value_per_rating' column
player_data$value_per_rating <- player_data$value_euro / player_data$overall_rating
head(player_data$value_per_rating)
## [1] 1175531.9 789772.7 829545.5 704545.5 681818.2 676136.4
In this pair, we treat market value (value_euro) as the response variable and the overall ratings the explanatory variable to investigate how a player’s rating impacts their market value.
This new variable shows the ratio of stamina to strength, indicating how balanced the player’s endurance is compared to their strength.
# Create the 'endurance_to_strength_ratio' column
player_data$endurance_to_strength_ratio <- player_data$stamina / player_data$strength
head(player_data$endurance_to_strength_ratio)
## [1] 1.0909091 1.5862069 1.0114943 1.7045455 0.7978723 0.8152174
This new variable represents the ratio between stamina and strength. It tells us how much endurance a player has in relation to their strength. A higher ratio indicates that a player has relatively more stamina compared to their physical strength, while a lower ratio suggests that their strength is more dominant over endurance. This is useful for determining the position a player can play in.
# Bin overall_rating and strength into intervals of 10 (10-20, 20-30, ..., 90-100)
player_data$overall_rating_bin <- cut(player_data$overall_rating,
breaks = seq(10, 100, by = 10),
include.lowest = TRUE,
right = FALSE) # Left-inclusive intervals
player_data$strength_bin <- cut(player_data$strength,
breaks = seq(10, 100, by = 10),
include.lowest = TRUE,
right = FALSE)
# Create a box plot for Value in Euros vs. Overall Rating
ggplot(player_data, aes(x = overall_rating_bin, y = value_euro)) +
geom_boxplot() +
labs(title = "Box Plot: Value in Euros vs. Overall Rating (Binned)",
x = "Overall Rating (Binned by 10s)",
y = "Value in Euros") +
theme_minimal()
## Warning: Removed 255 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
These are players with much higher market value than expected for their rating. Possible reasons for this maybe that they are young players with high potential driving up their value even though their overall rating is lower.
ggplot(player_data, aes(x = strength_bin, y = stamina)) +
geom_boxplot() +
labs(title = "Box Plot: Stamina vs. Strength (Binned)",
x = "Strength (Binned by 10s)",
y = "Stamina") +
theme_minimal()
There are few players with significantly high stamina compared to their peers in the same strength group but we mostly observe that there are a lot of players with lower stamina at the same strength range compared to their peers. This might be because they might be older players who have retained their strength but their stamina has dropped.
# Calculating the correlation for Pair 1: value_euro vs. overall_rating
cor_value_rating <- cor(player_data$value_euro, player_data$overall_rating, use = "complete.obs")
cat("Correlation between value_euro and overall_rating:", cor_value_rating, "\n")
## Correlation between value_euro and overall_rating: 0.6309281
This correlation is expected to be positive and relatively strong, since a higher overall rating generally corresponds to a higher market value for a player. This is logical because players with better skills and performances are more valuable in the transfer market.
# Calculating the correlation for Pair 2: stamina vs. strength
cor_stamina_strength <- cor(player_data$stamina, player_data$strength, use = "complete.obs")
cat("Correlation between stamina and strength:", cor_stamina_strength, "\n")
## Correlation between stamina and strength: 0.2671001
This correlation is expected to be positive, but possibly weaker than the first pair. Although strength and stamina are both physical attributes, they measure different aspects of a player’s fitness. Some players might have high stamina but moderate strength, or vice versa, depending on their position or training focus.
# 95% Confidence Interval for value_euro (Response Variable 1)
ci_value_euro <- t.test(player_data$value_euro)$conf.int
# 95% Confidence Interval for stamina (Response Variable 2)
ci_stamina <- t.test(player_data$stamina)$conf.int
# Display the results
cat("95% Confidence Interval for value_euro:", ci_value_euro, "\n")
## 95% Confidence Interval for value_euro: 2395491 2563069
cat("95% Confidence Interval for stamina:", ci_stamina, "\n")
## 95% Confidence Interval for stamina: 62.90099 63.36669
# Calculate the means and confidence intervals for both variables
mean_value_euro <- mean(player_data$value_euro, na.rm = TRUE)
ci_value_euro <- t.test(player_data$value_euro)$conf.int
mean_stamina <- mean(player_data$stamina, na.rm = TRUE)
ci_stamina <- t.test(player_data$stamina)$conf.int
# Density plot (line graph) for value_euro with Confidence Interval
ggplot(player_data, aes(x = value_euro)) +
geom_density(fill = "skyblue", alpha = 0.4, color = "blue") + # Density line (smoothed)
geom_vline(aes(xintercept = mean_value_euro), color = "red", linetype = "dashed", size = 1.2) + # Mean line
geom_vline(aes(xintercept = ci_value_euro[1]), color = "blue", linetype = "dotted", size = 1) + # Lower CI bound
geom_vline(aes(xintercept = ci_value_euro[2]), color = "blue", linetype = "dotted", size = 1) + # Upper CI bound
labs(title = "Density Plot for Market Value (Euro)",
x = "Market Value (Euro)", y = "Density") +
xlim(0, 1e7) + # Set the limit for the x-axis
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1130 rows containing non-finite outside the scale range
## (`stat_density()`).
The density plot for market value exhibits a highly skewed distribution with a very pronounced peak near zero, suggesting that a significant number of players have a low market value. The tail extends into higher values, indicating that while most players are valued lower, there are a few with very high market values.
The dashed red line indicates the mean market value. The confidence interval for this variable should reflect a broad range, capturing the significant variability in player market values. Given the skewness, the mean might be influenced heavily by the few players with extremely high values.
# Density plot (line graph) for stamina with Confidence Interval
ggplot(player_data, aes(x = stamina)) +
geom_density(fill = "lightgreen", alpha = 0.4, color = "green") + # Density line (smoothed)
geom_vline(aes(xintercept = mean_stamina), color = "red", linetype = "dashed", size = 1.2) + # Mean line
geom_vline(aes(xintercept = ci_stamina[1]), color = "blue", linetype = "dotted", size = 1) + # Lower CI bound
geom_vline(aes(xintercept = ci_stamina[2]), color = "blue", linetype = "dotted", size = 1) + # Upper CI bound
labs(title = "Density Plot for Stamina",
x = "Stamina", y = "Density") +
theme_minimal()
The density plot for stamina appears to be more normally distributed, with a clear peak around 75. This suggests that a majority of players have stamina levels around this value, indicating a more uniform distribution compared to market values.
The mean stamina, also indicated by the dashed line, likely falls around the peak of the distribution. The confidence interval should be narrower compared to that of market value, reflecting the lesser variability in stamina levels among players.
END