In this project, we take five random samples from our existing player data to mimic how we might gather information from a larger group. By looking closely at these samples, we want to spot any differences or unusual findings among them, as well as identify common traits.
# Set seed for reproducibility
set.seed(12345)
# Total rows in the data set
n <- nrow(Fifa_Players_Data)
# Loop to create 5 separate data frames for the 5 samples
for (i in 1:5) {
sample_data <- Fifa_Players_Data[sample(1:n, size = ceiling(0.5 * n), replace = TRUE), ]
assign(paste0("sample", i), sample_data) # Dynamically assign to separate data frames
}
# Merging the data sets so we can compare the differences between them
# Create a sub sample_id column in each data frame
sample1$subsample_id <- 1
sample2$subsample_id <- 2
sample3$subsample_id <- 3
sample4$subsample_id <- 4
sample5$subsample_id <- 5
# Combine all samples into one data frame
combined_data <- bind_rows(sample1, sample2, sample3, sample4, sample5)
# Summarize key statistics for each sub sample including categorical columns
summary_stats_continuos <- combined_data |>
group_by(subsample_id) |>
summarise(
avg_overall_rating = mean(overall_rating, na.rm = TRUE),
avg_potential = mean(potential, na.rm = TRUE),
avg_strength = mean(strength, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE),
avg_wage = mean(wage_euro, na.rm = TRUE),
avg_height = mean(height_cm, na.rm = TRUE),
)
# Summarize key statistics for each sub sample including categorical columns
summary_stats_categorical <- combined_data |>
group_by(subsample_id) |>
summarise(
# Summary for categorical columns
most_common_position = names(which.max(table(positions))),
most_common_national_team = names(which.max(table(national_team))),
most_common_body_type = names(which.max(table(body_type)))
)
print(summary_stats_continuos)
## # A tibble: 5 × 7
## subsample_id avg_overall_rating avg_potential avg_strength avg_age avg_wage
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 66.3 71.5 65.1 25.5 9972.
## 2 2 66.1 71.4 65.2 25.5 9885.
## 3 3 66.3 71.5 65.0 25.5 10088.
## 4 4 66.3 71.4 65.3 25.7 9750.
## 5 5 66.2 71.4 65.2 25.5 9697.
## # ℹ 1 more variable: avg_height <dbl>
Samples 3 and 4 have the best overall ratings and potential, while sample 2 has the youngest age and the lowest rating. Overall, the numbers are pretty similar across the samples, showing that the group performs consistently, with just a few differences in specific areas.
print(summary_stats_categorical)
## # A tibble: 5 × 4
## subsample_id most_common_position most_common_national…¹ most_common_body_type
## <dbl> <chr> <chr> <chr>
## 1 1 CB Czech Republic Normal
## 2 2 CB Scotland Normal
## 3 3 CB England Normal
## 4 4 CB Poland Normal
## 5 5 CB Poland Normal
## # ℹ abbreviated name: ¹most_common_national_team
Most common body type across the data samples are “Normal”. The national teams which have a significantly higher presence include the Czech Republic, Scotland, England. Poland appearing twice, indicating a strong presence of Polish players samples 4 and 5.
# Visualizations to understand if there are any key differences in the data
ggplot(combined_data, aes(x = factor(subsample_id), y = overall_rating)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Overall Rating Distribution by Subsample",
x = "Subsample ID",
y = "Overall Rating") +
theme_minimal()
# Box plot for Potential by Sub sample
ggplot(combined_data, aes(x = factor(subsample_id), y = potential)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Potential Distribution by Subsample",
x = "Subsample ID",
y = "Potential") +
theme_minimal()
Overall Rating and Potential: The values for overall ratings and potential across sub samples appear to be fairly consistent, with averages around 66-67 and potential values around 71-72. However, there are more outlines(i.e. Players with higher potential in sample1). Sample 3 and 4 have the outliers with the least potential among st all samples indicating that these samples may contain more players from nations where football isn’t encouraged.
# Box plot for Wage in Euro by Sub sample
ggplot(combined_data, aes(x = factor(subsample_id), y = wage_euro)) +
geom_boxplot(fill = "lavender") +
labs(title = "Wage Distribution by Subsample",
x = "Subsample ID",
y = "Wage (Euro)") +
scale_y_continuous(labels = label_number(scale = 1e-1)) +
theme_minimal()
## Warning: Removed 588 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Box plot for Value in Euro by Sub sample
ggplot(combined_data, aes(x = factor(subsample_id), y = value_euro)) +
geom_boxplot(fill = "orange") +
labs(title = "Player Value Distribution by Subsample",
x = "Subsample ID",
y = "Value (Euro)") +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = " M")) +
theme_minimal()
## Warning: Removed 614 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Samples 1, 3, and 4 include players with the highest wages and market values, indicating an anomaly compared to samples 2 and 5. The difference is wages might also indicate a difference in the concentration of players from different markets.
# Box plot for Age by Subsample
ggplot(combined_data, aes(x = factor(subsample_id), y = age)) +
geom_boxplot(fill = "lightcoral") +
labs(title = "Age Distribution by Subsample",
x = "Subsample ID",
y = "Age") +
theme_minimal()
The ages remain relatively stable across all the samples with very few differences.
# Box plot for Height by Sub sample
ggplot(combined_data, aes(x = factor(subsample_id), y = height_cm)) +
geom_boxplot(fill = "lightyellow") +
labs(title = "Height Distribution by Subsample",
x = "Subsample ID",
y = "Height (cm)") +
theme_minimal()
Sample 1 can be considered an anomaly compared to Sample 2, 3, 4 and 5 because of the lower concentration of players between the height group 155 - 185 cm. Also, 206 cm in sample 1 is an anomaly compared to the other samples which don’t have a player with this height in them.
# Set seed for reproducibility
set.seed(12345)
# Number of simulations
n_simulations <- 1000
# Create an empty list to store results
simulation_results <- vector("list", n_simulations)
# Run Monte Carlo Simulation
for (i in 1:n_simulations) {
# Sample with replacement from the data set
sampled_data <- Fifa_Players_Data |>
sample_frac(size = 0.5, replace = TRUE) # Use 50% size of the data
# Calculate key statistics
stats <- sampled_data |>
summarise(
avg_overall_rating = mean(overall_rating, na.rm = TRUE),
avg_potential = mean(potential, na.rm = TRUE),
avg_value_euro = mean(value_euro, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE)
)
# Store the results
simulation_results[[i]] <- stats
}
# Combine all results into a single dataframe
simulation_results_df <- do.call(rbind, simulation_results)
# View summary of simulation results
summary(simulation_results_df)
## avg_overall_rating avg_potential avg_value_euro avg_age
## Min. :66.01 Min. :71.27 Min. :2286023 Min. :25.38
## 1st Qu.:66.19 1st Qu.:71.38 1st Qu.:2439135 1st Qu.:25.53
## Median :66.24 Median :71.43 Median :2475996 Median :25.57
## Mean :66.24 Mean :71.43 Mean :2477560 Mean :25.57
## 3rd Qu.:66.29 3rd Qu.:71.47 3rd Qu.:2517947 3rd Qu.:25.60
## Max. :66.52 Max. :71.65 Max. :2663477 Max. :25.74
# Set seed for reproducibility
set.seed(12345)
# Number of simulations
n_simulations <- 3000
# Create an empty list to store results
simulation_results <- vector("list", n_simulations)
# Run Monte Carlo Simulation
for (i in 1:n_simulations) {
# Sample with replacement from the data set
sampled_data <- Fifa_Players_Data |>
sample_frac(size = 0.5, replace = TRUE) # Use 50% size of the data
# Calculate key statistics
stats <- sampled_data |>
summarise(
avg_overall_rating = mean(overall_rating, na.rm = TRUE),
avg_potential = mean(potential, na.rm = TRUE),
avg_value_euro = mean(value_euro, na.rm = TRUE),
avg_age = mean(age, na.rm = TRUE)
)
# Store the results
simulation_results[[i]] <- stats
}
# Combine all results into a single dataframe
simulation_results_df <- do.call(rbind, simulation_results)
# View summary of simulation results
summary(simulation_results_df)
## avg_overall_rating avg_potential avg_value_euro avg_age
## Min. :65.98 Min. :71.21 Min. :2286023 Min. :25.38
## 1st Qu.:66.19 1st Qu.:71.39 1st Qu.:2440310 1st Qu.:25.53
## Median :66.24 Median :71.43 Median :2478967 Median :25.56
## Mean :66.24 Mean :71.43 Mean :2479733 Mean :25.57
## 3rd Qu.:66.29 3rd Qu.:71.47 3rd Qu.:2521244 3rd Qu.:25.60
## Max. :66.52 Max. :71.65 Max. :2693747 Max. :25.76
Because we have chosen a FIFA data set where the primary population is Football players from Europe, the values across the different categories remain relatively stable especially the physical attributes because they are chosen from a particular demographic.
In future data analysis, it’s important to use different random samples to capture variability and recognize any outliers that could skew results. We should also consider the context, like player backgrounds or market conditions, to avoid drawing misleading conclusions. Using methods like Monte Carlo simulations can help us understand uncertainty and make sure our findings are reliable.