In this analysis, we will explore FIFA player data by grouping the data into different sets based on categorical columns and summarizing various continuous variables.
We will create 3 group by data frames each focusing on different categorical columns, summarizing either continuous or other relevant data points. We will then investigate each of these groups and draw conclusions based on our analysis.
reputation_summary <- Fifa_Players_Data |>
group_by(`international_reputation(1-5)`) |>
summarise(average_overall_rating = mean(overall_rating, na.rm = TRUE)) |>
arrange(desc(average_overall_rating))
reputation_summary <- reputation_summary |>
mutate(tag = if_else(`international_reputation(1-5)` == 1, "Lowest Probability Group", "Other"))
print(reputation_summary)
## # A tibble: 5 × 3
## `international_reputation(1-5)` average_overall_rating tag
## <dbl> <dbl> <chr>
## 1 5 90.8 Other
## 2 4 86.1 Other
## 3 3 81.2 Other
## 4 2 75.5 Other
## 5 1 65.2 Lowest Probability Gro…
International Reputation vs. Average Rating: This visualization helps in understanding how international recognition correlates with overall player ratings.
ggplot(reputation_summary, aes(x = as.factor(`international_reputation(1-5)`), y = average_overall_rating, fill = tag)) +
geom_bar(stat = "identity") +
labs(title = "Average Overall Rating by International Reputation", x = "International Reputation (1-5)", y = "Average Overall Rating") +
theme_minimal()
The group with an international reputation of 1 has the lowest average overall ratings. This suggests that players with this reputation are generally less recognized and may not be as highly valued in the football community. They are likely less experienced in international matches or less visible in high-profile competitions.
Players with the lowest international reputation (1) are rarer in the data set because they are less likely to be selected for high-profile teams or international competitions due to their perceived lower skill level or experience.
# Group by Nationality
group_nationality <- Fifa_Players_Data |>
group_by(nationality) |>
summarise(
avg_value = mean(value_euro, na.rm = TRUE),
avg_wage = mean(wage_euro, na.rm = TRUE),
count = n()
) |>
arrange(desc(count))
print(group_nationality)
## # A tibble: 160 × 4
## nationality avg_value avg_wage count
## <chr> <dbl> <dbl> <int>
## 1 England 1543341. 9967. 1658
## 2 Germany 2553841. 9656. 1199
## 3 Spain 4324014. 16063. 1070
## 4 France 3779140. 14077. 925
## 5 Argentina 3211847. 11831. 904
## 6 Brazil 4527726. 17149. 832
## 7 Italy 3281359. 14185. 655
## 8 Colombia 1738026. 5498. 624
## 9 Japan 844419. 3528. 466
## 10 Netherlands 2967664. 10111. 441
## # ℹ 150 more rows
The chart depicts that certain nationalities dominate the player pool. Major footballing countries like Brazil, Spain, and Germany have the largest representation.
# Taking top 10 nationalities
top_10_nationalities <- group_nationality |>
arrange(desc(count)) |>
slice_head(n = 10) # Select the top 10 rows
#Visualization of average wages by nationality (top 10)
ggplot(top_10_nationalities, aes(x=reorder(nationality, -count), y=count)) +
geom_bar(stat="identity", fill="forestgreen") +
labs(title="Top 10 Nationalities by Player Count", x="Nationality", y="Number of Players") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
lowest_nationality_group <- group_nationality %>%
filter(count == min(count)) %>%
mutate(tag = "Rare Group")
print(lowest_nationality_group)
## # A tibble: 17 × 5
## nationality avg_value avg_wage count tag
## <chr> <dbl> <dbl> <int> <chr>
## 1 Andorra 290000 1000 1 Rare Group
## 2 Barbados 400000 3000 1 Rare Group
## 3 Ethiopia 200000 1000 1 Rare Group
## 4 Guam 550000 3000 1 Rare Group
## 5 Indonesia 180000 1000 1 Rare Group
## 6 Kuwait 1200000 13000 1 Rare Group
## 7 Malta 300000 2000 1 Rare Group
## 8 New Caledonia 1800000 9000 1 Rare Group
## 9 Nicaragua 300000 1000 1 Rare Group
## 10 Oman 425000 12000 1 Rare Group
## 11 Papua New Guinea 260000 1000 1 Rare Group
## 12 South Sudan 260000 2000 1 Rare Group
## 13 St Lucia 500000 2000 1 Rare Group
## 14 São Tomé & Príncipe 2800000 15000 1 Rare Group
## 15 United Arab Emirates 10500000 39000 1 Rare Group
## 16 Vietnam 425000 1000 1 Rare Group
## 17 Yemen 160000 3000 1 Rare Group
Countries with smaller populations or less established football infrastructures naturally produce fewer professional players. Players from these nations have a lower probability of appearing in the dataset.
# Group by Position
group_position <- Fifa_Players_Data |>
group_by(positions) |>
summarise(
avg_value = mean(value_euro, na.rm = TRUE),
avg_wage = mean(wage_euro, na.rm = TRUE),
count = n()
) |>
arrange(desc(count))
group_position
## # A tibble: 890 × 4
## positions avg_value avg_wage count
## <chr> <dbl> <dbl> <int>
## 1 CB 2450709. 10556. 2243
## 2 GK 1626964. 6722. 2065
## 3 ST 2803157. 11185. 1747
## 4 CM 1850425. 7560. 764
## 5 CDM,CM 2979493. 12730 709
## 6 LB 1757062. 8496. 672
## 7 CM,CDM 3272840 12936 632
## 8 RB 1636034. 8669. 605
## 9 CDM 1808669. 8055. 321
## 10 CB,RB 1539223. 7148. 268
## # ℹ 880 more rows
A bar chart showing the number of players per position, highlighting the rarity of goalkeepers.
top_5_positions <- group_position |>
arrange(desc(count)) |>
slice_head(n = 5)
# Find the lowest position
lowest_position <- group_position |>
arrange(count) |>
slice_head(n = 5) |>
mutate(tag = "Rare Group")
# Combine top 5 and lowest position
positions_to_plot <- bind_rows(top_5_positions, lowest_position)
# Plotting Positions by Player Count
ggplot(positions_to_plot, aes(x=reorder(positions, -count), y=count)) +
geom_bar(stat="identity", fill="darkorange") +
labs(title="Number of Players by Position", x="Position", y="Number of Players") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Conclusions and Observations * Infield positions such as defenders
and goalkeepers are the most common, while strikers and central
midfielders have the lowest representation. * This aligns with the fact
that a team typically has more defensive players than offensive
players
print(lowest_position)
## # A tibble: 5 × 5
## positions avg_value avg_wage count tag
## <chr> <dbl> <dbl> <int> <chr>
## 1 CAM,CDM,CM,RM 775000 2000 1 Rare Group
## 2 CAM,CDM,RM 6500000 28000 1 Rare Group
## 3 CAM,CDM,RM,CM 4100000 6000 1 Rare Group
## 4 CAM,CF,CM,RM 1800000 8000 1 Rare Group
## 5 CAM,CF,LM 575000 1000 1 Rare Group
Players in these positions are one in a million because they require a wider range of skills, blending both offense and defense, making it harder to find players that excel in all these aspects thus making a small appearance in the data set.
We will continue analysis of the categorical variables by the following steps:
combinations <- expand.grid(
positions = unique(Fifa_Players_Data$positions),
nationality = unique(Fifa_Players_Data$nationality)
)
# We have removed compound positions because of large result set
missing_combinations <- anti_join(combinations, Fifa_Players_Data, by = c("positions", "nationality"))
missing_combinations <- missing_combinations |>
filter(!str_detect(positions, ","))
# Display the first 10 rows
first_page <- missing_combinations %>% slice(1:10)
kable(first_page, caption = "Missing Player Combinations (Page 1)")
positions | nationality |
---|---|
LWB | Argentina |
RWB | Argentina |
CDM | Denmark |
RW | Denmark |
CF | Denmark |
LWB | France |
RWB | France |
CF | France |
LWB | Italy |
RWB | Italy |
# Total number of rows
total_rows <- nrow(missing_combinations)
cat("Total rows:", total_rows, "\n")
## Total rows: 1534
Certain positions might be less common in certain countries due to regional playing styles or strategies.
combination_counts <- Fifa_Players_Data |>
count(positions, nationality) |>
arrange(desc(n)) # Sort by frequency in descending order
most_common <- head(combination_counts)
least_common <- tail(combination_counts)
print(most_common)
## # A tibble: 6 × 3
## positions nationality n
## <chr> <chr> <int>
## 1 ST England 214
## 2 CB England 213
## 3 GK England 179
## 4 GK Germany 162
## 5 CB Argentina 133
## 6 CB Germany 133
print(least_common)
## # A tibble: 6 × 3
## positions nationality n
## <chr> <chr> <int>
## 1 ST,RW,LW,CAM South Africa 1
## 2 ST,RW,LW,CF Brazil 1
## 3 ST,RW,RM Burkina Faso 1
## 4 ST,RW,RM Germany 1
## 5 ST,RW,RM Scotland 1
## 6 ST,RWB,RM Senegal 1
# Assuming most_common and least_common are data frames with 'nationality' and 'n' columns
top_most_common <- most_common |>
top_n(10, n) |>
mutate(group = "Most Common")
top_least_common <- least_common |>
top_n(5, n) |>
mutate(group = "Least Common")
# Combine both groups into one data frame
combined_data <- bind_rows(top_most_common, top_least_common)
# Create a new color scale for the groups
color_scale <- scale_fill_manual(values = c("Most Common" = "blue", "Least Common" = "red"))
filtered_combination_counts <- combination_counts |>
filter(nationality %in% combined_data$nationality)
# Visualize the heat map
ggplot(combined_data, aes(x = positions, y = nationality, fill = group)) +
geom_tile() +
color_scale +
labs(title = "Heatmap of Top 10 Most Common and Top 5 Least Common Combinations",
x = "Position",
y = "Nationality",
fill = "Group") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Observations and Conclusion * Players in all positions are highly
represented among English players in this data set. This suggests that
English teams or leagues have a deep talent pool, with strikers being
particularly common. * Argentina has a huge talent pool with defenders
being the most common position. * Positions which require both Offensive
and Defensive attributes are least common because of the unique
requirements.