Introduction

In this analysis, we will explore FIFA player data by grouping the data into different sets based on categorical columns and summarizing various continuous variables.

We will create 3 group by data frames each focusing on different categorical columns, summarizing either continuous or other relevant data points. We will then investigate each of these groups and draw conclusions based on our analysis.

Group by International reputation and calculate average overall rating

reputation_summary <- Fifa_Players_Data |>
  group_by(`international_reputation(1-5)`) |>
  summarise(average_overall_rating = mean(overall_rating, na.rm = TRUE)) |>
  arrange(desc(average_overall_rating))

Tagging the lowest probability group

reputation_summary <- reputation_summary |>
  mutate(tag = if_else(`international_reputation(1-5)` == 1, "Lowest Probability Group", "Other"))
print(reputation_summary)

## # A tibble: 5 × 3
##   `international_reputation(1-5)` average_overall_rating tag                    
##                             <dbl>                  <dbl> <chr>                  
## 1                               5                   90.8 Other                  
## 2                               4                   86.1 Other                  
## 3                               3                   81.2 Other                  
## 4                               2                   75.5 Other                  
## 5                               1                   65.2 Lowest Probability Gro…

Visualization of average overall rating by international reputation

International Reputation vs. Average Rating: This visualization helps in understanding how international recognition correlates with overall player ratings.

ggplot(reputation_summary, aes(x = as.factor(`international_reputation(1-5)`), y = average_overall_rating, fill = tag)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Overall Rating by International Reputation", x = "International Reputation (1-5)", y = "Average Overall Rating") +
  theme_minimal()

Conclusions and Observations:

The group with an international reputation of 1 has the lowest average overall ratings. This suggests that players with this reputation are generally less recognized and may not be as highly valued in the football community. They are likely less experienced in international matches or less visible in high-profile competitions.

Hypothesis:

Players with the lowest international reputation (1) are rarer in the data set because they are less likely to be selected for high-profile teams or international competitions due to their perceived lower skill level or experience.

Group by Nationality

Grouping and Summary

# Group by Nationality
group_nationality <- Fifa_Players_Data |>
  group_by(nationality) |>
  summarise(
    avg_value = mean(value_euro, na.rm = TRUE),
    avg_wage = mean(wage_euro, na.rm = TRUE),
    count = n()
  ) |>
  arrange(desc(count))

print(group_nationality)

## # A tibble: 160 × 4
##    nationality avg_value avg_wage count
##    <chr>           <dbl>    <dbl> <int>
##  1 England      1543341.    9967.  1658
##  2 Germany      2553841.    9656.  1199
##  3 Spain        4324014.   16063.  1070
##  4 France       3779140.   14077.   925
##  5 Argentina    3211847.   11831.   904
##  6 Brazil       4527726.   17149.   832
##  7 Italy        3281359.   14185.   655
##  8 Colombia     1738026.    5498.   624
##  9 Japan         844419.    3528.   466
## 10 Netherlands  2967664.   10111.   441
## # ℹ 150 more rows

Visualisation

The chart depicts that certain nationalities dominate the player pool. Major footballing countries like Brazil, Spain, and Germany have the largest representation.

# Taking top 10 nationalities
top_10_nationalities <- group_nationality |>
  arrange(desc(count)) |>
  slice_head(n = 10)  # Select the top 10 rows


#Visualization of average wages by nationality (top 10)

ggplot(top_10_nationalities, aes(x=reorder(nationality, -count), y=count)) +
  geom_bar(stat="identity", fill="forestgreen") +
  labs(title="Top 10 Nationalities by Player Count", x="Nationality", y="Number of Players") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions and Observations:

Certain nationalities dominate the player pool. Major footballing countries like Brazil, Spain, and Germany have the largest representation.
Less common nationalities are not represented in this chart, but they exist in smaller quantities outside the top 10.

lowest_nationality_group <- group_nationality %>%
  filter(count == min(count)) %>%
  mutate(tag = "Rare Group")

print(lowest_nationality_group)

## # A tibble: 17 × 5
##    nationality          avg_value avg_wage count tag       
##    <chr>                    <dbl>    <dbl> <int> <chr>     
##  1 Andorra                 290000     1000     1 Rare Group
##  2 Barbados                400000     3000     1 Rare Group
##  3 Ethiopia                200000     1000     1 Rare Group
##  4 Guam                    550000     3000     1 Rare Group
##  5 Indonesia               180000     1000     1 Rare Group
##  6 Kuwait                 1200000    13000     1 Rare Group
##  7 Malta                   300000     2000     1 Rare Group
##  8 New Caledonia          1800000     9000     1 Rare Group
##  9 Nicaragua               300000     1000     1 Rare Group
## 10 Oman                    425000    12000     1 Rare Group
## 11 Papua New Guinea        260000     1000     1 Rare Group
## 12 South Sudan             260000     2000     1 Rare Group
## 13 St Lucia                500000     2000     1 Rare Group
## 14 São Tomé & Príncipe    2800000    15000     1 Rare Group
## 15 United Arab Emirates  10500000    39000     1 Rare Group
## 16 Vietnam                 425000     1000     1 Rare Group
## 17 Yemen                   160000     3000     1 Rare Group

Hypothesis

Countries with smaller populations or less established football infrastructures naturally produce fewer professional players. Players from these nations have a lower probability of appearing in the dataset.

Group by Position

Grouping and Summarizing

# Group by Position

group_position <- Fifa_Players_Data |>
  group_by(positions) |>
  summarise(
    avg_value = mean(value_euro, na.rm = TRUE),
    avg_wage = mean(wage_euro, na.rm = TRUE),
    count = n()
  ) |>
  arrange(desc(count))

group_position

## # A tibble: 890 × 4
##    positions avg_value avg_wage count
##    <chr>         <dbl>    <dbl> <int>
##  1 CB         2450709.   10556.  2243
##  2 GK         1626964.    6722.  2065
##  3 ST         2803157.   11185.  1747
##  4 CM         1850425.    7560.   764
##  5 CDM,CM     2979493.   12730    709
##  6 LB         1757062.    8496.   672
##  7 CM,CDM     3272840    12936    632
##  8 RB         1636034.    8669.   605
##  9 CDM        1808669.    8055.   321
## 10 CB,RB      1539223.    7148.   268
## # ℹ 880 more rows

Visualisation

A bar chart showing the number of players per position, highlighting the rarity of goalkeepers.

top_5_positions <- group_position |>
  arrange(desc(count)) |>
  slice_head(n = 5)

# Find the lowest position
lowest_position <- group_position |>
  arrange(count) |>
  slice_head(n = 5) |>
  mutate(tag = "Rare Group")

# Combine top 5 and lowest position
positions_to_plot <- bind_rows(top_5_positions, lowest_position)

# Plotting Positions by Player Count
ggplot(positions_to_plot, aes(x=reorder(positions, -count), y=count)) +
  geom_bar(stat="identity", fill="darkorange") +
  labs(title="Number of Players by Position", x="Position", y="Number of Players") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Conclusions and Observations * Infield positions such as defenders and goalkeepers are the most common, while strikers and central midfielders have the lowest representation. * This aligns with the fact that a team typically has more defensive players than offensive players

print(lowest_position)

## # A tibble: 5 × 5
##   positions     avg_value avg_wage count tag       
##   <chr>             <dbl>    <dbl> <int> <chr>     
## 1 CAM,CDM,CM,RM    775000     2000     1 Rare Group
## 2 CAM,CDM,RM      6500000    28000     1 Rare Group
## 3 CAM,CDM,RM,CM   4100000     6000     1 Rare Group
## 4 CAM,CF,CM,RM    1800000     8000     1 Rare Group
## 5 CAM,CF,LM        575000     1000     1 Rare Group

Hypothesis

Players in these positions are one in a million because they require a wider range of skills, blending both offense and defense, making it harder to find players that excel in all these aspects thus making a small appearance in the data set.

Analysis of Categorical Variables

We will continue analysis of the categorical variables by the following steps:

Build a data frame of all possible combinations of the two categorical variables.
Identify combinations that do not exist in the data.
Analyze the most and least common combinations.
Visualize the results.

Generate all combinations of positions and nationality

combinations <- expand.grid(
  positions = unique(Fifa_Players_Data$positions),
  nationality = unique(Fifa_Players_Data$nationality)
)

Find Missing Combinations

# We have removed compound positions because of large result set
missing_combinations <- anti_join(combinations, Fifa_Players_Data, by = c("positions", "nationality"))
missing_combinations <- missing_combinations |>
    filter(!str_detect(positions, ","))

# Display the first 10 rows
first_page <- missing_combinations %>% slice(1:10)
kable(first_page, caption = "Missing Player Combinations (Page 1)")

Missing Player Combinations (Page 1)
positions	nationality
LWB	Argentina
RWB	Argentina
CDM	Denmark
RW	Denmark
CF	Denmark
LWB	France
RWB	France
CF	France
LWB	Italy
RWB	Italy

# Total number of rows
total_rows <- nrow(missing_combinations)
cat("Total rows:", total_rows, "\n")

## Total rows: 1534

Hypotheses for missing positions in certain countries.

Certain positions might be less common in certain countries due to regional playing styles or strategies.

Count occurrences of each combination

combination_counts <- Fifa_Players_Data |>
  count(positions, nationality) |>
  arrange(desc(n)) # Sort by frequency in descending order

Show the most and least common combinations

Most common

most_common <- head(combination_counts)
least_common <- tail(combination_counts)
print(most_common)

## # A tibble: 6 × 3
##   positions nationality     n
##   <chr>     <chr>       <int>
## 1 ST        England       214
## 2 CB        England       213
## 3 GK        England       179
## 4 GK        Germany       162
## 5 CB        Argentina     133
## 6 CB        Germany       133

Least common

print(least_common)

## # A tibble: 6 × 3
##   positions    nationality      n
##   <chr>        <chr>        <int>
## 1 ST,RW,LW,CAM South Africa     1
## 2 ST,RW,LW,CF  Brazil           1
## 3 ST,RW,RM     Burkina Faso     1
## 4 ST,RW,RM     Germany          1
## 5 ST,RW,RM     Scotland         1
## 6 ST,RWB,RM    Senegal          1

Visualisation

Create a heatmap of the combinations

# Assuming most_common and least_common are data frames with 'nationality' and 'n' columns
top_most_common <- most_common |>
  top_n(10, n) |>
  mutate(group = "Most Common")

top_least_common <- least_common |>
  top_n(5, n) |>
  mutate(group = "Least Common")

# Combine both groups into one data frame
combined_data <- bind_rows(top_most_common, top_least_common)

# Create a new color scale for the groups
color_scale <- scale_fill_manual(values = c("Most Common" = "blue", "Least Common" = "red"))

filtered_combination_counts <- combination_counts |>
  filter(nationality %in% combined_data$nationality)

# Visualize the heat map
ggplot(combined_data, aes(x = positions, y = nationality, fill = group)) +
  geom_tile() +
  color_scale +
  labs(title = "Heatmap of Top 10 Most Common and Top 5 Least Common Combinations", 
       x = "Position", 
       y = "Nationality", 
       fill = "Group") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

## Observations and Conclusion * Players in all positions are highly represented among English players in this data set. This suggests that English teams or leagues have a deep talent pool, with strikers being particularly common. * Argentina has a huge talent pool with defenders being the most common position. * Positions which require both Offensive and Defensive attributes are least common because of the unique requirements.

Data Dive - Group By and Probabilities

Raghuveer Venkatesh

2024-09-18

Introduction

Group by International reputation and calculate average overall rating

Tagging the lowest probability group

Visualization of average overall rating by international reputation

Conclusions and Observations:

Hypothesis:

Group by Nationality

Grouping and Summary

Visualisation

Conclusions and Observations:

Hypothesis

Group by Position

Grouping and Summarizing

Visualisation

Hypothesis

Analysis of Categorical Variables

Generate all combinations of positions and nationality

Find Missing Combinations

Hypotheses for missing positions in certain countries.

Count occurrences of each combination

Show the most and least common combinations

Most common

Least common

Visualisation

Create a heatmap of the combinations

The End