FIFA World Cup

Author

Cristian Mendez

FIFA World Cup 2022

Stadium Fans

Source: APNews (https://apnews.com/article/womens-soccer-soccer-sports-barcelona-champions-league-ec5d0e514972b0b2bb44a2675606d9b1)

Introduction

In this project, I will be attempting to discover whether a football team’s international ranking is connected to their likelihood of winning a match. Rankings are often used to predict which team will preform the best, but how well do they actually reflect match outcomes? We are analyzing a dataset of international soccer matches with over 43,000 matches from 1872 to 2022. This dataset allows us to examine how often home teams win, and whether there are patterns or trends across time or continents. The source of this data is compiled from FIFA Match archives and other public soccer statistics, and was created by Brenda Loznik.

The variables that will be used are: home_team, away_team, home_team_score, away_team_score, home_team_fifa_rank, and away_team_fifa_rank.

The question that I will be exploring with this dataset is “How do team rankings connect with win probability?”

I believe that the methodology used to collect this data likely came from FIFA’s official match statistics tracking which includes both automated system and manual data collection by official match analysts

I chose this dataset because I am a huge football fan. I’ve been a football fan since I was in elementary school and have loved it ever since. I’m interested in understanding the tactical and statistical patterns that company in high-level matches. The World Cup represents a unique opportunity to analyze how home-field-advantage really effects the outcomes of matches.

Load Libraries and Dataset

library(tidyverse)
library(ggthemes)
library(plotly)
library(ggplot2)
library(webshot2)
setwd("~/Downloads")
matches <- read_csv("international_matches.csv")
head(matches)
# A tibble: 6 × 25
  date   home_team away_team    home_team_continent away_team_continent
  <chr>  <chr>     <chr>        <chr>               <chr>              
1 8/8/93 Bolivia   Uruguay      South America       South America      
2 8/8/93 Brazil    Mexico       South America       North America      
3 8/8/93 Ecuador   Venezuela    South America       South America      
4 8/8/93 Guinea    Sierra Leone Africa              Africa             
5 8/8/93 Paraguay  Argentina    South America       South America      
6 8/8/93 Peru      Colombia     South America       South America      
# ℹ 20 more variables: home_team_fifa_rank <dbl>, away_team_fifa_rank <dbl>,
#   home_team_total_fifa_points <dbl>, away_team_total_fifa_points <dbl>,
#   home_team_score <dbl>, away_team_score <dbl>, tournament <chr>, city <chr>,
#   country <chr>, neutral_location <lgl>, shoot_out <chr>,
#   home_team_result <chr>, home_team_goalkeeper_score <dbl>,
#   away_team_goalkeeper_score <dbl>, home_team_mean_defense_score <dbl>,
#   home_team_mean_offense_score <dbl>, home_team_mean_midfield_score <dbl>, …

Clean the Data

What this does piece of code does is that it removes all the N/A’s in those specific columns.

matches_clean <- matches |>
  mutate(match_winner = case_when(
    home_team_result == "Win" ~ "Home",
    home_team_result == "Loss" ~ "Away",
    home_team_result == "Draw" ~ "Draw",
  ))

This code here creates a new column called “match_winner”, which is to find out the overall match winner

matches_clean3 <- matches_clean |>
  mutate(
    higher_rank_home = home_team_fifa_rank < away_team_fifa_rank,
    higher_rank_team_won = case_when(
      higher_rank_home & home_team_result == "Win" ~ TRUE,
      !higher_rank_home & home_team_result == "Loss" ~ TRUE,
    )
  )

This code here helps determine if higher ranked teams have won. I created 2 new columns “higher_ranked_home” and “higher_rank_team_won”. These columns will check if the home team had a better rank than the away team and checks to see if the higher ranked team won the match

matches_clean4 <- matches_clean|>
summarise(
  higher_rank_home_win = sum(home_team_fifa_rank < away_team_fifa_rank, home_team_score > away_team_score),
  higher_rank_away_win = sum(away_team_fifa_rank < home_team_fifa_rank, away_team_score > home_team_score)
)

This chunk of code allows us to see how many times the home/away team was ranked higher and the home/away team won the match.

Statistical Analyses

matches_model2 <- matches_clean  |>
  mutate(
    rank_difference = home_team_fifa_rank - away_team_fifa_rank,
    goal_difference = home_team_score - away_team_score
  )

model2 <- lm(goal_difference ~ rank_difference, data = matches_model2)
summary(model2)

Call:
lm(formula = goal_difference ~ rank_difference, data = matches_model2)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.0466  -1.1103  -0.0563   1.0541  27.7413 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.4759566  0.0122727   38.78   <2e-16 ***
rank_difference -0.0220855  0.0002313  -95.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.895 on 23919 degrees of freedom
Multiple R-squared:  0.276, Adjusted R-squared:  0.2759 
F-statistic:  9116 on 1 and 23919 DF,  p-value: < 2.2e-16

In the chuck of code above, I conducted a linear regression to explore if the difference in FIFA rankings between teams can predict the difference in goals in a match.I first created new dataset called matches_model2 and created new columns called rank_difference and goal_difference. ranked_difference calculates the ranking gap between home and away teams and goal_difference calculates the goal difference in a match from the home’s point of view. The model gives us a result of an intercept of about 0.476. This tells us that when the teams have an equal FIFA rankings, the home team wins by around 0.48 goals on average. The -0.022 lets us know that for every 1 rank higher the home team is then the away team, the home team is expected to score about 0.022 fewer goals in that same match.

Visualization 1: How often higher ranked teams win

higher_rank <- matches_clean |>
  mutate(rank_difference = home_team_fifa_rank - away_team_fifa_rank,
         higher_rank_team = if_else(rank_difference < 0, "Home",
                                    if_else(rank_difference > 0, "Away", "Equal"))) |>
  
  mutate(winner_type = case_when(
    home_team_result == "Win" & higher_rank_team == "Home" ~ "H ranked Home Won",
    home_team_result == "Lose" & higher_rank_team == "Away" ~ "L ranked Away Won",
    home_team_result == "Draw" ~ "Draw",
  )) |>
  count(winner_type)

What the first half of this chunk of code does is that it counts the win outcomes by whether the higher ranked team won that match. The second half of this chunk creates a new column called winner_type. This column categorizes the match outcome based on which team was better ranked.

ggplot(higher_rank, aes(x = reorder(winner_type, -n), y = n, fill = winner_type)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_manual(values = c("#32a852", "#3146a3", "#a33196", "#a39f31")) +
  labs(
    title = "Match Results Based on FIFA Ranking",
    x = "Reult Type",
    y = "Number of Matches",
    fill = "Result Type",
    caption = "Source: FIFA via Brena Loznik"
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(angle = 15, hjust = 1)) +
  annotate("text", x =1, y = max(higher_rank$n) * 0.95,
           label = "Most Games won by H ranked home teams",
           color = "black",
           size = 2)

For this bar plot visualization, I wanted to show how often higher ranked home/away teams won and how many matches ended in a draw. The NA shows that there were no reults for that match

Visualization 2:

home_win_rates <- matches_clean |>
  group_by(home_team) |>
  summarise(
    home_games = n(),
    home_wins = sum(home_team_result == "Win"),
    win_rate = round(home_wins / home_games, 3)
  ) |>
  filter(home_games >=20) |>
  arrange(desc(win_rate)) |>
  slice(1:15)

This creates a dataset of the top 15 most dominant national teams at home. This shows the total number home games and the number of home wins. It calculates the win percentage of that match and excludes teams with very few games.

plot2 <- ggplot(home_win_rates, aes(x = win_rate, y = reorder(home_team, win_rate))) +
  geom_point(aes(color = win_rate), size = 5) +
  scale_color_gradient(low = "#a33531", high = "#31a385") +
  labs(
    title = "Top 15 National Teams by Home-Win-Rate",
    x = "Home Win Rates",
    y = "Country",
    color = "Win Rate",
    caption = "Source: FIFA via Breanda Loznik"
  ) +
  theme_minimal()+
  annotate("text", x = 0.9, y = 5, label = "High home win rates", size = 2.5)
ggplotly(plot2)