In this project, I will be attempting to discover whether a football team’s international ranking is connected to their likelihood of winning a match. Rankings are often used to predict which team will preform the best, but how well do they actually reflect match outcomes? We are analyzing a dataset of international soccer matches with over 43,000 matches from 1872 to 2022. This dataset allows us to examine how often home teams win, and whether there are patterns or trends across time or continents. The source of this data is compiled from FIFA Match archives and other public soccer statistics, and was created by Brenda Loznik.
The variables that will be used are: home_team, away_team, home_team_score, away_team_score, home_team_fifa_rank, and away_team_fifa_rank.
The question that I will be exploring with this dataset is “How do team rankings connect with win probability?”
I believe that the methodology used to collect this data likely came from FIFA’s official match statistics tracking which includes both automated system and manual data collection by official match analysts
I chose this dataset because I am a huge football fan. I’ve been a football fan since I was in elementary school and have loved it ever since. I’m interested in understanding the tactical and statistical patterns that company in high-level matches. The World Cup represents a unique opportunity to analyze how home-field-advantage really effects the outcomes of matches.
# A tibble: 6 × 25
date home_team away_team home_team_continent away_team_continent
<chr> <chr> <chr> <chr> <chr>
1 8/8/93 Bolivia Uruguay South America South America
2 8/8/93 Brazil Mexico South America North America
3 8/8/93 Ecuador Venezuela South America South America
4 8/8/93 Guinea Sierra Leone Africa Africa
5 8/8/93 Paraguay Argentina South America South America
6 8/8/93 Peru Colombia South America South America
# ℹ 20 more variables: home_team_fifa_rank <dbl>, away_team_fifa_rank <dbl>,
# home_team_total_fifa_points <dbl>, away_team_total_fifa_points <dbl>,
# home_team_score <dbl>, away_team_score <dbl>, tournament <chr>, city <chr>,
# country <chr>, neutral_location <lgl>, shoot_out <chr>,
# home_team_result <chr>, home_team_goalkeeper_score <dbl>,
# away_team_goalkeeper_score <dbl>, home_team_mean_defense_score <dbl>,
# home_team_mean_offense_score <dbl>, home_team_mean_midfield_score <dbl>, …
Clean the Data
What this does piece of code does is that it removes all the N/A’s in those specific columns.
# A tibble: 6 × 28
date home_team away_team home_team_continent away_team_continent
<chr> <chr> <chr> <chr> <chr>
1 8/8/93 Bolivia Uruguay South America South America
2 8/8/93 Brazil Mexico South America North America
3 8/8/93 Ecuador Venezuela South America South America
4 8/8/93 Guinea Sierra Leone Africa Africa
5 8/8/93 Paraguay Argentina South America South America
6 8/8/93 Peru Colombia South America South America
# ℹ 23 more variables: home_team_fifa_rank <dbl>, away_team_fifa_rank <dbl>,
# home_team_total_fifa_points <dbl>, away_team_total_fifa_points <dbl>,
# home_team_score <dbl>, away_team_score <dbl>, tournament <chr>, city <chr>,
# country <chr>, neutral_location <lgl>, shoot_out <chr>,
# home_team_result <chr>, home_team_goalkeeper_score <dbl>,
# away_team_goalkeeper_score <dbl>, home_team_mean_defense_score <dbl>,
# home_team_mean_offense_score <dbl>, home_team_mean_midfield_score <dbl>, …
This code here helps determine if higher ranked teams have won. I created 2 new columns “higher_ranked_home” and “higher_rank_team_won”. These columns will check if the home team had a better rank than the away team and checks to see if the higher ranked team won the match
Call:
lm(formula = goal_difference ~ rank_difference, data = matches_model2)
Residuals:
Min 1Q Median 3Q Max
-19.0466 -1.1103 -0.0563 1.0541 27.7413
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4759566 0.0122727 38.78 <2e-16 ***
rank_difference -0.0220855 0.0002313 -95.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.895 on 23919 degrees of freedom
Multiple R-squared: 0.276, Adjusted R-squared: 0.2759
F-statistic: 9116 on 1 and 23919 DF, p-value: < 2.2e-16
In the chuck of code above, I conducted a linear regression to explore if the difference in FIFA rankings between teams can predict the difference in goals in a match.I first created new dataset called matches_model2 and created new columns called rank_difference and goal_difference. ranked_difference calculates the ranking gap between home and away teams and goal_difference calculates the goal difference in a match from the home’s point of view. The model gives us a result of an intercept of about 0.476. This tells us that when the teams have an equal FIFA rankings, the home team wins by around 0.48 goals on average. The -0.022 lets us know that for every 1 rank higher the home team is then the away team, the home team is expected to score about 0.022 fewer goals in that same match.
Visualization 1: How often higher ranked teams win
What the first half of this chunk of code does is that it counts the win outcomes by whether the higher ranked team won that match. The second half of this chunk creates a new column called winner_type. This column categorizes the match outcome based on which team was better ranked.
ggplot(higher_rank, aes(x =reorder(winner_type, -n), y = n, fill = winner_type)) +geom_bar(stat ="identity", color ="black") +scale_fill_manual(values =c("#32a852", "#3146a3", "#a33196", "#a39f31")) +labs(title ="Match Results Based on FIFA Ranking",x ="Reult Type",y ="Number of Matches",fill ="Result Type",caption ="Source: FIFA via Brena Loznik" ) +theme_minimal() +theme(axis.title.x =element_text(angle =15, hjust =1))
For this bar plot visualization, I wanted to show how often higher ranked home/away teams won and how many matches ended in a draw. The NA shows that there were no results for that match
This creates a dataset of the top 15 most dominant national teams at home. This shows the total number home games and the number of home wins. It calculates the win percentage of that match and excludes teams with very few games.
plot2 <-ggplot(home_win_rates, aes(x = win_rate, y =reorder(home_team, win_rate))) +geom_point(aes(color = win_rate), size =5) +scale_color_gradient(low ="#a33531", high ="#31a385") +labs(title ="Top 15 National Teams by Home-Win-Rate",x ="Home Win Rates",y ="Country",color ="Win Rate",caption ="Source: FIFA via Breanda Loznik" ) +theme_minimal()plotly_plot2 <-ggplotly(plot2)plotly_plot2
Summary Essay
This project helped me explore weather a national football team’s FIFA ranking is correlated to there probability of winning a match. The dataset that was chosen is called “FIFA World Cup 2022” and it includes over 43,000 matches played from 1872 to 2022. The variables that I chose to focus on for this project are home_team (categorical variable), away_team (categorical variable), home_team_score (quantitative variable), away_team_score (quantitative variable), home_fifa_rank (ordinal variable), and away_team_fifa_rank(ordinal). The dataset comes from Brenda Loznik, using FIFA archives. There was no ReadMe file provided for this dataset, but based on the structure of the dataset, the data likely is correlated to FIFA’s statistical methods. To clean up this dataset, I first created a column called “match_winner” to see if the match ended in a draw, win, or loss for the home team. I chose to group and summarize the data to calculate win rates and identify trends between team rankings and match results. I chose to use this dataset becuase I am a huge fan of the sport. I have been a fan of football ever since I could remember. It was a way where me and my family can come together and play, watch, or just enjoy something we all love. Football is very close to my heart and I wanted to see what I can find out with the data that was presented.
FIFA’s global ranking system is based on a point accumulation method known as the “SUM” method. The way this system works is that teams gain and lose points after each match depending on the results of that match, how important the match was, and how strong the opposing team was (FIFA, 2024). This model is supposed to offer an insight of international team strengt. These help explain some of the patterns in the dataset that higher ranked teams don’t always win, especially when playing away.
For the first vizualization, the graph showed how often the higher ranked team won their matches. The highest number of matches won was the higher ranked team. What I found interesting was that there was the amount of draws and loses were almost the same amount. For my second visualization, the dot plot shows the top 15 national teams by home-win-rate. What i found interesting with this plot was that Brazil and Spain wipe the floor compared to all the other countries by a large margin.
In conclusion, this project helped me better understand the relationship between a teams rank and performance.
Work Cited: FIFA. FIFA/Coca-Cola World Ranking – Men’s Ranking Procedure. FIFA, Aug. 2023, https://digitalhub.fifa.com/m/f99da4f73212220/original/edbm045h0udbwkqew35a-pdf.pdf.