This data set contains one row per match for the 2022 World Cup. The specific data includes the chance that each team will win, lose or tie every one of their matches, and each team’s SPI (soccer performance index) as well as a projected score. The table also holds information regarding non-shot expected goals (xG) and then adjusted forecast numbers based on things that happened during the game.
wc_matches <- read.csv("wc_matches.csv")
summary(wc_matches)
## date league_id league team1
## Length:64 Min. :1908 Length:64 Length:64
## Class :character 1st Qu.:1908 Class :character Class :character
## Mode :character Median :1908 Mode :character Mode :character
## Mean :1908
## 3rd Qu.:1908
## Max. :1908
## team2 spi1 spi2 prob1
## Length:64 Min. :48.16 Min. :48.46 Min. :0.0363
## Class :character 1st Qu.:68.75 1st Qu.:66.05 1st Qu.:0.2851
## Mode :character Median :78.72 Median :74.46 Median :0.4460
## Mean :77.32 Mean :74.30 Mean :0.4432
## 3rd Qu.:87.23 3rd Qu.:79.50 3rd Qu.:0.6070
## Max. :93.66 Max. :93.48 Max. :0.8261
## prob2 probtie proj_score1 proj_score2
## Min. :0.0595 Min. :0.0000 Min. :0.310 Min. :0.440
## 1st Qu.:0.2039 1st Qu.:0.1081 1st Qu.:0.985 1st Qu.:0.820
## Median :0.3121 Median :0.2575 Median :1.315 Median :1.055
## Mean :0.3583 Mean :0.1985 Mean :1.325 Mean :1.139
## 3rd Qu.:0.5047 3rd Qu.:0.2912 3rd Qu.:1.620 3rd Qu.:1.367
## Max. :0.8112 Max. :0.3371 Max. :2.600 Max. :2.550
## score1 score2 xg1 xg2
## Min. :0.000 Min. :0.000 Min. :0.070 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.600 1st Qu.:0.5075
## Median :1.000 Median :1.000 Median :0.885 Median :0.9400
## Mean :1.578 Mean :1.109 Mean :1.075 Mean :1.1089
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.430 3rd Qu.:1.4350
## Max. :7.000 Max. :4.000 Max. :3.100 Max. :4.4100
## nsxg1 nsxg2 adj_score1 adj_score2
## Min. :0.240 Min. :0.0900 Min. :0.000 Min. :0.000
## 1st Qu.:0.760 1st Qu.:0.6475 1st Qu.:0.000 1st Qu.:0.000
## Median :1.185 Median :0.9350 Median :1.050 Median :1.050
## Mean :1.194 Mean :1.1553 Mean :1.572 Mean :1.122
## 3rd Qu.:1.433 3rd Qu.:1.4625 3rd Qu.:2.100 3rd Qu.:2.100
## Max. :3.100 Max. :5.9000 Max. :6.220 Max. :3.720
I’m curious about South Korea in particular, so I’d like to see their journey throughout the world cup this past year. I’m only curious about what was predicted and the eventual outcome, so details on unexpected goals alongside the adjustment in predictions isn’t relevant to me.
# import libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(scales)
# choose relevant columns
relevant_data <- subset(
select(
wc_matches, c('date', 'team1', 'team2', 'spi1', 'spi2', 'prob1'
, 'prob2', 'probtie', 'proj_score1', 'proj_score2'
, 'score1', 'score2'))
, team1=="South Korea" | team2=="South Korea")
# rename spi to something more intuitive for folks who don't know about soccer
colnames(relevant_data)[colnames(relevant_data) == "spi1"] = "soccer_power_index_team1"
colnames(relevant_data)[colnames(relevant_data) == "spi2"] = "soccer_power_index_team2"
# renaming the probabilities for better readability
colnames(relevant_data)[colnames(relevant_data) == "prob1"] = "win_probability_team1"
colnames(relevant_data)[colnames(relevant_data) == "prob2"] = "win_probability_team2"
colnames(relevant_data)[colnames(relevant_data) == "probtie"] = "tie_probability"
# renaming the scores for better readability
colnames(relevant_data)[colnames(relevant_data) == "proj_score1"] = "projected_score_team1"
colnames(relevant_data)[colnames(relevant_data) == "proj_score2"] = "projected_score_team2"
colnames(relevant_data)[colnames(relevant_data) == "score1"] = "actual_score_team1"
colnames(relevant_data)[colnames(relevant_data) == "score2"] = "actual_score_team2"
# format the probability columns to percentages to make them easier to read
relevant_data[6:8] <- sapply(relevant_data[6:8], function(x) percent(x, accuracy=0.01))
relevant_data
We can see that South Korea only had a higher SPI than one of its competitors throughout the tournament, however they performed unexpectedly against the 4 other teams that they played. Their first game ended in a tie even though it was predicted for the other team to win, they ended up losing their second game despite having a higher SPI and a higher probability of winning, they ended up winning at only a 16.66% probability for their third match, and they ultimately lost to Brazil as predicted in their last game.
With this only being the 2022 match history, I’d be curious to expand this data set to include other years and also just work on seeing how accurate these predictions are overall. For South Korea in particular, only one of these games ended with the highest probability prediction being correct and that was the last game; is South Korea an outlier here or did all other teams also see unlikely wins/losses/ties?