Introduction

This document looks at FiveThirtyEight’s Women’s World Cup Predictions from 2019. This data was sourced from here, and is discussed in this article.

Libraries

First we import the needed libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(devtools)
## Loading required package: usethis
library(RCurl)
library(ggplot2)

Getting the Data

Then we import the data, reading a CSV from Github.

x <- getURL("http://raw.githubusercontent.com/cmm6/data607-assignment1/main/wwc_forecasts.csv",.opts=curlOptions(followlocation = TRUE)) 
y <- read.csv(text = x)

Exploring and Cleaning Up the Data

The first column pulls in special characters, so we need to rename it for readability. We also want to rename a couple unclear columns to be more straightforward, and get rid of a couple columns we will not be focusing on.

# Explore the dataset
head(y)
##     ï..forecast_timestamp        team group      spi global_o global_d sim_wins
## 1 2019-07-07 16:55:04 UTC         USA     F 98.60122  5.48708  0.48679        3
## 2 2019-07-07 16:55:04 UTC Netherlands     E 92.58631  3.75056  0.69163        3
## 3 2019-07-07 16:55:04 UTC     England     D 93.60390  3.78397  0.61037        3
## 4 2019-07-07 16:55:04 UTC      Sweden     F 89.54022  3.09929  0.61151        2
## 5 2019-07-07 16:55:04 UTC      France     A 96.87964  4.45211  0.48766        3
## 6 2019-07-07 16:55:04 UTC     Germany     B 94.41832  4.23550  0.72900        3
##   sim_ties sim_losses sim_goal_diff goals_scored goals_against group_1 group_2
## 1        0          0            18           18             0       1       0
## 2        0          0             4            6             2       1       0
## 3        0          0             4            5             1       1       0
## 4        0          1             4            7             3       0       1
## 5        0          0             6            7             1       1       0
## 6        0          0             6            6             0       1       0
##   group_3 group_4 make_round_of_16 make_quarters make_semis make_final
## 1       0       0                1             1          1          1
## 2       0       0                1             1          1          1
## 3       0       0                1             1          1          0
## 4       0       0                1             1          1          0
## 5       0       0                1             1          0          0
## 6       0       0                1             1          0          0
##   win_league               timestamp
## 1          1 2019-07-07 16:55:43 UTC
## 2          0 2019-07-07 16:55:43 UTC
## 3          0 2019-07-07 16:55:43 UTC
## 4          0 2019-07-07 16:55:43 UTC
## 5          0 2019-07-07 16:55:43 UTC
## 6          0 2019-07-07 16:55:43 UTC
# Rename the first column, without having to copy the special characters directly
colnames(y)[1] <- "forecast_timestamp"

# Confirm the change
names(y)
##  [1] "forecast_timestamp" "team"               "group"             
##  [4] "spi"                "global_o"           "global_d"          
##  [7] "sim_wins"           "sim_ties"           "sim_losses"        
## [10] "sim_goal_diff"      "goals_scored"       "goals_against"     
## [13] "group_1"            "group_2"            "group_3"           
## [16] "group_4"            "make_round_of_16"   "make_quarters"     
## [19] "make_semis"         "make_final"         "win_league"        
## [22] "timestamp"
# Update global_o and global_d to more descriptive names
y <- y %>%
  rename(offense_rating = global_o)
y <- y %>%
  rename(defense_rating = global_d)
# Confirm the change
names(y)
##  [1] "forecast_timestamp" "team"               "group"             
##  [4] "spi"                "offense_rating"     "defense_rating"    
##  [7] "sim_wins"           "sim_ties"           "sim_losses"        
## [10] "sim_goal_diff"      "goals_scored"       "goals_against"     
## [13] "group_1"            "group_2"            "group_3"           
## [16] "group_4"            "make_round_of_16"   "make_quarters"     
## [19] "make_semis"         "make_final"         "win_league"        
## [22] "timestamp"

Investigate the data

I’d like to understand the ratio of offense_rating to defense_rating, and where most teams fall, and compare the ratio to the winners in their group.

y <- y %>%
  mutate(offense_defense_ratio = offense_rating / defense_rating)

hist(y$offense_defense_ratio, 
     main="Histogram of All Teams' Offensive/Defensive Score Ratio", 
     xlab="Offensive Defensive Ratio")

group_winners <- y %>% 
  filter(group_1 == 1) %>% 
  select(offense_rating,
         defense_rating,
         offense_defense_ratio,
         team,
         spi,
         make_final)

hist(group_winners$offense_defense_ratio, 
     main="Histogram of Group Winners' Offensive/Defensive Score Ratio",
     xlab="Offensive Defensive Ratio")

I’d like to plot this ratio against the SPI score for all and group winners. Per the FiveThirtyEight article, this is essentially the team quality as determined by FiveThirty Eight, and we can layer with the likelihood to make the final. I also wanted to play around with ggplot.

# playing with  ggplot inputs from textbook
scatterplot <- ggplot(y, aes(x=spi, y=offense_defense_ratio)) + geom_point(aes(color=make_final))

scatterplot + ggtitle("SPI and Offensive/Defensive Ratio - All") + xlab("SPI") + ylab("Offensive Defensive Ratio")

scatterplot <- ggplot(group_winners, aes(x=spi, y=offense_defense_ratio)) + geom_point(aes(color=make_final))

scatterplot + ggtitle("SPI and Offensive/Defensive Ratio - Group Winners") + xlab("SPI") + ylab("Offensive Defensive Ratio")

Conclusions

This quick analysis would seem to suggest some relationship between stronger-than-defense offense and success, though it is less clear as early losers are weeded out. As a next step, one could incorporate other World Cup predictions to get more data from group winners. I would also be curious how this compares to Men’s predictions.