TidyVerse_Assignment-Women World Cup 2019 Prediction

Assignment

For this assignment I am using Women World Cup 2019 Prediction data set that I retrieved from Fivethirtyeight site to create a programming sample “vignette” that demonstrate how to use few of the main capabilities of the selected tidyverse package.

Introduction

This dataset shows the chance that each team will win, lose or tie every one of its matches. One important thing that this dataset include is the SPI rating, Soccer Power Index, how we turn those ratings into a forecast and how we calculate our in-game win probabilities.
To generate our SPI ratings, we run through every past match in our database of women’s international matches — back to 1971 — evaluating the performance of both teams with four metrics:

The number of goals they scored.
The number of goals they scored, adjusted to account for red cards and the time and score of the match when each goal was scored.
The number of goals they were expected to score given the shots they took.
The number of goals they were expected to score given the nonshooting actions they took near the opposing team’s goal.

Read .csv file

women_worldcup <- read_csv("https://projects.fivethirtyeight.com/soccer-api/international/2019/wwc_matches.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   date = col_date(format = ""),
##   league = col_character(),
##   team1 = col_character(),
##   team2 = col_character()
## )

## See spec(...) for full column specifications.

head(women_worldcup)

## # A tibble: 6 x 20
##   date       league_id league team1 team2  spi1  spi2 prob1  prob2 probtie
##   <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1 2019-06-07       352 FIFA ~ Fran~ Sout~  95.3  80.8 0.694 0.114    0.192
## 2 2019-06-08       352 FIFA ~ Germ~ Chin~  94.2  82.1 0.599 0.176    0.224
## 3 2019-06-08       352 FIFA ~ Spai~ Sout~  83.6  61.5 0.592 0.130    0.278
## 4 2019-06-08       352 FIFA ~ Norw~ Nige~  82.9  70.9 0.520 0.218    0.263
## 5 2019-06-09       352 FIFA ~ Aust~ Ital~  93.1  71.1 0.742 0.114    0.143
## 6 2019-06-09       352 FIFA ~ Braz~ Jama~  89.1  59.8 0.803 0.0803   0.117
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## #   score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>,
## #   nsxg2 <dbl>, adj_score1 <dbl>, adj_score2 <dbl>

Data transformation with dplyr

In this section I will be using slice(), filter(), summerise(), count() functions that can be found under dplyr.

Manipulate Cases

Using slice option we can select the row numbers we want to display

slice(women_worldcup, 10:20)

## # A tibble: 11 x 20
##    date       league_id league team1 team2  spi1  spi2  prob1  prob2
##    <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>
##  1 2019-06-11       352 FIFA ~ New ~ Neth~  83.0  92.7 0.2    0.582 
##  2 2019-06-11       352 FIFA ~ Chil~ Swed~  49.0  85.9 0.0373 0.822 
##  3 2019-06-11       352 FIFA ~ USA ~ Thai~  97.2  50.4 0.951  0.0117
##  4 2019-06-12       352 FIFA ~ Nige~ Sout~  69.3  78.3 0.277  0.455 
##  5 2019-06-12       352 FIFA ~ Germ~ Spai~  93.9  86.5 0.516  0.240 
##  6 2019-06-12       352 FIFA ~ Fran~ Norw~  96.1  84.1 0.649  0.136 
##  7 2019-06-13       352 FIFA ~ Aust~ Braz~  92.4  89.7 0.429  0.343 
##  8 2019-06-13       352 FIFA ~ Sout~ Chin~  58.2  81.9 0.143  0.593 
##  9 2019-06-14       352 FIFA ~ Japa~ Scot~  90.9  52.4 0.833  0.0542
## 10 2019-06-14       352 FIFA ~ Jama~ Ital~  58.1  72.5 0.242  0.574 
## 11 2019-06-14       352 FIFA ~ Engl~ Arge~  92.1  39.1 0.916  0.0136
## # ... with 11 more variables: probtie <dbl>, proj_score1 <dbl>,
## #   proj_score2 <dbl>, score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>,
## #   nsxg1 <dbl>, nsxg2 <dbl>, adj_score1 <dbl>, adj_score2 <dbl>

filter function to extract rows that meet logical criteria

filter(women_worldcup, women_worldcup$score1 > "3")

## # A tibble: 2 x 20
##   date       league_id league team1 team2  spi1  spi2 prob1  prob2 probtie
##   <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1 2019-06-07       352 FIFA ~ Fran~ Sout~  95.3  80.8 0.694 0.114    0.192
## 2 2019-06-16       352 FIFA ~ Swed~ Thai~  86.5  43.5 0.866 0.0254   0.108
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## #   score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>,
## #   nsxg2 <dbl>, adj_score1 <dbl>, adj_score2 <dbl>

Summarize Cases

apply summarize function to columns to create new table of summary statistics

women_worldcup %>% summarise(Avg_proj_score1 = mean(proj_score1), Median_proj_score1 = median(proj_score1), Avg_score1 = mean(score1), Median_score1 = median(score1))

## # A tibble: 1 x 4
##   Avg_proj_score1 Median_proj_score1 Avg_score1 Median_score1
##             <dbl>              <dbl>      <dbl>         <dbl>
## 1            1.47               1.27       1.65             1

count number of rows each group defined by the variable

#Team 1
count(women_worldcup, women_worldcup$team1, sort = TRUE)

## # A tibble: 24 x 2
##    `women_worldcup$team1`     n
##    <chr>                  <int>
##  1 England Women              5
##  2 France Women               4
##  3 Germany Women              4
##  4 Netherlands Women          4
##  5 Italy Women                3
##  6 Norway Women               3
##  7 Sweden Women               3
##  8 USA Women                  3
##  9 Australia Women            2
## 10 Canada Women               2
## # ... with 14 more rows

#Team 2
count(women_worldcup, women_worldcup$team2, sort = TRUE)

## # A tibble: 24 x 2
##    `women_worldcup$team2`     n
##    <chr>                  <int>
##  1 Sweden Women               4
##  2 USA Women                  4
##  3 Brazil Women               3
##  4 Cameroon Women             3
##  5 China Women                3
##  6 Netherlands Women          3
##  7 Argentina Women            2
##  8 Australia Women            2
##  9 Canada Women               2
## 10 Chile Women                2
## # ... with 14 more rows

Data Wrangling with tidyr

Under tidyr, There are four fundamental functions of data tyding.
1. gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer 2. spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider 3. separate() splits a single column into multiple columns 4. unite() combines multiple columns into a single column

Reshaping data

Creating a new data set with selected columns. Then gather some columns.

#Eliminating few columns from the original data set and creating a new data set. 
wwc_19 <- select(women_worldcup, -c(date, league_id, league, xg1, xg2, nsxg1, nsxg2, prob1, prob2, probtie))

df <- wwc_19 %>%
  gather(Teams, team_names, team1:team2)
df

## # A tibble: 104 x 10
##     spi1  spi2 proj_score1 proj_score2 score1 score2 adj_score1 adj_score2
##    <dbl> <dbl>       <dbl>       <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
##  1  95.3  80.8        1.9         0.61      4      0       3.83       0   
##  2  94.2  82.1        1.67        0.78      1      0       1.05       0   
##  3  83.6  61.5        1.27        0.43      3      1       2.34       1.05
##  4  82.9  70.9        1.35        0.76      3      0       3.15       0   
##  5  93.1  71.1        2.63        0.95      1      2       1.05       2.1 
##  6  89.1  59.8        2.92        0.85      3      0       3.15       0   
##  7  92.7  50.8        2.64        0.3       2      1       2.1        1.05
##  8  37.1  92.3        0.38        3.58      0      0       0          0   
##  9  92.2  62.5        1.87        0.44      1      0       1.05       0   
## 10  83.0  92.7        0.92        1.74      0      1       0          1.05
## # ... with 94 more rows, and 2 more variables: Teams <chr>,
## #   team_names <chr>

I am going to create a new two columns by combining scores and spi rank. This new data set will help me for my data visualization part as well.

only_scores <- wwc_19 %>%
  unite(score_comb, score1, score2, sep = "-") %>%
  unite(spi_comb, spi1, spi2, sep = "-")
only_scores

## # A tibble: 52 x 8
##    team1 team2 spi_comb proj_score1 proj_score2 score_comb adj_score1
##    <chr> <chr> <chr>          <dbl>       <dbl> <chr>           <dbl>
##  1 Fran~ Sout~ 95.32-8~        1.9         0.61 4-0              3.83
##  2 Germ~ Chin~ 94.17-8~        1.67        0.78 1-0              1.05
##  3 Spai~ Sout~ 83.58-6~        1.27        0.43 3-1              2.34
##  4 Norw~ Nige~ 82.91-7~        1.35        0.76 3-0              3.15
##  5 Aust~ Ital~ 93.13-7~        2.63        0.95 1-2              1.05
##  6 Braz~ Jama~ 89.12-5~        2.92        0.85 3-0              3.15
##  7 Engl~ Scot~ 92.7-50~        2.64        0.3  2-1              2.1 
##  8 Arge~ Japa~ 37.09-9~        0.38        3.58 0-0              0   
##  9 Cana~ Came~ 92.23-6~        1.87        0.44 1-0              1.05
## 10 New ~ Neth~ 82.98-9~        0.92        1.74 0-1              0   
## # ... with 42 more rows, and 1 more variable: adj_score2 <dbl>

Data Visualization - heatmaps

I am using ggplot function to show a heatmap like plot.

Visualization of scores

#Graph to show the scores  

g <- ggplot(wwc_19, aes(x = team1, y = team2, fill = score1-score2)) +
  geom_label(aes(label = paste(score1, score2, sep = "-")), fill = "white") +
  scale_x_discrete(limits = levels(wwc_19$team1), position = "bottom") +
  scale_y_discrete(limits = rev(levels(wwc_19$team2))) +
  theme (axis.text.x = element_text(angle = 60, hjust = 1))

g

only_scores <- wwc_19 %>%
  unite(score_comb, score1, score2, sep = "-") %>%
  unite(spi_comb, spi1, spi2, sep = "-")

heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = score_comb))+
  geom_tile()+
  theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)

Visualization of SPI (Soccer Power Index) Rating

heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = spi_comb))+
  geom_tile()+
  theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)