For this assignment I am using Women World Cup 2019 Prediction data set that I retrieved from Fivethirtyeight site to create a programming sample “vignette” that demonstrate how to use few of the main capabilities of the selected tidyverse package.
This dataset shows the chance that each team will win, lose or tie every one of its matches. One important thing that this dataset include is the SPI rating, Soccer Power Index, how we turn those ratings into a forecast and how we calculate our in-game win probabilities.
To generate our SPI ratings, we run through every past match in our database of women’s international matches — back to 1971 — evaluating the performance of both teams with four metrics:
women_worldcup <- read_csv("https://projects.fivethirtyeight.com/soccer-api/international/2019/wwc_matches.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## date = col_date(format = ""),
## league = col_character(),
## team1 = col_character(),
## team2 = col_character()
## )
## See spec(...) for full column specifications.
head(women_worldcup)
## # A tibble: 6 x 20
## date league_id league team1 team2 spi1 spi2 prob1 prob2 probtie
## <date> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-07 352 FIFA ~ Fran~ Sout~ 95.3 80.8 0.694 0.114 0.192
## 2 2019-06-08 352 FIFA ~ Germ~ Chin~ 94.2 82.1 0.599 0.176 0.224
## 3 2019-06-08 352 FIFA ~ Spai~ Sout~ 83.6 61.5 0.592 0.130 0.278
## 4 2019-06-08 352 FIFA ~ Norw~ Nige~ 82.9 70.9 0.520 0.218 0.263
## 5 2019-06-09 352 FIFA ~ Aust~ Ital~ 93.1 71.1 0.742 0.114 0.143
## 6 2019-06-09 352 FIFA ~ Braz~ Jama~ 89.1 59.8 0.803 0.0803 0.117
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## # score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>,
## # nsxg2 <dbl>, adj_score1 <dbl>, adj_score2 <dbl>
In this section I will be using slice(), filter(), summerise(), count() functions that can be found under dplyr.
Using slice option we can select the row numbers we want to display
slice(women_worldcup, 10:20)
## # A tibble: 11 x 20
## date league_id league team1 team2 spi1 spi2 prob1 prob2
## <date> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-11 352 FIFA ~ New ~ Neth~ 83.0 92.7 0.2 0.582
## 2 2019-06-11 352 FIFA ~ Chil~ Swed~ 49.0 85.9 0.0373 0.822
## 3 2019-06-11 352 FIFA ~ USA ~ Thai~ 97.2 50.4 0.951 0.0117
## 4 2019-06-12 352 FIFA ~ Nige~ Sout~ 69.3 78.3 0.277 0.455
## 5 2019-06-12 352 FIFA ~ Germ~ Spai~ 93.9 86.5 0.516 0.240
## 6 2019-06-12 352 FIFA ~ Fran~ Norw~ 96.1 84.1 0.649 0.136
## 7 2019-06-13 352 FIFA ~ Aust~ Braz~ 92.4 89.7 0.429 0.343
## 8 2019-06-13 352 FIFA ~ Sout~ Chin~ 58.2 81.9 0.143 0.593
## 9 2019-06-14 352 FIFA ~ Japa~ Scot~ 90.9 52.4 0.833 0.0542
## 10 2019-06-14 352 FIFA ~ Jama~ Ital~ 58.1 72.5 0.242 0.574
## 11 2019-06-14 352 FIFA ~ Engl~ Arge~ 92.1 39.1 0.916 0.0136
## # ... with 11 more variables: probtie <dbl>, proj_score1 <dbl>,
## # proj_score2 <dbl>, score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>,
## # nsxg1 <dbl>, nsxg2 <dbl>, adj_score1 <dbl>, adj_score2 <dbl>
filter function to extract rows that meet logical criteria
filter(women_worldcup, women_worldcup$score1 > "3")
## # A tibble: 2 x 20
## date league_id league team1 team2 spi1 spi2 prob1 prob2 probtie
## <date> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-07 352 FIFA ~ Fran~ Sout~ 95.3 80.8 0.694 0.114 0.192
## 2 2019-06-16 352 FIFA ~ Swed~ Thai~ 86.5 43.5 0.866 0.0254 0.108
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## # score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>,
## # nsxg2 <dbl>, adj_score1 <dbl>, adj_score2 <dbl>
apply summarize function to columns to create new table of summary statistics
women_worldcup %>% summarise(Avg_proj_score1 = mean(proj_score1), Median_proj_score1 = median(proj_score1), Avg_score1 = mean(score1), Median_score1 = median(score1))
## # A tibble: 1 x 4
## Avg_proj_score1 Median_proj_score1 Avg_score1 Median_score1
## <dbl> <dbl> <dbl> <dbl>
## 1 1.47 1.27 1.65 1
count number of rows each group defined by the variable
#Team 1
count(women_worldcup, women_worldcup$team1, sort = TRUE)
## # A tibble: 24 x 2
## `women_worldcup$team1` n
## <chr> <int>
## 1 England Women 5
## 2 France Women 4
## 3 Germany Women 4
## 4 Netherlands Women 4
## 5 Italy Women 3
## 6 Norway Women 3
## 7 Sweden Women 3
## 8 USA Women 3
## 9 Australia Women 2
## 10 Canada Women 2
## # ... with 14 more rows
#Team 2
count(women_worldcup, women_worldcup$team2, sort = TRUE)
## # A tibble: 24 x 2
## `women_worldcup$team2` n
## <chr> <int>
## 1 Sweden Women 4
## 2 USA Women 4
## 3 Brazil Women 3
## 4 Cameroon Women 3
## 5 China Women 3
## 6 Netherlands Women 3
## 7 Argentina Women 2
## 8 Australia Women 2
## 9 Canada Women 2
## 10 Chile Women 2
## # ... with 14 more rows
Under tidyr, There are four fundamental functions of data tyding.
1. gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer 2. spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider 3. separate() splits a single column into multiple columns 4. unite() combines multiple columns into a single column
Creating a new data set with selected columns. Then gather some columns.
#Eliminating few columns from the original data set and creating a new data set.
wwc_19 <- select(women_worldcup, -c(date, league_id, league, xg1, xg2, nsxg1, nsxg2, prob1, prob2, probtie))
df <- wwc_19 %>%
gather(Teams, team_names, team1:team2)
df
## # A tibble: 104 x 10
## spi1 spi2 proj_score1 proj_score2 score1 score2 adj_score1 adj_score2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 95.3 80.8 1.9 0.61 4 0 3.83 0
## 2 94.2 82.1 1.67 0.78 1 0 1.05 0
## 3 83.6 61.5 1.27 0.43 3 1 2.34 1.05
## 4 82.9 70.9 1.35 0.76 3 0 3.15 0
## 5 93.1 71.1 2.63 0.95 1 2 1.05 2.1
## 6 89.1 59.8 2.92 0.85 3 0 3.15 0
## 7 92.7 50.8 2.64 0.3 2 1 2.1 1.05
## 8 37.1 92.3 0.38 3.58 0 0 0 0
## 9 92.2 62.5 1.87 0.44 1 0 1.05 0
## 10 83.0 92.7 0.92 1.74 0 1 0 1.05
## # ... with 94 more rows, and 2 more variables: Teams <chr>,
## # team_names <chr>
I am going to create a new two columns by combining scores and spi rank. This new data set will help me for my data visualization part as well.
only_scores <- wwc_19 %>%
unite(score_comb, score1, score2, sep = "-") %>%
unite(spi_comb, spi1, spi2, sep = "-")
only_scores
## # A tibble: 52 x 8
## team1 team2 spi_comb proj_score1 proj_score2 score_comb adj_score1
## <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 Fran~ Sout~ 95.32-8~ 1.9 0.61 4-0 3.83
## 2 Germ~ Chin~ 94.17-8~ 1.67 0.78 1-0 1.05
## 3 Spai~ Sout~ 83.58-6~ 1.27 0.43 3-1 2.34
## 4 Norw~ Nige~ 82.91-7~ 1.35 0.76 3-0 3.15
## 5 Aust~ Ital~ 93.13-7~ 2.63 0.95 1-2 1.05
## 6 Braz~ Jama~ 89.12-5~ 2.92 0.85 3-0 3.15
## 7 Engl~ Scot~ 92.7-50~ 2.64 0.3 2-1 2.1
## 8 Arge~ Japa~ 37.09-9~ 0.38 3.58 0-0 0
## 9 Cana~ Came~ 92.23-6~ 1.87 0.44 1-0 1.05
## 10 New ~ Neth~ 82.98-9~ 0.92 1.74 0-1 0
## # ... with 42 more rows, and 1 more variable: adj_score2 <dbl>
I am using ggplot function to show a heatmap like plot.
#Graph to show the scores
g <- ggplot(wwc_19, aes(x = team1, y = team2, fill = score1-score2)) +
geom_label(aes(label = paste(score1, score2, sep = "-")), fill = "white") +
scale_x_discrete(limits = levels(wwc_19$team1), position = "bottom") +
scale_y_discrete(limits = rev(levels(wwc_19$team2))) +
theme (axis.text.x = element_text(angle = 60, hjust = 1))
g
only_scores <- wwc_19 %>%
unite(score_comb, score1, score2, sep = "-") %>%
unite(spi_comb, spi1, spi2, sep = "-")
heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = score_comb))+
geom_tile()+
theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)
heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = spi_comb))+
geom_tile()+
theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)