library(tidyverse)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.6.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
Tidyverse is a package I always use in my R data projects. Tidyverse is a great collection of R packages offering data science solutions in the areas of data manipulation, exploration, and visualization that share a common design philosophy. It was created by R industry luminary Hadley Wickham, the chief scientist behind RStudio.Tidyverse has become an essential for Data Science where Data Scientists spend more time cleaning, messaging and preparing data.- Tidyverse has actually made data cleaning and messaging fun!
For this blog I am using Women World Cup 2019 Prediction data set that I retrieved from Fivethirtyeight site to create a programming sample “vignette” that demonstrate how to use few of the main capabilities of the selected tidyverse package.
This dataset shows the chance that each team will win, lose or tie every one of its matches. One important thing that this dataset include is the SPI rating, Soccer Power Index, how we turn those ratings into a forecast and how we calculate our in-game win probabilities.
To generate our SPI ratings, we run through every past match in our database of women’s international matches — back to 1971 — evaluating the performance of both teams with four metrics:
women_worldcup <- read_csv("https://projects.fivethirtyeight.com/soccer-api/international/2019/wwc_matches.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## date = col_date(format = ""),
## league = col_character(),
## team1 = col_character(),
## team2 = col_character()
## )
## i Use `spec()` for the full column specifications.
women_worldcup %>%
head() %>%
kable() %>%
kable_styling()
date | league_id | league | team1 | team2 | spi1 | spi2 | prob1 | prob2 | probtie | proj_score1 | proj_score2 | score1 | score2 | xg1 | xg2 | nsxg1 | nsxg2 | adj_score1 | adj_score2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-06-07 | 352 | FIFA Women’s World Cup Final | France Women | South Korea Women | 95.32 | 80.76 | 0.6945 | 0.1140 | 0.1915 | 1.90 | 0.61 | 4 | 0 | 1.68 | 0.36 | 2.29 | 0.04 | 3.83 | 0.00 |
2019-06-08 | 352 | FIFA Women’s World Cup Final | Germany Women | China Women | 94.17 | 82.10 | 0.5993 | 0.1765 | 0.2243 | 1.67 | 0.78 | 1 | 0 | 0.82 | 0.63 | 1.08 | 0.59 | 1.05 | 0.00 |
2019-06-08 | 352 | FIFA Women’s World Cup Final | Spain Women | South Africa Women | 83.58 | 61.51 | 0.5921 | 0.1303 | 0.2776 | 1.27 | 0.43 | 3 | 1 | 3.33 | 0.50 | 3.42 | 0.12 | 2.34 | 1.05 |
2019-06-08 | 352 | FIFA Women’s World Cup Final | Norway Women | Nigeria Women | 82.91 | 70.90 | 0.5195 | 0.2176 | 0.2629 | 1.35 | 0.76 | 3 | 0 | 0.67 | 0.65 | 1.22 | 0.82 | 3.15 | 0.00 |
2019-06-09 | 352 | FIFA Women’s World Cup Final | Australia Women | Italy Women | 93.13 | 71.11 | 0.7425 | 0.1145 | 0.1430 | 2.63 | 0.95 | 1 | 2 | 2.52 | 0.79 | 1.74 | 0.57 | 1.05 | 2.10 |
2019-06-09 | 352 | FIFA Women’s World Cup Final | Brazil Women | Jamaica Women | 89.12 | 59.78 | 0.8027 | 0.0803 | 0.1170 | 2.92 | 0.85 | 3 | 0 | 3.73 | 0.94 | 1.90 | 0.70 | 3.15 | 0.00 |
Here is a complete list of functions that comes with dplyr.
In this section I will be using slice(), filter(), summerise(), count() functions that can be found under dplyr.
Using slice option we can select the row numbers we want to display
slice(women_worldcup, 10:20)
## # A tibble: 11 x 20
## date league_id league team1 team2 spi1 spi2 prob1 prob2 probtie
## <date> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-11 352 FIFA ~ New ~ Neth~ 83.0 92.7 0.2 0.582 0.218
## 2 2019-06-11 352 FIFA ~ Chil~ Swed~ 49.0 85.9 0.0373 0.822 0.140
## 3 2019-06-11 352 FIFA ~ USA ~ Thai~ 97.2 50.4 0.951 0.0117 0.0373
## 4 2019-06-12 352 FIFA ~ Nige~ Sout~ 69.3 78.3 0.277 0.455 0.268
## 5 2019-06-12 352 FIFA ~ Germ~ Spai~ 93.9 86.5 0.516 0.240 0.245
## 6 2019-06-12 352 FIFA ~ Fran~ Norw~ 96.1 84.1 0.649 0.136 0.215
## 7 2019-06-13 352 FIFA ~ Aust~ Braz~ 92.4 89.7 0.429 0.343 0.228
## 8 2019-06-13 352 FIFA ~ Sout~ Chin~ 58.2 81.9 0.143 0.593 0.265
## 9 2019-06-14 352 FIFA ~ Japa~ Scot~ 90.9 52.4 0.833 0.0542 0.113
## 10 2019-06-14 352 FIFA ~ Jama~ Ital~ 58.1 72.5 0.242 0.574 0.185
## 11 2019-06-14 352 FIFA ~ Engl~ Arge~ 92.1 39.1 0.916 0.0136 0.0709
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## # score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>, nsxg2 <dbl>,
## # adj_score1 <dbl>, adj_score2 <dbl>
filter function to extract rows that meet logical criteria
filter(women_worldcup, women_worldcup$score1 > "3")
## # A tibble: 2 x 20
## date league_id league team1 team2 spi1 spi2 prob1 prob2 probtie
## <date> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-07 352 FIFA ~ Fran~ Sout~ 95.3 80.8 0.694 0.114 0.192
## 2 2019-06-16 352 FIFA ~ Swed~ Thai~ 86.5 43.5 0.866 0.0254 0.108
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## # score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>, nsxg2 <dbl>,
## # adj_score1 <dbl>, adj_score2 <dbl>
apply summarize function to columns to create new table of summary statistics
women_worldcup %>% summarise(Avg_proj_score1 = mean(proj_score1), Median_proj_score1 = median(proj_score1), Avg_score1 = mean(score1), Median_score1 = median(score1))
## # A tibble: 1 x 4
## Avg_proj_score1 Median_proj_score1 Avg_score1 Median_score1
## <dbl> <dbl> <dbl> <dbl>
## 1 1.47 1.27 1.65 1
count number of rows each group defined by the variable
#Team 1
count(women_worldcup, women_worldcup$team1, sort = TRUE)
## # A tibble: 24 x 2
## `women_worldcup$team1` n
## <chr> <int>
## 1 England Women 5
## 2 France Women 4
## 3 Germany Women 4
## 4 Netherlands Women 4
## 5 Italy Women 3
## 6 Norway Women 3
## 7 Sweden Women 3
## 8 USA Women 3
## 9 Australia Women 2
## 10 Canada Women 2
## # ... with 14 more rows
#Team 2
count(women_worldcup, women_worldcup$team2, sort = TRUE)
## # A tibble: 24 x 2
## `women_worldcup$team2` n
## <chr> <int>
## 1 Sweden Women 4
## 2 USA Women 4
## 3 Brazil Women 3
## 4 Cameroon Women 3
## 5 China Women 3
## 6 Netherlands Women 3
## 7 Argentina Women 2
## 8 Australia Women 2
## 9 Canada Women 2
## 10 Chile Women 2
## # ... with 14 more rows
The tidyr package complements dplyr perfectly. It boosts the power of dplyr for data manipulation and pre-processing.Under tidyr, There are four fundamental functions of data tyding.
Creating a new data set with selected columns. Then gather some columns.
#Eliminating few columns from the original data set and creating a new data set.
wwc_19 <- select(women_worldcup, -c(date, league_id, league, xg1, xg2, nsxg1, nsxg2, prob1, prob2, probtie))
df <- wwc_19 %>%
gather(Teams, team_names, team1:team2)
df
## # A tibble: 104 x 10
## spi1 spi2 proj_score1 proj_score2 score1 score2 adj_score1 adj_score2 Teams
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 95.3 80.8 1.9 0.61 4 0 3.83 0 team1
## 2 94.2 82.1 1.67 0.78 1 0 1.05 0 team1
## 3 83.6 61.5 1.27 0.43 3 1 2.34 1.05 team1
## 4 82.9 70.9 1.35 0.76 3 0 3.15 0 team1
## 5 93.1 71.1 2.63 0.95 1 2 1.05 2.1 team1
## 6 89.1 59.8 2.92 0.85 3 0 3.15 0 team1
## 7 92.7 50.8 2.64 0.3 2 1 2.1 1.05 team1
## 8 37.1 92.3 0.38 3.58 0 0 0 0 team1
## 9 92.2 62.5 1.87 0.44 1 0 1.05 0 team1
## 10 83.0 92.7 0.92 1.74 0 1 0 1.05 team1
## # ... with 94 more rows, and 1 more variable: team_names <chr>
I am going to create a new two columns by combining scores and spi rank. This new data set will help me for my data visualization part as well.
only_scores <- wwc_19 %>%
unite(score_comb, score1, score2, sep = "-") %>%
unite(spi_comb, spi1, spi2, sep = "-")
only_scores %>%
kable() %>%
kable_styling()
team1 | team2 | spi_comb | proj_score1 | proj_score2 | score_comb | adj_score1 | adj_score2 |
---|---|---|---|---|---|---|---|
France Women | South Korea Women | 95.32-80.76 | 1.90 | 0.61 | 4-0 | 3.83 | 0.00 |
Germany Women | China Women | 94.17-82.1 | 1.67 | 0.78 | 1-0 | 1.05 | 0.00 |
Spain Women | South Africa Women | 83.58-61.51 | 1.27 | 0.43 | 3-1 | 2.34 | 1.05 |
Norway Women | Nigeria Women | 82.91-70.9 | 1.35 | 0.76 | 3-0 | 3.15 | 0.00 |
Australia Women | Italy Women | 93.13-71.11 | 2.63 | 0.95 | 1-2 | 1.05 | 2.10 |
Brazil Women | Jamaica Women | 89.12-59.78 | 2.92 | 0.85 | 3-0 | 3.15 | 0.00 |
England Women | Scotland Women | 92.7-50.75 | 2.64 | 0.30 | 2-1 | 2.10 | 1.05 |
Argentina Women | Japan Women | 37.09-92.29 | 0.38 | 3.58 | 0-0 | 0.00 | 0.00 |
Canada Women | Cameroon Women | 92.23-62.53 | 1.87 | 0.44 | 1-0 | 1.05 | 0.00 |
New Zealand Women | Netherlands Women | 82.98-92.71 | 0.92 | 1.74 | 0-1 | 0.00 | 1.05 |
Chile Women | Sweden Women | 48.96-85.88 | 0.27 | 2.12 | 0-2 | 0.00 | 1.58 |
USA Women | Thailand Women | 97.21-50.43 | 4.01 | 0.40 | 13-0 | 11.73 | 0.00 |
Nigeria Women | South Korea Women | 69.3-78.28 | 0.92 | 1.26 | 2-0 | 1.99 | 0.00 |
Germany Women | Spain Women | 93.86-86.53 | 1.48 | 0.92 | 1-0 | 1.05 | 0.00 |
France Women | Norway Women | 96.09-84.06 | 1.71 | 0.63 | 2-1 | 2.10 | 1.05 |
Australia Women | Brazil Women | 92.44-89.7 | 1.55 | 1.37 | 3-2 | 3.15 | 2.10 |
South Africa Women | China Women | 58.24-81.9 | 0.50 | 1.34 | 0-1 | 0.00 | 1.05 |
Japan Women | Scotland Women | 90.91-52.41 | 2.68 | 0.55 | 2-1 | 2.10 | 1.05 |
Jamaica Women | Italy Women | 58.14-72.46 | 1.43 | 2.26 | 0-5 | 0.00 | 4.96 |
England Women | Argentina Women | 92.14-39.07 | 2.86 | 0.20 | 1-0 | 1.05 | 0.00 |
Netherlands Women | Cameroon Women | 93.62-62.68 | 2.15 | 0.31 | 3-1 | 2.78 | 1.05 |
Canada Women | New Zealand Women | 91.94-80.85 | 1.56 | 0.78 | 2-0 | 1.87 | 0.00 |
Sweden Women | Thailand Women | 86.47-43.51 | 2.41 | 0.24 | 5-1 | 4.46 | 1.05 |
USA Women | Chile Women | 98.33-47.61 | 4.42 | 0.29 | 3-0 | 3.15 | 0.00 |
South Africa Women | Germany Women | 56.69-93.77 | 0.21 | 2.21 | 0-4 | 0.00 | 4.20 |
China Women | Spain Women | 82.67-86.53 | 0.85 | 1.08 | 0-0 | 0.00 | 0.00 |
South Korea Women | Norway Women | 76.36-83.68 | 0.94 | 1.40 | 1-2 | 1.05 | 2.10 |
Nigeria Women | France Women | 71.65-96.3 | 0.28 | 2.24 | 0-1 | 0.00 | 0.84 |
Jamaica Women | Australia Women | 53.52-92.82 | 0.76 | 3.65 | 1-4 | 1.05 | 3.87 |
Italy Women | Brazil Women | 76.14-89.47 | 1.10 | 1.94 | 0-1 | 0.00 | 1.05 |
Japan Women | England Women | 90.31-91.92 | 1.07 | 1.40 | 0-2 | 0.00 | 1.74 |
Scotland Women | Argentina Women | 53.98-39.19 | 1.50 | 0.90 | 3-3 | 3.15 | 3.15 |
Netherlands Women | Canada Women | 92.68-93.52 | 1.08 | 1.16 | 2-1 | 2.10 | 1.05 |
Cameroon Women | New Zealand Women | 65.84-77.62 | 0.93 | 1.37 | 2-1 | 2.10 | 1.05 |
Thailand Women | Chile Women | 40.69-46.53 | 1.44 | 1.69 | 0-2 | 0.00 | 1.86 |
Sweden Women | USA Women | 88.45-98.33 | 0.78 | 1.71 | 0-2 | 0.00 | 2.10 |
Germany Women | Nigeria Women | 94.93-69.91 | 2.19 | 0.43 | 3-0 | 2.84 | 0.00 |
Norway Women | Australia Women | 82.62-92.49 | 1.14 | 1.74 | 1-1 | 1.05 | 1.05 |
England Women | Cameroon Women | 92.2-69.69 | 1.78 | 0.50 | 3-0 | 3.15 | 0.00 |
France Women | Brazil Women | 96.61-89.26 | 1.70 | 0.72 | 2-1 | 2.10 | 1.05 |
Spain Women | USA Women | 88.41-98.49 | 0.82 | 1.98 | 1-2 | 1.05 | 2.10 |
Sweden Women | Canada Women | 87.36-93.07 | 0.75 | 1.01 | 1-0 | 1.05 | 0.00 |
Italy Women | China Women | 75.68-79.99 | 1.01 | 1.24 | 2-0 | 2.10 | 0.00 |
Netherlands Women | Japan Women | 93.33-89.81 | 1.62 | 1.16 | 2-1 | 2.10 | 1.05 |
Norway Women | England Women | 84.1-92.65 | 0.81 | 1.47 | 0-3 | 0.00 | 3.15 |
France Women | USA Women | 96.79-98.5 | 1.25 | 1.41 | 1-2 | 1.05 | 2.10 |
Italy Women | Netherlands Women | 77.41-93.32 | 0.90 | 2.10 | 0-2 | 0.00 | 1.85 |
Germany Women | Sweden Women | 94.89-88.46 | 1.26 | 0.85 | 1-2 | 1.05 | 2.10 |
England Women | USA Women | 93.54-98.42 | 0.98 | 1.63 | 1-2 | 1.05 | 2.10 |
Netherlands Women | Sweden Women | 93.56-89.6 | 1.15 | 0.92 | 1-0 | 1.05 | 0.00 |
England Women | Sweden Women | 94.04-88.77 | 1.10 | 0.75 | 1-2 | 1.05 | 2.10 |
USA Women | Netherlands Women | 98.24-93.96 | 1.71 | 1.08 | 2-0 | 2.10 | 0.00 |
There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots – you name it and ggplot2 has a function for it.Let’s see a few examples of how to create some really interactive plots with ggplot2 in R.
#Graph to show the scores
g <- ggplot(wwc_19, aes(x = team1, y = team2, fill = score1-score2)) +
geom_label(aes(label = paste(score1, score2, sep = "-")), fill = "white") +
scale_x_discrete(limits = levels(wwc_19$team1), position = "bottom") +
scale_y_discrete(limits = rev(levels(wwc_19$team2))) +
theme (axis.text.x = element_text(angle = 60, hjust = 1))
g
only_scores <- wwc_19 %>%
unite(score_comb, score1, score2, sep = "-") %>%
unite(spi_comb, spi1, spi2, sep = "-")
heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = score_comb))+
geom_tile()+
theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)
heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = spi_comb))+
geom_tile()+
theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)
Tidyverse is the most popular collection of R packages. Which isn’t all that surprising given how useful and easy to use they are. It is consisting with a powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way data science work with messy data.