library(tidyverse)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.6.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Tidyverse and its function in Data Visualization

Tidyverse is a package I always use in my R data projects. Tidyverse is a great collection of R packages offering data science solutions in the areas of data manipulation, exploration, and visualization that share a common design philosophy. It was created by R industry luminary Hadley Wickham, the chief scientist behind RStudio.Tidyverse has become an essential for Data Science where Data Scientists spend more time cleaning, messaging and preparing data.- Tidyverse has actually made data cleaning and messaging fun!

Introduction

For this blog I am using Women World Cup 2019 Prediction data set that I retrieved from Fivethirtyeight site to create a programming sample “vignette” that demonstrate how to use few of the main capabilities of the selected tidyverse package.

This dataset shows the chance that each team will win, lose or tie every one of its matches. One important thing that this dataset include is the SPI rating, Soccer Power Index, how we turn those ratings into a forecast and how we calculate our in-game win probabilities.
To generate our SPI ratings, we run through every past match in our database of women’s international matches — back to 1971 — evaluating the performance of both teams with four metrics:

  1. The number of goals they scored.
  2. The number of goals they scored, adjusted to account for red cards and the time and score of the match when each goal was scored.
  3. The number of goals they were expected to score given the shots they took.
  4. The number of goals they were expected to score given the non-shooting actions they took near the opposing team’s goal.

Read .csv file

women_worldcup <- read_csv("https://projects.fivethirtyeight.com/soccer-api/international/2019/wwc_matches.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   date = col_date(format = ""),
##   league = col_character(),
##   team1 = col_character(),
##   team2 = col_character()
## )
## i Use `spec()` for the full column specifications.
women_worldcup %>%
  head() %>%
  kable() %>%
  kable_styling()
date league_id league team1 team2 spi1 spi2 prob1 prob2 probtie proj_score1 proj_score2 score1 score2 xg1 xg2 nsxg1 nsxg2 adj_score1 adj_score2
2019-06-07 352 FIFA Women’s World Cup Final France Women South Korea Women 95.32 80.76 0.6945 0.1140 0.1915 1.90 0.61 4 0 1.68 0.36 2.29 0.04 3.83 0.00
2019-06-08 352 FIFA Women’s World Cup Final Germany Women China Women 94.17 82.10 0.5993 0.1765 0.2243 1.67 0.78 1 0 0.82 0.63 1.08 0.59 1.05 0.00
2019-06-08 352 FIFA Women’s World Cup Final Spain Women South Africa Women 83.58 61.51 0.5921 0.1303 0.2776 1.27 0.43 3 1 3.33 0.50 3.42 0.12 2.34 1.05
2019-06-08 352 FIFA Women’s World Cup Final Norway Women Nigeria Women 82.91 70.90 0.5195 0.2176 0.2629 1.35 0.76 3 0 0.67 0.65 1.22 0.82 3.15 0.00
2019-06-09 352 FIFA Women’s World Cup Final Australia Women Italy Women 93.13 71.11 0.7425 0.1145 0.1430 2.63 0.95 1 2 2.52 0.79 1.74 0.57 1.05 2.10
2019-06-09 352 FIFA Women’s World Cup Final Brazil Women Jamaica Women 89.12 59.78 0.8027 0.0803 0.1170 2.92 0.85 3 0 3.73 0.94 1.90 0.70 3.15 0.00

Data transformation with dplyr

Here is a complete list of functions that comes with dplyr.

  • select(): Select columns from your dataset
  • filter(): Filter out certain rows that meet your criteria(s)
  • group_by(): Group different observations together such that the original dataset does not change. Only the way it is represented is changed in the form of a list
  • summarise(): Summarise any of the above functions
  • arrange(): Arrange your column data in ascending or descending order
  • join(): Perform left, right, full, and inner joins in R
  • mutate(): Create new columns by preserving the existing variables

In this section I will be using slice(), filter(), summerise(), count() functions that can be found under dplyr.

Manipulate Cases

Using slice option we can select the row numbers we want to display

slice(women_worldcup, 10:20)
## # A tibble: 11 x 20
##    date       league_id league team1 team2  spi1  spi2  prob1  prob2 probtie
##    <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>   <dbl>
##  1 2019-06-11       352 FIFA ~ New ~ Neth~  83.0  92.7 0.2    0.582   0.218 
##  2 2019-06-11       352 FIFA ~ Chil~ Swed~  49.0  85.9 0.0373 0.822   0.140 
##  3 2019-06-11       352 FIFA ~ USA ~ Thai~  97.2  50.4 0.951  0.0117  0.0373
##  4 2019-06-12       352 FIFA ~ Nige~ Sout~  69.3  78.3 0.277  0.455   0.268 
##  5 2019-06-12       352 FIFA ~ Germ~ Spai~  93.9  86.5 0.516  0.240   0.245 
##  6 2019-06-12       352 FIFA ~ Fran~ Norw~  96.1  84.1 0.649  0.136   0.215 
##  7 2019-06-13       352 FIFA ~ Aust~ Braz~  92.4  89.7 0.429  0.343   0.228 
##  8 2019-06-13       352 FIFA ~ Sout~ Chin~  58.2  81.9 0.143  0.593   0.265 
##  9 2019-06-14       352 FIFA ~ Japa~ Scot~  90.9  52.4 0.833  0.0542  0.113 
## 10 2019-06-14       352 FIFA ~ Jama~ Ital~  58.1  72.5 0.242  0.574   0.185 
## 11 2019-06-14       352 FIFA ~ Engl~ Arge~  92.1  39.1 0.916  0.0136  0.0709
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## #   score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>, nsxg2 <dbl>,
## #   adj_score1 <dbl>, adj_score2 <dbl>

filter function to extract rows that meet logical criteria

filter(women_worldcup, women_worldcup$score1 > "3")
## # A tibble: 2 x 20
##   date       league_id league team1 team2  spi1  spi2 prob1  prob2 probtie
##   <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1 2019-06-07       352 FIFA ~ Fran~ Sout~  95.3  80.8 0.694 0.114    0.192
## 2 2019-06-16       352 FIFA ~ Swed~ Thai~  86.5  43.5 0.866 0.0254   0.108
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## #   score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>, nsxg2 <dbl>,
## #   adj_score1 <dbl>, adj_score2 <dbl>

Summarize Cases

apply summarize function to columns to create new table of summary statistics

women_worldcup %>% summarise(Avg_proj_score1 = mean(proj_score1), Median_proj_score1 = median(proj_score1), Avg_score1 = mean(score1), Median_score1 = median(score1))
## # A tibble: 1 x 4
##   Avg_proj_score1 Median_proj_score1 Avg_score1 Median_score1
##             <dbl>              <dbl>      <dbl>         <dbl>
## 1            1.47               1.27       1.65             1

count number of rows each group defined by the variable

#Team 1
count(women_worldcup, women_worldcup$team1, sort = TRUE)
## # A tibble: 24 x 2
##    `women_worldcup$team1`     n
##    <chr>                  <int>
##  1 England Women              5
##  2 France Women               4
##  3 Germany Women              4
##  4 Netherlands Women          4
##  5 Italy Women                3
##  6 Norway Women               3
##  7 Sweden Women               3
##  8 USA Women                  3
##  9 Australia Women            2
## 10 Canada Women               2
## # ... with 14 more rows
#Team 2
count(women_worldcup, women_worldcup$team2, sort = TRUE)
## # A tibble: 24 x 2
##    `women_worldcup$team2`     n
##    <chr>                  <int>
##  1 Sweden Women               4
##  2 USA Women                  4
##  3 Brazil Women               3
##  4 Cameroon Women             3
##  5 China Women                3
##  6 Netherlands Women          3
##  7 Argentina Women            2
##  8 Australia Women            2
##  9 Canada Women               2
## 10 Chile Women                2
## # ... with 14 more rows

Data Wrangling with tidyr

The tidyr package complements dplyr perfectly. It boosts the power of dplyr for data manipulation and pre-processing.Under tidyr, There are four fundamental functions of data tyding.

  • gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer
  • spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider
  • separate() splits a single column into multiple columns
  • unite() combines multiple columns into a single column

Reshaping data

Creating a new data set with selected columns. Then gather some columns.

#Eliminating few columns from the original data set and creating a new data set. 
wwc_19 <- select(women_worldcup, -c(date, league_id, league, xg1, xg2, nsxg1, nsxg2, prob1, prob2, probtie))
df <- wwc_19 %>%
  gather(Teams, team_names, team1:team2)
df
## # A tibble: 104 x 10
##     spi1  spi2 proj_score1 proj_score2 score1 score2 adj_score1 adj_score2 Teams
##    <dbl> <dbl>       <dbl>       <dbl>  <dbl>  <dbl>      <dbl>      <dbl> <chr>
##  1  95.3  80.8        1.9         0.61      4      0       3.83       0    team1
##  2  94.2  82.1        1.67        0.78      1      0       1.05       0    team1
##  3  83.6  61.5        1.27        0.43      3      1       2.34       1.05 team1
##  4  82.9  70.9        1.35        0.76      3      0       3.15       0    team1
##  5  93.1  71.1        2.63        0.95      1      2       1.05       2.1  team1
##  6  89.1  59.8        2.92        0.85      3      0       3.15       0    team1
##  7  92.7  50.8        2.64        0.3       2      1       2.1        1.05 team1
##  8  37.1  92.3        0.38        3.58      0      0       0          0    team1
##  9  92.2  62.5        1.87        0.44      1      0       1.05       0    team1
## 10  83.0  92.7        0.92        1.74      0      1       0          1.05 team1
## # ... with 94 more rows, and 1 more variable: team_names <chr>

I am going to create a new two columns by combining scores and spi rank. This new data set will help me for my data visualization part as well.

only_scores <- wwc_19 %>%
  unite(score_comb, score1, score2, sep = "-") %>%
  unite(spi_comb, spi1, spi2, sep = "-")

only_scores %>%
  kable() %>%
  kable_styling()
team1 team2 spi_comb proj_score1 proj_score2 score_comb adj_score1 adj_score2
France Women South Korea Women 95.32-80.76 1.90 0.61 4-0 3.83 0.00
Germany Women China Women 94.17-82.1 1.67 0.78 1-0 1.05 0.00
Spain Women South Africa Women 83.58-61.51 1.27 0.43 3-1 2.34 1.05
Norway Women Nigeria Women 82.91-70.9 1.35 0.76 3-0 3.15 0.00
Australia Women Italy Women 93.13-71.11 2.63 0.95 1-2 1.05 2.10
Brazil Women Jamaica Women 89.12-59.78 2.92 0.85 3-0 3.15 0.00
England Women Scotland Women 92.7-50.75 2.64 0.30 2-1 2.10 1.05
Argentina Women Japan Women 37.09-92.29 0.38 3.58 0-0 0.00 0.00
Canada Women Cameroon Women 92.23-62.53 1.87 0.44 1-0 1.05 0.00
New Zealand Women Netherlands Women 82.98-92.71 0.92 1.74 0-1 0.00 1.05
Chile Women Sweden Women 48.96-85.88 0.27 2.12 0-2 0.00 1.58
USA Women Thailand Women 97.21-50.43 4.01 0.40 13-0 11.73 0.00
Nigeria Women South Korea Women 69.3-78.28 0.92 1.26 2-0 1.99 0.00
Germany Women Spain Women 93.86-86.53 1.48 0.92 1-0 1.05 0.00
France Women Norway Women 96.09-84.06 1.71 0.63 2-1 2.10 1.05
Australia Women Brazil Women 92.44-89.7 1.55 1.37 3-2 3.15 2.10
South Africa Women China Women 58.24-81.9 0.50 1.34 0-1 0.00 1.05
Japan Women Scotland Women 90.91-52.41 2.68 0.55 2-1 2.10 1.05
Jamaica Women Italy Women 58.14-72.46 1.43 2.26 0-5 0.00 4.96
England Women Argentina Women 92.14-39.07 2.86 0.20 1-0 1.05 0.00
Netherlands Women Cameroon Women 93.62-62.68 2.15 0.31 3-1 2.78 1.05
Canada Women New Zealand Women 91.94-80.85 1.56 0.78 2-0 1.87 0.00
Sweden Women Thailand Women 86.47-43.51 2.41 0.24 5-1 4.46 1.05
USA Women Chile Women 98.33-47.61 4.42 0.29 3-0 3.15 0.00
South Africa Women Germany Women 56.69-93.77 0.21 2.21 0-4 0.00 4.20
China Women Spain Women 82.67-86.53 0.85 1.08 0-0 0.00 0.00
South Korea Women Norway Women 76.36-83.68 0.94 1.40 1-2 1.05 2.10
Nigeria Women France Women 71.65-96.3 0.28 2.24 0-1 0.00 0.84
Jamaica Women Australia Women 53.52-92.82 0.76 3.65 1-4 1.05 3.87
Italy Women Brazil Women 76.14-89.47 1.10 1.94 0-1 0.00 1.05
Japan Women England Women 90.31-91.92 1.07 1.40 0-2 0.00 1.74
Scotland Women Argentina Women 53.98-39.19 1.50 0.90 3-3 3.15 3.15
Netherlands Women Canada Women 92.68-93.52 1.08 1.16 2-1 2.10 1.05
Cameroon Women New Zealand Women 65.84-77.62 0.93 1.37 2-1 2.10 1.05
Thailand Women Chile Women 40.69-46.53 1.44 1.69 0-2 0.00 1.86
Sweden Women USA Women 88.45-98.33 0.78 1.71 0-2 0.00 2.10
Germany Women Nigeria Women 94.93-69.91 2.19 0.43 3-0 2.84 0.00
Norway Women Australia Women 82.62-92.49 1.14 1.74 1-1 1.05 1.05
England Women Cameroon Women 92.2-69.69 1.78 0.50 3-0 3.15 0.00
France Women Brazil Women 96.61-89.26 1.70 0.72 2-1 2.10 1.05
Spain Women USA Women 88.41-98.49 0.82 1.98 1-2 1.05 2.10
Sweden Women Canada Women 87.36-93.07 0.75 1.01 1-0 1.05 0.00
Italy Women China Women 75.68-79.99 1.01 1.24 2-0 2.10 0.00
Netherlands Women Japan Women 93.33-89.81 1.62 1.16 2-1 2.10 1.05
Norway Women England Women 84.1-92.65 0.81 1.47 0-3 0.00 3.15
France Women USA Women 96.79-98.5 1.25 1.41 1-2 1.05 2.10
Italy Women Netherlands Women 77.41-93.32 0.90 2.10 0-2 0.00 1.85
Germany Women Sweden Women 94.89-88.46 1.26 0.85 1-2 1.05 2.10
England Women USA Women 93.54-98.42 0.98 1.63 1-2 1.05 2.10
Netherlands Women Sweden Women 93.56-89.6 1.15 0.92 1-0 1.05 0.00
England Women Sweden Women 94.04-88.77 1.10 0.75 1-2 1.05 2.10
USA Women Netherlands Women 98.24-93.96 1.71 1.08 2-0 2.10 0.00

Data Visualization - ggplot2

There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots – you name it and ggplot2 has a function for it.Let’s see a few examples of how to create some really interactive plots with ggplot2 in R.

Visualization of scores

#Graph to show the scores  

g <- ggplot(wwc_19, aes(x = team1, y = team2, fill = score1-score2)) +
  geom_label(aes(label = paste(score1, score2, sep = "-")), fill = "white") +
  scale_x_discrete(limits = levels(wwc_19$team1), position = "bottom") +
  scale_y_discrete(limits = rev(levels(wwc_19$team2))) +
  theme (axis.text.x = element_text(angle = 60, hjust = 1))

g

only_scores <- wwc_19 %>%
  unite(score_comb, score1, score2, sep = "-") %>%
  unite(spi_comb, spi1, spi2, sep = "-")

heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = score_comb))+
  geom_tile()+
  theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)

Visualization of SPI (Soccer Power Index) Rating

heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = spi_comb))+
  geom_tile()+
  theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)

End Note

Tidyverse is the most popular collection of R packages. Which isn’t all that surprising given how useful and easy to use they are. It is consisting with a powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way data science work with messy data.