TidyVerse_Blog Post

Tidyverse and its function in Data Visualization

Tidyverse is a package I always use in my R data projects. Tidyverse is a great collection of R packages offering data science solutions in the areas of data manipulation, exploration, and visualization that share a common design philosophy. It was created by R industry luminary Hadley Wickham, the chief scientist behind RStudio.Tidyverse has become an essential for Data Science where Data Scientists spend more time cleaning, messaging and preparing data.- Tidyverse has actually made data cleaning and messaging fun!

Introduction

For this blog I am using Women World Cup 2019 Prediction data set that I retrieved from Fivethirtyeight site to create a programming sample “vignette” that demonstrate how to use few of the main capabilities of the selected tidyverse package.

This dataset shows the chance that each team will win, lose or tie every one of its matches. One important thing that this dataset include is the SPI rating, Soccer Power Index, how we turn those ratings into a forecast and how we calculate our in-game win probabilities.
To generate our SPI ratings, we run through every past match in our database of women’s international matches — back to 1971 — evaluating the performance of both teams with four metrics:

The number of goals they scored.
The number of goals they scored, adjusted to account for red cards and the time and score of the match when each goal was scored.
The number of goals they were expected to score given the shots they took.
The number of goals they were expected to score given the non-shooting actions they took near the opposing team’s goal.

Read .csv file

women_worldcup <- read_csv("https://projects.fivethirtyeight.com/soccer-api/international/2019/wwc_matches.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   date = col_date(format = ""),
##   league = col_character(),
##   team1 = col_character(),
##   team2 = col_character()
## )
## i Use `spec()` for the full column specifications.

women_worldcup %>%
  head() %>%
  kable() %>%
  kable_styling()

date	league_id	league	team1	team2	spi1	spi2	prob1	prob2	probtie	proj_score1	proj_score2	score1	score2	xg1	xg2	nsxg1	nsxg2	adj_score1	adj_score2
2019-06-07	352	FIFA Women’s World Cup Final	France Women	South Korea Women	95.32	80.76	0.6945	0.1140	0.1915	1.90	0.61	4	0	1.68	0.36	2.29	0.04	3.83	0.00
2019-06-08	352	FIFA Women’s World Cup Final	Germany Women	China Women	94.17	82.10	0.5993	0.1765	0.2243	1.67	0.78	1	0	0.82	0.63	1.08	0.59	1.05	0.00
2019-06-08	352	FIFA Women’s World Cup Final	Spain Women	South Africa Women	83.58	61.51	0.5921	0.1303	0.2776	1.27	0.43	3	1	3.33	0.50	3.42	0.12	2.34	1.05
2019-06-08	352	FIFA Women’s World Cup Final	Norway Women	Nigeria Women	82.91	70.90	0.5195	0.2176	0.2629	1.35	0.76	3	0	0.67	0.65	1.22	0.82	3.15	0.00
2019-06-09	352	FIFA Women’s World Cup Final	Australia Women	Italy Women	93.13	71.11	0.7425	0.1145	0.1430	2.63	0.95	1	2	2.52	0.79	1.74	0.57	1.05	2.10
2019-06-09	352	FIFA Women’s World Cup Final	Brazil Women	Jamaica Women	89.12	59.78	0.8027	0.0803	0.1170	2.92	0.85	3	0	3.73	0.94	1.90	0.70	3.15	0.00

Data transformation with dplyr

Here is a complete list of functions that comes with dplyr.

select(): Select columns from your dataset
filter(): Filter out certain rows that meet your criteria(s)
group_by(): Group different observations together such that the original dataset does not change. Only the way it is represented is changed in the form of a list
summarise(): Summarise any of the above functions
arrange(): Arrange your column data in ascending or descending order
join(): Perform left, right, full, and inner joins in R
mutate(): Create new columns by preserving the existing variables

In this section I will be using slice(), filter(), summerise(), count() functions that can be found under dplyr.

Manipulate Cases

Using slice option we can select the row numbers we want to display

slice(women_worldcup, 10:20)

## # A tibble: 11 x 20
##    date       league_id league team1 team2  spi1  spi2  prob1  prob2 probtie
##    <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>   <dbl>
##  1 2019-06-11       352 FIFA ~ New ~ Neth~  83.0  92.7 0.2    0.582   0.218 
##  2 2019-06-11       352 FIFA ~ Chil~ Swed~  49.0  85.9 0.0373 0.822   0.140 
##  3 2019-06-11       352 FIFA ~ USA ~ Thai~  97.2  50.4 0.951  0.0117  0.0373
##  4 2019-06-12       352 FIFA ~ Nige~ Sout~  69.3  78.3 0.277  0.455   0.268 
##  5 2019-06-12       352 FIFA ~ Germ~ Spai~  93.9  86.5 0.516  0.240   0.245 
##  6 2019-06-12       352 FIFA ~ Fran~ Norw~  96.1  84.1 0.649  0.136   0.215 
##  7 2019-06-13       352 FIFA ~ Aust~ Braz~  92.4  89.7 0.429  0.343   0.228 
##  8 2019-06-13       352 FIFA ~ Sout~ Chin~  58.2  81.9 0.143  0.593   0.265 
##  9 2019-06-14       352 FIFA ~ Japa~ Scot~  90.9  52.4 0.833  0.0542  0.113 
## 10 2019-06-14       352 FIFA ~ Jama~ Ital~  58.1  72.5 0.242  0.574   0.185 
## 11 2019-06-14       352 FIFA ~ Engl~ Arge~  92.1  39.1 0.916  0.0136  0.0709
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## #   score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>, nsxg2 <dbl>,
## #   adj_score1 <dbl>, adj_score2 <dbl>

filter function to extract rows that meet logical criteria

filter(women_worldcup, women_worldcup$score1 > "3")

## # A tibble: 2 x 20
##   date       league_id league team1 team2  spi1  spi2 prob1  prob2 probtie
##   <date>         <dbl> <chr>  <chr> <chr> <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1 2019-06-07       352 FIFA ~ Fran~ Sout~  95.3  80.8 0.694 0.114    0.192
## 2 2019-06-16       352 FIFA ~ Swed~ Thai~  86.5  43.5 0.866 0.0254   0.108
## # ... with 10 more variables: proj_score1 <dbl>, proj_score2 <dbl>,
## #   score1 <dbl>, score2 <dbl>, xg1 <dbl>, xg2 <dbl>, nsxg1 <dbl>, nsxg2 <dbl>,
## #   adj_score1 <dbl>, adj_score2 <dbl>

Summarize Cases

apply summarize function to columns to create new table of summary statistics

women_worldcup %>% summarise(Avg_proj_score1 = mean(proj_score1), Median_proj_score1 = median(proj_score1), Avg_score1 = mean(score1), Median_score1 = median(score1))

## # A tibble: 1 x 4
##   Avg_proj_score1 Median_proj_score1 Avg_score1 Median_score1
##             <dbl>              <dbl>      <dbl>         <dbl>
## 1            1.47               1.27       1.65             1

count number of rows each group defined by the variable

#Team 1
count(women_worldcup, women_worldcup$team1, sort = TRUE)

## # A tibble: 24 x 2
##    `women_worldcup$team1`     n
##    <chr>                  <int>
##  1 England Women              5
##  2 France Women               4
##  3 Germany Women              4
##  4 Netherlands Women          4
##  5 Italy Women                3
##  6 Norway Women               3
##  7 Sweden Women               3
##  8 USA Women                  3
##  9 Australia Women            2
## 10 Canada Women               2
## # ... with 14 more rows

#Team 2
count(women_worldcup, women_worldcup$team2, sort = TRUE)

## # A tibble: 24 x 2
##    `women_worldcup$team2`     n
##    <chr>                  <int>
##  1 Sweden Women               4
##  2 USA Women                  4
##  3 Brazil Women               3
##  4 Cameroon Women             3
##  5 China Women                3
##  6 Netherlands Women          3
##  7 Argentina Women            2
##  8 Australia Women            2
##  9 Canada Women               2
## 10 Chile Women                2
## # ... with 14 more rows

Data Wrangling with tidyr

The tidyr package complements dplyr perfectly. It boosts the power of dplyr for data manipulation and pre-processing.Under tidyr, There are four fundamental functions of data tyding.

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer
spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider
separate() splits a single column into multiple columns
unite() combines multiple columns into a single column

Reshaping data

Creating a new data set with selected columns. Then gather some columns.

#Eliminating few columns from the original data set and creating a new data set. 
wwc_19 <- select(women_worldcup, -c(date, league_id, league, xg1, xg2, nsxg1, nsxg2, prob1, prob2, probtie))

df <- wwc_19 %>%
  gather(Teams, team_names, team1:team2)
df

## # A tibble: 104 x 10
##     spi1  spi2 proj_score1 proj_score2 score1 score2 adj_score1 adj_score2 Teams
##    <dbl> <dbl>       <dbl>       <dbl>  <dbl>  <dbl>      <dbl>      <dbl> <chr>
##  1  95.3  80.8        1.9         0.61      4      0       3.83       0    team1
##  2  94.2  82.1        1.67        0.78      1      0       1.05       0    team1
##  3  83.6  61.5        1.27        0.43      3      1       2.34       1.05 team1
##  4  82.9  70.9        1.35        0.76      3      0       3.15       0    team1
##  5  93.1  71.1        2.63        0.95      1      2       1.05       2.1  team1
##  6  89.1  59.8        2.92        0.85      3      0       3.15       0    team1
##  7  92.7  50.8        2.64        0.3       2      1       2.1        1.05 team1
##  8  37.1  92.3        0.38        3.58      0      0       0          0    team1
##  9  92.2  62.5        1.87        0.44      1      0       1.05       0    team1
## 10  83.0  92.7        0.92        1.74      0      1       0          1.05 team1
## # ... with 94 more rows, and 1 more variable: team_names <chr>

I am going to create a new two columns by combining scores and spi rank. This new data set will help me for my data visualization part as well.

only_scores <- wwc_19 %>%
  unite(score_comb, score1, score2, sep = "-") %>%
  unite(spi_comb, spi1, spi2, sep = "-")

only_scores %>%
  kable() %>%
  kable_styling()

team1	team2	spi_comb	proj_score1	proj_score2	score_comb	adj_score1	adj_score2
France Women	South Korea Women	95.32-80.76	1.90	0.61	4-0	3.83	0.00
Germany Women	China Women	94.17-82.1	1.67	0.78	1-0	1.05	0.00
Spain Women	South Africa Women	83.58-61.51	1.27	0.43	3-1	2.34	1.05
Norway Women	Nigeria Women	82.91-70.9	1.35	0.76	3-0	3.15	0.00
Australia Women	Italy Women	93.13-71.11	2.63	0.95	1-2	1.05	2.10
Brazil Women	Jamaica Women	89.12-59.78	2.92	0.85	3-0	3.15	0.00
England Women	Scotland Women	92.7-50.75	2.64	0.30	2-1	2.10	1.05
Argentina Women	Japan Women	37.09-92.29	0.38	3.58	0-0	0.00	0.00
Canada Women	Cameroon Women	92.23-62.53	1.87	0.44	1-0	1.05	0.00
New Zealand Women	Netherlands Women	82.98-92.71	0.92	1.74	0-1	0.00	1.05
Chile Women	Sweden Women	48.96-85.88	0.27	2.12	0-2	0.00	1.58
USA Women	Thailand Women	97.21-50.43	4.01	0.40	13-0	11.73	0.00
Nigeria Women	South Korea Women	69.3-78.28	0.92	1.26	2-0	1.99	0.00
Germany Women	Spain Women	93.86-86.53	1.48	0.92	1-0	1.05	0.00
France Women	Norway Women	96.09-84.06	1.71	0.63	2-1	2.10	1.05
Australia Women	Brazil Women	92.44-89.7	1.55	1.37	3-2	3.15	2.10
South Africa Women	China Women	58.24-81.9	0.50	1.34	0-1	0.00	1.05
Japan Women	Scotland Women	90.91-52.41	2.68	0.55	2-1	2.10	1.05
Jamaica Women	Italy Women	58.14-72.46	1.43	2.26	0-5	0.00	4.96
England Women	Argentina Women	92.14-39.07	2.86	0.20	1-0	1.05	0.00
Netherlands Women	Cameroon Women	93.62-62.68	2.15	0.31	3-1	2.78	1.05
Canada Women	New Zealand Women	91.94-80.85	1.56	0.78	2-0	1.87	0.00
Sweden Women	Thailand Women	86.47-43.51	2.41	0.24	5-1	4.46	1.05
USA Women	Chile Women	98.33-47.61	4.42	0.29	3-0	3.15	0.00
South Africa Women	Germany Women	56.69-93.77	0.21	2.21	0-4	0.00	4.20
China Women	Spain Women	82.67-86.53	0.85	1.08	0-0	0.00	0.00
South Korea Women	Norway Women	76.36-83.68	0.94	1.40	1-2	1.05	2.10
Nigeria Women	France Women	71.65-96.3	0.28	2.24	0-1	0.00	0.84
Jamaica Women	Australia Women	53.52-92.82	0.76	3.65	1-4	1.05	3.87
Italy Women	Brazil Women	76.14-89.47	1.10	1.94	0-1	0.00	1.05
Japan Women	England Women	90.31-91.92	1.07	1.40	0-2	0.00	1.74
Scotland Women	Argentina Women	53.98-39.19	1.50	0.90	3-3	3.15	3.15
Netherlands Women	Canada Women	92.68-93.52	1.08	1.16	2-1	2.10	1.05
Cameroon Women	New Zealand Women	65.84-77.62	0.93	1.37	2-1	2.10	1.05
Thailand Women	Chile Women	40.69-46.53	1.44	1.69	0-2	0.00	1.86
Sweden Women	USA Women	88.45-98.33	0.78	1.71	0-2	0.00	2.10
Germany Women	Nigeria Women	94.93-69.91	2.19	0.43	3-0	2.84	0.00
Norway Women	Australia Women	82.62-92.49	1.14	1.74	1-1	1.05	1.05
England Women	Cameroon Women	92.2-69.69	1.78	0.50	3-0	3.15	0.00
France Women	Brazil Women	96.61-89.26	1.70	0.72	2-1	2.10	1.05
Spain Women	USA Women	88.41-98.49	0.82	1.98	1-2	1.05	2.10
Sweden Women	Canada Women	87.36-93.07	0.75	1.01	1-0	1.05	0.00
Italy Women	China Women	75.68-79.99	1.01	1.24	2-0	2.10	0.00
Netherlands Women	Japan Women	93.33-89.81	1.62	1.16	2-1	2.10	1.05
Norway Women	England Women	84.1-92.65	0.81	1.47	0-3	0.00	3.15
France Women	USA Women	96.79-98.5	1.25	1.41	1-2	1.05	2.10
Italy Women	Netherlands Women	77.41-93.32	0.90	2.10	0-2	0.00	1.85
Germany Women	Sweden Women	94.89-88.46	1.26	0.85	1-2	1.05	2.10
England Women	USA Women	93.54-98.42	0.98	1.63	1-2	1.05	2.10
Netherlands Women	Sweden Women	93.56-89.6	1.15	0.92	1-0	1.05	0.00
England Women	Sweden Women	94.04-88.77	1.10	0.75	1-2	1.05	2.10
USA Women	Netherlands Women	98.24-93.96	1.71	1.08	2-0	2.10	0.00

Data Visualization - ggplot2

There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots – you name it and ggplot2 has a function for it.Let’s see a few examples of how to create some really interactive plots with ggplot2 in R.

Visualization of scores

#Graph to show the scores  

g <- ggplot(wwc_19, aes(x = team1, y = team2, fill = score1-score2)) +
  geom_label(aes(label = paste(score1, score2, sep = "-")), fill = "white") +
  scale_x_discrete(limits = levels(wwc_19$team1), position = "bottom") +
  scale_y_discrete(limits = rev(levels(wwc_19$team2))) +
  theme (axis.text.x = element_text(angle = 60, hjust = 1))

g

only_scores <- wwc_19 %>%
  unite(score_comb, score1, score2, sep = "-") %>%
  unite(spi_comb, spi1, spi2, sep = "-")

heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = score_comb))+
  geom_tile()+
  theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)

Visualization of SPI (Soccer Power Index) Rating

heatmap <- ggplot(only_scores, mapping = aes(x = team2, y = team1, fill = spi_comb))+
  geom_tile()+
  theme (axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(heatmap)

End Note

Tidyverse is the most popular collection of R packages. Which isn’t all that surprising given how useful and easy to use they are. It is consisting with a powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way data science work with messy data.