knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(tidyr)nfl_elo <- read_csv('https://raw.githubusercontent.com/zachalexander/FALL2019TIDYVERSE/master/nfl_elo_latest.csv')Most of the time, data and information is not presented in a very easy-to-use format. As data scientists, a key part of our role is to manipulate data into formats that are simpler to understand and to use. Without clean and tidy data, it’s difficult to run analyses and visualize important points in order to draw valuable conclusions. Additionally, the data-tidying process is never static, as new questions arise and/or new data is collected, data need to be cleaned further.
The tidyr package is a great tool to utilize for these circumstances.
With built-in functions such as spread(), gather(), separate(), separate_rows(), unite(), drop_na(), fill(), and replace_na(), we can tidy data in many different ways.
Often times, you’ll need to change the layout of values in a table. Whether you are going from a wide dataset to long dataset, or vice versa, the gather() and spread() functions are very valuable for this process.
Gather
This function takes the values from the column names, turns them into a “key” column and gathers the values into a single value column.
As an example, we’ll take a look at the NFL Elo Ratings Dataset from FiveThirtyEight. We’ll select a few columns to simplify the view. Below, is a filtered dataset of Elo Ratings for all home teams for the 2019 season. I have removed games that have not been played yet (elo ratings as of 11/17/2019):
## # A tibble: 20 x 4
## date team elo1_pre elo1_post
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI 1383.740 1385.565
## 2 2019-09-22 ARI 1379.762 1353.336
## 3 2019-09-29 ARI 1353.336 1336.548
## 4 2019-10-13 ARI 1354.952 1362.684
## 5 2019-10-31 ARI 1382.397 1375.141
## 6 2019-09-15 ATL 1499.424 1516.860
## 7 2019-09-29 ATL 1507.661 1474.769
## 8 2019-10-20 ATL 1446.419 1417.507
## 9 2019-10-27 ATL 1417.507 1401.747
## 10 2019-09-15 BAL 1598.558 1604.361
## 11 2019-09-29 BAL 1592.051 1544.287
## 12 2019-10-13 BAL 1560.892 1566.734
## 13 2019-11-03 BAL 1602.365 1637.031
## 14 2019-11-17 BAL 1648.927 1670.466
## 15 2019-09-22 BUF 1485.887 1494.388
## 16 2019-09-29 BUF 1494.388 1482.189
## 17 2019-10-20 BUF 1511.859 1520.440
## 18 2019-10-27 BUF 1520.440 1485.364
## 19 2019-11-03 BUF 1485.364 1497.455
## 20 2019-09-08 CAR 1519.379 1506.142
Now, in order to gather the pre and post elo ratings into one column, we can use the gather() function to do this. As you can see from the output, both elo rating values are now gathered into one row, and there is a key column of either pre or post elo ratings:
gather <- home_teams %>%
gather(elo1_pre, elo1_post, key = 'pre_post', value = 'elo_ratings') %>%
arrange(team, date)## # A tibble: 20 x 4
## date team pre_post elo_ratings
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI elo1_pre 1383.740
## 2 2019-09-08 ARI elo1_post 1385.565
## 3 2019-09-22 ARI elo1_pre 1379.762
## 4 2019-09-22 ARI elo1_post 1353.336
## 5 2019-09-29 ARI elo1_pre 1353.336
## 6 2019-09-29 ARI elo1_post 1336.548
## 7 2019-10-13 ARI elo1_pre 1354.952
## 8 2019-10-13 ARI elo1_post 1362.684
## 9 2019-10-31 ARI elo1_pre 1382.397
## 10 2019-10-31 ARI elo1_post 1375.141
## 11 2019-09-15 ATL elo1_pre 1499.424
## 12 2019-09-15 ATL elo1_post 1516.860
## 13 2019-09-29 ATL elo1_pre 1507.661
## 14 2019-09-29 ATL elo1_post 1474.769
## 15 2019-10-20 ATL elo1_pre 1446.419
## 16 2019-10-20 ATL elo1_post 1417.507
## 17 2019-10-27 ATL elo1_pre 1417.507
## 18 2019-10-27 ATL elo1_post 1401.747
## 19 2019-09-15 BAL elo1_pre 1598.558
## 20 2019-09-15 BAL elo1_post 1604.361
Spread
In this same fashion, if we’d like to spread the key column back into separate columns, we can use the spread() function to move the unique values of a key into column names. Therefore, to continue our example, we can spread the elo1pre and elo1post values from the eloratings column, into their own columns based on the key values:
spread <- gather %>%
spread(pre_post, elo_ratings) %>%
arrange(team, date)## # A tibble: 20 x 4
## date team elo1_post elo1_pre
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI 1385.565 1383.740
## 2 2019-09-22 ARI 1353.336 1379.762
## 3 2019-09-29 ARI 1336.548 1353.336
## 4 2019-10-13 ARI 1362.684 1354.952
## 5 2019-10-31 ARI 1375.141 1382.397
## 6 2019-09-15 ATL 1516.860 1499.424
## 7 2019-09-29 ATL 1474.769 1507.661
## 8 2019-10-20 ATL 1417.507 1446.419
## 9 2019-10-27 ATL 1401.747 1417.507
## 10 2019-09-15 BAL 1604.361 1598.558
## 11 2019-09-29 BAL 1544.287 1592.051
## 12 2019-10-13 BAL 1566.734 1560.892
## 13 2019-11-03 BAL 1637.031 1602.365
## 14 2019-11-17 BAL 1670.466 1648.927
## 15 2019-09-22 BUF 1494.388 1485.887
## 16 2019-09-29 BUF 1482.189 1494.388
## 17 2019-10-20 BUF 1520.440 1511.859
## 18 2019-10-27 BUF 1485.364 1520.440
## 19 2019-11-03 BUF 1497.455 1485.364
## 20 2019-09-08 CAR 1506.142 1519.379
Unite
If we’d like to combine the elo ratings, pre and post into one cell and one column per team, we could use the unite() function to do this, with a “/” as a separator:
unite <- spread %>%
unite(elo1_post, elo1_pre, col = "elo_ratings", sep = "/") %>%
arrange(team, date)## # A tibble: 20 x 3
## date team elo_ratings
## <date> <chr> <chr>
## 1 2019-09-08 ARI 1385.565/1383.740
## 2 2019-09-22 ARI 1353.336/1379.762
## 3 2019-09-29 ARI 1336.548/1353.336
## 4 2019-10-13 ARI 1362.684/1354.952
## 5 2019-10-31 ARI 1375.141/1382.397
## 6 2019-09-15 ATL 1516.860/1499.424
## 7 2019-09-29 ATL 1474.769/1507.661
## 8 2019-10-20 ATL 1417.507/1446.419
## 9 2019-10-27 ATL 1401.747/1417.507
## 10 2019-09-15 BAL 1604.361/1598.558
## 11 2019-09-29 BAL 1544.287/1592.051
## 12 2019-10-13 BAL 1566.734/1560.892
## 13 2019-11-03 BAL 1637.031/1602.365
## 14 2019-11-17 BAL 1670.466/1648.927
## 15 2019-09-22 BUF 1494.388/1485.887
## 16 2019-09-29 BUF 1482.189/1494.388
## 17 2019-10-20 BUF 1520.440/1511.859
## 18 2019-10-27 BUF 1485.364/1520.440
## 19 2019-11-03 BUF 1497.455/1485.364
## 20 2019-09-08 CAR 1506.142/1519.379
Separate
Similarly, if we’d like to move these values back out into separate columns, we can use the separate() function to create the pre and post elo rating columns:
separate <- unite %>%
separate(elo_ratings, sep = "/", into = c("elo1_post", "elo1_pre")) %>%
arrange(team, date)## # A tibble: 20 x 4
## date team elo1_post elo1_pre
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI 1385.565 1383.740
## 2 2019-09-22 ARI 1353.336 1379.762
## 3 2019-09-29 ARI 1336.548 1353.336
## 4 2019-10-13 ARI 1362.684 1354.952
## 5 2019-10-31 ARI 1375.141 1382.397
## 6 2019-09-15 ATL 1516.860 1499.424
## 7 2019-09-29 ATL 1474.769 1507.661
## 8 2019-10-20 ATL 1417.507 1446.419
## 9 2019-10-27 ATL 1401.747 1417.507
## 10 2019-09-15 BAL 1604.361 1598.558
## 11 2019-09-29 BAL 1544.287 1592.051
## 12 2019-10-13 BAL 1566.734 1560.892
## 13 2019-11-03 BAL 1637.031 1602.365
## 14 2019-11-17 BAL 1670.466 1648.927
## 15 2019-09-22 BUF 1494.388 1485.887
## 16 2019-09-29 BUF 1482.189 1494.388
## 17 2019-10-20 BUF 1520.440 1511.859
## 18 2019-10-27 BUF 1485.364 1520.440
## 19 2019-11-03 BUF 1497.455 1485.364
## 20 2019-09-08 CAR 1506.142 1519.379
Separate Rows
Another option, rather than separating the values out into unique columns of pre and post elo ratings, is to create one new column with the pre and post ratings per date and team, similar to the gather() function, but without a key column. For this, we can use the separate_rows() function:
separate_rows <- unite %>%
separate_rows(elo_ratings, sep = "/") %>%
arrange(team, date)## # A tibble: 20 x 3
## date team elo_ratings
## <date> <chr> <chr>
## 1 2019-09-08 ARI 1385.565
## 2 2019-09-08 ARI 1383.740
## 3 2019-09-22 ARI 1353.336
## 4 2019-09-22 ARI 1379.762
## 5 2019-09-29 ARI 1336.548
## 6 2019-09-29 ARI 1353.336
## 7 2019-10-13 ARI 1362.684
## 8 2019-10-13 ARI 1354.952
## 9 2019-10-31 ARI 1375.141
## 10 2019-10-31 ARI 1382.397
## 11 2019-09-15 ATL 1516.860
## 12 2019-09-15 ATL 1499.424
## 13 2019-09-29 ATL 1474.769
## 14 2019-09-29 ATL 1507.661
## 15 2019-10-20 ATL 1417.507
## 16 2019-10-20 ATL 1446.419
## 17 2019-10-27 ATL 1401.747
## 18 2019-10-27 ATL 1417.507
## 19 2019-09-15 BAL 1604.361
## 20 2019-09-15 BAL 1598.558
Many times, there will be information in a dataset that is missing. Whether this is intentional, or by mistake, we can use a few tidyr functions to clean up these missing values.
To demonstrate, we’ll extend the example eloratings dataset to include elo ratings for games that have not occurred yet for NFL home teams in the 2019 season.
head(home_teams_future, n = 20L)## # A tibble: 20 x 4
## date team elo1_pre elo1_post
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI 1384 1386
## 2 2019-09-22 ARI 1380 1353
## 3 2019-09-29 ARI 1353 1337
## 4 2019-10-13 ARI 1355 1363
## 5 2019-10-31 ARI 1382 1375
## 6 2019-12-01 ARI 1360 <NA>
## 7 2019-12-08 ARI 1360 <NA>
## 8 2019-12-15 ARI 1360 <NA>
## 9 2019-09-15 ATL 1499 1517
## 10 2019-09-29 ATL 1508 1475
## 11 2019-10-20 ATL 1446 1418
## 12 2019-10-27 ATL 1418 1402
## 13 2019-11-24 ATL 1510 <NA>
## 14 2019-11-28 ATL 1510 <NA>
## 15 2019-12-08 ATL 1510 <NA>
## 16 2019-12-22 ATL 1510 <NA>
## 17 2019-09-15 BAL 1599 1604
## 18 2019-09-29 BAL 1592 1544
## 19 2019-10-13 BAL 1561 1567
## 20 2019-11-03 BAL 1602 1637
Replace NA
In certain instances, it will be necessary to fill in a value for missing values. You can do this easily with the replace_na() function.
Continuing with our example, we can fill in all missing values (NA), with “Game hasn’t happened yet”, in order to better inform users as to why this data is missing:
replace_na <- home_teams_future %>%
replace_na(list(elo1_post = "Game hasn't happened yet")) %>%
arrange(team, date)## # A tibble: 20 x 4
## date team elo1_pre elo1_post
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI 1384 1386
## 2 2019-09-22 ARI 1380 1353
## 3 2019-09-29 ARI 1353 1337
## 4 2019-10-13 ARI 1355 1363
## 5 2019-10-31 ARI 1382 1375
## 6 2019-12-01 ARI 1360 Game hasn't happened yet
## 7 2019-12-08 ARI 1360 Game hasn't happened yet
## 8 2019-12-15 ARI 1360 Game hasn't happened yet
## 9 2019-09-15 ATL 1499 1517
## 10 2019-09-29 ATL 1508 1475
## 11 2019-10-20 ATL 1446 1418
## 12 2019-10-27 ATL 1418 1402
## 13 2019-11-24 ATL 1510 Game hasn't happened yet
## 14 2019-11-28 ATL 1510 Game hasn't happened yet
## 15 2019-12-08 ATL 1510 Game hasn't happened yet
## 16 2019-12-22 ATL 1510 Game hasn't happened yet
## 17 2019-09-15 BAL 1599 1604
## 18 2019-09-29 BAL 1592 1544
## 19 2019-10-13 BAL 1561 1567
## 20 2019-11-03 BAL 1602 1637
Fill NA
In some instances, if there are missing values that should be the same value as the most recent non-NA value in the column, you can also use the fill() function to resolve this.
We will do this with our current example, although it will not reflect accurate information for the future elo ratings:
fill <- home_teams_future %>%
fill(elo1_post) %>%
arrange(team, date)## # A tibble: 20 x 4
## date team elo1_pre elo1_post
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI 1384 1386
## 2 2019-09-22 ARI 1380 1353
## 3 2019-09-29 ARI 1353 1337
## 4 2019-10-13 ARI 1355 1363
## 5 2019-10-31 ARI 1382 1375
## 6 2019-12-01 ARI 1360 1375
## 7 2019-12-08 ARI 1360 1375
## 8 2019-12-15 ARI 1360 1375
## 9 2019-09-15 ATL 1499 1517
## 10 2019-09-29 ATL 1508 1475
## 11 2019-10-20 ATL 1446 1418
## 12 2019-10-27 ATL 1418 1402
## 13 2019-11-24 ATL 1510 1402
## 14 2019-11-28 ATL 1510 1402
## 15 2019-12-08 ATL 1510 1402
## 16 2019-12-22 ATL 1510 1402
## 17 2019-09-15 BAL 1599 1604
## 18 2019-09-29 BAL 1592 1544
## 19 2019-10-13 BAL 1561 1567
## 20 2019-11-03 BAL 1602 1637
Drop NA
Finally, to remove NA values from a column, you can use the drop_na() function to do this in one line of code.
We will also do this with our current example, reverting this back to our hometeams dataset, which only shows elo ratings for past games for home teams in the 2019 season:
drop_na <- home_teams_future %>%
drop_na(elo1_post) %>%
arrange(team, date)## # A tibble: 20 x 4
## date team elo1_pre elo1_post
## <date> <chr> <chr> <chr>
## 1 2019-09-08 ARI 1384 1386
## 2 2019-09-22 ARI 1380 1353
## 3 2019-09-29 ARI 1353 1337
## 4 2019-10-13 ARI 1355 1363
## 5 2019-10-31 ARI 1382 1375
## 6 2019-09-15 ATL 1499 1517
## 7 2019-09-29 ATL 1508 1475
## 8 2019-10-20 ATL 1446 1418
## 9 2019-10-27 ATL 1418 1402
## 10 2019-09-15 BAL 1599 1604
## 11 2019-09-29 BAL 1592 1544
## 12 2019-10-13 BAL 1561 1567
## 13 2019-11-03 BAL 1602 1637
## 14 2019-11-17 BAL 1649 1670
## 15 2019-09-22 BUF 1486 1494
## 16 2019-09-29 BUF 1494 1482
## 17 2019-10-20 BUF 1512 1520
## 18 2019-10-27 BUF 1520 1485
## 19 2019-11-03 BUF 1485 1497
## 20 2019-09-08 CAR 1519 1506