DATA 607 - TidyVerse Vignette (tidyr)

Working with the tidyr package

Most of the time, data and information is not presented in a very easy-to-use format. As data scientists, a key part of our role is to manipulate data into formats that are simpler to understand and to use. Without clean and tidy data, it’s difficult to run analyses and visualize important points in order to draw valuable conclusions. Additionally, the data-tidying process is never static, as new questions arise and/or new data is collected, data need to be cleaned further.

The tidyr package is a great tool to utilize for these circumstances.

With built-in functions such as spread(), gather(), separate(), separate_rows(), unite(), drop_na(), fill(), and replace_na(), we can tidy data in many different ways.

Reshaping Data

Often times, you’ll need to change the layout of values in a table. Whether you are going from a wide dataset to long dataset, or vice versa, the gather() and spread() functions are very valuable for this process.

Gather

This function takes the values from the column names, turns them into a “key” column and gathers the values into a single value column.

As an example, we’ll take a look at the NFL Elo Ratings Dataset from FiveThirtyEight. We’ll select a few columns to simplify the view. Below, is a filtered dataset of Elo Ratings for all home teams for the 2019 season. I have removed games that have not been played yet (elo ratings as of 11/17/2019):

## # A tibble: 20 x 4
##    date       team  elo1_pre elo1_post
##    <date>     <chr> <chr>    <chr>    
##  1 2019-09-08 ARI   1383.740 1385.565 
##  2 2019-09-22 ARI   1379.762 1353.336 
##  3 2019-09-29 ARI   1353.336 1336.548 
##  4 2019-10-13 ARI   1354.952 1362.684 
##  5 2019-10-31 ARI   1382.397 1375.141 
##  6 2019-09-15 ATL   1499.424 1516.860 
##  7 2019-09-29 ATL   1507.661 1474.769 
##  8 2019-10-20 ATL   1446.419 1417.507 
##  9 2019-10-27 ATL   1417.507 1401.747 
## 10 2019-09-15 BAL   1598.558 1604.361 
## 11 2019-09-29 BAL   1592.051 1544.287 
## 12 2019-10-13 BAL   1560.892 1566.734 
## 13 2019-11-03 BAL   1602.365 1637.031 
## 14 2019-11-17 BAL   1648.927 1670.466 
## 15 2019-09-22 BUF   1485.887 1494.388 
## 16 2019-09-29 BUF   1494.388 1482.189 
## 17 2019-10-20 BUF   1511.859 1520.440 
## 18 2019-10-27 BUF   1520.440 1485.364 
## 19 2019-11-03 BUF   1485.364 1497.455 
## 20 2019-09-08 CAR   1519.379 1506.142

Now, in order to gather the pre and post elo ratings into one column, we can use the gather() function to do this. As you can see from the output, both elo rating values are now gathered into one row, and there is a key column of either pre or post elo ratings:

gather <- home_teams %>% 
  gather(elo1_pre, elo1_post, key = 'pre_post', value = 'elo_ratings') %>% 
  arrange(team, date)

## # A tibble: 20 x 4
##    date       team  pre_post  elo_ratings
##    <date>     <chr> <chr>     <chr>      
##  1 2019-09-08 ARI   elo1_pre  1383.740   
##  2 2019-09-08 ARI   elo1_post 1385.565   
##  3 2019-09-22 ARI   elo1_pre  1379.762   
##  4 2019-09-22 ARI   elo1_post 1353.336   
##  5 2019-09-29 ARI   elo1_pre  1353.336   
##  6 2019-09-29 ARI   elo1_post 1336.548   
##  7 2019-10-13 ARI   elo1_pre  1354.952   
##  8 2019-10-13 ARI   elo1_post 1362.684   
##  9 2019-10-31 ARI   elo1_pre  1382.397   
## 10 2019-10-31 ARI   elo1_post 1375.141   
## 11 2019-09-15 ATL   elo1_pre  1499.424   
## 12 2019-09-15 ATL   elo1_post 1516.860   
## 13 2019-09-29 ATL   elo1_pre  1507.661   
## 14 2019-09-29 ATL   elo1_post 1474.769   
## 15 2019-10-20 ATL   elo1_pre  1446.419   
## 16 2019-10-20 ATL   elo1_post 1417.507   
## 17 2019-10-27 ATL   elo1_pre  1417.507   
## 18 2019-10-27 ATL   elo1_post 1401.747   
## 19 2019-09-15 BAL   elo1_pre  1598.558   
## 20 2019-09-15 BAL   elo1_post 1604.361

Spread

In this same fashion, if we’d like to spread the key column back into separate columns, we can use the spread() function to move the unique values of a key into column names. Therefore, to continue our example, we can spread the elo1pre and elo1post values from the eloratings column, into their own columns based on the key values:

spread <- gather %>% 
  spread(pre_post, elo_ratings) %>% 
  arrange(team, date)

## # A tibble: 20 x 4
##    date       team  elo1_post elo1_pre
##    <date>     <chr> <chr>     <chr>   
##  1 2019-09-08 ARI   1385.565  1383.740
##  2 2019-09-22 ARI   1353.336  1379.762
##  3 2019-09-29 ARI   1336.548  1353.336
##  4 2019-10-13 ARI   1362.684  1354.952
##  5 2019-10-31 ARI   1375.141  1382.397
##  6 2019-09-15 ATL   1516.860  1499.424
##  7 2019-09-29 ATL   1474.769  1507.661
##  8 2019-10-20 ATL   1417.507  1446.419
##  9 2019-10-27 ATL   1401.747  1417.507
## 10 2019-09-15 BAL   1604.361  1598.558
## 11 2019-09-29 BAL   1544.287  1592.051
## 12 2019-10-13 BAL   1566.734  1560.892
## 13 2019-11-03 BAL   1637.031  1602.365
## 14 2019-11-17 BAL   1670.466  1648.927
## 15 2019-09-22 BUF   1494.388  1485.887
## 16 2019-09-29 BUF   1482.189  1494.388
## 17 2019-10-20 BUF   1520.440  1511.859
## 18 2019-10-27 BUF   1485.364  1520.440
## 19 2019-11-03 BUF   1497.455  1485.364
## 20 2019-09-08 CAR   1506.142  1519.379

Unite

If we’d like to combine the elo ratings, pre and post into one cell and one column per team, we could use the unite() function to do this, with a “/” as a separator:

unite <- spread %>% 
  unite(elo1_post, elo1_pre, col = "elo_ratings", sep = "/") %>% 
  arrange(team, date)

## # A tibble: 20 x 3
##    date       team  elo_ratings      
##    <date>     <chr> <chr>            
##  1 2019-09-08 ARI   1385.565/1383.740
##  2 2019-09-22 ARI   1353.336/1379.762
##  3 2019-09-29 ARI   1336.548/1353.336
##  4 2019-10-13 ARI   1362.684/1354.952
##  5 2019-10-31 ARI   1375.141/1382.397
##  6 2019-09-15 ATL   1516.860/1499.424
##  7 2019-09-29 ATL   1474.769/1507.661
##  8 2019-10-20 ATL   1417.507/1446.419
##  9 2019-10-27 ATL   1401.747/1417.507
## 10 2019-09-15 BAL   1604.361/1598.558
## 11 2019-09-29 BAL   1544.287/1592.051
## 12 2019-10-13 BAL   1566.734/1560.892
## 13 2019-11-03 BAL   1637.031/1602.365
## 14 2019-11-17 BAL   1670.466/1648.927
## 15 2019-09-22 BUF   1494.388/1485.887
## 16 2019-09-29 BUF   1482.189/1494.388
## 17 2019-10-20 BUF   1520.440/1511.859
## 18 2019-10-27 BUF   1485.364/1520.440
## 19 2019-11-03 BUF   1497.455/1485.364
## 20 2019-09-08 CAR   1506.142/1519.379

Separate

Similarly, if we’d like to move these values back out into separate columns, we can use the separate() function to create the pre and post elo rating columns:

separate <- unite %>% 
  separate(elo_ratings, sep = "/", into = c("elo1_post", "elo1_pre")) %>% 
  arrange(team, date)

## # A tibble: 20 x 4
##    date       team  elo1_post elo1_pre
##    <date>     <chr> <chr>     <chr>   
##  1 2019-09-08 ARI   1385.565  1383.740
##  2 2019-09-22 ARI   1353.336  1379.762
##  3 2019-09-29 ARI   1336.548  1353.336
##  4 2019-10-13 ARI   1362.684  1354.952
##  5 2019-10-31 ARI   1375.141  1382.397
##  6 2019-09-15 ATL   1516.860  1499.424
##  7 2019-09-29 ATL   1474.769  1507.661
##  8 2019-10-20 ATL   1417.507  1446.419
##  9 2019-10-27 ATL   1401.747  1417.507
## 10 2019-09-15 BAL   1604.361  1598.558
## 11 2019-09-29 BAL   1544.287  1592.051
## 12 2019-10-13 BAL   1566.734  1560.892
## 13 2019-11-03 BAL   1637.031  1602.365
## 14 2019-11-17 BAL   1670.466  1648.927
## 15 2019-09-22 BUF   1494.388  1485.887
## 16 2019-09-29 BUF   1482.189  1494.388
## 17 2019-10-20 BUF   1520.440  1511.859
## 18 2019-10-27 BUF   1485.364  1520.440
## 19 2019-11-03 BUF   1497.455  1485.364
## 20 2019-09-08 CAR   1506.142  1519.379

Separate Rows

Another option, rather than separating the values out into unique columns of pre and post elo ratings, is to create one new column with the pre and post ratings per date and team, similar to the gather() function, but without a key column. For this, we can use the separate_rows() function:

separate_rows <- unite %>% 
  separate_rows(elo_ratings, sep = "/") %>% 
  arrange(team, date)

## # A tibble: 20 x 3
##    date       team  elo_ratings
##    <date>     <chr> <chr>      
##  1 2019-09-08 ARI   1385.565   
##  2 2019-09-08 ARI   1383.740   
##  3 2019-09-22 ARI   1353.336   
##  4 2019-09-22 ARI   1379.762   
##  5 2019-09-29 ARI   1336.548   
##  6 2019-09-29 ARI   1353.336   
##  7 2019-10-13 ARI   1362.684   
##  8 2019-10-13 ARI   1354.952   
##  9 2019-10-31 ARI   1375.141   
## 10 2019-10-31 ARI   1382.397   
## 11 2019-09-15 ATL   1516.860   
## 12 2019-09-15 ATL   1499.424   
## 13 2019-09-29 ATL   1474.769   
## 14 2019-09-29 ATL   1507.661   
## 15 2019-10-20 ATL   1417.507   
## 16 2019-10-20 ATL   1446.419   
## 17 2019-10-27 ATL   1401.747   
## 18 2019-10-27 ATL   1417.507   
## 19 2019-09-15 BAL   1604.361   
## 20 2019-09-15 BAL   1598.558

Handling Missing Values

Many times, there will be information in a dataset that is missing. Whether this is intentional, or by mistake, we can use a few tidyr functions to clean up these missing values.

To demonstrate, we’ll extend the example eloratings dataset to include elo ratings for games that have not occurred yet for NFL home teams in the 2019 season.

head(home_teams_future, n = 20L)

## # A tibble: 20 x 4
##    date       team  elo1_pre elo1_post
##    <date>     <chr> <chr>    <chr>    
##  1 2019-09-08 ARI   1384     1386     
##  2 2019-09-22 ARI   1380     1353     
##  3 2019-09-29 ARI   1353     1337     
##  4 2019-10-13 ARI   1355     1363     
##  5 2019-10-31 ARI   1382     1375     
##  6 2019-12-01 ARI   1360     <NA>     
##  7 2019-12-08 ARI   1360     <NA>     
##  8 2019-12-15 ARI   1360     <NA>     
##  9 2019-09-15 ATL   1499     1517     
## 10 2019-09-29 ATL   1508     1475     
## 11 2019-10-20 ATL   1446     1418     
## 12 2019-10-27 ATL   1418     1402     
## 13 2019-11-24 ATL   1510     <NA>     
## 14 2019-11-28 ATL   1510     <NA>     
## 15 2019-12-08 ATL   1510     <NA>     
## 16 2019-12-22 ATL   1510     <NA>     
## 17 2019-09-15 BAL   1599     1604     
## 18 2019-09-29 BAL   1592     1544     
## 19 2019-10-13 BAL   1561     1567     
## 20 2019-11-03 BAL   1602     1637

Replace NA

In certain instances, it will be necessary to fill in a value for missing values. You can do this easily with the replace_na() function.

Continuing with our example, we can fill in all missing values (NA), with “Game hasn’t happened yet”, in order to better inform users as to why this data is missing:

replace_na <- home_teams_future %>% 
  replace_na(list(elo1_post = "Game hasn't happened yet")) %>% 
  arrange(team, date)

## # A tibble: 20 x 4
##    date       team  elo1_pre elo1_post               
##    <date>     <chr> <chr>    <chr>                   
##  1 2019-09-08 ARI   1384     1386                    
##  2 2019-09-22 ARI   1380     1353                    
##  3 2019-09-29 ARI   1353     1337                    
##  4 2019-10-13 ARI   1355     1363                    
##  5 2019-10-31 ARI   1382     1375                    
##  6 2019-12-01 ARI   1360     Game hasn't happened yet
##  7 2019-12-08 ARI   1360     Game hasn't happened yet
##  8 2019-12-15 ARI   1360     Game hasn't happened yet
##  9 2019-09-15 ATL   1499     1517                    
## 10 2019-09-29 ATL   1508     1475                    
## 11 2019-10-20 ATL   1446     1418                    
## 12 2019-10-27 ATL   1418     1402                    
## 13 2019-11-24 ATL   1510     Game hasn't happened yet
## 14 2019-11-28 ATL   1510     Game hasn't happened yet
## 15 2019-12-08 ATL   1510     Game hasn't happened yet
## 16 2019-12-22 ATL   1510     Game hasn't happened yet
## 17 2019-09-15 BAL   1599     1604                    
## 18 2019-09-29 BAL   1592     1544                    
## 19 2019-10-13 BAL   1561     1567                    
## 20 2019-11-03 BAL   1602     1637

Fill NA

In some instances, if there are missing values that should be the same value as the most recent non-NA value in the column, you can also use the fill() function to resolve this.

We will do this with our current example, although it will not reflect accurate information for the future elo ratings:

fill <- home_teams_future %>% 
  fill(elo1_post) %>% 
  arrange(team, date)

## # A tibble: 20 x 4
##    date       team  elo1_pre elo1_post
##    <date>     <chr> <chr>    <chr>    
##  1 2019-09-08 ARI   1384     1386     
##  2 2019-09-22 ARI   1380     1353     
##  3 2019-09-29 ARI   1353     1337     
##  4 2019-10-13 ARI   1355     1363     
##  5 2019-10-31 ARI   1382     1375     
##  6 2019-12-01 ARI   1360     1375     
##  7 2019-12-08 ARI   1360     1375     
##  8 2019-12-15 ARI   1360     1375     
##  9 2019-09-15 ATL   1499     1517     
## 10 2019-09-29 ATL   1508     1475     
## 11 2019-10-20 ATL   1446     1418     
## 12 2019-10-27 ATL   1418     1402     
## 13 2019-11-24 ATL   1510     1402     
## 14 2019-11-28 ATL   1510     1402     
## 15 2019-12-08 ATL   1510     1402     
## 16 2019-12-22 ATL   1510     1402     
## 17 2019-09-15 BAL   1599     1604     
## 18 2019-09-29 BAL   1592     1544     
## 19 2019-10-13 BAL   1561     1567     
## 20 2019-11-03 BAL   1602     1637

Drop NA

Finally, to remove NA values from a column, you can use the drop_na() function to do this in one line of code.

We will also do this with our current example, reverting this back to our hometeams dataset, which only shows elo ratings for past games for home teams in the 2019 season:

drop_na <- home_teams_future %>% 
  drop_na(elo1_post) %>% 
  arrange(team, date)

## # A tibble: 20 x 4
##    date       team  elo1_pre elo1_post
##    <date>     <chr> <chr>    <chr>    
##  1 2019-09-08 ARI   1384     1386     
##  2 2019-09-22 ARI   1380     1353     
##  3 2019-09-29 ARI   1353     1337     
##  4 2019-10-13 ARI   1355     1363     
##  5 2019-10-31 ARI   1382     1375     
##  6 2019-09-15 ATL   1499     1517     
##  7 2019-09-29 ATL   1508     1475     
##  8 2019-10-20 ATL   1446     1418     
##  9 2019-10-27 ATL   1418     1402     
## 10 2019-09-15 BAL   1599     1604     
## 11 2019-09-29 BAL   1592     1544     
## 12 2019-10-13 BAL   1561     1567     
## 13 2019-11-03 BAL   1602     1637     
## 14 2019-11-17 BAL   1649     1670     
## 15 2019-09-22 BUF   1486     1494     
## 16 2019-09-29 BUF   1494     1482     
## 17 2019-10-20 BUF   1512     1520     
## 18 2019-10-27 BUF   1520     1485     
## 19 2019-11-03 BUF   1485     1497     
## 20 2019-09-08 CAR   1519     1506

DATA 607 - TidyVerse Vignette (tidyr)

Zach Alexander

November 19, 2019

Libraries loaded

Loading in the dataset from FiveThirtyEight

Working with the tidyr package

Reshaping Data

Handling Missing Values