Pivoting Homework

Overall Summary on Code Implementation:

We saw two types of pivoting in this code, longer and wider pivoting
The longer pivot is useful when a value that would rather make more sense as a variable is instead included as a column in the tibble. Longer pivoting may cause duplication of a few values in some columns but eventually makes it easier to plot and visualize the data set.
The wider pivot is useful in the inverse scenario of the longer pivot. When a value has been repeated in multiple instances, and each of it is associated with a different variable, it would make help more if we extract it into a column of its own for better readability and comparison. This also makes it easy to visualize it in a plot given that it can be given its own axis or dimension.

Note to grader:

The inline-comments on the code chunks have been titled and numbered with H3 headers and the comments have been italicized for faster identification.

library(tidyverse)
theme_set(theme_bw())

Pivoting refers to the process of changing the interpretation of each row in a data frame. It is useful for addressing problems 1 - 2 in the previous lecture, which we repeat here for completeness.
1. A variable might be implicitly stored within column names, rather than explicitly stored in its own column.
2. The same observation may appear in multiple rows, where each instance of the row is associated with a different variable.
To address (a), we can use the pivot_longer function in tidyr. It takes an implicitly stored variable and explicitly stores it in a column defined by the names_to argument. In
The example below shows pivot_longer being used to tidy one of the non-tidy tuberculosis data sets. Note that the data has doubled in length, because there are now two rows per country (one per year).

For reference, these are the original data.

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

This step lengthens the data,

table4a_longer <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

table4a_longer

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Comment #1:

Here we are reshaping the tale into a longer table by including year as a variable instead of a column name and assigning its corresponding values to a new column by the name ‘cases’.

dim(table4a)

[1] 3 3

dim(table4a_longer)

[1] 6 3

We can pivot both the population and the cases table, then combine them using a join operation. A join operation matches rows across two tables according to their shared columns.

# helper function, to avoid copying and pasting code
pivot_fun <- function(x, value_column = "cases") {
  x %>%
    pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = value_column)
}

table4 <- left_join(
  pivot_fun(table4a), # look for all country x year combinations in left table
  pivot_fun(table4b, "population") # and find matching rows in right table
)
table4

# A tibble: 6 × 4
  country     year   cases population
  <chr>       <chr>  <dbl>      <dbl>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

Comment #2:

Here we are joining the population and cases table to compare them against each other and plot them together in a graph. The left join ensures that we only pick entries from the population table only if the corresponding entries exist in the cases table.

This lets us make the year vs. rate plot that we had tried to put together in the last lecture. It’s much easier to recognize trends when comparing the rates, than when looking at the raw case counts.

ggplot(table4, aes(x = year, y = cases / population, col = country)) +
  geom_point() +
  geom_line(aes(group = country))

To address (b), we can use the pivot_wider function. It spreads the column in the values_from argument across new columns specified by the names_from argument.
The example below shows pivot_wider being used to tidy one of the other non-tidy datasets. Note when there are more than two levels in the names_from column, this will always be wider than the starting data frame, which is why this operation is called pivot_wider.

For reference, here is table2 before pivoting.

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Now, we spread the cases and population variables into their own columns.

table2 %>%
    pivot_wider(names_from = type, values_from = count)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Comment #3:

Here we are reshaping the table2 into a wider table that has its own columns for cases and population as the original table is difficult to follow and draw insights from when we need to compare an instance against another.