suppressPackageStartupMessages(library("tidyverse"))

1. Why are pivot_longer() and pivot_wider() not perfectly symmetrical? Carefully consider the following example:

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")

(Hint: look at the variable types and think about column names.) pivot_longer() has a names_ptype argument, e.g. names_ptype = list(year = double()). What does it do?

The functions pivot_longer() and pivot_wider() are not perfectly symmetrical because column type information is lost. When we use pivot_wider() on a data frame, it discards the original column types. It has to coerce all the variables into a single vector with a single type. Later, if we pivot_longer() that data frame, the pivot_longer() function does not know the original data types of the variables.

stocks %>% 
  pivot_wider(names_from = year, values_from = return)

The following use of pivot_longer() will create a column, year, from the column names. However, since column names are used to create the names_from column, the resulting vector is a character vector. Even though we “know” that the all the column names refer to years, which are numbers, the pivot_longer() function does not know that.

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")

names_ptype in pivot_longer() is a list of of column name-prototype pairs. A prototype (or ptype for short) is a zero-length vector (like integer() or numeric()) that defines the type, class, and attributes of a vector. If not specified, the type of the columns generated from names_to will be character, and the type of the variables generated from values_to will be the common type of the input columns used to generate them.

The functions pivot_wider() and pivot_longer() are almost symmetrical if we use the names_ptype argument. When names_ptype = list(year = double()), the pivot_longer() function will attempt to convert year to the double.

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return", names_ptype = list(year = double()))

Now, year is a double vector. However, this conversion is merely guessing the type of the variables and so will not always return the original variable types. That information is lost.

2. Why does this code fail?

#table4a %>% 
  #pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
#> Error in inds_combine(.vars, ind_list): Position must be between 0 and n

The code fails because the column names 1999 and 2000 are not non-syntactic variable names. A syntactic name must consist of letters2, digits, . and _ but can’t begin with _ or a digit. Additionally, you can’t use any of the reserved words like TRUE, NULL, if, and function (see the complete list in ?Reserved). A name that doesn’t follow these rules is a non-syntactic name; if you try to use them, you’ll get an error.

When selecting variables from a data frame, tidyverse functions will interpret numbers, like 1999 and 2000, as column numbers. In this case, pivot_longer() tries to select 1999th and 2000th column of the data frame. To select the columns 1999 and 2000, you can either surround their names in backticks ( `) or provide them as strings.

table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
table4a %>% 
  pivot_longer(c("1999", "2000"), names_to = "year", values_to = "cases")

3.What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

glimpse(people)
Observations: 5
Variables: 3
$ name   <chr> "Phillip Woods", "Phillip Woods", "Phillip Woods", "Jessica Cordero", "Jessica Co...
$ names  <chr> "age", "height", "age", "age", "height"
$ values <dbl> 45, 186, 50, 37, 156
pivot_wider(people, names_from = names, values_from = values)
Values in `values` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(values = list)` to suppress this warning.
* Use `values_fn = list(values = length)` to identify where the duplicates arise
* Use `values_fn = list(values = summary_fun)` to summarise duplicates

pivot_wider() this data frame fails because the name and key columns do not uniquely identify rows. In particular, there are two rows with values for the age of “Phillip Woods”.

We could solve the problem by adding a row with a distinct observation count for each combination of name and key.

people2 <- people %>%
  group_by(name, names) %>%
  mutate(obs = row_number())
people2

We can pivot_wider() people2 because the combination of name and obs will uniquely identify the spread rows.

pivot_wider(people2, names_from = names, values_from = values)

Another way to solve this problem is by keeping only distinct rows of the name and key values, and dropping duplicate rows.

people %>%
  distinct(name, names, .keep_all = TRUE) %>%
  pivot_wider(names_from = names, values_from = values)

However, before doing this you would want to understand why there are duplicates in the data to begin with. This is usually not merely a nuisance, but indicates deeper problems with the data.

4. Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?

preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

To tidy the preg tibble, we need to use pivot_longer(). The variables in this data are:

  • sex (“female”, “male”)
  • pregnant (“yes”, “no”)
  • count, which is a non-negative integer representing the number of observations.

The observations in this data are unique combinations of sex and pregnancy status.

preg_tidy <- preg %>%
  #gather(male, female, key = "sex", value = "count")
  pivot_longer(c(male, female), names_to = "sex", values_to = "count")
preg_tidy

We can simplify the tidied data frame by removing the (male, pregnant) row with the missing value of NA.

preg_tidy2 <- preg %>%
  pivot_longer(c(male, female), names_to = "sex", values_to = "count", values_drop_na=TRUE)
preg_tidy2

This an example of turning an explicit missing value into an implicit missing value, which is discussed in the upcoming Missing Values section. The missing (male, pregnant) row represents an implicit missing value because the value of count can be inferred from its absence. In the tidy data, we can represent rows with missing values of count either explicitly with an NA (as in preg_tidy) or implicitly by the absence of a row (as in preg_tidy2). But in the wide data, the missing values can only be represented explicitly.

Though we have already done enough to make the data tidy, there’s some other changes that can be made to clean this data. If a variable takes two values, like pregnant and sex, it is often preferable to store them as logical vectors.

preg_tidy3 <- preg_tidy2 %>%
  mutate(
    female = sex == "female",
    pregnant = pregnant == "yes"
  ) %>%
  select(female, pregnant, count)
preg_tidy3

In the previous data frame, I named the logical variable representing the sex female, not sex. This makes the meaning of the variable self-documenting. If the variable were named sex with values TRUE and FALSE, without reading the documentation, we wouldn’t know whether TRUE means male or female.

Apart from some minor memory savings, representing these variables as logical vectors results in more clear and concise code. Compare the filter() calls to select non-pregnant females from preg_tidy2 and preg_tidy.

filter(preg_tidy2, sex == "female", pregnant == "no")
filter(preg_tidy3, female, !pregnant)
