10.1 Introduction

Throughout this book we work with “tibbles” instead of R’s tranditional data.frame. Tibbles are data frames, but they weak some older behaviors to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the tibble package, which provides opinionated data frames that make working in the tidyverse a little easier. In most places, I’ll use the term tibble and data frame interchangeably; when I want to draw particular attention to R’s built-in data frame, I’ll call them data.frame S.

If this chapter leaves you wanting to learn more about tilles, you might enjoy vignette(“tibble”).

10.1.1 Prerequisites

In this chapter we’ll explore the tibble package, part of the core tidyverse.

library(tidyverse)

10.2 Creating tibbles

Almost all of the functions that you’ll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble():

as_tibble(iris)

You can create a new tibble from individual vectors with tibble(). tibble() will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.

tibble(
  x = 1:5,
  y = 1,
  z = x^2 + y
)

If you’re already familiar with data.frame(), note that tibble() does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntatic names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, `:

tb <- tibble(
  `:)` = "smile",
  ` ` = "space",
  `2000` = "number"
)
tb

You’ll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.

Another way to create a tibble is with tibble(), short for __tr__ansposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amouts of data in easy to read form.

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)

10.3 Tibbles vs. data.frame

There are two main differences in the usage of a tibble vs. a classic data.frame: printing and subsetting.

10.3.1 Printing

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str():

tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)

Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.

First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display. width = Inf will display all columns.

nycflights13::flights %>%
  print(n = 10, width = Inf)

You can also control the default print behavior by setting options:

You can see a complete list of options by looking at the package help with package?tibble.

A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.

nycflights13::flights %>%
  View()

10.3.2 Subsetting

So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, \(__ and __[[__. __[[__ can extract by name or position, __\) only extracts by name but is a little less typing.

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

# Extract by name
df$x
[1] 0.1801147 0.9679416 0.4738654 0.8527135 0.2442153
df[["x"]]
[1] 0.1801147 0.9679416 0.4738654 0.8527135 0.2442153
# Extract by position
df[[1]]
[1] 0.1801147 0.9679416 0.4738654 0.8527135 0.2442153

To use these in a pipe, you’ll need to use the special placeholder .:

df %>% .$x
[1] 0.1801147 0.9679416 0.4738654 0.8527135 0.2442153
df %>%
  .[["x"]]
[1] 0.1801147 0.9679416 0.4738654 0.8527135 0.2442153

Compared to a data.frame, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

10.4 Interacting with older code

Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame:

class(as.data.frame(tb))
[1] "data.frame"

The main reason that some older functions don’t work with tibble is the [ function. We don’t use [ much in this book because dplyr::filter() and dplyr::select() allow you to solve the same problems with clearer code (but you will learn a little about it in vector subsetting). With base R data frames, [ sometimes returns a data frame, and sometimes returns a vector. With tibbles, [ always returns another tibble.

10.5 Exercises

  1. How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).
mtcars
as_tibble(mtcars)

I think because of my screen, there is not much of a noticeable difference. We know however that data frames will print all columns and not cap how many rows. Tibbles however are more compact and generally (assuming no prefernce overrides) will print the first ten rows and only as many columns as fit on the screen.

Interestingly, both tables produced show the type of data in each column - I thought data frames did not do this? Perhaps this is a unique feature of R notebook.

You can also use the is_tibble function to check.

is_tibble(mtcars)
[1] FALSE

More generally, you can use the class() function to find out the class of an object. Tibbles has classes c(“tbl_df”, “tbl”, “data.frame”), while old data frames will only have the class “data.frame”.

If you are interested in reading more on R’s classes, read the chapters on object oriented programming in Advanaced R.

  1. Compare and contrast the following operations on a data.frame and equivalent tibble What is different? Why might the default data frame behaviors cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] a
Levels: a
df[, "xyz"]
[1] a
Levels: a
df[, c("abc", "xyz")]
df2 <- tibble(abc = 1, xyz = "a")
df2$x
Unknown or uninitialised column: `x`.
NULL
df2[, "xyz"]
df2[, c("abc", "xyz")]

When using the $ operator, data frames will select any column that begins with the first letter. So in the first example, there is no column “x”, but because there is a column “xyz” - the data frame pulls the corresponding rows for that column. This may be helpful for lowering keystrokes, but it increases the risk of accidentally selecting a column you did not intend to.

  1. If you have the name of a variable stored in an object, e.g. var <- “mpg”, how can you extract the reference variable from a tibble?
df[[var]]
Error in .subset2(x, i, exact = exact) : invalid subscript type 'closure'
  1. Practice referring to non-syntactic names in the following data frame by:

  2. Extracting the variable called 1.
  3. Plotting a scatterplot of 1 vs 2.
  4. Creating a new column called 3 which is 2 divided by 1.
  5. Renaming the columns to one, two, and three.

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)
annoying[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10
annoying$`1`
 [1]  1  2  3  4  5  6  7  8  9 10
annoying[["1"]]
 [1]  1  2  3  4  5  6  7  8  9 10
ggplot(annoying, aes(`1`, `2`)) +
  geom_point()

annoying <- tibble(annoying, `3` = annoying$`2`/annoying$`1`)
annoying
annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
annoying
  1. What does tibble::enframe() do? When might you use it?
?tibble::enframe

enframe() converts vectors to data frames, and vice versa.

enframe() converts named atomic vectors or lists to one- or two-column data frames. For a list, the result will be a nested tibble with a column of type list. For unnamed vectors, the natural sequence is used as name column.

deframe() converts two-column data frames to a named vector or list, using the first column as name and the second column as value. If the input has only one column, and unnamed vector is returned.

enframe(c(a = 1, b = 2, c = 3))

This would be useful if you had a series of vectors that you needed to use to construct a data frame. For example, in the df above, if you had a bunch of vectors containg car data, you could create a useful dataset with all the vectors.

  1. What option controls how many additional column names are printed at the footer of a tibble?
---
title: "Ch 10 - Tibbles"
output: html_notebook
---

#### 10.1 Introduction

Throughout this book we work with "tibbles" instead of R's tranditional __data.frame__.
Tibbles *are* data frames, but they weak some older behaviors to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier. In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's built-in data frame, I'll call them __data.frame S__.

If this chapter leaves you wanting to learn more about tilles, you might enjoy __vignette("tibble")__.

#### 10.1.1 Prerequisites

In this chapter we'll explore the __tibble__ package, part of the core tidyverse.

```{r}
library(tidyverse)
```

#### 10.2 Creating tibbles

Almost all of the functions that you'll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with __as_tibble()__:

```{r}
as_tibble(iris)
```

You can create a new tibble from individual vectors with __tibble()__. __tibble()__ will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.

```{r}
tibble(
  x = 1:5,
  y = 1,
  z = x^2 + y
)
```

If you're already familiar with __data.frame()__, note that __tibble()__ does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

It's possible for a tibble to have column names that are not valid R variable names, aka __non-syntatic__ names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, __`__:

```{r}
tb <- tibble(
  `:)` = "smile",
  ` ` = "space",
  `2000` = "number"
)
tb
```

You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.

Another way to create a tibble is with __tibble()__, short for __tr__ansposed tibble. __tribble()__ is customised for data entry in code: column headings are defined by formulas (i.e. they start with __~__), and entries are separated by commas. This makes it possible to lay out small amouts of data in easy to read form.

```{r}
tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
```

#### 10.3 Tibbles vs. data.frame

There are two main differences in the usage of a tibble vs. a classic __data.frame__: printing and subsetting.

#### 10.3.1 Printing

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from __str()__:

```{r}
tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
```

Tibbles are designed so that you don't accidentally overwhelm your console when you print large data frames. But sometimes you  need more output than the default display. There are a few options that can help.

First, you can explicitly __print()__ the data frame and control the number of rows (__n__) and the __width__ of the display. __width = Inf__ will display all columns. 

```{r}
nycflights13::flights %>%
  print(n = 10, width = Inf)
```

You can also control the default print behavior by setting options:

* __options(tibble.print_max = n, tibble.print_min = m)__: if more than __n__ rows, print only __m__ rows. Use __options(tibble.print_min = Inf)__ to always show all rows.

* Use __options(tibble.width = Inf)__ to always print all columns, regardless of the width of the screen.

You can see a complete list of options by looking at the package help with __package?tibble__.

A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.

```{r}
nycflights13::flights %>%
  View()
```

#### 10.3.2 Subsetting

So far all the tools you've learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, __$__ and __[[__. __[[__ can extract by name or position, __$__ only extracts by name but is a little less typing.

```{r}
df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

# Extract by name
df$x

df[["x"]]

# Extract by position
df[[1]]
```

To use these in a pipe, you'll need to use the special placeholder __.__:

```{r}
df %>% .$x
df %>%
  .[["x"]]
```

Compared to a __data.frame__, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

#### 10.4 Interacting with older code

Some older functions don't work with tibbles. If you encounter one of these functions, use __as.data.frame()__ to turn a tibble back to a __data.frame__:

```{r}
class(as.data.frame(tb))
```

The main reason that some older functions don't work with tibble is the __[__ function. We don't use __[__ much in this book because __dplyr::filter()__ and __dplyr::select()__ allow you to solve the same problems with clearer code (but you will learn a little about it in __vector subsetting__). With base R data frames, __[__ sometimes returns a data frame, and sometimes returns a vector. With tibbles, __[__ always returns another tibble.

#### 10.5 Exercises

1. How can you tell if an object is a tibble? (Hint: try printing __mtcars__, which is a regular data frame).

```{r}
mtcars
as_tibble(mtcars)
```

I think because of my screen, there is not much of a noticeable difference. We know however that data frames will print all columns and not cap how many rows. Tibbles however are more compact and generally (assuming no prefernce overrides) will print the first ten rows and only as many columns as fit on the screen.

Interestingly, both tables produced show the type of data in each column - I thought data frames did not do this? Perhaps this is a unique feature of R notebook.

You can also use the __is_tibble__ function to check.

```{r}
is_tibble(mtcars)
```

More generally, you can use the __class()__ function to find out the class of an object. Tibbles has classes __c("tbl_df", "tbl", "data.frame")__, while old data frames will only have the class __"data.frame"__.

If you are interested in reading more on R's classes, read the chapters on object oriented programming in __Advanaced R__. 

2. Compare and contrast the following operations on a __data.frame__ and equivalent tibble What is different? Why might the default data frame behaviors cause you frustration?

```{r}
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]
```

```{r}
df2 <- tibble(abc = 1, xyz = "a")
df2$x
df2[, "xyz"]
df2[, c("abc", "xyz")]
```

When using the __$__ operator, data frames will select any column that begins with the first letter. So in the first example, there is no column "x", but because there is a column "xyz" - the data frame pulls the corresponding rows for that column. This may be helpful for lowering keystrokes, but it increases the risk of accidentally selecting a column you did not intend to.

3. If you have the name of a variable stored in an object, e.g. __var <- "mpg"__, how can you extract the reference variable from a tibble?

```{r}
df[[var]]
```

4. Practice referring to non-syntactic names in the following data frame by:

1. Extracting the variable called 1.
2. Plotting a scatterplot of __1__ vs __2__.
3. Creating a new column called __3__ which is __2__ divided by __1__.
4. Renaming the columns to __one__, __two__, and __three__.

```{r}
annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)
```

```{r}
annoying[[1]]
annoying$`1`
annoying[["1"]]
```

```{r}
ggplot(annoying, aes(`1`, `2`)) +
  geom_point()
```

```{r}
annoying <- tibble(annoying, `3` = annoying$`2`/annoying$`1`)
```

```{r}
annoying
```

```{r}
annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
```

```{r}
annoying
```

5. What does __tibble::enframe()__ do? When might you use it?

```{r}
?tibble::enframe
```

enframe() converts vectors to data frames, and vice versa.

enframe() converts named atomic vectors or lists to one- or two-column data frames. For a list, the result will be a nested tibble with a column of type list. For unnamed vectors, the natural sequence is used as name column.

deframe() converts two-column data frames to a named vector or list, using the first column as name and the second column as value. If the input has only one column, and unnamed vector is returned.

```{r}
enframe(c(a = 1, b = 2, c = 3))
```

This would be useful if you had a series of vectors that you needed to use to construct a data frame. For example, in the df above, if you had a bunch of vectors containg car data, you could create a useful dataset with all the vectors.

6. What option controls how many additional column names are printed at the footer of a tibble?



