Memory usage

One of the reasons that dplyr is fast is that it is very careful about when it makes copies of columns. This vignette describes how this work, and gives you some useful tools for understanding the memory usage of data frames in R.

The first tool we’ll use is location(). It tells us three things about a data frame:

location(iris)
## <0x10888ca80>
## Variables:
##  * Sepal.Length: <0x1069f3000>
##  * Sepal.Width:  <0x106825800>
##  * Petal.Length: <0x106826600>
##  * Petal.Width:  <0x106961400>
##  * Species:      <0x103f298d0>
## Attributes:
##  * names:        <0x10888cad8>
##  * row.names:    <0x103f28340>
##  * class:        <0x108971868>

It’s useful to know the memory address, because if the address changes, then you know R has made a copy. R general tries to avoid making copies. For example, if you just assign iris to another variable, it continues to the point same location:

iris2 <- iris
location(iris2)
## <0x10888ca80>
## Variables:
##  * Sepal.Length: <0x1069f3000>
##  * Sepal.Width:  <0x106825800>
##  * Petal.Length: <0x106826600>
##  * Petal.Width:  <0x106961400>
##  * Species:      <0x103f298d0>
## Attributes:
##  * names:        <0x10888cad8>
##  * row.names:    <0x103f40ee0>
##  * class:        <0x108971868>

Rather than carefully comparing long memory locations, we can instead use the changes() function to highlights changes between two versions of a data frame. This shows us that iris and iris2 are identical: both names point to the same location in memory.

changes(iris2, iris)
## <identical>

What do you think happens if you modify a single column of iris2? R is going to have to make a copy, but it seems like most of the columns could continue to point to the old locations:

iris2$Sepal.Length <- iris2$Sepal.Length * 2
changes(iris, iris2)
## Changed variables:
##              old         new        
## Sepal.Length 0x1069f3000 0x106aeca00
## Sepal.Width  0x106825800 0x106ae2800
## Petal.Length 0x106826600 0x106ae3600
## Petal.Width  0x106961400 0x106ae5200
## Species      0x103f298d0 0x103f4e9f0
## 
## Changed attributes:
##              old         new        
## names        0x10888cad8 0x106814880
## row.names    0x103f4eee0 0x103f4f160
## class        0x108971868 0x100fbeaa8

Unfortunately, base R is not smart enough to recognise that it could save memory by having iris2$Sepal.Width point to the same location as iris2$Sepal.Width. In contrast, let’s see what dplyr does:

iris3 <- mutate(iris, Sepal.Length = Sepal.Length * 2)
changes(iris3, iris)
## Changed variables:
##              old         new        
## Sepal.Length 0x10800ca00 0x1069f3000
## 
## Changed attributes:
##              old         new        
## class        0x108d5ba88 0x108971868
## names        0x106aee0b8 0x10888cad8
## row.names    0x102f12300 0x102f12580

It’s smart enough to create only one new column: all the other columns continue to point at their old locations. (You might notice that the attributes have still been copied: this isn’t usually a big deal because they’re so small and makes the code considerably slower).

In base R, a surprising number of operations make copies of the individual vectors. For example, even extracting columns from a data frame creates a copy!

# Copies the first two columns
changes(iris, iris[1:2])
## Changed variables:
##              old         new        
## Sepal.Length 0x1069f3000 0x106af5800
## Sepal.Width  0x106825800 0x106af4a00
## Petal.Length 0x106826600 <deleted>  
## Petal.Width  0x106961400 <deleted>  
## Species      0x103f298d0 <deleted>  
## 
## Changed attributes:
##              old         new        
## names        0x10888cad8 0x1034ea440
## row.names    0x103f049f0 0x103f090c0
## class        0x108971868 0x1048d00e8
# dplyr::select doesn't copy the columns
changes(iris2, select(iris2, Sepal.Length, Sepal.Width))
## Changed variables:
##              old         new      
## Petal.Length 0x106ae3600 <deleted>
## Petal.Width  0x106ae5200 <deleted>
## Species      0x103f4e9f0 <deleted>
## 
## Changed attributes:
##              old         new        
## names        0x106814880 0x106cf7670
## row.names    0x103f05000 0x103f02160

(Due to work by Michael Lawrence, a number of these copies will be avoided in R 3.1.0.)

dplyr never makes copies unless it has to:

This means that dplyr lets you work with data frames with very little memory overhead.

data.table takes this idea one more step than dplyr, and provides functions that modify a data table in place. This avoids the need to copy the pointers to existing columns and attributes, and provides speed up when you have many columns. dplyr doesn’t do this with data frames (although it could) because I think it’s safer to keep data immutable: all dplyr data frame methods return a new data frame, even while they share as much data as possible.