Data wrangling? No problem, let’s talk about DPLYR package!

24.06.2015, Oslo useR! Group Dmitrijs Cudihins

Who am I

Background

  • 5 month experience with R
  • Currently taking Data Science Course/ Specialisation on Coursera
  • I use R for a personal project

Job

  • MS Dynamics CRM consultant at AITC (aitc.no)

Agenda

This is meant to be an introduction to dplyr

  • dplyr basics
  • get into a little bit of programming with dplyr
  • tips and tricks
  • benchmarking comparison

Github: https://github.com/dclux/dsc-oslor-dplyr

Cheatsheet

You might also find Rstudio’s Data Wrangling Cheat Sheet featuring dplyr useful.

This is also where I borrowed some of the material used in this presentation.

Source: http://www.rstudio.com/resources/cheatsheets/

Intro to Split - Apply

Data analysis involves the procedure of splitting the data set based on a grouping variable and then applying a function to each of the groups (split-apply)

Lets calculate the median values for a few parameters for cars with different numbers of cylinders.

data(movies)
data = movies
data$year = factor(data$year)

# Preview the dataset
head(data)

# Global configuration
column.names = c("rating", "votes", "length")
# Select only required columns
data.base = data[ , colnames(data) %in% column.names]

# Split the data on the number of cylinders
data.base.split = split(data.base, data$year)

# Apply function of interest to all columns in our dataset
data.base.results = sapply(data.base.split, function(x) apply(x, 2, mean))
data.base.results = t(data.base.results)
head(data.base.results)
# Split-Apply using dplyr
data %>% group_by(year) %>% select(one_of(column.names)) %>% summarise_each(funs(mean))

base vs dplyr

base

dplyr

dplyr::%>%

%>% or the pipe operator is simply then

data %>% group_by(cyl) %>% select(one_of(column.names)) %>% summarise_each(funs(median))

dplyr::tbl_df

tbl_df which is a local data frame and mostly behaves like the classical data.frame but is more convenient for working with large data.

The main advantage to over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen and avoid accidental print of a full data set to the screen.

> (movies_df = tbl_df(movies))
Source: local data frame [58,788 x 24]

                      title year length budget rating votes   r1   r2  r3   r4   r5
1                         $ 1971    121     NA    6.4   348  4.5  4.5 4.5  4.5 14.5
2         $1000 a Touchdown 1939     71     NA    6.0    20  0.0 14.5 4.5 24.5 14.5
3    $21 a Day Once a Month 1941      7     NA    8.2     5  0.0  0.0 0.0  0.0  0.0
4                   $40,000 1996     70     NA    8.2     6 14.5  0.0 0.0  0.0  0.0
5  $50,000 Climax Show, The 1975     71     NA    3.4    17 24.5  4.5 0.0 14.5 14.5
6                     $pent 2000     91     NA    4.3    45  4.5  4.5 4.5 14.5 14.5
7                   $windle 2002     93     NA    5.3   200  4.5  0.0 4.5  4.5 24.5
8                      '15' 2002     25     NA    6.7    24  4.5  4.5 4.5  4.5  4.5
9                       '38 1987     97     NA    6.6    18  4.5  4.5 4.5  0.0  0.0
10                  '49-'17 1917     61     NA    6.0    51  4.5  0.0 4.5  4.5  4.5
..                      ...  ...    ...    ...    ...   ...  ...  ... ...  ...  ...
Variables not shown: r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)

dplyr functions

The most common dplyr functions also referred to as verbs are as follows:

  • filter() and slice() - keep rows matching criterias
  • arrange() - reorder rows
  • select() and rename() - pick columns by name
  • mutate() and transmute() - add new variables
  • summarise() - reduce variables to values
  • group_by() - group rows

structure

These all functions most workly the same.

  • The first argument is a data frame
  • Subsequent arguments say what to do with data frame
  • Always return a data frame
  • Never modify in place, if you want to modify your data frame, you need to assign the result

Data exploration

glimpse is an analogue of str and tries to show you more of the data.

> str(movies)
'data.frame':    58788 obs. of  24 variables:
 $ title      : chr  "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
 $ year       : int  1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
 $ length     : int  121 71 7 70 71 91 93 25 97 61 ...
 $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ rating     : num  6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
 $ votes      : int  348 20 5 6 17 45 200 24 18 51 ...
 $ r1         : num  4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
 $ r2         : num  4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
 $ r3         : num  4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
 $ r4         : num  4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
 $ r5         : num  14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
 $ r6         : num  24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
 $ r7         : num  24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
 $ r8         : num  14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
 $ r9         : num  4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
 $ r10        : num  4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
 $ mpaa       : Factor w/ 5 levels "","NC-17","PG",..: 1 1 1 1 1 1 5 1 1 1 ...
 $ Action     : int  0 0 0 0 0 0 1 0 0 0 ...
 $ Animation  : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Comedy     : int  1 1 0 1 0 0 0 0 0 0 ...
 $ Drama      : int  1 0 0 0 0 1 1 0 1 0 ...
 $ Documentary: int  0 0 0 0 0 0 0 1 0 0 ...
 $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Short      : int  0 0 1 0 0 0 0 1 0 0 ...
> glimpse(movies)
Observations: 58788
Variables:
$ title       (chr) "$", "$1000 a Touchdown", "$21 a Day Once a Month", "$...
$ year        (int) 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 1987, ...
$ length      (int) 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10, 10...
$ budget      (int) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ rating      (dbl) 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, 5.4,...
$ votes       (int) 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44, 11...
$ r1          (dbl) 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4....
$ r2          (dbl) 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0, 0.0...
$ r3          (dbl) 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5,...
$ r4          (dbl) 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4.5, 1...
$ r5          (dbl) 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, 4.5,...
$ r6          (dbl) 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0, 44....
$ r7          (dbl) 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, 14.5...
$ r8          (dbl) 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4.5, ...
$ r9          (dbl) 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4.5, 1...
$ r10         (dbl) 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24.5, 4...
$ mpaa        (fctr) , , , , , , R, , , , , , , , PG-13, PG-13, , , , , , ...
$ Action      (int) 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...
$ Animation   (int) 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Comedy      (int) 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, ...
$ Drama       (int) 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, ...
$ Documentary (int) 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
$ Romance     (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Short       (int) 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...

dplyr::filter

filter() allows you to select a subset of the rows of a data frame matching some criterias or conditions. The first argument is the name of the data frame, and the second and subsequent are filtering expressions evaluated in the context of that data frame. If you supply multiple arguments and they are all added together.

AND condition

base

movies_df[movies_df$year == 2005 & movies_df$rating > 9.5, ]
Source: local data frame [10 x 24]

                    title year length budget rating votes r1  r2 r3 r4 r5 r6 r7   r8   r9  r10 mpaa Action
1            Blaze Orange 2005     16      0    9.6     7  0 0.0  0  0  0  0  0 14.5 14.5 74.5           0
2      Filmic Achievement 2005     80     NA    9.6    10  0 0.0  0  0  0  0  0 14.5 24.5 74.5           0
3          Goodnight Bill 2005     19     NA    9.6     7  0 0.0  0  0  0  0  0  0.0 44.5 45.5           0
4        Goodnite Charlie 2005    119 100000    9.8    34  0 4.5  0  0  0  0  0  4.5  4.5 84.5           1
5      Keeper of the Past 2005     18  30000    9.9     7  0 0.0  0  0  0  0  0  0.0 14.5 84.5           0
6              Morphin(e) 2005     20   8000    9.7     7  0 0.0  0  0  0  0  0  0.0 24.5 74.5           1
7                  Nun Fu 2005      5   5000    9.8     5  0 0.0  0  0  0  0  0  0.0 24.5 84.5           1
8               Oath, The 2005     23     NA    9.8     5  0 0.0  0  0  0  0  0  0.0 24.5 84.5           0
9  Weg ist das Spiel, Der 2005      3     NA    9.8     8  0 0.0  0  0  0  0  0  0.0 24.5 74.5           0
10        Wild Girls Gone 2005     93     NA    9.6     7  0 0.0  0  0  0  0  0 14.5 14.5 74.5           0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)

dplyr

filter(movies_df, year == 2005, rating > 9.5)
Source: local data frame [10 x 24]

                    title year length budget rating votes r1  r2 r3 r4 r5 r6 r7   r8   r9  r10 mpaa Action
1            Blaze Orange 2005     16      0    9.6     7  0 0.0  0  0  0  0  0 14.5 14.5 74.5           0
2      Filmic Achievement 2005     80     NA    9.6    10  0 0.0  0  0  0  0  0 14.5 24.5 74.5           0
3          Goodnight Bill 2005     19     NA    9.6     7  0 0.0  0  0  0  0  0  0.0 44.5 45.5           0
4        Goodnite Charlie 2005    119 100000    9.8    34  0 4.5  0  0  0  0  0  4.5  4.5 84.5           1
5      Keeper of the Past 2005     18  30000    9.9     7  0 0.0  0  0  0  0  0  0.0 14.5 84.5           0
6              Morphin(e) 2005     20   8000    9.7     7  0 0.0  0  0  0  0  0  0.0 24.5 74.5           1
7                  Nun Fu 2005      5   5000    9.8     5  0 0.0  0  0  0  0  0  0.0 24.5 84.5           1
8               Oath, The 2005     23     NA    9.8     5  0 0.0  0  0  0  0  0  0.0 24.5 84.5           0
9  Weg ist das Spiel, Der 2005      3     NA    9.8     8  0 0.0  0  0  0  0  0  0.0 24.5 74.5           0
10        Wild Girls Gone 2005     93     NA    9.6     7  0 0.0  0  0  0  0  0 14.5 14.5 74.5           0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)

OR condition

base

movies_df[movies_df$year == 2005 & movies_df$rating > 9.5 & (movies_df$Animation == 1 | movies_df$Short == 1), ]
Source: local data frame [7 x 24]

                   title year length budget rating votes r1 r2 r3 r4 r5 r6 r7   r8   r9  r10 mpaa Action
1           Blaze Orange 2005     16      0    9.6     7  0  0  0  0  0  0  0 14.5 14.5 74.5           0
2         Goodnight Bill 2005     19     NA    9.6     7  0  0  0  0  0  0  0  0.0 44.5 45.5           0
3     Keeper of the Past 2005     18  30000    9.9     7  0  0  0  0  0  0  0  0.0 14.5 84.5           0
4             Morphin(e) 2005     20   8000    9.7     7  0  0  0  0  0  0  0  0.0 24.5 74.5           1
5                 Nun Fu 2005      5   5000    9.8     5  0  0  0  0  0  0  0  0.0 24.5 84.5           1
6              Oath, The 2005     23     NA    9.8     5  0  0  0  0  0  0  0  0.0 24.5 84.5           0
7 Weg ist das Spiel, Der 2005      3     NA    9.8     8  0  0  0  0  0  0  0  0.0 24.5 74.5           0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int),
  Short (int)

dplyr

filter(movies_df, year == 2005, rating > 9.5, Animation == 1 | Short == 1)
Source: local data frame [7 x 24]

                   title year length budget rating votes r1 r2 r3 r4 r5 r6 r7   r8   r9  r10 mpaa Action
1           Blaze Orange 2005     16      0    9.6     7  0  0  0  0  0  0  0 14.5 14.5 74.5           0
2         Goodnight Bill 2005     19     NA    9.6     7  0  0  0  0  0  0  0  0.0 44.5 45.5           0
3     Keeper of the Past 2005     18  30000    9.9     7  0  0  0  0  0  0  0  0.0 14.5 84.5           0
4             Morphin(e) 2005     20   8000    9.7     7  0  0  0  0  0  0  0  0.0 24.5 74.5           1
5                 Nun Fu 2005      5   5000    9.8     5  0  0  0  0  0  0  0  0.0 24.5 84.5           1
6              Oath, The 2005     23     NA    9.8     5  0  0  0  0  0  0  0  0.0 24.5 84.5           0
7 Weg ist das Spiel, Der 2005      3     NA    9.8     8  0  0  0  0  0  0  0  0.0 24.5 74.5           0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int),
  Short (int)

We can include any ?Comparison or ?base::Logic operators in our filter.

dplyr::slice

Slice is a variant of filter used to extract rows based on position.

base

movies_df[1:10, ]
Source: local data frame [10 x 24]

                      title year length budget rating votes   r1   r2  r3   r4   r5   r6   r7   r8   r9
1                         $ 1971    121     NA    6.4   348  4.5  4.5 4.5  4.5 14.5 24.5 24.5 14.5  4.5
2         $1000 a Touchdown 1939     71     NA    6.0    20  0.0 14.5 4.5 24.5 14.5 14.5 14.5  4.5  4.5
3    $21 a Day Once a Month 1941      7     NA    8.2     5  0.0  0.0 0.0  0.0  0.0 24.5  0.0 44.5 24.5
4                   $40,000 1996     70     NA    8.2     6 14.5  0.0 0.0  0.0  0.0  0.0  0.0  0.0 34.5
5  $50,000 Climax Show, The 1975     71     NA    3.4    17 24.5  4.5 0.0 14.5 14.5  4.5  0.0  0.0  0.0
6                     $pent 2000     91     NA    4.3    45  4.5  4.5 4.5 14.5 14.5 14.5  4.5  4.5 14.5
7                   $windle 2002     93     NA    5.3   200  4.5  0.0 4.5  4.5 24.5 24.5 14.5  4.5  4.5
8                      '15' 2002     25     NA    6.7    24  4.5  4.5 4.5  4.5  4.5 14.5 14.5 14.5  4.5
9                       '38 1987     97     NA    6.6    18  4.5  4.5 4.5  0.0  0.0  0.0 34.5 14.5  4.5
10                  '49-'17 1917     61     NA    6.0    51  4.5  0.0 4.5  4.5  4.5 44.5 14.5  4.5  4.5
Variables not shown: r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int),
  Documentary (int), Romance (int), Short (int)

dplyr

slice(movies_df, 1:10)
Source: local data frame [10 x 24]

                      title year length budget rating votes   r1   r2  r3   r4   r5   r6   r7   r8   r9
1                         $ 1971    121     NA    6.4   348  4.5  4.5 4.5  4.5 14.5 24.5 24.5 14.5  4.5
2         $1000 a Touchdown 1939     71     NA    6.0    20  0.0 14.5 4.5 24.5 14.5 14.5 14.5  4.5  4.5
3    $21 a Day Once a Month 1941      7     NA    8.2     5  0.0  0.0 0.0  0.0  0.0 24.5  0.0 44.5 24.5
4                   $40,000 1996     70     NA    8.2     6 14.5  0.0 0.0  0.0  0.0  0.0  0.0  0.0 34.5
5  $50,000 Climax Show, The 1975     71     NA    3.4    17 24.5  4.5 0.0 14.5 14.5  4.5  0.0  0.0  0.0
6                     $pent 2000     91     NA    4.3    45  4.5  4.5 4.5 14.5 14.5 14.5  4.5  4.5 14.5
7                   $windle 2002     93     NA    5.3   200  4.5  0.0 4.5  4.5 24.5 24.5 14.5  4.5  4.5
8                      '15' 2002     25     NA    6.7    24  4.5  4.5 4.5  4.5  4.5 14.5 14.5 14.5  4.5
9                       '38 1987     97     NA    6.6    18  4.5  4.5 4.5  0.0  0.0  0.0 34.5 14.5  4.5
10                  '49-'17 1917     61     NA    6.0    51  4.5  0.0 4.5  4.5  4.5 44.5 14.5  4.5  4.5
Variables not shown: r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int),
  Documentary (int), Romance (int), Short (int)

dplyr::arrange

Order data based on specified columns.

base

movies_df[order(movies_df$votes), ]
Source: local data frame [58,788 x 24]

                                                                title year length budget rating votes   r1
1                                              $21 a Day Once a Month 1941      7     NA    8.2     5  0.0
2  'Sugar Chile' Robinson, Billie Holiday, Count Basie and His Sextet 1951     15     NA    5.3     5  0.0
3                                                  ...ani smrt nebere 1996     88     NA    3.6     5 44.5
4                       ...dopo di che, uccide il maschio e lo divora 1971     96     NA    6.1     5  0.0
5                                          ...und abends in die Scala 1958     94     NA    5.7     5  0.0
6                                         ...und die Liebe lacht dazu 1957     97     NA    6.8     5  0.0
7                                                               1+1=3 1979     85     NA    7.5     5  0.0
8                                              1,000 Dollars a Minute 1935     70     NA    6.2     5  0.0
9                                                                10.7 1997     28     NA    8.8     5  0.0
10                                                   10000th Day, The 1997     18     NA    8.0     5  0.0
..                                                                ...  ...    ...    ...    ...   ...  ...
Variables not shown: r2 (dbl), r3 (dbl), r4 (dbl), r5 (dbl), r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl),
  r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int), Documentary (int),
  Romance (int), Short (int)

dplyr

arrange(movies_df, votes)
Source: local data frame [58,788 x 24]

                                                                title year length budget rating votes   r1
1                                              $21 a Day Once a Month 1941      7     NA    8.2     5  0.0
2  'Sugar Chile' Robinson, Billie Holiday, Count Basie and His Sextet 1951     15     NA    5.3     5  0.0
3                                                  ...ani smrt nebere 1996     88     NA    3.6     5 44.5
4                       ...dopo di che, uccide il maschio e lo divora 1971     96     NA    6.1     5  0.0
5                                          ...und abends in die Scala 1958     94     NA    5.7     5  0.0
6                                         ...und die Liebe lacht dazu 1957     97     NA    6.8     5  0.0
7                                                               1+1=3 1979     85     NA    7.5     5  0.0
8                                              1,000 Dollars a Minute 1935     70     NA    6.2     5  0.0
9                                                                10.7 1997     28     NA    8.8     5  0.0
10                                                   10000th Day, The 1997     18     NA    8.0     5  0.0
..                                                                ...  ...    ...    ...    ...   ...  ...
Variables not shown: r2 (dbl), r3 (dbl), r4 (dbl), r5 (dbl), r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl),
  r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int), Documentary (int),
  Romance (int), Short (int)

descending

dplyr descending

arrange(movies_df, desc(votes))
Source: local data frame [58,788 x 24]

                                                title year length   budget rating  votes  r1  r2  r3  r4  r5
1  Lord of the Rings: The Fellowship of the Ring, The 2001    208 93000000    8.8 157608 4.5 4.5 4.5 4.5 4.5
2                           Shawshank Redemption, The 1994    142 25000000    9.1 149494 4.5 4.5 4.5 4.5 4.5
3                                         Matrix, The 1999    136 63000000    8.5 143853 4.5 4.5 4.5 4.5 4.5
4                                           Star Wars 1977    125 11000000    8.8 134640 4.5 4.5 4.5 4.5 4.5
5                                        Pulp Fiction 1994    168  8000000    8.8 132745 4.5 4.5 4.5 4.5 4.5
6                                      Godfather, The 1972    175  6000000    9.1 122755 4.5 4.5 4.5 4.5 4.5
7              Lord of the Rings: The Two Towers, The 2002    223 94000000    8.8 114797 4.5 4.5 4.5 4.5 4.5
8                                          Fight Club 1999    139 63000000    8.5 112092 4.5 4.5 4.5 4.5 4.5
9                                     American Beauty 1999    121 15000000    8.5 109991 4.5 4.5 4.5 4.5 4.5
10                                Usual Suspects, The 1995    106  6000000    8.7 103854 4.5 4.5 4.5 4.5 4.5
..                                                ...  ...    ...      ...    ...    ... ... ... ... ... ...
Variables not shown: r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation
(int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)

ascending and descending

If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

base

movies_df[order(movies_df$budget, -movies_df$votes), ]
Source: local data frame [58,788 x 24]

                       title year length budget rating votes   r1   r2  r3   r4   r5   r6   r7   r8
1        Last Broadcast, The 1998     86      0    5.7   967 14.5  4.5 4.5  4.5  4.5 14.5 14.5 14.5
2   Great Train Robbery, The 1903     12      0    7.4   849  4.5  4.5 4.5  4.5  4.5 14.5 24.5 24.5
3                  Tarnation 2003     88      0    7.7   596  4.5  4.5 4.5  4.5  4.5  4.5 14.5 14.5
4   Six Figures Getting Sick 1966      4      0    6.3   357  4.5  4.5 4.5  4.5 14.5 14.5 14.5 14.5
5                 My Hustler 1965     79      0    7.0    39  4.5  4.5 4.5  4.5  4.5  4.5 14.5 24.5
6  Adventures of Dollie, The 1908     12      0    5.1    34  4.5  4.5 4.5  4.5 24.5 14.5  4.5 24.5
7               Run Leia Run 2003     16      0    5.5    32  4.5 14.5 4.5  4.5  4.5  4.5  4.5 14.5
8              Visitant, The 1981     22      0    7.1    30 24.5  0.0 0.0  4.5  0.0  4.5 14.5  4.5
9             From the Drain 1967     14      0    5.1    28 14.5  4.5 0.0 14.5 14.5 24.5 14.5  4.5
10                      Zero 2000     20      0    8.1    24 24.5  0.0 0.0  0.0  0.0  4.5  0.0  0.0
..                       ...  ...    ...    ...    ...   ...  ...  ... ...  ...  ...  ...  ...  ...
Variables not shown: r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int),
  Drama (int), Documentary (int), Romance (int), Short (int)

dplyr descending

arrange(movies_df, budget, desc(votes))
Source: local data frame [58,788 x 24]

                       title year length budget rating votes   r1   r2  r3   r4   r5   r6   r7   r8
1        Last Broadcast, The 1998     86      0    5.7   967 14.5  4.5 4.5  4.5  4.5 14.5 14.5 14.5
2   Great Train Robbery, The 1903     12      0    7.4   849  4.5  4.5 4.5  4.5  4.5 14.5 24.5 24.5
3                  Tarnation 2003     88      0    7.7   596  4.5  4.5 4.5  4.5  4.5  4.5 14.5 14.5
4   Six Figures Getting Sick 1966      4      0    6.3   357  4.5  4.5 4.5  4.5 14.5 14.5 14.5 14.5
5                 My Hustler 1965     79      0    7.0    39  4.5  4.5 4.5  4.5  4.5  4.5 14.5 24.5
6  Adventures of Dollie, The 1908     12      0    5.1    34  4.5  4.5 4.5  4.5 24.5 14.5  4.5 24.5
7               Run Leia Run 2003     16      0    5.5    32  4.5 14.5 4.5  4.5  4.5  4.5  4.5 14.5
8              Visitant, The 1981     22      0    7.1    30 24.5  0.0 0.0  4.5  0.0  4.5 14.5  4.5
9             From the Drain 1967     14      0    5.1    28 14.5  4.5 0.0 14.5 14.5 24.5 14.5  4.5
10                      Zero 2000     20      0    8.1    24 24.5  0.0 0.0  0.0  0.0  4.5  0.0  0.0
..                       ...  ...    ...    ...    ...   ...  ...  ... ...  ...  ...  ...  ...  ...
Variables not shown: r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int),
  Drama (int), Documentary (int), Romance (int), Short (int)

dplyr::select

Often you work with large datasets with many columns where only a few are actually of interest to you. You provide a name of a dataframe and list of variables you want to keep.

base

movies_df[ , colnames(movies_df) %in% c("title","length")]
movies_df[ , c("title", "length")]
Source: local data frame [58,788 x 2]

                      title length
1                         $    121
2         $1000 a Touchdown     71
3    $21 a Day Once a Month      7
4                   $40,000     70
5  $50,000 Climax Show, The     71
6                     $pent     91
7                   $windle     93
8                      '15'     25
9                       '38     97
10                  '49-'17     61
..                      ...    ...

dplyr

select(movies_df, title, year)
select(movies_df, one_of(c("title", "year")))
Source: local data frame [58,788 x 2]

                      title year
1                         $ 1971
2         $1000 a Touchdown 1939
3    $21 a Day Once a Month 1941
4                   $40,000 1996
5  $50,000 Climax Show, The 1975
6                     $pent 2000
7                   $windle 2002
8                      '15' 2002
9                       '38 1987
10                  '49-'17 1917
..                      ...  ...

dplyr exclude columns

select(movies_df, -(r1:mpaa))
Source: local data frame [58,788 x 13]

                      title year length budget rating votes Action Animation Comedy Drama Documentary
1                         $ 1971    121     NA    6.4   348      0         0      1     1           0
2         $1000 a Touchdown 1939     71     NA    6.0    20      0         0      1     0           0
3    $21 a Day Once a Month 1941      7     NA    8.2     5      0         1      0     0           0
4                   $40,000 1996     70     NA    8.2     6      0         0      1     0           0
5  $50,000 Climax Show, The 1975     71     NA    3.4    17      0         0      0     0           0
6                     $pent 2000     91     NA    4.3    45      0         0      0     1           0
7                   $windle 2002     93     NA    5.3   200      1         0      0     1           0
8                      '15' 2002     25     NA    6.7    24      0         0      0     0           1
9                       '38 1987     97     NA    6.6    18      0         0      0     1           0
10                  '49-'17 1917     61     NA    6.0    51      0         0      0     0           0
..                      ...  ...    ...    ...    ...   ...    ...       ...    ...   ...         ...
Variables not shown: Romance (int), Short (int)

helper methods

There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains(). These let you quickly match larger blocks of variable that meet some criterion.

mutate and transmute

As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns or create stand alone variables.

When you add new variables they always go to the end.

As an example, lets calculate the age of the movie.

base

yearsSinceDate = 2015 - movies_df$year
movies_df_new = cbind(yearsSinceDate, movies_df)
Source: local data frame [58,788 x 25]

   yearsSinceDate                    title year length budget rating votes   r1   r2  r3   r4   r5   r6
1              44                        $ 1971    121     NA    6.4   348  4.5  4.5 4.5  4.5 14.5 24.5
2              76        $1000 a Touchdown 1939     71     NA    6.0    20  0.0 14.5 4.5 24.5 14.5 14.5
3              74   $21 a Day Once a Month 1941      7     NA    8.2     5  0.0  0.0 0.0  0.0  0.0 24.5
4              19                  $40,000 1996     70     NA    8.2     6 14.5  0.0 0.0  0.0  0.0  0.0
5              40 $50,000 Climax Show, The 1975     71     NA    3.4    17 24.5  4.5 0.0 14.5 14.5  4.5
6              15                    $pent 2000     91     NA    4.3    45  4.5  4.5 4.5 14.5 14.5 14.5
7              13                  $windle 2002     93     NA    5.3   200  4.5  0.0 4.5  4.5 24.5 24.5
8              13                     '15' 2002     25     NA    6.7    24  4.5  4.5 4.5  4.5  4.5 14.5
9              28                      '38 1987     97     NA    6.6    18  4.5  4.5 4.5  0.0  0.0  0.0
10             98                  '49-'17 1917     61     NA    6.0    51  4.5  0.0 4.5  4.5  4.5 44.5
..            ...                      ...  ...    ...    ...    ...   ...  ...  ... ...  ...  ...  ...
Variables not shown: r7 (dbl), r8 (dbl), r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation
  (int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)

dplyr

transmute(movies_df, yearsSinceDate = 2015 - year)
mutate(movies_df, yearsSinceDate = 2015 - year)
Source: local data frame [58,788 x 25]

                      title year length budget rating votes   r1   r2  r3   r4   r5   r6   r7   r8   r9
1                         $ 1971    121     NA    6.4   348  4.5  4.5 4.5  4.5 14.5 24.5 24.5 14.5  4.5
2         $1000 a Touchdown 1939     71     NA    6.0    20  0.0 14.5 4.5 24.5 14.5 14.5 14.5  4.5  4.5
3    $21 a Day Once a Month 1941      7     NA    8.2     5  0.0  0.0 0.0  0.0  0.0 24.5  0.0 44.5 24.5
4                   $40,000 1996     70     NA    8.2     6 14.5  0.0 0.0  0.0  0.0  0.0  0.0  0.0 34.5
5  $50,000 Climax Show, The 1975     71     NA    3.4    17 24.5  4.5 0.0 14.5 14.5  4.5  0.0  0.0  0.0
6                     $pent 2000     91     NA    4.3    45  4.5  4.5 4.5 14.5 14.5 14.5  4.5  4.5 14.5
7                   $windle 2002     93     NA    5.3   200  4.5  0.0 4.5  4.5 24.5 24.5 14.5  4.5  4.5
8                      '15' 2002     25     NA    6.7    24  4.5  4.5 4.5  4.5  4.5 14.5 14.5 14.5  4.5
9                       '38 1987     97     NA    6.6    18  4.5  4.5 4.5  0.0  0.0  0.0 34.5 14.5  4.5
10                  '49-'17 1917     61     NA    6.0    51  4.5  0.0 4.5  4.5  4.5 44.5 14.5  4.5  4.5
..                      ...  ...    ...    ...    ...   ...  ...  ... ...  ...  ...  ...  ...  ...  ...
Variables not shown: r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int),
  Documentary (int), Romance (int), Short (int), yearsSinceDate (dbl)

tip

Many dplyr functions will let you use newly create variables in the same function which is creating the variable in the first place.

Lets calculate how many years left to anniversary.

dplyr

transmute(movies_df, title, yearsSinceDate = 2015 - year, yearsToAnniversary = 100 - yearsSinceDate)
Source: local data frame [58,788 x 3]

                      title yearsSinceDate yearsToAnniversary
1                         $             44                 56
2         $1000 a Touchdown             76                 24
3    $21 a Day Once a Month             74                 26
4                   $40,000             19                 81
5  $50,000 Climax Show, The             40                 60
6                     $pent             15                 85
7                   $windle             13                 87
8                      '15'             13                 87
9                       '38             28                 72
10                  '49-'17             98                  2
..                      ...            ...                ...

dplyr::summarise

Verb is summarise(), which collapses a data frame to a single row.

Lets calculate median for a newly created age column.

transmute(movies_df, yearsSinceDate = 2015 - year) %>% summarise(age.median = median(yearsSinceDate))
Source: local data frame [1 x 1]

  age.median
1         32

Or lets count ratio for movies that don’t have a budget.

movies_df %>% summarize(budget.na.percent = sum(is.na(budget)) / length(budget) )
Source: local data frame [1 x 1]

  budget.na.percent
1     0.9112914

We can use summarise_each to summarise data across columns. Lets count number of movies by type.

select(movies_df, Action:Short) %>% summarise_each(funs(sum))
Source: local data frame [1 x 7]

  Action Animation Comedy Drama Documentary Romance Short
1   4688      3690  17271 21811        3472    4744  9458

dplyr::group_by

group_by breaks the data sets into groups of rows. This function becomes very powerful when combined with the previously discussed dplyr verbs.

Lets calculate number of movies per year per type.

select(movies_df, year, Action:Short) %>% group_by(year) %>% summarise_each(funs(sum))
Source: local data frame [113 x 8]

   year Action Animation Comedy Drama Documentary Romance Short
1  1893      0         0      0     0           0       0     1
2  1894      0         0      0     0           5       0     9
3  1895      0         0      0     0           2       0     3
4  1896      0         0      1     1           7       0    12
5  1897      0         0      2     0           6       0     7
6  1898      0         0      2     1           1       0     5
7  1899      0         0      1     2           4       0     8
8  1900      0         1      5     2           5       0    16
9  1901      1         0      8     5           9       1    27
10 1902      0         0      2     1           2       0     9
..  ...    ...       ...    ...   ...         ...     ...   ...

We can add additional filter and exclude all movies before 2000 and also count number of votes.

select(movies_df, year, votes, Action:Short) %>% filter(year >= 2000) %>% group_by(year) %>% summarise_each(funs(sum))
Source: local data frame [6 x 9]

  year   votes Action Animation Comedy Drama Documentary Romance Short
1 2000 2173392    154        89    561   793         175     207   449
2 2001 2263362    169        82    582   837         196     211   468
3 2002 2270605    176        81    591   929         249     245   533
4 2003 1816132    180        94    642   899         261     215   555
5 2004 1540722    147        56    597   805         258     169   480
6 2005  139286     43        10    123   137          35      37    54

A common data analysis task might be to carry out some group-wise normalization or adjustments of the data. For exmaple, we might want to find most rated movies during the calendar year.

To do this we will start by calculating the average rating for each year.

averages = select(movies_df, year, rating) %>% group_by(year) %>% summarise_each(funs(mean)) %>% rename(rating.avg = rating)
Source: local data frame [113 x 2]

   year rating.avg
1  1893   7.000000
2  1894   4.888889
3  1895   5.500000
4  1896   5.269231
5  1897   4.677778
6  1898   5.040000
7  1899   4.277778
8  1900   4.731250
9  1901   4.682143
10 1902   4.900000
..  ...        ...

Next we will use joing capabilities of dplyr to join averages with movies dataset and calculate the difference.

benchmarking comparison

Finally, I would like to show easy head-to-head comparison of base and dplyr.

Lets prepapre data - generate a data frame with 10000 rows and 100 columns and 100 groups.

rows = 10000
cols = 100
groups = 100
samples = rows / groups
tmp.data = data.frame(matrix(rnorm(rows),rows,cols))
tmp.data$group = rep(1:groups,each=samples)

base

time.start = Sys.time()
    
# split the data based on the group
big.l = split(tmp.data, tmp.data$group)
    
# apply some function of interest to all columns
results = sapply(big.l, function(x) apply(x,2,median))
    
# bind results and add splitting info
results = t(results)
    
# elapsed time
(time.end = Sys.time() - time.start ) 

dplyr

time.start <- Sys.time()
    
# complete all action
results = tmp.data %>% group_by(group) %>% summarise_each(funs(median(.)))
    
#elapsed time
(time.end = Sys.time() - time.start )
base: Time difference of 0.5896311 secs
dplyr: Time difference of 0.310796 secs