24.06.2015, Oslo useR! Group Dmitrijs Cudihins
Background
Job
This is meant to be an introduction to dplyr
You might also find Rstudio’s Data Wrangling Cheat Sheet featuring dplyr useful.
This is also where I borrowed some of the material used in this presentation.
Data analysis involves the procedure of splitting the data set based on a grouping variable and then applying a function to each of the groups (split-apply)
Lets calculate the median values for a few parameters for cars with different numbers of cylinders.
data(movies)
data = movies
data$year = factor(data$year)
# Preview the dataset
head(data)
# Global configuration
column.names = c("rating", "votes", "length")
# Select only required columns
data.base = data[ , colnames(data) %in% column.names]
# Split the data on the number of cylinders
data.base.split = split(data.base, data$year)
# Apply function of interest to all columns in our dataset
data.base.results = sapply(data.base.split, function(x) apply(x, 2, mean))
data.base.results = t(data.base.results)
head(data.base.results)
# Split-Apply using dplyr
data %>% group_by(year) %>% select(one_of(column.names)) %>% summarise_each(funs(mean))
base
dplyr
%>% or the pipe operator is simply then
data %>% group_by(cyl) %>% select(one_of(column.names)) %>% summarise_each(funs(median))
tbl_df which is a local data frame and mostly behaves like the classical data.frame but is more convenient for working with large data.
The main advantage to over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen and avoid accidental print of a full data set to the screen.
> (movies_df = tbl_df(movies))
Source: local data frame [58,788 x 24]
title year length budget rating votes r1 r2 r3 r4 r5
1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5
2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5 14.5
3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0 0.0
4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0 0.0
5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0 14.5 14.5
6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5
7 $windle 2002 93 NA 5.3 200 4.5 0.0 4.5 4.5 24.5
8 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5
9 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0.0 0.0
10 '49-'17 1917 61 NA 6.0 51 4.5 0.0 4.5 4.5 4.5
.. ... ... ... ... ... ... ... ... ... ... ...
Variables not shown: r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)
The most common dplyr functions also referred to as verbs are as follows:
filter() and slice() - keep rows matching criteriasarrange() - reorder rowsselect() and rename() - pick columns by namemutate() and transmute() - add new variablessummarise() - reduce variables to valuesgroup_by() - group rowsThese all functions most workly the same.
glimpse is an analogue of str and tries to show you more of the data.
> str(movies)
'data.frame': 58788 obs. of 24 variables:
$ title : chr "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
$ year : int 1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
$ length : int 121 71 7 70 71 91 93 25 97 61 ...
$ budget : int NA NA NA NA NA NA NA NA NA NA ...
$ rating : num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
$ votes : int 348 20 5 6 17 45 200 24 18 51 ...
$ r1 : num 4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
$ r2 : num 4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
$ r3 : num 4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
$ r4 : num 4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
$ r5 : num 14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
$ r6 : num 24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
$ r7 : num 24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
$ r8 : num 14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
$ r9 : num 4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
$ r10 : num 4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
$ mpaa : Factor w/ 5 levels "","NC-17","PG",..: 1 1 1 1 1 1 5 1 1 1 ...
$ Action : int 0 0 0 0 0 0 1 0 0 0 ...
$ Animation : int 0 0 1 0 0 0 0 0 0 0 ...
$ Comedy : int 1 1 0 1 0 0 0 0 0 0 ...
$ Drama : int 1 0 0 0 0 1 1 0 1 0 ...
$ Documentary: int 0 0 0 0 0 0 0 1 0 0 ...
$ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
$ Short : int 0 0 1 0 0 0 0 1 0 0 ...
> glimpse(movies)
Observations: 58788
Variables:
$ title (chr) "$", "$1000 a Touchdown", "$21 a Day Once a Month", "$...
$ year (int) 1971, 1939, 1941, 1996, 1975, 2000, 2002, 2002, 1987, ...
$ length (int) 121, 71, 7, 70, 71, 91, 93, 25, 97, 61, 99, 96, 10, 10...
$ budget (int) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ rating (dbl) 6.4, 6.0, 8.2, 8.2, 3.4, 4.3, 5.3, 6.7, 6.6, 6.0, 5.4,...
$ votes (int) 348, 20, 5, 6, 17, 45, 200, 24, 18, 51, 23, 53, 44, 11...
$ r1 (dbl) 4.5, 0.0, 0.0, 14.5, 24.5, 4.5, 4.5, 4.5, 4.5, 4.5, 4....
$ r2 (dbl) 4.5, 14.5, 0.0, 0.0, 4.5, 4.5, 0.0, 4.5, 4.5, 0.0, 0.0...
$ r3 (dbl) 4.5, 4.5, 0.0, 0.0, 0.0, 4.5, 4.5, 4.5, 4.5, 4.5, 4.5,...
$ r4 (dbl) 4.5, 24.5, 0.0, 0.0, 14.5, 14.5, 4.5, 4.5, 0.0, 4.5, 1...
$ r5 (dbl) 14.5, 14.5, 0.0, 0.0, 14.5, 14.5, 24.5, 4.5, 0.0, 4.5,...
$ r6 (dbl) 24.5, 14.5, 24.5, 0.0, 4.5, 14.5, 24.5, 14.5, 0.0, 44....
$ r7 (dbl) 24.5, 14.5, 0.0, 0.0, 0.0, 4.5, 14.5, 14.5, 34.5, 14.5...
$ r8 (dbl) 14.5, 4.5, 44.5, 0.0, 0.0, 4.5, 4.5, 14.5, 14.5, 4.5, ...
$ r9 (dbl) 4.5, 4.5, 24.5, 34.5, 0.0, 14.5, 4.5, 4.5, 4.5, 4.5, 1...
$ r10 (dbl) 4.5, 14.5, 24.5, 45.5, 24.5, 14.5, 14.5, 14.5, 24.5, 4...
$ mpaa (fctr) , , , , , , R, , , , , , , , PG-13, PG-13, , , , , , ...
$ Action (int) 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...
$ Animation (int) 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Comedy (int) 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, ...
$ Drama (int) 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, ...
$ Documentary (int) 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
$ Romance (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Short (int) 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
filter() allows you to select a subset of the rows of a data frame matching some criterias or conditions. The first argument is the name of the data frame, and the second and subsequent are filtering expressions evaluated in the context of that data frame. If you supply multiple arguments and they are all added together.
base
movies_df[movies_df$year == 2005 & movies_df$rating > 9.5, ]
Source: local data frame [10 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa Action
1 Blaze Orange 2005 16 0 9.6 7 0 0.0 0 0 0 0 0 14.5 14.5 74.5 0
2 Filmic Achievement 2005 80 NA 9.6 10 0 0.0 0 0 0 0 0 14.5 24.5 74.5 0
3 Goodnight Bill 2005 19 NA 9.6 7 0 0.0 0 0 0 0 0 0.0 44.5 45.5 0
4 Goodnite Charlie 2005 119 100000 9.8 34 0 4.5 0 0 0 0 0 4.5 4.5 84.5 1
5 Keeper of the Past 2005 18 30000 9.9 7 0 0.0 0 0 0 0 0 0.0 14.5 84.5 0
6 Morphin(e) 2005 20 8000 9.7 7 0 0.0 0 0 0 0 0 0.0 24.5 74.5 1
7 Nun Fu 2005 5 5000 9.8 5 0 0.0 0 0 0 0 0 0.0 24.5 84.5 1
8 Oath, The 2005 23 NA 9.8 5 0 0.0 0 0 0 0 0 0.0 24.5 84.5 0
9 Weg ist das Spiel, Der 2005 3 NA 9.8 8 0 0.0 0 0 0 0 0 0.0 24.5 74.5 0
10 Wild Girls Gone 2005 93 NA 9.6 7 0 0.0 0 0 0 0 0 14.5 14.5 74.5 0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)
dplyr
filter(movies_df, year == 2005, rating > 9.5)
Source: local data frame [10 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa Action
1 Blaze Orange 2005 16 0 9.6 7 0 0.0 0 0 0 0 0 14.5 14.5 74.5 0
2 Filmic Achievement 2005 80 NA 9.6 10 0 0.0 0 0 0 0 0 14.5 24.5 74.5 0
3 Goodnight Bill 2005 19 NA 9.6 7 0 0.0 0 0 0 0 0 0.0 44.5 45.5 0
4 Goodnite Charlie 2005 119 100000 9.8 34 0 4.5 0 0 0 0 0 4.5 4.5 84.5 1
5 Keeper of the Past 2005 18 30000 9.9 7 0 0.0 0 0 0 0 0 0.0 14.5 84.5 0
6 Morphin(e) 2005 20 8000 9.7 7 0 0.0 0 0 0 0 0 0.0 24.5 74.5 1
7 Nun Fu 2005 5 5000 9.8 5 0 0.0 0 0 0 0 0 0.0 24.5 84.5 1
8 Oath, The 2005 23 NA 9.8 5 0 0.0 0 0 0 0 0 0.0 24.5 84.5 0
9 Weg ist das Spiel, Der 2005 3 NA 9.8 8 0 0.0 0 0 0 0 0 0.0 24.5 74.5 0
10 Wild Girls Gone 2005 93 NA 9.6 7 0 0.0 0 0 0 0 0 14.5 14.5 74.5 0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)
base
movies_df[movies_df$year == 2005 & movies_df$rating > 9.5 & (movies_df$Animation == 1 | movies_df$Short == 1), ]
Source: local data frame [7 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa Action
1 Blaze Orange 2005 16 0 9.6 7 0 0 0 0 0 0 0 14.5 14.5 74.5 0
2 Goodnight Bill 2005 19 NA 9.6 7 0 0 0 0 0 0 0 0.0 44.5 45.5 0
3 Keeper of the Past 2005 18 30000 9.9 7 0 0 0 0 0 0 0 0.0 14.5 84.5 0
4 Morphin(e) 2005 20 8000 9.7 7 0 0 0 0 0 0 0 0.0 24.5 74.5 1
5 Nun Fu 2005 5 5000 9.8 5 0 0 0 0 0 0 0 0.0 24.5 84.5 1
6 Oath, The 2005 23 NA 9.8 5 0 0 0 0 0 0 0 0.0 24.5 84.5 0
7 Weg ist das Spiel, Der 2005 3 NA 9.8 8 0 0 0 0 0 0 0 0.0 24.5 74.5 0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int),
Short (int)
dplyr
filter(movies_df, year == 2005, rating > 9.5, Animation == 1 | Short == 1)
Source: local data frame [7 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa Action
1 Blaze Orange 2005 16 0 9.6 7 0 0 0 0 0 0 0 14.5 14.5 74.5 0
2 Goodnight Bill 2005 19 NA 9.6 7 0 0 0 0 0 0 0 0.0 44.5 45.5 0
3 Keeper of the Past 2005 18 30000 9.9 7 0 0 0 0 0 0 0 0.0 14.5 84.5 0
4 Morphin(e) 2005 20 8000 9.7 7 0 0 0 0 0 0 0 0.0 24.5 74.5 1
5 Nun Fu 2005 5 5000 9.8 5 0 0 0 0 0 0 0 0.0 24.5 84.5 1
6 Oath, The 2005 23 NA 9.8 5 0 0 0 0 0 0 0 0.0 24.5 84.5 0
7 Weg ist das Spiel, Der 2005 3 NA 9.8 8 0 0 0 0 0 0 0 0.0 24.5 74.5 0
Variables not shown: Animation (int), Comedy (int), Drama (int), Documentary (int), Romance (int),
Short (int)
We can include any ?Comparison or ?base::Logic operators in our filter.
Slice is a variant of filter used to extract rows based on position.
base
movies_df[1:10, ]
Source: local data frame [10 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9
1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5
2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5
3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 44.5 24.5
4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 34.5
5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0 14.5 14.5 4.5 0.0 0.0 0.0
6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 4.5 4.5 14.5
7 $windle 2002 93 NA 5.3 200 4.5 0.0 4.5 4.5 24.5 24.5 14.5 4.5 4.5
8 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5 14.5 14.5 14.5 4.5
9 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0.0 0.0 0.0 34.5 14.5 4.5
10 '49-'17 1917 61 NA 6.0 51 4.5 0.0 4.5 4.5 4.5 44.5 14.5 4.5 4.5
Variables not shown: r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int),
Documentary (int), Romance (int), Short (int)
dplyr
slice(movies_df, 1:10)
Source: local data frame [10 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9
1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5
2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5
3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 44.5 24.5
4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 34.5
5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0 14.5 14.5 4.5 0.0 0.0 0.0
6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 4.5 4.5 14.5
7 $windle 2002 93 NA 5.3 200 4.5 0.0 4.5 4.5 24.5 24.5 14.5 4.5 4.5
8 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5 14.5 14.5 14.5 4.5
9 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0.0 0.0 0.0 34.5 14.5 4.5
10 '49-'17 1917 61 NA 6.0 51 4.5 0.0 4.5 4.5 4.5 44.5 14.5 4.5 4.5
Variables not shown: r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int),
Documentary (int), Romance (int), Short (int)
Order data based on specified columns.
base
movies_df[order(movies_df$votes), ]
Source: local data frame [58,788 x 24]
title year length budget rating votes r1
1 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0
2 'Sugar Chile' Robinson, Billie Holiday, Count Basie and His Sextet 1951 15 NA 5.3 5 0.0
3 ...ani smrt nebere 1996 88 NA 3.6 5 44.5
4 ...dopo di che, uccide il maschio e lo divora 1971 96 NA 6.1 5 0.0
5 ...und abends in die Scala 1958 94 NA 5.7 5 0.0
6 ...und die Liebe lacht dazu 1957 97 NA 6.8 5 0.0
7 1+1=3 1979 85 NA 7.5 5 0.0
8 1,000 Dollars a Minute 1935 70 NA 6.2 5 0.0
9 10.7 1997 28 NA 8.8 5 0.0
10 10000th Day, The 1997 18 NA 8.0 5 0.0
.. ... ... ... ... ... ... ...
Variables not shown: r2 (dbl), r3 (dbl), r4 (dbl), r5 (dbl), r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl),
r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int), Documentary (int),
Romance (int), Short (int)
dplyr
arrange(movies_df, votes)
Source: local data frame [58,788 x 24]
title year length budget rating votes r1
1 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0
2 'Sugar Chile' Robinson, Billie Holiday, Count Basie and His Sextet 1951 15 NA 5.3 5 0.0
3 ...ani smrt nebere 1996 88 NA 3.6 5 44.5
4 ...dopo di che, uccide il maschio e lo divora 1971 96 NA 6.1 5 0.0
5 ...und abends in die Scala 1958 94 NA 5.7 5 0.0
6 ...und die Liebe lacht dazu 1957 97 NA 6.8 5 0.0
7 1+1=3 1979 85 NA 7.5 5 0.0
8 1,000 Dollars a Minute 1935 70 NA 6.2 5 0.0
9 10.7 1997 28 NA 8.8 5 0.0
10 10000th Day, The 1997 18 NA 8.0 5 0.0
.. ... ... ... ... ... ... ...
Variables not shown: r2 (dbl), r3 (dbl), r4 (dbl), r5 (dbl), r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl),
r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int), Documentary (int),
Romance (int), Short (int)
dplyr descending
arrange(movies_df, desc(votes))
Source: local data frame [58,788 x 24]
title year length budget rating votes r1 r2 r3 r4 r5
1 Lord of the Rings: The Fellowship of the Ring, The 2001 208 93000000 8.8 157608 4.5 4.5 4.5 4.5 4.5
2 Shawshank Redemption, The 1994 142 25000000 9.1 149494 4.5 4.5 4.5 4.5 4.5
3 Matrix, The 1999 136 63000000 8.5 143853 4.5 4.5 4.5 4.5 4.5
4 Star Wars 1977 125 11000000 8.8 134640 4.5 4.5 4.5 4.5 4.5
5 Pulp Fiction 1994 168 8000000 8.8 132745 4.5 4.5 4.5 4.5 4.5
6 Godfather, The 1972 175 6000000 9.1 122755 4.5 4.5 4.5 4.5 4.5
7 Lord of the Rings: The Two Towers, The 2002 223 94000000 8.8 114797 4.5 4.5 4.5 4.5 4.5
8 Fight Club 1999 139 63000000 8.5 112092 4.5 4.5 4.5 4.5 4.5
9 American Beauty 1999 121 15000000 8.5 109991 4.5 4.5 4.5 4.5 4.5
10 Usual Suspects, The 1995 106 6000000 8.7 103854 4.5 4.5 4.5 4.5 4.5
.. ... ... ... ... ... ... ... ... ... ... ...
Variables not shown: r6 (dbl), r7 (dbl), r8 (dbl), r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation
(int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
base
movies_df[order(movies_df$budget, -movies_df$votes), ]
Source: local data frame [58,788 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8
1 Last Broadcast, The 1998 86 0 5.7 967 14.5 4.5 4.5 4.5 4.5 14.5 14.5 14.5
2 Great Train Robbery, The 1903 12 0 7.4 849 4.5 4.5 4.5 4.5 4.5 14.5 24.5 24.5
3 Tarnation 2003 88 0 7.7 596 4.5 4.5 4.5 4.5 4.5 4.5 14.5 14.5
4 Six Figures Getting Sick 1966 4 0 6.3 357 4.5 4.5 4.5 4.5 14.5 14.5 14.5 14.5
5 My Hustler 1965 79 0 7.0 39 4.5 4.5 4.5 4.5 4.5 4.5 14.5 24.5
6 Adventures of Dollie, The 1908 12 0 5.1 34 4.5 4.5 4.5 4.5 24.5 14.5 4.5 24.5
7 Run Leia Run 2003 16 0 5.5 32 4.5 14.5 4.5 4.5 4.5 4.5 4.5 14.5
8 Visitant, The 1981 22 0 7.1 30 24.5 0.0 0.0 4.5 0.0 4.5 14.5 4.5
9 From the Drain 1967 14 0 5.1 28 14.5 4.5 0.0 14.5 14.5 24.5 14.5 4.5
10 Zero 2000 20 0 8.1 24 24.5 0.0 0.0 0.0 0.0 4.5 0.0 0.0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Variables not shown: r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int),
Drama (int), Documentary (int), Romance (int), Short (int)
dplyr descending
arrange(movies_df, budget, desc(votes))
Source: local data frame [58,788 x 24]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8
1 Last Broadcast, The 1998 86 0 5.7 967 14.5 4.5 4.5 4.5 4.5 14.5 14.5 14.5
2 Great Train Robbery, The 1903 12 0 7.4 849 4.5 4.5 4.5 4.5 4.5 14.5 24.5 24.5
3 Tarnation 2003 88 0 7.7 596 4.5 4.5 4.5 4.5 4.5 4.5 14.5 14.5
4 Six Figures Getting Sick 1966 4 0 6.3 357 4.5 4.5 4.5 4.5 14.5 14.5 14.5 14.5
5 My Hustler 1965 79 0 7.0 39 4.5 4.5 4.5 4.5 4.5 4.5 14.5 24.5
6 Adventures of Dollie, The 1908 12 0 5.1 34 4.5 4.5 4.5 4.5 24.5 14.5 4.5 24.5
7 Run Leia Run 2003 16 0 5.5 32 4.5 14.5 4.5 4.5 4.5 4.5 4.5 14.5
8 Visitant, The 1981 22 0 7.1 30 24.5 0.0 0.0 4.5 0.0 4.5 14.5 4.5
9 From the Drain 1967 14 0 5.1 28 14.5 4.5 0.0 14.5 14.5 24.5 14.5 4.5
10 Zero 2000 20 0 8.1 24 24.5 0.0 0.0 0.0 0.0 4.5 0.0 0.0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Variables not shown: r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int),
Drama (int), Documentary (int), Romance (int), Short (int)
Often you work with large datasets with many columns where only a few are actually of interest to you. You provide a name of a dataframe and list of variables you want to keep.
base
movies_df[ , colnames(movies_df) %in% c("title","length")]
movies_df[ , c("title", "length")]
Source: local data frame [58,788 x 2]
title length
1 $ 121
2 $1000 a Touchdown 71
3 $21 a Day Once a Month 7
4 $40,000 70
5 $50,000 Climax Show, The 71
6 $pent 91
7 $windle 93
8 '15' 25
9 '38 97
10 '49-'17 61
.. ... ...
dplyr
select(movies_df, title, year)
select(movies_df, one_of(c("title", "year")))
Source: local data frame [58,788 x 2]
title year
1 $ 1971
2 $1000 a Touchdown 1939
3 $21 a Day Once a Month 1941
4 $40,000 1996
5 $50,000 Climax Show, The 1975
6 $pent 2000
7 $windle 2002
8 '15' 2002
9 '38 1987
10 '49-'17 1917
.. ... ...
dplyr exclude columns
select(movies_df, -(r1:mpaa))
Source: local data frame [58,788 x 13]
title year length budget rating votes Action Animation Comedy Drama Documentary
1 $ 1971 121 NA 6.4 348 0 0 1 1 0
2 $1000 a Touchdown 1939 71 NA 6.0 20 0 0 1 0 0
3 $21 a Day Once a Month 1941 7 NA 8.2 5 0 1 0 0 0
4 $40,000 1996 70 NA 8.2 6 0 0 1 0 0
5 $50,000 Climax Show, The 1975 71 NA 3.4 17 0 0 0 0 0
6 $pent 2000 91 NA 4.3 45 0 0 0 1 0
7 $windle 2002 93 NA 5.3 200 1 0 0 1 0
8 '15' 2002 25 NA 6.7 24 0 0 0 0 1
9 '38 1987 97 NA 6.6 18 0 0 0 1 0
10 '49-'17 1917 61 NA 6.0 51 0 0 0 0 0
.. ... ... ... ... ... ... ... ... ... ... ...
Variables not shown: Romance (int), Short (int)
There are a number of helper functions you can use within select(), like starts_with(), ends_with(), matches() and contains(). These let you quickly match larger blocks of variable that meet some criterion.
As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns or create stand alone variables.
When you add new variables they always go to the end.
As an example, lets calculate the age of the movie.
base
yearsSinceDate = 2015 - movies_df$year
movies_df_new = cbind(yearsSinceDate, movies_df)
Source: local data frame [58,788 x 25]
yearsSinceDate title year length budget rating votes r1 r2 r3 r4 r5 r6
1 44 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5
2 76 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5 14.5 14.5
3 74 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0 0.0 24.5
4 19 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0 0.0 0.0
5 40 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0 14.5 14.5 4.5
6 15 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5
7 13 $windle 2002 93 NA 5.3 200 4.5 0.0 4.5 4.5 24.5 24.5
8 13 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5 14.5
9 28 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0.0 0.0 0.0
10 98 '49-'17 1917 61 NA 6.0 51 4.5 0.0 4.5 4.5 4.5 44.5
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
Variables not shown: r7 (dbl), r8 (dbl), r9 (dbl), r10 (dbl), mpaa (fctr), Action (int), Animation
(int), Comedy (int), Drama (int), Documentary (int), Romance (int), Short (int)
dplyr
transmute(movies_df, yearsSinceDate = 2015 - year)
mutate(movies_df, yearsSinceDate = 2015 - year)
Source: local data frame [58,788 x 25]
title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9
1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5
2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5
3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 44.5 24.5
4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 34.5
5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0 14.5 14.5 4.5 0.0 0.0 0.0
6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 4.5 4.5 14.5
7 $windle 2002 93 NA 5.3 200 4.5 0.0 4.5 4.5 24.5 24.5 14.5 4.5 4.5
8 '15' 2002 25 NA 6.7 24 4.5 4.5 4.5 4.5 4.5 14.5 14.5 14.5 4.5
9 '38 1987 97 NA 6.6 18 4.5 4.5 4.5 0.0 0.0 0.0 34.5 14.5 4.5
10 '49-'17 1917 61 NA 6.0 51 4.5 0.0 4.5 4.5 4.5 44.5 14.5 4.5 4.5
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Variables not shown: r10 (dbl), mpaa (fctr), Action (int), Animation (int), Comedy (int), Drama (int),
Documentary (int), Romance (int), Short (int), yearsSinceDate (dbl)
Many dplyr functions will let you use newly create variables in the same function which is creating the variable in the first place.
Lets calculate how many years left to anniversary.
dplyr
transmute(movies_df, title, yearsSinceDate = 2015 - year, yearsToAnniversary = 100 - yearsSinceDate)
Source: local data frame [58,788 x 3]
title yearsSinceDate yearsToAnniversary
1 $ 44 56
2 $1000 a Touchdown 76 24
3 $21 a Day Once a Month 74 26
4 $40,000 19 81
5 $50,000 Climax Show, The 40 60
6 $pent 15 85
7 $windle 13 87
8 '15' 13 87
9 '38 28 72
10 '49-'17 98 2
.. ... ... ...
Verb is summarise(), which collapses a data frame to a single row.
Lets calculate median for a newly created age column.
transmute(movies_df, yearsSinceDate = 2015 - year) %>% summarise(age.median = median(yearsSinceDate))
Source: local data frame [1 x 1]
age.median
1 32
Or lets count ratio for movies that don’t have a budget.
movies_df %>% summarize(budget.na.percent = sum(is.na(budget)) / length(budget) )
Source: local data frame [1 x 1]
budget.na.percent
1 0.9112914
We can use summarise_each to summarise data across columns. Lets count number of movies by type.
select(movies_df, Action:Short) %>% summarise_each(funs(sum))
Source: local data frame [1 x 7]
Action Animation Comedy Drama Documentary Romance Short
1 4688 3690 17271 21811 3472 4744 9458
group_by breaks the data sets into groups of rows. This function becomes very powerful when combined with the previously discussed dplyr verbs.
Lets calculate number of movies per year per type.
select(movies_df, year, Action:Short) %>% group_by(year) %>% summarise_each(funs(sum))
Source: local data frame [113 x 8]
year Action Animation Comedy Drama Documentary Romance Short
1 1893 0 0 0 0 0 0 1
2 1894 0 0 0 0 5 0 9
3 1895 0 0 0 0 2 0 3
4 1896 0 0 1 1 7 0 12
5 1897 0 0 2 0 6 0 7
6 1898 0 0 2 1 1 0 5
7 1899 0 0 1 2 4 0 8
8 1900 0 1 5 2 5 0 16
9 1901 1 0 8 5 9 1 27
10 1902 0 0 2 1 2 0 9
.. ... ... ... ... ... ... ... ...
We can add additional filter and exclude all movies before 2000 and also count number of votes.
select(movies_df, year, votes, Action:Short) %>% filter(year >= 2000) %>% group_by(year) %>% summarise_each(funs(sum))
Source: local data frame [6 x 9]
year votes Action Animation Comedy Drama Documentary Romance Short
1 2000 2173392 154 89 561 793 175 207 449
2 2001 2263362 169 82 582 837 196 211 468
3 2002 2270605 176 81 591 929 249 245 533
4 2003 1816132 180 94 642 899 261 215 555
5 2004 1540722 147 56 597 805 258 169 480
6 2005 139286 43 10 123 137 35 37 54
A common data analysis task might be to carry out some group-wise normalization or adjustments of the data. For exmaple, we might want to find most rated movies during the calendar year.
To do this we will start by calculating the average rating for each year.
averages = select(movies_df, year, rating) %>% group_by(year) %>% summarise_each(funs(mean)) %>% rename(rating.avg = rating)
Source: local data frame [113 x 2]
year rating.avg
1 1893 7.000000
2 1894 4.888889
3 1895 5.500000
4 1896 5.269231
5 1897 4.677778
6 1898 5.040000
7 1899 4.277778
8 1900 4.731250
9 1901 4.682143
10 1902 4.900000
.. ... ...
Next we will use joing capabilities of dplyr to join averages with movies dataset and calculate the difference.
Finally, I would like to show easy head-to-head comparison of base and dplyr.
Lets prepapre data - generate a data frame with 10000 rows and 100 columns and 100 groups.
rows = 10000
cols = 100
groups = 100
samples = rows / groups
tmp.data = data.frame(matrix(rnorm(rows),rows,cols))
tmp.data$group = rep(1:groups,each=samples)
base
time.start = Sys.time()
# split the data based on the group
big.l = split(tmp.data, tmp.data$group)
# apply some function of interest to all columns
results = sapply(big.l, function(x) apply(x,2,median))
# bind results and add splitting info
results = t(results)
# elapsed time
(time.end = Sys.time() - time.start )
dplyr
time.start <- Sys.time()
# complete all action
results = tmp.data %>% group_by(group) %>% summarise_each(funs(median(.)))
#elapsed time
(time.end = Sys.time() - time.start )
base: Time difference of 0.5896311 secs
dplyr: Time difference of 0.310796 secs