Data Manipulation through dplyr and tibbles

SQL style data manipulation through dplyr package

The package dplyr gives freedom for us to process the data in data frames and tibbles using its functions similar SQL statements. This package dplyr provides:

a set of functions for efficiently manipulating data sets
a grammar of data manipulation

To see all functions included in the dplyr package, we can simply type library(“help:dplyr”) in the command prompt. We will get the following results.

all_equal               Flexible equality comparison for data frames
all_vars                Apply predicate to all variables
arrange                 Arrange rows by variables
arrange_all             Arrange rows by a selection of variables
as.table.tbl_cube       Coerce a 'tbl_cube' to other data structures
as.tbl_cube             Coerce an existing data structure into a'tbl_cube'
auto_copy               Copy tables to same source, if necessary
band_members            Band membership
between                 Do values in a numeric vector fall in specified range?
bind                    Efficiently bind multiple data frames by row and column
case_when               A general vectorised if
coalesce                Find first non-missing element
compute                 Force computation of a database query
copy_to                 Copy a local data frame to a remote src
cumall                  Cumulativate versions of any, all, and mean
desc                    Descending order
distinct                Select distinct/unique rows
do                      Do anything
dplyr-package           dplyr: a grammar of data manipulation
dr_dplyr                Dr Dplyr checks your installation for common problems.
explain                 Explain details of a tbl
filter                  Return rows with matching conditions
filter_all              Filter within a selection of variables
funs                    Create a list of functions calls.
group_by                Group by one or more variables
group_by_all            Group by a selection of variables
groups                  Return grouping variables
ident                   Flag a character vector as SQL identifiers
if_else                 Vectorised if
join                    Join two tbls together
join.tbl_df             Join data frame tbls
lead-lag                Lead and lag.
mutate                  Add new variables
n                       The number of observations in the current group.
n_distinct              Efficiently count the number of unique values in a set of vector
na_if                   Convert values to NA
nasa                    NASA spatio-temporal data
near                    Compare two numeric vectors
nth                     Extract the first, last or nth value from a vector
order_by                A helper function for ordering window function output
pull                    Pull out a single variable
ranking                 Windowed rank functions.
recode                  Recode values
rowwise                 Group input by rows
sample                  Sample n rows from a table
scoped                  Operate on a selection of variables
select                  Select/rename variables by name
select_all              Select and rename a selection of variables
select_helpers          Select helpers
setops                  Set operations
slice                   Select rows by position
sql                     SQL escaping.
src_dbi                 Source for database backends
starwars                Starwars characters
storms                  Storm tracks data
summarise               Reduces multiple values down to a single value
summarise_all           Summarise and mutate multiple columns.
tally                   Count/tally observations by group
tbl                     Create a table from a data source
tbl_cube                A data cube tbl
top_n                   Select top (or bottom) n rows (by value)
vars                    Select variables

The package, dplyr, uses around five functions as main functions out of the ones given above. Each function takes a data set (data frame or tibble) as input, and returns a data set(data frame or tibble) as output. The following are the functions along with data manipulation task details

Select selects certain columns of data. This is akin to SELECT clause of SQL and is comparable with projection operator of relational algebra. This can be also said as column wise sub-setting.
Filter to select specific rows. This is akin to SELECT-WHERE clause of SQL. This can be also said as row wise sub-setting.
Arrange the rows of our data into some order. This can be said as ORDER clause of SQL.
Mutate our data frame to contain new columns or change the existing columns. This is akin to DDL statements of SQL.
Summarize chunks of you data in some way. This is similar to aggregate operations like MAX(), SUM(), TOTAL() of SQL.

Select

We have already discussed about selection of columns of a data frame using their column indices or column names in quotes (single or double) (sub-setting).

Data_Frame_Object[,c(1,2,7)]

Data_Frame_Object(, c(“column_one_name”, “column_two_name”, “column_seven_name”)]

We can use select() function for the purpose as:

select(Data_Frame_Object,1,2,7]

select(Data_Frame_Object, column_one_name, column_two_name, column_seven_name)

Do see the following workout of ours. For ease of display, we are running head command on the result of select() function in both the cases such that first 6 rows of the result of select() will be displayed. Also, do remember that while specifying the column names with select function, we need not required enclose in quotes. Of course, if we want, we can include. However, in the case of sub-setting using [ and ], we have to include the column names in between quotes.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

head(select(mtcars,1,2,7))

##                    mpg cyl  qsec
## Mazda RX4         21.0   6 16.46
## Mazda RX4 Wag     21.0   6 17.02
## Datsun 710        22.8   4 18.61
## Hornet 4 Drive    21.4   6 19.44
## Hornet Sportabout 18.7   8 17.02
## Valiant           18.1   6 20.22

head(select(mtcars,mpg,cyl,qsec))

##                    mpg cyl  qsec
## Mazda RX4         21.0   6 16.46
## Mazda RX4 Wag     21.0   6 17.02
## Datsun 710        22.8   4 18.61
## Hornet 4 Drive    21.4   6 19.44
## Hornet Sportabout 18.7   8 17.02
## Valiant           18.1   6 20.22

Figure 1.

With select also we can use minus sign to deselect some columns from our select. Of course, similar thing is possible in normal sub-setting also. Here also, we have sent the output of select to head() function to select only six rows for comparison purpose.

library(dplyr)
head(select(mtcars,-1,-2,-7))

##                   disp  hp drat    wt vs am gear carb
## Mazda RX4          160 110 3.90 2.620  0  1    4    4
## Mazda RX4 Wag      160 110 3.90 2.875  0  1    4    4
## Datsun 710         108  93 3.85 2.320  1  1    4    1
## Hornet 4 Drive     258 110 3.08 3.215  1  0    3    1
## Hornet Sportabout  360 175 3.15 3.440  0  0    3    2
## Valiant            225 105 2.76 3.460  1  0    3    1

head(select(mtcars,-mpg,-cyl,-qsec))

##                   disp  hp drat    wt vs am gear carb
## Mazda RX4          160 110 3.90 2.620  0  1    4    4
## Mazda RX4 Wag      160 110 3.90 2.875  0  1    4    4
## Datsun 710         108  93 3.85 2.320  1  1    4    1
## Hornet 4 Drive     258 110 3.08 3.215  1  0    3    1
## Hornet Sportabout  360 175 3.15 3.440  0  0    3    2
## Valiant            225 105 2.76 3.460  1  0    3    1

head(mtcars[-c(1,2,7)])

##                   disp  hp drat    wt vs am gear carb
## Mazda RX4          160 110 3.90 2.620  0  1    4    4
## Mazda RX4 Wag      160 110 3.90 2.875  0  1    4    4
## Datsun 710         108  93 3.85 2.320  1  1    4    1
## Hornet 4 Drive     258 110 3.08 3.215  1  0    3    1
## Hornet Sportabout  360 175 3.15 3.440  0  0    3    2
## Valiant            225 105 2.76 3.460  1  0    3    1

#head(mtcars[-c("mpg","cyl","qsec")])

Figure 2.

We wanted to explore between this select function and conventional sub-setting. So, we have tried to extract same columns more than once using both of them. The conventional sub-setting is giving repeated columns also according to our request, where as select function is giving a column only once. See the following workout of ours.

library(dplyr)
BOD[,c(1,2,1,1)]

##   Time demand Time.1 Time.2
## 1    1    8.3      1      1
## 2    2   10.3      2      2
## 3    3   19.0      3      3
## 4    4   16.0      4      4
## 5    5   15.6      5      5
## 6    7   19.8      7      7

BOD[,c("Time","demand","Time","Time")]

##   Time demand Time.1 Time.2
## 1    1    8.3      1      1
## 2    2   10.3      2      2
## 3    3   19.0      3      3
## 4    4   16.0      4      4
## 5    5   15.6      5      5
## 6    7   19.8      7      7

select(BOD,1,2,1,1)

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

select(BOD,Time,demand,Time,Time)

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

Figure 3.

Also, we wanted to further explore the difference between conventional sub-setting and select function. We want to identify whether one can specify some columns with minus and some others without minus with the both the approaches. We found that it is acceptable with select function, whereas with conventional sub-setting it is not. See the following workout in our support.

library(dplyr)
head(select(mtcars,-mpg,-cyl,qsec))

##                   disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4          160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag      160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive     258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant            225 105 2.76 3.460 20.22  1  0    3    1

head(select(mtcars,-1,-2,7))

##                   disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4          160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag      160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive     258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant            225 105 2.76 3.460 20.22  1  0    3    1

#head(mtcars[,c(-1,-2,7)])

Figure 4.

Exercise: Does the following command gets executed at R console?
BOD[Time,demand,Time, Time]
Answer: No.
Exercise: Does the output of the following two R commands are same?
select(BOD,2,1)
select(BOD,1,2)
Answer: No.
Exercise: How many columns will be there in the result of the following R command?
select(BOD,2,1,1,1)
Answer: 2
Exercise: How many columns will be there in the result of the following R command?
BOD[2,1,1,1]
Answer: 4

We can also use colon to specify a sequence of columns both with column indexes and names. For instance, the following selects four columns of mtcars data frame. select(mtcars, disp:wt)

select(mtcars,3:6)

We can assign new names to the resultant data frame columns of a select function by assigning old variable name to new variable name like

select(dataframe, new_col_name=old_col_name)

select(dataframe, new_col_name=old_col_number_or_index)

Also, if we want resultant data frame to have some columns with the old data frame column’s names while others with new names, we can use rename function.

The following figure illustrates the working of the above concepts on BOD data frame. In the first command, new data frame will be having it column name as ‘Demand’. After the second command, the resultant data frame will be having Time as the first column (same as BOD) and ‘Demand’ as the second column(new name).

library(dplyr)
library(dplyr)
select(BOD,demand)

##   demand
## 1    8.3
## 2   10.3
## 3   19.0
## 4   16.0
## 5   15.6
## 6   19.8

select(BOD,Demand=demand)

##   Demand
## 1    8.3
## 2   10.3
## 3   19.0
## 4   16.0
## 5   15.6
## 6   19.8

Figure 5.

Filter

We used Select to decide whether to include certain columns in our output dataset. Filter allows us to decide to include certain rows in our dataset. This is just like sub-setting operations we have already learned with data frames. For example, we want to select those rows of the data frame, BOD, whose demand value is more than 10. Then, we can use either of the following filter commands.

filter(BOD,“demand”>10)

filter(BOD,demand>10)

We are expected to use some expression may be involving column name(s) and some operators whose final result is TRUE or FALSE in a logical vector whose size is same as the number of rows of data frame. We cannot use simply column name in the filter function like the following,

filter(BOD,demand)

Also, we cannot use column numbers (indexes) in filter conditions like the following. filter(BOD,2>10)

library(dplyr)
filter(BOD,"demand">10)

## Warning: package 'bindrcpp' was built under R version 3.3.3

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

filter(BOD,demand>10)

##   Time demand
## 1    2   10.3
## 2    3   19.0
## 3    4   16.0
## 4    5   15.6
## 5    7   19.8

#filter(BOD,demand)
#filter(BOD,2>10)

Figure 6.

Exercise: Does the following command is acceptable?
filter(BOD,c(T,T,F,F,T,T))
Answer: Yes. It selects the rows 1, 2,5, and 6 of BOD.
Exercise: Does the following command is acceptable?
filter(BOD,c(T,T,F,T,T))
Answer: No.

Exercise: Does the following command is acceptable?
filter(BOD,as.logical(rep(1,6)) )
Answer:Yes.

Exercise: Does the following command is acceptable?
filter(BOD, Time*demand>67)
Answer: Yes.
Exercise: What are the equivalent commands in sub-setting for the following filter command?
filter(BOD,c(T,T,F,F,T,T))
Answer:
BOD[BOD[,2]>10,]
BOD[BOD$demand>10,]

Filtering on Multiple Columns

The filter function has the ability to take multiple arguments, i.e., multiple conditions. That is, we can filter on multiple conditions all at once. Also, we can filter on two or more conditions involving same column at the same time without any hesitation. When we use more than one condition, it selects those rows of the data frame which satisfies all the conditions simultaneously. Also, if we want we can use & between conditions. The following work out of ours illustrates use of multiple commands with filter function.

library(dplyr)
filter(BOD,demand>10,demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

filter(BOD,demand>10&demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

filter(BOD,Time>1,Time<6,demand>10,demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

filter(BOD,Time>1&Time<6&demand>10&demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

Figure 7.

Exercise: Does the following command gets executed?
filter(BOD, c(T,T,F,F,T,T),c(T,T,T,F,F,F))
Answer: Yes
Exercise: Does the following command gets executed?
filter(BOD, c(T,T,F,F,T,T)&c(T,T,T,F,F,F))
Answer: Yes

We can also use logical OR (|) between the conditions. We can also use some logical & operations, commas, and OR operations also. See the following workout of ours

library(dplyr)
filter(BOD,Time>1&Time<6,demand>10&demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

filter(BOD,Time>1|Time<4,demand>10|demand<16)

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

filter(BOD,Time>1|Time<4&demand>10|demand<16)

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

filter(BOD,Time>1&Time<4&demand>10&demand<16)

##   Time demand
## 1    2   10.3

FIgure 8.

We can also do similar things using the %in% operator, here is the same thing without using the | operator. Operator %in% compares the vector of input with the vector of options and returns TRUE for each element of input that is also in the vector of options. For instance, see the following workout:

x <- c(1:5)
x %in% seq(2,4)

## [1] FALSE  TRUE  TRUE  TRUE FALSE

 # [1] FALSE TRUE TRUE TRUE FALSE

That is, the elements of x, 1 to 5 are verified whether they are in seq(2,4). If they exists, result is TRUE; otherwise result is FALSE.

For example, the following selects 3-5 rows of BOD data frame.

 filter(BOD, Time %in%c(3:5))
  Time demand
1    3   19.0
2    4   16.0
3    5   15.6

Exercise: Does the following command gets executed?
filter(BOD, c(T,T,F,F,T,T)|c(T,T,T,F,F,F))
Answer: Yes

arrange

This function is used to order the rows of the data frame based on their values of a specified column (field) or group of column values. That is, it takes a data frame, and a set of column names (or more complicated expressions) to order by. If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

For example, we wanted to use mtcars data frame to order the rows such that rows having the number cylinders more first and less at last (descending) and gears in ascending. That is, if some cars are having same number of cylinders then they will sorted based on their gear value in ascending order. We need to specify the column name on which we want sorting has to be done naturally(ascending). If we want sorting in descending order based some column name, we can enclose that column inside desc() function. Of course, we use a minus sign also with that column name. See the following workout which is the result of the following command.

library(dplyr)
arrange(mtcars,desc(cyl),gear)

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 2  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 3  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 4  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 5  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 6  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 7  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 8  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 9  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 10 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 11 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 12 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 13 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 14 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 15 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 16 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 17 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 18 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 19 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 20 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 21 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 22 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 23 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 24 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 25 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 26 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 27 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 28 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 29 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 30 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## 31 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 32 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Figure 9.

The following also gives the same results.

arrange(mtcars, -cyl,gear)

Exercise: Explain what the following command gives.
select(arrange(mtcars,desc(cyl),gear), 1)[1,]
Answer: It displays mpg of the first car which contains highest number of cylinders.

Exercise: Does the following gets executed?
select(arrange(mtcars,desc(cyl),gear), cyl==8)
Answer: No.

Exercise: Explain how the following command works.
filter(filter(arrange(mtcars,desc(cyl),gear), cyl==8),gear==5)[,"mpg"]
>[1] 15.8 15.0
Answer: 1. mtcars data frame rows are sorted based on decreasing cyl and increasing gear. From the result, the rows whose cyl value is 8 are selected. From this result, rows whose gear value is 5 are selected. From those rows, mpg column will be displayed.

Function mutate

If we want to add some extra columns or change the existing columns, we can use mutate function. Mutate adds columns to the end of the dataset. In the following, we are creating a new data frame from BOD such that it contains third column as Dem in addition to Time and demand columns of BOD. Also, if we want to change an existing column, simply assign some expression to the existing column name inside the mutate function. We do have a similar function of base package known as transmute which returns the new data frame with new columns name. Also, if we want to remove some columns in the resultant data frame of mutate function, we can simply assign NULL to column names.

library(dplyr)
mutate(BOD,Dem=demand*10)

##   Time demand Dem
## 1    1    8.3  83
## 2    2   10.3 103
## 3    3   19.0 190
## 4    4   16.0 160
## 5    5   15.6 156
## 6    7   19.8 198

mutate(BOD,demand=demand*10)

##   Time demand
## 1    1     83
## 2    2    103
## 3    3    190
## 4    4    160
## 5    5    156
## 6    7    198

transmute(BOD,demand=demand*10)

##   demand
## 1     83
## 2    103
## 3    190
## 4    160
## 5    156
## 6    198

Figure 10.

We can use new column names that are used in mutate function also in the same call and with expressions that are assigned to other column names.

If we give column names enclosed between quotes or some integers to mutate function, we get new columns with string or integers are their values (see the following figure). Also, while creating new columns through mutate, we can give that column values as a vector as shown below Figure.

library(dplyr)
mutate(BOD,"demand",6)

##   Time demand "demand" 6
## 1    1    8.3   demand 6
## 2    2   10.3   demand 6
## 3    3   19.0   demand 6
## 4    4   16.0   demand 6
## 5    5   15.6   demand 6
## 6    7   19.8   demand 6

mutate(BOD,Price=c(1,8,2,3,3,2))

##   Time demand Price
## 1    1    8.3     1
## 2    2   10.3     8
## 3    3   19.0     2
## 4    4   16.0     3
## 5    5   15.6     3
## 6    7   19.8     2

mutate(BOD,"Price"=c(1,8,2,3,3,2))

##   Time demand Price
## 1    1    8.3     1
## 2    2   10.3     8
## 3    3   19.0     2
## 4    4   16.0     3
## 5    5   15.6     3
## 6    7   19.8     2

Figure 11,

Also, when we try to add a new column to a data frame using mutate function and with a simple expression (without new name for that column) that is involving the column names of the data frame, then we get new data frame with the expression as the respective column name. See the following figure.

library(dplyr)
#mutate(BOD,demand+10)
#mutate(BOD,"demand"+10)
mutate(BOD,Time*10)

##   Time demand Time * 10
## 1    1    8.3        10
## 2    2   10.3        20
## 3    3   19.0        30
## 4    4   16.0        40
## 5    5   15.6        50
## 6    7   19.8        70

Exercise: Does the following command is acceptable?
mutate(BOD, demand+10)
Answer: Yes

Mutate and window functions

Window functions are the functions that take a vector of values and return another vector of values. The following figure contains some window functions which can be used with mutate function.

Figure

The following workout will illustrate working with the above functions. For instance, lead shifts elements by one location left while lag shifts elements by one element right. At the ends, NA is added. Also, verify working of percent_rank(), min_rank(), cumsum() functions.

Figure

The following R commands are used to illustrate working with windows functions in mutate function. Do observe the output of each command.

mutate(BOD,lead(demand))

##   Time demand lead(demand)
## 1    1    8.3         10.3
## 2    2   10.3         19.0
## 3    3   19.0         16.0
## 4    4   16.0         15.6
## 5    5   15.6         19.8
## 6    7   19.8           NA

mutate(BOD,lead(demand),min_rank(demand),percent_rank(demand))

##   Time demand lead(demand) min_rank(demand) percent_rank(demand)
## 1    1    8.3         10.3                1                  0.0
## 2    2   10.3         19.0                2                  0.2
## 3    3   19.0         16.0                5                  0.8
## 4    4   16.0         15.6                4                  0.6
## 5    5   15.6         19.8                3                  0.4
## 6    7   19.8           NA                6                  1.0

mutate(BOD,lead(demand),min_rank(demand),percent_rank(demand),cumsum(demand))

##   Time demand lead(demand) min_rank(demand) percent_rank(demand)
## 1    1    8.3         10.3                1                  0.0
## 2    2   10.3         19.0                2                  0.2
## 3    3   19.0         16.0                5                  0.8
## 4    4   16.0         15.6                4                  0.6
## 5    5   15.6         19.8                3                  0.4
## 6    7   19.8           NA                6                  1.0
##   cumsum(demand)
## 1            8.3
## 2           18.6
## 3           37.6
## 4           53.6
## 5           69.2
## 6           89.0

mutate(BOD,lead(demand),min_rank(demand),percent_rank(demand),cumsum(demand),row_number(demand))

##   Time demand lead(demand) min_rank(demand) percent_rank(demand)
## 1    1    8.3         10.3                1                  0.0
## 2    2   10.3         19.0                2                  0.2
## 3    3   19.0         16.0                5                  0.8
## 4    4   16.0         15.6                4                  0.6
## 5    5   15.6         19.8                3                  0.4
## 6    7   19.8           NA                6                  1.0
##   cumsum(demand) row_number(demand)
## 1            8.3                  1
## 2           18.6                  2
## 3           37.6                  5
## 4           53.6                  4
## 5           69.2                  3
## 6           89.0                  6

Figure

Function summarise

This function helps us to find aggregate operations like column averages, maximum?s etc. For example, the following displays average mpg, average hp and average weight of the cars in mtcars data frame.

library(dplyr)
summarise(mtcars,avgmpg=mean(mpg,na.rm=T),avghp=mean(hp,na.rm=T),avgweight=mean(wt,na.rm=T))

Output is given as: avgmpg avghp avgweight 1 20.09062 146.6875 3.21725

The following R command displays, average mpg, average hp and maximum weight of the cars in mtcars data frame. >summarise(mtcars,avgmpg=mean(mpg,na.rm=T),avghp=mean(hp,na.rm=T),maxweight=max(wt,na.rm=T)) avgmpg avghp maxweight 1 20.09062 146.6875 5.424

We can use group_by function along with summarise. For example, in the following workout, we have calculated average weight of cars with cylinders 4, 6 and 8 using group_by and summarise functions. Also, we have illustrated use of group_by to group cars based on both cyl and gear and then apply summarise function to compute their average weight. When we group by using multiple variables, each summary peels off one level of the grouping.

library(dplyr)
summarise(group_by(mtcars,cyl), meanwt=mean(wt,na.rm=T))

## # A tibble: 3 x 2
##     cyl   meanwt
##   <dbl>    <dbl>
## 1     4 2.285727
## 2     6 3.117143
## 3     8 3.999214

summarise(group_by(mtcars,cyl,gear), meanwt=mean(wt,na.rm=T))

## # A tibble: 8 x 3
## # Groups:   cyl [?]
##     cyl  gear   meanwt
##   <dbl> <dbl>    <dbl>
## 1     4     3 2.465000
## 2     4     4 2.378125
## 3     4     5 1.826500
## 4     6     3 3.337500
## 5     6     4 3.093750
## 6     6     5 2.770000
## 7     8     3 4.104083
## 8     8     5 3.370000

Figure 13.

We can also use summarise_each() also like summarise. Both summarise, summarise_each use summary functions, functions that take a vector of values and return a single value, such as:

Figure

The following workout illustrates working with summarise_each. The following are the command executed here.

summarise_each(BOD,funs(mean))

## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`

##       Time   demand
## 1 3.666667 14.83333

summarise_each(BOD,funs(mean,max,min))

## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`

##   Time_mean demand_mean Time_max demand_max Time_min demand_min
## 1  3.666667    14.83333        7       19.8        1        8.3

summarise_each(BOD,funs(mean,max,min,IQR))

## `summarise_each()` is deprecated.
## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
## To map `funs` over all variables, use `summarise_all()`

##   Time_mean demand_mean Time_max demand_max Time_min demand_min Time_IQR
## 1  3.666667    14.83333        7       19.8        1        8.3      2.5
##   demand_IQR
## 1      6.625

Figure

From the above, we can understand that summarise_at, summarise_all(), summarise_if() can be used. These there are three variants of summarise family of functions.

summarise_all affects every variable summarise_at affects variables selected with a character vector or vars() summarise_if affects variables selected with a predicate function:

The following figure illustrates working with the above functions. In this workout, the following commands are executed. We can use funs() to execute multiple functions such as mean, max, etc., given in the above figure.

iris %>% group_by(Species)%>%summarise_all(mean)

## # A tibble: 3 x 5
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##       <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

iris %>% group_by(Species)%>%summarise_at(c("Sepal.Length","Sepal.Width"),mean, na.rm=T)

## # A tibble: 3 x 3
##      Species Sepal.Length Sepal.Width
##       <fctr>        <dbl>       <dbl>
## 1     setosa        5.006       3.428
## 2 versicolor        5.936       2.770
## 3  virginica        6.588       2.974

iris %>% group_by(Species)%>%summarise_at(vars(Sepal.Length:Sepal.Width),mean, na.rm=T)

## # A tibble: 3 x 3
##      Species Sepal.Length Sepal.Width
##       <fctr>        <dbl>       <dbl>
## 1     setosa        5.006       3.428
## 2 versicolor        5.936       2.770
## 3  virginica        6.588       2.974

iris %>% group_by(Species)%>%summarise_if(is.double,mean, na.rm=T)

## # A tibble: 3 x 5
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##       <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

iris %>% group_by(Species)%>%summarise_if(is.double,funs(mean,max), na.rm=T)

## # A tibble: 3 x 9
##      Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean
##       <fctr>             <dbl>            <dbl>             <dbl>
## 1     setosa             5.006            3.428             1.462
## 2 versicolor             5.936            2.770             4.260
## 3  virginica             6.588            2.974             5.552
## # ... with 5 more variables: Petal.Width_mean <dbl>,
## #   Sepal.Length_max <dbl>, Sepal.Width_max <dbl>, Petal.Length_max <dbl>,
## #   Petal.Width_max <dbl>

Figure

We do have mutate_all(), mutate_at(), mutate_if() functions to change data frame or tibble. For example, see the following.

#Each column of iris data frame will be converted to integer.
iris %>% as_tibble() %>% mutate_all(as.integer)%>%glimpse(.)

## Observations: 150
## Variables: 5
## $ Sepal.Length <int> 5, 4, 4, 4, 5, 5, 4, 5, 4, 4, 5, 4, 4, 4, 5, 5, 5...
## $ Sepal.Width  <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 4, 4, 3...
## $ Petal.Length <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Petal.Width  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Species      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

#Only Sepal.Length and Sepal.Width columns if iris data frame are converted to integer.
iris %>% as_tibble() %>% mutate_at(vars(Sepal.Length:Sepal.Width), as.integer)%>%glimpse(.)

## Observations: 150
## Variables: 5
## $ Sepal.Length <int> 5, 4, 4, 4, 5, 5, 4, 5, 4, 4, 5, 4, 4, 4, 5, 5, 5...
## $ Sepal.Width  <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 4, 4, 3...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

#Those columns of iris data frame which are double type of will be converted to integer.
iris %>% as_tibble() %>% mutate_if(is.double, as.integer)%>%glimpse(.)

## Observations: 150
## Variables: 5
## $ Sepal.Length <int> 5, 4, 4, 4, 5, 5, 4, 5, 4, 4, 5, 4, 4, 4, 5, 5, 5...
## $ Sepal.Width  <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 4, 4, 3...
## $ Petal.Length <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Petal.Width  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

#Those columns of iris data frame which are double type are divided with 2.54.
iris %>% as_tibble() %>% mutate_if(is.double, funs(. / 2.54))%>%glimpse(.)

## Observations: 150
## Variables: 5
## $ Sepal.Length <dbl> 2.007874, 1.929134, 1.850394, 1.811024, 1.968504,...
## $ Sepal.Width  <dbl> 1.377953, 1.181102, 1.259843, 1.220472, 1.417323,...
## $ Petal.Length <dbl> 0.5511811, 0.5511811, 0.5118110, 0.5905512, 0.551...
## $ Petal.Width  <dbl> 0.07874016, 0.07874016, 0.07874016, 0.07874016, 0...
## $ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

Functions sample_n, sample_frac We can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction. The following workout illustrates the working with these functions.

library(dplyr)
sample_n(mtcars,4)

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

sample_frac(mtcars,0.2)

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

Figure 14.

Piping operator

Like piping concept Unix/Linux or layering concept of ggplot2 or ggvis, dplyr also supports piping concept through which we can written easily decipherable data processing functions. Here, %>% is called as the piping operator whose primary action is output of the left side command (or function) to become as input to the right hand side command or function. While designing a piping sequence, the first argument becomes the data frame or tibble on which a series of data manipulation functions such as select, mutate, etc., are supposed to run. In the following, we have demonstrated with a data manipulation command and its equivalent one using piping. It is implicit that the function right hand side of the operator %>% takes the result of the left hand side function result as the first argument to it. However, we can first argument as dot in all the right hand side functions to piping operator %>%. Here, ?.? is a special object that represents the output from the prior step.

library(dplyr)
filter(filter(BOD,demand>10),demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

BOD %>% filter(demand>10) %>% filter(demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

BOD %>% filter(., demand>10) %>% filter(., demand<16)

##   Time demand
## 1    2   10.3
## 2    5   15.6

Figure 15.

In the following figure, we have R statement to select the car which is having largest mpg value and hp value more than 200 and more number of gears. We have also shown its equivalent piping command also.

library(dplyr)
head(arrange(filter(filter(mtcars,hp>200),gear==max(gear)), -mpg),1)

##    mpg cyl disp  hp drat   wt qsec vs am gear carb
## 1 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4

mtcars %>% filter(hp>200) %>% filter(gear==max(gear)) %>% arrange(-mpg) %>% head(.,1)

##    mpg cyl disp  hp drat   wt qsec vs am gear carb
## 1 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4

Figure 16.

Tibbles through dplyr

Developers of this tibble package claims that ?Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e.?converting character vectors to factors).?. Tibbles are first introduced in dplyr package. Later tibble package is released by changing tbl_df class of dplyr. These data types gives little better control over manipulating data using dplyr(sub-setting) in addition to its variety of printing style. When we enter a tibble type object at the console prompt, we get its first 10 rows, and all the columns that fit on screen. In addition to its name, each column reports its type, a nice feature borrowed from str().This makes it much easier to work with large data.

We can use tbl_df function to convert a data frame to a tibble. Let us make a simple experiment. Let us use this function with mtcars and flights (of course, flights is a tibble anyway). See the following figure. We find that first 10 rows of the data set is displayed along with additional information about how many rows and columns left. Also, it is showing remaining columns names and their types.

library(dplyr)
library(nycflights13)

## Warning: package 'nycflights13' was built under R version 3.3.3

tbl_df(mtcars)

## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
##  2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
##  3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
##  4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
##  5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
##  6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
##  7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
##  8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
##  9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
## # ... with 22 more rows

tbl_df(flights)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

$Figure 17.$ As detailed above, Tibbles are designed so that we don’t accidentally overwhelm our console when we enter data frame object names at the console prompt to see their content. However, sometimes we need more output than the default display. We can explicitly print() some number of rows of a tibble (or data frame) by specifying number of lines with argument n and the width (number of columns of tibble or data frame) with width argument; width= Inf will display all columns. See the following statement which displays 10 lines of mtcars.

mtcars %>% print(n=10,width=Inf)

We can control the default appearance of the tibble display with options: options(tibble.print_max = n, tibble.print_min = m): if there are more than n rows, print only the first m rows. Use options(tibble.print_max = Inf) to always show all rows. options(tibble.width = Inf) will always print all columns, regardless of the width of the screen.

We want readers to experiment the following commands.

options(tibble.print_max=Inf)

weather

options(tibble.width=Inf)

weather

Also, scrollable view of the complete tibble or data frame can be displayed using View() function. For example, the following will display flights tibble content.

library(dplyr)
library(nycflights13)
flights %>% View()

Ofcourse, we can work like this also.

library(dplyr)
library(nycflights13)
View(flights)

The following is the resulting window.

Figure.

Tibbles are quite strict about sub-setting. [ always returns another tibble. Contrast this with a data frame: sometimes [ returns a data frame and sometimes it just returns a vector. See the following figure which tries to use sub-setting with flights tibble that comes along with package nycflight113. Do not forget to load nycflights13 package first.

library(dplyr)
library(nycflights13)
flights[1:3,]

## # A tibble: 3 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

flights[1:3,1:3]

## # A tibble: 3 x 3
##    year month   day
##   <int> <int> <int>
## 1  2013     1     1
## 2  2013     1     1
## 3  2013     1     1

Figure 18.

We can use normal [[ to access columns. Of course, we can use column names with $ like tipplename $columnname also. This is valid with tipples also. The result of this operators with tipple gives vector. See the following figure.

library(dplyr)
library(nycflights13)
library(tibble)

## Warning: package 'tibble' was built under R version 3.3.3

class(flights[[1]])

## [1] "integer"

class(flights$year)

## [1] "integer"

is.vector(flights[[1]])

## [1] TRUE

is.data.frame(flights[[1]])

## [1] FALSE

is.tibble(flights[[1]])

## [1] FALSE

is.vector(flights$year)

## [1] TRUE

is.data.frame(flights$year)

## [1] FALSE

is.tibble(flights$year)

## [1] FALSE

Figure 19.

Tibbles are also stricter with $. Tibbles never do partial matching, and will throw a warning and return NULL if the column does not exist.Also, tibbles ignore the drop argument.

After loading the package, tibble, we can create tibbles using tibble() function. The following figure illustrates how to create tibble without column names and with column names. Also, the following figure illustrates what happens if a column value is having single element? Obviously, it recycles. Do remember that when constructing a tibble, only values of length 1 are recycled. The first column with length different to one determines the number of rows in the tibble, conflicts lead to an error. This also extends to tibbles with zero rows, which is sometimes important for programming.

library(dplyr)
library(nycflights13)
x<-tibble(1:5,c(81,22,33,12,33))
y<-tibble(a=1:5,b=c(81,22,33,12,33))
z<-tibble(a=1:5,b=7)

Figure 20.

Unlike?data.frames(),?tibble()?does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

It is also possible for a tibble to have column names that are not valid R variable names, aka?non-syntactic?names(which might not start with a letter, or they might not contain unusual characters like a space). To refer to these variables, we need to surround them with backticks (`):

Figure 21.

We can use add_column() function to add columns to a tibble. However, we cannot change a column or change number of rows of a tibble.

library(dplyr)
library(nycflights13)
df<-tibble(x=1:3,y=3:1)
add_column(df,z=-1:1,w=0)

## # A tibble: 3 x 4
##       x     y     z     w
##   <int> <int> <int> <dbl>
## 1     1     3    -1     0
## 2     2     2     0     0
## 3     3     1     1     0

#add_column(df,x=4:6)
#add_column(df,z=1:5)

Figure 22.

We can use add_row() function to add rows to a tibble. See the following workout of ours and the comments there in.

library(dplyr)
library(nycflights13)
df<-tibble(x=1:3,y=3:1)
add_row(df,x=4,y=0)

## # A tibble: 4 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     3
## 2     2     2
## 3     3     1
## 4     4     0

add_row(df,x=4,y=6,.before=2)

## # A tibble: 4 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     3
## 2     4     6
## 3     2     2
## 4     3     1

add_row(df,x=4:5,y=0:-1)

## # A tibble: 5 x 2
##       x     y
##   <int> <int>
## 1     1     3
## 2     2     2
## 3     3     1
## 4     4     0
## 5     5    -1

add_row(df,x=4)

## # A tibble: 4 x 2
##       x     y
##   <dbl> <int>
## 1     1     3
## 2     2     2
## 3     3     1
## 4     4    NA

#add_row(df,z=10)

Figure 23.

Function glimpse is also used to see the content of a tibble. Its output looks like a transposed version of print: columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It?s a little like str() applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)

Another way to create a tibble is with?tribble(), short form for?transposed?tibble. Function?tribble()?is customised for data entry in code: column headings are defined by formulas (i.e. they start with?~), and entries are separated by commas.? Create tibbles using an easier to read row-by-row layout. They are useful for small tables of data where readability is important.

library(dplyr)
library(nycflights13)
x<-tribble(
~colA,~colB,
"a",1,
"b",2,
"c",3
)
x

## # A tibble: 3 x 2
##    colA  colB
##   <chr> <dbl>
## 1     a     1
## 2     b     2
## 3     c     3

y<-tribble(
~x,~y,
"a",1:3,
"b",4:6
)
y

## # A tibble: 2 x 2
##       x         y
##   <chr>    <list>
## 1     a <int [3]>
## 2     b <int [3]>

Figure 24.

Also, all the functions(select, filter, arrange, mutate, groub_by, piping) related to dplyr package which we have discussed in the previous pages are equally applicable with tibbles.

We can use pipe operator with the help of . operator as shown below.

library(dplyr)
library(nycflights13)
x<-tibble(a=1:5,b=c(12,22,32,22,32))
x %>% .[["a"]]

## [1] 1 2 3 4 5

x  %>% .[[1]]

## [1] 1 2 3 4 5

x %>% .[[1:2]]

## [1] 2

x %>% .[1,2]

## # A tibble: 1 x 1
##       b
##   <dbl>
## 1    12

x %>% .[1:2,]

## # A tibble: 2 x 2
##       a     b
##   <int> <dbl>
## 1     1    12
## 2     2    22

x %>% .[,"a"]

## # A tibble: 5 x 1
##       a
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5

Figure 25.

We can use as.tibble , as_tibble, tbl_df, as_data_frame etc., functions are used to convert data frame to tibble and vice versa. Note tbl_df is in dplyr package while others belongs to tibble package.

Joins

The join functions that are available with dplyr package are similar to the merge() function that comes with Base R but are more specific than the general merge(). These functions works with data frames also.

Figure 26.

Join Types

inner_join(x, y, by = NULL, copy = FALSE, suffix = c(“.x”, “.y”), …)

left_join(x, y, by = NULL, copy = FALSE, suffix = c(“.x”, “.y”), …)

right_join(x, y, by = NULL, copy = FALSE, suffix = c(“.x”, “.y”), …)

full_join(x, y, by = NULL, copy = FALSE, suffix = c(“.x”, “.y”), …)

semi_join(x, y, by = NULL, copy = FALSE, …)

anti_join(x, y, by = NULL, copy = FALSE, …)

Arguments x, y tbls to join by a character vector of variables to join by. If NULL, the default, *_join() will do a natural join, using all variables with common names across the two tables. A message lists the variables so that you can check they’re right (to suppress the message, simply explicitly list the variables that you want to join).

To join by different variables on x and y use a named vector. For example, by = c(“a” = “b”) will match x.a to y.b. copy If x and y are not from the same data source, and copy is TRUE, then y will be copied into the same src as x. This allows you to join tables across srcs, but it is a potentially expensive operation so you must opt into it.

suffix If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2. … other parameters passed onto methods

Currently dplyr supports four join types:

inner_join()

return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

left_join()

return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

right_join()

return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

semi_join()

return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

anti_join()

return all rows from x where there are not matching values in y, keeping just columns from x.

full_join()

return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

Do remember that groups are ignored for the purpose of joining, but the result preserves the grouping of x.

The following figure illustrates the working of the above functions with two data tibbles (or data frames). We advise readers to be keen about the result to feel of these operations. We have borrowed this example from http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Figure.

Also, intersect(), union(), setdiff() functions are meaningful between tibbles also. We do have bind_rows(), bind_cols() functions in dplyr package which works akin to rbind and cbind. The following figure illustrates the working of these functions on two data frame y and z whose content is also shown in the figure.

Figure.

We can also join two tibbles that are having different column names in join operations. For example, assume X is tipple having “id”, “age”, “name” as column names, while Y is another tipple having “Eid”, “Sal” as column names. Then, left_join(X,Y, c(“id”==”Eid”)) gives join operation by taking “id” and “Eid” as common named columns.

For an example to illustrate the use of the join functions we will be using the weather dataset included with the nycflights13 package. Let?s first take a look at the data. We take flights data of nycflights13 package and weather package. We wanted to arrive some data which relate flights delay and wind speed. First, let us explore the column types in both the data sets by using names() function. The following figure contains the same. We find that both of them are having some common attributes such as origin, year, month, hour. Thus, we want to apply join between these two data tibbles and arrive at a table that is having delay(average) and wind speeds(average).

library(dplyr)
library(nycflights13)
names(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

names(weather)

##  [1] "origin"     "year"       "month"      "day"        "hour"      
##  [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
## [11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour"

Figure 27.

The following code is borrowed from www.rdocumenatation.org.

library(dplyr)
library(nycflights13)
wind_speed <- flights %>% 
                filter(., !is.na(dep_delay)) %>% 
                group_by(., origin, year, month, day, hour) %>% 
                summarise(., avg_dep_delay = round(mean(dep_delay),1)) %>% 
        left_join(.,
            weather %>% 
                filter(., !is.na(wind_speed)) %>% 
                select(., origin, year, month, day, hour, wind_speed) %>% 
                mutate(., wind_speed = round(wind_speed, 0)),
            by = c("origin", "year", "month", "day", "hour")) %>%
        group_by(., wind_speed) %>% 
        summarise(., delay = round(mean(avg_dep_delay, na.rm = T),2)) %>% 
        arrange(., wind_speed) 

wind_speed

## # A tibble: 36 x 2
##    wind_speed delay
##         <dbl> <dbl>
##  1          0  8.64
##  2          3  8.88
##  3          5  8.96
##  4          6  8.90
##  5          7 10.42
##  6          8 12.49
##  7          9 12.35
##  8         10 12.44
##  9         12 13.70
## 10         13 15.34
## # ... with 26 more rows

We get the following output.
# A tibble: 36 x 2
   wind_speed delay
         
 1          0  8.64
 2          3  8.88
 3          5  8.96
 4          6  8.90
 5          7 10.42
 6          8 12.49
 7          9 12.35
 8         10 12.44
 9         12 13.70
10         13 15.34
# ... with 26 more rows

We can find correlation coefficient between them. If we want, we can find regression equation also between them. If wanted, we can also extract other weather parameters such as wind direction, temperature, humidity, dew point, etc., and find multiple regression between delay and all of these quantities.

Exercise: Use flights data and display the flight, origin, dest, dep_time,air_time for those flights which starts on one calendar day and reaches in the next calendar day.

Answer: By observing the data, we found that dep_time and air_time(time taken in travel) are useful to solve this problem. Moreover, we understood that both of these two columns are double numbers. The last two digits of dep_time represent minutes and remaining digits conveys hours. For example, if one value in this column is 535 then it should be taken as 5 hours 35 minutes. Similarly, if a column value is 2122 then it should be taken as 21 hours 22 minutes. Also, air_time is given in total minutes taken in the air.

Now, our problem is to arrive at logic that declares a flight is reaching whether in the next calendar day or on the same calendar given its departure time and time taken during the flight. We employ the following logic. For dep_time, we separate hours and minutes by using integer division and modulus operation with 100. For air_time, we separate hours and minutes by using integer division and modulus operation with 60. Now, we calculate total minutes. This total minute is divided with 60 to find out how many hours are there (which can be 0 or 1). For this, we add hours component of both the given numbers. If this is more than 24, we declare the flight is reaching in the next calendar day; otherwise in the same calendar day.

We have written isNextDay() function that takes two numbers n and m. The above logic is employed to compute the results.

isNextDay<-function(n,m){
last<-n%%100
n<-n%/%100
last1<-m%%60
m<-m%/%60
hours=n+m+(last+last1)%/%60
hours=hours%/%24
return(hours==1)
}


isNextDay(1232,1422)  #supposed to get TRUE

## [1] TRUE

isNextDay(1232,234)    #supposed to get FALSE

## [1] FALSE

isNextDay(2210,234)   #supposed to get TRUE

## [1] TRUE

isNextDay(2210,153)   #supposed to get TRUE

## [1] TRUE

isNextDay(2210,101)  #supposed to get FALSE

## [1] FALSE

Now, to answer our problem, we use the above function with filter function of dplyr. We use piping operation. We select those rows of the flights data tibble for which isNextDay() returns TRUE. This is sent select function through piping operator which in turn selects flight, origin,dest, dep_time, and air_time. Result is stored in the object result whose content can be viewed either through head or glimpse or simply typing its name at the command prompt.

isNextDay<-function(n,m){
last<-n%%100
n<-n%/%100
last1<-m%%60
m<-m%/%60
hours=n+m+(last+last1)%/%60
hours=hours%/%24
return(hours==1)
}
library(dplyr)
library(nycflights13)
result<-filter(flights,isNextDay(dep_time,air_time))%>%select(.,flight,origin,dest,dep_time,air_time)
glimpse(result)

## Observations: 18,104
## Variables: 5
## $ flight   <int> 15, 119, 269, 1643, 1075, 389, 177, 29, 181, 91, 853,...
## $ origin   <chr> "EWR", "EWR", "JFK", "JFK", "EWR", "JFK", "JFK", "JFK...
## $ dest     <chr> "HNL", "LAX", "SEA", "SEA", "SNA", "SFO", "SFO", "SFO...
## $ dep_time <int> 1344, 1820, 1824, 1826, 1832, 1840, 1848, 1850, 1856,...
## $ air_time <dbl> 656, 366, 348, 334, 342, 357, 361, 364, 336, 330, 361...

Exercise: Use flights data from nycflights13 package and prepare a table which displays on a given date how many flights are there from one place to other.

Answer: First, we read year, month, day interactively. Then, we group flighs based on these three attributes using group_by. In the group_by, we included origin and dest columns also. This is sent to summarise function along with n() as the function to be executed on each of the groups. This summarise function retunrs on the given day, from one origin to one destination, how many flights are available will be returned which is stored in the object X. Using X[,4:6] sub-setting we get origin, dest, number of flights in a tabular fashion.

library(nycflights13)
library(dplyr)
if(interactive()){
Year<-as.integer(readline("Enter year"))
Month<-as.integer(readline("Enter month"))
Day<-as.integer(readline("Enter day"))

x<-summarise(group_by(flights,year==Year,month==Month,day==Day,origin,dest),No=n())
x[,4:6]
}

Exercise: Use flights data from nycflights13 package and prepare a table which shows total number of flights on each of the week days(Mon to Sun). No need of order.

Answer: Function ISODate() takes year, month, day and returns Date type of object which if passed to weekdays function we get respective day of the week. By using mutate function and we create new tibble which contains weekday for each fligh with the column name wday. The result is stores in res tibble. This, wday column is factored to have days as “Sunday” to “Saturday”. This tibble is grouped by group_by using wday column and the result is sent to summarise function along with n() as required aggregate function.

library(nycflights13)
library(dplyr)
res <- mutate(flights,wday=weekdays(ISOdate(year,month,day)))
res$wday <- factor(res$wday, levels= c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
summarise(group_by(res,wday),No=n())

## # A tibble: 7 x 2
##        wday    No
##      <fctr> <int>
## 1    Sunday 46357
## 2    Monday 50690
## 3   Tuesday 50422
## 4 Wednesday 50060
## 5  Thursday 50219
## 6    Friday 50308
## 7  Saturday 38720

Exercise:Use flights data from nycflights13 package and write R statements to select the flight which started before the scheduled time or which arrived before the scheduled time.

Answer: We have written orderTime() which takes two time values and returns TRUE if first one is earlier than second one; otherwise returns FALSE. By using filter function and this orderTime function we have selected those rows of the data tibble which related to the flights that arrived before the secheduled time or which departured before the scheduled time.

library(dplyr)
library(nycflights13)
orderTime<-function(n,m){
last<-n%%100
n<-n%/%100
last1<-m%%100
m<-m%/%100
if(n==0)n=24
if(m==0)m=24
m=m*60+last1;
n=n*60+last;
return(n<m)
}

result<-filter(flights,orderTime(dep_time,sched_dep_time)|orderTime(arr_time,sched_arr_time))

## Warning in if (n == 0) n = 24: the condition has length > 1 and only the
## first element will be used

## Warning in if (m == 0) m = 24: the condition has length > 1 and only the
## first element will be used

## Warning in if (n == 0) n = 24: the condition has length > 1 and only the
## first element will be used

## Warning in if (m == 0) m = 24: the condition has length > 1 and only the
## first element will be used

glimpse(result)

## Observations: 234,925
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 544, 554, 554, 555, 557, 557, 558, 558, 558, 55...
## $ sched_dep_time <int> 545, 600, 558, 600, 600, 600, 600, 600, 600, 60...
## $ dep_delay      <dbl> -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1,...
## $ arr_time       <int> 1004, 812, 740, 913, 709, 838, 753, 849, 853, 9...
## $ sched_arr_time <int> 1022, 837, 728, 854, 723, 846, 745, 851, 856, 9...
## $ arr_delay      <dbl> -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 3...
## $ carrier        <chr> "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6",...
## $ flight         <int> 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194...
## $ tailnum        <chr> "N804JB", "N668DN", "N39463", "N516JB", "N829AS...
## $ origin         <chr> "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA"...
## $ dest           <chr> "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", "ORD"...
## $ air_time       <dbl> 183, 116, 150, 158, 53, 140, 138, 149, 158, 345...
## $ distance       <dbl> 1576, 762, 719, 1065, 229, 944, 733, 1028, 1005...
## $ hour           <dbl> 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6,...
## $ minute         <dbl> 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0,...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 06:00:00, 2013...

Exercise: Use flights data from nycflights12 package and write a set of R statements to display flight, year,month, day along with quarter number of the year as Q1, Q2, Q3 and Q4.

Answer: First, we write a function getQuarterNo() which takes month as argument and returns either Q1 or Q2 or Q3 or Q4 based on the month value. For, this, we have used a simple logic. We have subtracted 1 from month and divided with %/%(integer division) and added 1 which is appended to the string “Q” through paste. For example, if month value is 12, first it is taken as 11. Then 11%/%3 gives 3. By adding 1, we are getting 4 which is fourth quarter.

Now, from the flights data tibble we select flight, year, month,day and then using mutate we add new attribute Quarter whose value is calculated using the above function by sending month value.

library(dplyr)
library(nycflights13)

getQuarterNo<-function(month){
month<-month-1
month<-1+month%/%3
return(paste("Q",month,sep='') )
}
result<-select(flights,flight,year, month, day)
result<-mutate(result,Quarter=getQuarterNo(month))
glimpse(result)

## Observations: 336,776
## Variables: 5
## $ flight  <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, ...
## $ year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ month   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Quarter <chr> "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", "Q1", ...

Let us disambiguate group_by and ungroup

We have really understood the use of group_by function to divide the data into groups based on single column or multi-column. The function ungroup is reported as doing reverse of the group_by. We find one need to drill the conceptual things little further. Thus, the following experiment is created.

First, we will create a tibble with 100 rows having columns as sex, age and IQ. All are randomly generated.

library(dplyr)
set.seed(1)
DF <- 
  data.frame(
    sex = sample(c("Male","Female", "Hijra"), 100, TRUE),
    age = sample(20:100, 100, TRUE),
    IQ = rnorm(100, 100, 15)
  ) %>% as.tbl

Now, we apply group_by() function on DF with sex as the column and the result is stored in DF1. DF1<-DF %>% group_by(sex) DF2<-DF %>% group_by(sex) %>%ungroup()

We wanted test whether DF and DF1 are same. When we execute identical(), we get FALSE as output indicating that they are different. Further, we wanted to display and see them at the console prompt. The following figure contains the result. We find all the columns of DF and DF1 are exactly same. However, DF1 is showing Groups: sex[3], indicating it is a grouped tibble. We have also applied identical on DF and DF2 and the result is confirming that they are same. When we displayed their contents also, we found nothing different.

Figure

We have further executed class() function on DF, DF1, DF2. We find that class function for DF1 is showing it as grouped_df. class(DF) class(DF1) class(DF2)

Thus, though DF and DF1 tibbles columns contents are exactly same, but they are different. We wanted to further interrogate the difference. For this, we have applied filter operations as shown below.

The following command on DF gave only one row whose age value is minimum out of all, irrespective of sex. Do remember that DF is ungrouped tibble.

filter(DF,age==min(age))

Now, the same filter command is applied on DF1. It has given rows whose age is minimum for each of the groups. Of course, if there are more than one row is having same minimum, all of them are displayed. This is the difference between grouped tibble (DF1) and ungrouped tibble(DF). filter(DF1,age==min(age))

Of course, the same filter command DF2 worked akin to DF.

filter(DF2,age==min(age))

Whole code is given as:

library(dplyr)
set.seed(1)
DF <- 
  data.frame(
    sex = sample(c("Male","Female", "Hijra"), 100, TRUE),
    age = sample(20:100, 100, TRUE),
    IQ = rnorm(100, 100, 15)
  ) %>% as.tbl

DF1<-DF %>% group_by(sex)
DF2<-DF %>% group_by(sex) %>%ungroup()
class(DF)

## [1] "tbl_df"     "tbl"        "data.frame"

class(DF1)

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

class(DF2)

## [1] "tbl_df"     "tbl"        "data.frame"

filter(DF,age==min(age))

## # A tibble: 1 x 3
##      sex   age       IQ
##   <fctr> <int>    <dbl>
## 1 Female    21 94.10788

filter(DF1,age==min(age))

## # A tibble: 6 x 3
## # Groups:   sex [3]
##      sex   age        IQ
##   <fctr> <int>     <dbl>
## 1 Female    21  94.10788
## 2  Hijra    28  95.81330
## 3   Male    34  97.30665
## 4   Male    34 131.30750
## 5  Hijra    28  80.70549
## 6   Male    34  83.87212

filter(DF2,age==min(age))

## # A tibble: 1 x 3
##      sex   age       IQ
##   <fctr> <int>    <dbl>
## 1 Female    21 94.10788

The following figure illustrates the working of the above. Figure.

Data Manipulation through dplyr and tibbles

N.B.Venkateswarlu

5 November 2017

SQL style data manipulation through dplyr package

Select

Filter

Filtering on Multiple Columns

arrange

Function mutate

Mutate and window functions

Function summarise

Piping operator

Tibbles through dplyr

Joins

inner_join()

left_join()

right_join()

semi_join()

anti_join()

full_join()

Let us disambiguate group_by and ungroup