1 Brief Introduction

The dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember. For instance, to change the data table by adding a new column, we use mutate(). To filter the data table to a subset of rows, we use filter(). To subset the data by selecting specific columns, we use select(). Besides these functions, the other dplyr verbs that are widely used are summarise(), group_by(), and arrange(). In the following sections we shall discuss these verbs in brief.

It should be noted that the packages dplyr can be used only for the data frames. To use the package as we mentioned earlier, we first need to install it in the R environment and then load it. Let us first install the package dplyr using the function install.packages() and load it in the environment by using library() function.

install.packages("dplyr")
library(dplyr)

1.1 Adding a column using mutate()

While doing data science project the we need to acquire all the necessary information for analysis to be included in the data table. For this we may have to create new variables from the existing variables using some suitable mathematical operations. If we are creating a new variable, that means we are creating a new column, because in a data frame each column represents a variable and each row represents the observations on the variables.

In the earlier sections the R base functions, which we did use to add a new column or a new variable, was not so user friendly. In this section we will learn to create a new column or a new variable comparatively more easier way by using the function mutate() of the dplyr package.

To understand using these functions of the dplyr package let us load the data set called gapminder from the package dslabs. gapminder data set consists of a time series of Health and income outcomes for 184 countries from the year 1960 to 2016. It also includes two character vectors, oecd and opec, with the names of OECD and OPEC countries from 2016. Details of variables can be obtained by using ?gapminder or help("gapminder") after loading dslabs package.

install.packages("dslabs")
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(dslabs)
data("gapminder")

We may explore the data structures of the variables inside the data frame gapminder using str() function.

str(gapminder)
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

We can see that the gapminder is a data frame consisting 10545 observations of 9 variables. The class of the variables country, continent and region is Factor while the variable year is integer and others are of numeric. We can also use glimpse() function to view the glimpse of the data set. Note that glimpse() is a function from the package dplyr.

dplyr::glimpse(gapminder)
Rows: 10,545
Columns: 9
$ country          <fct> Albania, Algeria, Angola, Antigua and Barbuda, Arg...
$ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 19...
$ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, ...
$ life_expectancy  <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 7...
$ fertility        <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2....
$ population       <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 18673...
$ gdp              <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966...
$ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, ...
$ region           <fct> Southern Europe, Northern Africa, Middle Africa, C...

Let us use the head() function to view first 6 observations of the data set gapminder.

head(gapminder)
              country year infant_mortality life_expectancy fertility
1             Albania 1960           115.40           62.87      6.19
2             Algeria 1960           148.20           47.50      7.65
3              Angola 1960           208.00           35.98      7.32
4 Antigua and Barbuda 1960               NA           62.97      4.43
5           Argentina 1960            59.87           65.39      3.11
6             Armenia 1960               NA           66.86      4.55
  population          gdp continent          region
1    1636054           NA    Europe Southern Europe
2   11124892  13828152297    Africa Northern Africa
3    5270844           NA    Africa   Middle Africa
4      54681           NA  Americas       Caribbean
5   20619075 108322326649  Americas   South America
6    1867396           NA      Asia    Western Asia

Now suppose we want to create a new variable say gdp_per_capita which can be obtained by dividing gdp by population. Let us see how we can create a this variable using mutate() function. The function mutate() takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values. So, to add gdp_per_capita, we use:

gapminder = mutate(gapminder, gdp_per_capita = gdp / population)

Notice that here we used gdp and population inside the function, which are objects that are not defined in our workspace. But why don’t we get an error? Recall that using R base functions we could have accessed a variable only by using $ sign.

This is one of dplyr’s main features. Functions in this package, such as mutate(), know to look for variables in the data frame provided in the first argument. In the call to mutate above, gdp will have the values in gapminder$gdp and that of population will have the values as in gapminder$population. This function is aware of the column names and inside the function we can call them unquoted. We can write gdp rather than gapminder$gdp and population as gapminder$population. The function mutate() knows we are grabbing columns from gapminder data frame. This approach makes the code much more readable.

Now let us use the head() function to check whether the new column or the new variable gdp_per_capita is created.

head(gapminder)
              country year infant_mortality life_expectancy fertility
1             Albania 1960           115.40           62.87      6.19
2             Algeria 1960           148.20           47.50      7.65
3              Angola 1960           208.00           35.98      7.32
4 Antigua and Barbuda 1960               NA           62.97      4.43
5           Argentina 1960            59.87           65.39      3.11
6             Armenia 1960               NA           66.86      4.55
  population          gdp continent          region gdp_per_capita
1    1636054           NA    Europe Southern Europe             NA
2   11124892  13828152297    Africa Northern Africa       1242.992
3    5270844           NA    Africa   Middle Africa             NA
4      54681           NA  Americas       Caribbean             NA
5   20619075 108322326649  Americas   South America       5253.501
6    1867396           NA      Asia    Western Asia             NA

We can see the newly created variable gdp_per_capita, that has been added to the data frame gapminder. Although we have overwritten the original gapminder object, this does not change the object that loaded with data(gapminder). If we load the gapminder data again, the original will overwrite our mutated version. Otherwise we could have created a with new object name also.

1.2 Subsetting using filter()

Now suppose that we want to filter the data table to only show the entries for which the gdp_per_capita is higher than 25000. To do this we use the filter() function, which takes the data frame gapminder as the first argument and then the conditional statement as the second. Like mutate(), we can use the unquoted variable names from gapminder inside the function and it will know we mean the columns and not objects in the workspace.

filter(gapminder, gdp_per_capita > 55000)
               country year infant_mortality life_expectancy fertility
1 United Arab Emirates 1975             45.9           64.30      6.01
2 United Arab Emirates 1976             41.9           64.87      5.87
3 United Arab Emirates 1977             38.2           65.41      5.74
4 United Arab Emirates 1980             28.9           66.91      5.42
5 United Arab Emirates 1981             26.4           67.36      5.34
6           Luxembourg 2007              2.4           80.60      1.62
7           Luxembourg 2008              2.2           81.00      1.63
  population         gdp continent         region gdp_per_capita
1     531265 29899277858      Asia   Western Asia       56279.40
2     622051 34840688594      Asia   Western Asia       56009.38
3     722849 42310298856      Asia   Western Asia       58532.69
4    1016789 62370744063      Asia   Western Asia       61340.89
5    1093108 65276707224      Asia   Western Asia       59716.61
6     474705 27066319405    Europe Western Europe       57017.14
7     485079 27270323848    Europe Western Europe       56218.31

We can also use this function to filter the data table by several conditional statements. For an example suppose if we want to filter the data set by filtering year to be 2016 and the life_expectancy to be higher than 82 years, the code will look like as below: Let us recall that for equality we use double equal sign (==).

filter(gapminder, year == 2016 & life_expectancy > 82)
            country year infant_mortality life_expectancy fertility population
1         Australia 2016               NA           82.30        NA         NA
2  Hong Kong, China 2016               NA           83.90        NA         NA
3           Iceland 2016               NA           83.30        NA         NA
4            Israel 2016               NA           82.91        NA         NA
5             Italy 2016               NA           82.30        NA         NA
6             Japan 2016               NA           83.30        NA         NA
7        Luxembourg 2016               NA           82.30        NA         NA
8             Malta 2016               NA           82.20        NA         NA
9         Singapore 2016               NA           82.10        NA         NA
10            Spain 2016               NA           82.70        NA         NA
11           Sweden 2016               NA           82.10        NA         NA
12      Switzerland 2016               NA           83.10        NA         NA
   gdp continent                    region gdp_per_capita
1   NA   Oceania Australia and New Zealand             NA
2   NA      Asia              Eastern Asia             NA
3   NA    Europe           Northern Europe             NA
4   NA      Asia              Western Asia             NA
5   NA    Europe           Southern Europe             NA
6   NA      Asia              Eastern Asia             NA
7   NA    Europe            Western Europe             NA
8   NA    Europe           Southern Europe             NA
9   NA      Asia        South-Eastern Asia             NA
10  NA    Europe           Southern Europe             NA
11  NA    Europe           Northern Europe             NA
12  NA    Europe            Western Europe             NA

We can also use %in% to filter with dplyr. We can therefore see the data from Nepal, India and China using following line of codes.

Nep.Ind.chn = filter(gapminder, country %in% c("Nepal", "India", "China"))
head(Nep.Ind.chn)
  country year infant_mortality life_expectancy fertility population
1   China 1960            190.0           30.53      3.99  644450173
2   India 1960            165.1           41.26      5.87  449661874
3   Nepal 1960            219.6           39.80      5.99   10056945
4   China 1961            161.0           32.95      3.28  654625069
5   India 1961            162.5           41.99      5.86  458691457
6   Nepal 1961            216.7           40.26      5.99   10215153
          gdp continent        region gdp_per_capita
1 70348527260      Asia  Eastern Asia      109.16054
2 81042654922      Asia Southern Asia      180.23021
3  1354713257      Asia Southern Asia      134.70425
4 51282752582      Asia  Eastern Asia       78.33912
5 84059664307      Asia Southern Asia      183.25971
6  1380564790      Asia Southern Asia      135.14871

1.3 Selecting columns using select()

This function is very useful when we have to select the desired variables for the analysis from the data frame having lot of variables. In our gapminder data frame we have all together 10 columns or variables, but some data table may include hundreds of variables. If we want to view just a few, we can use the dplyr select() function. In the code below we select three columns, assign this to a new object called new_gapminder.

new_gapminder = select(gapminder, country, life_expectancy, population)
head(new_gapminder)
              country life_expectancy population
1             Albania           62.87    1636054
2             Algeria           47.50   11124892
3              Angola           35.98    5270844
4 Antigua and Barbuda           62.97      54681
5           Argentina           65.39   20619075
6             Armenia           66.86    1867396

When we call the function select() the first argument is the data frame (in this case gapminder) followed by the variable names separated by comma (in this case country, life_expectancy, population) to be selected. We can also use the filter() function after calling selec() function. Suppose if we want to select the three variables as in the above case and want to filter the new object new_gapminder by using the conditional statement population to be less than or equal to 45,000. The line of command would be as follows.

new_gapminder = select(gapminder, country, life_expectancy, population)
filter(new_gapminder, population <= 45000)
                 country life_expectancy population
1              Greenland           61.49      31238
2               Kiribati           45.84      41234
3  Micronesia, Fed. Sts.           56.82      44539
4             Seychelles           60.42      41538
5              Greenland           61.85      32474
6               Kiribati           46.24      42261
7             Seychelles           61.03      42402
8              Greenland           62.22      33730
9               Kiribati           46.64      43312
10            Seychelles           61.59      43409
11             Greenland           62.59      35038
12              Kiribati           47.03      44372
13            Seychelles           62.08      44515
14             Greenland           62.97      36441
15             Greenland           63.34      37963
16             Greenland           63.71      39616
17             Greenland           64.08      41360
18             Greenland           64.45      43100
19             Greenland           64.82      44703

1.4 The pipe operator: %>%

The pipe operator is one of the widely used operators while using different functions of the dplyr package. We can perform series of operations using pipe operator: %>%, for example we can select the columns (variables) using select() and then filter the observations , by sending the results of select() function to filter() using what is called the pipe operator : %>%. The short cut key for the pipe operator : %>% is clrt + shift + M. We can pronounce this operator as then also. Some details are included below.

We wrote the code above to select the three variables (country, life_expectancy, and population) and for population that have less than or equal to 45000. To do this we have had created an intermediate object new_gapminder. In dplyr we can write codes that looks more like a description of what we want to do without defining intermediate objects using pipe %>% operator.

original data → select() → filter()

For such an operation, we can use the pipe %>%. And using %>% the code looks like this:

gapminder %>% select(country, life_expectancy, population) %>% filter(population <=45000)
                 country life_expectancy population
1              Greenland           61.49      31238
2               Kiribati           45.84      41234
3  Micronesia, Fed. Sts.           56.82      44539
4             Seychelles           60.42      41538
5              Greenland           61.85      32474
6               Kiribati           46.24      42261
7             Seychelles           61.03      42402
8              Greenland           62.22      33730
9               Kiribati           46.64      43312
10            Seychelles           61.59      43409
11             Greenland           62.59      35038
12              Kiribati           47.03      44372
13            Seychelles           62.08      44515
14             Greenland           62.97      36441
15             Greenland           63.34      37963
16             Greenland           63.71      39616
17             Greenland           64.08      41360
18             Greenland           64.45      43100
19             Greenland           64.82      44703

Thus we find that, this one line of code is equivalent to two lines of code above. In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. In the above code first gapminder data set is called and then the set is put to select the three variables (country, life_expectance, and population) and finally it calls another function filter to filter the observations having population less than or equal to 45000.

Here are some more simple examples:

49 %>% sqrt()
[1] 7

We can continue this pipe values along:

49 %>% sqrt() %>% log10()
[1] 0.845098

The code is equivalent to log10(sqrt(49))

We see that, the pipe sends values to the first argument, so we can define other arguments as if the first argument is already defined. Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument. Note that the pipe works well with functions where the first argument is the input data. All the functions in the package dplyr have this format and can be used easily with the pipe.

1.5 Summarizing data using summarize()[ or summarise()] function

Summarising data is one of the important part of exploratory data analysis. Summarising data includes computing descriptive statistic like averages, measures of variations, skewness, kurtosis etc. Details of these summary statistics will be covered in the section Descriptive Statistics. In this section we briefly discuss how we can compute the summary statistics like mean, media, standard deviation, etc using summarise() or summarize() function.

The summarize function in dplyr provides a way to compute summary statistics with intuitive and readable code. The following line of codes can be used to compute the summary statistics like mean, standard deviation, maximum and minimum value of life_expectancy of Nepal from the gapminder data set.

gapminder %>% filter(country == "Nepal") %>%
  summarise(Mean = mean(life_expectancy), 
           SD = sd(life_expectancy), 
          Max = max(life_expectancy),
          Min = min(life_expectancy))
      Mean       SD  Max  Min
1 56.12175 10.21833 70.2 39.8

The above codes at first calls our original data table gapminder as input, filters it to keep only for country Nepal, and then produces a new summarized table with just the arithmetic mean, the standard deviation, Maximum and Minimum value of life_expectancy. Note that we can choose the names of the columns of the resulting table. For example, above we decided to use Mean, SD, Max and Min but we could have used other names just in the similar manner. Also we could have stored the resulting table in a new data frame object. Let the above resulting table be stored in a data frame object called Nepal_life_expectancy_summary.

Nepal_life_expectancy_summary = gapminder %>% 
  filter(country == "Nepal") %>% 
  summarise(Mean = mean(life_expectancy),
            SD = sd(life_expectancy),
            Max = max(life_expectancy),
            Min = min(life_expectancy))
Nepal_life_expectancy_summary
      Mean       SD  Max  Min
1 56.12175 10.21833 70.2 39.8

As the table has been stored in the object Nepal_life_expectancy_summary, we can pull its components by using the Accessor, the $ sign.

Nepal_life_expectancy_summary$Mean
[1] 56.12175
Nepal_life_expectancy_summary$SD
[1] 10.21833
Nepal_life_expectancy_summary$Max
[1] 70.2
Nepal_life_expectancy_summary$Min
[1] 39.8

The similar action can be taken by using pull() function using the pipe operator %>%. This is something equivalent as using $. For example

Nepal_life_expectancy_summary %>% pull(Mean)
[1] 56.12175

and so on.

As the most of other dplyr functions, summarize() is also aware of the variable names and it uses them directly. No accessor, ($) is required as in R base functions to access a variable whenever we use **dplyr* package. So when inside the call to the summarize() function we write mean(life_expectancy), the function is accessing the column with the name “life_expectancy” of the gapminder data frame and then computing the arithmetic mean of the resulting numeric vector.

It should be noted that With the function summarize(), we can only call functions that return a single value. We can not use the functions that return two or more than two values inside summarise().

1.6 Grouping and then summarizing using group_by() function

In data exploration we may require to group the variable and compute the summary statistics for each of the groups. For example, we may want to compute the summary statistics like mean and standard deviation for different continents’ life_expectancy. In the gapminder data set we have the following continents; . Using goup_by() function we can easily compute the summary statistics like mean and standard deviation of life_expectancy continent wise.

gapminder %>% 
  group_by(continent) %>% 
  summarise(Mean = mean(life_expectancy), SD = sd(life_expectancy))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 3
  continent  Mean    SD
  <fct>     <dbl> <dbl>
1 Africa     54.4  8.97
2 Americas   68.8  7.00
3 Asia       65.9  9.77
4 Europe     73.6  4.26
5 Oceania    64.6  8.49

Here, we can see that the function summarise() can be applied to the each of the groups separately. We have to be sure that the grouping variable must have the class of factor as we can see that the class of the variable continent is a factor (<fct>) beneath the continent variable.

Let us try it with another example to compute the summary statistics like mean, standard deviation, median, value of life_expectancy for the different region. In this case our grouping variable is region and it has 22 groups (We can check this by using either str() or glimpse() function). The code should return mean, median and standard deviation for each of the 22 regions. Let us check what the code should look like;

gapminder %>% 
  group_by(region) %>% # this will group the variable `region`
  summarise(Mean = mean(life_expectancy), 
            Median = median(life_expectancy),
            SD = sd(life_expectancy))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 22 x 4
   region                     Mean Median    SD
   <fct>                     <dbl>  <dbl> <dbl>
 1 Australia and New Zealand  76.0   75.9  3.89
 2 Caribbean                  69.0   70.4  6.58
 3 Central America            68.0   70.0  7.67
 4 Central Asia               64.7   65.1  3.67
 5 Eastern Africa             54.5   53.9  8.68
 6 Eastern Asia               70.5   71.9  8.97
 7 Eastern Europe             70.6   70.6  2.82
 8 Melanesia                  60.5   61.3  7.24
 9 Micronesia                 60.0   60.8  6.05
10 Middle Africa              51.2   51.8  6.27
# ... with 12 more rows

1.7 Sorting data frame by a variable using arrange() function

In the exploratory data analysis, we may sometime require to examine the data table in a order with respect to the array of a particular variable or column. In such cases we could use the dplyr function, called arrang(). To arrange the data table using arrange() we first need to decide which variable to sort by.

To see the country by life_expectancy in the year 2015, from smallest to largest, in the data table gapminder, we arrange with the following line of codes.

gapminder %>%    # to call `gapminder` data frame
  select(country,life_expectancy,year) %>% # selects `country`, `life_expectancy`, `year`.
  filter(year == 2015) %>% # filter by the year of 2015.
  arrange(life_expectancy) %>%  # to arrange the data table by `life_expectancy`.
  head() # to view first 6 entries only.
                   country life_expectancy year
1                  Lesotho            47.1 2015
2 Central African Republic            49.6 2015
3                Swaziland            51.8 2015
4            Guinea-Bissau            55.6 2015
5                   Zambia            56.7 2015
6               Mozambique            57.1 2015

By default the arrange() function arranges the data table in ascending order and to arrange in descending order we call another function desc() inside the arrange(). Let us see the country list by life_expectancy in descending order.

gapminder %>% 
  select(country, life_expectancy, year) %>% 
  filter(year == 2015) %>% 
  arrange(desc(life_expectancy)) %>% 
  head()
           country life_expectancy year
1 Hong Kong, China           83.73 2015
2          Iceland           83.30 2015
3            Japan           83.20 2015
4      Switzerland           83.00 2015
5            Spain           82.60 2015
6        Australia           82.30 2015

1.8 Dealing with missing values

For any missing values in any data structure, R automatically assigns as “NA” meaning Not Available. NA is a logical constant of length 1 which contains a missing value indicator. For a detail description about NA, readers are advised to check for ?NA or help(NA).

If we observe the gapminder data set carefully, we can find many missing values in the columns infrant_mortality and gdp, which are assigned by NAs. We have to be cautious while computing summary statistics for a data which consists of NAs. If we try to compute the summary statistics like mean, standard deviation for a variable which consists of NA using R functions,that will always return NA. For an example if we compute the mean of gdp of Nepal from the gapminder data set, it returns NA.

gapminder %>% 
  select(country, gdp) %>% 
  filter(country == "Nepal") %>% 
  summarise(Mean_GDP = mean(gdp))
  Mean_GDP
1       NA

It can be seen that the above line of codes return NA as the mean of gdp of Nepal, this is because of missing entries of gdp of Nepal in some years.

To ignore such NA values we use na.rm argument. By setting the argument na.rm = TRUE we can ignore the NAs in the data set.

gapminder %>% 
  select(country, gdp) %>% 
  filter(country == "Nepal") %>% 
  summarise(Mean_GDP = mean(gdp, na.rm = TRUE))
    Mean_GDP
1 3542149470