dplyr packageThe dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember. For instance, to change the data table by adding a new column, we use mutate(). To filter the data table to a subset of rows, we use filter(). To subset the data by selecting specific columns, we use select(). Besides these functions, the other dplyr verbs that are widely used are summarise(), group_by(), and arrange(). In the following sections we shall discuss these verbs in brief.
It should be noted that the packages dplyr can be used only for the data frames. To use the package as we mentioned earlier, we first need to install it in the R environment and then load it. Let us first install the package dplyr using the function install.packages() and load it in the environment by using library() function.
mutate()While doing data science project the we need to acquire all the necessary information for analysis to be included in the data table. For this we may have to create new variables from the existing variables using some suitable mathematical operations. If we are creating a new variable, that means we are creating a new column, because in a data frame each column represents a variable and each row represents the observations on the variables.
In the earlier sections the R base functions, which we did use to add a new column or a new variable, was not so user friendly. In this section we will learn to create a new column or a new variable comparatively more easier way by using the function mutate() of the dplyr package.
To understand using these functions of the dplyr package let us load the data set called gapminder from the package dslabs. gapminder data set consists of a time series of Health and income outcomes for 184 countries from the year 1960 to 2016. It also includes two character vectors, oecd and opec, with the names of OECD and OPEC countries from 2016. Details of variables can be obtained by using ?gapminder or help("gapminder") after loading dslabs package.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We may explore the data structures of the variables inside the data frame gapminder using str() function.
'data.frame': 10545 obs. of 9 variables:
$ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
$ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ infant_mortality: num 115.4 148.2 208 NA 59.9 ...
$ life_expectancy : num 62.9 47.5 36 63 65.4 ...
$ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
$ population : num 1636054 11124892 5270844 54681 20619075 ...
$ gdp : num NA 1.38e+10 NA NA 1.08e+11 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
$ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
We can see that the gapminder is a data frame consisting 10545 observations of 9 variables. The class of the variables country, continent and region is Factor while the variable year is integer and others are of numeric. We can also use glimpse() function to view the glimpse of the data set. Note that glimpse() is a function from the package dplyr.
Rows: 10,545
Columns: 9
$ country <fct> Albania, Algeria, Angola, Antigua and Barbuda, Arg...
$ year <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 19...
$ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, ...
$ life_expectancy <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 7...
$ fertility <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2....
$ population <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 18673...
$ gdp <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966...
$ continent <fct> Europe, Africa, Africa, Americas, Americas, Asia, ...
$ region <fct> Southern Europe, Northern Africa, Middle Africa, C...
Let us use the head() function to view first 6 observations of the data set gapminder.
country year infant_mortality life_expectancy fertility
1 Albania 1960 115.40 62.87 6.19
2 Algeria 1960 148.20 47.50 7.65
3 Angola 1960 208.00 35.98 7.32
4 Antigua and Barbuda 1960 NA 62.97 4.43
5 Argentina 1960 59.87 65.39 3.11
6 Armenia 1960 NA 66.86 4.55
population gdp continent region
1 1636054 NA Europe Southern Europe
2 11124892 13828152297 Africa Northern Africa
3 5270844 NA Africa Middle Africa
4 54681 NA Americas Caribbean
5 20619075 108322326649 Americas South America
6 1867396 NA Asia Western Asia
Now suppose we want to create a new variable say gdp_per_capita which can be obtained by dividing gdp by population. Let us see how we can create a this variable using mutate() function. The function mutate() takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values. So, to add gdp_per_capita, we use:
Notice that here we used gdp and population inside the function, which are objects that are not defined in our workspace. But why don’t we get an error? Recall that using R base functions we could have accessed a variable only by using $ sign.
This is one of dplyr’s main features. Functions in this package, such as mutate(), know to look for variables in the data frame provided in the first argument. In the call to mutate above, gdp will have the values in gapminder$gdp and that of population will have the values as in gapminder$population. This function is aware of the column names and inside the function we can call them unquoted. We can write gdp rather than gapminder$gdp and population as gapminder$population. The function mutate() knows we are grabbing columns from gapminder data frame. This approach makes the code much more readable.
Now let us use the head() function to check whether the new column or the new variable gdp_per_capita is created.
country year infant_mortality life_expectancy fertility
1 Albania 1960 115.40 62.87 6.19
2 Algeria 1960 148.20 47.50 7.65
3 Angola 1960 208.00 35.98 7.32
4 Antigua and Barbuda 1960 NA 62.97 4.43
5 Argentina 1960 59.87 65.39 3.11
6 Armenia 1960 NA 66.86 4.55
population gdp continent region gdp_per_capita
1 1636054 NA Europe Southern Europe NA
2 11124892 13828152297 Africa Northern Africa 1242.992
3 5270844 NA Africa Middle Africa NA
4 54681 NA Americas Caribbean NA
5 20619075 108322326649 Americas South America 5253.501
6 1867396 NA Asia Western Asia NA
We can see the newly created variable gdp_per_capita, that has been added to the data frame gapminder. Although we have overwritten the original gapminder object, this does not change the object that loaded with data(gapminder). If we load the gapminder data again, the original will overwrite our mutated version. Otherwise we could have created a with new object name also.
filter()Now suppose that we want to filter the data table to only show the entries for which the gdp_per_capita is higher than 25000. To do this we use the filter() function, which takes the data frame gapminder as the first argument and then the conditional statement as the second. Like mutate(), we can use the unquoted variable names from gapminder inside the function and it will know we mean the columns and not objects in the workspace.
country year infant_mortality life_expectancy fertility
1 United Arab Emirates 1975 45.9 64.30 6.01
2 United Arab Emirates 1976 41.9 64.87 5.87
3 United Arab Emirates 1977 38.2 65.41 5.74
4 United Arab Emirates 1980 28.9 66.91 5.42
5 United Arab Emirates 1981 26.4 67.36 5.34
6 Luxembourg 2007 2.4 80.60 1.62
7 Luxembourg 2008 2.2 81.00 1.63
population gdp continent region gdp_per_capita
1 531265 29899277858 Asia Western Asia 56279.40
2 622051 34840688594 Asia Western Asia 56009.38
3 722849 42310298856 Asia Western Asia 58532.69
4 1016789 62370744063 Asia Western Asia 61340.89
5 1093108 65276707224 Asia Western Asia 59716.61
6 474705 27066319405 Europe Western Europe 57017.14
7 485079 27270323848 Europe Western Europe 56218.31
We can also use this function to filter the data table by several conditional statements. For an example suppose if we want to filter the data set by filtering year to be 2016 and the life_expectancy to be higher than 82 years, the code will look like as below: Let us recall that for equality we use double equal sign (==).
country year infant_mortality life_expectancy fertility population
1 Australia 2016 NA 82.30 NA NA
2 Hong Kong, China 2016 NA 83.90 NA NA
3 Iceland 2016 NA 83.30 NA NA
4 Israel 2016 NA 82.91 NA NA
5 Italy 2016 NA 82.30 NA NA
6 Japan 2016 NA 83.30 NA NA
7 Luxembourg 2016 NA 82.30 NA NA
8 Malta 2016 NA 82.20 NA NA
9 Singapore 2016 NA 82.10 NA NA
10 Spain 2016 NA 82.70 NA NA
11 Sweden 2016 NA 82.10 NA NA
12 Switzerland 2016 NA 83.10 NA NA
gdp continent region gdp_per_capita
1 NA Oceania Australia and New Zealand NA
2 NA Asia Eastern Asia NA
3 NA Europe Northern Europe NA
4 NA Asia Western Asia NA
5 NA Europe Southern Europe NA
6 NA Asia Eastern Asia NA
7 NA Europe Western Europe NA
8 NA Europe Southern Europe NA
9 NA Asia South-Eastern Asia NA
10 NA Europe Southern Europe NA
11 NA Europe Northern Europe NA
12 NA Europe Western Europe NA
We can also use %in% to filter with dplyr. We can therefore see the data from Nepal, India and China using following line of codes.
country year infant_mortality life_expectancy fertility population
1 China 1960 190.0 30.53 3.99 644450173
2 India 1960 165.1 41.26 5.87 449661874
3 Nepal 1960 219.6 39.80 5.99 10056945
4 China 1961 161.0 32.95 3.28 654625069
5 India 1961 162.5 41.99 5.86 458691457
6 Nepal 1961 216.7 40.26 5.99 10215153
gdp continent region gdp_per_capita
1 70348527260 Asia Eastern Asia 109.16054
2 81042654922 Asia Southern Asia 180.23021
3 1354713257 Asia Southern Asia 134.70425
4 51282752582 Asia Eastern Asia 78.33912
5 84059664307 Asia Southern Asia 183.25971
6 1380564790 Asia Southern Asia 135.14871
select()This function is very useful when we have to select the desired variables for the analysis from the data frame having lot of variables. In our gapminder data frame we have all together 10 columns or variables, but some data table may include hundreds of variables. If we want to view just a few, we can use the dplyr select() function. In the code below we select three columns, assign this to a new object called new_gapminder.
country life_expectancy population
1 Albania 62.87 1636054
2 Algeria 47.50 11124892
3 Angola 35.98 5270844
4 Antigua and Barbuda 62.97 54681
5 Argentina 65.39 20619075
6 Armenia 66.86 1867396
When we call the function select() the first argument is the data frame (in this case gapminder) followed by the variable names separated by comma (in this case country, life_expectancy, population) to be selected. We can also use the filter() function after calling selec() function. Suppose if we want to select the three variables as in the above case and want to filter the new object new_gapminder by using the conditional statement population to be less than or equal to 45,000. The line of command would be as follows.
new_gapminder = select(gapminder, country, life_expectancy, population)
filter(new_gapminder, population <= 45000) country life_expectancy population
1 Greenland 61.49 31238
2 Kiribati 45.84 41234
3 Micronesia, Fed. Sts. 56.82 44539
4 Seychelles 60.42 41538
5 Greenland 61.85 32474
6 Kiribati 46.24 42261
7 Seychelles 61.03 42402
8 Greenland 62.22 33730
9 Kiribati 46.64 43312
10 Seychelles 61.59 43409
11 Greenland 62.59 35038
12 Kiribati 47.03 44372
13 Seychelles 62.08 44515
14 Greenland 62.97 36441
15 Greenland 63.34 37963
16 Greenland 63.71 39616
17 Greenland 64.08 41360
18 Greenland 64.45 43100
19 Greenland 64.82 44703
%>%The pipe operator is one of the widely used operators while using different functions of the dplyr package. We can perform series of operations using pipe operator: %>%, for example we can select the columns (variables) using select() and then filter the observations , by sending the results of select() function to filter() using what is called the pipe operator : %>%. The short cut key for the pipe operator : %>% is clrt + shift + M. We can pronounce this operator as then also. Some details are included below.
We wrote the code above to select the three variables (country, life_expectancy, and population) and for population that have less than or equal to 45000. To do this we have had created an intermediate object new_gapminder. In dplyr we can write codes that looks more like a description of what we want to do without defining intermediate objects using pipe %>% operator.
original data → select() → filter()
For such an operation, we can use the pipe %>%. And using %>% the code looks like this:
country life_expectancy population
1 Greenland 61.49 31238
2 Kiribati 45.84 41234
3 Micronesia, Fed. Sts. 56.82 44539
4 Seychelles 60.42 41538
5 Greenland 61.85 32474
6 Kiribati 46.24 42261
7 Seychelles 61.03 42402
8 Greenland 62.22 33730
9 Kiribati 46.64 43312
10 Seychelles 61.59 43409
11 Greenland 62.59 35038
12 Kiribati 47.03 44372
13 Seychelles 62.08 44515
14 Greenland 62.97 36441
15 Greenland 63.34 37963
16 Greenland 63.71 39616
17 Greenland 64.08 41360
18 Greenland 64.45 43100
19 Greenland 64.82 44703
Thus we find that, this one line of code is equivalent to two lines of code above. In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. In the above code first gapminder data set is called and then the set is put to select the three variables (country, life_expectance, and population) and finally it calls another function filter to filter the observations having population less than or equal to 45000.
Here are some more simple examples:
[1] 7
We can continue this pipe values along:
[1] 0.845098
The code is equivalent to log10(sqrt(49))
We see that, the pipe sends values to the first argument, so we can define other arguments as if the first argument is already defined. Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument. Note that the pipe works well with functions where the first argument is the input data. All the functions in the package dplyr have this format and can be used easily with the pipe.
summarize()[ or summarise()] functionSummarising data is one of the important part of exploratory data analysis. Summarising data includes computing descriptive statistic like averages, measures of variations, skewness, kurtosis etc. Details of these summary statistics will be covered in the section Descriptive Statistics. In this section we briefly discuss how we can compute the summary statistics like mean, media, standard deviation, etc using summarise() or summarize() function.
The summarize function in dplyr provides a way to compute summary statistics with intuitive and readable code. The following line of codes can be used to compute the summary statistics like mean, standard deviation, maximum and minimum value of life_expectancy of Nepal from the gapminder data set.
gapminder %>% filter(country == "Nepal") %>%
summarise(Mean = mean(life_expectancy),
SD = sd(life_expectancy),
Max = max(life_expectancy),
Min = min(life_expectancy)) Mean SD Max Min
1 56.12175 10.21833 70.2 39.8
The above codes at first calls our original data table gapminder as input, filters it to keep only for country Nepal, and then produces a new summarized table with just the arithmetic mean, the standard deviation, Maximum and Minimum value of life_expectancy. Note that we can choose the names of the columns of the resulting table. For example, above we decided to use Mean, SD, Max and Min but we could have used other names just in the similar manner. Also we could have stored the resulting table in a new data frame object. Let the above resulting table be stored in a data frame object called Nepal_life_expectancy_summary.
Nepal_life_expectancy_summary = gapminder %>%
filter(country == "Nepal") %>%
summarise(Mean = mean(life_expectancy),
SD = sd(life_expectancy),
Max = max(life_expectancy),
Min = min(life_expectancy))
Nepal_life_expectancy_summary Mean SD Max Min
1 56.12175 10.21833 70.2 39.8
As the table has been stored in the object Nepal_life_expectancy_summary, we can pull its components by using the Accessor, the $ sign.
[1] 56.12175
[1] 10.21833
[1] 70.2
[1] 39.8
The similar action can be taken by using pull() function using the pipe operator %>%. This is something equivalent as using $. For example
[1] 56.12175
and so on.
As the most of other dplyr functions, summarize() is also aware of the variable names and it uses them directly. No accessor, ($) is required as in R base functions to access a variable whenever we use **dplyr* package. So when inside the call to the summarize() function we write mean(life_expectancy), the function is accessing the column with the name “life_expectancy” of the gapminder data frame and then computing the arithmetic mean of the resulting numeric vector.
It should be noted that With the function summarize(), we can only call functions that return a single value. We can not use the functions that return two or more than two values inside summarise().
group_by() functionIn data exploration we may require to group the variable and compute the summary statistics for each of the groups. For example, we may want to compute the summary statistics like mean and standard deviation for different continents’ life_expectancy. In the gapminder data set we have the following continents; . Using goup_by() function we can easily compute the summary statistics like mean and standard deviation of life_expectancy continent wise.
gapminder %>%
group_by(continent) %>%
summarise(Mean = mean(life_expectancy), SD = sd(life_expectancy))`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 3
continent Mean SD
<fct> <dbl> <dbl>
1 Africa 54.4 8.97
2 Americas 68.8 7.00
3 Asia 65.9 9.77
4 Europe 73.6 4.26
5 Oceania 64.6 8.49
Here, we can see that the function summarise() can be applied to the each of the groups separately. We have to be sure that the grouping variable must have the class of factor as we can see that the class of the variable continent is a factor (<fct>) beneath the continent variable.
Let us try it with another example to compute the summary statistics like mean, standard deviation, median, value of life_expectancy for the different region. In this case our grouping variable is region and it has 22 groups (We can check this by using either str() or glimpse() function). The code should return mean, median and standard deviation for each of the 22 regions. Let us check what the code should look like;
gapminder %>%
group_by(region) %>% # this will group the variable `region`
summarise(Mean = mean(life_expectancy),
Median = median(life_expectancy),
SD = sd(life_expectancy))`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 22 x 4
region Mean Median SD
<fct> <dbl> <dbl> <dbl>
1 Australia and New Zealand 76.0 75.9 3.89
2 Caribbean 69.0 70.4 6.58
3 Central America 68.0 70.0 7.67
4 Central Asia 64.7 65.1 3.67
5 Eastern Africa 54.5 53.9 8.68
6 Eastern Asia 70.5 71.9 8.97
7 Eastern Europe 70.6 70.6 2.82
8 Melanesia 60.5 61.3 7.24
9 Micronesia 60.0 60.8 6.05
10 Middle Africa 51.2 51.8 6.27
# ... with 12 more rows
arrange() functionIn the exploratory data analysis, we may sometime require to examine the data table in a order with respect to the array of a particular variable or column. In such cases we could use the dplyr function, called arrang(). To arrange the data table using arrange() we first need to decide which variable to sort by.
To see the country by life_expectancy in the year 2015, from smallest to largest, in the data table gapminder, we arrange with the following line of codes.
gapminder %>% # to call `gapminder` data frame
select(country,life_expectancy,year) %>% # selects `country`, `life_expectancy`, `year`.
filter(year == 2015) %>% # filter by the year of 2015.
arrange(life_expectancy) %>% # to arrange the data table by `life_expectancy`.
head() # to view first 6 entries only. country life_expectancy year
1 Lesotho 47.1 2015
2 Central African Republic 49.6 2015
3 Swaziland 51.8 2015
4 Guinea-Bissau 55.6 2015
5 Zambia 56.7 2015
6 Mozambique 57.1 2015
By default the arrange() function arranges the data table in ascending order and to arrange in descending order we call another function desc() inside the arrange(). Let us see the country list by life_expectancy in descending order.
gapminder %>%
select(country, life_expectancy, year) %>%
filter(year == 2015) %>%
arrange(desc(life_expectancy)) %>%
head() country life_expectancy year
1 Hong Kong, China 83.73 2015
2 Iceland 83.30 2015
3 Japan 83.20 2015
4 Switzerland 83.00 2015
5 Spain 82.60 2015
6 Australia 82.30 2015
For any missing values in any data structure, R automatically assigns as “NA” meaning Not Available. NA is a logical constant of length 1 which contains a missing value indicator. For a detail description about NA, readers are advised to check for ?NA or help(NA).
If we observe the gapminder data set carefully, we can find many missing values in the columns infrant_mortality and gdp, which are assigned by NAs. We have to be cautious while computing summary statistics for a data which consists of NAs. If we try to compute the summary statistics like mean, standard deviation for a variable which consists of NA using R functions,that will always return NA. For an example if we compute the mean of gdp of Nepal from the gapminder data set, it returns NA.
gapminder %>%
select(country, gdp) %>%
filter(country == "Nepal") %>%
summarise(Mean_GDP = mean(gdp)) Mean_GDP
1 NA
It can be seen that the above line of codes return NA as the mean of gdp of Nepal, this is because of missing entries of gdp of Nepal in some years.
To ignore such NA values we use na.rm argument. By setting the argument na.rm = TRUE we can ignore the NAs in the data set.
gapminder %>%
select(country, gdp) %>%
filter(country == "Nepal") %>%
summarise(Mean_GDP = mean(gdp, na.rm = TRUE)) Mean_GDP
1 3542149470