Week 2 - Functions in R and Basic Grammars: part 1 - dplyr verbs

As mentioned last week, R is a powerful programming language itself, specialised for statistics and data. One of the reasons behind the fact of how R could become a preferralbe language for data handling is because of huge reserves of open source data packages. Some professionally skilled continuously create very powerful and useful library to which all R users can have an access. Most of all, the R users love it!

Because of too many functions and library in R, however, I have been a bit confused in a sense that which part I should start learning first. Plus, the language itself has not been as much familiar to programmers as other languages, since R is one of the newly developed languages in 2002.

During my SLICC, I will particularly use Tidyverse package, which provides a coherent environment for data manipulation, exploration and visualisation. This smart package is developed by Hadley Wickham with the intention to make statisticians and data scientists more productive. With this package, let us explore the world of data and statistics.

As briefly mentioned last week, we need to install and load packages,

install.packages("dplyer")
install.packages("ggplot2")
install.packages("gapminder")
install.packages("StatWithR/statsr")

After that, let’s load it using library().

library(dplyr)
library(ggplot2)
library(shiny)
library(gapminder)

Within the Tidyverse package, dplyer and ggplot2 are included, where dplyer is a specialised package for data manipulation and ggplot2 is a specialised for data visualisation. There are several other library we can use, yet I will use dplyer and ggplot2 here, analysing data provided from gapminder, open access data in R, loaded above in this session.

Before working on, I want to point out one of the popularly used grammars in dplyer is a pipe, %>%. This is not an in-built function in R, but used frequently when using the dplyer package. Therefore, we need to know what it means. The pipe, %>%, means take whatever is before it, and feed it into the next step.

Filter verb: filtering cases!

One of the essential jobs in the analysis of data that everyone needs to do is probably filtering observations. Here, I introduce filter function for cutting off samples with the example of the usage of the pipe, %>%.

Let’s print what gapminder dataset has.

gapminder
## # A tibble: 1,704 x 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

There are more than 1,700 observations which of each includes 6 variables, country, continent, year, life expectancy, population, and GDPPC.

Let’s use the pipe, %>%, trimmining the data.

gapminder %>%
  filter (year == 2007)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 142 x 6
##        country continent  year lifeExp       pop  gdpPercap
##         <fctr>    <fctr> <int>   <dbl>     <int>      <dbl>
##  1 Afghanistan      Asia  2007  43.828  31889923   974.5803
##  2     Albania    Europe  2007  76.423   3600523  5937.0295
##  3     Algeria    Africa  2007  72.301  33333216  6223.3675
##  4      Angola    Africa  2007  42.731  12420476  4797.2313
##  5   Argentina  Americas  2007  75.320  40301927 12779.3796
##  6   Australia   Oceania  2007  81.235  20434176 34435.3674
##  7     Austria    Europe  2007  79.829   8199783 36126.4927
##  8     Bahrain      Asia  2007  75.635    708573 29796.0483
##  9  Bangladesh      Asia  2007  64.062 150448339  1391.2538
## 10     Belgium    Europe  2007  79.441  10392226 33692.6051
## # ... with 132 more rows

As can be seen, about 1550 observations have been deleted out of 1,704 cases. Because of filter, observations with year 2007 are only remained. If I wasn’t allowed to use pipe, %>%, it is much more complicated to sort the observations out like this:

filter(gapminder, year == 2007)
## # A tibble: 142 x 6
##        country continent  year lifeExp       pop  gdpPercap
##         <fctr>    <fctr> <int>   <dbl>     <int>      <dbl>
##  1 Afghanistan      Asia  2007  43.828  31889923   974.5803
##  2     Albania    Europe  2007  76.423   3600523  5937.0295
##  3     Algeria    Africa  2007  72.301  33333216  6223.3675
##  4      Angola    Africa  2007  42.731  12420476  4797.2313
##  5   Argentina  Americas  2007  75.320  40301927 12779.3796
##  6   Australia   Oceania  2007  81.235  20434176 34435.3674
##  7     Austria    Europe  2007  79.829   8199783 36126.4927
##  8     Bahrain      Asia  2007  75.635    708573 29796.0483
##  9  Bangladesh      Asia  2007  64.062 150448339  1391.2538
## 10     Belgium    Europe  2007  79.441  10392226 33692.6051
## # ... with 132 more rows

I mean in cases of simple use of function like above, it’s completely doable without %>%. However, if it is the case of using a long and complex function, the pipe, %>%, eases your stress for sure.

Also we can specify multiple conditions in the filter.

gapminder %>%
  filter(year == 2007, country == "United States")
## # A tibble: 1 x 6
##         country continent  year lifeExp       pop gdpPercap
##          <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
## 1 United States  Americas  2007  78.242 301139947  42951.65

This sort of double/multiple filtering is useful for extracting a single case that you are mostly interested in. Wonderful!, isn’t it?

Arrange verb: arranging cases in an ascending/descending order!

We had a another look at how to filter cases above. This time, I will introduce the arrange verb, arrange(). With a dataset, researchers probably check first which value is the highest and the lowest. In this sense, sorting with arrange() is essential!

Let’s have a look then!

gapminder %>%
  arrange(gdpPercap)
## # A tibble: 1,704 x 6
##             country continent  year lifeExp      pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
##  1 Congo, Dem. Rep.    Africa  2002  44.966 55379852  241.1659
##  2 Congo, Dem. Rep.    Africa  2007  46.462 64606759  277.5519
##  3          Lesotho    Africa  1952  42.138   748747  298.8462
##  4    Guinea-Bissau    Africa  1952  32.500   580653  299.8503
##  5 Congo, Dem. Rep.    Africa  1997  42.587 47798986  312.1884
##  6          Eritrea    Africa  1952  35.928  1438760  328.9406
##  7          Myanmar      Asia  1952  36.319 20092996  331.0000
##  8          Lesotho    Africa  1957  45.047   813338  335.9971
##  9          Burundi    Africa  1952  39.031  2445618  339.2965
## 10          Eritrea    Africa  1957  38.047  1542611  344.1619
## # ... with 1,694 more rows

The cases are now sorted in increasing order, from the lowest GDPPC to the highest. Just as with filter, the gapminder object hasn’t been changed. arrange() just gives us a newly sorted dataset.

Apparently, arrange() also allows us to sort in descending order.

gapminder %>%
  arrange(desc(gdpPercap))
## # A tibble: 1,704 x 6
##      country continent  year lifeExp     pop gdpPercap
##       <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
##  1    Kuwait      Asia  1957  58.033  212846 113523.13
##  2    Kuwait      Asia  1972  67.712  841934 109347.87
##  3    Kuwait      Asia  1952  55.565  160000 108382.35
##  4    Kuwait      Asia  1962  60.470  358266  95458.11
##  5    Kuwait      Asia  1967  64.624  575003  80894.88
##  6    Kuwait      Asia  1977  69.343 1140357  59265.48
##  7    Norway    Europe  2007  80.196 4627926  49357.19
##  8    Kuwait      Asia  2007  77.588 2505559  47306.99
##  9 Singapore      Asia  2007  79.972 4553009  47143.18
## 10    Norway    Europe  2002  79.050 4535591  44683.98
## # ... with 1,694 more rows

What about the case that a researcher only needs the data for the highest GDPPC in one year! Here, the pipe, %>%, makes our life easy.

gapminder %>%
  filter(year == 2007) %>%
  arrange(desc(gdpPercap))
## # A tibble: 142 x 6
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
##  1           Norway    Europe  2007  80.196   4627926  49357.19
##  2           Kuwait      Asia  2007  77.588   2505559  47306.99
##  3        Singapore      Asia  2007  79.972   4553009  47143.18
##  4    United States  Americas  2007  78.242 301139947  42951.65
##  5          Ireland    Europe  2007  78.885   4109086  40676.00
##  6 Hong Kong, China      Asia  2007  82.208   6980412  39724.98
##  7      Switzerland    Europe  2007  81.701   7554661  37506.42
##  8      Netherlands    Europe  2007  79.762  16570613  36797.93
##  9           Canada  Americas  2007  80.653  33390141  36319.24
## 10          Iceland    Europe  2007  81.757    301931  36180.79
## # ... with 132 more rows

Without the pipe, I can’t even know how to code to get the result of year 2007 sorted from the highest to the lowest GDPPC.

Mutate verb: changing one of the variables – (adding or subtracting)

When manipulating data, every user may want to make a change in one of the variables in their datset, like adding or subtracting. For this case, we can easily do this using mutate() verb.

Like what we learnt from filter() and arrange(), we use mutate() after a pipe operator, %>%. Let’s have a look at then

gapminder %>%
  mutate(pop = pop / 10000000)
## # A tibble: 1,704 x 6
##        country continent  year lifeExp       pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>     <dbl>     <dbl>
##  1 Afghanistan      Asia  1952  28.801 0.8425333  779.4453
##  2 Afghanistan      Asia  1957  30.332 0.9240934  820.8530
##  3 Afghanistan      Asia  1962  31.997 1.0267083  853.1007
##  4 Afghanistan      Asia  1967  34.020 1.1537966  836.1971
##  5 Afghanistan      Asia  1972  36.088 1.3079460  739.9811
##  6 Afghanistan      Asia  1977  38.438 1.4880372  786.1134
##  7 Afghanistan      Asia  1982  39.854 1.2881816  978.0114
##  8 Afghanistan      Asia  1987  40.822 1.3867957  852.3959
##  9 Afghanistan      Asia  1992  41.674 1.6317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 2.2227415  635.3414
## # ... with 1,694 more rows

pop is one of the variables in the gapminder dataset and it is by now replaced by pop / 10000000. Therefore, its scale has been lowered, much smaller than it had been before. This is how we manipulate existing variables and is often required for data process and cleaning. Again, we are not chaning the original data, but chaning the value in this new data frame table that is being returned.

The mutate function, mutate() also uses for adding a new variable. Here, I will make a new variable gdp by multipying gdpPercapita by its population pop.

gapminder %>%
  mutate(gdp = gdpPercap * pop)
## # A tibble: 1,704 x 7
##        country continent  year lifeExp      pop gdpPercap         gdp
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>       <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453  6567086330
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530  7585448670
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007  8758855797
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971  9648014150
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811  9678553274
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134 11697659231
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114 12598563401
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959 11820990309
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414 10595901589
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414 14121995875
## # ... with 1,694 more rows

Notice that in the table showing result, there is a newly created column, gdp.

As a last example for this dplyr verb, I will use all three verbs, introduced above, that are combined in one code.

gapminder %>%
  mutate(gdp = gdpPercap * pop) %>%
  filter(year == 2007) %>%
  arrange(desc(gdp))
## # A tibble: 142 x 7
##           country continent  year lifeExp        pop gdpPercap
##            <fctr>    <fctr> <int>   <dbl>      <int>     <dbl>
##  1  United States  Americas  2007  78.242  301139947 42951.653
##  2          China      Asia  2007  72.961 1318683096  4959.115
##  3          Japan      Asia  2007  82.603  127467972 31656.068
##  4          India      Asia  2007  64.698 1110396331  2452.210
##  5        Germany    Europe  2007  79.406   82400996 32170.374
##  6 United Kingdom    Europe  2007  79.425   60776238 33203.261
##  7         France    Europe  2007  80.657   61083916 30470.017
##  8         Brazil  Americas  2007  72.390  190010647  9065.801
##  9          Italy    Europe  2007  80.546   58147733 28569.720
## 10         Mexico  Americas  2007  76.195  108700891 11977.575
## # ... with 132 more rows, and 1 more variables: gdp <dbl>

This lets us know from which country was the most productive in GDP in 2007 to which the lowest.


I specially want to say thanks to Kaya Lee, data scientist at CUPIST, for her advice. I specially benefitted from subscribing to DataCamp (www.campus.datacamp.com) this time, which is not originally in my plans of SLICC. There are tons of useful tutorials and information we can access. Some parts of the tutorial in DataCamp are free, others are not. But, for those who are new to R, I definitely recommmend DataCamp to go through! In my case, I have consistently read speicalised textbooks for R, Introduction to R and Learning R, yet practising in DataCamp with professionally organised video clips help me a lot more than just reading textbook and solely practising it.