dplyr
verbsAs mentioned last week, R is a powerful programming language itself, specialised for statistics and data. One of the reasons behind the fact of how R could become a preferralbe language for data handling is because of huge reserves of open source data packages
. Some professionally skilled continuously create very powerful and useful library to which all R users can have an access. Most of all, the R users love it!
Because of too many functions and library in R, however, I have been a bit confused in a sense that which part I should start learning first. Plus, the language itself has not been as much familiar to programmers as other languages, since R is one of the newly developed languages in 2002.
During my SLICC, I will particularly use Tidyverse
package, which provides a coherent environment for data manipulation, exploration and visualisation. This smart package is developed by Hadley Wickham with the intention to make statisticians and data scientists more productive. With this package, let us explore the world of data and statistics.
As briefly mentioned last week, we need to install and load packages,
install.packages("dplyer")
install.packages("ggplot2")
install.packages("gapminder")
install.packages("StatWithR/statsr")
After that, let’s load it using library().
library(dplyr)
library(ggplot2)
library(shiny)
library(gapminder)
Within the Tidyverse
package, dplyer
and ggplot2
are included, where dplyer
is a specialised package for data manipulation and ggplot2
is a specialised for data visualisation. There are several other library we can use, yet I will use dplyer
and ggplot2
here, analysing data provided from gapminder
, open access data in R, loaded above in this session.
Before working on, I want to point out one of the popularly used grammars in dplyer
is a pipe, %>%
. This is not an in-built function in R, but used frequently when using the dplyer
package. Therefore, we need to know what it means. The pipe, %>%
, means take whatever is before it, and feed it into the next step.
One of the essential jobs in the analysis of data that everyone needs to do is probably filtering observations. Here, I introduce filter
function for cutting off samples with the example of the usage of the pipe, %>%
.
Let’s print what gapminder
dataset has.
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
There are more than 1,700 observations which of each includes 6 variables, country, continent, year, life expectancy, population, and GDPPC.
Let’s use the pipe, %>%
, trimmining the data.
gapminder %>%
filter (year == 2007)
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.828 31889923 974.5803
## 2 Albania Europe 2007 76.423 3600523 5937.0295
## 3 Algeria Africa 2007 72.301 33333216 6223.3675
## 4 Angola Africa 2007 42.731 12420476 4797.2313
## 5 Argentina Americas 2007 75.320 40301927 12779.3796
## 6 Australia Oceania 2007 81.235 20434176 34435.3674
## 7 Austria Europe 2007 79.829 8199783 36126.4927
## 8 Bahrain Asia 2007 75.635 708573 29796.0483
## 9 Bangladesh Asia 2007 64.062 150448339 1391.2538
## 10 Belgium Europe 2007 79.441 10392226 33692.6051
## # ... with 132 more rows
As can be seen, about 1550 observations have been deleted out of 1,704 cases. Because of filter
, observations with year 2007 are only remained. If I wasn’t allowed to use pipe, %>%
, it is much more complicated to sort the observations out like this:
filter(gapminder, year == 2007)
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.828 31889923 974.5803
## 2 Albania Europe 2007 76.423 3600523 5937.0295
## 3 Algeria Africa 2007 72.301 33333216 6223.3675
## 4 Angola Africa 2007 42.731 12420476 4797.2313
## 5 Argentina Americas 2007 75.320 40301927 12779.3796
## 6 Australia Oceania 2007 81.235 20434176 34435.3674
## 7 Austria Europe 2007 79.829 8199783 36126.4927
## 8 Bahrain Asia 2007 75.635 708573 29796.0483
## 9 Bangladesh Asia 2007 64.062 150448339 1391.2538
## 10 Belgium Europe 2007 79.441 10392226 33692.6051
## # ... with 132 more rows
I mean in cases of simple use of function like above, it’s completely doable without %>%
. However, if it is the case of using a long and complex function, the pipe, %>%
, eases your stress for sure.
Also we can specify multiple conditions in the filter.
gapminder %>%
filter(year == 2007, country == "United States")
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.242 301139947 42951.65
This sort of double/multiple filtering is useful for extracting a single case that you are mostly interested in. Wonderful!, isn’t it?
We had a another look at how to filter cases above. This time, I will introduce the arrange verb, arrange()
. With a dataset, researchers probably check first which value is the highest and the lowest. In this sense, sorting with arrange()
is essential!
Let’s have a look then!
gapminder %>%
arrange(gdpPercap)
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Congo, Dem. Rep. Africa 2002 44.966 55379852 241.1659
## 2 Congo, Dem. Rep. Africa 2007 46.462 64606759 277.5519
## 3 Lesotho Africa 1952 42.138 748747 298.8462
## 4 Guinea-Bissau Africa 1952 32.500 580653 299.8503
## 5 Congo, Dem. Rep. Africa 1997 42.587 47798986 312.1884
## 6 Eritrea Africa 1952 35.928 1438760 328.9406
## 7 Myanmar Asia 1952 36.319 20092996 331.0000
## 8 Lesotho Africa 1957 45.047 813338 335.9971
## 9 Burundi Africa 1952 39.031 2445618 339.2965
## 10 Eritrea Africa 1957 38.047 1542611 344.1619
## # ... with 1,694 more rows
The cases are now sorted in increasing order, from the lowest GDPPC to the highest. Just as with filter, the gapminder object hasn’t been changed. arrange()
just gives us a newly sorted dataset.
Apparently, arrange()
also allows us to sort in descending order.
gapminder %>%
arrange(desc(gdpPercap))
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1957 58.033 212846 113523.13
## 2 Kuwait Asia 1972 67.712 841934 109347.87
## 3 Kuwait Asia 1952 55.565 160000 108382.35
## 4 Kuwait Asia 1962 60.470 358266 95458.11
## 5 Kuwait Asia 1967 64.624 575003 80894.88
## 6 Kuwait Asia 1977 69.343 1140357 59265.48
## 7 Norway Europe 2007 80.196 4627926 49357.19
## 8 Kuwait Asia 2007 77.588 2505559 47306.99
## 9 Singapore Asia 2007 79.972 4553009 47143.18
## 10 Norway Europe 2002 79.050 4535591 44683.98
## # ... with 1,694 more rows
What about the case that a researcher only needs the data for the highest GDPPC in one year! Here, the pipe, %>%
, makes our life easy.
gapminder %>%
filter(year == 2007) %>%
arrange(desc(gdpPercap))
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Norway Europe 2007 80.196 4627926 49357.19
## 2 Kuwait Asia 2007 77.588 2505559 47306.99
## 3 Singapore Asia 2007 79.972 4553009 47143.18
## 4 United States Americas 2007 78.242 301139947 42951.65
## 5 Ireland Europe 2007 78.885 4109086 40676.00
## 6 Hong Kong, China Asia 2007 82.208 6980412 39724.98
## 7 Switzerland Europe 2007 81.701 7554661 37506.42
## 8 Netherlands Europe 2007 79.762 16570613 36797.93
## 9 Canada Americas 2007 80.653 33390141 36319.24
## 10 Iceland Europe 2007 81.757 301931 36180.79
## # ... with 132 more rows
Without the pipe, I can’t even know how to code to get the result of year 2007 sorted from the highest to the lowest GDPPC.
When manipulating data, every user may want to make a change in one of the variables in their datset, like adding or subtracting. For this case, we can easily do this using mutate()
verb.
Like what we learnt from filter()
and arrange()
, we use mutate()
after a pipe operator, %>%
. Let’s have a look at then
gapminder %>%
mutate(pop = pop / 10000000)
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.801 0.8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 0.9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 1.0267083 853.1007
## 4 Afghanistan Asia 1967 34.020 1.1537966 836.1971
## 5 Afghanistan Asia 1972 36.088 1.3079460 739.9811
## 6 Afghanistan Asia 1977 38.438 1.4880372 786.1134
## 7 Afghanistan Asia 1982 39.854 1.2881816 978.0114
## 8 Afghanistan Asia 1987 40.822 1.3867957 852.3959
## 9 Afghanistan Asia 1992 41.674 1.6317921 649.3414
## 10 Afghanistan Asia 1997 41.763 2.2227415 635.3414
## # ... with 1,694 more rows
pop
is one of the variables in the gapminder dataset and it is by now replaced by pop / 10000000
. Therefore, its scale has been lowered, much smaller than it had been before. This is how we manipulate existing variables and is often required for data process and cleaning. Again, we are not chaning the original data, but chaning the value in this new data frame table that is being returned.
The mutate function, mutate()
also uses for adding a new variable. Here, I will make a new variable gdp
by multipying gdpPercapita
by its population pop
.
gapminder %>%
mutate(gdp = gdpPercap * pop)
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114 12598563401
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959 11820990309
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10595901589
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414 14121995875
## # ... with 1,694 more rows
Notice that in the table showing result, there is a newly created column, gdp.
As a last example for this dplyr
verb, I will use all three verbs, introduced above, that are combined in one code.
gapminder %>%
mutate(gdp = gdpPercap * pop) %>%
filter(year == 2007) %>%
arrange(desc(gdp))
## # A tibble: 142 x 7
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 United States Americas 2007 78.242 301139947 42951.653
## 2 China Asia 2007 72.961 1318683096 4959.115
## 3 Japan Asia 2007 82.603 127467972 31656.068
## 4 India Asia 2007 64.698 1110396331 2452.210
## 5 Germany Europe 2007 79.406 82400996 32170.374
## 6 United Kingdom Europe 2007 79.425 60776238 33203.261
## 7 France Europe 2007 80.657 61083916 30470.017
## 8 Brazil Americas 2007 72.390 190010647 9065.801
## 9 Italy Europe 2007 80.546 58147733 28569.720
## 10 Mexico Americas 2007 76.195 108700891 11977.575
## # ... with 132 more rows, and 1 more variables: gdp <dbl>
This lets us know from which country was the most productive in GDP in 2007 to which the lowest.
I specially want to say thanks to Kaya Lee, data scientist at CUPIST, for her advice. I specially benefitted from subscribing to DataCamp
(www.campus.datacamp.com) this time, which is not originally in my plans of SLICC. There are tons of useful tutorials and information we can access. Some parts of the tutorial in DataCamp are free, others are not. But, for those who are new to R, I definitely recommmend DataCamp to go through! In my case, I have consistently read speicalised textbooks for R, Introduction to R
and Learning R
, yet practising in DataCamp with professionally organised video clips help me a lot more than just reading textbook and solely practising it.