Data are ubiquitous but insight is hard due to lack of capacity among researchers/scholars at the local level. Survey data are collected but explored very little. One main reason for this limited use of data is a difficult and time consuming process of cleaning and transformation of data. Most of the time researchers use it in various softwares in a way that using data for others require the same level of effort as the person who got mastery over the data in a very hard way. This tutorial is an effort not only to introduce students/scholars/academia/policy makers to dplyr and other R packages but also to make it available that anyone can use it following these steps. I am using dplyr package of the tidyverse because it makes data transformation easier and faster. It has gained great popularity among R users and experise in it is one of the standard skill in R and is in high demand. In the following I give an introduction to some main functions of dplyr and use it for Gapminder data. I shall elaborate some of the most extensively used functions of dplyr. Once you get understanding of using dplyr with Gapminder data, I shall do data anaysis for Labour Force Surcey (LFS) data in my next post.
I shall upload gapminder and tidyverse library
library(gapminder)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
View command opens data in new worksheet while glimpse lists nature of variables (numeric/character/factor…) and total number of rows and columns.
glimpse(gapminder) # We see that there are 1704 rows for 6 columns and also tells nature of variable
Observations: 1,704
Variables: 6
$ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
View(gapminder) # This opens up full data in a new window
gapminder # gapminder is now tibble when we use dplyr and it displays only first 10 rows
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
Now I shall mention some of the powerful but very simple to use features of dplyr. First of all I am going to explain filter variable of dplyr. filter is used only to select rows for a given condition. I am going to select data only for year 2002.
gapminder %>% filter(year==2002)
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Albania Europe 2002 75.7 3508512 4604.
## 3 Algeria Africa 2002 71.0 31287142 5288.
## 4 Angola Africa 2002 41.0 10866106 2773.
## 5 Argentina Americas 2002 74.3 38331121 8798.
## 6 Australia Oceania 2002 80.4 19546792 30688.
## 7 Austria Europe 2002 79.0 8148312 32418.
## 8 Bahrain Asia 2002 74.8 656397 23404.
## 9 Bangladesh Asia 2002 62.0 135656790 1136.
## 10 Belgium Europe 2002 78.3 10311970 30486.
## # ... with 132 more rows
Compared to previus one, gapminder is showing data only for 142 rows. Pipes %>% play very nicely with dplyr and make our code very easy to understand. The tibble gapminder is being piped into the function filter(). The argument year == 2002 tells filter() that it should find all the rows such that the logical condition year == 2002 is TRUE.
Have we accidently deleted all other rows? Answer is no. Nope: we haven’t made any changes to gapminder at all. If you don’t believe me try entering gapminder at the console. All that this command does is display a subset of gapminder. If we wanted to store the result of running this command, we’d need to assign it to a variable, for example if you are not sure, lets type
gapminder2002<-gapminder %>% filter(year==2002)
gapminder2002
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Albania Europe 2002 75.7 3508512 4604.
## 3 Algeria Africa 2002 71.0 31287142 5288.
## 4 Angola Africa 2002 41.0 10866106 2773.
## 5 Argentina Americas 2002 74.3 38331121 8798.
## 6 Australia Oceania 2002 80.4 19546792 30688.
## 7 Austria Europe 2002 79.0 8148312 32418.
## 8 Bahrain Asia 2002 74.8 656397 23404.
## 9 Bangladesh Asia 2002 62.0 135656790 1136.
## 10 Belgium Europe 2002 78.3 10311970 30486.
## # ... with 132 more rows
gapminder %>% filter(year==2002,country=="Sri Lanka")
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sri Lanka Asia 2002 70.8 19576783 3015.
gapminderSA<-gapminder %>% filter(country %in% c("Bangladesh","India","Pakistan","Sri Lanka","Nepal", "Afghanistan","Bhutan", "Maldives"))
gapminderSA
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 62 more rows
Sort data with arrange Suppose we wanted to sort gapminder data for South Asia by gdpPercap. To do this we can use the arrange command along with the pipe %>% as follows:
gapminderSA %>% arrange(gdpPercap)
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Nepal Asia 1952 36.2 9182536 546.
## 2 India Asia 1952 37.4 372000000 547.
## 3 India Asia 1957 40.2 409000000 590.
## 4 Nepal Asia 1957 37.7 9682338 598.
## 5 Bangladesh Asia 1972 45.3 70759295 630.
## 6 Afghanistan Asia 1997 41.8 22227415 635.
## 7 Afghanistan Asia 1992 41.7 16317921 649.
## 8 Nepal Asia 1962 39.4 10332057 652.
## 9 India Asia 1962 43.6 454000000 658.
## 10 Bangladesh Asia 1977 46.9 80428306 660.
## # ... with 62 more rows
The logic is very similar to what we saw above for filter. Here, I use another important function arrange. The argument gdpPercap tells arrange() that we want to sort by GDP per capita. Note that by default arrange() sorts in ascending order. If we want to sort in descending order, we use the function desc().
gapminderSA %>% arrange(desc(gdpPercap))
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sri Lanka Asia 2007 72.4 20378239 3970.
## 2 Sri Lanka Asia 2002 70.8 19576783 3015.
## 3 Sri Lanka Asia 1997 70.5 18698655 2664.
## 4 Pakistan Asia 2007 65.5 169270617 2606.
## 5 India Asia 2007 64.7 1110396331 2452.
## 6 Sri Lanka Asia 1992 70.4 17587060 2154.
## 7 Pakistan Asia 2002 63.6 153403524 2093.
## 8 Pakistan Asia 1997 61.8 135564834 2049.
## 9 Pakistan Asia 1992 60.8 120065004 1972.
## 10 Sri Lanka Asia 1987 69.0 16495304 1877.
## # ... with 62 more rows
What is the lowest and highest life expectancy among South Asian countries?
gapminderSA %>% filter(year==2002) %>% arrange(lifeExp)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Nepal Asia 2002 61.3 25873917 1057.
## 3 Bangladesh Asia 2002 62.0 135656790 1136.
## 4 India Asia 2002 62.9 1034172547 1747.
## 5 Pakistan Asia 2002 63.6 153403524 2093.
## 6 Sri Lanka Asia 2002 70.8 19576783 3015.
It’s a little hard to read the column pop in gapminder since there are so many digits. Suppose that, instead of raw population, we wanted to display population in millions. This requires us to pop by 1000000, which we can do using the function mutate() from dplyr as follows:
gapminderSA %>% mutate(pop=pop/1000000)
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8.43 779.
## 2 Afghanistan Asia 1957 30.3 9.24 821.
## 3 Afghanistan Asia 1962 32.0 10.3 853.
## 4 Afghanistan Asia 1967 34.0 11.5 836.
## 5 Afghanistan Asia 1972 36.1 13.1 740.
## 6 Afghanistan Asia 1977 38.4 14.9 786.
## 7 Afghanistan Asia 1982 39.9 12.9 978.
## 8 Afghanistan Asia 1987 40.8 13.9 852.
## 9 Afghanistan Asia 1992 41.7 16.3 649.
## 10 Afghanistan Asia 1997 41.8 22.2 635.
## # ... with 62 more rows
If we want to calculate GDP, we need to multiply gdpPercap by pop. But wait! Didn’t we just change pop so it’s expressed in millions? No: we never stored the results of our previous command, we simply displayed them. Just as I discussed above, unless you overwrite it, the original gapminder dataset will be unchanged. With this in mind, we can create the gdp variable as follows:
gapminderSA %>% mutate(gdp = pop * gdpPercap)
## # A tibble: 72 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # ... with 62 more rows
Another feature of dplyr is summarise data
gapminder %>% filter(year==2002) %>% group_by(continent) %>% summarise(mean=mean(lifeExp),min=min(lifeExp),max=max(lifeExp))
## # A tibble: 5 x 4
## continent mean min max
## <fct> <dbl> <dbl> <dbl>
## 1 Africa 53.3 39.2 75.7
## 2 Americas 72.4 58.1 79.8
## 3 Asia 69.2 42.1 82
## 4 Europe 76.7 70.8 80.6
## 5 Oceania 79.7 79.1 80.4
##A simple scatterplot using ggplot2 Now that we know the basics of dplyr, we’ll turn our attention to graphics. R has many powerful build-in graphics functions that may be familiar to you. I am using a very powerful package for statistical visualization called ggplot2. ggplot2 is included in the tidyverse package, which if you’ve already installed and loaded, no need to do anything. Otherwise, make sure you have this package installed. We’ll start off by constructing a subset of the gapminder dataset that contains information from the year 2002 that we’ll use for our plots below.
library(ggplot2 )
gapminder_2002 <- gapminder %>% filter(year == 2002)
p<-ggplot(gapminder_2002)+geom_point(mapping =aes(x=gdpPercap,y=lifeExp,color=continent))
p
There is a relationship between gdpPercap and lifeExp. Now we explore whether there is relationship between population and liftExp using scatter plot.
ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = pop, y = lifeExp, color = continent))
##Ploting on log scale To transform the x-axis, it’s as easy as adding a + scale_x_log10() to the end of our command from above:
ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10()
ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10() +
scale_y_log10()
In all of these above mentioned graphs, we have used two main arguments : data which is gapminder_2002 and geom_functioning. We can also use size and color arguments as well for the scatter plot to make graph more informative.
ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
scale_x_log10()