Exploring Gapminder data for South Asia using dplyr

Data are ubiquitous but insight is hard due to lack of capacity among researchers/scholars at the local level. Survey data are collected but explored very little. One main reason for this limited use of data is a difficult and time consuming process of cleaning and transformation of data. Most of the time researchers use it in various softwares in a way that using data for others require the same level of effort as the person who got mastery over the data in a very hard way. This tutorial is an effort not only to introduce students/scholars/academia/policy makers to dplyr and other R packages but also to make it available that anyone can use it following these steps. I am using dplyr package of the tidyverse because it makes data transformation easier and faster. It has gained great popularity among R users and experise in it is one of the standard skill in R and is in high demand. In the following I give an introduction to some main functions of dplyr and use it for Gapminder data. I shall elaborate some of the most extensively used functions of dplyr. Once you get understanding of using dplyr with Gapminder data, I shall do data anaysis for Labour Force Surcey (LFS) data in my next post.

I shall upload gapminder and tidyverse library

library(gapminder)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Information in gapminder data

View command opens data in new worksheet while glimpse lists nature of variables (numeric/character/factor…) and total number of rows and columns.

glimpse(gapminder) # We see that there are 1704 rows for 6 columns and also tells nature of variable
Observations: 1,704
Variables: 6
$ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
View(gapminder)    # This opens up full data in a new window
gapminder   # gapminder is now tibble when we use dplyr and it displays only first 10 rows
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ... with 1,694 more rows

dplyr features

Now I shall mention some of the powerful but very simple to use features of dplyr. First of all I am going to explain filter variable of dplyr. filter is used only to select rows for a given condition. I am going to select data only for year 2002.

gapminder %>% filter(year==2002)
## # A tibble: 142 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2002    42.1  25268405      727.
##  2 Albania     Europe     2002    75.7   3508512     4604.
##  3 Algeria     Africa     2002    71.0  31287142     5288.
##  4 Angola      Africa     2002    41.0  10866106     2773.
##  5 Argentina   Americas   2002    74.3  38331121     8798.
##  6 Australia   Oceania    2002    80.4  19546792    30688.
##  7 Austria     Europe     2002    79.0   8148312    32418.
##  8 Bahrain     Asia       2002    74.8    656397    23404.
##  9 Bangladesh  Asia       2002    62.0 135656790     1136.
## 10 Belgium     Europe     2002    78.3  10311970    30486.
## # ... with 132 more rows

Compared to previus one, gapminder is showing data only for 142 rows. Pipes %>% play very nicely with dplyr and make our code very easy to understand. The tibble gapminder is being piped into the function filter(). The argument year == 2002 tells filter() that it should find all the rows such that the logical condition year == 2002 is TRUE.

Have we accidently deleted all other rows? Answer is no. Nope: we haven’t made any changes to gapminder at all. If you don’t believe me try entering gapminder at the console. All that this command does is display a subset of gapminder. If we wanted to store the result of running this command, we’d need to assign it to a variable, for example if you are not sure, lets type

gapminder2002<-gapminder %>% filter(year==2002)
gapminder2002
## # A tibble: 142 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2002    42.1  25268405      727.
##  2 Albania     Europe     2002    75.7   3508512     4604.
##  3 Algeria     Africa     2002    71.0  31287142     5288.
##  4 Angola      Africa     2002    41.0  10866106     2773.
##  5 Argentina   Americas   2002    74.3  38331121     8798.
##  6 Australia   Oceania    2002    80.4  19546792    30688.
##  7 Austria     Europe     2002    79.0   8148312    32418.
##  8 Bahrain     Asia       2002    74.8    656397    23404.
##  9 Bangladesh  Asia       2002    62.0 135656790     1136.
## 10 Belgium     Europe     2002    78.3  10311970    30486.
## # ... with 132 more rows

Filtering with respect to two variables

gapminder %>% filter(year==2002,country=="Sri Lanka")
## # A tibble: 1 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Sri Lanka Asia       2002    70.8 19576783     3015.

Filtering data for South Asia countries

gapminderSA<-gapminder %>% filter(country %in% c("Bangladesh","India","Pakistan","Sri Lanka","Nepal", "Afghanistan","Bhutan", "Maldives"))
gapminderSA
## # A tibble: 72 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 62 more rows

Sort data with arrange

Sort data with arrange Suppose we wanted to sort gapminder data for South Asia by gdpPercap. To do this we can use the arrange command along with the pipe %>% as follows:

gapminderSA %>% arrange(gdpPercap)
## # A tibble: 72 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Nepal       Asia       1952    36.2   9182536      546.
##  2 India       Asia       1952    37.4 372000000      547.
##  3 India       Asia       1957    40.2 409000000      590.
##  4 Nepal       Asia       1957    37.7   9682338      598.
##  5 Bangladesh  Asia       1972    45.3  70759295      630.
##  6 Afghanistan Asia       1997    41.8  22227415      635.
##  7 Afghanistan Asia       1992    41.7  16317921      649.
##  8 Nepal       Asia       1962    39.4  10332057      652.
##  9 India       Asia       1962    43.6 454000000      658.
## 10 Bangladesh  Asia       1977    46.9  80428306      660.
## # ... with 62 more rows

The logic is very similar to what we saw above for filter. Here, I use another important function arrange. The argument gdpPercap tells arrange() that we want to sort by GDP per capita. Note that by default arrange() sorts in ascending order. If we want to sort in descending order, we use the function desc().

gapminderSA %>% arrange(desc(gdpPercap))
## # A tibble: 72 x 6
##    country   continent  year lifeExp        pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Sri Lanka Asia       2007    72.4   20378239     3970.
##  2 Sri Lanka Asia       2002    70.8   19576783     3015.
##  3 Sri Lanka Asia       1997    70.5   18698655     2664.
##  4 Pakistan  Asia       2007    65.5  169270617     2606.
##  5 India     Asia       2007    64.7 1110396331     2452.
##  6 Sri Lanka Asia       1992    70.4   17587060     2154.
##  7 Pakistan  Asia       2002    63.6  153403524     2093.
##  8 Pakistan  Asia       1997    61.8  135564834     2049.
##  9 Pakistan  Asia       1992    60.8  120065004     1972.
## 10 Sri Lanka Asia       1987    69.0   16495304     1877.
## # ... with 62 more rows

Life Expectancy in South Asia in 2002

What is the lowest and highest life expectancy among South Asian countries?

gapminderSA %>% filter(year==2002) %>%  arrange(lifeExp)
## # A tibble: 6 x 6
##   country     continent  year lifeExp        pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>      <int>     <dbl>
## 1 Afghanistan Asia       2002    42.1   25268405      727.
## 2 Nepal       Asia       2002    61.3   25873917     1057.
## 3 Bangladesh  Asia       2002    62.0  135656790     1136.
## 4 India       Asia       2002    62.9 1034172547     1747.
## 5 Pakistan    Asia       2002    63.6  153403524     2093.
## 6 Sri Lanka   Asia       2002    70.8   19576783     3015.

Change an existing variable or create a new one with mutate

It’s a little hard to read the column pop in gapminder since there are so many digits. Suppose that, instead of raw population, we wanted to display population in millions. This requires us to pop by 1000000, which we can do using the function mutate() from dplyr as follows:

gapminderSA %>% mutate(pop=pop/1000000)
## # A tibble: 72 x 6
##    country     continent  year lifeExp   pop gdpPercap
##    <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8.43      779.
##  2 Afghanistan Asia       1957    30.3  9.24      821.
##  3 Afghanistan Asia       1962    32.0 10.3       853.
##  4 Afghanistan Asia       1967    34.0 11.5       836.
##  5 Afghanistan Asia       1972    36.1 13.1       740.
##  6 Afghanistan Asia       1977    38.4 14.9       786.
##  7 Afghanistan Asia       1982    39.9 12.9       978.
##  8 Afghanistan Asia       1987    40.8 13.9       852.
##  9 Afghanistan Asia       1992    41.7 16.3       649.
## 10 Afghanistan Asia       1997    41.8 22.2       635.
## # ... with 62 more rows

If we want to calculate GDP, we need to multiply gdpPercap by pop. But wait! Didn’t we just change pop so it’s expressed in millions? No: we never stored the results of our previous command, we simply displayed them. Just as I discussed above, unless you overwrite it, the original gapminder dataset will be unchanged. With this in mind, we can create the gdp variable as follows:

gapminderSA %>% mutate(gdp = pop * gdpPercap)
## # A tibble: 72 x 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ... with 62 more rows

Summarising data

Another feature of dplyr is summarise data

gapminder %>% filter(year==2002) %>% group_by(continent) %>% summarise(mean=mean(lifeExp),min=min(lifeExp),max=max(lifeExp))
## # A tibble: 5 x 4
##   continent  mean   min   max
##   <fct>     <dbl> <dbl> <dbl>
## 1 Africa     53.3  39.2  75.7
## 2 Americas   72.4  58.1  79.8
## 3 Asia       69.2  42.1  82  
## 4 Europe     76.7  70.8  80.6
## 5 Oceania    79.7  79.1  80.4

##A simple scatterplot using ggplot2 Now that we know the basics of dplyr, we’ll turn our attention to graphics. R has many powerful build-in graphics functions that may be familiar to you. I am using a very powerful package for statistical visualization called ggplot2. ggplot2 is included in the tidyverse package, which if you’ve already installed and loaded, no need to do anything. Otherwise, make sure you have this package installed. We’ll start off by constructing a subset of the gapminder dataset that contains information from the year 2002 that we’ll use for our plots below.

library(ggplot2 )
gapminder_2002 <- gapminder %>% filter(year == 2002)
p<-ggplot(gapminder_2002)+geom_point(mapping =aes(x=gdpPercap,y=lifeExp,color=continent))
p

There is a relationship between gdpPercap and lifeExp. Now we explore whether there is relationship between population and liftExp using scatter plot.

ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = pop, y = lifeExp, color = continent)) 

##Ploting on log scale To transform the x-axis, it’s as easy as adding a + scale_x_log10() to the end of our command from above:

ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10()

ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) +
scale_x_log10() +
scale_y_log10()

In all of these above mentioned graphs, we have used two main arguments : data which is gapminder_2002 and geom_functioning. We can also use size and color arguments as well for the scatter plot to make graph more informative.

ggplot(data = gapminder_2002) +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
scale_x_log10()