Reproducibility, RMarkdown, Presentation, Webpages…., pdf, html, word, Latex…
Understanding data sets requires many hours/days or in some cases weeks.There are many commercially available software but open source community based software have now dominated and R is one of these. R makes data understanding process as easy as possible through the dplyr package. It is one of the easiest solution for code-based data analysis. We will learn in this training how to do it. I have already uploaded videos for downloading packages and relevant information. video links….
I have discussed the Gapminder dataset in my videos and we shall use it throughout this training. It’s available through CRAN, so make sure to install it. Here’s how to load in all required packages:
## # A tibble: 12 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sweden Europe 1952 71.9 7124673 8528.
## 2 Sweden Europe 1957 72.5 7363802 9912.
## 3 Sweden Europe 1962 73.4 7561588 12329.
## 4 Sweden Europe 1967 74.2 7867931 15258.
## 5 Sweden Europe 1972 74.7 8122293 17832.
## 6 Sweden Europe 1977 75.4 8251648 18856.
## 7 Sweden Europe 1982 76.4 8325260 20667.
## 8 Sweden Europe 1987 77.2 8421403 23587.
## 9 Sweden Europe 1992 78.2 8718867 23880.
## 10 Sweden Europe 1997 79.4 8897619 25267.
## 11 Sweden Europe 2002 80.0 8954175 29342.
## 12 Sweden Europe 2007 80.9 9031088 33860.
First few rows of data are displayed here and there 1704 observations in total.
In this training, some very basic but important things you will learn are
View command opens data in new worksheet while glimpse lists nature of variables (numeric/character/factor…) and total number of rows and columns.
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~
| country | continent | year | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|---|---|
| Afghanistan: 12 | Africa :624 | Min. :1952 | Min. :23.60 | Min. :6.001e+04 | Min. : 241.2 | |
| Albania : 12 | Americas:300 | 1st Qu.:1966 | 1st Qu.:48.20 | 1st Qu.:2.794e+06 | 1st Qu.: 1202.1 | |
| Algeria : 12 | Asia :396 | Median :1980 | Median :60.71 | Median :7.024e+06 | Median : 3531.8 | |
| Angola : 12 | Europe :360 | Mean :1980 | Mean :59.47 | Mean :2.960e+07 | Mean : 7215.3 | |
| Argentina : 12 | Oceania : 24 | 3rd Qu.:1993 | 3rd Qu.:70.85 | 3rd Qu.:1.959e+07 | 3rd Qu.: 9325.5 | |
| Australia : 12 | NA | Max. :2007 | Max. :82.60 | Max. :1.319e+09 | Max. :113523.1 | |
| (Other) :1632 | NA | NA | NA | NA | NA |
More often than not, you don’t need all columns of a data set for your analysis. For example PDHS files have more than 5000 columns in some files and maybe 40 or 50 or even fewer than that are needed for your analysis. Select() function of R’s dplyr is used to select columns of your interest.
Three selected columns are selected as follows. You can give new name to this data.
## # A tibble: 1,704 x 3
## country pop lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 8425333 28.8
## 2 Afghanistan 9240934 30.3
## 3 Afghanistan 10267083 32.0
## 4 Afghanistan 11537966 34.0
## 5 Afghanistan 13079460 36.1
## 6 Afghanistan 14880372 38.4
## 7 Afghanistan 12881816 39.9
## 8 Afghanistan 13867957 40.8
## 9 Afghanistan 16317921 41.7
## 10 Afghanistan 22227415 41.8
## # ... with 1,694 more rows
In case you want to select most of the variables and drop one or two, you may proceed as follows
## # A tibble: 1,704 x 5
## country continent year lifeExp pop
## <fct> <fct> <int> <dbl> <int>
## 1 Afghanistan Asia 1952 28.8 8425333
## 2 Afghanistan Asia 1957 30.3 9240934
## 3 Afghanistan Asia 1962 32.0 10267083
## 4 Afghanistan Asia 1967 34.0 11537966
## 5 Afghanistan Asia 1972 36.1 13079460
## 6 Afghanistan Asia 1977 38.4 14880372
## 7 Afghanistan Asia 1982 39.9 12881816
## 8 Afghanistan Asia 1987 40.8 13867957
## 9 Afghanistan Asia 1992 41.7 16317921
## 10 Afghanistan Asia 1997 41.8 22227415
## # ... with 1,694 more rows
So gdpPerCapita column is now not shown above.
Filtering is another very important task one has to do in one’s analysis. Sometimes, one has to select sale related to a particular city or agent or quarter. Here is how one uses filter() command for data with a condition. We are using here commanad only to select data for year 2007 for all the countries.
Now I shall mention some of the powerful but very simple to use features of dplyr. First of all I am going to explain filter variable of dplyr. filter is used only to select rows for a given condition. I am going to select data only for year 2007.
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # ... with 132 more rows
Compared to previous one, gapminder is showing data only for 142 rows. Pipes %>% play very nicely with dplyr and make our code very easy to understand. The tibble gapminder is being piped into the function filter(). The argument year == 2007 tells filter() that it should find all the rows such that the logical condition year == 2007 is TRUE.
Have we accidently deleted all other rows? Answer is no. Nope: we haven’t made any changes to gapminder at all. If you don’t believe me try entering gapminder at the console. All that this command does is display a subset of gapminder. If we wanted to store the result of running this command, we’d need to assign it to a variable, for example if you are not sure, lets type
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
## 7 Austria Europe 2007 79.8 8199783 36126.
## 8 Bahrain Asia 2007 75.6 708573 29796.
## 9 Bangladesh Asia 2007 64.1 150448339 1391.
## 10 Belgium Europe 2007 79.4 10392226 33693.
## # ... with 132 more rows
One can apply multiple filters
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Pakistan Asia 2007 65.5 169270617 2606.
Now we are selecting multiple countries
## # A tibble: 5 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Bangladesh Asia 2007 64.1 150448339 1391.
## 3 India Asia 2007 64.7 1110396331 2452.
## 4 Iran Asia 2007 71.0 69453570 11606.
## 5 Pakistan Asia 2007 65.5 169270617 2606.
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sri Lanka Asia 2007 72.4 20378239 3970.
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 62 more rows
Sort data with arrange Suppose we wanted to sort gapminder data for South Asia by gdpPercap. To do this we can use the arrange command along with the pipe %>% as follows:
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Nepal Asia 1952 36.2 9182536 546.
## 2 India Asia 1952 37.4 372000000 547.
## 3 India Asia 1957 40.2 409000000 590.
## 4 Nepal Asia 1957 37.7 9682338 598.
## 5 Bangladesh Asia 1972 45.3 70759295 630.
## 6 Afghanistan Asia 1997 41.8 22227415 635.
## 7 Afghanistan Asia 1992 41.7 16317921 649.
## 8 Nepal Asia 1962 39.4 10332057 652.
## 9 India Asia 1962 43.6 454000000 658.
## 10 Bangladesh Asia 1977 46.9 80428306 660.
## # ... with 62 more rows
The logic is very similar to what we saw above for filter. Here, I use another important function arrange. The argument gdpPercap tells arrange() that we want to sort by GDP per capita. Note that by default arrange() sorts in ascending order. If we want to sort in descending order, we use the function desc().
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sri Lanka Asia 2007 72.4 20378239 3970.
## 2 Sri Lanka Asia 2002 70.8 19576783 3015.
## 3 Sri Lanka Asia 1997 70.5 18698655 2664.
## 4 Pakistan Asia 2007 65.5 169270617 2606.
## 5 India Asia 2007 64.7 1110396331 2452.
## 6 Sri Lanka Asia 1992 70.4 17587060 2154.
## 7 Pakistan Asia 2002 63.6 153403524 2093.
## 8 Pakistan Asia 1997 61.8 135564834 2049.
## 9 Pakistan Asia 1992 60.8 120065004 1972.
## 10 Sri Lanka Asia 1987 69.0 16495304 1877.
## # ... with 62 more rows
What is the lowest and highest life expectancy among South Asian countries?
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Nepal Asia 2007 63.8 28901790 1091.
## 3 Bangladesh Asia 2007 64.1 150448339 1391.
## 4 India Asia 2007 64.7 1110396331 2452.
## 5 Pakistan Asia 2007 65.5 169270617 2606.
## 6 Sri Lanka Asia 2007 72.4 20378239 3970.
It’s a little hard to read the column pop in gapminder since there are so many digits. Suppose that, instead of raw population, we wanted to display population in millions. This requires us to pop by 1000000, which we can do using the function mutate() from dplyr as follows:
## # A tibble: 72 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8.43 779.
## 2 Afghanistan Asia 1957 30.3 9.24 821.
## 3 Afghanistan Asia 1962 32.0 10.3 853.
## 4 Afghanistan Asia 1967 34.0 11.5 836.
## 5 Afghanistan Asia 1972 36.1 13.1 740.
## 6 Afghanistan Asia 1977 38.4 14.9 786.
## 7 Afghanistan Asia 1982 39.9 12.9 978.
## 8 Afghanistan Asia 1987 40.8 13.9 852.
## 9 Afghanistan Asia 1992 41.7 16.3 649.
## 10 Afghanistan Asia 1997 41.8 22.2 635.
## # ... with 62 more rows
If we want to calculate GDP, we need to multiply gdpPercap by pop. But wait! Didn’t we just change pop so it’s expressed in millions? No: we never stored the results of our previous command, we simply displayed them. Just as I discussed above, unless you overwrite it, the original gapminder dataset will be unchanged. With this in mind, we can create the gdp variable as follows:
## # A tibble: 72 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # ... with 62 more rows
Another feature of dplyr is summarise data
## # A tibble: 5 x 4
## continent mean min max
## <fct> <dbl> <dbl> <dbl>
## 1 Africa 54.8 39.6 76.4
## 2 Americas 73.6 60.9 80.7
## 3 Asia 70.7 43.8 82.6
## 4 Europe 77.6 71.8 81.8
## 5 Oceania 80.7 80.2 81.2
If one wants to have ordered data with respect to specific column(s), ‘arrange()’ function is used in dplyr. To arrange data by life expectancy, we use arrange() function
## # A tibble: 142 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Swaziland 2007 39.6
## 2 Mozambique 2007 42.1
## 3 Zambia 2007 42.4
## 4 Sierra Leone 2007 42.6
## 5 Lesotho 2007 42.6
## 6 Angola 2007 42.7
## 7 Zimbabwe 2007 43.5
## 8 Afghanistan 2007 43.8
## 9 Central African Republic 2007 44.7
## 10 Liberia 2007 45.7
## # ... with 132 more rows
If one wants order from top to bottom, then use arrange(desc()) command as follows:
## # A tibble: 142 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Japan 2007 82.6
## 2 Hong Kong, China 2007 82.2
## 3 Iceland 2007 81.8
## 4 Switzerland 2007 81.7
## 5 Australia 2007 81.2
## 6 Spain 2007 80.9
## 7 Sweden 2007 80.9
## 8 Israel 2007 80.7
## 9 France 2007 80.7
## 10 Canada 2007 80.7
## # ... with 132 more rows
If one wants that only top 5 or bottom 5 rows are returned. One may use
## Selecting by lifeExp
## # A tibble: 5 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Japan 2007 82.6
## 2 Hong Kong, China 2007 82.2
## 3 Iceland 2007 81.8
## 4 Switzerland 2007 81.7
## 5 Australia 2007 81.2
## Selecting by lifeExp
## # A tibble: 10 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Canada 2007 80.7
## 2 France 2007 80.7
## 3 Israel 2007 80.7
## 4 Sweden 2007 80.9
## 5 Spain 2007 80.9
## 6 Australia 2007 81.2
## 7 Switzerland 2007 81.7
## 8 Iceland 2007 81.8
## 9 Hong Kong, China 2007 82.2
## 10 Japan 2007 82.6
## Selecting by gdp
## # A tibble: 10 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 United States Americas 2007 78.2 301139947 42952. 1.29e13
## 2 China Asia 2007 73.0 1318683096 4959. 6.54e12
## 3 Japan Asia 2007 82.6 127467972 31656. 4.04e12
## 4 India Asia 2007 64.7 1110396331 2452. 2.72e12
## 5 Germany Europe 2007 79.4 82400996 32170. 2.65e12
## 6 United Kingdom Europe 2007 79.4 60776238 33203. 2.02e12
## 7 France Europe 2007 80.7 61083916 30470. 1.86e12
## 8 Brazil Americas 2007 72.4 190010647 9066. 1.72e12
## 9 Italy Europe 2007 80.5 58147733 28570. 1.66e12
## 10 Mexico Americas 2007 76.2 108700891 11978. 1.30e12
‘r transmute()’ keeps only the derived column. Let’s use it in the example from above:
## Selecting by gdp
## # A tibble: 10 x 1
## gdp
## <dbl>
## 1 1.29e13
## 2 6.54e12
## 3 4.04e12
## 4 2.72e12
## 5 2.65e12
## 6 2.02e12
## 7 1.86e12
## 8 1.72e12
## 9 1.66e12
## 10 1.30e12
## # A tibble: 1 x 1
## avglifeExp
## <dbl>
## 1 59.5
## # A tibble: 1 x 1
## avgLifeExp
## <dbl>
## 1 70.7
## # A tibble: 5 x 2
## continent avglife
## <fct> <dbl>
## 1 Africa 54.8
## 2 Americas 73.6
## 3 Asia 70.7
## 4 Europe 77.6
## 5 Oceania 80.7
## # A tibble: 5 x 3
## continent avgLifeExp over75
## <fct> <dbl> <chr>
## 1 Africa 54.8 N
## 2 Americas 73.6 Y
## 3 Asia 70.7 Y
## 4 Europe 77.6 Y
## 5 Oceania 80.7 Y
## # A tibble: 5 x 2
## continent tot_pop
## <fct> <dbl>
## 1 Africa 929539692
## 2 Americas 898871184
## 3 Asia 3811953827
## 4 Europe 586098529
## 5 Oceania 24549947
In general it is assumed that higher the GDP , higher the lifeExp. To test this assumption, lets calculate percentiles of lifeExp. This will indicate how many countries have ranking lower than the current country.
## # A tibble: 142 x 5
## country year lifeExp gdpPercap percentile
## <fct> <int> <dbl> <dbl> <int>
## 1 Norway 2007 80.2 49357. 88
## 2 Kuwait 2007 77.6 47307. 68
## 3 Singapore 2007 80.0 47143. 87
## 4 United States 2007 78.2 42952. 71
## 5 Ireland 2007 78.9 40676. 79
## 6 Hong Kong, China 2007 82.2 39725. 99
## 7 Switzerland 2007 81.7 37506. 97
## 8 Netherlands 2007 79.8 36798. 85
## 9 Canada 2007 80.7 36319. 91
## 10 Iceland 2007 81.8 36181. 98
## # ... with 132 more rows
One can notice that all countries are well above 60th percentile on lifeExpectancy when arranged by GDP per capita.
Before you conclude, lets see the bottom side
## # A tibble: 142 x 5
## country year lifeExp gdpPercap percentile
## <fct> <int> <dbl> <dbl> <int>
## 1 Congo, Dem. Rep. 2007 46.5 278. 7
## 2 Liberia 2007 45.7 415. 5
## 3 Burundi 2007 49.6 430. 10
## 4 Zimbabwe 2007 43.5 470. 4
## 5 Guinea-Bissau 2007 46.4 579. 6
## 6 Niger 2007 56.9 620. 18
## 7 Eritrea 2007 58.0 641. 19
## 8 Ethiopia 2007 52.9 691. 14
## 9 Central African Republic 2007 44.7 706. 5
## 10 Gambia 2007 59.4 753. 21
## # ... with 132 more rows
So it makes sense that higher the GDP, higher the lifeExp. This is not formal testing but exploratory data makes lot of sense here.
Filtering data as done in introductory analysis seems quite difficult if you are not familiar with these simple things. But if you are working with dplyr for quite sometime, there is not anything very advanced or difficult.
For example, let’s say you have to find out the top 10 countries in the 90th percentile regarding life expectancy in 2007. You can reuse some of the logic from the previous sections, but answering this question alone requires multiple filterings and subsetting:
## # A tibble: 10 x 4
## country continent lifeExp gdpPercap
## <fct> <fct> <dbl> <dbl>
## 1 Japan Asia 82.6 31656.
## 2 Hong Kong, China Asia 82.2 39725.
## 3 Iceland Europe 81.8 36181.
## 4 Switzerland Europe 81.7 37506.
## 5 Australia Oceania 81.2 34435.
## 6 Spain Europe 80.9 28821.
## 7 Sweden Europe 80.9 33860.
## 8 Israel Asia 80.7 25523.
## 9 France Europe 80.7 30470.
## 10 Canada Americas 80.7 36319.
In case you are interested in bottom 10 (worst lifeExp countries from the bottom), use top_n with -10.
## # A tibble: 10 x 4
## country continent lifeExp gdpPercap
## <fct> <fct> <dbl> <dbl>
## 1 Mozambique Africa 42.1 824.
## 2 Swaziland Africa 39.6 4513.
## 3 Sierra Leone Africa 42.6 863.
## 4 Zambia Africa 42.4 1271.
## 5 Angola Africa 42.7 4797.
## 6 Lesotho Africa 42.6 1569.
## 7 Afghanistan Asia 43.8 975.
## 8 Zimbabwe Africa 43.5 470.
## 9 Central African Republic Africa 44.7 706.
## 10 Liberia Africa 45.7 415.
Stat graphics is a mapping of variable to aesthetic attributes of geometric objects.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine