Uploaded data and required R packages

Understanding data sets requires many hours/days or in some cases weeks.There are many commercially available software but open source community based software have now dominated and R is one of these. R makes data understanding process as easy as possible through the dplyr package. It is one of the easiest solution for code-based data analysis. We will learn in this training how to do it. I have already uploaded videos for downloading packages and relevant information. video links….

I have discussed the Gapminder dataset in my videos and we shall use it throughout this training. It’s available through CRAN, so make sure to install it. Here’s how to load in all required packages:

## # A tibble: 12 x 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Sweden  Europe     1952    71.9 7124673     8528.
##  2 Sweden  Europe     1957    72.5 7363802     9912.
##  3 Sweden  Europe     1962    73.4 7561588    12329.
##  4 Sweden  Europe     1967    74.2 7867931    15258.
##  5 Sweden  Europe     1972    74.7 8122293    17832.
##  6 Sweden  Europe     1977    75.4 8251648    18856.
##  7 Sweden  Europe     1982    76.4 8325260    20667.
##  8 Sweden  Europe     1987    77.2 8421403    23587.
##  9 Sweden  Europe     1992    78.2 8718867    23880.
## 10 Sweden  Europe     1997    79.4 8897619    25267.
## 11 Sweden  Europe     2002    80.0 8954175    29342.
## 12 Sweden  Europe     2007    80.9 9031088    33860.

First few rows of data are displayed here and there 1704 observations in total.
In this training, some very basic but important things you will learn are

Components

View, glimpse, structure
head, tail
Column Selection
Data Filtering
Data Ordering
Creating Derived Columns
Calculating Summary Statistics
Grouping

Information in gapminder data

View command opens data in new worksheet while glimpse lists nature of variables (numeric/character/factor…) and total number of rows and columns.

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan: 12	Africa :624	Min. :1952	Min. :23.60	Min. :6.001e+04	Min. : 241.2
Albania : 12	Americas:300	1st Qu.:1966	1st Qu.:48.20	1st Qu.:2.794e+06	1st Qu.: 1202.1
Algeria : 12	Asia :396	Median :1980	Median :60.71	Median :7.024e+06	Median : 3531.8
Angola : 12	Europe :360	Mean :1980	Mean :59.47	Mean :2.960e+07	Mean : 7215.3
Argentina : 12	Oceania : 24	3rd Qu.:1993	3rd Qu.:70.85	3rd Qu.:1.959e+07	3rd Qu.: 9325.5
Australia : 12	NA	Max. :2007	Max. :82.60	Max. :1.319e+09	Max. :113523.1
(Other) :1632	NA	NA	NA	NA	NA

Column Selection

More often than not, you don’t need all columns of a data set for your analysis. For example PDHS files have more than 5000 columns in some files and maybe 40 or 50 or even fewer than that are needed for your analysis. Select() function of R’s dplyr is used to select columns of your interest.
Three selected columns are selected as follows. You can give new name to this data.

## # A tibble: 1,704 x 3
##    country          pop lifeExp
##    <fct>          <int>   <dbl>
##  1 Afghanistan  8425333    28.8
##  2 Afghanistan  9240934    30.3
##  3 Afghanistan 10267083    32.0
##  4 Afghanistan 11537966    34.0
##  5 Afghanistan 13079460    36.1
##  6 Afghanistan 14880372    38.4
##  7 Afghanistan 12881816    39.9
##  8 Afghanistan 13867957    40.8
##  9 Afghanistan 16317921    41.7
## 10 Afghanistan 22227415    41.8
## # ... with 1,694 more rows

In case you want to select most of the variables and drop one or two, you may proceed as follows

## # A tibble: 1,704 x 5
##    country     continent  year lifeExp      pop
##    <fct>       <fct>     <int>   <dbl>    <int>
##  1 Afghanistan Asia       1952    28.8  8425333
##  2 Afghanistan Asia       1957    30.3  9240934
##  3 Afghanistan Asia       1962    32.0 10267083
##  4 Afghanistan Asia       1967    34.0 11537966
##  5 Afghanistan Asia       1972    36.1 13079460
##  6 Afghanistan Asia       1977    38.4 14880372
##  7 Afghanistan Asia       1982    39.9 12881816
##  8 Afghanistan Asia       1987    40.8 13867957
##  9 Afghanistan Asia       1992    41.7 16317921
## 10 Afghanistan Asia       1997    41.8 22227415
## # ... with 1,694 more rows

So gdpPerCapita column is now not shown above.

Data Filtering

Filtering is another very important task one has to do in one’s analysis. Sometimes, one has to select sale related to a particular city or agent or quarter. Here is how one uses filter() command for data with a condition. We are using here commanad only to select data for year 2007 for all the countries.

dplyr features

Now I shall mention some of the powerful but very simple to use features of dplyr. First of all I am going to explain filter variable of dplyr. filter is used only to select rows for a given condition. I am going to select data only for year 2007.

## # A tibble: 142 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ... with 132 more rows

Compared to previous one, gapminder is showing data only for 142 rows. Pipes %>% play very nicely with dplyr and make our code very easy to understand. The tibble gapminder is being piped into the function filter(). The argument year == 2007 tells filter() that it should find all the rows such that the logical condition year == 2007 is TRUE.

Have we accidently deleted all other rows? Answer is no. Nope: we haven’t made any changes to gapminder at all. If you don’t believe me try entering gapminder at the console. All that this command does is display a subset of gapminder. If we wanted to store the result of running this command, we’d need to assign it to a variable, for example if you are not sure, lets type

## # A tibble: 142 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ... with 132 more rows

One can apply multiple filters

## # A tibble: 1 x 6
##   country  continent  year lifeExp       pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Pakistan Asia       2007    65.5 169270617     2606.

Now we are selecting multiple countries

## # A tibble: 5 x 6
##   country     continent  year lifeExp        pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>      <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8   31889923      975.
## 2 Bangladesh  Asia       2007    64.1  150448339     1391.
## 3 India       Asia       2007    64.7 1110396331     2452.
## 4 Iran        Asia       2007    71.0   69453570    11606.
## 5 Pakistan    Asia       2007    65.5  169270617     2606.

Filtering with respect to two variables

## # A tibble: 1 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Sri Lanka Asia       2007    72.4 20378239     3970.

Filtering data for South Asia countries

## # A tibble: 72 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 62 more rows

Sort data with arrange

Sort data with arrange Suppose we wanted to sort gapminder data for South Asia by gdpPercap. To do this we can use the arrange command along with the pipe %>% as follows:

## # A tibble: 72 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Nepal       Asia       1952    36.2   9182536      546.
##  2 India       Asia       1952    37.4 372000000      547.
##  3 India       Asia       1957    40.2 409000000      590.
##  4 Nepal       Asia       1957    37.7   9682338      598.
##  5 Bangladesh  Asia       1972    45.3  70759295      630.
##  6 Afghanistan Asia       1997    41.8  22227415      635.
##  7 Afghanistan Asia       1992    41.7  16317921      649.
##  8 Nepal       Asia       1962    39.4  10332057      652.
##  9 India       Asia       1962    43.6 454000000      658.
## 10 Bangladesh  Asia       1977    46.9  80428306      660.
## # ... with 62 more rows

The logic is very similar to what we saw above for filter. Here, I use another important function arrange. The argument gdpPercap tells arrange() that we want to sort by GDP per capita. Note that by default arrange() sorts in ascending order. If we want to sort in descending order, we use the function desc().

## # A tibble: 72 x 6
##    country   continent  year lifeExp        pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Sri Lanka Asia       2007    72.4   20378239     3970.
##  2 Sri Lanka Asia       2002    70.8   19576783     3015.
##  3 Sri Lanka Asia       1997    70.5   18698655     2664.
##  4 Pakistan  Asia       2007    65.5  169270617     2606.
##  5 India     Asia       2007    64.7 1110396331     2452.
##  6 Sri Lanka Asia       1992    70.4   17587060     2154.
##  7 Pakistan  Asia       2002    63.6  153403524     2093.
##  8 Pakistan  Asia       1997    61.8  135564834     2049.
##  9 Pakistan  Asia       1992    60.8  120065004     1972.
## 10 Sri Lanka Asia       1987    69.0   16495304     1877.
## # ... with 62 more rows

Life Expectancy in South Asia in 2007

What is the lowest and highest life expectancy among South Asian countries?

## # A tibble: 6 x 6
##   country     continent  year lifeExp        pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>      <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8   31889923      975.
## 2 Nepal       Asia       2007    63.8   28901790     1091.
## 3 Bangladesh  Asia       2007    64.1  150448339     1391.
## 4 India       Asia       2007    64.7 1110396331     2452.
## 5 Pakistan    Asia       2007    65.5  169270617     2606.
## 6 Sri Lanka   Asia       2007    72.4   20378239     3970.

Change an existing variable or create a new one with mutate

It’s a little hard to read the column pop in gapminder since there are so many digits. Suppose that, instead of raw population, we wanted to display population in millions. This requires us to pop by 1000000, which we can do using the function mutate() from dplyr as follows:

## # A tibble: 72 x 6
##    country     continent  year lifeExp   pop gdpPercap
##    <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8.43      779.
##  2 Afghanistan Asia       1957    30.3  9.24      821.
##  3 Afghanistan Asia       1962    32.0 10.3       853.
##  4 Afghanistan Asia       1967    34.0 11.5       836.
##  5 Afghanistan Asia       1972    36.1 13.1       740.
##  6 Afghanistan Asia       1977    38.4 14.9       786.
##  7 Afghanistan Asia       1982    39.9 12.9       978.
##  8 Afghanistan Asia       1987    40.8 13.9       852.
##  9 Afghanistan Asia       1992    41.7 16.3       649.
## 10 Afghanistan Asia       1997    41.8 22.2       635.
## # ... with 62 more rows

If we want to calculate GDP, we need to multiply gdpPercap by pop. But wait! Didn’t we just change pop so it’s expressed in millions? No: we never stored the results of our previous command, we simply displayed them. Just as I discussed above, unless you overwrite it, the original gapminder dataset will be unchanged. With this in mind, we can create the gdp variable as follows:

## # A tibble: 72 x 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ... with 62 more rows

Summarising data

Another feature of dplyr is summarise data

## # A tibble: 5 x 4
##   continent  mean   min   max
##   <fct>     <dbl> <dbl> <dbl>
## 1 Africa     54.8  39.6  76.4
## 2 Americas   73.6  60.9  80.7
## 3 Asia       70.7  43.8  82.6
## 4 Europe     77.6  71.8  81.8
## 5 Oceania    80.7  80.2  81.2

Ordering

If one wants to have ordered data with respect to specific column(s), ‘arrange()’ function is used in dplyr. To arrange data by life expectancy, we use arrange() function

## # A tibble: 142 x 3
##    country                   year lifeExp
##    <fct>                    <int>   <dbl>
##  1 Swaziland                 2007    39.6
##  2 Mozambique                2007    42.1
##  3 Zambia                    2007    42.4
##  4 Sierra Leone              2007    42.6
##  5 Lesotho                   2007    42.6
##  6 Angola                    2007    42.7
##  7 Zimbabwe                  2007    43.5
##  8 Afghanistan               2007    43.8
##  9 Central African Republic  2007    44.7
## 10 Liberia                   2007    45.7
## # ... with 132 more rows

If one wants order from top to bottom, then use arrange(desc()) command as follows:

## # A tibble: 142 x 3
##    country           year lifeExp
##    <fct>            <int>   <dbl>
##  1 Japan             2007    82.6
##  2 Hong Kong, China  2007    82.2
##  3 Iceland           2007    81.8
##  4 Switzerland       2007    81.7
##  5 Australia         2007    81.2
##  6 Spain             2007    80.9
##  7 Sweden            2007    80.9
##  8 Israel            2007    80.7
##  9 France            2007    80.7
## 10 Canada            2007    80.7
## # ... with 132 more rows

If one wants that only top 5 or bottom 5 rows are returned. One may use

## Selecting by lifeExp

## # A tibble: 5 x 3
##   country           year lifeExp
##   <fct>            <int>   <dbl>
## 1 Japan             2007    82.6
## 2 Hong Kong, China  2007    82.2
## 3 Iceland           2007    81.8
## 4 Switzerland       2007    81.7
## 5 Australia         2007    81.2

## Selecting by lifeExp

## # A tibble: 10 x 3
##    country           year lifeExp
##    <fct>            <int>   <dbl>
##  1 Canada            2007    80.7
##  2 France            2007    80.7
##  3 Israel            2007    80.7
##  4 Sweden            2007    80.9
##  5 Spain             2007    80.9
##  6 Australia         2007    81.2
##  7 Switzerland       2007    81.7
##  8 Iceland           2007    81.8
##  9 Hong Kong, China  2007    82.2
## 10 Japan             2007    82.6

How to calculate new variables

## Selecting by gdp

## # A tibble: 10 x 7
##    country        continent  year lifeExp        pop gdpPercap     gdp
##    <fct>          <fct>     <int>   <dbl>      <int>     <dbl>   <dbl>
##  1 United States  Americas   2007    78.2  301139947    42952. 1.29e13
##  2 China          Asia       2007    73.0 1318683096     4959. 6.54e12
##  3 Japan          Asia       2007    82.6  127467972    31656. 4.04e12
##  4 India          Asia       2007    64.7 1110396331     2452. 2.72e12
##  5 Germany        Europe     2007    79.4   82400996    32170. 2.65e12
##  6 United Kingdom Europe     2007    79.4   60776238    33203. 2.02e12
##  7 France         Europe     2007    80.7   61083916    30470. 1.86e12
##  8 Brazil         Americas   2007    72.4  190010647     9066. 1.72e12
##  9 Italy          Europe     2007    80.5   58147733    28570. 1.66e12
## 10 Mexico         Americas   2007    76.2  108700891    11978. 1.30e12

‘r transmute()’ keeps only the derived column. Let’s use it in the example from above:

## Selecting by gdp

## # A tibble: 10 x 1
##        gdp
##      <dbl>
##  1 1.29e13
##  2 6.54e12
##  3 4.04e12
##  4 2.72e12
##  5 2.65e12
##  6 2.02e12
##  7 1.86e12
##  8 1.72e12
##  9 1.66e12
## 10 1.30e12

Summarising data

## # A tibble: 1 x 1
##   avglifeExp
##        <dbl>
## 1       59.5

Summarising data by groups

## # A tibble: 1 x 1
##   avgLifeExp
##        <dbl>
## 1       70.7

## # A tibble: 5 x 2
##   continent avglife
##   <fct>       <dbl>
## 1 Africa       54.8
## 2 Americas     73.6
## 3 Asia         70.7
## 4 Europe       77.6
## 5 Oceania      80.7

if_else command alongwith mutate

## # A tibble: 5 x 3
##   continent avgLifeExp over75
##   <fct>          <dbl> <chr> 
## 1 Africa          54.8 N     
## 2 Americas        73.6 Y     
## 3 Asia            70.7 Y     
## 4 Europe          77.6 Y     
## 5 Oceania         80.7 Y

Total Population by Continets in 2007

## # A tibble: 5 x 2
##   continent    tot_pop
##   <fct>          <dbl>
## 1 Africa     929539692
## 2 Americas   898871184
## 3 Asia      3811953827
## 4 Europe     586098529
## 5 Oceania     24549947

Percentiles

In general it is assumed that higher the GDP , higher the lifeExp. To test this assumption, lets calculate percentiles of lifeExp. This will indicate how many countries have ranking lower than the current country.

## # A tibble: 142 x 5
##    country           year lifeExp gdpPercap percentile
##    <fct>            <int>   <dbl>     <dbl>      <int>
##  1 Norway            2007    80.2    49357.         88
##  2 Kuwait            2007    77.6    47307.         68
##  3 Singapore         2007    80.0    47143.         87
##  4 United States     2007    78.2    42952.         71
##  5 Ireland           2007    78.9    40676.         79
##  6 Hong Kong, China  2007    82.2    39725.         99
##  7 Switzerland       2007    81.7    37506.         97
##  8 Netherlands       2007    79.8    36798.         85
##  9 Canada            2007    80.7    36319.         91
## 10 Iceland           2007    81.8    36181.         98
## # ... with 132 more rows

One can notice that all countries are well above 60th percentile on lifeExpectancy when arranged by GDP per capita.

Before you conclude, lets see the bottom side

## # A tibble: 142 x 5
##    country                   year lifeExp gdpPercap percentile
##    <fct>                    <int>   <dbl>     <dbl>      <int>
##  1 Congo, Dem. Rep.          2007    46.5      278.          7
##  2 Liberia                   2007    45.7      415.          5
##  3 Burundi                   2007    49.6      430.         10
##  4 Zimbabwe                  2007    43.5      470.          4
##  5 Guinea-Bissau             2007    46.4      579.          6
##  6 Niger                     2007    56.9      620.         18
##  7 Eritrea                   2007    58.0      641.         19
##  8 Ethiopia                  2007    52.9      691.         14
##  9 Central African Republic  2007    44.7      706.          5
## 10 Gambia                    2007    59.4      753.         21
## # ... with 132 more rows

So it makes sense that higher the GDP, higher the lifeExp. This is not formal testing but exploratory data makes lot of sense here.

Advanced Analysis

Filtering data as done in introductory analysis seems quite difficult if you are not familiar with these simple things. But if you are working with dplyr for quite sometime, there is not anything very advanced or difficult.

For example, let’s say you have to find out the top 10 countries in the 90th percentile regarding life expectancy in 2007. You can reuse some of the logic from the previous sections, but answering this question alone requires multiple filterings and subsetting:

## # A tibble: 10 x 4
##    country          continent lifeExp gdpPercap
##    <fct>            <fct>       <dbl>     <dbl>
##  1 Japan            Asia         82.6    31656.
##  2 Hong Kong, China Asia         82.2    39725.
##  3 Iceland          Europe       81.8    36181.
##  4 Switzerland      Europe       81.7    37506.
##  5 Australia        Oceania      81.2    34435.
##  6 Spain            Europe       80.9    28821.
##  7 Sweden           Europe       80.9    33860.
##  8 Israel           Asia         80.7    25523.
##  9 France           Europe       80.7    30470.
## 10 Canada           Americas     80.7    36319.

In case you are interested in bottom 10 (worst lifeExp countries from the bottom), use top_n with -10.

## # A tibble: 10 x 4
##    country                  continent lifeExp gdpPercap
##    <fct>                    <fct>       <dbl>     <dbl>
##  1 Mozambique               Africa       42.1      824.
##  2 Swaziland                Africa       39.6     4513.
##  3 Sierra Leone             Africa       42.6      863.
##  4 Zambia                   Africa       42.4     1271.
##  5 Angola                   Africa       42.7     4797.
##  6 Lesotho                  Africa       42.6     1569.
##  7 Afghanistan              Asia         43.8      975.
##  8 Zimbabwe                 Africa       43.5      470.
##  9 Central African Republic Africa       44.7      706.
## 10 Liberia                  Africa       45.7      415.

ggplot2 like dplyr is part of tidyverse package

Stat graphics is a mapping of variable to aesthetic attributes of geometric objects.

3 Essential components

data: dataset containing the variables of interest
geom: geometric object in question line, point, bars
aes: _aes_thetic attributes of an object x/y position, colors, shape, size

Scatter plot

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Comparison of 2007 vs 1952 continentwise

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Learn Basics of Data Analysis using R

Knowledge and Research Support Services

Zahid Asghar

2021-11-05

Why R

Uploaded data and required R packages

Components

Information in gapminder data

Column Selection

Data Filtering

dplyr features

Filtering with respect to two variables

Filtering data for South Asia countries

Sort data with arrange

Life Expectancy in South Asia in 2007

Change an existing variable or create a new one with mutate

Summarising data

Ordering

How to calculate new variables

Summarising data

Summarising data by groups

if_else command alongwith mutate

Total Population by Continets in 2007

Percentiles

Advanced Analysis

ggplot2 like dplyr is part of tidyverse package

3 Essential components

Scatter plot

Comparison of 2007 vs 1952 continentwise