This notebook summarizes datacamp’s ‘Introduction to Tidyverse’ course in R.

Loading necessary libraries.

library(gapminder)
library(dplyr)
library(ggplot2)
options(scipen=999,digits=3)

Data Wrangling

Tidyverse is a powerful set of packages which can help to manipulate, visualize and model data. The whole notebook focuses on the gapminder dataset which tracks socio-economic, life expectacy etc. across countries over time. The primary key is country and year. The filter verb is used when you only want to look at a suset of the observations based on a particular condition. To use multiple verbs, pipes( %>%) are used to separate them. It basically puts the output from left of pipe, to the first argument on the right.

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333       779
##  2 Afghanistan Asia       1957    30.3  9240934       821
##  3 Afghanistan Asia       1962    32.0 10267083       853
##  4 Afghanistan Asia       1967    34.0 11537966       836
##  5 Afghanistan Asia       1972    36.1 13079460       740
##  6 Afghanistan Asia       1977    38.4 14880372       786
##  7 Afghanistan Asia       1982    39.9 12881816       978
##  8 Afghanistan Asia       1987    40.8 13867957       852
##  9 Afghanistan Asia       1992    41.7 16317921       649
## 10 Afghanistan Asia       1997    41.8 22227415       635
## # ... with 1,694 more rows
# Filter with one condition
gapminder %>% filter(year==2007)
## # A tibble: 142 x 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923       975
##  2 Albania     Europe     2007    76.4   3600523      5937
##  3 Algeria     Africa     2007    72.3  33333216      6223
##  4 Angola      Africa     2007    42.7  12420476      4797
##  5 Argentina   Americas   2007    75.3  40301927     12779
##  6 Australia   Oceania    2007    81.2  20434176     34435
##  7 Austria     Europe     2007    79.8   8199783     36126
##  8 Bahrain     Asia       2007    75.6    708573     29796
##  9 Bangladesh  Asia       2007    64.1 150448339      1391
## 10 Belgium     Europe     2007    79.4  10392226     33693
## # ... with 132 more rows
# Filter the gapminder dataset for the year 1957
gapminder %>% filter(year==1957)
## # A tibble: 142 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1957    30.3  9240934       821
##  2 Albania     Europe     1957    59.3  1476505      1942
##  3 Algeria     Africa     1957    45.7 10270856      3014
##  4 Angola      Africa     1957    32.0  4561361      3828
##  5 Argentina   Americas   1957    64.4 19610538      6857
##  6 Australia   Oceania    1957    70.3  9712569     10950
##  7 Austria     Europe     1957    67.5  6965860      8843
##  8 Bahrain     Asia       1957    53.8   138655     11636
##  9 Bangladesh  Asia       1957    39.3 51365468       662
## 10 Belgium     Europe     1957    69.2  8989111      9715
## # ... with 132 more rows
gapminder %>% filter(country =="United States")
## # A tibble: 12 x 6
##    country       continent  year lifeExp       pop gdpPercap
##    <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
##  1 United States Americas   1952    68.4 157553000     13990
##  2 United States Americas   1957    69.5 171984000     14847
##  3 United States Americas   1962    70.2 186538000     16173
##  4 United States Americas   1967    70.8 198712000     19530
##  5 United States Americas   1972    71.3 209896000     21806
##  6 United States Americas   1977    73.4 220239000     24073
##  7 United States Americas   1982    74.6 232187835     25010
##  8 United States Americas   1987    75.0 242803533     29884
##  9 United States Americas   1992    76.1 256894189     32004
## 10 United States Americas   1997    76.8 272911760     35767
## 11 United States Americas   2002    77.3 287675526     39097
## 12 United States Americas   2007    78.2 301139947     42952
# Filter with 2 conditions
gapminder %>% filter(year==2007,country=="United States")
## # A tibble: 1 x 6
##   country       continent  year lifeExp       pop gdpPercap
##   <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
## 1 United States Americas   2007    78.2 301139947     42952
# Filter for China in 2002
gapminder %>%
  filter(year==2002,country=="China")
## # A tibble: 1 x 6
##   country continent  year lifeExp        pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
## 1 China   Asia       2002    72.0 1280400000      3119

arrange verb sorts the dataset based on one or more variables in ascending or descending order. It is useful when you are interested in extreme values.

gapminder %>% arrange(gdpPercap)
## # A tibble: 1,704 x 6
##    country          continent  year lifeExp      pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Congo, Dem. Rep. Africa     2002    45.0 55379852       241
##  2 Congo, Dem. Rep. Africa     2007    46.5 64606759       278
##  3 Lesotho          Africa     1952    42.1   748747       299
##  4 Guinea-Bissau    Africa     1952    32.5   580653       300
##  5 Congo, Dem. Rep. Africa     1997    42.6 47798986       312
##  6 Eritrea          Africa     1952    35.9  1438760       329
##  7 Myanmar          Asia       1952    36.3 20092996       331
##  8 Lesotho          Africa     1957    45.0   813338       336
##  9 Burundi          Africa     1952    39.0  2445618       339
## 10 Eritrea          Africa     1957    38.0  1542611       344
## # ... with 1,694 more rows
gapminder %>% arrange(desc(gdpPercap))
## # A tibble: 1,704 x 6
##    country   continent  year lifeExp     pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Kuwait    Asia       1957    58.0  212846    113523
##  2 Kuwait    Asia       1972    67.7  841934    109348
##  3 Kuwait    Asia       1952    55.6  160000    108382
##  4 Kuwait    Asia       1962    60.5  358266     95458
##  5 Kuwait    Asia       1967    64.6  575003     80895
##  6 Kuwait    Asia       1977    69.3 1140357     59265
##  7 Norway    Europe     2007    80.2 4627926     49357
##  8 Kuwait    Asia       2007    77.6 2505559     47307
##  9 Singapore Asia       2007    80.0 4553009     47143
## 10 Norway    Europe     2002    79.0 4535591     44684
## # ... with 1,694 more rows
# Sort in ascending order of lifeExp
gapminder %>% arrange(lifeExp)
## # A tibble: 1,704 x 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203       737
##  2 Afghanistan  Asia       1952    28.8 8425333       779
##  3 Gambia       Africa     1952    30.0  284320       485
##  4 Angola       Africa     1952    30.0 4232095      3521
##  5 Sierra Leone Africa     1952    30.3 2143249       880
##  6 Afghanistan  Asia       1957    30.3 9240934       821
##  7 Cambodia     Asia       1977    31.2 6978607       525
##  8 Mozambique   Africa     1952    31.3 6446316       469
##  9 Sierra Leone Africa     1957    31.6 2295678      1004
## 10 Burkina Faso Africa     1952    32.0 4469979       543
## # ... with 1,694 more rows
# Sort in descending order of lifeExp
gapminder %>% arrange(desc(lifeExp))
## # A tibble: 1,704 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972     31656
##  2 Hong Kong, China Asia       2007    82.2   6980412     39725
##  3 Japan            Asia       2002    82.0 127065841     28605
##  4 Iceland          Europe     2007    81.8    301931     36181
##  5 Switzerland      Europe     2007    81.7   7554661     37506
##  6 Hong Kong, China Asia       2002    81.5   6762476     30209
##  7 Australia        Oceania    2007    81.2  20434176     34435
##  8 Spain            Europe     2007    80.9  40448191     28821
##  9 Sweden           Europe     2007    80.9   9031088     33860
## 10 Israel           Asia       2007    80.7   6426679     25523
## # ... with 1,694 more rows
# Combining two verbs
gapminder %>% filter(year==2007) %>% arrange(desc(gdpPercap))
## # A tibble: 142 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Norway           Europe     2007    80.2   4627926     49357
##  2 Kuwait           Asia       2007    77.6   2505559     47307
##  3 Singapore        Asia       2007    80.0   4553009     47143
##  4 United States    Americas   2007    78.2 301139947     42952
##  5 Ireland          Europe     2007    78.9   4109086     40676
##  6 Hong Kong, China Asia       2007    82.2   6980412     39725
##  7 Switzerland      Europe     2007    81.7   7554661     37506
##  8 Netherlands      Europe     2007    79.8  16570613     36798
##  9 Canada           Americas   2007    80.7  33390141     36319
## 10 Iceland          Europe     2007    81.8    301931     36181
## # ... with 132 more rows
# Filter for the year 1957, then arrange in descending order of population
gapminder %>% filter(year==1957) %>% arrange(desc(pop))
## # A tibble: 142 x 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 China          Asia       1957    50.5 637408000       576
##  2 India          Asia       1957    40.2 409000000       590
##  3 United States  Americas   1957    69.5 171984000     14847
##  4 Japan          Asia       1957    65.5  91563009      4318
##  5 Indonesia      Asia       1957    39.9  90124000       859
##  6 Germany        Europe     1957    69.1  71019069     10188
##  7 Brazil         Americas   1957    53.3  65551171      2487
##  8 United Kingdom Europe     1957    70.4  51430000     11283
##  9 Bangladesh     Asia       1957    39.3  51365468       662
## 10 Italy          Europe     1957    67.8  49182000      6249
## # ... with 132 more rows

Mutates modifies or creates new variables.

# Changing existing variables
gapminder %>% mutate(pop=pop/1000000)
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp   pop gdpPercap
##    <fct>       <fct>     <int>   <dbl> <dbl>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8.43       779
##  2 Afghanistan Asia       1957    30.3  9.24       821
##  3 Afghanistan Asia       1962    32.0 10.3        853
##  4 Afghanistan Asia       1967    34.0 11.5        836
##  5 Afghanistan Asia       1972    36.1 13.1        740
##  6 Afghanistan Asia       1977    38.4 14.9        786
##  7 Afghanistan Asia       1982    39.9 12.9        978
##  8 Afghanistan Asia       1987    40.8 13.9        852
##  9 Afghanistan Asia       1992    41.7 16.3        649
## 10 Afghanistan Asia       1997    41.8 22.2        635
## # ... with 1,694 more rows
# Use mutate to change lifeExp to be in months
gapminder %>% mutate(lifeExp = lifeExp * 12)
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952     346  8425333       779
##  2 Afghanistan Asia       1957     364  9240934       821
##  3 Afghanistan Asia       1962     384 10267083       853
##  4 Afghanistan Asia       1967     408 11537966       836
##  5 Afghanistan Asia       1972     433 13079460       740
##  6 Afghanistan Asia       1977     461 14880372       786
##  7 Afghanistan Asia       1982     478 12881816       978
##  8 Afghanistan Asia       1987     490 13867957       852
##  9 Afghanistan Asia       1992     500 16317921       649
## 10 Afghanistan Asia       1997     501 22227415       635
## # ... with 1,694 more rows
# Adding new variables
gapminder %>% mutate(grossgdp = pop * gdpPercap)
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap    grossgdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>       <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333       779  6567086330
##  2 Afghanistan Asia       1957    30.3  9240934       821  7585448670
##  3 Afghanistan Asia       1962    32.0 10267083       853  8758855797
##  4 Afghanistan Asia       1967    34.0 11537966       836  9648014150
##  5 Afghanistan Asia       1972    36.1 13079460       740  9678553274
##  6 Afghanistan Asia       1977    38.4 14880372       786 11697659231
##  7 Afghanistan Asia       1982    39.9 12881816       978 12598563401
##  8 Afghanistan Asia       1987    40.8 13867957       852 11820990309
##  9 Afghanistan Asia       1992    41.7 16317921       649 10595901589
## 10 Afghanistan Asia       1997    41.8 22227415       635 14121995875
## # ... with 1,694 more rows
# Use mutate to create a new column called lifeExpMonths
gapminder %>% mutate(lifeExpMonths = lifeExp * 12 )
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap lifeExpMonths
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333       779           346
##  2 Afghanistan Asia       1957    30.3  9240934       821           364
##  3 Afghanistan Asia       1962    32.0 10267083       853           384
##  4 Afghanistan Asia       1967    34.0 11537966       836           408
##  5 Afghanistan Asia       1972    36.1 13079460       740           433
##  6 Afghanistan Asia       1977    38.4 14880372       786           461
##  7 Afghanistan Asia       1982    39.9 12881816       978           478
##  8 Afghanistan Asia       1987    40.8 13867957       852           490
##  9 Afghanistan Asia       1992    41.7 16317921       649           500
## 10 Afghanistan Asia       1997    41.8 22227415       635           501
## # ... with 1,694 more rows
# Combing 3 verbs
gapminder %>% mutate(grossgdp = pop * gdpPercap) %>%
  filter(year==2007) %>% 
  arrange(desc(grossgdp))
## # A tibble: 142 x 7
##    country        continent  year lifeExp        pop gdpPercap    grossgdp
##    <fct>          <fct>     <int>   <dbl>      <int>     <dbl>       <dbl>
##  1 United States  Americas   2007    78.2  301139947     42952     1.29e13
##  2 China          Asia       2007    73.0 1318683096      4959     6.54e12
##  3 Japan          Asia       2007    82.6  127467972     31656     4.04e12
##  4 India          Asia       2007    64.7 1110396331      2452     2.72e12
##  5 Germany        Europe     2007    79.4   82400996     32170     2.65e12
##  6 United Kingdom Europe     2007    79.4   60776238     33203     2.02e12
##  7 France         Europe     2007    80.7   61083916     30470     1.86e12
##  8 Brazil         Americas   2007    72.4  190010647      9066     1.72e12
##  9 Italy          Europe     2007    80.5   58147733     28570     1.66e12
## 10 Mexico         Americas   2007    76.2  108700891     11978     1.30e12
## # ... with 132 more rows
# Find the countries with the highest life expectancy, in months, in the year 2007
# Filter, mutate, and arrange the gapminder dataset
gapminder %>% filter(year==2007) %>%
  mutate(lifeExpMonths = lifeExp * 12) %>%
  arrange(desc(lifeExpMonths))
## # A tibble: 142 x 7
##    country          continent  year lifeExp    pop gdpPercap lifeExpMonths
##    <fct>            <fct>     <int>   <dbl>  <int>     <dbl>         <dbl>
##  1 Japan            Asia       2007    82.6 1.27e8     31656           991
##  2 Hong Kong, China Asia       2007    82.2 6.98e6     39725           986
##  3 Iceland          Europe     2007    81.8 3.02e5     36181           981
##  4 Switzerland      Europe     2007    81.7 7.55e6     37506           980
##  5 Australia        Oceania    2007    81.2 2.04e7     34435           975
##  6 Spain            Europe     2007    80.9 4.04e7     28821           971
##  7 Sweden           Europe     2007    80.9 9.03e6     33860           971
##  8 Israel           Asia       2007    80.7 6.43e6     25523           969
##  9 France           Europe     2007    80.7 6.11e7     30470           968
## 10 Canada           Americas   2007    80.7 3.34e7     36319           968
## # ... with 132 more rows

Visualizing with ggplot2

When working with more than 2 variables, always use color to control factor variables, and size to control numerical

# Does higher GDP lead to higher lifeExp (because of other factors like better healthcare etc)
ggplot(data=gapminder,aes(x=gdpPercap,y=lifeExp)) + geom_point()

# It is better to use a log scale, when one variable is very densely distributed across just few values
ggplot(data=gapminder,aes(x=log(gdpPercap),y=lifeExp)) + geom_point()

# Another way --probably cleaner is (it doesnt mess your column labels)
ggplot(data=gapminder,aes(x=log(gdpPercap),y=lifeExp)) + geom_point() + 
  scale_x_log10()

# Create gapminder_1952
gapminder_1952 <- filter(.data=gapminder,year==1952)

# Create gapminder_2007
gapminder_2007 <- filter(.data=gapminder,year==2007)

# Population vs GDP in 1952
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
  geom_point()

# Cleaning it up
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
  geom_point() + 
  scale_x_log10() + 
  scale_y_log10()

# Create a scatter plot with pop on the x-axis and lifeExp on the y-axis
ggplot(data=gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point()

# To clean it up
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
  geom_point() +
  scale_x_log10()

# Working with >2 variables
ggplot(data=gapminder_2007,aes(x=gdpPercap,y=lifeExp,color=continent)) + geom_point() + 
  scale_x_log10() 

ggplot(data=gapminder_2007,aes(x=gdpPercap,y=lifeExp,color=continent,size=pop)) + geom_point() + 
  scale_x_log10() 

# Scatter plot comparing pop and lifeExp, with color representing continent
ggplot(data=gapminder_1952,aes(x=pop,y=lifeExp,color=continent)) +
  geom_point() + 
  scale_x_log10()

# Add the size aesthetic to represent a country's gdpPercap
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent,
  size=gdpPercap)) +
  geom_point() +
  scale_x_log10()

# Instead of showing all categorical variables in one plot , we can have 5 different plots in one plot using faceting
ggplot(data=gapminder_2007,aes(x=gdpPercap,y=lifeExp)) + geom_point() + 
  scale_x_log10() + 
  facet_wrap(~continent)

# Scatter plot comparing pop and lifeExp, faceted by continent
ggplot(data=gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point() + 
  scale_x_log10() + 
  facet_wrap(~continent)

# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(data=gapminder,aes(x=gdpPercap,y=lifeExp,color=continent,
  size = pop)) + geom_point() + 
  scale_x_log10() + 
  facet_wrap(~year)

Grouping and summarizing

summarize turns many rows rows into one by aggregating at one level. To group your data, you can use group_by. It is often that group_by and summarize are used with each other

# Finding mean life exp across all years all continents
gapminder %>% summarize(meanLifeExp = mean(lifeExp))
## # A tibble: 1 x 1
##   meanLifeExp
##         <dbl>
## 1        59.5
# Summarize to find the median life expectancy
gapminder %>% summarize(medianLifeExp = median(lifeExp))
## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          60.7
# Avg Life Exp in 2007
gapminder %>% filter(year==2007) %>% summarize(meanLifeExp = mean(lifeExp))
## # A tibble: 1 x 1
##   meanLifeExp
##         <dbl>
## 1        67.0
# Avg life Exp and total pop in 2007
gapminder %>% filter(year==2007) %>% summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))
## # A tibble: 1 x 2
##   meanLifeExp   totalPop
##         <dbl>      <dbl>
## 1        67.0 6251013179
# Filter for 1957 then summarize the median life expectancy
gapminder %>% filter(year==1957) %>% summarize(medianLifeExp = median(lifeExp))
## # A tibble: 1 x 1
##   medianLifeExp
##           <dbl>
## 1          48.4
# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>% filter(year==1957) %>% 
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
## # A tibble: 1 x 2
##   medianLifeExp maxGdpPercap
##           <dbl>        <dbl>
## 1          48.4       113523
# Avg life Exp and total pop in each year
gapminder %>% group_by(year) %>% summarize(meanLifeExp = mean(lifeExp),totalPop =  sum(as.numeric(pop)))
## # A tibble: 12 x 3
##     year meanLifeExp   totalPop
##    <int>       <dbl>      <dbl>
##  1  1952        49.1 2406957150
##  2  1957        51.5 2664404580
##  3  1962        53.6 2899782974
##  4  1967        55.7 3217478384
##  5  1972        57.6 3576977158
##  6  1977        59.6 3930045807
##  7  1982        61.5 4289436840
##  8  1987        63.2 4691477418
##  9  1992        64.2 5110710260
## 10  1997        65.0 5515204472
## 11  2002        65.7 5886977579
## 12  2007        67.0 6251013179
# Avg life Exp and total pop in each continent in 2007
gapminder %>% filter(year==2007) %>%
  group_by(continent) %>% 
    summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))
## # A tibble: 5 x 3
##   continent meanLifeExp   totalPop
##   <fct>           <dbl>      <dbl>
## 1 Africa           54.8  929539692
## 2 Americas         73.6  898871184
## 3 Asia             70.7 3811953827
## 4 Europe           77.6  586098529
## 5 Oceania          80.7   24549947
# Avg life Exp and total pop in each year and contient
gapminder %>% group_by(year,continent) %>% 
  summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))
## # A tibble: 60 x 4
## # Groups:   year [?]
##     year continent meanLifeExp   totalPop
##    <int> <fct>           <dbl>      <dbl>
##  1  1952 Africa           39.1  237640501
##  2  1952 Americas         53.3  345152446
##  3  1952 Asia             46.3 1395357351
##  4  1952 Europe           64.4  418120846
##  5  1952 Oceania          69.3   10686006
##  6  1957 Africa           41.3  264837738
##  7  1957 Americas         56.0  386953916
##  8  1957 Asia             49.3 1562780599
##  9  1957 Europe           66.7  437890351
## 10  1957 Oceania          70.3   11941976
## # ... with 50 more rows
# Find median life expectancy and maximum GDP per capita in each year
gapminder %>% group_by(year) %>% 
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
## # A tibble: 12 x 3
##     year medianLifeExp maxGdpPercap
##    <int>         <dbl>        <dbl>
##  1  1952          45.1       108382
##  2  1957          48.4       113523
##  3  1962          50.9        95458
##  4  1967          53.8        80895
##  5  1972          56.5       109348
##  6  1977          59.7        59265
##  7  1982          62.4        33693
##  8  1987          65.8        31541
##  9  1992          67.7        34933
## 10  1997          69.4        41283
## 11  2002          70.8        44684
## 12  2007          71.9        49357
# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>% filter(year==1957) %>% group_by(continent) %>% 
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
## # A tibble: 5 x 3
##   continent medianLifeExp maxGdpPercap
##   <fct>             <dbl>        <dbl>
## 1 Africa             40.6         5487
## 2 Americas           56.1        14847
## 3 Asia               48.3       113523
## 4 Europe             67.6        17909
## 5 Oceania            70.3        12247
# Find median life expectancy and maximum GDP per capita in each year/continent combination
gapminder %>% group_by(year,continent) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
## # A tibble: 60 x 4
## # Groups:   year [?]
##     year continent medianLifeExp maxGdpPercap
##    <int> <fct>             <dbl>        <dbl>
##  1  1952 Africa             38.8         4725
##  2  1952 Americas           54.7        13990
##  3  1952 Asia               44.9       108382
##  4  1952 Europe             65.9        14734
##  5  1952 Oceania            69.3        10557
##  6  1957 Africa             40.6         5487
##  7  1957 Americas           56.1        14847
##  8  1957 Asia               48.3       113523
##  9  1957 Europe             67.6        17909
## 10  1957 Oceania            70.3        12247
## # ... with 50 more rows

Visualizing summarized data. Sometimes it is misleading when y axis doesnt start at zero. You can then use expand_limits(y=0)

by_year <- gapminder %>% group_by(year) %>% 
  summarize(meanLifeExp = mean(lifeExp),totalPop = sum(as.numeric(pop)))
by_year_continent <- gapminder %>% group_by(year,continent) %>% 
  summarize(meanLifeExp = mean(lifeExp),totalPop =sum(as.numeric(pop)))

# Visualizing population over time
ggplot(data=by_year, aes(x=year,y=totalPop)) + geom_point()

# Visualizing population over time,starting at zero
ggplot(data=by_year, aes(x=year,y=totalPop)) + geom_point() + expand_limits(y=0)

# Create a scatter plot showing the change in meanLifeExp over time
ggplot(data=by_year,aes(x=year,y=meanLifeExp)) + geom_point() +
expand_limits(y=0)

# Visualizing population over time,starting at zero, for each continent
ggplot(data=by_year_continent, aes(x=year,y=totalPop,color=continent)) + geom_point() + expand_limits(y=0)

# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent2 <- gapminder %>% group_by(continent,year) %>%
  summarize(medianGdpPercap=median(gdpPercap))

# Plot the change in medianGdpPercap in each continent over time
ggplot(data=by_year_continent2,aes(x=year,y=medianGdpPercap,color=continent)) + geom_point() + expand_limits(y=0)

# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007 <- gapminder %>% filter(year==2007) %>%
  group_by(continent) %>%
  summarize(medianLifeExp = median(lifeExp),
            medianGdpPercap = median(gdpPercap))

# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(data=by_continent_2007,aes(x=medianGdpPercap,y=medianLifeExp,color=continent)) + geom_point()

Types of visualization

Line plots are useful for showing a trend over time. Bar charts are good at showing statistics for different categories. Histogram describe the distribution of a 1D numeric variable. Boxplots compare the distribution of a numeric variable across different categories.

# Summarize the median gdpPercap by year, then save it as by_year
by_year <- gapminder %>% group_by(year) %>% 
  summarize(medianGdpPercap = median(gdpPercap))

# Create a line plot showing the change in medianGdpPercap over time
ggplot(data=by_year,aes(x=year,y=medianGdpPercap)) + geom_line() +
  expand_limits(y=0)

# Summarize the median gdpPercap by year & continent, save as by_year_continent
by_year_continent <- gapminder %>% group_by(year,continent) %>% 
  summarize(medianGdpPercap=median(gdpPercap))

# Create a line plot showing the change in medianGdpPercap by continent over time
ggplot(data=by_year_continent,aes(x=year,y=medianGdpPercap,color=continent)) + geom_line() + expand_limits(y=0)

Bar plots are useful for exploring data across discrete categories.

by_continent <- gapminder %>% filter(year==2007) %>% group_by(continent) %>% summarize(meanLifeExp = mean(lifeExp))

by_continent
## # A tibble: 5 x 2
##   continent meanLifeExp
##   <fct>           <dbl>
## 1 Africa           54.8
## 2 Americas         73.6
## 3 Asia             70.7
## 4 Europe           77.6
## 5 Oceania          80.7
ggplot(data=by_continent,aes(x=continent,y=meanLifeExp)) + geom_col()

# Summarize the median gdpPercap by year and continent in 1952
by_continent <- gapminder %>% filter(year==1952) %>% group_by(continent) %>% summarize(medianGdpPercap=median(gdpPercap))

# Create a bar plot showing medianGdp by continent
ggplot(data=by_continent,aes(x=continent,y=medianGdpPercap)) + geom_col()

# Filter for observations in the Oceania continent in 1952
oceania_1952 <- gapminder %>% filter(year==1952,continent=="Oceania") %>% group_by(country) %>% summarize(medianGdpPercap=median(gdpPercap))

# Create a bar plot of gdpPercap by country
ggplot(data=oceania_1952,aes(x=country,y=medianGdpPercap)) + geom_col()

Histogram shows distribution. Ever bar represents a bin, and height of bar represents frequency. The bin size is automatic and largely affects the way the distribution appears to look. binwidth= option inside geom_histogram() controls it.

ggplot(data=gapminder_2007,aes(x=lifeExp)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# With binwidth = 5
ggplot(data=gapminder_2007,aes(x=lifeExp)) + geom_histogram(binwidth=5)

gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Create a histogram of population (pop)
ggplot(data=gapminder_1952,aes(x=pop)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# This histogram s not helpful to read. Because of certain outliers, or heavy population countries, all the others are clumped in one bin. We can show it on a log scale

# Create a histogram of population (pop), with x on a log scale
ggplot(data=gapminder_1952,aes(x=pop)) + geom_histogram() + 
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If we wanted to observe the same above distributions but across continents, we can use boxplots. The line in the middle of each box is the median of that distribution. The top and bottom show 75th and 25th percentile. The whiskers represent other datapoints, and the points outside the whiskers represent outliers.

ggplot(data=gapminder_2007,aes(x=continent,y=lifeExp)) + geom_boxplot()

# Create a boxplot comparing gdpPercap among continents
ggplot(data=gapminder_1952,aes(x=continent,y=gdpPercap)) + geom_boxplot() + scale_y_log10()

# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10() + 
  ggtitle("Comparing GDP per capita across continents")

```