Libraries

library(gapminder)
library(tidyverse)

0. Course Description

This is an introduction to the programming language R, focused on a powerful set of tools known as the Tidyverse. You’ll learn the intertwined processes of data manipulation and visualization using the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting, and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. You’ll get a taste of the value of exploratory data analysis and the power of Tidyverse tools.

1. Data Wrangling

In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps allows you to answer questions about your data.

1.1 filter()

The filter verb extracts particular observations based on a condition. In this exercise you’ll filter for observations from a particular year.

gapminder %>% filter(year==2007)
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # … with 132 more rows
gapminder %>% filter(country=="UnitedStates")
## # A tibble: 0 × 6
## # … with 6 variables: country <fct>, continent <fct>, year <int>,
## #   lifeExp <dbl>, pop <int>, gdpPercap <dbl>
### Filter for China in 2002
gapminder %>% filter(year== 2002, country=="China")
## # A tibble: 1 × 6
##   country continent  year lifeExp        pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
## 1 China   Asia       2002    72.0 1280400000     3119.

1.2 arrange()

You use arrange() to sort observations in ascending or descending order of a particular variable. In this case, you’ll sort the dataset based on the lifeExp variable.

# Sort in ascending order of lifeExp
gapminder %>% arrange(lifeExp)
## # A tibble: 1,704 × 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203      737.
##  2 Afghanistan  Asia       1952    28.8 8425333      779.
##  3 Gambia       Africa     1952    30    284320      485.
##  4 Angola       Africa     1952    30.0 4232095     3521.
##  5 Sierra Leone Africa     1952    30.3 2143249      880.
##  6 Afghanistan  Asia       1957    30.3 9240934      821.
##  7 Cambodia     Asia       1977    31.2 6978607      525.
##  8 Mozambique   Africa     1952    31.3 6446316      469.
##  9 Sierra Leone Africa     1957    31.6 2295678     1004.
## 10 Burkina Faso Africa     1952    32.0 4469979      543.
## # … with 1,694 more rows
### Sort in descending order of lifeExp
gapminder %>% arrange(desc(lifeExp ))
## # A tibble: 1,704 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # … with 1,694 more rows
### Filter for the year 1957, then arrange in descending order of population
gapminder %>% filter(year==1957) %>% arrange(desc(pop))
## # A tibble: 142 × 6
##    country        continent  year lifeExp       pop gdpPercap
##    <fct>          <fct>     <int>   <dbl>     <int>     <dbl>
##  1 China          Asia       1957    50.5 637408000      576.
##  2 India          Asia       1957    40.2 409000000      590.
##  3 United States  Americas   1957    69.5 171984000    14847.
##  4 Japan          Asia       1957    65.5  91563009     4318.
##  5 Indonesia      Asia       1957    39.9  90124000      859.
##  6 Germany        Europe     1957    69.1  71019069    10188.
##  7 Brazil         Americas   1957    53.3  65551171     2487.
##  8 United Kingdom Europe     1957    70.4  51430000    11283.
##  9 Bangladesh     Asia       1957    39.3  51365468      662.
## 10 Italy          Europe     1957    67.8  49182000     6249.
## # … with 132 more rows

1.3 mutate()

Suppose we want life expectancy to be measured in months instead of years: you’d have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that’s calculated this way.

### Use mutate to create a new column called lifeExpMonths.
gapminder %>% mutate(lifeExpMonths=12*lifeExp)
## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap lifeExpMonths
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.          346.
##  2 Afghanistan Asia       1957    30.3  9240934      821.          364.
##  3 Afghanistan Asia       1962    32.0 10267083      853.          384.
##  4 Afghanistan Asia       1967    34.0 11537966      836.          408.
##  5 Afghanistan Asia       1972    36.1 13079460      740.          433.
##  6 Afghanistan Asia       1977    38.4 14880372      786.          461.
##  7 Afghanistan Asia       1982    39.9 12881816      978.          478.
##  8 Afghanistan Asia       1987    40.8 13867957      852.          490.
##  9 Afghanistan Asia       1992    41.7 16317921      649.          500.
## 10 Afghanistan Asia       1997    41.8 22227415      635.          501.
## # … with 1,694 more rows
### Filter, mutate, and arrange the gapminder dataset.
gapminder %>% filter(year==2007) %>% mutate(lifeExpMonths=12*lifeExp) %>% arrange(desc(lifeExpMonths))
## # A tibble: 142 × 7
##    country          continent  year lifeExp       pop gdpPercap lifeExpMonths
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>         <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.          991.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.          986.
##  3 Iceland          Europe     2007    81.8    301931    36181.          981.
##  4 Switzerland      Europe     2007    81.7   7554661    37506.          980.
##  5 Australia        Oceania    2007    81.2  20434176    34435.          975.
##  6 Spain            Europe     2007    80.9  40448191    28821.          971.
##  7 Sweden           Europe     2007    80.9   9031088    33860.          971.
##  8 Israel           Asia       2007    80.7   6426679    25523.          969.
##  9 France           Europe     2007    80.7  61083916    30470.          968.
## 10 Canada           Americas   2007    80.7  33390141    36319.          968.
## # … with 132 more rows

2. Data Visualizations

In this chapter, you’ll learn the essential skills of data visualization using the ggplot2 package, and you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

### Create gapminder_1952.
gapminder_1952 <- gapminder %>%  filter(year == 1952)
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) + geom_point()

2.1 log() scales

Suppose we want life expectancy to be measured in months instead of years: you’d have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that’s calculated this way.

### Scatter plot comparing pop and gdpPercap, with both axes on a log scale.
ggplot(gapminder_1952, aes(x=pop, y=gdpPercap)) + geom_point() +scale_x_log10() + scale_y_log10()

2.2 Additional aesthetics

### Scatter plot comparing pop and lifeExp, with color representing continent.
ggplot(gapminder_1952, aes(x=pop, y=lifeExp, color=continent)) + geom_point() + scale_x_log10()

### Add the size aesthetic to represent a country's gdpPercap.
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent, size=gdpPercap)) + geom_point() + scale_x_log10()

2.3 Faceting

### Scatter plot comparing pop and lifeExp, faceted by continent.
ggplot(gapminder_1952,  aes(x=pop, y=lifeExp)) +
geom_point() + scale_x_log10() + facet_wrap(~continent)

### # Scatter plot comparing gdpPercap and lifeExp, with color representing continent and size representing population, faceted by year.
ggplot(gapminder,  aes(x=gdpPercap, y=lifeExp, color = continent, size = pop)) + geom_point() + scale_x_log10() + facet_wrap(~year)

3. Grouping and summarizing

You’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

### Find median life expectancy and maximum GDP per capital in each year.
gapminder %>% 
          group_by(year) %>% 
          summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
## # A tibble: 12 × 3
##     year medianLifeExp maxGdpPercap
##    <int>         <dbl>        <dbl>
##  1  1952          45.1      108382.
##  2  1957          48.4      113523.
##  3  1962          50.9       95458.
##  4  1967          53.8       80895.
##  5  1972          56.5      109348.
##  6  1977          59.7       59265.
##  7  1982          62.4       33693.
##  8  1987          65.8       31541.
##  9  1992          67.7       34933.
## 10  1997          69.4       41283.
## 11  2002          70.8       44684.
## 12  2007          71.9       49357.
### Find median life expectancy and maximum GDP per capital in each continent in 1957.
gapminder %>% 
        filter(year == 1957) %>%
        group_by(continent) %>% 
        summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
## # A tibble: 5 × 3
##   continent medianLifeExp maxGdpPercap
##   <fct>             <dbl>        <dbl>
## 1 Africa             40.6        5487.
## 2 Americas           56.1       14847.
## 3 Asia               48.3      113523.
## 4 Europe             67.6       17909.
## 5 Oceania            70.3       12247.
### Find median life expectancy and maximum GDP per capita in each continent/year combination
gapminder %>% 
        group_by(continent, year) %>% 
        summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
## # A tibble: 60 × 4
## # Groups:   continent [5]
##    continent  year medianLifeExp maxGdpPercap
##    <fct>     <int>         <dbl>        <dbl>
##  1 Africa     1952          38.8        4725.
##  2 Africa     1957          40.6        5487.
##  3 Africa     1962          42.6        6757.
##  4 Africa     1967          44.7       18773.
##  5 Africa     1972          47.0       21011.
##  6 Africa     1977          49.3       21951.
##  7 Africa     1982          50.8       17364.
##  8 Africa     1987          51.6       11864.
##  9 Africa     1992          52.4       13522.
## 10 Africa     1997          52.8       14723.
## # … with 50 more rows

3.1 Visualizing summarized data

### Create a scatter plot showing the change in medianLifeExp over time
by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
ggplot(by_year, aes(x = year, y = medianLifeExp)) +
geom_point() + expand_limits(y = 0)

### Summarize medianGdpPercap within each continent within each year: by_year_continent, and plot the change in medianGdpPercap in each continent over time

by_year_continent <- gapminder %>%
  group_by(continent, year) %>%
  summarize(medianGdpPercap = median(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent)) +
  geom_point() +
  expand_limits(y = 0)

###  Summarize the median GDP and median life expectancy per continent in 2007, and use a scatter plot to compare the median GDP and median life expectancy.
by_continent_2007 <- gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(medianGdpPercap = median(gdpPercap),
            medianLifeExp = median(lifeExp))

ggplot(by_continent_2007, aes(x = medianGdpPercap, y = medianLifeExp, color = continent)) +
  geom_point()

4. Types of visualizations

You will learn how to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot requires different methods of data manipulation and preparation, and you’ll understand how each of these plot types plays a different role in data analysis.

### Summarize the median gdpPercap by year, then save it as by_year, and create a line plot showing the change in medianGdpPercap over time
by_year <- gapminder %>% group_by(year) %>% summarize(medianGdpPercap = median(gdpPercap))
by_year
## # A tibble: 12 × 2
##     year medianGdpPercap
##    <int>           <dbl>
##  1  1952           1969.
##  2  1957           2173.
##  3  1962           2335.
##  4  1967           2678.
##  5  1972           3339.
##  6  1977           3799.
##  7  1982           4216.
##  8  1987           4280.
##  9  1992           4386.
## 10  1997           4782.
## 11  2002           5320.
## 12  2007           6124.
ggplot(by_year, aes(x=year, y=medianGdpPercap)) + geom_line() + expand_limits(y = 0)

### Summarize the median gdpPercap by year & continent, save as by_year_continent, and create a line plot showing the change in medianGdpPercap by continent over time.
by_year_continent <- gapminder %>% group_by(year, continent) %>% summarize(medianGdpPercap = median(gdpPercap))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
ggplot(by_year_continent, aes(x=year, y=medianGdpPercap, color = continent)) + geom_line() + expand_limits(y = 0)

### Summarize the median gdpPercap by continent in 1952, and create a bar plot showing medianGdp by continent
by_continent <- gapminder %>%
  filter(year == 1952) %>%
  group_by(continent) %>%
  summarize(medianGdpPercap = median(gdpPercap))

ggplot(by_continent, aes(x = continent, y = medianGdpPercap)) + geom_col()

### Filter for observations in the Oceania continent in 1952, and create a bar plot of gdpPercap by country
oceania_1952 <- gapminder %>% filter(year == 1952, continent == "Oceania")

ggplot(oceania_1952, aes(x = country, y = gdpPercap)) + geom_col()

### Create a histogram of population (pop_by_mil)
gapminder_1952 <- gapminder %>%
  filter(year == 1952) %>%
  mutate(pop_by_mil = pop / 1000000)

ggplot(gapminder_1952, aes(x=pop_by_mil)) + geom_histogram(bins=50)

### Create a histogram of population (pop), with x on a log scale
gapminder_1952 <- gapminder %>%
  filter(year == 1952)

ggplot(gapminder_1952, aes(x=pop)) + geom_histogram() + scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Create a boxplot comparing gdpPercap among continents
gapminder_1952 <- gapminder %>%
  filter(year == 1952)

ggplot(gapminder_1952, aes(x = continent, y= gdpPercap)) + geom_boxplot() + scale_y_log10()

The End.

Thanks DataCamp

- My Favorite Team -

Cim boom