Introduction to the Tidyverse; The Gapminder dataset

What is Tidyverse? + A collection of tools in R for transforming and visualizing data What is Gapminder? + Gapminder tracks economic and social indicators of countries overtime + A package created by Jenny Bryan, which contains the Gapminder dataset + It is structured as a dataframe. What is a package? + R packages are tools that aren’t built into the language or are created later by programmers What id dplyr? + a package created by Hadley Wickham, which provides step-by-steps tools for transforming data such as filtering, sorting, and summarizing.

Loading packages

library(gapminder)

Loading the gapminder and dplyr packages

  • Instructions
    • Use the library() function to load the dplyr package, just like we’ve loaded the gapminder package for you.
    • Type gapminder, on its own line, to look at the gapminder dataset.
install.packages("gapminder")
install.packages("dplyr")
# Load the gapminder package
library(gapminder)
# Load the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Look at the gapminder dataset
# How many observations/rows are in the dataset?
gapminder

1704

The filter verb

What is a verb? + verbs are available in the dplyr package + verbs are the atomic steps you use to transform data What is the filter verb? + a verb utlized when you want to look only at a subset of observations based on a particular condition + filtering data is a common first step in data analysis + Everytime you filter data you will use a pipe + Can specify multiple conditions, just separate with a comma What is a pipe? + %>% + “take whatever is before it and feed it into the next step”

Filtering for one year

  • Instructions
    • Add a filter() line after the pipe (%>%) to extract only the observations from the year 1957. Remember that you use == to compare two values.
# Filter the gapminder dataset for the year 1957
gapminder %>%
  filter(year==1957)

Filtering for one country and one year

  • Instructions
    • Filter the gapminder data to retrieve only the observation from China in the year 2002.
# Filter for China in 2002
gapminder %>%
    filter(country=="China",year==2002)

The arrange verb

What is the arrange verb? + sorts the observation in a dataset in an ascending or descending order + Use after the pipe operator + within the parantheses tell it what column you want it to arrange by + to sort by descending order –> arrange(desc(variable))

Arranging observations by life expectancy

  • Instructions
    • Sort the gapminder dataset in ascending order of life expectancy (lifeExp).
    • Sort the gapminder dataset in descending order of life expectancy.
# Sort in ascending order of lifeExp
gapminder %>%
    arrange(lifeExp)
# Sort in descending order of lifeExp
gapminder %>%
    arrange(desc(lifeExp))

Filtering and arranging

  • Instructions
    • Use filter() to extract observations from just the year 1957, then use arrange() to sort in descending order of population (pop).
# Filter for the year 1957, then arrange in descending order of population
gapminder %>%
    filter(year==1957) %>%
    arrange (desc(pop))

The mutate verb

What is the mutate verb? + a variable that changes an existing variable + after a pipe operator + inside the parantheses, what’s on the left is what’s being calculated, what’s on the right is what’s being replaced + a variable that allows you to add a new variable + after a pipe + similar use to changing a verb

Using mutate to change or create a column

  • Instructions
    • Use mutate() to change the existing lifeExp column, by multiplying it by 12: 12 * lifeExp.
    • Use mutate() to add a new column, called lifeExpMonths, calculated as 12 * lifeExp.
# Use mutate to change lifeExp to be in months
gapminder %>%
    mutate(lifeExp=lifeExp*12)
# Use mutate to create a new column called lifeExpMonths
gapminder %>%
    mutate(lifeExpMonths=lifeExp*12)

Combining filter, mutate, and arrange

  • Instructions
    • In one sequence of pipes on the gapminder dataset:
    • filter() for observations from the year 2007,
    • mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and
    • arrange() in descending order of that new column
# Filter, mutate, and arrange the gapminder dataset
gapminder%>%
    filter(year==2007) %>%
    mutate(lifeExpMonths=lifeExp*12) %>%
    arrange(desc(lifeExpMonths))

Introduction to the Tidyverse; Data Visualization

Visualizing with ggplot2

What is ggplot2? + ggplot (data, aes(x,y)) + layer

What is an aesthetic? + a visual dimension of a grapgh that can be used to communicate information

What is a layer? + layers specify the type of grapgh that you’re creating

Variable assignment

  • Instructions
    • Load the ggplot2 package after the gapminder and dplyr packages.
    • Filter gapminder for observations from the year 1952, and assign it to a new dataset gapminder_1952 using the assignment operator (<-).
install.packages("ggplot2")
library(ggplot2)
# Create gapminder_1952
gapminder_1952<-gapminder %>%
    filter(year==1952)

Comparing population and GDP per capita

  • Instructions
    • Change the scatter plot of gapminder_1952 so that (pop) is on the x-axis and GDP per capita (gdpPercap) is on the y-axis.
# Change to put pop on the x-axis and gdpPercap on the y-axis
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
  geom_point()

Comparing population and life expectancy

  • Instructions
    • Create a scatter plot of gapminder_1952 with population (pop) is on the x-axis and life expectancy (lifeExp) on the y-axis.
# Create a scatter plot with pop on the x-axis and lifeExp on the y-axis

ggplot(gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point()

Log scales

What is a logarithmic scale? + a scale where each fixed distance represents a multiplication of the value + new layer -> scale_x_log10() for x axis + new layer -> scale_y_log10() for y axis

Putting the x-axis on a log scale

  • Instructions
    • Change the existing scatter plot (code provided) to put the x-axis (representing population) on a log scale.
# Change this plot to put the x-axis on a log scale
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
  geom_point() + scale_x_log10()

Putting the x- and y- axes on a log scale

  • Instructions
    • Create a scatter plot with population (pop) on the x-axis and GDP per capita (gdpPercap) on the y-axis. Put both the x- and y- axes on a log scale.
# Scatter plot comparing pop and gdpPercap, with both axes on a log scale

ggplot(gapminder_1952,aes(pop,gdpPercap)) + geom_point() + scale_x_log10() + scale_y_log10()

Adding color to a scatter plot

  • Instructions
    • Create a scatter plot with population (pop) on the x-axis, life expectancy (lifeExp) on the y-axis, and with continent (continent) represented by the color of the points. Put the x-axis on a log scale.
# Scatter plot comparing pop and lifeExp, with color representing continent
ggplot(gapminder_1952,aes(pop,lifeExp, color=continent)) + geom_point() + scale_x_log10()

  • Instructions
    • Modify the scatter plot so that the size of the points represents each country’s GDP per capita (gdpPercap).
# Add the size aesthetic to represent a country's gdpPercap
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent, size=gdpPercap)) +
  geom_point() +
  scale_x_log10()

Faceting

What is faceting? + another way to explore your data in terms of catergorical variable + facet_wrap(~ variable) + splitting the plot by + divides the data into subplots based on the categorical variable

Creating a subgraph for each continent

  • Instructions
    • Create a scatter plot of gapminder_1952 with the x-axis representing population (pop), the y-axis representing life expectancy (lifeExp), and faceted to have one subplot per continent (continent). Put the x-axis on a log scale.
# Scatter plot comparing pop and lifeExp, faceted by continent
ggplot(gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point() + scale_x_log10() + facet_wrap(~continent)

Faceting by year

  • Instructions
    • Create a scatter plot of the gapminder data:
    • Put GDP per capita (gdpPercap) on the x-axis and life expectancy (lifeExp) on the y-axis, with continent (continent) represented by color and population (pop) represented by size.
    • Put the x-axis on a log scale
    • Facet by the year variable
# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(gapminder, aes(x=gdpPercap,y=lifeExp, color=continent, size=pop)) + geom_point() + scale_x_log10() + facet_wrap(~year)


Introduction to the Tidyverse; Grouping and summarizing

The summarize verb

  • What is the summarize verb?
    • collapses the entire table down to one row
    • can create multiple summaries at once, add commas
    • functions used for summarizing: mean, sum, median, min, max

Summarizing the median life expectancy

  • Instructions
    • Use the median() function within a summarize() to find the median life expectancy. Save it into a column called medianLifeExp.
# Summarize to find the median life expectancy
gapminder %>%
    summarize(medianLifeExp=median(lifeExp))

Summarizing the median life expectancy in 1957

  • Instructions
    • Filter for the year 1957, then use the median() function within a summarize() to calculate the median life expectancy into a column called medianLifeExp.
# Filter for 1957 then summarize the median life expectancy
gapminder %>%
    filter(year==1957) %>%
    summarize(medianLifeExp=median(lifeExp))

Summarizing multiple variables in 1957

  • Instructions
    • Find both the median life expectancy (lifeExp) and the maximum GDP per capita (gdpPercap) in the year 1957, calling them medianLifeExp and maxGdpPercap respectively. You can use the max() function to find the maximum.
# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>%
    filter(year==1957) %>%
    summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))

The group_by verb

  • What us the group_by verb?
    • tells dplyr to summarize w/in groups rather than summarize the entire dataset
    • example: group_by(year) replaces filter(year==2007)
    • can summarize by multiple variables, just ad a comma
# Find median life expectancy and maximum GDP per capita in each year
gapminder %>%
  group_by(year) %>% 
    summarize(medianLifeExp=median(lifeExp), maxGdpPercap=max(gdpPercap))

Summarizing by continent

  • Instructions
    • Filter the gapminder data for the year 1957. Then find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each continent, saving them into medianLifeExp and maxGdpPercap, respectively.
# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>%
    group_by(continent) %>%
    filter(year==1957) %>%
    summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))

Summarizing by continent and year

  • Instructions
    • Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each combination of continent and year, saving them into medianLifeExp and maxGdpPercap, respectively.
# Find median life expectancy and maximum GDP per capita in each continent/year combination
gapminder%>%
    group_by(continent,year)%>%
    summarize(medianLifeExp=median(lifeExp), maxGdpPercap=max(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

Visualizing summarized data

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year,aes(x=year,y=medianLifeExp)) + geom_point() + expand_limits(y=0)

Visualizing median GDP per capita per continent over time

  • Instructions
    • Summarize the gapminder dataset by continent and year, finding the median GDP per capita (gdpPercap) within each and putting it into a column called medianGdpPercap. Use the assignment operator <- to save this summarized data as by_year_continent.
    • Create a scatter plot showing the change in medianGdpPercap by continent over time. Use color to distinguish between continents, and be sure to add expand_limits(y = 0) so that the y-axis starts at zero.
# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent <-gapminder %>% 
group_by(continent,year) %>%
summarize(medianGdpPercap=median(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent,aes(x=year, y=medianGdpPercap, color=continent)) + geom_point() + expand_limits(y=0)

Comparing median life expectancy and median GDP per continent in 2007

  • Instructions
    • Filter the gapminder dataset for the year 2007, then summarize the median GDP per capita and the median life expectancy within each continent, into columns called medianLifeExp and medianGdpPercap. Save this as by_continent_2007.
    • Use the by_continent_2007 data to create a scatterplot comparing these summary statistics for continents in 2007, putting the median GDP per capita on the x-axis to the median life expectancy on the y-axis. Color the scatter plot by continent. You don’t need to add expand_limits(y = 0) for this plot.
# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007<-gapminder %>%
    group_by(continent) %>%
    filter(year==2007) %>%
    summarize(medianLifeExp=median(lifeExp),medianGdpPercap=median(gdpPercap))

# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007,aes(x=medianGdpPercap,y=medianLifeExp, color=continent)) + geom_point()

***

Introduction to the Tidyverse; Type of visualizations

Line plots

  • Types of grapghs that can be made using ggplot
    • scatterplots are useful for comparing 2 variables
    • line plots are useful for showing change over time
      • the connected points make it clear that we care about the upward or downward trend over time.
    • bar plots are goof at comparing statistics for each of several categories
    • histograms describe the distribution of a one-dimensional numeric variable
    • box plots compare the distribution of a numeric variable among several categories

Visualizing median GDP per capita over time

  • Use group_by() and summarize() to find the median GDP per capita within each year, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_year.
  • Use the by_year dataset to create a line plot showing the change in median GDP per capita over time. Be sure to use expand_limits(y = 0) to include 0 on the y-axis.
# Summarize the median gdpPercap by year, then save it as by_year
by_year<-gapminder%>%
    group_by(year) %>%
    summarize(medianGdpPercap=median(gdpPercap))

# Create a line plot showing the change in medianGdpPercap over time
ggplot(data=by_year,aes(x=year,y=medianGdpPercap)) + geom_line() + expand_limits(y=0)

Visualizing median GDP per capita by continent over time

  • Instructions
    • Use group_by() and summarize() to find the median GDP per capita within each year and continent, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_year_continent.
    • Use the by_year_continent dataset to create a line plot showing the change in median GDP per capita over time, with color representing continent. Be sure to use expand_limits(y = 0) to include 0 on the y-axis.
# Summarize the median gdpPercap by year & continent, save as by_year_continent
by_year_continent<-gapminder %>%
group_by(year,continent) %>%
summarize(medianGdpPercap=median(gdpPercap))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# Create a line plot showing the change in medianGdpPercap by continent over time
ggplot(data=by_year_continent, aes(x=year,y=medianGdpPercap, color=continent)) + geom_line() + expand_limits(y=0)

Bar plots

  • Bar plots
    • geom_col
    • x= categorical variable
    • y= the variable the determines the height of the bar
    • always start at 0

Visualizing median GDP per capita by continent

  • Instructions
    • Use group_by() and summarize() to find the median GDP per capita within each continent in the year 1952, calling the output column medianGdpPercap. Use the assignment operator <- to save it to a dataset called by_continent.
    • Use the by_continent dataset to create a bar plot showing the median GDP per capita in each continent.
# Summarize the median gdpPercap by continent in 1952
by_continent<-gapminder%>%
group_by(continent)%>%
filter(year==1952) %>%
summarize(medianGdpPercap=median(gdpPercap))

# Create a bar plot showing medianGdp by continent
ggplot(data=by_continent, aes(x=continent,y=medianGdpPercap)) + geom_col()

Visualizing GDP per capita by country in Oceania

  • Instructions
    • Filter for observations in the Oceania continent in the year 1952. Save this as oceania_1952.
    • Use the oceania_1952 dataset to create a bar plot, with country on the x-axis and gdpPercap on the y-axis.
# Filter for observations in the Oceania continent in 1952
oceania_1952<-gapminder %>%
filter(continent=="Oceania", year==1952)

# Create a bar plot of gdpPercap by country
ggplot(data=oceania_1952,aes(x=country,y=gdpPercap)) +geom_col()

## Histograms

  • What is a histogram?
    • shows a distribution
    • every bar represents a bin
    • the height of each bar represents how many of said variable falls into that bin
    • geom_histogram()
    • only one asethetic, the x axis
    • x= the variable whose distribution you are examining
    • width of each bin is chosen automatically, can be customized with geom_histogram(bins=x)
    • might have to put the x axis of the histogram on a log scale to make it more understandable

Visualizing population

  • Instructions
    • Use the gapminder_1952 dataset to create a histogram of country population (pop_by_mil) in the year 1952. Inside the histogram geom, set the number of bins to 50.
gapminder_1952 <- gapminder %>%
  filter(year == 1952) %>%
  mutate(pop_by_mil = pop / 1000000)

# Create a histogram of population (pop_by_mil)
ggplot(data=gapminder_1952, aes(x=pop_by_mil)) + geom_histogram(bins=50)

Visualizing population with x-axis on a log scale

  • Instructions
    • Use the gapminder_1952 dataset (code is provided) to create a histogram of country population (pop) in the year 1952, putting the x-axis on a log scale with scale_x_log10().
gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Create a histogram of population (pop), with x on a log scale
ggplot(data=gapminder_1952,aes(x=pop)) + geom_histogram() + scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Boxplots

  • What are boxplots?
    • compares the distributions of a certain variable within another variable
    • geom_boxplot (x,y)
    • x=category
    • y= the values that we’re comparing
    • black line in the middle = median of the distribution
    • top = 75th percentile of that group
    • bottom = 25th percentile of that group
    • whiskers cover additional observations that lie outside of the percentiles
    • dots that lie above or below the whiskers represent outliers
    • gives more context to the histogram

Comparing GDP per capita across continents

  • Instructions
    • Use the gapminder_1952 dataset (code is provided) to create a boxplot comparing GDP per capita (gdpPercap) among continents. Put the y-axis on a log scale with scale_y_log10().
gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Create a boxplot comparing gdpPercap among continents
ggplot(data=gapminder_1952,aes(x=continent, y=gdpPercap)) + geom_boxplot() + scale_y_log10()

Adding a title to your graph

  • Instructions
    • Add a title to the graph: Comparing GDP per capita across continents. Use a search engine, such as Google or Bing, to learn how to do so.
    • After this exercise you are almost done with your course. If you enjoyed the material, feel free to send Dave a thank you via twitter. He’ll appreciate it. Tweet to Dave
gapminder_1952 <- gapminder %>%
  filter(year == 1952)

# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  scale_y_log10() + ggtitle("Comparing GDP per capita across continents")