What is Tidyverse? + A collection of tools in R for transforming and visualizing data What is Gapminder? + Gapminder tracks economic and social indicators of countries overtime + A package created by Jenny Bryan, which contains the Gapminder dataset + It is structured as a dataframe. What is a package? + R packages are tools that aren’t built into the language or are created later by programmers What id dplyr? + a package created by Hadley Wickham, which provides step-by-steps tools for transforming data such as filtering, sorting, and summarizing.
Loading packages
library(gapminder)
install.packages("gapminder")
install.packages("dplyr")
# Load the gapminder package
library(gapminder)
# Load the dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Look at the gapminder dataset
# How many observations/rows are in the dataset?
gapminder
1704
What is a verb? + verbs are available in the dplyr package + verbs are the atomic steps you use to transform data What is the filter verb? + a verb utlized when you want to look only at a subset of observations based on a particular condition + filtering data is a common first step in data analysis + Everytime you filter data you will use a pipe + Can specify multiple conditions, just separate with a comma What is a pipe? + %>% + “take whatever is before it and feed it into the next step”
# Filter the gapminder dataset for the year 1957
gapminder %>%
filter(year==1957)
# Filter for China in 2002
gapminder %>%
filter(country=="China",year==2002)
What is the arrange verb? + sorts the observation in a dataset in an ascending or descending order + Use after the pipe operator + within the parantheses tell it what column you want it to arrange by + to sort by descending order –> arrange(desc(variable))
# Sort in ascending order of lifeExp
gapminder %>%
arrange(lifeExp)
# Sort in descending order of lifeExp
gapminder %>%
arrange(desc(lifeExp))
# Filter for the year 1957, then arrange in descending order of population
gapminder %>%
filter(year==1957) %>%
arrange (desc(pop))
What is the mutate verb? + a variable that changes an existing variable + after a pipe operator + inside the parantheses, what’s on the left is what’s being calculated, what’s on the right is what’s being replaced + a variable that allows you to add a new variable + after a pipe + similar use to changing a verb
# Use mutate to change lifeExp to be in months
gapminder %>%
mutate(lifeExp=lifeExp*12)
# Use mutate to create a new column called lifeExpMonths
gapminder %>%
mutate(lifeExpMonths=lifeExp*12)
# Filter, mutate, and arrange the gapminder dataset
gapminder%>%
filter(year==2007) %>%
mutate(lifeExpMonths=lifeExp*12) %>%
arrange(desc(lifeExpMonths))
What is ggplot2? + ggplot (data, aes(x,y)) + layer
What is an aesthetic? + a visual dimension of a grapgh that can be used to communicate information
What is a layer? + layers specify the type of grapgh that you’re creating
install.packages("ggplot2")
library(ggplot2)
# Create gapminder_1952
gapminder_1952<-gapminder %>%
filter(year==1952)
# Change to put pop on the x-axis and gdpPercap on the y-axis
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
geom_point()
# Create a scatter plot with pop on the x-axis and lifeExp on the y-axis
ggplot(gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point()
What is a logarithmic scale? + a scale where each fixed distance represents a multiplication of the value + new layer -> scale_x_log10() for x axis + new layer -> scale_y_log10() for y axis
# Change this plot to put the x-axis on a log scale
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
geom_point() + scale_x_log10()
# Scatter plot comparing pop and gdpPercap, with both axes on a log scale
ggplot(gapminder_1952,aes(pop,gdpPercap)) + geom_point() + scale_x_log10() + scale_y_log10()
# Scatter plot comparing pop and lifeExp, with color representing continent
ggplot(gapminder_1952,aes(pop,lifeExp, color=continent)) + geom_point() + scale_x_log10()
# Add the size aesthetic to represent a country's gdpPercap
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent, size=gdpPercap)) +
geom_point() +
scale_x_log10()
What is faceting? + another way to explore your data in terms of catergorical variable + facet_wrap(~ variable) + splitting the plot by + divides the data into subplots based on the categorical variable
# Scatter plot comparing pop and lifeExp, faceted by continent
ggplot(gapminder_1952,aes(x=pop,y=lifeExp)) + geom_point() + scale_x_log10() + facet_wrap(~continent)
# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(gapminder, aes(x=gdpPercap,y=lifeExp, color=continent, size=pop)) + geom_point() + scale_x_log10() + facet_wrap(~year)
# Summarize to find the median life expectancy
gapminder %>%
summarize(medianLifeExp=median(lifeExp))
# Filter for 1957 then summarize the median life expectancy
gapminder %>%
filter(year==1957) %>%
summarize(medianLifeExp=median(lifeExp))
# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>%
filter(year==1957) %>%
summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
# Find median life expectancy and maximum GDP per capita in each year
gapminder %>%
group_by(year) %>%
summarize(medianLifeExp=median(lifeExp), maxGdpPercap=max(gdpPercap))
# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>%
group_by(continent) %>%
filter(year==1957) %>%
summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))
# Find median life expectancy and maximum GDP per capita in each continent/year combination
gapminder%>%
group_by(continent,year)%>%
summarize(medianLifeExp=median(lifeExp), maxGdpPercap=max(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
utilize ggplot2 package
save the summarized data as an object (<-)
How to get an axis to start at 0
by_year <- gapminder %>%
group_by(year) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))
# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year,aes(x=year,y=medianLifeExp)) + geom_point() + expand_limits(y=0)
# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent <-gapminder %>%
group_by(continent,year) %>%
summarize(medianGdpPercap=median(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent,aes(x=year, y=medianGdpPercap, color=continent)) + geom_point() + expand_limits(y=0)
# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007<-gapminder %>%
group_by(continent) %>%
filter(year==2007) %>%
summarize(medianLifeExp=median(lifeExp),medianGdpPercap=median(gdpPercap))
# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007,aes(x=medianGdpPercap,y=medianLifeExp, color=continent)) + geom_point()
***
# Summarize the median gdpPercap by year, then save it as by_year
by_year<-gapminder%>%
group_by(year) %>%
summarize(medianGdpPercap=median(gdpPercap))
# Create a line plot showing the change in medianGdpPercap over time
ggplot(data=by_year,aes(x=year,y=medianGdpPercap)) + geom_line() + expand_limits(y=0)
# Summarize the median gdpPercap by year & continent, save as by_year_continent
by_year_continent<-gapminder %>%
group_by(year,continent) %>%
summarize(medianGdpPercap=median(gdpPercap))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# Create a line plot showing the change in medianGdpPercap by continent over time
ggplot(data=by_year_continent, aes(x=year,y=medianGdpPercap, color=continent)) + geom_line() + expand_limits(y=0)
# Summarize the median gdpPercap by continent in 1952
by_continent<-gapminder%>%
group_by(continent)%>%
filter(year==1952) %>%
summarize(medianGdpPercap=median(gdpPercap))
# Create a bar plot showing medianGdp by continent
ggplot(data=by_continent, aes(x=continent,y=medianGdpPercap)) + geom_col()
# Filter for observations in the Oceania continent in 1952
oceania_1952<-gapminder %>%
filter(continent=="Oceania", year==1952)
# Create a bar plot of gdpPercap by country
ggplot(data=oceania_1952,aes(x=country,y=gdpPercap)) +geom_col()
## Histograms
gapminder_1952 <- gapminder %>%
filter(year == 1952) %>%
mutate(pop_by_mil = pop / 1000000)
# Create a histogram of population (pop_by_mil)
ggplot(data=gapminder_1952, aes(x=pop_by_mil)) + geom_histogram(bins=50)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Create a histogram of population (pop), with x on a log scale
ggplot(data=gapminder_1952,aes(x=pop)) + geom_histogram() + scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Boxplots
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Create a boxplot comparing gdpPercap among continents
ggplot(data=gapminder_1952,aes(x=continent, y=gdpPercap)) + geom_boxplot() + scale_y_log10()
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
scale_y_log10() + ggtitle("Comparing GDP per capita across continents")