Course Description
This is an introduction to the programming language R, focused on a powerful set of tools known as the Tidyverse. You’ll learn the intertwined processes of data manipulation and visualization using the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting, and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. You’ll get a taste of the value of exploratory data analysis and the power of Tidyverse tools. This is a suitable introduction for those who have no previous experience in R and are interested in performing data analysis.
In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps allows you to answer questions about your data.
Before you can work with the gapminder
dataset, you’ll
need to load two R packages that contain the tools for working with it,
then display the gapminder
dataset so that you can see what
it contains.
To your right, you’ll see two windows inside which you can enter
code: The script.R
window, and the R Console. All of your
code to solve each exercise must go inside script.R
.
If you hit Submit Answer, your R script is executed and the output is shown in the R Console. DataCamp checks whether your submission is correct and gives you feedback. You can hit Submit Answer as often as you want. If you’re stuck, you can ask for a hint or a solution.
You can use the R Console interactively by simply typing R code and hitting Enter. When you work in the console directly, your code will not be checked for correctness so it is a great way to experiment and explore.
This course introduces a lot of new concepts, so if you ever need a quick refresher, download the Tidyverse For Beginners Cheat Sheet and keep it handy!
library()
function to load the
dplyr
package, just like we’ve loaded the
gapminder
package for you.gapminder
, on its own line, to look at the
gapminder dataset.# Load the gapminder package
library(gapminder)
# Load the dplyr package
library(dplyr)
# Look at the gapminder dataset
gapminder
Now that you’ve loaded the gapminder
dataset, you can
start examining and understanding it.
We’ve already loaded the gapminder
and
dplyr
packages. Type gapminder
in the console,
to display the object.
How many observations (rows) are in the dataset?
The filter
verb extracts particular observations based
on a condition. In this exercise you’ll filter for observations from a
particular year.
filter()
line after the pipe
(%>%
) to extract only the observations from the year
1957. Remember that you use ==
to compare two values.library(gapminder)
library(dplyr)
# Filter the gapminder dataset for the year 1957
gapminder %>%
filter(year == 1957)
You can also use the filter()
verb to set two
conditions, which could retrieve a single observation.
Just like in the last exercise, you can do this in two lines of code,
starting with gapminder %>%
and having the
filter()
on the second line. Keeping one verb on each line
helps keep the code readable. Note that each time, you’ll put the pipe
%>%
at the end of the first line (like
gapminder %>%
); putting the pipe at the beginning of the
second line will throw an error.
gapminder
data to retrieve only the
observation from China in the year 2002.library(gapminder)
library(dplyr)
# Filter for China in 2002
gapminder %>%
filter(country == "China", year == 2002)
You use arrange()
to sort observations in ascending or
descending order of a particular variable. In this case, you’ll sort the
dataset based on the lifeExp
variable.
gapminder
dataset in ascending order of life
expectancy (lifeExp
).gapminder
dataset in descending order of life
expectancy.library(gapminder)
library(dplyr)
# Sort in ascending order of lifeExp
gapminder %>%
arrange(lifeExp)
# Sort in descending order of lifeExp
gapminder %>%
arrange(desc(lifeExp))
You’ll often need to use the pipe operator (%>%
) to
combine multiple dplyr verbs in a row. In this case, you’ll combine a
filter()
with an arrange()
to find the highest
population countries in a particular year.
filter()
to extract observations from just the year
1957, then use arrange()
to sort in descending order of
population (pop
).library(gapminder)
library(dplyr)
# Filter for the year 1957, then arrange in descending order of population
gapminder %>%
filter(year == 1957) %>%
arrange(desc(pop))
Suppose we want life expectancy to be measured in months instead of
years: you’d have to multiply the existing value by 12. You can use the
mutate()
verb to change this column, or to create a new
column that’s calculated this way.
mutate()
to change the existing
lifeExp
column, by multiplying it by 12:
12 * lifeExp
.mutate()
to add a new column, called
lifeExpMonths
, calculated as
12 * lifeExp
.library(gapminder)
library(dplyr)
# Use mutate to change lifeExp to be in months
gapminder %>%
mutate(lifeExp = lifeExp * 12)
# Use mutate to create a new column called lifeExpMonths
gapminder %>%
mutate(lifeExpMonths = lifeExp * 12)
In this exercise, you’ll combine all three of the verbs you’ve learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.
gapminder
dataset:filter()
for observations from the year 2007,mutate()
to create a column lifeExpMonths
,
calculated as 12 * lifeExp
, andarrange()
in descending order of that new columnlibrary(gapminder)
library(dplyr)
# Filter, mutate, and arrange the gapminder dataset
gapminder %>%
filter(year == 2007) %>%
mutate(lifeExpMonths = 12 * lifeExp) %>%
arrange(desc(lifeExpMonths))
Often a better way to understand and present data as a graph. In this chapter, you’ll learn the essential skills of data visualization using the ggplot2 package, and you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.
Throughout the exercises in this chapter, you’ll be visualizing a
subset of the gapminder data from the year 1952. First, you’ll have to
load the ggplot2 package, and create a gapminder_1952
dataset to visualize.
By the way, if you haven’t downloaded it already, check out the Tidyverse For Beginners Cheat Sheet. It includes an overview of the most important concepts, functions and methods and might come in handy if you ever need a quick refresher!
ggplot2
package after the gapminder and dplyr
packages.gapminder
for observations from the year 1952,
and assign it to a new dataset gapminder_1952
using the
assignment operator (<-
).# Load the ggplot2 package as well
library(gapminder)
library(dplyr)
library(ggplot2)
# Create gapminder_1952
gapminder_1952 <- gapminder %>%
filter(year == 1952)
In the video you learned to create a scatter plot with GDP per capita on the x-axis and life expectancy on the y-axis (the code for that graph has been provided in the exercise code). When you’re exploring data visually, you’ll often need to try different combinations of variables and aesthetics.
gapminder_1952
so that
(pop
) is on the x-axis and GDP per capita
(gdpPercap
) is on the y-axis.library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Change to put pop on the x-axis and gdpPercap on the y-axis
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
geom_point()
In this exercise, you’ll use ggplot2
to create a scatter
plot from scratch, to compare each country’s population with its life
expectancy in the year 1952.
gapminder_1952
with population
(pop
) is on the x-axis and life expectancy
(lifeExp
) on the y-axis.library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Create a scatter plot with pop on the x-axis and lifeExp on the y-axis
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
geom_point()
You previously created a scatter plot with population on the x-axis and life expectancy on the y-axis. Since population is spread over several orders of magnitude, with some countries having a much higher population than others, it’s a good idea to put the x-axis on a log scale.
library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Change this plot to put the x-axis on a log scale
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
geom_point() +
scale_x_log10()
Suppose you want to create a scatter plot with population on the x-axis and GDP per capita on the y-axis. Both population and GDP per-capita are better represented with log scales, since they vary over many orders of magnitude.
pop
) on the
x-axis and GDP per capita (gdpPercap
) on the y-axis. Put
both the x- and y- axes on a log scale.library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Scatter plot comparing pop and gdpPercap, with both axes on a log scale
ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
In this lesson you learned how to use the color aesthetic, which can be used to show which continent each point in a scatter plot represents.
pop
) on the
x-axis, life expectancy (lifeExp
) on the y-axis, and with
continent (continent
) represented by the color of the
points. Put the x-axis on a log scale.library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Scatter plot comparing pop and lifeExp, with color representing continent
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10()
In the last exercise, you created a scatter plot communicating information about each country’s population, life expectancy, and continent. Now you’ll use the size of the points to communicate even more.
gdpPercap
).library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Add the size aesthetic to represent a country's gdpPercap
ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent, size = gdpPercap)) +
geom_point() +
scale_x_log10()
You’ve learned to use faceting to divide a graph into subplots based on one of its variables, such as the continent.
gapminder_1952
with the x-axis
representing population (pop
), the y-axis representing life
expectancy (lifeExp
), and faceted to have one subplot per
continent (continent
). Put the x-axis on a log scale.library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Scatter plot comparing pop and lifeExp, faceted by continent
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ continent)
All of the graphs in this chapter have been visualizing statistics within one year. Now that you’re able to use faceting, however, you can create a graph showing all the country-level data from 1952 to 2007, to understand how global statistics have changed over time.
gapminder
data:gdpPercap
) on the x-axis and life
expectancy (lifeExp
) on the y-axis, with continent
(continent
) represented by color and population
(pop
) represented by size.year
variablelibrary(gapminder)
library(dplyr)
library(ggplot2)
# Scatter plot comparing gdpPercap and lifeExp, with color representing continent
# and size representing population, faceted by year
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ year)
So far you’ve been answering questions about individual country-year pairs, but you may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.
You’ve seen how to find the mean life expectancy and the total
population across a set of observations, but mean()
and
sum()
are only two of the functions R provides for
summarizing a collection of numbers. Here, you’ll learn to use the
median()
function in combination with
summarize()
.
By the way, dplyr
displays some messages when it’s
loaded that we’ve been hiding so far. They’ll show up in red and start
with:
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
This will occur in future exercises each time you load
dplyr
: it’s mentioning some built-in functions that are
overwritten by dplyr
. You won’t need to worry about this
message within this course.
median()
function within a
summarize()
to find the median life expectancy. Save it
into a column called medianLifeExp
.library(gapminder)
library(dplyr)
# Summarize to find the median life expectancy
gapminder %>%
summarize(medianLifeExp = median(lifeExp))
Rather than summarizing the entire dataset, you may want to find the median life expectancy for only one particular year. In this case, you’ll find the median in the year 1957.
median()
function within a summarize()
to calculate the median life
expectancy into a column called medianLifeExp
.library(gapminder)
library(dplyr)
# Filter for 1957 then summarize the median life expectancy
gapminder %>%
filter(year == 1957) %>%
summarize(medianLifeExp = median(lifeExp))
The summarize()
verb allows you to summarize multiple
variables at once. In this case, you’ll use the median()
function to find the median life expectancy and the max()
function to find the maximum GDP per capita.
lifeExp
) and the
maximum GDP per capita (gdpPercap
) in the year 1957,
calling them medianLifeExp
and maxGdpPercap
respectively. You can use the max()
function to find the
maximum.library(gapminder)
library(dplyr)
# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder %>%
filter(year == 1957) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))
In a previous exercise, you found the median life expectancy and the
maximum GDP per capita in the year 1957. Now, you’ll perform those two
summaries within each year in the dataset, using the
group_by
verb.
lifeExp
) and maximum
GDP per capita (gdpPercap
) within each
year, saving them into medianLifeExp
and
maxGdpPercap
, respectively.library(gapminder)
library(dplyr)
# Find median life expectancy and maximum GDP per capita in each year
gapminder %>%
group_by(year) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))
You can group by any variable in your dataset to create a summary. Rather than comparing across time, you might be interested in comparing among continents. You’ll want to do that within one year of the dataset: let’s use 1957.
gapminder
data for the year 1957. Then find
the median life expectancy (lifeExp
) and maximum GDP per
capita (gdpPercap
) within each continent,
saving them into medianLifeExp
and
maxGdpPercap
, respectively.library(gapminder)
library(dplyr)
# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>%
filter(year == 1957) %>%
group_by(continent) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))
Instead of grouping just by year, or just by continent, you’ll now group by both continent and year to summarize within each.
lifeExp
) and maximum
GDP per capita (gdpPercap
) within each combination
of continent and year, saving them into
medianLifeExp
and maxGdpPercap
,
respectively.library(gapminder)
library(dplyr)
# Find median life expectancy and maximum GDP per capita in each continent/year combination
gapminder %>%
group_by(continent, year) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))
In the last chapter, you summarized the gapminder data to calculate
the median life expectancy within each year. This code is provided for
you, and is saved (with <-
) as the by_year
dataset.
Now you can use the ggplot2 package to turn this into a visualization of changing life expectancy over time.
by_year
dataset to create a scatter plot
showing the change of median life expectancy over time, with
year
on the x-axis and medianLifeExp
on the
y-axis. Be sure to add expand_limits(y = 0)
to make sure
the plot’s y-axis includes zero.library(gapminder)
library(dplyr)
library(ggplot2)
by_year <- gapminder %>%
group_by(year) %>%
summarize(medianLifeExp = median(lifeExp),
maxGdpPercap = max(gdpPercap))
# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year, aes(x = year, y = medianLifeExp)) +
geom_point() +
expand_limits(y = 0)
In the last exercise you were able to see how the median life expectancy of countries changed over time. Now you’ll examine the median GDP per capita instead, and see how the trend differs among continents.
gdpPercap
) within each and putting
it into a column called medianGdpPercap
. Use the assignment
operator <-
to save this summarized data as
by_year_continent
.medianGdpPercap
by continent over time. Use color to
distinguish between continents, and be sure to add
expand_limits(y = 0)
so that the y-axis starts at
zero.library(gapminder)
library(dplyr)
library(ggplot2)
# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent <- gapminder %>%
group_by(continent, year) %>%
summarize(medianGdpPercap = median(gdpPercap))
# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent)) +
geom_point() +
expand_limits(y = 0)
In these exercises you’ve generally created plots that show change over time. But as another way of exploring your data visually, you can also use ggplot2 to plot summarized data to compare continents within a single year.
medianLifeExp
and
medianGdpPercap
. Save this as
by_continent_2007
.by_continent_2007
data to create a scatterplot
comparing these summary statistics for continents in 2007, putting the
median GDP per capita on the x-axis to the median life expectancy on the
y-axis. Color the scatter plot by continent
. You don’t need
to add expand_limits(y = 0)
for this plot.library(gapminder)
library(dplyr)
library(ggplot2)
# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007 <- gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(medianGdpPercap = median(gdpPercap),
medianLifeExp = median(lifeExp))
# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007, aes(x = medianGdpPercap, y = medianLifeExp, color = continent)) +
geom_point()
In this chapter, you’ll learn how to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot requires different methods of data manipulation and preparation, and you’ll understand how each of these plot types plays a different role in data analysis.
A line plot is useful for visualizing trends over time. In this exercise, you’ll examine how the median GDP per capita has changed over time.
group_by()
and summarize()
to find the
median GDP per capita within each year, calling the
output column medianGdpPercap
. Use the assignment operator
<-
to save it to a dataset called
by_year
.by_year
dataset to create a line plot showing
the change in median GDP per capita over time. Be sure
to use expand_limits(y = 0)
to include 0 on the
y-axis.library(gapminder)
library(dplyr)
library(ggplot2)
# Summarize the median gdpPercap by year, then save it as by_year
by_year <- gapminder %>%
group_by(year) %>%
summarize(medianGdpPercap = median(gdpPercap))
# Create a line plot showing the change in medianGdpPercap over time
ggplot(by_year, aes(x = year, y = medianGdpPercap)) +
geom_line() +
expand_limits(y = 0)
In the last exercise you used a line plot to visualize the increase in median GDP per capita over time. Now you’ll examine the change within each continent.
group_by()
and summarize()
to find the
median GDP per capita within each year and continent,
calling the output column medianGdpPercap
. Use the
assignment operator <-
to save it to a dataset called
by_year_continent
.by_year_continent
dataset to create a line plot
showing the change in median GDP per capita over time, with color
representing continent. Be sure to use
expand_limits(y = 0)
to include 0 on the y-axis.library(gapminder)
library(dplyr)
library(ggplot2)
# Summarize the median gdpPercap by year & continent, save as by_year_continent
by_year_continent <- gapminder %>%
group_by(year, continent) %>%
summarize(medianGdpPercap = median(gdpPercap))
# Create a line plot showing the change in medianGdpPercap by continent over time
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent)) +
geom_line() +
expand_limits(y = 0)
A bar plot is useful for visualizing summary statistics, such as the median GDP in each continent.
group_by()
and summarize()
to find the
median GDP per capita within each continent in the year
1952, calling the output column medianGdpPercap
. Use the
assignment operator <-
to save it to a dataset called
by_continent
.by_continent
dataset to create a bar plot
showing the median GDP per capita in each continent.library(gapminder)
library(dplyr)
library(ggplot2)
# Summarize the median gdpPercap by continent in 1952
by_continent <- gapminder %>%
filter(year == 1952) %>%
group_by(continent) %>%
summarize(medianGdpPercap = median(gdpPercap))
# Create a bar plot showing medianGdp by continent
ggplot(by_continent, aes(x = continent, y = medianGdpPercap)) +
geom_col()
You’ve created a plot where each bar represents one continent, showing the median GDP per capita for each. But the x-axis of the bar plot doesn’t have to be the continent: you can instead create a bar plot where each bar represents a country.
In this exercise, you’ll create a bar plot comparing the GDP per capita between the two countries in the Oceania continent (Australia and New Zealand).
oceania_1952
.oceania_1952
dataset to create a bar plot, with
country on the x-axis and gdpPercap
on the y-axis.library(gapminder)
library(dplyr)
library(ggplot2)
# Filter for observations in the Oceania continent in 1952
oceania_1952 <- gapminder %>%
filter(continent == "Oceania", year == 1952)
# Create a bar plot of gdpPercap by country
ggplot(oceania_1952, aes(x = country, y = gdpPercap)) +
geom_col()
A histogram is useful for examining the distribution of a numeric variable. In this exercise, you’ll create a histogram showing the distribution of country populations (by millions) in the year 1952.
Code for generating this dataset, gapminder_1952
, is
provided.
Use the gapminder_1952
dataset to create a histogram of
country population (pop_by_mil
) in the year 1952. Inside
the histogram geom, set the number of bins
to
50
.
library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952) %>%
mutate(pop_by_mil = pop / 1000000)
# Create a histogram of population (pop_by_mil)
ggplot(gapminder_1952, aes(x = pop_by_mil)) +
geom_histogram(bins = 50)
In the last exercise you created a histogram of populations across countries. You might have noticed that there were several countries with a much higher population than others, which causes the distribution to be very skewed, with most of the distribution crammed into a small part of the graph. (Consider that it’s hard to tell the median or the minimum population from that histogram).
To make the histogram more informative, you can try putting the x-axis on a log scale.
gapminder_1952
dataset (code is provided) to
create a histogram of country population (pop
) in the year
1952, putting the x-axis on a log scale with
scale_x_log10()
.library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Create a histogram of population (pop), with x on a log scale
ggplot(gapminder_1952, aes(x = pop)) +
geom_histogram() +
scale_x_log10()
A boxplot is useful for comparing a distribution of values across several groups. In this exercise, you’ll examine the distribution of GDP per capita by continent. Since GDP per capita varies across several orders of magnitude, you’ll need to put the y-axis on a log scale.
gapminder_1952
dataset (code is provided) to
create a boxplot comparing GDP per capita (gdpPercap
) among
continents. Put the y-axis on a log scale with
scale_y_log10()
.library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Create a boxplot comparing gdpPercap among continents
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
scale_y_log10()
There are many other options for customizing a ggplot2
graph, which you can learn about in other DataCamp courses. You can also
learn about them from online resources, which is an important skill to
develop.
As the final exercise in this course, you’ll practice looking up
ggplot2
instructions by completing a task we haven’t shown
you how to do.
library(gapminder)
library(dplyr)
library(ggplot2)
gapminder_1952 <- gapminder %>%
filter(year == 1952)
# Add a title to this graph: "Comparing GDP per capita across continents"
ggplot(gapminder_1952, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Comparing GDP per capita across continents")
Congratulations on completing this Introduction to R via the Tidyverse. You’ve been introduced to the principles
of transforming and visualizing data with R, and in the process learned some real insights from the Gapminder dataset. This course forms a great foundation for other DataCamp courses where you can continue learning how to use these powerful tools to explore data. You can take
courses about ggplot2 to learn to create more informative and customized data visualizations. You can learn much more about using dplyr to transform your data, such as how to join multiple tables together. To analyze other data that you’re interested in, you can take the course on importing and cleaning datasets. And you can practice your data wrangling and visualization skills in my own course “Exploratory Data Analysis with R”, which offers a case study of analyzing United Nations voting over time. These are just a few of the many resources you have to continue learning about data science and R.
I hope you had fun in this course, and continue to enjoy your data science journey!