A line graph is a type a graph that uses a line to play “connect the dots” with the data points represented. It’s not required, but the x-axis on most line graphs represent time in some way.
We’ll start with an example for a single name picked at random (like “Jacob”)
# We'll create a line graph for the popularity of the name Jacob for boys
babynames |>
filter(name == "Jacob" & sex == "M") |>
# Pass the data set along to ggplot and won't need a data = ... argument
ggplot(
mapping = aes(
x = year,
y = prop
)
) +
# Adding a line
geom_line(
linewidth = 1
) +
# Changing the y-axis to a percentage
scale_y_continuous(
labels = scales::label_percent()
)
It’s pretty typical for line graphs to have more than one line. Let’s recreate the line graph we saw on the first day of class, comparing the relative popularity for the names Karen and Terry.
The code below will create a data set with just female babies named Karen and male babies Terry from 1946 - 2017, then calculate the popularity of the name vs the maximum popularity: pop/max(pop) for each name
k_t <-
babynames |>
# Getting Karens and Terrys from 1946 to 2017
filter(
name == "Karen" & sex == "F" | name == "Terry" & sex == "M",
year >= 1946
) |>
# Calculating rel_prop - the relative popularity of the name
mutate(
.by = name,
rel_prop = n/max(n)
) |>
# Keeping only the relevant columns
dplyr::select(year, name, sex, rel_prop)
k_t
## # A tibble: 144 × 4
## year name sex rel_prop
## <dbl> <chr> <chr> <dbl>
## 1 1946 Karen F 0.484
## 2 1946 Terry M 0.684
## 3 1947 Karen F 0.533
## 4 1947 Terry M 0.817
## 5 1948 Karen F 0.542
## 6 1948 Terry M 0.778
## 7 1949 Karen F 0.554
## 8 1949 Terry M 0.734
## 9 1950 Karen F 0.595
## 10 1950 Terry M 0.733
## # ℹ 134 more rows
The k_t data set has 4 columns:
We want to make a graph to display the relative popularity for babies
named Karen and Terry. Let’s start by creating a blank graph with
x = year
and y = rel_prop
and changing the
y-axis to be percentages. Save the graph as gg_kt
gg_kt <-
ggplot(
data = k_t,
mapping = aes(
x = year,
y = rel_prop
)
) +
scale_y_continuous(
labels = scales::label_percent()
)
gg_kt
Now that we have a blank graph, how do we add a line or lines? Can we
just use geom_smooth()
?
Not quite. geom_smooth()
fits a smooth, trend line
across the graph. We want a geom that will connect the left-most dot
with the next left-most dot, and so on. So which geom should we use?
It’s not much of a surprise that we should use
geom_line()
! Add it to gg_kt with
linewidth = 1
that was created in the previous code
chunk:
gg_kt +
geom_line(
linewidth = 1
)
Uh, that’s not quite what we wanted. So what happened?
The way geom_line()
works is it will connect the dots
from left to right. If there are 2 dots with the same x-value, it will
draw a vertical line in the graph! Is that what is going on in our
graph?
Instead of geom_line()
, add geom_point()
to
our graph
gg_kt +
geom_point()
We get a better look at the dots being connected now! For each unique date, there are two points in the data: one for Karen’s popularity and another for Terry’s. In the codechunk above, color the dots by name o paint a clearer picture.
So how do we fix it?
From what we saw when we were looking at making bar charts,
ggplot()
will form groups in the data whenever an aesthetic
is mapped to a categorical variable. So how does that help us here?
geom_line()
will only play connect-the-dots with points
in the same group. So if we have ggplot()
form groups in the data, it will draw multiple lines using one
geom_line()
function. We just need to map an appropriate
aesthetic to the column(s) that forms the groups!
Some choices are:
color
: the most popular aesthetic to uselinetype
: changes how the line is drawn - solid,
dashed, dotted, etc…group
: won’t actually change how the line is drawn, but
will draw a separate line for each group.
Let’s map color to name and see what happens!
gg_kt2 <-
gg_kt +
geom_line(
mapping = aes(color = name),
linewidth = 1
)
gg_kt2
Try changing color
to linetype
and see what
changes!
Once you get it working and have each line represented by a color, save the result as gg_kt2
One caution about working with lines is you want to make sure that
there aren’t too many groups, otherwise geom_line()
may not
work correctly!
Let’s look over the example below:
k_t |>
# Picking just the years from 1970 to 1980
filter(
between(year, 1950, 1980)
) |>
ggplot(
mapping = aes(
x = factor(year),
y = rel_prop,
color = name
)
) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
# Moving the legend to the top of the graph
theme(
legend.position = "top",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)
) +
scale_y_continuous(
labels = scales::label_percent()
) +
labs(x = NULL)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
Uh oh, what happened?
The warning message gives a hint- geom_line()
: each
group consists of only one observation. Do you need to adjust the group
aesthetic?
What that means is since both x
and color
are now mapped to categorical columns, the groups are now formed for
each combination of name and year. Since each
combination of name and year only has 1 row,
geom_line()
doesn’t have 2 points it can play connect the
dots with! geom_point()
is unaffected since it just places
a dot at each (x,y) coord combo. But if there aren’t 2 points in the
same group, it can’t connect any of the dots!
So what could we do? We can use the group
aesthetic to
try and fix it! Try mapping name to group
below
k_t |>
# Picking just the years from 1970 to 1980
filter(
between(year, 1950, 1980)
) |>
ggplot(
mapping = aes(
x = factor(year),
y = rel_prop,
color = name,
group = name
)
) +
geom_line(linewidth = 1) +
#geom_point(size = 2) +
# Moving the legend to the top of the graph
theme(
legend.position = "top",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)
) +
scale_y_continuous(
labels = scales::label_percent()
) +
labs(x = NULL)
Including group = name
inside geom_line()
will cause ggplot to only form groups based on browser type (just for
geom_line()
. Any other geoms will still see the groups
formed by both browser and date!).
If gg_kt2 is working correctly
gg_kt2 +
labs(
y = NULL,
x = "Year",
color = NULL,
title = "Men named <span style='color:steelblue;'>Terry</span> tend to be of the same age as women named <span style='color:#FE5BAC;'>Karen</span>",
subtitle = "Number of births per year in the US as a % of peak popularity",
caption = "Data: babynames package in R"
) +
scale_x_continuous(
expand = c(0, 1)
) +
theme_grey() +
theme(
plot.title.position = "plot",
plot.title = ggtext::element_markdown(face = "bold"),
plot.background = element_rect(fill = "#eef5f9"),
panel.background = element_rect(fill = "#eef5f9"),
panel.grid = element_line(color = "grey90"),
legend.position = "none"
) +
scale_color_manual(
values = c("Terry" = "steelblue", "Karen" = "#FE5BAC")
)