Line graphs

A line graph is a type a graph that uses a line to play “connect the dots” with the data points represented. It’s not required, but the x-axis on most line graphs represent time in some way.

We’ll start with an example for a single name picked at random (like “Jacob”)

# We'll create a line graph for the popularity of the name Jacob for boys
babynames |> 
  filter(name == "Jacob" & sex == "M") |> 
  # Pass the data set along to ggplot and won't need a data = ... argument
  ggplot(
    mapping = aes(
      x = year, 
      y = prop
    )
  ) + 
  # Adding a line
  geom_line(
    linewidth = 1
  ) + 
  # Changing the y-axis to a percentage
  scale_y_continuous(
    labels = scales::label_percent()
  )

It’s pretty typical for line graphs to have more than one line. Let’s recreate the line graph we saw on the first day of class, comparing the relative popularity for the names Karen and Terry.

Using the babynames data set to create a set with Karen and Terry

The code below will create a data set with just female babies named Karen and male babies Terry from 1946 - 2017, then calculate the popularity of the name vs the maximum popularity: pop/max(pop) for each name

k_t  <- 
  babynames |> 
  # Getting Karens and Terrys from 1946 to 2017
  filter(
    name == "Karen" & sex == "F" | name == "Terry" & sex == "M",
    year >= 1946
  ) |> 
  # Calculating rel_prop - the relative popularity of the name
  mutate(
    .by = name,
    rel_prop = n/max(n)
  ) |> 
  # Keeping only the relevant columns
  dplyr::select(year, name, sex, rel_prop)

k_t
## # A tibble: 144 × 4
##     year name  sex   rel_prop
##    <dbl> <chr> <chr>    <dbl>
##  1  1946 Karen F        0.484
##  2  1946 Terry M        0.684
##  3  1947 Karen F        0.533
##  4  1947 Terry M        0.817
##  5  1948 Karen F        0.542
##  6  1948 Terry M        0.778
##  7  1949 Karen F        0.554
##  8  1949 Terry M        0.734
##  9  1950 Karen F        0.595
## 10  1950 Terry M        0.733
## # ℹ 134 more rows

Data description

The k_t data set has 4 columns:

  1. year: The year from 1946 - 2017
  2. name: The name given to the baby
  3. sex: If the baby was female (F) or male (M)
  4. rel_prop: The relative popularity of the name Terry or Karen

We want to make a graph to display the relative popularity for babies named Karen and Terry. Let’s start by creating a blank graph with x = year and y = rel_prop and changing the y-axis to be percentages. Save the graph as gg_kt

gg_kt <- 
  ggplot(
    data = k_t,
    mapping = aes(
      x = year,
      y = rel_prop
    )
  ) + 
  scale_y_continuous(
    labels = scales::label_percent()
  )

gg_kt

Now that we have a blank graph, how do we add a line or lines? Can we just use geom_smooth()?

Not quite. geom_smooth() fits a smooth, trend line across the graph. We want a geom that will connect the left-most dot with the next left-most dot, and so on. So which geom should we use?

It’s not much of a surprise that we should use geom_line()! Add it to gg_kt with linewidth = 1 that was created in the previous code chunk:

gg_kt + 
  geom_line(
    linewidth = 1
  )

Uh, that’s not quite what we wanted. So what happened?

The way geom_line() works is it will connect the dots from left to right. If there are 2 dots with the same x-value, it will draw a vertical line in the graph! Is that what is going on in our graph?

Instead of geom_line(), add geom_point() to our graph

gg_kt +
  geom_point()

We get a better look at the dots being connected now! For each unique date, there are two points in the data: one for Karen’s popularity and another for Terry’s. In the codechunk above, color the dots by name o paint a clearer picture.

So how do we fix it?

Multiple lines

From what we saw when we were looking at making bar charts, ggplot() will form groups in the data whenever an aesthetic is mapped to a categorical variable. So how does that help us here? geom_line() will only play connect-the-dots with points in the same group. So if we have ggplot() form groups in the data, it will draw multiple lines using one geom_line() function. We just need to map an appropriate aesthetic to the column(s) that forms the groups!

Some choices are:

  • color: the most popular aesthetic to use
  • linetype: changes how the line is drawn - solid, dashed, dotted, etc…
  • group: won’t actually change how the line is drawn, but will draw a separate line for each group.
    • Useful when constructing “spaghetti plots”. see attached pdf for a more detailed description.

Let’s map color to name and see what happens!

gg_kt2 <- 
  gg_kt +
  geom_line(
    mapping = aes(color = name),
    linewidth = 1
  )

gg_kt2

Try changing color to linetype and see what changes!

Once you get it working and have each line represented by a color, save the result as gg_kt2

Caution about line graphs with ggplot

One caution about working with lines is you want to make sure that there aren’t too many groups, otherwise geom_line() may not work correctly!

Let’s look over the example below:

k_t |> 
  # Picking just the years from 1970 to 1980
  filter(
    between(year, 1950, 1980)
    ) |> 
  
  ggplot(
    mapping = aes(
      x = factor(year), 
      y = rel_prop,
      color = name
    )
  ) + 
  
  geom_line(linewidth = 1) + 
  
  geom_point(size = 2) +
  
  # Moving the legend to the top of the graph
  theme(
    legend.position = "top",
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)
    ) + 
  scale_y_continuous(
    labels = scales::label_percent()
  ) + 
  
  labs(x = NULL)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Uh oh, what happened?

The warning message gives a hint- geom_line(): each group consists of only one observation. Do you need to adjust the group aesthetic?

What that means is since both x and color are now mapped to categorical columns, the groups are now formed for each combination of name and year. Since each combination of name and year only has 1 row, geom_line() doesn’t have 2 points it can play connect the dots with! geom_point() is unaffected since it just places a dot at each (x,y) coord combo. But if there aren’t 2 points in the same group, it can’t connect any of the dots!

So what could we do? We can use the group aesthetic to try and fix it! Try mapping name to group below

k_t |> 
  # Picking just the years from 1970 to 1980
  filter(
    between(year, 1950, 1980)
    ) |> 
  
  ggplot(
    mapping = aes(
      x = factor(year), 
      y = rel_prop,
      color = name,
      group = name
    )
  ) + 
  
  geom_line(linewidth = 1) + 
  
  #geom_point(size = 2) +
  
  # Moving the legend to the top of the graph
  theme(
    legend.position = "top",
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)
    ) + 
  scale_y_continuous(
    labels = scales::label_percent()
  ) + 
  
  labs(x = NULL)

Including group = name inside geom_line() will cause ggplot to only form groups based on browser type (just for geom_line(). Any other geoms will still see the groups formed by both browser and date!).

Recreating the graph seen on the first day of class

If gg_kt2 is working correctly

gg_kt2 + 
  
  labs(
    y = NULL,
    x = "Year",
    color = NULL,
    title = "Men named <span style='color:steelblue;'>Terry</span> tend to be of the same age as women named <span style='color:#FE5BAC;'>Karen</span>",
    subtitle = "Number of births per year in the US as a % of peak popularity",
    caption = "Data: babynames package in R"
  ) + 
  
  scale_x_continuous(
    expand = c(0, 1)
  ) + 
  theme_grey() + 
  theme(
    plot.title.position = "plot",
    plot.title = ggtext::element_markdown(face = "bold"),
    plot.background = element_rect(fill = "#eef5f9"),
    panel.background = element_rect(fill = "#eef5f9"),
    panel.grid = element_line(color = "grey90"),
    legend.position = "none"
  ) + 
  scale_color_manual(
    values = c("Terry" = "steelblue", "Karen" = "#FE5BAC")
  )