1. Building a plot

Looking at a dataset is nice, but we will often want to visualize our data. R has incredibly powerful tools for data visualization.

To start, load the tidyverse library and read the presidential elections data from yesterday1 and take a look at it to remind yourself:

library(tidyverse)
elections <- read_csv("pres_elections.csv")
elections
## # A tibble: 1,097 x 5
##    state       abb   democrat  year region   
##    <chr>       <chr>    <dbl> <dbl> <chr>    
##  1 Alabama     AL        84.8  1932 South    
##  2 Arizona     AZ        67.0  1932 West     
##  3 Arkansas    AR        86.3  1932 South    
##  4 California  CA        58.4  1932 West     
##  5 Colorado    CO        54.8  1932 West     
##  6 Connecticut CT        47.4  1932 Northeast
##  7 Delaware    DE        48.1  1932 South    
##  8 Florida     FL        74.5  1932 South    
##  9 Georgia     GA        91.6  1932 South    
## 10 Idaho       ID        58.7  1932 West     
## # … with 1,087 more rows

Let’s look at the results for New Jersey:

nj <- elections %>% filter(state == "New Jersey")

As you might expect, there are many functions for plotting! The starting point for every plot we make in this course is called ggplot().

Starting with a dataset, you can create a plot with year on the x-axis and democrat on the y-axis with:

nj %>%
  ggplot(aes(x = year, y = democrat))

ggplot() makes a plot for you, and the aes() function (short for “aesthetic”) describes what you want on the x and y axis (for now! We can use aes() for other things too later).

But it’s empty! To get shapes to appear on the plot, we need to ask for a particular geom (short for “geometry”). A geom in R is a way to visualize the data, like a point, a line, or a shape. To further customize this plot, we simply add a geom for the shape we want. Let’s use geom_line() to make a line:

nj %>%
  ggplot(aes(x = year, y = democrat)) + 
  geom_line()

Notice the + sign! We add a + sign between different pieces of a plot.

Look back at the dataset and see where you are plotting points on the plot:

nj
## # A tibble: 22 x 5
##    state      abb   democrat  year region   
##    <chr>      <chr>    <dbl> <dbl> <chr>    
##  1 New Jersey NJ        49.5  1932 Northeast
##  2 New Jersey NJ        59.5  1936 Northeast
##  3 New Jersey NJ        51.6  1940 Northeast
##  4 New Jersey NJ        50.3  1944 Northeast
##  5 New Jersey NJ        45.9  1948 Northeast
##  6 New Jersey NJ        42    1952 Northeast
##  7 New Jersey NJ        34.2  1956 Northeast
##  8 New Jersey NJ        50.0  1960 Northeast
##  9 New Jersey NJ        65.6  1964 Northeast
## 10 New Jersey NJ        44.0  1968 Northeast
## # … with 12 more rows

We could keep almost this exact code for a plot with points as well:

# the bottom line is the only one that changed
nj %>%
  ggplot(aes(x = year, y = democrat)) + 
  geom_point()

You can also add both! Notice how the points appear on top of the lines, since we added them after:

nj %>%
  ggplot(aes(x = year, y = democrat)) + 
    geom_line() + 
    geom_point()

Exercises

  1. Okay, let’s all try this. Create a new object with the election results from one state other than New Jersey. Use it to make a plot like we have above.

  2. Then, try to make a bar graph using geom_col() instead of points or lines.

nj %>%
  ggplot(aes(x = year, y = democrat)) + 
    geom_col()

  1. Look back at the line, point, and bar plots you made. Are they all displaying the same information? Which one do you think is most effective?

2. Aesthetics

We added an x and y aesthetic, but plots can accept many other arguments.

Colors

If you want to make your geoms a certain color, that is very easy to do with the color argument:

nj %>%
  ggplot(aes(x = year, y = democrat)) + 
  geom_line(color = "grey") + 
  geom_point(color = "blue")

This looks great, but what if we want the colors in our plots to depend on the value of the data? For example, red points for elections that Republicans won and blue for elections that Democrats won.

Then, people looking at our plot would see additional pieces of information beyond the values on the x and y axes.

Just like the x and y axes, if we want the color of the points to depend on values in the data we have to use a column in our dataset to define the colors. Let’s make a new column that shows whether the Democratic candidate won the election.

For a crude measure of the election winner, let’s use whether democrat is greater than 50 percent (this is too simple since more than two candidates can run, but it’s okay for now).

nj <- nj %>%
  mutate(winner = democrat > 50)

Remember how this code works: the column democrat in nj is really a vector. The code works very similarly to running something like:

democrat <- c(52, 37, 63)
democrat > 50
## [1]  TRUE FALSE  TRUE

If you want the color of the points to depend on the value of a column, then you can use the color argument in the aes() function. R will assign one color to each value in the winner vector. Since there are only TRUE and FALSE values in this column, all of the TRUE values will have one color and FALSE will have another.

nj %>%
  ggplot(aes(x = year, y = democrat, color = winner)) + 
  geom_point()

What if we add the line back?

nj %>%
  ggplot(aes(x = year, y = democrat, color = winner)) + 
  geom_point() + geom_line()

Uh-oh! What’s happening here? Well, we’ve asked the plot to change the color of our shapes according to the winner variable. Since we have both points and a line, the plot is trying to change the color both.

What if we only want to change the color of the points depending on the value of winner? Well, we can include that aesthetic only in the geom_point() function.

nj %>%
  ggplot(aes(x = year, y = democrat)) +
    geom_line() +
    geom_point(aes(color = winner))

Like before, you can still set the color of the line manually since you don’t want the color to vary by the value of a column. Make sure to do this outside of aes():

nj %>%
  ggplot(aes(x = year, y = democrat)) +
    geom_line(color = "grey") +
    geom_point(aes(color = winner))

Size and shape

Similarly, you can have the size of a point depend on the value of a column. For example:

nj %>%
  ggplot(aes(x = year, y = democrat)) +
    geom_line(color = "grey") +
    geom_point(aes(size = democrat))

Now, points are larger for larger values of democrat! However, larger values of democrat are already higher up on the y-axis, so this does not add much information to our plot.

The same is true for shape:

nj %>%
  ggplot(aes(x = year, y = democrat)) +
    geom_line(color = "grey") +
    geom_point(aes(shape = winner))

Exercises

  1. Create a new column in the nj dataset called percent. The values should be equal the values in democrat divided by 100.

  2. Make a plot for the nj object with year on the x-axis and democrat on the y-axis.

  3. Create a new column in nj called modern which is TRUE for all elections after 1980 and FALSE for those before. Create a plot with year on the x-axis, democrat on the y-axis, color the points by winner, and vary the shape by modern:

nj %>%
  mutate(modern = year > 1980) %>%
  ggplot(aes(x = year, y = democrat)) +
    geom_line(color = "grey") +
    geom_point(aes(color = winner, shape = modern))

3. Customizing your visualizations

Geometries and aesthetics are the core of a nice visualization. R gives you many many more tools to customize your plots any way you want. For example:

Labels

Labels are important in any plot. We create these with the labs() function, which has arguments for title, subitle, caption, x, and y labels. You can choose which labels to include in your plot. For example:

nj %>%
  ggplot(aes(x = year, y = democrat)) +
    geom_line(color = "grey") +
    geom_point(aes(color = winner)) +
    labs(title = "New Jersey Presidential Election Results",
         subtitle = "1932-2016",
         x = "Election Year", 
         y = "Democratic %")

Themes

Themes are simple ways to improve the presentation of your plot as well. We will learn how to make our own later, but for now you can use built-in themes. Some built-in themes include theme_bw(), theme_minimal(), and theme_dark().

For convenience, you can also store plots to an object and add additional features onto that object:

# save plot in an object called p
p <- nj %>%
  ggplot(aes(x = year, y = democrat)) +
    geom_line(color = "grey") +
    geom_point(aes(color = winner)) +
    labs(title = "New Jersey Presidential Election Results",
         subtitle = "1932-2016",
         x = "Election Year", 
         y = "Democratic %")

# now we can make more customizations to p
# without retyping everything
p + theme_minimal()

p + theme_dark()

There are many, many more themes available via packages. In your console, run install.packages("ggthemes"). Then, add this code to your .Rmd file:

library(ggthemes)

This opens up many many more themes for you, many of which are lisited at this link. Here are a few:

p + theme_clean()

p + theme_fivethirtyeight() # 538

p + theme_igray() # Gray background

p + theme_economist() # The Economist

p + theme_stata() # theme from a language called Stata

p + theme_solarized()

You can edit almost anything you want about a plot’s theme, even if you’ve already added a preset theme. Most of this works happens through the theme() function. You can run ?theme to get a full list of options. For example:

p + 
  theme_bw() + 
  theme(legend.position = "bottom")

Facets

Often, you will want to plot several groups at once. However, putting all information on one plot can be overwhelming. For example, consider this plot:

northeast <- elections %>% filter(region == "Northeast")

northeast %>%
  mutate(winner = democrat > 50) %>%
  ggplot(aes(x = year, y = democrat, color = winner)) + 
  geom_point()

Why is this so cluttered? Well, we’re now plotting results from all 50 states! We could color by state instead, but that might look overwhelming:

northeast %>%
  ggplot(aes(x = year, y = democrat, color = state)) + 
  geom_point()

Wow! That looks terrible. Instead, what if we plotted a separate line for each state?

northeast %>%
  ggplot(aes(x = year, y = democrat, color = state)) + 
  geom_point() + 
  geom_line()

That looks a little better, but it is still difficult to tell each line apart from one another. What if we made a smaller plot for each state and combined them? This is what a facet is. If we ask for a facet_wrap() by state, R will make one plot per state:

northeast %>%
  ggplot(aes(x = year, y = democrat)) + 
  geom_point() + 
  geom_line() + 
  facet_wrap(~state) + # notice the ~ key (called a tilde)
  theme_linedraw()

We could also add the winner color back and facet_wrap() will automatically apply it to each plot:

northeast %>%
  mutate(winner = democrat > 50) %>%
  ggplot(aes(x = year, y = democrat)) + 
  geom_line() + 
  geom_point(aes(color = winner)) + 
  facet_wrap(~state) + # notice the ~ key (called a tilde)
  theme_linedraw() + 
  labs(x = "Election Year",
       y = "Democratic %",
       title = "Presidential Elections",
       subtitle = "1932-2016, Northeastern States")

Exercises

  1. Here is a dataset with state population data over time.2 Like yesterday, download this file and place it in the same folder as this file. Then, read the file in to an object called states.
  1. For any state you want, make a plot showing population by year for every year after 1960.

  2. Add labels and a theme to your plot from Question 2.

  3. Now, design a plot (or extend your plot from Question 3) that uses a facet in some way.


  1. This dataset comes from the pscl R package.↩︎

  2. This dataset comes from user JoshData on GitHub.↩︎