Introduction

We have already seen some of the basics of ggplot2. The following has been adapted from a course by Colin Pascal from Johns Hopkins University. We will assume that you are familiar with some of the basics of ggplot and graphics in general and we will go through concepts here fairly quickly.

A useful tool is the R Graph Gallery. While we cannot teach all of the various types of graphs that ggplot is capable of creating here, the R Graph Gallery is a very nice next step in creating different visualizations.

As we know, ggplot is a powerful and complex piece of software. Begin by loading the required library (tidyverse or ggplot2 will both work).

library(tidyverse)

We use data from the United States Congress collected by a research team led by Alan Wiseman and Craig Volden of the University of Virginia and Vanderbilt University. Wiseman and Volden collect information about members of Congress for research on the effectiveness of members of congress when it comes to making laws.

cel <- read_csv(url("https://www.dropbox.com/s/4ebgnkdhhxo5rac/cel_volden_wiseman%20_coursera.csv?raw=1"))
head(cel)

## # A tibble: 6 x 38
##   thomas_num thomas_name icpsr congress  year st_name    cd   dem elected female
##        <dbl> <chr>       <dbl>    <dbl> <dbl> <chr>   <dbl> <dbl>   <dbl>  <dbl>
## 1          1 Abdnor, Ja~ 14000       93  1973 SD          2     0    1972      0
## 2          2 Abzug, Bel~ 13001       93  1973 NY         20     1    1970      1
## 3          3 Adams, Bro~ 10700       93  1973 WA          7     1    1964      0
## 4          4 Addabbo, J~ 10500       93  1973 NY          7     1    1960      0
## 5          6 Alexander,~ 12000       93  1973 AR          1     1    1968      0
## 6          8 Anderson, ~ 12001       93  1973 CA         35     1    1968      0
## # ... with 28 more variables: votepct <dbl>, dwnom1 <dbl>, deleg_size <dbl>,
## #   speaker <dbl>, subchr <dbl>, afam <dbl>, latino <dbl>, votepct_sq <dbl>,
## #   power <dbl>, chair <dbl>, state_leg <dbl>, state_leg_prof <dbl>,
## #   majority <dbl>, maj_leader <dbl>, min_leader <dbl>, meddist <dbl>,
## #   majdist <dbl>, all_bills <dbl>, all_aic <dbl>, all_abc <dbl>,
## #   all_pass <dbl>, all_law <dbl>, les <dbl>, seniority <dbl>, benchmark <dbl>,
## #   expectation <dbl>, TotalInParty <dbl>, RankInParty <dbl>

Getting Started with ggplot

We begin by describing the basic structure of a ggplot figure. The gg in ggplot stands for grammar of graphics. There are five parts to a ggplot visualization:

tidy data in long format.
mapping these data to the aesthetic features of a visualization, like the x and the y axis, the size of the marks on the plot, or the color or the type of marks.
geoms, which is what ggplot calls the geometric shapes that go into a plot.
a coordinate system.
guides which is what is what ggplot calls legends and other labels.

A Basic Scatter Plot

Suppose we want to plot how many bills members of Congress pass as a function of how many terms they were elected. We select a single Congress and see how many bills are passed versus the experience of the members.

We filter the data and take the 100th Congress as an example.

fig100 <- cel %>% 
    filter(congress==100) %>% 
    select("seniority","all_pass")

head(fig100)

## # A tibble: 6 x 2
##   seniority all_pass
##       <dbl>    <dbl>
## 1         3        4
## 2         6        1
## 3        10        0
## 4        10        7
## 5         3        1
## 6        12        2

Next, we plot the data using a basic ggplot. Initially, we want to feed the function the data. Then, we select an aesthetic mapping.

ggplot(fig100,aes(x=seniority, y=all_pass)) +
  geom_point()

There are many geom commands such as:

points,
bars,
boxplots,
lines
etc.

A complete list of geoms are available on the tidyverse page. The geom_point() command creates an aesthetic mapping that creates a scatterplot. The other geom commands create the mappings you might expect. Note that we do not put anything in the parentheses right now because the geom_point function inherits the data from the original ggplot command. That is, that it knows it wants to use the data from the ggplot command that preceded it along with the aesthetic mappings in the gg function. There are geoms that you can add to a ggplot string that will require some code inside the brackets; however, not all will.

One issue you will immediately recognize is that we are missing points. Actually, this is just an overplotting issue. There are duplicate cases in the data and those dots are overlaid on top of each other. To help represent this and not have these points over plotted on top of each other, we modify the command slightly. Rather than using geom_point, we use geom_jitter. This adds a bit of noise so the data does not lay directly on top of each other. While this no longer accurately represents the data, it does create an accurate visual depiction of what is occurring - the more senior a member of Congress is, the more bills they pass, in general.

ggplot(fig100,aes(x=seniority, y=all_pass)) +
  geom_jitter()

To improve the look, we can give the figure a title and change the axis labels. We do this by adding a labs function to the string of ggplot commands, connecting it to the ggplot and geom_jitter functions with another plus sign.

ggplot(fig100,aes(x=seniority, y=all_pass)) +
  geom_jitter() +
  labs(x="Seniority",
       y="Bills Passed",
       title="Seniority and Bills Passed in the 100th Congress")

More or less, this is the basics of ggplot. However, we can still improve by adding colour. We modify the diagram to point out Democrats and Republicans. We will need to re-filter the data to include the party variable (“dem”).

fig100 <- cel %>% 
  filter(congress==100) %>% 
  select("seniority", "all_pass", "dem")

head(fig100)

## # A tibble: 6 x 3
##   seniority all_pass   dem
##       <dbl>    <dbl> <dbl>
## 1         3        4     1
## 2         6        1     1
## 3        10        0     1
## 4        10        7     1
## 5         3        1     1
## 6        12        2     1

ggplot(fig100, aes(x=seniority, y=all_pass, color=dem)) +
  geom_jitter() +
  labs(x="Seniority",
       y="Bills Passed",
       title="Seniority and Bills Passed in the 100th Congress")

The colour is given on a continuous scale; however, we understand that the party of a candidate is binary (ordinal). With a little data wrangling, we convert the dim variable from a number to a categorical variable using recode and then add that recoded variable back to our data.

party <- recode(fig100$dem, `1`="Democrat", `0`="Republican")

fig100 <- add_column(fig100, party)

ggplot(fig100,aes(x=seniority, y=all_pass, color=party)) +
  geom_jitter() +
  labs(x="Seniority",
       y="Bills Passed",
       title="Seniority and Bills Passed in the 100th Congress")

As a final touch, we match traditional blue democrats and red republicans. We manually control the colours by adding the function scale_color_manual and set these values to blue and red.

ggplot(fig100,aes(x=seniority,y=all_pass,color=party)) +
  geom_jitter() +
  labs(x="Seniority",
       y="Bills Passed",
       title="Seniority and Bills Passed in the 100th Congress") +
  scale_color_manual(values=c("blue","red"))

If we wanted separate plots for Democrats and Republicans, we use a process called faceting. We add the facet_wrap command to our ggplot to create two diagrams.
The facet_wrap command works by putting inside the parentheses a tilde mark followed by a variable in the data that will split the figure into two parts on the basis of that variable.

ggplot(fig100,aes(x=seniority, y=all_pass, color=party)) +
  geom_jitter() +
  labs(x="Seniority",
       y="Bills Passed",
       title="Seniority and Bills Passed in the 100th Congress") +
  scale_color_manual(values=c("blue","red")) +
  facet_wrap(~party)

Distributions

Box Plots

Scatter plots are useful to show the relationship between two different variables. There may, however, be occasions where visualizations that display summaries of the data or underlying data distributions for single variables in the dataset may be more informative. One classic example of such a visualization is the box plot. We have seen these in previous sections.

We use the survey data from the Cooperative Congressional Election Survey, which is a major academic survey focused on politics in the United States. It is distributed to members of the general public and responded to via phone.

cces<- read_csv(url("https://www.dropbox.com/s/ahmt12y39unicd2/cces_sample_coursera.csv?raw=1"))
head(cces)

## # A tibble: 6 x 25
##   caseid region gender  educ edloan  race hispanic employ marstat  pid7 ideo5
##    <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>    <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1 4.18e8      3      1     2      2     1        2      5       3     6     3
## 2 4.15e8      1      2     6      2     1        1      1       1     2     2
## 3 4.14e8      3      2     3      2     2        2      1       4     2     3
## 4 4.12e8      1      2     5      2     6        2      5       3     3     1
## 5 4.17e8      2      1     2     NA     4        2      8       5     1     1
## 6 4.20e8      2      1     3      2     1        2      1       1     4     5
## # ... with 14 more variables: pew_religimp <dbl>, newsint <dbl>,
## #   faminc_new <dbl>, union <dbl>, investor <dbl>, CC18_308a <dbl>,
## #   CC18_310a <dbl>, CC18_310b <dbl>, CC18_310c <dbl>, CC18_310d <dbl>,
## #   CC18_325a <dbl>, CC18_325b <dbl>, CC18_325c <dbl>, CC18_325d <dbl>

The column faminc_new in the dataset reports the approximate household income for survey respondents, broken into 16 categories from low to high, coded numbers 1 through 16. We make a boxplot of this data.

The first argument in ggplot is the dataset (called CCES). For the aesthetic mapping, we only need to set y to the variable of interest because it is a univariate distribution. Once we have our ggplot command, we add the function geom_boxplot.

ggplot(cces, aes(y=faminc_new)) +
    geom_boxplot()

If we wanted to modify this box plot, we could do so as we did before. For instance, if we wanted to compare the household incomes of individuals at different educational levels, we could do that. The educ variable is the answer to a question about the highest level of education respondents achieved. We then graph by various education levels by adding it into our aesthetic mappings and citing it to the group. One could further modify the figure with labels and titles. This gives the distribution for income levels by education level on the x-axis here.

ggplot(cces,aes(y=faminc_new, group=educ)) +
  geom_boxplot() +
  labs(x="Education Level",
       y="Family Income",
       title="Family Inc. by Respondent Ed. Level")

An additional change we may wish to make is to convert the numeric coding for the educational level variable to a categorical or qualitative variable. We can do that using the recode function.

cces$educ_category <- 
    recode(cces$educ,`1`="<4 yr Degree",
           `2`="<4 yr Degree",
           `3`="<4 yr Degree",
           `4`="<4 yr Degree",
           `5`="4 yr Deg. +",
           `6`="4 yr Deg. +")

head(cces)

## # A tibble: 6 x 26
##   caseid region gender  educ edloan  race hispanic employ marstat  pid7 ideo5
##    <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>    <dbl>  <dbl>   <dbl> <dbl> <dbl>
## 1 4.18e8      3      1     2      2     1        2      5       3     6     3
## 2 4.15e8      1      2     6      2     1        1      1       1     2     2
## 3 4.14e8      3      2     3      2     2        2      1       4     2     3
## 4 4.12e8      1      2     5      2     6        2      5       3     3     1
## 5 4.17e8      2      1     2     NA     4        2      8       5     1     1
## 6 4.20e8      2      1     3      2     1        2      1       1     4     5
## # ... with 15 more variables: pew_religimp <dbl>, newsint <dbl>,
## #   faminc_new <dbl>, union <dbl>, investor <dbl>, CC18_308a <dbl>,
## #   CC18_310a <dbl>, CC18_310b <dbl>, CC18_310c <dbl>, CC18_310d <dbl>,
## #   CC18_325a <dbl>, CC18_325b <dbl>, CC18_325c <dbl>, CC18_325d <dbl>,
## #   educ_category <chr>

Now we must change the aesthetic mapping so that the categorical variable we created is mapped to x rather than group. This will ensure the x-axis appears correctly. Occasionally, the aesthetic mappings can be finicky, so experimentation may be required to figure out exactly which mapping will achieve the look you want.

ggplot(cces,aes(y=faminc_new, x=educ_category)) +
  geom_boxplot() +
  labs(x="Income Level",
       y="Family Income",
       title="Family Inc. by Respondent Ed. Level")

Histograms & Density Plots

Two other common visualizations to display the distributions of univariate data are the histogram and the density plot.

A histogram is a frequency bar plot showing the count of discreet values in a numeric vector.

ggplot(cces,aes(x=faminc_new))+
  geom_histogram()

A density plot is essentially a smoothed and re-scaled histogram.

ggplot(cces,aes(x=faminc_new))+
  geom_density()

You can make the same kinds of modifications to the color and the x-axis, the axis labels as you did with the other figures. The process is identical.

Bar Plots

We use data from the Center for Effective Lawmaking and Cooperative Congressional Election Survey for our discussion on bar plots. We wish to create a bar chart that shows counts of how many Republicans and Democrats there were in the 100th Congress. We filter the data as we did before. You will notice that there are 180 Republicans and 266 Democrats in the 100th Congress.

dat <- cel %>% 
  filter(congress==110) 

ggplot(dat, aes(x=dem)) +
  geom_bar()

A Second Barplot

Suppose we wanted to see how many members of Congress there are for each state. This figure provides the count of how many members of Congress there are from each state in the US.

dat <- cel %>% 
    filter(congress==100)

ggplot(dat, aes(x=st_name)) + 
    geom_bar()

This graphic is a bit unappealing. Words are often better on the vertical axis as they tend not to get mashed together as much. So, we change our aesthetic to put the state names on the vertical axis.

ggplot(dat, aes(y=st_name)) + 
    geom_bar()

One thing that’s unsatisfying about the x-axis here is that it’s not intuitively labeled. If we recode the dem column, we can add the party for each member of Congress.

party <- recode(cel$dem,
                `1`="Democrat",
                `0`="Republican")

cel <- add_column(cel, party)

dat <- cel %>% 
    filter(congress==100)

ggplot(dat, aes(x=party)) +
  geom_bar()

We now add some other aesthetic touches to the graph such as labels, party colour and the removal of the key (using the guides command) since it seems unnecessary.

dat <- cel %>% 
    filter(congress==100)

ggplot(dat, aes(x=party, fill=party)) +
  geom_bar() +
  labs(x="Party", 
       y="Number of Members") +
  scale_fill_manual(values=c("blue", "red")) +
  guides(fill=FALSE)

Stacked & Side-by-Side Barplots

Once again, we will work with the Cooperative Congressional elections survey data.

We will categorize respondents as either Democrat, Republican or Independent based on their response to the pid7 question with those answering 1 – 3 as Democrat and those answering 5 – 7 as Republican. Those who answered 4 will be considered Independent. We are then going to see how these political leanings look across the four regions of the United States.

We can do it in two ways. One way, is to create four separate stacked bar charts with the total number of respondents in each region, and in the different parts of the different categories of party membership, broken out into stacked bars for each region. The other way would be to create 12 separate bars that are grouped together into the four different regions.

We begin by recoding the data in the ped7 question as dem_rep - a classification of Democrat, Republican or Independent.

cces <- read_csv(url("https://www.dropbox.com/s/ahmt12y39unicd2/cces_sample_coursera.csv?raw=1"))

dem_rep <- recode(cces$pid7,
                  `1`="Democrat",
                  `2`="Democrat",
                  `3`="Democrat",
                  `4`="Independent",
                  `5`="Republican",
                  `6`="Republican",
                  `7`="Republican")

table(dem_rep)

## dem_rep
##    Democrat Independent  Republican 
##         516         119         365

cces <- add_column(cces, dem_rep)

We graph this data with region on the \(x\)-axis. To create the stacked bars, we map the fill color for each bar to the collapsed party variable that we’ve created.

ggplot(cces, aes(x=region, fill=dem_rep)) +
  geom_bar()

The fill is mapped to the count of the number of different respondents in each party category, by the four different regions of the country. If you wanted to have a grouped set of bars, go into the geom bar function, and set the option position to equal dodge. This takes the stacked bar chart and breaks it apart.

###grouped bars
ggplot(cces,aes(x=region, fill=dem_rep)) +
  geom_bar(position="dodge")

Which is better depends on what you want to tell your audience. For example, the second graph gives a better illustration that there are more Democrats in region 4 than in region 1 while illustration 1 shows more respondents from region 4 than in region 1. Of course, this is not a finished graphic as it should have better labels, colours and a title. However, this is done in the same way as it was in previous sections.

Line Plots

Line plots are generally used to show time trends indicating fluctuation in a value across time. For a line chart, we need data that has a column of time or date information with additional columns of data to be plotted at each time point.

To demonstrate, we create a fictitious data set consisting of the years 1991 to 2020 with random data representing a stock price. This data has mean 25 and standard deviation 5. We join these in a tibble.

The geom we want here is geom_line. Using the aesthetic of years on the horizontal axis and price on the vertical, we get a line plot of the data.

years <- seq(from=1991, to=2020, by=1)

price <- rnorm(30, mean=25, sd=5)

fig_data <- tibble("year"=years, "stock_price"=price)

ggplot(fig_data, (aes(x=years, y=price))) +
  geom_line()

If we wanted a second line on the graph, we could add one. Here, we recreate our data with two stocks (Stock_1 and Stock_2). The second stock will have mean 15 and standard deviation 3.

fig_data$stock_id=rep("Stock_1",30)

stock_1_time_series <- fig_data

stock_id <- rep("Stock_2",30)

years <- seq(from=1991,to=2020,by=1)

price <- rnorm(30,mean=15,sd=3)

stock_2_time_series <- tibble("stock_id"=stock_id, 
                              "year"=years, 
                              "stock_price"=price)

all_stocks_time_series <- bind_rows(stock_1_time_series, stock_2_time_series)

We now plot the data using the aesthetic group to break up the two data sets.

ggplot(all_stocks_time_series, (aes(x=year,y=stock_price, group=stock_id))) +
  geom_line()

Now we have the price of our two stocks on the same figure showing both of them how they both vary across time. Of course, this graphic would be better here if the lines were different colors or had different styles. We can do this by changing the aesthetic mapping. So for the line type, we change that and we also go ahead and set the color aesthetic.

ggplot(all_stocks_time_series,
       (aes(x=year,
            y=stock_price,
            group=stock_id,
            linetype=stock_id,
            color=stock_id))) +
  geom_line()

If we want the lines on different graphs, we facet wrap.

ggplot(all_stocks_time_series,
       (aes(x=year,
            y=stock_price,
            group=stock_id,
            linetype=stock_id,
            color=stock_id))) +
  geom_line() +
  facet_wrap(~stock_id)

Another Line Chart Example

One questions in American Political Science in the last 25 years has been whether the Republican and Democratic parties are becoming increasingly polarized, with Republicans becoming more conservative and Democrats becoming more liberal. Political scientists have come up with a way of measuring how Conservative and Liberal members of the US Congress are based on their voting records (called the DW nominate score).

The Volden and Wiseman dataset on Legislative Effectiveness has a column called dwnom1 that measures this score. A lower score means that the member of Congress is more liberal while a higher score means that the member of Congress is more conservative. One way to potentially assess whether American political parties have become more extreme over time is to measure the average ideology of members of the two parties across time. If the distance is increasing, we can interpret this as greater political polarization.

First, recode the dem variable from 1 0 indicator variable to make it read as Democrat or Republican. Next, summarize this data collecting the mean of the dwnom1 ideology score across party each year (note, we drop any missing variables). This gives the average ideology for Democrats and Republicans for all the different years. Finally, create the ggplot in the usual way.

cel <- read_csv(url("https://www.dropbox.com/s/4ebgnkdhhxo5rac/cel_volden_wiseman%20_coursera.csv?raw=1"))

cel$Party <- recode(cel$dem,
                    `1`="Democrat",
                    `0`="Republican")

fig_data <- cel %>% 
  group_by(Party, year) %>% 
  summarize("Ideology"=mean(dwnom1, na.rm=T))
  
ggplot(fig_data,(aes(x=year, y=Ideology, group=Party, color=Party))) +
  geom_line() +
  scale_color_manual(values=c("blue", "red"))

We can see that the parties move apart from Congress to Congress. This suggests a greater partisan polarization over time. Note that this graphic is an exceptionally simple way to explain that idea to an audience.

At this point, you are in a great position to experiment with the R Graph Gallery

Advanced ggplot

Heatmaps

A heatmap is a handy way of summarizing a number of different variables across different units of observations. In a heatmap, the color of the different tiles indicates the relative magnitude of different variables.

One can create a heatmap in ggplot using the geom_tile function. To use this, we need a set of observations and then at least three variables per observation.

To create a dummy example, we create a data set with the first 20 letters as rows and columns labelled “varXX”. This is a 20 x 20 matrix of data (using the expand.grid function). Expanded.grid creates a data frame of every combination of the supplied vectors of data that you put into the function. We then create a variable Z which consists of 400 integers selected uniformly from 1 – 5 (using runif). Again, we need 400 of these numbers since 20 x 20 = 400.

Our ggplot will have letters on the \(x\)-axis, variables on the \(y\)-axis and the Z values as entires.

x <- LETTERS[1:20]
y <- paste0("var", seq(1,20))

dat <- expand.grid(X=x, Y=y)

dat$Z <- runif(400, 0, 5)


ggplot(dat, aes(x=X, y=Y, fill= Z)) + 
  geom_tile()

More Heatmaps

Suppose we had three basketball players and we had stats on those players (assists, points and rebounds). We create a tibble of these statistics (note that we will need to rotate the data 90 degrees).

        | Michael | LeBron | Kobe

—|—|—|— Points |35 |40 | 45 Assists |10 |12 |5 Rebounds |15 |12 |5

players <- c("Michael","LeBron","Kobe")
points <- c(35, 40,45)
assists <- c(10,12,5)
rebounds <- c(15,12,5)

basketball <- tibble(players,points,assists,rebounds)
basketball

## # A tibble: 3 x 4
##   players points assists rebounds
##   <chr>    <dbl>   <dbl>    <dbl>
## 1 Michael     35      10       15
## 2 LeBron      40      12       12
## 3 Kobe        45       5        5

The next step is to standardize these values by dividing them by the largest entry in the row. This will scale the data from 0 to 1 making each of these quantities comparable. (Normally, these types of statistics have very different ranges). Alternatively (and especially for larger data sets), one could calculate a \(z\)-score.

basketball$stanardize_points <- basketball$points/max(basketball$points)
basketball$stanardize_assists <- basketball$assists/max(basketball$assists)
basketball$stanardize_rebounds <- basketball$rebounds/max(basketball$rebounds)

basketball_stanardize <- select(basketball, 
                                "players", 
                                "stanardize_points", 
                                "stanardize_assists", 
                                "stanardize_rebounds")

basketball_stanardize

## # A tibble: 3 x 4
##   players stanardize_points stanardize_assists stanardize_rebounds
##   <chr>               <dbl>              <dbl>               <dbl>
## 1 Michael             0.778              0.833               1    
## 2 LeBron              0.889              1                   0.8  
## 3 Kobe                1                  0.417               0.333

The standardized data is in wide format and not long format. The pivot_longer command turns wide data into long. Moving data from wide to long format is a little bit tricky. First, we identify the data we want to transform. Then we identify the variables that we want to stretch into the second column of data. This gives the names of the players matched with the pair of the statistic and the value for the statistic for that player.

long_basketball_scaled <-
    pivot_longer(basketball_stanardize,
                 c("stanardize_points",
                   "stanardize_assists",
                   "stanardize_rebounds"),
                 names_to="stat",
                 values_to = "value")

long_basketball_scaled

## # A tibble: 9 x 3
##   players stat                value
##   <chr>   <chr>               <dbl>
## 1 Michael stanardize_points   0.778
## 2 Michael stanardize_assists  0.833
## 3 Michael stanardize_rebounds 1    
## 4 LeBron  stanardize_points   0.889
## 5 LeBron  stanardize_assists  1    
## 6 LeBron  stanardize_rebounds 0.8  
## 7 Kobe    stanardize_points   1    
## 8 Kobe    stanardize_assists  0.417
## 9 Kobe    stanardize_rebounds 0.333

Finally, we create our heatmap with players on the \(x\)-axis, statistics on the \(y\) axis and the statistic’s value as the fill.

ggplot(long_basketball_scaled, aes(x=players, y=stat, fill=value)) + 
  geom_tile()

Again, we could modify this heatmap with various labels, headings etc. However, the general idea is hopefully clear.

Annotations

ggplot has a variety of tools for doing more advanced graphical refining. Adding annotations to a graph can be very useful.

geom_text is analogous to geom_point. Essentially, what you are doing is replacing the points that you would have in a scatter plot with text for each individual point.

We create a graph of the number of hours playing a video game and lifetime high scores of the players. It seems likely that we will have a positive correlation between these two variables.

Player	Hours
Adam	50
Bob	45
Chris	45
Derek	35
Erin	30
Fran	15

player <- c("Adam","Bob","Chris","Derek","Erin","Fran")
time_spent <- c(50,45,45,35,30,15)
high_score <- c(100,75,85,50,20,45)

game_scores <- tibble(player, time_spent, high_score)
game_scores

## # A tibble: 6 x 3
##   player time_spent high_score
##   <chr>       <dbl>      <dbl>
## 1 Adam           50        100
## 2 Bob            45         75
## 3 Chris          45         85
## 4 Derek          35         50
## 5 Erin           30         20
## 6 Fran           15         45

The ggplot function will map the \(x\) and \(y\) aesthetics, and add the geom_point command.

ggplot(game_scores,aes(x=time_spent, y=high_score)) +
  geom_point()

We label the points with the player names using geom_text.

ggplot(game_scores, aes(x=time_spent, y=high_score)) +
  geom_point() +
  geom_text(aes(label=player))

We may want to remove the dot since it overlaps with the names. If we want to remove the points and only have names, we remove the geom_point command.

ggplot(game_scores, aes(x=time_spent, y=high_score)) +
  geom_text(aes(label=player))

The issue here is that we do not know exactly where each of these points are. To keep the point but to move the names away from the dot requires the nudge command.

ggplot(game_scores,aes(x=time_spent, y=high_score)) +
  geom_point() +
  geom_text(aes(label=player), nudge_y=5)

More Annotations

We return to the Congressional Effectiveness Data. We want to plot the relationship between political ideology and how many bills a member of Congress gets passed. The dwnom1 variable measures ideology from liberal to conservative. We measure how many bills a member of Congress passed using the variable all_pass.

First, we filter to command to keep this to the 100th Congress.

cel<-drop_na(read_csv(url("https://www.dropbox.com/s/4ebgnkdhhxo5rac/cel_volden_wiseman%20_coursera.csv?raw=1")))

dat <- cel %>% 
    filter(congress==100)

Next we plot the data with the names of the Congressperson and the points.

ggplot(dat, aes(x=dwnom1, y=all_pass, label=thomas_name)) +
  geom_point() +
  geom_text()

Of course, this is a mess. The interesting names to look at are those who have passed a lot of legislation. We can add text to those points that have at least 8 bills passed simply by filtering the data in the geom_text command.

ggplot(dat, aes(x=dwnom1, y=all_pass, label=thomas_name)) +
    geom_point() +
    geom_text(data=filter(dat, congress==100 & all_pass>8))

You notice that we reread the data in the geom_text function inserting the filtered data that isolates observations in the 100th congress with values of more than eight into all past column. The aesthetic mappings will still come from the original ggplot command, but the data are going to be different. Swapping different mappings in and out is a very useful thing to be able to do.

There are still some issues with this figure. The ggrepel library will help with these. ggrepel is not in the tidyverse. We use the geom_text_repel command to move the labels away from each other to avoid overlaps.

library(ggrepel)

dat <- cel %>% 
    filter(congress==100)

ggplot(dat, aes(x=dwnom1, y=all_pass)) +
  geom_point()+
  geom_text_repel(data=filter(cel, congress==100 & all_pass>8),
                  mapping=aes(x=dwnom1, 
                              y=all_pass, 
                              label=thomas_name))

If we wanted to point out Morris Udall, we could highlight the dot in red and make it transparent (alpha = 0.2). We can add an annotation in red as well.

dat <- cel %>% filter(congress==100)

ggplot(dat, aes(x=dwnom1, y=all_pass)) +
  geom_point() +
  geom_text_repel(filter(cel, congress==100 & all_pass>8),
                  mapping=aes(x=dwnom1, 
                              y=all_pass, 
                              label=thomas_name)) +
  annotate("rect", xmin=-.56, xmax=-.2, ymin=15, ymax=17, alpha=.2, fill="red") +
  annotate("text", x=.6, y=14, label="Most Passed", color="red")

Themes

There are a number of modifications to a visualization that can improve the overall look of the final output. These include themes, colour and legends.

Here, we work again with the Cooperative Congressional Election Survey. We will also add the RColorBrewer and the ggthemes libraries.

library(tidyverse)
library(RColorBrewer)
library(ggthemes)

cces <- read_csv(url("https://www.dropbox.com/s/ahmt12y39unicd2/cces_sample_coursera.csv?raw=1"))

We will consider some questions on Presidential support. In doing so, we will select a number of columns in the data set. We have filtered the data below. For reference:

cc18_308a: Job approval for President Trump
ideo5: In general, how would you describe your own political viewpoint?
educ: What is the highest level of education you have completed?
faminc_new: Thinking back over the last year, what was your family’s annual income?
employ: Which of the following best describes your current employment status?

plot_data <- select(cces, "CC18_308a", "ideo5", "educ", "faminc_new", "employ") %>%
    drop_na()

We will plot political ideology versus the Trump approval rating.

ggplot(plot_data, aes(y=CC18_308a, x=ideo5)) +
  geom_jitter()

To add some additional dimensionality to the data, we will colour the data by education level and change the size of the dot based on annual income.

ggplot(plot_data,aes(y=CC18_308a, x=ideo5, color=educ, size=faminc_new)) +
  geom_jitter()

Now, we have taken 4 dimensions of data and plotted them on this 2D graph. If we want to change the colour scale (perhaps for a colour blind audience), we could use scale_color_gradient.

ggplot(plot_data, aes(y=CC18_308a, x=ideo5, color=educ, size=faminc_new)) +
  geom_jitter() +
  scale_color_gradient(low="gray", high="purple")

We can use a scale with categorical variables. We use a colour code with the employ_cat variable. We will take one of the pre-defined colour palettes called YdYlGn in the RColorBrewer package. ColorBrewer is nice because it has these aesthetically pleasing color palettes built right into it that you can grab and apply to your figures.

plot_data$employ_cat <- recode(plot_data$employ,
                               `1`="Full-time",
                               `2`="Part-time",
                               `3`="Temp. Layoff",
                               `4`="Unemployed",
                               `5`="Retired",
                               `6`="Disabled",
                               `7`="Homemaker",
                               `8`="Student",
                               `9`="Other")

ggplot(plot_data, aes(y=CC18_308a, x=ideo5, color=employ_cat)) +
  geom_jitter() +
  scale_color_brewer(palette="RdYlGn")

Themes

Here we discuss some themes. The theme is generally used to modify elements that are consistent regardless of what the underlying data are. These would be things like the typeface of the font in a figure, the placement of the legend, and some of the other aspects of that, that would be static and are not tied to any aspect of your data.

Now theme is a pretty complicated function. When you find a theme you like, normally you will reuse it over and over again. The theme function is added to a ggplot string of commands with a plus sign, just like we’ve seen before, and it’s quite a powerful function.

Suppose we wanted to move the legend of this plot to the bottom. We can do this by modifying the argument and theme legend.position, and setting that to bottom.

plot_data<-rename(plot_data,"Employment"=employ_cat)

ggplot(plot_data, aes(y=CC18_308a, x=ideo5, color=Employment)) +
  geom_jitter() +
  scale_color_brewer(name="Employment", palette="RdYlGn") +
  theme(legend.position="bottom")

We may wish to change the format of the x-axis text. Again this is going to be by modifying an argument in the theme function with the axis.text.x argument. Using the options within the element text function, we can set the angle, the horizontal spacing, the color of the text, and other options. While not aesthetically pleasing, we will rotate the axis labels 90 degrees and colour them blue.

ggplot(plot_data,aes(y=CC18_308a,x=ideo5,color=Employment))+
  geom_jitter()+
  scale_colour_brewer(name="Employment",palette="RdYlGn")+
  theme(legend.position="bottom",
        axis.text.x=element_text(angle=90,hjust=1,color="blue"))

The axis.title argument allows us to change the axis title. It is worth noticing that the axis.title command in the theme works with the labs command to create the labels. This way, you can have very fine control over the detail in your labels.

ggplot(plot_data, aes(y=CC18_308a, x=ideo5, color=Employment)) +
  geom_jitter() +
  scale_colour_brewer(name="Employment", palette="RdYlGn") +
  theme(legend.position="bottom",
        axis.text.x=element_text(angle=90, hjust=1, color="blue"),
        axis.title=element_text(color="red")) +
  labs(x="Ideology", 
       y="Trump Approval",
       title="Trump Approval, Ideology, and Employment",
       caption="Cooperative Congressional Election Survey.")

There is, however, an easier way to do all of this - ggthemes. ggthemes is a library that contains a large number of pre-made themes that you can apply to your visualizations. A list of available themes is available on the website All Your Figure Are Belong To Us. This website is basically a gallery of all the different ggthemes. For example, we use the theme_wsj option, which will mimic the way that figures look in The Wall Street Journal. Keep in mind that even if you use ggthemes, you can still go in and modify elements of the theme yourself using the theme function, and this will give you direct control over the features of the plot.

ggplot(plot_data,aes(y=CC18_308a,x=ideo5,color=Employment))+
  geom_jitter()+
  theme_wsj()

Citations

All Your Figure Are Belong To Us. Accessed July 9, 2021. Available here.

“Cookbook for R.” Accessed June 30, 2021. Available here.

Coursera. “Data Visualization in R with ggplot2” Accessed July 9, 2021. Available here.

“Create Elegant Data Visualisations Using the Grammar of Graphics.” Accessed June 30, 2021. Available here.

Holtz, Yan. “The R Graph Gallery – Help and Inspiration for R Charts.” The R Graph Gallery. Accessed July 9, 2021. Available here.

“R for Data Science.” Accessed June 30, 2021. Available here.

Wickham, Hadley et al. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics (version 3.3.5), 2021. Available here.

More on ggplot2

OC Data Science

Introduction

Getting Started with ggplot

A Basic Scatter Plot

Distributions

Box Plots

Histograms & Density Plots

Bar Plots

A Second Barplot

Stacked & Side-by-Side Barplots

Line Plots

Another Line Chart Example

Advanced ggplot

Heatmaps

More Heatmaps

Annotations

More Annotations

Themes

Themes

Citations