About this Site

This website was created for Lecture 13 in the class PSYC 604: Computer-based Data Management & Analysis at James Madison University. It provides an overview on the use of ggplot in R to create commonly used graphs.

About the Data

To illustrate graphing we will use the data from the article below (this is the same data we used to learn graphing in SAS).

Data used in this example are from: Atkins, D. C., & Gallop, R. J. (2007). Re-thinking how family researchers model infrequent outcomes: A tutorial on count regression and zero-inflated models. Journal of Family Psychology, 21(4), 726-735.

Description of variables:

  • id = couple id
  • steps = # of steps taken towards divorce/separation
  • marital_satis = measure of marital satisfaction
  • sex_dis: measure of sexual dissatisfaction
  • afc: measure of problems with affective communication
  • affair: extramarital affair

Data set contains data from 263 married people (about 133 couples) that was slightly altered from the real data for IRB purposes.

Let’s begin by setting the working directory.

#Specify working directory where data file is located here
setwd("C:\\Users\\pastorda\\Dropbox\\604 FA23\\Day13\\course packet files")

Load Data

To read the permanent SAS data set called marriage.sas7bdat into R we will use the read_sas function from package haven.

#install and load haven package
if (require(haven) == FALSE){
  install.packages("haven")
}
library(haven)

marriage <- read_sas("marriage.sas7bdat")

I am now using str() to obtain variable names and data type for each variable.

str(marriage)   #Lists variables names in your data set
## tibble [263 × 7] (S3: tbl_df/tbl/data.frame)
##  $ id           : num [1:263] 1 2 3 4 5 6 7 8 9 10 ...
##  $ gender       : num [1:263] 1 1 1 1 1 1 1 1 1 1 ...
##  $ steps        : num [1:263] 4 7 4 8 2 6 0 2 6 0 ...
##  $ marital_satis: num [1:263] 70 45 79 92 87 67 81 102 71 112 ...
##  $ sex_dis      : num [1:263] 61 69 49 66 42 49 73 58 35 42 ...
##  $ afc          : num [1:263] 65 69 49 56 65 69 69 54 63 49 ...
##  $ affair       : num [1:263] 0 0 0 0 0 0 0 1 1 0 ...

I see that gender and affair are both numeric variables. We might need them to be factors at some point, so I am going to do so here. Instead of creating new variables that are factor versions of the original numeric variables, I could just use factor(gender) or factor(affair) in ggplot function when plotting these variables and needing them to be factors.

#Creating genderF (factor) from gender (numeric)
marriage$genderF<-factor(marriage$gender,levels=c(1,2), labels=c("Female","Male"))

#Creating affairF (factor) from affair (numeric)
marriage$affairF<-factor(marriage$affair,levels=c(0,1), labels=c("No Affair","Affair"))

#checking to ensure genderF and affairF are factors
str(marriage)
## tibble [263 × 9] (S3: tbl_df/tbl/data.frame)
##  $ id           : num [1:263] 1 2 3 4 5 6 7 8 9 10 ...
##  $ gender       : num [1:263] 1 1 1 1 1 1 1 1 1 1 ...
##  $ steps        : num [1:263] 4 7 4 8 2 6 0 2 6 0 ...
##  $ marital_satis: num [1:263] 70 45 79 92 87 67 81 102 71 112 ...
##  $ sex_dis      : num [1:263] 61 69 49 66 42 49 73 58 35 42 ...
##  $ afc          : num [1:263] 65 69 49 56 65 69 69 54 63 49 ...
##  $ affair       : num [1:263] 0 0 0 0 0 0 0 1 1 0 ...
##  $ genderF      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ affairF      : Factor w/ 2 levels "No Affair","Affair": 1 1 1 1 1 1 1 2 2 1 ...
#getting descriptives for all variables
summary(marriage)
##        id             gender          steps        marital_satis   
##  Min.   :  1.00   Min.   :1.000   Min.   : 0.000   Min.   : 40.00  
##  1st Qu.: 34.00   1st Qu.:1.000   1st Qu.: 1.000   1st Qu.: 74.50  
##  Median : 67.00   Median :1.000   Median : 3.000   Median : 87.00  
##  Mean   : 67.06   Mean   :1.498   Mean   : 3.186   Mean   : 84.49  
##  3rd Qu.:100.50   3rd Qu.:2.000   3rd Qu.: 5.000   3rd Qu.: 94.00  
##  Max.   :133.00   Max.   :2.000   Max.   :11.000   Max.   :115.00  
##     sex_dis           afc            affair         genderF         affairF   
##  Min.   :34.00   Min.   :45.00   Min.   :0.0000   Female:132   No Affair:225  
##  1st Qu.:55.00   1st Qu.:59.00   1st Qu.:0.0000   Male  :131   Affair   : 38  
##  Median :61.00   Median :63.00   Median :0.0000                               
##  Mean   :60.07   Mean   :63.35   Mean   :0.1445                               
##  3rd Qu.:68.00   3rd Qu.:69.00   3rd Qu.:0.0000                               
##  Max.   :79.00   Max.   :76.00   Max.   :1.0000

About ggplot

There are many, many ways to create graphics in R. In fact, many useful functions for graphing come with Base R (e.g., plot, coplot, hist, barplot, boxplot). Instead of learning about the graphs you can create using functions in Base R, we will learn how to create graphs using the ggplot function in the package ggplot2, which was was created by Hadley Wickham. There is a function called qplot in ggplot2, but we will only focus on ggplot, which is more versatile. ggplot allows you to create a wide array of graphs and allows a lot of customization opportunities in R.

There are TONS of resources on the web for learning more about ggplot (too numerous to list here). You might find the RStudio ggplot2 cheat sheet useful.

Fantastic resources for considering all the graphing possibilities in R (e.g., base R functions, ggplot) include:

https://www.data-to-viz.com/\#explore

https://www.r-graph-gallery.com/index.html

We could just install and load the package ggplot2, but we are going to install and load tidyverse, which is a collection of useful packages, including ggplot2.

if (require(tidyverse) == FALSE){
  install.packages("tidyverse")
}
library(tidyverse)

ggplot relies on Hadley Wickham’s grammar of graphics which considers plots to consist of various elements, including:

  • data: the actual data you want to plot - almost all data you want to plot will need to be in a data frame
  • aesthetics: aesthetic attributes, e.g., what is on the x and y axes, color/shape/size of data points, height of bars, etc.
  • geometric objects: type of graph being created (e.g., histogram, bar plot, line plot, density, points)

We can add various geometric objects to a single chart by layering.

To understand the basics of ggplot, consider the script below where we call the ggplot function and provide only two arguments, data and mapping, to that function. The data argument indicates which data frame we want to use in the graph. The mapping argument indicates the aes (aesthetics), which right now just includes the variables we wish to graph on the x and y axes. After running this, you’ll notice no data were graphed, which is because we have yet to provide a geometric object.

ggplot(data=marriage, mapping=aes(x=steps, y=marital_satis))

Scatterplots

Let’s add geom_point() as a layer to create a scatterplot.

#Let's add +geom_point() to actually provide a scatterplot
ggplot(data=marriage, mapping=aes(x=steps, y=marital_satis))+
  geom_point()

Adding a smoothed line to a scatterplot

Let’s now add a smoothed line to the scatterplot as another layer with geom_smooth().

ggplot(data=marriage, mapping=aes(x=steps, y=marital_satis))+
  geom_point()+
  geom_smooth()

Let’s now remove geom_point(). Notice how the actual data points disappear and we only have the smoothed line.

ggplot(data=marriage, mapping=aes(x=steps, y=marital_satis))+
  geom_smooth()

As illustrated above, the most basic information we provide to ggplot is the data, the aesthetics, and the geoms.

Remember that we don’t have to supply argument names (e.g., data=, mapping=) to a function. If we do not supply argument names, then the values for the arguments must come in a particular order.

#The script WITH argument names
ggplot(data=marriage, mapping=aes(x=steps, y=marital_satis))+
  geom_point()

#The script WITHOUT argument names (provides the same graph)
ggplot(marriage,aes(x=steps, y=marital_satis))+
  geom_point()

Another thing to keep in mind is that the plus sign (+) for adding elements must come at the END of a line.

#This works:
ggplot(marriage,aes(x=steps, y=marital_satis))+
  geom_point()

#This does NOT work:
ggplot(marriage,aes(x=steps, y=marital_satis))
  +geom_point()

Another thing to be aware of is that the aesthetics could be provided within the geom statement, as shown below. When we provide the aesthetic mapping within the geom statement, it is only applicable to that geom statement. This is known as applying the aesthetic mapping locally. If instead we supply the aesthetic mapping within the ggplot function(e.g., ggplot(marriage,aes(x=steps, y=marital_satis))), the aesthetic applies to all geoms used in that ggplot function. This is known as applying the aesthetic mapping globally.

ggplot(marriage)+
 geom_point(aes(x=steps, y=marital_satis))

Changing elements of scatterplots

We learned above how to create a basic scatterplot. Let’s add to it:

  • jittering of data points by including jitter function before variable names in aes argument. Jittering adds random noise to the variables and can be useful when you have a lot of data points (because a single data point might represent many, many people). So we use jittering to avoid overplotting. Jittering shifts each data point a little so you can better understand how many people might have a particular combination of x and y values. Note: alpha for controlling transparency of markers is also helpful to avoid overplotting and has the benefit of not altering data like jittering does. You can specify different levels of jittering, we are just using the default amount here. To alter the level of jittering, put a value after the comma after the variable name: jitter(steps,2).
  • a title & more descriptive labels for x and y axes by including the labs function
  • changes to the markers by including geom_point. We altered their shape (see this handy guide for shape of markers), transparency using alpha argument (ranges from 0 to 1 with lower values yielding more transparency), border color with the color argument, fill color with the fill argument, and size with the size argument.
  • limits to the minimum and maximum of the x and y axes by including xlim and ylim, respectively.
ggplot(marriage,aes(x=jitter(steps), y=jitter(marital_satis)))+
  geom_point(size=2, shape=23, color="tomato", fill="royalblue",alpha=0.5)+
  labs(title="Relationship Between # of Steps Taken Towards Divorce and Marital Satisfaction",
  x="# of Steps Taken Towards Divorce",y="Marital Satisfaction")+
  xlim(0,20)+
  ylim(30,120)

Now I am adding to the graph above by including a smoothed line, a regression line, and omitting tomato as a color of the borders on markers and forced minimums and maximums on x and y axes.

ggplot(marriage,aes(x=jitter(steps), y=jitter(marital_satis)))+
  geom_point(size=2, shape=23, fill="royalblue",alpha=0.5)+
  labs(title="Relationship Between # of Steps and Marital Satisfaction",
       x="# of Steps Taken Towards Divorce",
       y="Marital Satisfaction")+
  geom_smooth(color="red",size=1.5)+
  geom_smooth(method=lm,color="orange",size=1.5)

Let’s set things up a little differently so I can have a legend. Below I added color=NULL in labs so that the legend would not have a label (had I put color=“Line Type”, Line Type would appear as title to the legend). I mapped the word “Loess” onto color for the first geom_smooth (because I am mapping something like a variable onto color, it goes in aes) and I mapped the word “Linear regression” onto color for the second geom_smooth. scale_color_manual is not needed (try running it without to see what happens), but allows me to provide the colors I want for the loess and linear regression lines.

ggplot(marriage,aes(x=jitter(steps), y=jitter(marital_satis)))+
  geom_point(size=2, shape=23, fill="royalblue",alpha=0.5)+
  labs(title="Relationship Between # of Steps and Marital Satisfaction", 
       x="# of Steps Taken Towards Divorce",
       y="Marital Satisfaction",
       color=NULL)+
  geom_smooth(aes(color="Loess"),size=1.5)+
  geom_smooth(method=lm,aes(color="Linear regression"),size=1.5)+
  scale_color_manual(values = c("Loess" = "red", "Linear regression" ="orange"))

Adding horizontal and vertical lines and text to graph

Now let’s modify the scatterplot by omitting the smoothed line and the regression line and adding horizontal and vertical lines at the mean of x and y variables, respectively.

  • geom_hline: used to provide horizontal line in graph. yintercept value is where the horizontal line will cross the yaxis. I chose for it to cross at the mean of marital_satis (and I had to provide data frame name before marital_satis here because data frame provided in ggplot does not carry through to geom_hline). In geom_hline I also provided values to alter line type, size, and color.

  • geom_vline: used to provide vertical line in graph. xintercept value is where the vertical line will cross the xaxis. I chose for it to cross at the mean of steps (and I had to provide data frame name before steps because data frame provided in ggplot does not carry through to geom_vline). In geom_vline I also provided values to alter line type, size, and color.

  • annotate: used to add text in any location in any graph. I used annotate add text indicating what the horizontal and vertical lines represent. By providing geom=“text” in annotate I am indicating I would like the text after the label argument to be provided in the graph at the location specified by the x and y values.

Explanation of the first annotate statement: I wanted to place the text “Mean of Steps=3.19” alongside the vertical line I added to the graph, so I changed the angle to 90 degrees. I made the x value for the text equal to the mean of steps (I could’ve put 3.19 here, but decided to let R calculate it for me by putting mean(marriage$steps)). I subtracted a little bit off that value because I didn’t want the text exactly on the line, but to the left of the line. I made the y coordinate value equal to 53 (where text is centered on y-axis). In all honesty, I played around with different values of y until it looked right. As far as the label, I could’ve just put: label=“Mean of Steps=3.19”, but I wanted R to calculate the mean for me, which is why you see paste (pastes combines “Mean of Steps=” with actual mean value). I used the round function with the mean to round the value to two decimal places.

ggplot(marriage,aes(x=jitter(steps), y=jitter(marital_satis)))+
  geom_point(size=2, shape=23, fill="royalblue",alpha=0.5)+
  labs(title="Relationship Between # of Steps and Marital Satisfaction", 
       x="# of Steps Taken Towards Divorce",y="Marital Satisfaction")+
  geom_hline(yintercept=mean(marriage$marital_satis),
             linetype="dashed",
             size=2, 
             color="black")+
  geom_vline(xintercept=mean(marriage$steps),
             linetype="solid",
             size=2, 
             color="black")+
  annotate(geom="text",
           x=mean(marriage$steps)-.3,
           y=53, 
           angle=90,
           label=c(paste("Mean of Steps=",
                         round(mean(marriage$steps),digits=2))))+
  annotate(geom="text",
           x=8, 
           y=mean(marriage$marital_satis)+3, 
           label=c(paste("Mean of Marital Satisfaction=",
                         round(mean(marriage$marital_satis),digits=2))))

Marginal histograms/densities/boxplots on scatterplots

I’m a fan of marginal histograms/densities/boxplots on scatterplots. To add marginal information onto a scatterplot, we need to install and load package ggExtra. I then take a scatterplot and store it as an object (called p). Any graph we create can be stored as an object, we just haven’t done this yet. We need to store the scatterplot as an object in order to use the functions in ggExtra to add marginal information to our scatterplot.

#Install and load ggExtra package
if (require(ggExtra) == FALSE){
  install.packages("ggExtra")
}
library(ggExtra)

#Taking basic scatterplot and storing it in object p
p <- ggplot(marriage,aes(x=jitter(steps), y=jitter(marital_satis)))+
  geom_point(size=2, shape=23, fill="royalblue",alpha=0.5)+
  labs(title="Relationship Between # of Steps and Marital Satisfaction", 
       x="# of Steps Taken Towards Divorce",y="Marital Satisfaction")

#Looking at graph - it does not yet have marginal histograms
p

#Using ggMarginal from package ggExtra, providing p as argument and asking for marginal histograms
ggMarginal(p,type="histogram")

#Using ggMarginal from package ggExtra, providing p as argument and asking for marginal density plots
ggMarginal(p,type="density")

#Using ggMarginal from package ggExtra, providing p as argument and asking for marginal boxplots
ggMarginal(p,type="boxplot")

Scatterplots by group

Now let’s obtain scatterplots by group. We will illustrate by considering relationship between steps and marital satisfaction by affair status.

I am using color and shape in aes mapping in ggplot to make data point markers a different shape and color for those who have and have not had an affair. I am also adding linear regression lines (for each group). I used affairF here, but you could instead use factor(affair) everywhere you see affairF below. However, if you used factor(affair) the legend would have values of 0 and 1, not No Affair and Affair. In the labs function I am specifying color=NULL and shape=NULL so that the legend does not have a title. You can always omit one or both of color=NULL and shape=NULL to see what would happen.

ggplot(marriage,aes(x=steps, y=marital_satis,color=affairF, shape=affairF))+
    geom_point(size=2,alpha=0.5)+
    geom_smooth(method=lm)+
    labs(title="Relationship Between # of Steps and Marital Satisfaction", 
         x="# of Steps Taken Towards Divorce",y="Marital Satisfaction",
         color=NULL, shape=NULL) 

Rather than having a single graph with different marker colors/shape for each group, below I am using facet_wrap to create a separate graph for each group.

ggplot(marriage,aes(x=steps, y=marital_satis))+
    geom_point(size=2,alpha=0.5)+
    geom_smooth(method=lm)+
    facet_wrap(~affairF)+
    labs(title="Relationship Between # of Steps and Marital Satisfaction", 
         x="# of Steps Taken Towards Divorce",y="Marital Satisfaction") 

Below I am combining the two approaches above - I have different colored/shaped markers for those having an affair/not having an affair and I create separate graphs for males and females using facet_wrap(~genderF).

  ggplot(marriage,aes(x=steps, y=marital_satis,color=affairF, shape=affairF))+
    geom_point(size=2,alpha=0.5)+
    geom_smooth(method=lm)+
    facet_wrap(~genderF)+
    labs(x="# of Steps Taken Towards Divorce",y="Marital Satisfaction", color=NULL, shape=NULL) 

Scatterplots, correlations, histograms using pairs.panels

A great way to explore relationships between continuous variables in your data and get a feel for their distributions is by using the pairs.panels function in the package psych. This is not part of ggplot, so we are stepping away from that approach momentarily. Big thanks to Jeanne Horst for helping me learn about pairs.panels and many other ways of graphing in R.

if (require(psych) == FALSE){
    install.packages("psych")
  }  
library(psych)
#Here I am just running pairs.panels on 4 variables from my marriage data frame
pairs.panels(marriage[,c("steps","marital_satis","sex_dis","afc")])

#Doing the same, but illustrating how you can change all kinds of features
#Check out the pairs.panels documentation to learn more

pairs.panels(marriage[,c("steps","marital_satis","sex_dis","afc")], 
             ellipses = TRUE, 
             method = "pearson",
             pch = 18, 
             cor = TRUE,
             jiggle = FALSE,
             hist.col = "thistle", 
             show.points = TRUE,
             rug = TRUE, 
             cex.cor = 0.75,
             smoother = TRUE, 
             stars = FALSE, 
             ci=TRUE, 
             alpha =0.05)

Histograms

Use geom_histogram to create a basic histogram. You only need to supply a single variable, which here is marital_satis.

ggplot(marriage, aes(x=marital_satis))+
  geom_histogram()

Let’s change color of the box border to white and the fill of the box to lightblue. Let’s also change x axis label.

ggplot(marriage, aes(x=marital_satis))+
  geom_histogram(fill="lightblue", color="white")+
  labs(x="Marital Satisfaction")

Illustrating binwidth modification and grid.arrange

Note that we are getting a message in our console saying that the default # of bins is 30 and R is telling us to pick a better value of binwidth (binwidth = number of values of variable in each bin, bin = # of bins). Below I am using geom_histogram to obtain density and histograms with various binwidths. I am assigning each graph to an object and then using the grid.arrange function from the gridExtra package to arrange all graphs on a single page (you can use the grid.arrange function with any graphs you create).

#install and load gridExtra package
if (require(gridExtra) == FALSE){
  install.packages("gridExtra")
}  
library(gridExtra)

dens <- ggplot(marriage, aes(x=marital_satis))+
  geom_density()+
  labs(title="density", x="Marital Satisfaction")

bw30 <- ggplot(marriage, aes(x=marital_satis))+
  geom_histogram(fill="lightblue", color="white",binwidth = 30)+
  labs(title = "bindwidth = 30 (default)",x="Marital Satisfaction")

bw5 <- ggplot(marriage, aes(x=marital_satis))+ 
  geom_histogram(fill="lightblue", color="white",binwidth = 5) +
  labs(title = "bindwidth = 5", x="Marital Satisfaction") 

bw10 <- ggplot (marriage, aes(x=marital_satis))+
  geom_histogram(fill="lightblue", color="white",binwidth = 10) +
  labs(title = "bindwidth = 10", x="Marital Satisfaction") 

bw20 <- ggplot (marriage, aes(x=marital_satis))+
  geom_histogram(fill="lightblue", color="white",binwidth = 20) +
  labs(title = "bindwidth = 20", x="Marital Satisfaction") 

bw50 <- ggplot (marriage, aes(x=marital_satis))+
  geom_histogram(fill="lightblue", color="white",binwidth = 50) +
  labs(title = "bindwidth = 50", x="Marital Satisfaction") 

grid.arrange(dens,bw30,bw5,bw10,bw20,bw50, nrow=3)

Illustrating here how we can obtain separate histograms by gender, with one on top of the other by using facet_wrap with ncol argument equal to 1. The last line suppresses the printing of the legend. Run the code without that line to see the legend; you’ll see that the information is provides is not needed. We’ll talk more about the theme function later in today’s lecture.

ggplot(marriage, aes(x=marital_satis, fill=genderF))+
  geom_histogram(color="white",binwidth=5)+
  labs(x="Marital Satisfaction")+
  facet_wrap(~genderF, ncol=1)+
  theme(legend.position = "none")

In this example I overlay the two densities for each gender for sexual dissatisfaction. The color argument in the global aes mapping controls the line color and fill controls the fill color. The alpha argument on the geom_density function can range from 0 to 1, with values closer to 0 yielding more transprency. I use color=NULL and fill=NULL in the labs function to control the appearance of the legend.

ggplot(marriage, aes(x=sex_dis, color=genderF, fill=genderF))+
  geom_density(alpha=0.2)+
  labs(x="Sexual Dissatisfaction", color=NULL, fill=NULL)

Here I provide the same density overlay plot, but without the use of fill.

ggplot(marriage, aes(x=sex_dis, color=genderF))+
  geom_density(linewidth=0.75)+
  labs(x="Sexual Dissatisfaction", color=NULL)

Boxplots

We use geom_boxplot to obtain boxplots. You only need to supply only one variable (and as a y variable; you can always change y to x to see why we use y here).

ggplot(marriage, aes(y=marital_satis))+
  geom_boxplot()

Obtaining boxplots for marital satisfaction by gender in same graph.

ggplot(marriage, aes(x=genderF, y=marital_satis))+
  geom_boxplot()

Obtaining boxplots for marital satisfaction by gender in same graph, with separate graphs for those who have and have not had an affair.

ggplot(marriage, aes(x=genderF, y=marital_satis))+
  geom_boxplot()+
  facet_wrap(~affairF)

As with the other examples, you can use labs to add titles/change axis labels You can also change the other features (e.g., fill color of boxplots, etc.) as we have done in prior examples

I’ve never been a fan of boxplots and it appears I am not alone. A warning about boxplots is provided here. They suggest using violin plots or ridgeline plots instead (same website provides ggplot examples for those).

Bar Plots: Single Categorical Variable

Plotting frequencies

When we learned about graphing in SAS, we started the section on bar plots by plotting the frequencies for a single categorical variable, steps. We will do the same here, using geom_bar. Notice that I supply a single variable, steps, and map it to the x axis.

# Notice that I supply a single variable, steps, and map it to the x axis.
ggplot(marriage,aes(x=steps))+
  geom_bar()

#If you want to have a horizontal bar graph, not a vertical bar graph, simply map the variable to the y axis (instead of x axis).
ggplot(marriage,aes(y=steps))+
  geom_bar()

#Or map the variable to the x axis and use coord_flip() to flip graph
ggplot(marriage,aes(x=steps))+
  geom_bar()+
  coord_flip()

Plotting proportions

To plot proportions instead of frequencies, we have MANY choices. The approach I prefer involves making a data frame (which I am calling toplot0 below) using dplyr functions available from the package tidyverse. This data frame will consist of the statistics we wish to plot - so rather than having ggplot calculate those statistics from the raw data, I am calculating them before using ggplot and storing them in a data frame I will use as input into ggplot. A good tutorial on the use of dplyr for manipulating data is here (see section called “Wrangling data with dplyr functions”).

Creating toplot0

  • When you see |> it means “pipe” which I think of as “use then in”.

  • marriage |> means pipe (use) marriage data frame in next command.

  • group_by(steps) means to take marriage data frame and group records by steps. |> at end means pipe (use) the result of this in next command.

  • summarize means create a new summarized variable. My new summarized variable is called stepsfreq which will be the # (n) of people who have taken each step (the frequency).|> at end means pipe (use) the result of this in next command.

  • mutate adds one or more variables and is used here to create a variable called stepsprop which equals stepsfreq/sum(stepsfreq), which is the proportion of people who have taken each step.

toplot0<-marriage |>
  group_by(steps) |> 
  summarize(stepsfreq=n()) |> 
  mutate(stepsprop = stepsfreq / sum(stepsfreq))

#taking a look at toplot0 making sure it has the information I want to plot...it does
toplot0
## # A tibble: 12 × 3
##    steps stepsfreq stepsprop
##    <dbl>     <int>     <dbl>
##  1     0        63   0.240  
##  2     1        26   0.0989 
##  3     2        32   0.122  
##  4     3        26   0.0989 
##  5     4        33   0.125  
##  6     5        24   0.0913 
##  7     6        24   0.0913 
##  8     7        16   0.0608 
##  9     8        12   0.0456 
## 10     9         5   0.0190 
## 11    10         1   0.00380
## 12    11         1   0.00380

Now I will use ggplot with toplot0, mapping steps to x axis and stepsprop to y axis. We must include stat=“identity” because geom_bar wants to do the calculations for us, but we have already provided the values that need to be plotted (no calculations are needed).

ggplot(toplot0,aes(x=steps,y=stepsprop))+
  geom_bar(stat="identity")

#or just use geom_col (and you don't have to provide stat="identity")
ggplot(toplot0,aes(x=steps,y=stepsprop))+
  geom_col()

Data value labels in bar plot

A benefit of creating a data frame with the information we need to use in the bar plot is that it makes it easy to provide frequencies or proportions as value labels in the chart. We do so below by adding geom_text and mapping either the variable stepsfreq or stepsprop (depending on which we are plotting) to label. The vjust argument shifts the vertical location of the label in the graph (I’ll be honest, I don’t know why -0.5 works as the value here for vjust). The size argument controls the size of the label.

#Plotting frequencies from data frame toplot0 and adding value labels
ggplot(toplot0,aes(x=steps,y=stepsfreq))+
  geom_col()+
  geom_text(aes(label=stepsfreq), vjust = -0.5, size=3)

#Plotting proportions from data frame toplot0 and adding value labels (note that I needed to round values of proportions used as labels)
ggplot(toplot0,aes(x=steps,y=stepsprop))+
  geom_col()+
  geom_text(aes(label=round(stepsprop,digits=2)), vjust = -0.5, size=3)

Bar Plots: Two Categorical Variables

After creating a bar graph for a single categorical variable in a previous lecture, we then considered how to create various bar graphs for 2 categorical variables. For instance, suppose we were interested in whether the distribution of steps differed for males or females.

Separate bar plots for each variable

We could use facet_wrap to create different bar graphs of steps for males and females.Notice how I am using the marriage data frame as input here. fill=genderF makes the bars different colors for males and females, facet_wrap(~genderF) creates separate bar graphs for males and females, and ncol=1 arranges separate graphs in single column. I am omitting legend with show.legend=“FALSE”.

ggplot(marriage, aes(x=steps, fill=genderF))+
  geom_bar(show.legend=FALSE)+
  facet_wrap(~genderF, ncol=1)

Stacked Bar Chart

We also used a stacked bar chart with steps and gender (although I am NOT a fan of stacked bar charts).

ggplot(marriage, aes(x=steps, fill=genderF))+
  geom_bar()+
  labs(fill=NULL)

100% Stacked Bar Chart

We also used a 100% stacked bar chart with steps and gender (I am also NOT a fan of 100% stacked bar charts). I always find it hard to interpret the 100% bar graphs, so here is some guidance using two values of steps:

  • Of those who have taken 0 steps, 51% are female (reddish) and 49% are male (bluish).
  • For those who have taken 8 steps, 67% are female (reddish) and 33% are male (bluish).
ggplot(marriage, aes(y=steps, fill=genderF))+
  geom_bar(position="fill")+
  #line below ensures all values of steps are shown and with 0 at top of y axis and 11 at the bottom
  scale_y_continuous(breaks=seq(0,11,1), trans = "reverse")+
  #line below puts legend at bottom of chart
  theme(legend.position="bottom")+
  #line below changes x axis title so that it is proportion
  labs(x="proportion", fill=NULL)

Clustered or Grouped Bar Chart

I like the separate bar charts of steps for each gender that we already created. I also like clustered or grouped bar charts. To create such charts in ggplot, we have to create a data frame with the information that we need, which is the number of people with each combination of values of steps and gender (e.g., # of males with 0 steps). Below I am making a data frame (which I am calling toplot1) using dplyr functions, where I create a variable called freqtoplot which will be the # of people with each combination of values of steps and gender. Notice I group data by both steps and genderF and then use summarize to create freqtoplot.

#Creating toplot1
toplot1<-marriage |>
  group_by(steps, genderF) |> 
  count() 

#Viewing data frame to make sure it includes data I need for the plot
toplot1
## # A tibble: 22 × 3
## # Groups:   steps, genderF [22]
##    steps genderF     n
##    <dbl> <fct>   <int>
##  1     0 Female     32
##  2     0 Male       31
##  3     1 Female     11
##  4     1 Male       15
##  5     2 Female     10
##  6     2 Male       22
##  7     3 Female     14
##  8     3 Male       12
##  9     4 Female     16
## 10     4 Male       17
## # … with 12 more rows

In ggplot below I am using toplot1 as input, genderF mapped to fill (bars will be different colors for males, females). steps is mapped to x axis and n is mapped to y axis. position=“dodge” allows it to be a clustered/grouped bar chart.

ggplot(toplot1, aes(x=steps,y=n,fill=genderF)) + 
  geom_col(position="dodge")+
  #Instead of line above, could used: geom_bar(position="dodge",stat="identity") 
  #line below adds value labels centered above each bar and of size 3
  geom_text(aes(label=n),
            position = position_dodge(0.9),
            vjust=-.5, 
            size=3)+
  #line below puts legend at bottom of chart
  theme(legend.position="bottom")+
  #line below changes y axis title 
  labs(y="frequency", fill=NULL)

Plotting Means by Group

In a previous lecture we also used SAS to plot means by group. Specifically, we plotted the mean of steps by gender. To do so in R, we first use dplyr to create a data frame (I am calling it toplot2) that contains the means of the variable steps by gender (our grouping variable). We also calculate the standard error of the mean (SE) for each group which we use to obtain the upper and lower limits of our confidence interval. The qt function is used to get the t critical value.

toplot2 <- marriage |> 
  group_by(genderF) |> # Group the data by genderF
  summarize(mean_steps=mean(steps), # Create variable with mean of steps per gender
            sd_steps=sd(steps), # Create variable with sd of steps per gender
            N_steps=n(), # Create new variable N of steps per gender
            se=sd_steps/sqrt(N_steps), # Create variable with se of steps per gender
            upper_limit=mean_steps+se*qt(p=.05/2, df=N_steps-1,lower.tail=FALSE), # Upper limit
            lower_limit=mean_steps-se*qt(p=.05/2, df=N_steps-1,lower.tail=FALSE) # Lower limit
            ) 

#Viewing toplot2 data frame to make sure all the information I need to plot is there
toplot2
## # A tibble: 2 × 7
##   genderF mean_steps sd_steps N_steps    se upper_limit lower_limit
##   <fct>        <dbl>    <dbl>   <int> <dbl>       <dbl>       <dbl>
## 1 Female        3.46     2.79     132 0.243        3.94        2.98
## 2 Male          2.91     2.55     131 0.223        3.35        2.47

To double check the calculations, I use t.test with the data limited to variable steps and either only males or females to make sure the lower and upper bounds of 95% CI are correct in toplot2.

femaleonly <-  marriage[which(marriage$genderF=="Female"),c("steps")]
maleonly <-  marriage[which(marriage$genderF=="Male"),c("steps")]

t.test(femaleonly)
## 
##  One Sample t-test
## 
## data:  femaleonly
## t = 14.235, df = 131, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.981001 3.943241
## sample estimates:
## mean of x 
##  3.462121
t.test(maleonly)
## 
##  One Sample t-test
## 
## data:  maleonly
## t = 13.057, df = 130, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.467735 3.349059
## sample estimates:
## mean of x 
##  2.908397

To create the plot of means by group with 95% CIs, I input toplot2 to ggplot and map genderF and mean_steps to the x and y axes, respectively. geom_point provides points for each mean and I adjusted the point size to 2. geom_errorbar provides error bars - I am mapping lower_limit and upper_limit from toplot2 to ymin and ymax, respectively. I am forcing range of y axis to be 1 to 5 with ylim(1,5). I am adding values for means as text with the two annotate lines (x is position on x axis, y is position on y axis). label is actual text value, hjust=-.2 puts # to far right of point. I am using labs to provide labels for axes.

ggplot(toplot2,aes(x=genderF, y=mean_steps))+
  geom_point(size=2)+
  geom_errorbar(aes(ymin=lower_limit, ymax=upper_limit))+
  ylim(1,5)+
  annotate(geom="text",x="Female",y=3.462121,label="3.46",hjust=-0.2)+
  annotate(geom="text",x="Male",y=2.908397 ,label="2.91",hjust=-0.2)+
  labs(x="Gender", y="Average # of steps")

Note that instead of using y=3.462121 in the first annotate function, I could’ve used toplot2[1,2], which is the location of the value of 3.46212 in the toplot2 data frame Instead of using y=2.908397, I could’ve used toplot2[2,2].

If you search the web, you’ll find other approaches to plotting means with 95% CIs. Many of these other approaches will use the data frame (e.g., marriage) in ggplot and use statistical functions within ggplot to obtain means and SEs. That is fine to do, I just personally (at this point in my learning) like to calculate the stats myself, place those stats in a data frame and plot the information in that data frame.

Let’s look at another example where we plot the means for sexual dissatisfaction by two grouping variables: genderF and affairF. We did this in SAS in a previous lecture. To do so in R, we first use dplyr to create a data frame (I am calling it toplot3) that contains the means of the variable sex_dis by gender and affair status (our two grouping variables). We also calculate the standard error of the mean (SE) for each group combination which we use to obtain the upper and lower limits of our confidence interval.

toplot3 <- marriage|>
  group_by(genderF,affairF) |> 
  summarize(mean_sexdis=mean(sex_dis), 
  
            sd_sexdis=sd(sex_dis), 
            N_sexdis=n(), 
            se=sd_sexdis/sqrt(N_sexdis), 
            upper_limit=mean_sexdis+se*qt(p=.05/2, df=N_sexdis-1,lower.tail=FALSE), 
            lower_limit=mean_sexdis-se*qt(p=.05/2, df=N_sexdis-1,lower.tail=FALSE) 
            ) 

#Viewing data to make sure all the data is in there that I need for plotting purposes
toplot3
## # A tibble: 4 × 8
## # Groups:   genderF [2]
##   genderF affairF   mean_sexdis sd_sexdis N_sexdis    se upper_limit lower_limit
##   <fct>   <fct>           <dbl>     <dbl>    <int> <dbl>       <dbl>       <dbl>
## 1 Female  No Affair        60.4     10.4       113 0.979        62.3        58.5
## 2 Female  Affair           62.8      9.57       19 2.20         67.5        58.2
## 3 Male    No Affair        58.5      9.62      112 0.909        60.3        56.7
## 4 Male    Affair           64.8      7.47       19 1.71         68.4        61.2

Here I am using ggplot to create a graph similar to one from SAS in a previous lecture. I am using toplot3 as data frame and mapping affairF to x axis, mean_sexdis to y axis, and genderF to color and shape (any colors or shapes in graph will differ depending on value of genderF). geom_point is used to provide points for each of the 4 means, making size of points=2. geom line with mapping of genderF to group provides lines that connect means. ylim(50,70) is limiting y axis range. theme(legend.position = “bottom”) forced location of legend to bottom of graph. annotate provides text in graph (here I am providing the values for each mean; instead I could have used the location of each mean within the toplot3 data frame, e.g., toplot3[1,3]). labs is used to label the x and y axes.

ggplot(toplot3,aes(x=affairF, y=mean_sexdis,color=genderF,shape=genderF))+
  geom_point(size=2)+
  geom_line(aes(group=genderF))+
  ylim(50,70)+
  theme(legend.position = "bottom")+
  annotate(geom="text",x="No Affair",y=60.4,label="60.4",hjust=1.2)+
  annotate(geom="text",x="No Affair",y=58.5 ,label="58.5",hjust=1.2)+
  annotate(geom="text",x="Affair",y=62.8,label="62.8",hjust=-0.2)+
  annotate(geom="text",x="Affair",y=64.8 ,label="64.8",hjust=-0.2)+
  labs(x="Affair Status", y="Average Sexual Dissatisfaction")

I could create a graph with the 95% CIs for each of the four means. Using facet_wrap with genderF because it gets busy with all four 95% CIs on a single graph.

ggplot(toplot3,aes(x=affairF, y=mean_sexdis,color=genderF))+
  geom_point(size=2)+
  geom_errorbar(aes(ymin=lower_limit, ymax=upper_limit))+
  facet_wrap(~genderF)+
  ylim(50,70)+
  theme(legend.position = "none")+
  labs(x="Affair Status", y="Average Sexual Dissatisfaction")

Themes

Themes allow you to customize your graph. So far, we have been using ggplot’s default theme, but there are several other themes available to us (and you could even create your own personalized theme if you want to). To get a feel for different themes, let’s take one of our graphs and save it as an object.

graph1 <-
  ggplot(marriage,aes(x=steps, y=marital_satis,color=affairF, shape=affairF))+
  geom_point(size=2,alpha=0.5)+
  geom_smooth(method=lm)+
  facet_wrap(~genderF)+
  labs(x="# of Steps Taken Towards Divorce",y="Marital Satisfaction", color=NULL, shape=NULL) 

#We have to run the code below to "see" the graph. It is using the default theme or theme_gray (or theme_grey).
graph1

Now let’s apply various themes to see what they look like.

#I could apply the theme as another layer in ggplot

ggplot(marriage,aes(x=steps, y=marital_satis,color=affairF, shape=affairF))+
  geom_point(size=2,alpha=0.5)+
  geom_smooth(method=lm)+
  facet_wrap(~genderF)+
  labs(x="# of Steps Taken Towards Divorce",y="Marital Satisfaction", color=NULL, shape=NULL) +
  theme_minimal()

#Since I saved the graph as an object, I can just add the theme as a layer to the object
#You'll see people do this all the time when using ggplot

graph1+theme_minimal()

Let’s look at many themes. I’m using grid.arrange to arrange them all into a single graph.

g1 <-graph1+labs(title="theme_grey(Default)")
g2 <-graph1+theme_minimal()+labs(title="theme_minimal")
g3 <-graph1+theme_bw()+labs(title="theme_bw")
g4 <-graph1+theme_light()+labs(title="theme_light")
g5 <-graph1+theme_dark()+labs(title="theme_dark")
g6 <-graph1+theme_void()+labs(title="theme_void")
g7 <-graph1+theme_classic()+labs(title="theme_classic")
g8 <-graph1+theme_linedraw()+labs(title="theme_linedraw")


grid.arrange(g1,g2,g3,g4, nrow=2)

grid.arrange(g5,g6,g7,g8, nrow=2)

If you fall in love with a theme, you can use the syntax below (would need to run it before running ggplot to create graphs) so that the theme applies to all graphs being created: theme_set(theme_minimal())

What if you want to modify components (e.g., titles, labels, fonts, background, gridlines, and legends) of an existing theme? Useful examples of doing so are here and arguments to provide for element_text, element_line, etc. are here.

I am using theme() below to put the legend at the top, remove panel borders and gridlines, change size of titles for axes, change size of text for values on axes (and angle x values 30 degrees), and change font elements/color.

ggplot(marriage,aes(x=steps, y=marital_satis,color=affairF, shape=affairF))+
  geom_point(size=2,alpha=0.5)+
  geom_smooth(method=lm)+
  facet_wrap(~genderF)+
  labs(x="# of Steps Taken Towards Divorce",y="Marital Satisfaction",
       title="Relationship bt Steps & Marital Satis.", color=NULL, shape=NULL)+
  theme (#Place legend at top
        legend.position = "top",
        #Hide panel borders and remove grid lines
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        #X axis title
        axis.title.x=element_text(size=8),  
        #Y axis title
        axis.title.y=element_text(size=8),
        #X axis text
        axis.text.x=element_text(size=5,angle = 30), 
        #Y axis text
        axis.text.y=element_text(size=5),
        #plot title
        plot.title=element_text(size=12, 
                                face="bold", 
                                color="darkblue",
                                hjust=0.5) 
        )

#hjust=0.5 centers title (values range from 0=left justified to 1=right justified)

Color Palettes

Sometimes you don’t want to change the theme, you just want to use a different pallette of colors. We have already seen how to change color of lines/fills/borders for individual graph components. Here I am changing the actual color palette being used. Again, you could add this within ggplot script or if plot is assigned to an object, just add scale_color_??? to object as I am doing below. A good resource on this is here.

gg1 <-graph1+scale_color_viridis_d()+labs(title="scale_color_viridis_d")
gg2 <-graph1+scale_color_brewer()+labs(title="scale_color_brewer")
gg3 <-graph1+scale_color_brewer(palette="Dark2")+labs(title="scale_color_brewer(palette=Dark2)")
gg4 <-graph1+scale_color_hue()+labs(title="scale_color_hue")

grid.arrange(gg1,gg2,gg3,gg4,nrow=2)

A particularly useful color palette is one that is colorblind safe. In order to use scale_color_colorblind, you must first install and load the package ggthemes. If you are ever worried that your color palette is not colorblind friendly, you can upload an image to this website to see how it appears under different kinds of colorblindness.

#install and load haven package
if (require(ggthemes) == FALSE){
  install.packages("ggthemes")
}
library(ggthemes)

graph1+scale_color_colorblind()+labs(title="scale_color_colorblind")

Saving graphs

Use function ggsave to save graph as an external file.

First argument is path and filename for file you will save (if omit path, it will save in working directory). Second argument, plot=, specifies which graph to save out. It is a good idea to assign plot to an object and put object name as value to plot argument. If you leave plot= out, it will save out the last plot created.

Can also save as jpg, bmp, tiff, pdf, etc. Can also specify height, width, dpi, etc.
See handy reference here.

ggsave("graph1.png",plot = graph1)