How to Make a Scatterplot in ggplot2

By: Greg Simonson, Aleksis Kincaid, and Michael Herriges

In this tutorial we will walk you through how to make a series of scatterplots using ggplot2.

Obtaining and Loading Data

We retrieved our data from Table 1 on page 15 in the following article by Sullivan and Artiles:

Sullivan, A.L., & Artiles, A.J. (2011). Theorizing racial inequity in special education: Applying structural inequity theory to disproportionality. Urban Education, 46, 1526-1552.

Table 1

alt text

These data were entered into a Microsoft Excel file in “long” format.

Raw Data

## Warning: cannot open file
## '/Users/aleksiskincaid/Downloads/RRRLEADataFrame.csv': No such file or
## directory

## Error: cannot open the connection

##    Disability StateRRR Ethnicity LEALow LEAHigh
## 1        SPED     1.22        AA  17.50    23.8
## 2         SLD     1.38        AA  27.50    29.7
## 3         SLI     0.82        AA  43.75     9.2
## 4          ED     1.48        AA  37.50    18.9
## 5        MIMR     2.42        AA  35.00    27.0
## 6         OHI     0.85        AA  48.75    10.8
## 7      Autism     0.89        AA  43.13     6.6
## 8          HI     0.76        AA  35.00     1.1
## 9          MD     1.30        AA  37.50     9.7
## 10       MOMR     1.53        AA  41.25    10.8
## 11         OI     0.71        AA  35.63     4.9
## 12        TBI     1.00        AA  31.88     1.6
## 13         VI     0.86        AA  27.50     2.7
## 14       SPED     0.96        HI  11.18     9.7
## 15        SLD     1.19        HI   9.41    19.5
## 16        SLI     0.86        HI  21.76     9.7
## 17         ED     0.37        HI  57.65     5.9
## 18       MIMR     1.38        HI  21.18    15.7
## 19        OHI     0.40        HI  50.59     3.2
## 20     Autism     0.42        HI  46.47     3.2
## 21         HI     1.12        HI  20.00     8.6
## 22         MD     0.95        HI  25.29     5.9
## 23       MOMR     1.35        HI  24.71    15.1
## 24         OI     0.71        HI  27.65     4.3
## 25        TBI     0.75        HI  20.59     5.4
## 26         VI     0.71        HI  23.53     2.7
## 27       SPED     1.26       NAM  16.46    23.8
## 28        SLD     1.71       NAM  21.34    39.5
## 29        SLI     0.90       NAM  39.02    15.7
## 30         ED     0.65       NAM  51.22    11.4
## 31       MIMR     1.49       NAM  20.73    16.8
## 32        OHI     0.45       NAM  59.15     5.4
## 33     Autism     0.03       NAM  39.63     3.8
## 34         HI     1.71       NAM  29.88     4.3
## 35         MD     1.70       NAM  21.34    11.4
## 36       MOMR     1.41       NAM  43.29    10.8
## 37         OI     1.00       NAM  37.20     4.3
## 38        TBI     1.50       NAM  28.66     4.9
## 39         VI     1.57       NAM  28.66     3.8
## 40       SPED     0.58       API  53.10     9.7
## 41        SLD     0.42       API  68.28     5.9
## 42        SLI     0.79       API  53.79     9.7
## 43         ED     0.21       API  79.31     2.7
## 44       MIMR     0.69       API  61.38     4.9
## 45        OHI     0.40       API  73.10     3.2
## 46     Autism     0.11       API  53.10     5.9
## 47         HI     1.71       API  35.17     5.9
## 48         MD     0.07       API  52.41     5.4
## 49       MOMR     0.94       API  53.79     4.9
## 50         OI     1.14       API  38.62     5.9
## 51        TBI       NA       API  35.86     2.2
## 52         VI     1.00       API  33.79     2.7

The data were loaded into R with the following read.csv command:


Data=read.csv(file.path) 
#file.path is the actual file path name on your computer.

#This will also name your dataframe "Data"

Next ggplot was loaded using:

library(ggplot2)
# If you don't have ggplot2 installed in R, simply download the package
# from the CRAN Repository

Now that we have the dataframe in R and ggplot2 is loaded, we will be able to graphically explore the data from Table 1. Sullivan and Artiles (2011) were interested in “investigating differential risk of educational disability across racial groups” (pp. 1526). Our goal is to graphically display what Sullivan and Artiles hoped their table would elucidate - differential risk of being placed in special education by race.

Graphing in ggplot

We decided to examine disability by state relative risk ratio based on ethnicity. The code is as follows.

ggplot(data=Data, aes(x=Disability, y=StateRRR, fill=Ethnicity))+
#We set x = Disability, Y = StateRRR and the fill for the points conditional on ethnicity.
#Remember to add + signs at the end of each line with ggplot code to add another layer

  geom_point()
#geom_point() adds points to the graph for a scatterplot

At this point the graph looks like this:

## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-6

We thought the graph would be more interpretable if the points were different shapes and colors based on ethnicity. Also, as a matter of aesthetics, we wanted to eliminate the warning message that appeared due to one instance of missing data. The code now looks like this:

ggplot(data=Data, aes(x=Disability, y=StateRRR, fill=Ethnicity))+
  geom_point(size=5, na.rm=TRUE, aes(color=Ethnicity, shape=Ethnicity))
#Within geom_point you can change the size with size= 
#na.rm=TRUE eliminates the pesky error message
#aes is a handy function for aesthetics
#Here, we used aes() to change the color and shape conditional on ethnicity

The graph now looks like this:

At this point, we got picky about what shapes we wanted for ethnicity. In the above graph, the '+' symbol for Native American students is difficult to see on the graph. We also wanted to include lines for the cut scores that represent significant under- and over-representation of students compared to white students (centered at 1.0 on state relative risk ratio; StateRRR).

Specific shape's corresponding R-code numbers can be found on the R cookbook wesbite

We added a few lines of code to accomplish this:

ggplot(data=Data, aes(x=Disability, y=StateRRR, fill=Ethnicity))+
  geom_point(size=5, na.rm=TRUE, aes(color=Ethnicity, shape=Ethnicity))+
  scale_shape_manual(values=c(15,16,17,18))+
  geom_hline(size=1,aes(yintercept=1.0))+
  geom_hline(size=1,aes(yintercept=0.5, alpha=.7))+
  geom_hline(size=1,aes(yintercept=1.5, alpha=.7))

#scale_shape_manual allows you to manually select the point shapes displayed on the R cookbook website
#geom_hline adds a line to the graph at the point that you determine
#again, the size function can be used
#alpha allows one to make the line transparent (0) to fully opaque (1)
#Lines were added at .5 and 1.5 to display what was considered disproportionate over- or under-representation of minority groups (anything between .5 and 1.5 is not considered disproportionate)

Now the graph looks like this:

We decided the graph would look better with lines connecting each individual ethinic group across disability categories to display trends. We also wanted to title the axes and the graph, as well as remove the grey background.

ggplot(data=Data, aes(x=Disability, y=StateRRR, fill=Ethnicity))+
  geom_point(size=5, na.rm=TRUE, aes(color=Ethnicity, shape=Ethnicity))+
  scale_shape_manual(values=c(15,16,17,18))+
  geom_hline(size=1,aes(yintercept=1.0))+
  geom_hline(size=1,aes(yintercept=0.5, alpha=.7))+
  geom_hline(size=1,aes(yintercept=1.5, alpha=.7))+
  #new code starts below
  geom_line(size=1, aes(group=Ethnicity, color=Ethnicity))+
  ylab("State Relative Risk Ratio")+
  xlab("Disability Categories")+
  ggtitle("State Level Relative Risk Ratio by Ethnicity")+
  theme_bw()+
  #added to remove the grey background
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16)) +
  theme(plot.title=element_text(size=20))

#geom_line added a line to the graph, the aes function was used to separate the lines by ethnicity (group) and by color.
#labeling the graph follows pretty intuitive functions such as xlab for the x-axis label, ylab for the y-axis, and ggtitle to give the plot a title
#The theme function was used to change the size of the x and y labels as well as the the ggplot title

The graph now looks like this:

The graph looked like it was coming together and trends were becoming more interpretable than in the table of data. However, we wanted to clean the graph up and bit and make it look less chaotic. Thus, we created a white background, removed the minor gridlines, made the tickmark text larger, and changed the color of the major gridlines.

ggplot(data=Data, aes(x=Disability, y=StateRRR, fill=Ethnicity))+
  geom_point(size=5, na.rm=TRUE, aes(color=Ethnicity, shape=Ethnicity))+
  scale_shape_manual(values=c(15,16,17,18))+
  geom_hline(size=1,aes(yintercept=1.0))+
  geom_hline(size=1,aes(yintercept=0.5, alpha=.7))+
  geom_hline(size=1,aes(yintercept=1.5, alpha=.7))+
  geom_line(size=1, aes(group=Ethnicity, color=Ethnicity))+
  ylab("State Relative Risk Ratio")+
  xlab("Disability Categories")+
  ggtitle("State Level Relative Risk Ratio by Ethnicity")+
  theme_bw()+
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16)) +
  theme(plot.title=element_text(size=20)) + 
#new code starts below
  theme(panel.grid.minor=element_blank())+
  #to remove the minor gridlines
  theme(panel.grid.major=element_line(color="grey78"))+
  #to make the major gridlines grey
  theme(axis.text=element_text(size=10))
#to make the tickmark labels larger

Voilà, the final graph:

This graph demonstrates the relative risk of being given a specific disability label for students in different ethnic groups compared to white students. The researchers also presented data about the percentage of local education agencies (LEAs) under (low) and over-representing (high) students in different ethnic groups across disability category compared to white students. The script for those graphs are similar to the previous one and will thus be explained in less detail.

ggplot(data=Data, aes(x=Disability, y=LEALow, fill=Ethnicity))+
  #y changed to LEALow
  ylab("Percentage of LEAs Underrepresenting Students")+
  xlab("Disability Categories")+
  ggtitle("Underrepresentation in Disability Category by Ethnicity")+
  theme_bw()+
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16)) +
  theme(plot.title=element_text(size=18))+
  theme(panel.grid.minor=element_blank())+
  theme(axis.text=element_text(size=10))+
  theme(panel.grid.major=element_line(color="grey78"))+
  geom_point(size=5, na.rm=TRUE, aes(color=Ethnicity, shape=Ethnicity))+
  scale_shape_manual(values=c(15,16,17,18))+
  geom_line(size=2, aes(group=Ethnicity, color=Ethnicity))
#hlines were omitted as there is no longer a singificant cutoff

This script produces this graph:

It is easy to graphically see trends in underrepresentation based on ethnic group as compared to white students. While the tabular data (seen in Table 1) were hard to compare and interpret, clear trends emerge in the graph.

The next graph has the same script as the previous except y is equal to LEAHigh rather than low. Thus, the script will not be reproduced.

Again, in graphical form, it is very easy to see emerging trends of which ethnic groups are over-represented in specific disability categories as compared to white students.