ICPSR Machine Learning Lab: Data Visualization in R

Introduction

This demo provides a basic understanding of data visualization in R, mainly using ggplot2 and plotly. However, there’s a whole ecosystem to explore if you’re interested. This is one good place to look, other links scattered throughout: https://github.recommend/erikgahner/awesome-ggplot2. For a more general guide to data visualization, I recommend Edward Tufte’s one-day course, or you can check out his books here: https://www.edwardtufte.com/tufte/ Kieren Healy is another great data viz guru to check out, blog here: https://kieranhealy.org/blog/ and book here: https://www.amazon.com/Data-Visualization-Introduction-Kieran-Healy/dp/0691181624

set.seed(1995) 
library(ggplot2) #the main dataviz package for this lab 
library(ggrepel) #package for cleaning up text labels 
library(ggcorrplot) # great for making correlation plots, another alternative is corrr::rplot()
library(GGally) # Pairwise plot package among other things 
library(ggpubr) # package for stiching together ggplots for publications
library(ggthemes) ## Out-of-the-box aesthetic themes for ggplot. 
## Here's a themes guide https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/ 
library(reshape2) #data melting functions and other stuff
library(performance) #great model check package 
library(fixest) #my favorite estimation package - it rocks! 
library(ggdist) #very nice ggplot extension for mapping distributions 
library(tidyverse) #lots of other tidy functions 
library(marginaleffects) #my favorite package for standard model comparisons and interaction effects 

Data

For part of this demo, we’ll be working with the mtcars dataset built into R. It’s a dataset of a few dozen automobiles with some associated characteristics. You don’t need to load mtcars, it comes with R from the factory…

Here’s the head so you can get a sense of it.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Base R Plots

Before we get to ggplot, let’s talk briefly about base R plot. Here’s a simple scatter plot example using base plot

plot(x = mtcars$mpg, #define the x variable 
     y = mtcars$hp) #define the y variable 

This is pretty ugly, but for quick diagnostics it works just fine. For example, if you want to see a quick histogram you can use the hist() function…

hist(mtcars$mpg)

Here’s a chunk demonstrating how to build and save a base plot:

pdf("my_histogram.pdf") #create a blank pdf file to save the plot in, 
                        #this could also by an svg, jpg, tiff, bmp, or png. 
hist(mtcars$mpg) #create the plot 
dev.off() #close and save the file 

## NOTE: This can be finnicky on occasion, alternative methods here:
# https://www.stat.berkeley.edu/~s133/saving.html 
## IF you're interested in learning more about base plot (for whatever reason), 
# this is a helpful guide:https://intro2r.com/simple-base-r-plots.html 

The ggplot Package

Basic Scatter Plot

Moving past base plot, let’s make a basic ggplot object…

## This is a scatter plot with jitter (to avoid clusters of points)

scatter.plot <- ggplot(data = mtcars, ## Feed ggplot your dataframe. 
       aes(x = mpg, y = hp)) + ## Define your x axis and y axis. 
       geom_jitter(width = 5, height = 5) ## Tell ggplot what type of plot you'd like. 
                    #This is a scatterplot with some added jitter to avoid clumping 

## All possible ggplot "geoms" here: https://ggplot2.tidyverse.org/reference/ 

scatter.plot # Print the plot

You’ll notice right away that ggplot’s base formatting violates some of the rules we discussed earlier (mainly clutter). We’ll talk about fixing this up throughout the demo. Also, keep in mind that the data you feed ggplot MUST be in the form of a data frame. To convert a matrix or other format to a DF, you can sometimes get away with using as.data.frame(). Sometimes more complex transformations will be needed.

Saving Plots as an Object

Saving a ggplot object locally as an object…

## Like any other R object, you can simply save gg objects as such. 
mpg_plot <- ggplot(data = mtcars, aes(x = mpg, y = hp)) + 
                   geom_jitter(size = 2) #increase point size
## To print the object simply run the object itself
mpg_plot

Saving Plots to a File

## To save a ggplot as a png,pdf,jpeg, etc. use the ggsave function. 
ggsave("mpg_plot.png", ## Type whatever you'd like the name of the file 
                       ## to be in quotes here. Must include the extension!!
       plot = mpg_plot, ## Define what plot you'd like to save,
                        #default is last plot loaded,
       width = 10, #width of the file 
       height = 10, #height of the file 
       units = "in", #units for width and height, here inches 
       dpi = 300, ## Define the picture resolution. 300 is a good, HD default. 
       )

Changing Axis Labels

## Now let's fix up the labels on our plot. 
mpg_plot <- ggplot(data = mtcars, 
                   aes(x = mpg, y = hp)) + 
                   xlab("Miles per Gallon") + ## x axis label 
                   ylab("Horsepower") + ## y axis label  
                   ggtitle("Scatterplot of MPG and Horsepower") + ## title 
                   geom_jitter()

mpg_plot

Changing Axis Limits

## Now let's customize the axis ranges 
mpg_plot <- ggplot(data = mtcars, 
                   aes(x = mpg, y = hp)) + 
                   xlim(0,40) + ## x axis range 
                   ylim(0,400) + ## y axis range
                   labs(x = "Miles per Gallon", 
                        y = "Horsepower",
                        title = "Scatterplot of MPG and Horsepower") +
                   geom_jitter()

## NOTE: You can also change all labels using the labs() function

mpg_plot

Color Customization

## Now let's customize our colors. 
#First, define a custom color British Racing Green 
brg <- "#004225" ## This is the color code for BRG, you can find it here:
#https://en.wikipedia.org/wiki/British_racing_green

mpg_plot <- ggplot(data = mtcars, 
                   aes(x = mpg, y = hp)) + 
                   xlab("Miles per Gallon") + 
                   ylab("Horsepower") + 
                   ggtitle("Scatterplot of MPG and Horsepower") + 
                   geom_jitter(color=brg) ## Set the color 
mpg_plot

Format Text and Theme

Don’t get overwhelmed here. Remember, every line is formatting one part of our graphic, consider them in order.

mpg_plot + 
        theme(  ## Adjust title format 
                plot.title = element_text(size=14, face="bold", 
                                          hjust = 0.5),
                ## Adjust background color and fill 
                panel.background = element_rect(fill = "white",
                                colour = "white",
                                linewidth = 0.5, linetype = "solid"),
                ## Adjust axis lines
                axis.line = element_line(linewidth = .5, color = "black"),
                ## Adjust x + y axis title formatting 
                axis.title.x = element_text(size=12, face="bold"),
                axis.title.y = element_text(size=12, face="bold"),
                ## Adjust x + y ticks 
                axis.text.x = element_text(color = "grey30", size=10),
                axis.text.y = element_text(color = "grey30", size=10),
                axis.ticks = element_blank()) # remove tick marks

Saving and Using a Custom Theme

Once you come up with a theme you like, you can save the preset and reuse it…

justanothertheme <- function(){
  theme(plot.title = element_text(size=14, face="bold", hjust = 0.5),
  panel.background = element_rect(fill = "white", colour = "white", 
                                  linewidth = 0.5, 
                                  linetype = "solid"),
  axis.line = element_line(linewidth = .5, color = "black"),
  axis.title.x = element_text(size=12, face="bold"),
  axis.title.y = element_text(size=12, face="bold"),
  axis.text.x = element_text(color = "grey30", size=10),
  axis.text.y = element_text(color = "grey30", size=10),
  axis.ticks = element_blank())
}
mpg_plot + justanothertheme() # notice the result is the same, much cleaner though! 

Adding a Geom to and Existing Plot

Rug Plot

Moving back to different types of geoms, consider using a rug with or instead of axis lines…

mpg_plot + justanothertheme() + geom_rug() 

Prediction Lines

We can also combine different graphics in the same image, a prediction line through points for example. A longer explainer for adding prediction lines in ggplot is here: https://aosmith.rbind.io/2018/11/16/plot-fitted-lines/

Linear model

mpg_plot <- ggplot(data = mtcars, 
                   aes(x = mpg, y = hp)) + 
                   xlab("Miles per Gallon") + 
                   ylab("Horsepower") + 
                   ggtitle("Scatterplot of MPG and Horsepower") + 
                   justanothertheme() + 
                   geom_jitter(color=brg, alpha =.5) + ## Alpha adjusts opacity 
                   ## smooth adds a prediction line, we're using default lm here 
                   geom_smooth(method = "lm", color = brg) 
mpg_plot                  

LOESS Model

mpg_plot <- ggplot(data = mtcars, 
                   aes(x = mpg, y = hp)) + 
                   xlab("Miles per Gallon") + 
                   ylab("Horsepower") + 
                   justanothertheme() + 
                   ggtitle("Scatterplot of MPG and Horsepower") + 
                   geom_jitter(color=brg, alpha =.5) + 
                   geom_smooth(method = "loess", color = brg) # Change model type 
mpg_plot      
## `geom_smooth()` using formula = 'y ~ x'

Text Labels

mpg_plot <- ggplot(data = mtcars, 
                   aes(x = mpg, y = hp)) + 
                   xlab("Miles per Gallon") + 
                   ylab("Horsepower") + 
                   justanothertheme() + 
                   ggtitle("Scatterplot of MPG and Horsepower") + 
                   geom_jitter(color=brg) +
                   geom_text(aes(label=rownames(mtcars))) ## Add text labels
                                                          ## using the rownames
mpg_plot

Uh oh, this is pretty ugly! Let’s try a different approach…

mpg_plot <- ggplot(data = mtcars, 
                   aes(x = mpg, y = hp)) + 
                   xlab("Miles per Gallon") + 
                   ylab("Horsepower") + 
                   justanothertheme() + 
                   ggtitle("Scatterplot of MPG and Horsepower") + 
                   geom_jitter(color=brg) +
                   ## Use text_repel to separate the labels
                   geom_text_repel(aes(label=rownames(mtcars)),
                                   max.time = 3,
                                   max.iter = 10000, #I set a seed up top,
                                   #you'll need one for this to reproduce! 
                                   max.overlaps = 5) 
mpg_plot

This is a bit better, but with data this clustered the best option would be to use something interactive like plotly instead (which we won’t cover here, but demo files available in the folder!).

Histograms

hist_data <- data.frame(X1 = rnorm(10000,0,1)) #draw a normal distro, mean 0 sd = 1 
hist_plot <- ggplot(data = hist_data, 
                   aes(x = X1)) + 
                   geom_histogram(color=brg, #outline color 
                                  fill=brg, #fill color 
                                  alpha = .5) + #transparency in the closed set from [0,1] 
                   justanothertheme() + 
                   xlab("") + 
                   ylab("Density") +
                   ggtitle("") 
hist_plot

Density Plots

## Density plots and histograms use the same data type, so we can easily flip between them
hist_plot <- ggplot(data = hist_data, 
                   aes(x = X1)) + 
                   geom_density(color=brg,
                                  fill=brg,
                                  alpha = .5) + 
                   justanothertheme() + 
                   xlab("") + 
                   ylab("Density") +
                   ggtitle("") 
hist_plot

Overlapping Density and Histogram Plots

## Draw another distro with a different mean 
hist_data1 <- as.data.frame(rnorm(10000, 1, 1))

## Combine the objects into one DF 
overlap_data <- cbind(x=hist_data,y=hist_data1)
colnames(overlap_data) <- c("Mean0","Mean1") # Change col names
overlap_data <- reshape2::melt(overlap_data) #use melt to reformat the data

## Generate the plot 
hist_plot2 <- ggplot(data = overlap_data, 
                   aes(x=value, fill=variable)) + 
                   geom_density(alpha = .25) + 
                   ggthemes::theme_economist() + ## Use a canned theme, the economist magazine here  
                   xlab("") + 
                   ylab("Density") +
                   ggtitle("") 
hist_plot2

Or two overlapping histograms instead!

## Generate the plot 
hist_plot3 <- ggplot(data = overlap_data, 
                   aes(x=value, fill=variable)) + 
                   geom_histogram(bins = 40, alpha = .5) + 
                   ggthemes::theme_stata() + ## Use a canned theme, trick them into thinking you used stata! 
                   xlab("") + 
                   ylab("Density") +
                   ggtitle("") 
hist_plot3 

Appendix (Extra Code)

Below is the code used to generate some of the figures in the slides if you’re interested.

q.df <- mtcars[4:7,]
ggplot(data = q.df, 
       aes(x = mpg,
           y = drat,
           label = rownames(q.df))) + 
      geom_jitter(size = 4) +
      slide_theme() + 
      xlab("MPG") +
      ylab("Rear Axle Ratio") + 
      geom_text_repel(box.padding = .3,size = 10) 

ggsave("badplot.png",dpi = 500, width = 16,height = 9,units = "in")
ggplot(data = mtcars, 
       aes(x = mpg, y = hp)) + 
       geom_jitter(size = 2) + 
       theme(
         axis.line = element_line(size = .5, colour = "black", linetype=1),
         panel.grid = element_line(color = "black",
                                     linewidth = .5,
                                     linetype = 1)) + 
      slide_theme()
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggsave("notbetterplot.png",dpi = 500,width = 8, height = 9,units = "in")
ggplot(data = mtcars, 
       aes(x = mpg, y = hp)) +  
       geom_jitter(size = 2, alpha = .8) + 
       #geom_rug(alpha = .8) + 
       xlab("Miles per Gallon") + 
       ylab("Horsepower") + 
       theme_classic() + 
       theme(panel.background = element_blank(),
             axis.ticks.length = unit(.1,"in"),
             ) + 
       #scale_x_continuous(limits = c(0,45), expand = c(0, 0)) +
       #scale_y_continuous(limits = c(0,400), expand = c(0, 0)) + 
  #this will reduce the axis and remove extra space
       geom_rangeframe() + 
       slide_theme()

ggsave("betterplot.png",dpi = 500,width = 8, height = 9,units = "in")
#some code I stole to generate a simple DGP with heteroskedastic error  
n      <- rep(1:100,2)
a      <- 0
b      <- 1
sigma2 <- n^1.5
eps    <- rnorm(n,mean=0,sd=sqrt(sigma2))
y      <- a+b*n + eps
#into a df it goes! 
hetero.df <- data.frame(Y = y,
                        X = n)

hetero.mod <- fixest::feols(Y ~ X, data = hetero.df, vcov = "hetero")
hetero.preds <- predict(hetero.mod,
                        hetero.df,
                        interval = "conf",
                        level = .99)
hetero.df$fit <- hetero.preds$fit
hetero.df$upper <- hetero.preds$ci_high
hetero.df$lower <- hetero.preds$ci_low

ggplot(aes(x = X, 
           y = Y),
       data = hetero.df) + 
       geom_point(size = 2) + 
       geom_smooth(method = "lm",
                   se = F,linewidth = 2) + 
       theme_few() + 
       slide_theme()
## `geom_smooth()` using formula = 'y ~ x'

ggsave("badheteroplot.png",dpi=500,width = 16,height = 9,units = "in")
## `geom_smooth()` using formula = 'y ~ x'
#Star Trek LCARS Orange, because why not 
goodorange <- "#ec9d37"
ggplot(aes(x = X, 
           y = Y),
       data = hetero.df) + 
       geom_point(size = 2, color = goodorange,alpha = .75) + 
       geom_line(aes(x = X, y = fit),color = goodorange,linewidth = 2) + 
       geom_ribbon(aes(x = X, y = fit, ymin=lower, ymax=upper),
                   alpha = .2,
                   fill = goodorange) + 
       theme_few() + 
       slide_theme()

ggsave("heteroplot.png",dpi=500,width = 16,height = 9,units = "in")
ggplot(aes(x = X, 
           y = Y),
       data = hetero.df) + 
       geom_point(size = 2, color = goodorange,alpha = .75) + 
       geom_line(aes(x = X, y = fit),color = goodorange,linewidth = 2) + 
       geom_ribbon(aes(x = X, y = fit, ymin=lower, ymax=upper),
                   alpha = .2,
                   fill = goodorange) + 
       xlab("") + 
       ylab("") + 
       ggtitle("") + 
         slide_theme() + 
       theme(axis.ticks.x = element_blank(),
             axis.text.x = element_blank(),
             axis.text.y = element_blank()) 

ggsave("nolabels.png",dpi=500,width = 16,height = 9,units = "in")