ICPSR Machine Learning Lab: Data Visualization in R
Introduction
This demo provides a basic understanding of data visualization in R, mainly using ggplot2 and plotly. However, there’s a whole ecosystem to explore if you’re interested. This is one good place to look, other links scattered throughout: https://github.recommend/erikgahner/awesome-ggplot2. For a more general guide to data visualization, I recommend Edward Tufte’s one-day course, or you can check out his books here: https://www.edwardtufte.com/tufte/ Kieren Healy is another great data viz guru to check out, blog here: https://kieranhealy.org/blog/ and book here: https://www.amazon.com/Data-Visualization-Introduction-Kieran-Healy/dp/0691181624
set.seed(1995)
library(ggplot2) #the main dataviz package for this lab
library(ggrepel) #package for cleaning up text labels
library(ggcorrplot) # great for making correlation plots, another alternative is corrr::rplot()
library(GGally) # Pairwise plot package among other things
library(ggpubr) # package for stiching together ggplots for publications
library(ggthemes) ## Out-of-the-box aesthetic themes for ggplot.
## Here's a themes guide https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
library(reshape2) #data melting functions and other stuff
library(performance) #great model check package
library(fixest) #my favorite estimation package - it rocks!
library(ggdist) #very nice ggplot extension for mapping distributions
library(tidyverse) #lots of other tidy functions
library(marginaleffects) #my favorite package for standard model comparisons and interaction effects Data
For part of this demo, we’ll be working with the mtcars dataset built into R. It’s a dataset of a few dozen automobiles with some associated characteristics. You don’t need to load mtcars, it comes with R from the factory…
Here’s the head so you can get a sense of it.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Base R Plots
Before we get to ggplot, let’s talk briefly about base R plot. Here’s a simple scatter plot example using base plot
This is pretty ugly, but for quick diagnostics it works just fine. For example, if you want to see a quick histogram you can use the hist() function…
Here’s a chunk demonstrating how to build and save a base plot:
pdf("my_histogram.pdf") #create a blank pdf file to save the plot in,
#this could also by an svg, jpg, tiff, bmp, or png.
hist(mtcars$mpg) #create the plot
dev.off() #close and save the file
## NOTE: This can be finnicky on occasion, alternative methods here:
# https://www.stat.berkeley.edu/~s133/saving.html
## IF you're interested in learning more about base plot (for whatever reason),
# this is a helpful guide:https://intro2r.com/simple-base-r-plots.html The ggplot Package
Basic Scatter Plot
Moving past base plot, let’s make a basic ggplot object…
## This is a scatter plot with jitter (to avoid clusters of points)
scatter.plot <- ggplot(data = mtcars, ## Feed ggplot your dataframe.
aes(x = mpg, y = hp)) + ## Define your x axis and y axis.
geom_jitter(width = 5, height = 5) ## Tell ggplot what type of plot you'd like.
#This is a scatterplot with some added jitter to avoid clumping
## All possible ggplot "geoms" here: https://ggplot2.tidyverse.org/reference/
scatter.plot # Print the plotYou’ll notice right away that ggplot’s base formatting violates some of the rules we discussed earlier (mainly clutter). We’ll talk about fixing this up throughout the demo. Also, keep in mind that the data you feed ggplot MUST be in the form of a data frame. To convert a matrix or other format to a DF, you can sometimes get away with using as.data.frame(). Sometimes more complex transformations will be needed.
Saving Plots as an Object
Saving a ggplot object locally as an object…
## Like any other R object, you can simply save gg objects as such.
mpg_plot <- ggplot(data = mtcars, aes(x = mpg, y = hp)) +
geom_jitter(size = 2) #increase point size
## To print the object simply run the object itself
mpg_plotSaving Plots to a File
## To save a ggplot as a png,pdf,jpeg, etc. use the ggsave function.
ggsave("mpg_plot.png", ## Type whatever you'd like the name of the file
## to be in quotes here. Must include the extension!!
plot = mpg_plot, ## Define what plot you'd like to save,
#default is last plot loaded,
width = 10, #width of the file
height = 10, #height of the file
units = "in", #units for width and height, here inches
dpi = 300, ## Define the picture resolution. 300 is a good, HD default.
)Changing Axis Labels
## Now let's fix up the labels on our plot.
mpg_plot <- ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
xlab("Miles per Gallon") + ## x axis label
ylab("Horsepower") + ## y axis label
ggtitle("Scatterplot of MPG and Horsepower") + ## title
geom_jitter()
mpg_plotChanging Axis Limits
## Now let's customize the axis ranges
mpg_plot <- ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
xlim(0,40) + ## x axis range
ylim(0,400) + ## y axis range
labs(x = "Miles per Gallon",
y = "Horsepower",
title = "Scatterplot of MPG and Horsepower") +
geom_jitter()
## NOTE: You can also change all labels using the labs() function
mpg_plotColor Customization
## Now let's customize our colors.
#First, define a custom color British Racing Green
brg <- "#004225" ## This is the color code for BRG, you can find it here:
#https://en.wikipedia.org/wiki/British_racing_green
mpg_plot <- ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
xlab("Miles per Gallon") +
ylab("Horsepower") +
ggtitle("Scatterplot of MPG and Horsepower") +
geom_jitter(color=brg) ## Set the color
mpg_plotFormat Text and Theme
Don’t get overwhelmed here. Remember, every line is formatting one part of our graphic, consider them in order.
mpg_plot +
theme( ## Adjust title format
plot.title = element_text(size=14, face="bold",
hjust = 0.5),
## Adjust background color and fill
panel.background = element_rect(fill = "white",
colour = "white",
linewidth = 0.5, linetype = "solid"),
## Adjust axis lines
axis.line = element_line(linewidth = .5, color = "black"),
## Adjust x + y axis title formatting
axis.title.x = element_text(size=12, face="bold"),
axis.title.y = element_text(size=12, face="bold"),
## Adjust x + y ticks
axis.text.x = element_text(color = "grey30", size=10),
axis.text.y = element_text(color = "grey30", size=10),
axis.ticks = element_blank()) # remove tick marksSaving and Using a Custom Theme
Once you come up with a theme you like, you can save the preset and reuse it…
justanothertheme <- function(){
theme(plot.title = element_text(size=14, face="bold", hjust = 0.5),
panel.background = element_rect(fill = "white", colour = "white",
linewidth = 0.5,
linetype = "solid"),
axis.line = element_line(linewidth = .5, color = "black"),
axis.title.x = element_text(size=12, face="bold"),
axis.title.y = element_text(size=12, face="bold"),
axis.text.x = element_text(color = "grey30", size=10),
axis.text.y = element_text(color = "grey30", size=10),
axis.ticks = element_blank())
}
mpg_plot + justanothertheme() # notice the result is the same, much cleaner though! Adding a Geom to and Existing Plot
Rug Plot
Moving back to different types of geoms, consider using a rug with or instead of axis lines…
Prediction Lines
We can also combine different graphics in the same image, a prediction line through points for example. A longer explainer for adding prediction lines in ggplot is here: https://aosmith.rbind.io/2018/11/16/plot-fitted-lines/
Linear model
mpg_plot <- ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
xlab("Miles per Gallon") +
ylab("Horsepower") +
ggtitle("Scatterplot of MPG and Horsepower") +
justanothertheme() +
geom_jitter(color=brg, alpha =.5) + ## Alpha adjusts opacity
## smooth adds a prediction line, we're using default lm here
geom_smooth(method = "lm", color = brg)
mpg_plot LOESS Model
mpg_plot <- ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
xlab("Miles per Gallon") +
ylab("Horsepower") +
justanothertheme() +
ggtitle("Scatterplot of MPG and Horsepower") +
geom_jitter(color=brg, alpha =.5) +
geom_smooth(method = "loess", color = brg) # Change model type
mpg_plot ## `geom_smooth()` using formula = 'y ~ x'
Text Labels
mpg_plot <- ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
xlab("Miles per Gallon") +
ylab("Horsepower") +
justanothertheme() +
ggtitle("Scatterplot of MPG and Horsepower") +
geom_jitter(color=brg) +
geom_text(aes(label=rownames(mtcars))) ## Add text labels
## using the rownames
mpg_plotUh oh, this is pretty ugly! Let’s try a different approach…
mpg_plot <- ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
xlab("Miles per Gallon") +
ylab("Horsepower") +
justanothertheme() +
ggtitle("Scatterplot of MPG and Horsepower") +
geom_jitter(color=brg) +
## Use text_repel to separate the labels
geom_text_repel(aes(label=rownames(mtcars)),
max.time = 3,
max.iter = 10000, #I set a seed up top,
#you'll need one for this to reproduce!
max.overlaps = 5)
mpg_plotThis is a bit better, but with data this clustered the best option would be to use something interactive like plotly instead (which we won’t cover here, but demo files available in the folder!).
Histograms
hist_data <- data.frame(X1 = rnorm(10000,0,1)) #draw a normal distro, mean 0 sd = 1
hist_plot <- ggplot(data = hist_data,
aes(x = X1)) +
geom_histogram(color=brg, #outline color
fill=brg, #fill color
alpha = .5) + #transparency in the closed set from [0,1]
justanothertheme() +
xlab("") +
ylab("Density") +
ggtitle("")
hist_plotDensity Plots
## Density plots and histograms use the same data type, so we can easily flip between them
hist_plot <- ggplot(data = hist_data,
aes(x = X1)) +
geom_density(color=brg,
fill=brg,
alpha = .5) +
justanothertheme() +
xlab("") +
ylab("Density") +
ggtitle("")
hist_plotOverlapping Density and Histogram Plots
## Draw another distro with a different mean
hist_data1 <- as.data.frame(rnorm(10000, 1, 1))
## Combine the objects into one DF
overlap_data <- cbind(x=hist_data,y=hist_data1)
colnames(overlap_data) <- c("Mean0","Mean1") # Change col names
overlap_data <- reshape2::melt(overlap_data) #use melt to reformat the data
## Generate the plot
hist_plot2 <- ggplot(data = overlap_data,
aes(x=value, fill=variable)) +
geom_density(alpha = .25) +
ggthemes::theme_economist() + ## Use a canned theme, the economist magazine here
xlab("") +
ylab("Density") +
ggtitle("")
hist_plot2Or two overlapping histograms instead!
## Generate the plot
hist_plot3 <- ggplot(data = overlap_data,
aes(x=value, fill=variable)) +
geom_histogram(bins = 40, alpha = .5) +
ggthemes::theme_stata() + ## Use a canned theme, trick them into thinking you used stata!
xlab("") +
ylab("Density") +
ggtitle("")
hist_plot3 Appendix (Extra Code)
Below is the code used to generate some of the figures in the slides if you’re interested.
q.df <- mtcars[4:7,]
ggplot(data = q.df,
aes(x = mpg,
y = drat,
label = rownames(q.df))) +
geom_jitter(size = 4) +
slide_theme() +
xlab("MPG") +
ylab("Rear Axle Ratio") +
geom_text_repel(box.padding = .3,size = 10) ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
geom_jitter(size = 2) +
theme(
axis.line = element_line(size = .5, colour = "black", linetype=1),
panel.grid = element_line(color = "black",
linewidth = .5,
linetype = 1)) +
slide_theme()## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(data = mtcars,
aes(x = mpg, y = hp)) +
geom_jitter(size = 2, alpha = .8) +
#geom_rug(alpha = .8) +
xlab("Miles per Gallon") +
ylab("Horsepower") +
theme_classic() +
theme(panel.background = element_blank(),
axis.ticks.length = unit(.1,"in"),
) +
#scale_x_continuous(limits = c(0,45), expand = c(0, 0)) +
#scale_y_continuous(limits = c(0,400), expand = c(0, 0)) +
#this will reduce the axis and remove extra space
geom_rangeframe() +
slide_theme()#some code I stole to generate a simple DGP with heteroskedastic error
n <- rep(1:100,2)
a <- 0
b <- 1
sigma2 <- n^1.5
eps <- rnorm(n,mean=0,sd=sqrt(sigma2))
y <- a+b*n + eps
#into a df it goes!
hetero.df <- data.frame(Y = y,
X = n)
hetero.mod <- fixest::feols(Y ~ X, data = hetero.df, vcov = "hetero")
hetero.preds <- predict(hetero.mod,
hetero.df,
interval = "conf",
level = .99)
hetero.df$fit <- hetero.preds$fit
hetero.df$upper <- hetero.preds$ci_high
hetero.df$lower <- hetero.preds$ci_low
ggplot(aes(x = X,
y = Y),
data = hetero.df) +
geom_point(size = 2) +
geom_smooth(method = "lm",
se = F,linewidth = 2) +
theme_few() +
slide_theme()## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#Star Trek LCARS Orange, because why not
goodorange <- "#ec9d37"
ggplot(aes(x = X,
y = Y),
data = hetero.df) +
geom_point(size = 2, color = goodorange,alpha = .75) +
geom_line(aes(x = X, y = fit),color = goodorange,linewidth = 2) +
geom_ribbon(aes(x = X, y = fit, ymin=lower, ymax=upper),
alpha = .2,
fill = goodorange) +
theme_few() +
slide_theme()ggplot(aes(x = X,
y = Y),
data = hetero.df) +
geom_point(size = 2, color = goodorange,alpha = .75) +
geom_line(aes(x = X, y = fit),color = goodorange,linewidth = 2) +
geom_ribbon(aes(x = X, y = fit, ymin=lower, ymax=upper),
alpha = .2,
fill = goodorange) +
xlab("") +
ylab("") +
ggtitle("") +
slide_theme() +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank())