Data Visualization is:
Abbreviations: Both “data vis” and “data viz” are used as abbreviations for data visualization.
Observe this example, which visualizes public construction records for Lakeview Amphitheater in Syracuse, NY:
Questions: In observing the above scatter plot:
R uses, or may use, both built-in and external packages (libraries) to produce graphics.
The three most popular packages, from oldest to newest, are:
Honorable mention goes to package grid, which powers a lot of base R graphics output behind the scenes.
Package graphics comes preinstalled and preloaded in your R or RStudio environment.
Let’s import the Lakeview Amphitheater data again and save them in object lv:
library(readr)
url <- "https://raw.githubusercontent.com/jamisoncrawford/reis/master/Datasets/tblr_master.csv"
lv <- read_csv(url)
index <- which(lv$project == "Lakeview")
lv <- lv[index, 1:12]
You can find the documentation for these data in the Lakeview and Hancock repositories.
Explore the data with str() and other exploratory functions:
names(lv)
dims(lv)
summary(lv)
str(lv)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3356 obs. of 12 variables:
## $ project: chr "Lakeview" "Lakeview" "Lakeview" "Lakeview" ...
## $ name : chr "Ajay Glass" "Ajay Glass" "Ajay Glass" "Ajay Glass" ...
## $ ending : Date, format: "2015-08-02" "2015-08-02" ...
## $ zip : int 14612 13205 14428 14549 14433 14612 14428 13118 14433 14612 ...
## $ ssn : chr "3888" "2807" "7762" "9759" ...
## $ class : chr "Journeyman" "Journeyman" "Journeyman" "Journeyman" ...
## $ hours : num 40 32 40 26 32 40 40 16 32 42 ...
## $ rate : num 18.5 28.7 33.7 33.7 17 18.5 33.7 28.7 17 18.5 ...
## $ gross : num 740 918 1348 876 544 ...
## $ net : num 511 587 1026 891 384 ...
## $ sex : chr NA NA NA NA ...
## $ race : chr NA NA NA NA ...
Plotting Functions: Base R graphics have many functions. The following will initialize a plot:
plot() automatically detects/creates scatter plots, bar charts, etc.hist() creates a histogram for a single variableboxplot() creates a box and whisker plotbarplot() creates bar plotsThere are several other functions for intializing plots, but these are common.
Annotating Functions: The functions and others will annotate an existing plot.
loess() adds a smoother to an existing scatter plottext() adds text to an existing plotjitter() adds random noise to points to break up overlaplegend() adds a legend to an existing plotModifiable Parameters: Parameters go inside the () of a function, and there are many.
main = modifies titlexlab = modifies x-axis labelylab = modifies y-axis labellwd = modifies line widthlty = modifies line typecol = modifies colorcex = modifies sizeThere are a ridiculous number of modifiable parameters. You can learn more by running ?par.
The curious learner should check out this Base R Cheat Sheet by David Gerard (2017).
Turn It Off: Since each annotating function adds a layer to an existing plot:
plot.new()dev.off()These are great go-to procedures for troubleshooting when visualizations subvert from expectations.
Let’s Plot: The following is a scatterplot of hours worked as a function of gross pay.
x = and y= to dataset lv and variables hours and gross using $plot.new()
plot(x = lv$hours, y = lv$gross)
Modifying an Initial Plot: You can adjust the parameters in the initial call to plot().
plot.new()
plot(x = lv$hours,
y = lv$gross,
main = "Lakeview: Weekly hours v. gross pay",
xlab = "Hours",
ylab = "Gross")
Experiment with the above plot with new arguments. Try:
pch = adjusts point shape (e.g. pch = 19)col = adjust color (e.g. `col = “tomato”, try it)Instructions: Use these data (lv) to create a histogram, function hist(), of the distribution of hours worked each pay period, variable hours. Refer to these data as lv$hours in arguments.
The following arguments are separated by commas and must use "", e.g. col = "skyblue"
Check out Colors in R for a sizeable selection of colors.
main =col =xlab =ylab =breaks =my_hist - the code is all setmy_hist <- hist()
Setting Global Parameters: Recall the earlier call to ?par to view all base graphics arguments.
par() allows setting global parameterspar() or dev.off()For example, we can change the number of visualizations in one graphic with argument mfrow =:
par(mfrow = c(1, 2))
Here, mfrow = takes a pair of numbers in function c(), the first is for rows, the second, columns.
Grid It: After calling par() with mfrow = c(1, 2), you can call the following plot functions:
par(mfrow = c(1, 2))
plot(x = lv$hours,
y = lv$gross,
main = "Weekly hours v. gross",
xlab = "Hours",
ylab = "Gross (USD)",
pch = 21,
col = "tomato")
hist(x = lv$hours,
main = "Distribution of worker hours",
xlab = "Hours",
col = "skyblue")
Base exporting functions include png(), pdf(), jpeg(), tiff(), bmp(), and others.
filename = takes a quoted value, including an extension, e.g. "hours_distro.png"height = and width = modify dimensions - important for tailoring, not stretching, graphicsunits =, relating to height and width, is the units for size, e.g. "in" (inches) or "px" (pixels)family = selects font family, affecting all textres = adjusts resolutionWeaknessess:** Base R package graphics has shortcomings:
pairs()cex =)Strengths: Base R package graphics has a few advantages:
In summary, Base R graphics is rarely used when alternatives like package ggplot2 are easier to learn and often more practical to implement. However, it has some pretty nifty statistical uses. Here’s a few:
Heatmaps
data(volcano)
heatmap(volcano)
Model Summaries
my_lm <- lm(gross ~ hours, data = lv)
par(mfrow = c(2, 2))
plot(my_lm)
Dendrograms
my_cluster <- hclust(dist(USArrests), "ave")
plot(my_cluster)
Mosaic Plots or Confusion Matrices
mosaicplot(gross ~ hours,
data = lv[1:20, ])
Univariate Diagnostics
Importing Data: Run the following code to import the worker records for the I-690.
library(readr)
url <- "https://raw.githubusercontent.com/jamisoncrawford/reis/master/Datasets/690_workforce_summary.csv"
hw <- read_csv(url)[, 1:12]
The Plots: Create four plots in a 2 by 2 grid. The calls to par() and mfrow = are in the code below. Create:
hw$hours with function hist()hw$gross with function hist()hw$gross and hw$hours with function plot()plot()) with x = as.factor(hw$race) and y = hw$grossThe Annotations: Include the following arguments for each plot and supply thoughtful parameters:
main =xlab = (optional)ylab = (optional)frame.plot = FALSEcol =, cex =, size =, and pch = at your discretionpar(mfrow = c(2, 2))