1 What is Data Visualization?


Data Visualization is:

  • “The representation of an object, situation, or set of information as a chart or other image” - Oxford Dictionary
  • “…the creation and study of the visual representation of data” - Wikipedia


Abbreviations: Both “data vis” and “data viz” are used as abbreviations for data visualization.


Observe this example, which visualizes public construction records for Lakeview Amphitheater in Syracuse, NY:



Questions: In observing the above scatter plot:

  • What do you like about it?
  • What do you dislike about it?
  • What idea does this image attempt to convey?


2 Visualization in R


R uses, or may use, both built-in and external packages (libraries) to produce graphics.

The three most popular packages, from oldest to newest, are:

  • graphics, built into R for “base graphics”
  • lattice, an external package for small multiples, which group and depict data in small charts on common scales
  • ggplot2, a core Tidyverse package based on Leland Wilkinson’s “Grammar of Graphics” (1999)


Honorable mention goes to package grid, which powers a lot of base R graphics output behind the scenes.


3 Base R “graphics”


Package graphics comes preinstalled and preloaded in your R or RStudio environment.

  • Base R visualizations are layered, i.e. an “Artist’s Palette” approach
  • The first visualization activates R’s graphics device
  • Subsequent commands layer over the initial viz
  • Errors require rerunning these commands


3.1 Practice Data


Let’s import the Lakeview Amphitheater data again and save them in object lv:

library(readr)

url <- "https://raw.githubusercontent.com/jamisoncrawford/reis/master/Datasets/tblr_master.csv"
lv <- read_csv(url)

index <- which(lv$project == "Lakeview")
lv <- lv[index, 1:12]

You can find the documentation for these data in the Lakeview and Hancock repositories.


Explore the data with str() and other exploratory functions:

names(lv)
dims(lv)
summary(lv)
str(lv)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3356 obs. of  12 variables:
##  $ project: chr  "Lakeview" "Lakeview" "Lakeview" "Lakeview" ...
##  $ name   : chr  "Ajay Glass" "Ajay Glass" "Ajay Glass" "Ajay Glass" ...
##  $ ending : Date, format: "2015-08-02" "2015-08-02" ...
##  $ zip    : int  14612 13205 14428 14549 14433 14612 14428 13118 14433 14612 ...
##  $ ssn    : chr  "3888" "2807" "7762" "9759" ...
##  $ class  : chr  "Journeyman" "Journeyman" "Journeyman" "Journeyman" ...
##  $ hours  : num  40 32 40 26 32 40 40 16 32 42 ...
##  $ rate   : num  18.5 28.7 33.7 33.7 17 18.5 33.7 28.7 17 18.5 ...
##  $ gross  : num  740 918 1348 876 544 ...
##  $ net    : num  511 587 1026 891 384 ...
##  $ sex    : chr  NA NA NA NA ...
##  $ race   : chr  NA NA NA NA ...


3.2 Functions


Plotting Functions: Base R graphics have many functions. The following will initialize a plot:

  • plot() automatically detects/creates scatter plots, bar charts, etc.
  • hist() creates a histogram for a single variable
  • boxplot() creates a box and whisker plot
  • barplot() creates bar plots

There are several other functions for intializing plots, but these are common.


Annotating Functions: The functions and others will annotate an existing plot.

  • loess() adds a smoother to an existing scatter plot
  • text() adds text to an existing plot
  • jitter() adds random noise to points to break up overlap
  • legend() adds a legend to an existing plot


Modifiable Parameters: Parameters go inside the () of a function, and there are many.

  • main = modifies title
  • xlab = modifies x-axis label
  • ylab = modifies y-axis label
  • lwd = modifies line width
  • lty = modifies line type
  • col = modifies color
  • cex = modifies size

There are a ridiculous number of modifiable parameters. You can learn more by running ?par.

The curious learner should check out this Base R Cheat Sheet by David Gerard (2017).


Turn It Off: Since each annotating function adds a layer to an existing plot:

  • You can begin a new plot with plot.new()
  • You can turn off (and restart) your graphical device with dev.off()

These are great go-to procedures for troubleshooting when visualizations subvert from expectations.


Let’s Plot: The following is a scatterplot of hours worked as a function of gross pay.

  • Here, we’re setting x = and y= to dataset lv and variables hours and gross using $
plot.new()
plot(x = lv$hours, y = lv$gross)


Modifying an Initial Plot: You can adjust the parameters in the initial call to plot().

plot.new()
plot(x = lv$hours, 
     y = lv$gross,
     main = "Lakeview: Weekly hours v. gross pay",
     xlab = "Hours",
     ylab = "Gross")


Experiment with the above plot with new arguments. Try:

  • pch = adjusts point shape (e.g. pch = 19)
  • col = adjust color (e.g. `col = “tomato”, try it)


3.3 Mid-Introduction Practice


Instructions: Use these data (lv) to create a histogram, function hist(), of the distribution of hours worked each pay period, variable hours. Refer to these data as lv$hours in arguments.

The following arguments are separated by commas and must use "", e.g. col = "skyblue"

Check out Colors in R for a sizeable selection of colors.

  • Set a title with argument main =
  • Set a color argument with col =
  • Set an x-axis label with xlab =
  • Set a y-axis label with ylab =
  • Set the size of the binwidths with breaks =
  • Store this in object: my_hist - the code is all set
my_hist <- hist()


3.4 Global Parameters


Setting Global Parameters: Recall the earlier call to ?par to view all base graphics arguments.

  • Function par() allows setting global parameters
  • In other words, they apply to plots until reset
  • Allows for unified and consistent visualizations
  • Reset by calling par() or dev.off()


For example, we can change the number of visualizations in one graphic with argument mfrow =:

par(mfrow = c(1, 2))

Here, mfrow = takes a pair of numbers in function c(), the first is for rows, the second, columns.


Grid It: After calling par() with mfrow = c(1, 2), you can call the following plot functions:

par(mfrow = c(1, 2))

plot(x = lv$hours, 
     y = lv$gross,
     main = "Weekly hours v. gross",
     xlab = "Hours",
     ylab = "Gross (USD)",
     pch = 21,
     col = "tomato")

hist(x = lv$hours,
     main = "Distribution of worker hours",
     xlab = "Hours",
     col = "skyblue")


3.5 Exporting Graphics


Base exporting functions include png(), pdf(), jpeg(), tiff(), bmp(), and others.

  • filename = takes a quoted value, including an extension, e.g. "hours_distro.png"
  • height = and width = modify dimensions - important for tailoring, not stretching, graphics
  • units =, relating to height and width, is the units for size, e.g. "in" (inches) or "px" (pixels)
  • family = selects font family, affecting all text
  • res = adjusts resolution


3.6 Verdict


Weaknessess:** Base R package graphics has shortcomings:

  • “Artist’s Palette” is cool, and good script writing helps, but it may be inconvenient
  • Limited use of small multiples with function pairs()
  • Inconsistent in syntax, with unintuituve arguments (e.g. cex =)
  • Limited in displaying multivariate data
  • Not terribly elegant


Strengths: Base R package graphics has a few advantages:

  • “Quick and Dirty” exploratory visualizations to understand new data
  • Uses many arguments which help in understanding more advanced packages
  • Pretty good for statistics-related visualization


In summary, Base R graphics is rarely used when alternatives like package ggplot2 are easier to learn and often more practical to implement. However, it has some pretty nifty statistical uses. Here’s a few:

Heatmaps

data(volcano)
heatmap(volcano)

Model Summaries

my_lm <- lm(gross ~ hours, data = lv)
par(mfrow = c(2, 2))
plot(my_lm)


Dendrograms

my_cluster <- hclust(dist(USArrests), "ave")
plot(my_cluster)


Mosaic Plots or Confusion Matrices

mosaicplot(gross ~ hours, 
           data = lv[1:20, ])


Univariate Diagnostics


4 Applied Practice


Importing Data: Run the following code to import the worker records for the I-690.

library(readr)

url <- "https://raw.githubusercontent.com/jamisoncrawford/reis/master/Datasets/690_workforce_summary.csv"
hw <- read_csv(url)[, 1:12]


The Plots: Create four plots in a 2 by 2 grid. The calls to par() and mfrow = are in the code below. Create:

  1. A histogram of hw$hours with function hist()
  2. A histogram of hw$gross with function hist()
  3. A scatterplot of hw$gross and hw$hours with function plot()
  4. A boxplot (using function plot()) with x = as.factor(hw$race) and y = hw$gross


The Annotations: Include the following arguments for each plot and supply thoughtful parameters:

  • main =
  • xlab = (optional)
  • ylab = (optional)
  • Set frame.plot = FALSE
  • Use col =, cex =, size =, and pch = at your discretion
par(mfrow = c(2, 2))