Graphics: Bare
Prologue
- Data in the form of a data frame is not immediately informative.
- Graphics, on the other hand, tell stories and therefore are important for communicating information.
- More so according to the old adage “A picture speaks a thousand words”.
- In this session, we will learn to draw basic plots with no aesthetics or additional annotations whatsoever. So beware of the many shades of grey in this session!
- At the most basic level we may have 1 or 2 variables to plots. Barplots and histograms are suitable when you have 1 variable while boxplots, scatterplots, and line graphs are suitable when you have 2 variables.
- More advanced graphis like heatmaps, maps, 3D plots, and interactive graphics will also be discussed in later sessions.
- We will be using the ggplot2 for plotting purposes here.
- We will be using the Wage dataset from the ISLR package here for barplot, histogram, and scatterplot while we will be creating our own dataset for dotplot and line graph.
library(ggplot2)
# Dataset 1
library(ISLR)
data(Wage)
class(Wage); dim(Wage);str(Wage)
## [1] "data.frame"
## [1] 3000 11
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
# Dataset 2
col1 <- sample(toupper(letters[1:3]), size=30, replace=TRUE)
col2 <- rnorm(n=30, mean=10, sd=2)
df2 <- data.frame("Group"=col1, "Value"=col2)
head(df2)
## Group Value
## 1 A 10.847050
## 2 C 10.636241
## 3 C 13.425266
## 4 C 12.697105
## 5 A 8.588117
## 6 B 9.884022
# Dataset 3
col1 <- c(2009:2018)
col2 <- rnorm(n=10, mean=10, sd=2)
df3 <- data.frame("Year"=col1, "Interest_in_Her"=col2)
df3
## Year Interest_in_Her
## 1 2009 12.193212
## 2 2010 9.116503
## 3 2011 8.580381
## 4 2012 12.952090
## 5 2013 8.555840
## 6 2014 6.788802
## 7 2015 7.428374
## 8 2016 6.757306
## 9 2017 4.940656
## 10 2018 10.628538
Barplot
- Barplots are suitable when you want to plot 1 categorical variable.
- The categorical variable will be mapped on the x-axis.
- Counts for each element within the variable, which are probably not computed yet in your original data, will be computed at the back-end and plotted on the y-axis.
- Using the ggplot() and geom_bar() functions will do the trick.
# Plot the distribution of education levels
ggplot(data=Wage) +
geom_bar(mapping=aes(x=education))

- You may also use an input where you have the counts computed.
- First create a data frame containing the count for each level.
- Pass the y and stat=“identity” arguments in the geom_bar() function.
# Compute count for each education level
education_freq <- data.frame(table(Wage$education))
# Give relevant header to data frame
names(education_freq) <- c("education", "frequency")
education_freq
## education frequency
## 1 1. < HS Grad 268
## 2 2. HS Grad 971
## 3 3. Some College 650
## 4 4. College Grad 685
## 5 5. Advanced Degree 426
# Plot the distribution of education levels
ggplot(data=education_freq) +
geom_bar(mapping=aes(x=education, y=frequency), stat="identity")

- You may want proportion instead of count. Simple pre-processing of the input data frame containing the counts would do.
# Compute proportion for each education level
education_freq$proportion <- education_freq$frequency/sum(education_freq$frequency)
education_freq
## education frequency proportion
## 1 1. < HS Grad 268 0.08933333
## 2 2. HS Grad 971 0.32366667
## 3 3. Some College 650 0.21666667
## 4 4. College Grad 685 0.22833333
## 5 5. Advanced Degree 426 0.14200000
# Plot the distribution of education levels
ggplot(data=education_freq) +
geom_bar(mapping=aes(x=education, y=proportion), stat="identity")

Histogram
- Histograms are suitable when you want to plot 1 continuous variable.
- The continous variable will be mapped on the x-axis.
- Counts for each block/bin of the variable, which are probably not computed yet in your original data, will be computed at the back-end and plotted on the y-axis.
- Using the ggplot() and geom_hist() functions will do the trick.
ggplot(data=Wage) +
geom_histogram(mapping=aes(x=age))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Boxplot
- Barplots are suitable when you want to plot a categorical variable and a continuous variable.
- The categorical variable will be mapped on the x-axis.
- The continuous variabble will be mapped on the y-axis
- Using the ggplot() and geom_bar() functions will do the trick.
ggplot(data=Wage) +
geom_boxplot(mapping=aes(x=education, y=wage))

Dotplot
- Dotplot is similar to boxplot with the former displaying the individual data points while the latter displaying the statistics of the data points, i.e. median, lower and upper quartile, and outliers.
- Dotplot is more informative over boxplot when you have fewer data points for each group.
- Use the ggplot() and geom_dotplot() functions for dotplot.
# Set min and max value on y-axis
ggplot(data=df2, aes(x=Group, y=Value)) +
geom_dotplot(binaxis='y', stackdir='center')
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Scatterplot
- Scatterplots are suitable when you want to plot 2 continuous variables.
- The “independent”" variable will be mapped on the x-axis.
- The “dependent” variable will be mapped on the y-axis
- I placed quotations marks for the terms independent and dependent because in association studies determining which variables affects which may not be straightforward.
- Using the ggplot() and geom_point() functions will do the trick.
ggplot(data=Wage) +
geom_point(mapping=aes(x=age, y=wage))

Line graph
- Lines graphs are suitable when you want to join the dots. In other words, you are not interested in finding the line of best fit such as a straight, curved or smooth line.
ggplot(data=df3) +
geom_line(mapping=aes(x=Year, y=Interest_in_Her))

Summary
- The geom_*() function specifies the type of plot.
- The aes argument specifies the x and y variables.