Graphics: Adding New Dimensions

Prologue

Picking up from where we left off previously, we typically will have 1 or 2 variables to plot.
For 1 variable, we would commonly use bar charts and histograms when our variable is categorical and continuous, respectively.
For 2 variables, we would commonly use boxplots, scatterplots, and line graphs.
What if we would like to add a 2nd variable to, say, a bar chart? Or a 3rd variable to a scatterplot? Or even a 2nd, 3rd, 4th etc. variable to a line graph?
The answer to these scenarios is colors! And also shapes and sizes to some degree, in particular, for scatterplots.
We will be using the ggplot2 for plotting purposes here.
We will be using the Wage dataset from the ISLR package for plotting bar charts here; iris dataset from base R for plotting scatterplots; and creating our own dataset for plotting line graphs.

library(ggplot2)

# Dataset 1
library(ISLR)
data(Wage)
class(Wage); dim(Wage); str(Wage)

## [1] "data.frame"

## [1] 3000   11

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

# Dataset 2
data(iris)
class(iris); dim(iris); str(iris)

## [1] "data.frame"

## [1] 150   5

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Dataset 3
set.seed(888)
col1 <- rep(toupper(letters[1:10]), each=10)
col2 <- c(2009:2018)
col3 <- rnorm(n=100, mean=20, sd=1)
col4 <- rnorm(n=100, mean=5, sd=1)
df <- data.frame("Item"=col1, "Year"=col2, "Sales"=col3, "Quantity"=col4)
head(df, 15)

##    Item Year    Sales Quantity
## 1     A 2009 18.04866 5.729918
## 2     A 2010 18.45563 4.154880
## 3     A 2011 20.72983 3.792285
## 4     A 2012 19.72242 6.760088
## 5     A 2013 18.34372 3.671745
## 6     A 2014 19.74898 4.352879
## 7     A 2015 17.83355 3.662588
## 8     A 2016 20.58686 4.919433
## 9     A 2017 20.63062 6.085947
## 10    A 2018 21.10359 4.854613
## 11    B 2009 20.31740 4.995891
## 12    B 2010 21.35573 7.466998
## 13    B 2011 20.95616 3.837340
## 14    B 2012 20.40556 4.226224
## 15    B 2013 19.49944 5.190011

Bar chart: Colors

Here, we will first plot the count distribution for the 1 variable, i.e. education.
Then, we will add race as the 2nd variable to the plot. This will give the breakdown of race for each education level. Define this 2nd variable in the fill argument and pass it to the geom_bar() function.
Pass the position=“fill” arguments to the geom_bar() function to yield a stacked bar chart.
Proportion, instead of count, will be displayed on the y-axis for stacked bar charts.

# Plotting 1 variable
ggplot(data=Wage) +
  geom_bar(mapping=aes(x=education))

# Plotting 2 variables: stacked bar chart
ggplot(data=Wage) +
  geom_bar(mapping=aes(x=education, fill=race), position="fill")

Pass the position=“dodge” arguments to the geom_bar() function to yield a contiguous bar chart.
The usual count will be displayed on the y-axis for contiguous bar chart.

# Plotting 2 variables: contiguous bar chart
ggplot(data=Wage) +
  geom_bar(mapping=aes(x=education, fill=race), position="dodge")

Scatterplot: Colors

Here, we will first plot the data points for the 2 variables, i.e. sepal length and petal length.
Then, we will add a categorical variable as the 3rd variable to the plot. Here, we will use Species as this additional variable. This will annotate each data point with the corresponding species.
Define this 3rd variable in the color argument and pass it to the geom_bar() function.

# Plotting 2 variable
ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length))

# Plotting 3 variables
ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, color=Species))

You may want to customize your legends.
Pass the color argument to the labs() function to customize the main legend title.
Pass the labels argument to the color() function to customize the individual labels.

ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, color=factor(Species, labels=c("Setosa", "Versicolor", "Virginica")))) +
  labs(color="Species Labels")

We just added a categorical variable to our scatterplot. We could do the same with a continuous variable. Here, we will use petal width as this additional variable.

ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, color=Petal.Width))

Scatterplot: Shapes

When plotting a 3rd categorical variable, we could represent each level of this variable using data points of different shapes.
Pass the shape argument to the geom_point() function for this purpose.

ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, shape=Species))

You may also use a combination of shapes and colors.
This combination is very typically for principal component analysis plots.

ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, color=Species, shape=Species))

Scatterplot: Sizes

When plotting a 3rd continuous variable, we could represent the scale of this variable using data points of different sizes.
Pass the size argument to the geom_point() function for this purpose.

ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, size=Petal.Width))

If would seem that the overlapping data points may obscure one another to some degree.
In this case, you may want to make your data points more transparent.
Pass the alpha argument to the geom_point() function for this purpose. Alpha argument takes a scale from 0-1 where values nearer to 0 represents more transparency.

ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, size=Petal.Width, alpha=0.1))

You may also use a combination of shapes and colors.

ggplot(data=iris) +
  geom_point(mapping=aes(x=Sepal.Length, y=Petal.Length, color=Petal.Width, size=Petal.Width, alpha=0.1))

Line graph: Colors

Plotting line graphs may require you to reshape your data frame into a “long” format. Refer to Reshaping Data Frames for more information.

head(df, n=15)

##    Item Year    Sales Quantity
## 1     A 2009 18.04866 5.729918
## 2     A 2010 18.45563 4.154880
## 3     A 2011 20.72983 3.792285
## 4     A 2012 19.72242 6.760088
## 5     A 2013 18.34372 3.671745
## 6     A 2014 19.74898 4.352879
## 7     A 2015 17.83355 3.662588
## 8     A 2016 20.58686 4.919433
## 9     A 2017 20.63062 6.085947
## 10    A 2018 21.10359 4.854613
## 11    B 2009 20.31740 4.995891
## 12    B 2010 21.35573 7.466998
## 13    B 2011 20.95616 3.837340
## 14    B 2012 20.40556 4.226224
## 15    B 2013 19.49944 5.190011

Colors in line graph are useful when you would like to plot the same feature for multiple elements.
For example, you may want to look into sales values across the years for multiple items.
Here, each different-colored lines may represent different items.

ggplot(data=df) +
  geom_line(mapping=aes(x=Year, y=Sales, colour=Item))

Line graph: Dual axes

You may wish to look into two trends for the same element.
For example, you may want to look into both the trends of sales and quantity across the years for a single item.
You may of course plot both trends using the same y-axis.

library(reshape2)

# Retrieve one item (A)
df_Item_A <- df[which(df$Item=="A"), ]

# Convert data frame to "long format"
tall <- melt(data=df_Item_A, id.vars=c("Item", "Year"), measure.vars=c("Sales", "Quantity"))
tall

##    Item Year variable     value
## 1     A 2009    Sales 18.048657
## 2     A 2010    Sales 18.455634
## 3     A 2011    Sales 20.729833
## 4     A 2012    Sales 19.722418
## 5     A 2013    Sales 18.343716
## 6     A 2014    Sales 19.748977
## 7     A 2015    Sales 17.833546
## 8     A 2016    Sales 20.586860
## 9     A 2017    Sales 20.630617
## 10    A 2018    Sales 21.103590
## 11    A 2009 Quantity  5.729918
## 12    A 2010 Quantity  4.154880
## 13    A 2011 Quantity  3.792285
## 14    A 2012 Quantity  6.760088
## 15    A 2013 Quantity  3.671745
## 16    A 2014 Quantity  4.352879
## 17    A 2015 Quantity  3.662588
## 18    A 2016 Quantity  4.919433
## 19    A 2017 Quantity  6.085947
## 20    A 2018 Quantity  4.854613

# Plot line graph
ggplot(data=tall) +
  geom_line(mapping=aes(x=Year, y=value, colour=variable))

But oftentimes, they may not share the same scale/range or even the same units.
Here, combining multi-colored lines and an additional y-axis using the sec.axis argument may help.

# Compute average ratio of Sales to Quantity
ratio <- mean(df_Item_A$Sales)/mean(df_Item_A$Quantity)

# Plot line graph
ggplot(data=df_Item_A) +
  # Plot the first variable
  geom_line(aes(x=Year, y=Sales, color="Sales")) +
  # Plot the second variable
  geom_line(aes(x=Year, y=Quantity*ratio, color="Quantity")) +
  # Add the second y-axis
  scale_y_continuous(sec.axis=sec_axis(~./ratio, name="Quantity")) +
  # Change legend main title
  labs(color="Variables")

Plotting dual-axes using ggplot2 isn’t the most feasible approach.
Notice that we had to do 2 transformations. The first when plotting the line of the additional variable and the second when defining the scale of the y-axis of this additional variable.
Notice also that the scale of the y-axis of additional very much depends on the values of the plotted variable.
In this case, the base R plotting mechanism may be more suitable for plotting dual axes.

# Widen the plot margins to make space of the 2nd y-axis
par(mar = c(5, 4, 4, 4) + 0.3)

# Plot the first variable, i.e. Sales
plot(df_Item_A$Year, df_Item_A$Sales, type="l", col="blue", xlab="Year", ylab="Sales")

# Indicate you would like to plot an additional line graph
par(new = TRUE)

# Plot the second variable, i.e. Quantity
# The argument axes=FALSE specifies we would like not to plot the y-axis. Setting this argument to TRUE will lead to an overlap of this y-axis on the existing y-axis on the final display
# Also need not specify the x- and y-axis labels again.
plot(df_Item_A$Year, df_Item_A$Quantity, type = "l", col="red", axes = FALSE, xlab = "", ylab = "")

# Set up the y-axis for the second variable
# The side argument specifies that the axis should be drawn on the right-hand side of the graph
axis(side=4, at=pretty(range(df_Item_A$Quantity)))

# Give the new axis a label
# The side argument specifies the label should on the right side of the graph
# The line argument specifies how far away from the plot margin should the label be placed
mtext("Quantity", side=4, line=3)