ALY6000 Class 2: Graphics
Northeastern University

Instructor: Dr. Dee Chiluiza, PhD
Boston, Massachusetts
Date: 25 May, 2021


Note. There is a code icon at the top-right of this document. It will allow you to hide or to show all codes at once.
You can also observe individual codes by clicking on the icons located on the right margin of the document.


Libraries and data sets used on this document

Click the Code icon on the right to observe the list of libraries and data sets used on this document.

# Libraries used in this document
library(tidyverse) 
library(gridExtra)   # For grid.arrange()
library(grid)        # For grid tables
library(DT)          # For datatables
library(knitr) 
library(modeest)
library(magrittr)
library(RColorBrewer)
library(pander)


# Data sets used in this document
data("faithful")
data("ggplot2::mpg")
data("iris")
data("mtcars")
data("pressure")


Bar plots


Take a look at the data table

Observe just the top of the data set to have a general idea.

knitr::kable(head(mpg,5))
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact


Take a quick look at the data using tibble::glimpse()

Another similar alternative is str().

tibble::glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "~
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "~
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.~
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200~
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, ~
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto~
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4~
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1~
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2~
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p~
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c~


Tables. We can use code table() to present a simple count of observations per categories.
As observed in the tibble::glimpse() code above, drive train in a categorical variable with 3 categories.
These are the categories in drive train: f, 4, r, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.


This first table is created using regular coding: codes inside codes.

knitr::kable(as.data.frame(table(mpg$drv)))
Var1 Freq
4 103
f 106
r 25

Compare with this table created using the same codes, but organized using pipes (%>%).

mpg %>%
  select(drv)%>%
  table()%>%
  as.data.frame()%>%
  knitr::kable()
. Freq
4 103
f 106
r 25


Prepare a bar plot

One easy way to prepare a bar plot for a categorical variable is by using the tables we created above. Observe the code below.

par(mfcol=c(1,2))

drv_table = table(mpg$drv) 

# Basic plot
barplot(drv_table, 
        main="Basic plot")

# Plot with added options
barplot(drv_table, 
        main="Plot with added options",
          col= c("blue", "yellow","pink"),
          las=1,
          ylab = "Frequency")

Observe the second bar plot. You can improve the presentation format of your plot by adding options. These options are inserted in the barplot() code and separated by comas.
A common practice for better organization is to create a new line for each added option.


Horizontal bar plots

You can present the bar plot in an horizontal position by simply adding the option: horiz = TRUE.
Using the second bar plot from the previous R chunk.
Pay attention to xlab and ylab options, in this case, we need to change the ylab = “Frequency” to xlab = “Frequency”.

barplot(drv_table, 
        horiz = TRUE, 
        main="Horizontal Bar Plot",
          col= c("blue", "yellow","pink"),
          las=1,
          xlab = "Frequency")


Prepare a bar plot with two categorical variables

You can also present two categorical variables at the same time, this one you can observe the distribution of one categorical variable based on a second categorical variable.

In this case, the table you create must include the two categorical variables of interest.
The par() code below helps with the presentation of 2 or more graphs, I will teach more about this code during next class.

par(mfrow=c(1,2), mai=c(1.4,1.4,1,0.6))

### Create a plot using regular coding 

# 1. Create a table for the two categorical variables
# Provide a name to the table

driveClass = table(mpg$drv, mpg$class)

# 2. Create the plot using the table you created

plot14 = barplot(driveClass,
                 main = "Regular codes",
                 las=2,
                 ylab = "Frequency")

### Create a plot using pipes

mpg %>%
  select(drv, class) %>%
  table() %>%
  barplot(las=2,
          main = "Codes using Pipes (%>%)",
          ylab = "Frequency")


Add colors and a legend

Using the second bar plot from the previous R chunk.

For colors, one option to use is brewer.pal(), from library(RColorBrewer). After installing the package and activating the library, check all available colors by using code: display.brewer.all() in your R Studio console.

The palette you observe has different sets, pay attention to how many color each set has, and their pattern, you can choose one set or another depending on your needs, and you can also choose the number of colors you want per set.

To use this option:
- Enter the brewer.pal() code.
- Inside the code, first enter the number of colors you need. In this case, we use 3 because we have three categories that we need to differentiate in drive train.
- Enter the color set using quotations.

Other two options to enter colors:
- Create a vector with the color names, e.g., c(“blue”, “yellow”, “green”)
- Create a vector with the color codes, e.g., c(“#A11515”, “#00CC66”, “#7F00FF”)

Check this page for color additional codes: https://www.rapidtables.com/web/color/RGB_Color.html

Observe the codes on the R chunk below.

mpg %>%
  select(drv, class) %>%
  table() %>%
  barplot(las=2,
          main = "Bar plot with colors and a legend",
          col = brewer.pal(3, "Dark2"),
          legend = TRUE,
          args.legend = list(x = "topleft"))


Present the second variable side-by-side

Observe the graph below. Since some car classes do not present all drive train types, column groups are uneven (some present one bar, some two, some three).
In this case, the previous bar graphs is probably a better option.

mpg %>%
  select(drv, class) %>%
  table() %>%
  barplot(beside = TRUE,
          las=2,
          main = NA,
          legend=TRUE,
          col=brewer.pal(3, "Dark2"),
          args.legend = list(x = "topleft"))


Exercise # 1


Using data set data(“chickwts”), create a bar plot for feed type.

Steps:
1. Check data set information using ?chickwts on your console.
2. Take a glimpse() of the data set.
3. Create a table of the categorical variable of interest using name = table(dataset$variable).
4. Create a basic barplot() using the table you created.
5. Improve presentation of your bar graphs by adding options.


Histograms


To extend our practice and manipulation of data sets, let’s use the continuous variable “weight” from data set data(“chickwts”). First, use the code ?chickwts in your R Studio console to get information about the data set.

Since this variable is continuous, we can enter the variable name directly into the code (you do not need to create a table like you did on bar plots).

The code is very simple: hist(dataset$variable)

Let’s start from the basics:
    - Inside the code enter the name of the data set.
    - Follow by the dollar sign $.
    - Then the name of the variable.

In this case:

hist(chickwts$weight)

The code below will create a basic histogram.

hist(chickwts$weight)


Improve histogram presentation

In the following order, let’s add options to:

hist(chickwts$weight,
     main = "Histogram of Chicken Weights",
     xlab = "Weight",
     breaks = 16,
     ylim = c(0,12),
     las=1,
     col = terrain.colors(12),
     labels = TRUE)


Add vertical lines for additional values

You can use the power of coding to enter additional information.
For example, you need to show the mean of the variable and the position of the interquartile range (Q1 at 25% and Q3 at 75%).
Plot the mean using a blue line, and Q1,Q3 using red lines.
The code in this case is abline(). Check options using ?abline in your R Sturio console.
Since these are vertical lines, use v = inside the code, and then the corresponding code for the statistical value you need.

Here an important note: This code goes after your plot code, in other words, after the closing parenthesis of your code.

The code below is the same code as the histogram above, with the abline() codes added after the closing parenthesis.
For each corresponding abline(), the code for the statistic value of interest is entered after v =.

hist(chickwts$weight,
     main = "Histogram of Chicken Weights",
     xlab = "Weight",
     breaks = 17,
     ylim = c(0,12),
     las=1,
     col = "#ACFCF9",
     labels = TRUE)

abline(v=mean(chickwts$weight), col="blue", lwd=3)
abline(v=quantile(chickwts$weight, 0.25), col="red", lwd=3)
abline(v=quantile(chickwts$weight, 0.75), col="red", lwd=3)

text("25% quantile", x=198, y=10, cex = 0.8, col="red", srt=90)
text("Mean", x=270, y=10, cex = 0.8, col="blue", srt=1350)
text("75% quantile", x=318, y=10, cex = 0.8, col="red", srt=90)


Zoom into an area of the histogram

You can “zoom” in an area of the histogram by changing the limits of your x-axis and then fixing the limits of the y-axis to fit the highest bar on your new graph. Below two examples, one “zooming” on area 100 to 200, the other on 350 to 450.
Notice the changes on options xlim = c(100,200) and xlim = c(350,450).

par(mfcol=c(1,2))

# Histogram 100 to 200
hist(chickwts$weight,
     main = NA,
     xlab = "Weight",
     breaks = 30,
     ylim = c(0,4),
     las=1,
     col = "#ACFCF9",
     labels = TRUE,
     xlim = c(100,200))

# Histogram 350 to 450
hist(chickwts$weight,
     main = NA,
     xlab = "Weight",
     breaks = 30,
     ylim = c(0,3),
     las=1,
     col = "#ACFCF9",
     labels = TRUE,
     xlim = c(350,450))


Filter data first

Another way to perform the “zoom” task is to select the data first, then create the graph.
In the R chunk below, first create objects to store the data values of interest, then use those objects to plot histograms.

# Create object to select values below 100
weight100 = filter(chickwts, weight<200)

# Create object to select values above 350
weight350 = filter(chickwts, weight>350)

# Code to plot two graphs next to each other.
par(mfcol=c(1,2))

# Histograms 

hist(weight100$weight,
     main = NA,
     xlab = "Weight",
     breaks = 10,
     ylim = c(0,4),
     las=1,
     col = "#DC92DA",
     labels = TRUE)

hist(weight350$weight,
     main = NA,
     xlab = "Weight",
     breaks = 10,
     ylim = c(0,3),
     las=1,
     col = "#F2D165",
     labels = TRUE)


Exercise # 2


- Using the data set data(“faithful”), create a histogram for variables waiting.

Steps:
1. Check data set information using ?faithful on your console.
2. Take a glimpse() of the data set.
3. 4. Create a basic histogram using hist().
4. Improve presentation of your histogram by adding options.
5. Using abline(), add a vertical line for the 80% quantile. 6. Using text(), add a text label for the vertical line.


Box plots


Box plots are another very useful graphical tool to plot and analyze continuous numerical data.
At difference with histograms, which display a continuous spectrum of the data based on frequency observations per class (bins), box plots show the data divided into quartiles.

par(mfcol=c(1,2))
boxplot(faithful$eruptions)
boxplot(faithful$waiting)


Data “sections” in box plots indicate quartiles

In general, box plots are divided in 5 lines, with lines 2 and 4 forming a box. All these lines represent quartiles.


This values can be obtained with the code quantile() without specifying values.

Remember
- The minimum value min(dataset$variable) is also the 0% quantile value quantile(dataset$variable, 0).
- The maximum value max(dataset$variable) is also the 100% quantile value quantile(dataset$variable, 1).
- The median value median(dataset$variable) is also the 50% quantile value quantile(dataset$variable, 0.5).

In the R chunk below, an object has been created to store the quartiles of the variable eruptions, then that object will be used to present a table and to add labels to the box plot.

eruption_quant = quantile(faithful$eruptions)


Present a table for quartiles

Observe R chunk below.

eruption_quant %>%
  as.data.frame() %>%
  knitr::kable()
.
0% 1.60000
25% 2.16275
50% 4.00000
75% 4.45425
100% 5.10000


Add quantile values to the box plot

Notice that I used the round() code inside labels =.

boxplot(faithful$eruptions)
text(y = eruption_quant,
     labels = round(eruption_quant,2),
     x = 1.24,
     col= "#A11515")


Add the mean value to the box plot

Using the same strategy, the mean value can be added. Mean is normally presented using a symbol such a cross, circle, diamond, etc.

To display a point for the mean, use code point() as follow:

boxplot(faithful$eruptions)

text(y = eruption_quant,
     labels = round(eruption_quant,2),
     x = 1.24,
     col= "#A11515")

text(y = mean(faithful$eruptions),
     labels = paste(round(mean(faithful$eruptions),2),"Mean"),
     x = 1.08,
     col = "red")

points(mean(faithful$eruptions), 
       pch = 18, 
       col = "#F20909")


Outliers

If you box plot contains outliers, they can also be labeled in the same way, using code text() and for the formula: boxplot.stats(dataset$variable)$out.


Improve presentation using options

Similar to bar plots and histograms, the presentation of box plots can be improved using options.

boxplot(faithful$eruptions,
        las=1,
        col="#212198",
        ylab= "Eruption time in minutes",
        ylim = c(0,6))

text(y = eruption_quant,
     labels = round(eruption_quant,2),
     x = 1.24,
     col= "#A11515")

text(y = mean(faithful$eruptions),
     labels = paste(round(mean(faithful$eruptions),2),"Mean"),
     x = 1.08,
     col = "#F5CCEE")

points(mean(faithful$eruptions), 
       pch = 18, 
       col = "#F5CCEE")


Box plot of one numerical variable with a categorical variable

Let’s use the mpg data set.
Imagine you want to know the data distribution for city efficiency in miles per gallon (variable cty), but you want to know the distribution of the data by car class, a categorical variable that contains sever categories.
Observe the codes below:
- The first code presents the whole distribution of cty alone.
- The second code incorporates the class variable.

par(mfcol=c(1,2), mai=c(1.2,1, 0.6, 0.4), cex.main=0.8)

boxplot(mpg$cty,
        main = "City efficiency",
        ylab = "Miles per galon",
        col="#0000CC",
        las=2)

boxplot(mpg$cty ~ mpg$class,
        main = "City efficiency per class",
        ylab= NA,
        xlab = NA,
        col=brewer.pal(7, "Dark2"),
        las=2)


Exercise # 2


- Using the data set data(“pressure”), create a box plot for the variable pressure. - pressure$pressure

Steps:
1. Check data set information using ?pressure on your console.
2. Take a glimpse() of the data set.
3. 4. Create a basic box plot using boxplot().
4. Since variable “pressure” contains outliers, add the value labels to your graph.
5. Add the labels for quartiles.
6. Add the mean value.
4. Improve presentation of your box plot by adding options.





Disclaimer: This short series manual project is a work in progress. Until clearly mentioned, these files are considered draft versions.


Dee Chiluiza, PhD
2021

Bruno Dog