Pre-requisites

Quiz-1 and Quiz-2

Learning Outcomes

After completing this module, quizzes and assignments, you will be able to:

  1. Develop plots in ggplot2.
  2. Define the workflow to design advanced plots.
  3. Interpret several plots.

Basic Setup

Please download the data crimedata_all_withUnemp.csv and save at MyViz folder at D drive. You can have save the data in any folder locally, but make sure to set the path of the working directory to that folder. However, I will use D:/MyViz as an example.

# Setting the working directory
path <- "D:/MyViz"
setwd(path)

Let’s load the required packages. Below you will see a function load_package that will install the required packages in the R environment. Note that if the packages have been installed previously you do not need to perform this step. I am using the sapply function to list wise apply the function library() to character ggplot2, dplyr, readr. This is easier when many libraries are loaded in the R environment. The apply family functions can help to loop a function. Some of them are apply, lapply and sapply. You should consider learning these functions to increase your productivity. here

# load_package function
load_package <- function(package) {
    new.package <- package[!(package %in% installed.packages()[, "Package"])]
    if (length(new.package)) 
        install.packages(new.package, dependencies = TRUE)
    sapply(package, require, character.only = TRUE)
}

# These are the packages we will use for this module.
req_packages <- c("ggplot2", "dplyr", "readr", "gridExtra", "maps", "ggthemes", 
    "rmarkdown")

# Lets load the packages.
load_package(req_packages)

Now, let’s import the data and then see the structure of the dataset using the str() function of base R. The str() function shows the dimension of the data along with a best guess of the type of variables in your data set. Sometimes using the str() function may not be helpful. I usually prefer (whenever possible) to export the dataset in extensions such that Excel can read the data. Then, I use the filter in Excel to try to see the structure of the data, as well as how the missing and non-available values of variables are recorded. Data management is an integral part of data visualization. However, we will not be spending much time on this aspect of data management in this course.

Next, I use the read_csv function from the readr package. Compared to the conventional read.csv function, the read_csv function can load large .csv files quickly in the R environment. Here, I changed the variable name ID to be a character and skipped several variables, such as StateName, X1 and fstate.

path <- "C:/Users/ss0088/Dropbox (RRI at WVU)/Shishir/ELG/WVU Tableau/greg/Week-3"
setwd(path)

# Load the data
crimedata <- read_csv("Data/crimedata_all_withUnemp.csv", col_types = cols(ID = col_character(), 
    StateName = col_skip(), X1 = col_skip(), fstate = col_skip()))


str(crimedata)

Basics of ggplot2 (The Grammar of Graphics)

In ggplot2 there are three components.

Quick Plots qplot

The qplot() represents quick plot and allows users to generate a number of different types of plots quickly. However, to create more complex graphics we will learn ggplot().

Histograms/Density (Quantitative Variables)

A histogram is appropriate for summarizing the distribution of a numerical variable. We will develop several histograms of the variable VicAge, which is a numerical variable. Remember, we need the data, geom function and aesthetic mappings for a ggplot object. For our purposes, data is crimedata, geom function is histogram and aesthetic mapping is done for variable VicAge which represents the Victim’s age.

hist1 <- qplot(data = crimedata, VicAge, geom = "histogram")

# Let not included the unknown age. We can use:
# crimedata$VicAge[crimedata$VicAge == 998] <- NA (However, this will reduce
# the size of the dataset)

md <- crimedata %>% select(VicAge) %>% filter(VicAge != "998")

hist2 <- qplot(data = md, VicAge, main = "Hist-2")

hist3 <- qplot(data = md, VicAge, binwidth = 10, geom = "histogram", main = "Hist-3") + 
    theme_bw()

hist4 <- qplot(data = md, VicAge, binwidth = 15, geom = "histogram") + ggtitle("Hist-4") + 
    theme(plot.title = element_text(hjust = 0.5))

# Ploting all the histograms in one diagram
gridExtra::grid.arrange(hist1, hist2, hist3, hist4, ncol = 2)  #grid.arrange belong to package called gridExtra. 

I have defined 4 different histograms. hist1 is a generic histogram. The hist1 object give false results because looking at the data we know that missing observations of VicAge are labeled as 998. In the hist2, I have considered a dataframe called md which is a subset of crimedata. md dataframe excludes the missing observation of VicAge. Remember md <- crimedata %>% select(VicAge) %>% filter(VicAge != "998") can be interpreted as taking crimedata, then, selecting VicAge and finally filtering VicAge!= 998 (VicAge not equal to 998). Then I defined a ggplot object called hist2 and ran the qplot command over the data md, on variable VicAge, and provide a title Hist-2. Notice that qplot() identified the numerical variable geom function should be histogram.

If you have correctly run the above code for hist1 and hist, it should have given you a warning message that it implemented the bandwidth as 30. In the object hist3 and hist4 I have defined different bandwidths. Depending upon the bandwidths, the histogram can reveal or hide several important aspects within the distribution of the data. Further, I layered different themes by using the + command. You should try finding various other themes and experimenting with them. In object hist4, I have layered the title and a theme in which I centered the title.

All the histograms are saved as the ggplot object in the R environment and not yet plotted. If you type the object names in the R console then it plots the histogram. However, to plot all 4 histogram in 2 columns, I used the function called grid.arrange from package called gridExtra. The command can be run as grid.arrange(hist1, hist2, hist3, hist4, ncol = 2). But, I like to run gridExtra::grid.arrange instead. I do this mainly for three reasons: 1) I want to make sure that I know both the command and package when I use several packages; 2) Sometime two different packages can have same the command name, but do different things; 3) Sometimes I don’t want to load the whole package in my environment (given that I have already installed the package) when I just need one command.

The ggthemes Package

I have now taken the previous dataframe md, but randomly selected 500 observations without replacement and overwrote the original dataframe md. The object qp1 is a density plot which is plotted with The Economist magazine theme. The density is the plot of relative frequency of the histogram. The object qp2 is a dotplot with binwidth of 2 which is plotted with the Google document theme. The object qp3 is a frequency polygon plotted with theme of fivethirtyeight.com and qp4 is a generic histogram plotted with standard ggplot theme. However I defined qp4 differently. First, I defined a ggplot object called obj in which there are two important features: 1) data and 2) aesthetic. Once I defined obj then I started to layer the histogram using the geom_histogram() command, then I layered the title and Stata (a common statistics software) output theme.

# md <- crimedata %>% select(VicAge) %>% filter(VicAge != '998')

# Randomly sampling 500 observations
md <- md[sample(1:nrow(md), 500, replace = FALSE), ]

# a theme based on the plots in the The Economist magazine.
qp1 <- qplot(data = md, VicAge, main = "qp1-Density Plot", geom = "density") + 
    ggthemes::theme_economist()

# a theme based on Google Docs
qp2 <- qplot(data = md, VicAge, main = "qp2-Dot Plot", geom = "dotplot", binwidth = 2) + 
    ggthemes::theme_gdocs()

# a theme based on the plots at fivethirtyeight.com.
qp3 <- qplot(data = md, VicAge, main = "qp-3Frequency Polygon", geom = "freqpoly") + 
    ggthemes::theme_fivethirtyeight() + coord_cartesian(xlim = c(0, 100), ylim = c(0, 
    100))

# themes based on Stata graph schemes
obj <- ggplot(md, aes(VicAge))
qp4 <- obj + geom_histogram() + ggtitle("hist-4") + ggthemes::theme_stata()

gridExtra::grid.arrange(qp1, qp2, qp3, qp4, nrow = 2)

Filling the histogram with the categorical variables

From here I will develop histograms differently. I will let the categorical variable interact with the numeric variable. Consider a new data frame called df. It takes the crime data, then selects VicAge and VicSex and filters the VicAge which are not 998. The command sample randomly select 1000 observations. Due to the random sample, your plots will not be same as mine. Further, I suggest you to run the following snippet of code several times and see if you get a different looking graph each time or not.

df <- crimedata %>% select(VicAge, VicSex) %>% filter(VicAge != "998")

df <- df[sample(1:nrow(df), 1000, replace = FALSE), ]

h5a <- qplot(data = df, VicAge, fill = VicSex, main = "H5a", geom = "density")
h5b <- qplot(data = df, VicAge, fill = VicSex, main = "H5b", geom = "histogram")

gridExtra::grid.arrange(h5a, h5b, ncol = 2)

I have defined ggplot object h5a to be a density plot. However, it will separate the VicSex. The variable VicSex represents the victim’s sex, which can be either Female or Male and plot the density plot of VicAge on VicSex. In h5b the logic is the same, but I plotted the histogram. The density plot H5a does not look good. What’s wrong with that the plot?

Excluding VicSex unknown (just for simplicity)

Again, I used the command sample to randomly select 1000 observations. However, this time I used a random seed to sample 1000 observations with the command set.seed(123). If you used the same seed as mine you will be able to produce the exact same results. Random sample by default has to be random but researchers seed the randomness for the sake of replicability. Read here for reasons to use the set.seed().

The variable VicSex has Female, Male and Unknown as the factors, but let’s gets rid of instances where the VicSex is Unknown. Further, let’s add another variable which is Homicide. The plots will then show the victim’s age distribution across gender and type of homicide. I have plotted this histogram, but in object h6a, I filled the interaction between VicSex and Homicide while in object h6b I colored the interaction.

# Let's not include the VicSex which is unknown (just for simplicity)
df <- crimedata %>% select(VicAge, VicSex, Homicide) %>% filter(VicSex != "Unknown" & 
    VicAge != 998)

set.seed(123)
df <- df[sample(1:nrow(df), 1000, replace = FALSE), ]

qplot(data = df, VicAge, fill = interaction(VicSex, Homicide), geom = "histogram", 
    main = "H6a")

qplot(data = df, VicAge, color = interaction(VicSex, Homicide), geom = "histogram", 
    main = "H6b")

However, both of these plots are not easy to comprehend. We will try and explore different other techniques. For this we will develop vertical facets and horizontal facets.

Facets

Again, I want to plot the histogram of the victim’s age distribution across gender and type of homicide. To do this I will use vertical facets, which will develop the two different plots based on the interaction and the selected facet variables. In this example the facets variable is VicSex. Note that to develop the vertical facets I used ~. after the facets variable name.

qplot(data = df, VicAge, facets = VicSex ~ ., fill = interaction(VicSex, Homicide), 
    geom = "histogram", main = "The Vertical Facets")

The histogram looks appealing but in each panel the graph overlaps. So rather than fill, using the color option for interactions can be helpful. Also, histogram density plots can be very useful in this case. In this example I chose the facets variable as VicSex (you can choose Homicide as well) . Note that to develop horizontal facets I used .~ before the facets variable name.

qplot(data = df, VicAge, facets = . ~ VicSex, color = interaction(VicSex, Homicide), 
    geom = "density", main = "The Horizontal Facets")

Boxplots (Combining of Quantitative and Categorical Variables)

The Meaning of a BoxPlot

Now we will develop boxplots. Boxplots can sometimes be more informative than the histogram because the boxplots are based on the inter-quartile range (IQR) and median. Anything above 1.5 times the IQR is defined as an outlier or extreme value. The IQR is based on the first and third quartiles (see the following figures).

Boxplot

Boxplot

Box plot in ggplot2

You have likely realized that there is several ways to plot the same diagram. Below I show three different ways to plot same boxplots. I suggest that you try other ways on you own.

md <- crimedata %>% select(Homicide, OffAge, OffRace, VicSex, State_Abb) %>% 
    filter(OffAge != 0, VicSex != "Unknown")

# We can use qplot to develop the boxplot.
qplot(data = md, factor(State_Abb), OffAge, geom = "boxplot")

# OR we can create a ggplot object and add layer to it.
ggplot(data = md, aes(x = State_Abb, y = OffAge)) + geom_boxplot()

# Or Alternatively,
obj <- ggplot(data = md, aes(x = State_Abb, y = OffAge))
obj + geom_boxplot()

# All of above codes will generate same graph as below.

To make the boxplot I selected a few variables from the crimedata and then filtered. To develop a boxplot we need a quantitative and a categorical variable. For the purpose of demonstration, I took the state variable called State_Abb which is the abbreviation of the US states. It is a categorical variable. Then I took OffAge which is the offender’s age.

You can clearly see that the median age fluctuated around 25 to 35 years. Consider Alaska (AK) where the median age of the offender is just above 25 years, meaning that less than 50 percent of the offender’s in Alaska will have an age that is below 25. Note that the categorical variable should be the x-axis and quantitative variable should be the y-axis for the boxplot.

Box plot with interactions.

Next, I move on to make a boxplot of OffAge interacted with Homicide type and OffRace. Note that the x-axis name will be very long, so I have rotated the x-axis text 90 degrees and resized the font to 10 point. To make the graphic fit better, one might consider shortening the description of either the race or the homicide description, or both.

# ggplot object
obj <- ggplot(data = md, aes(x = interaction(Homicide, OffRace), y = OffAge))
# Object, geom and theam layering
obj + geom_boxplot() + theme(axis.text.x = element_text(face = "italic", color = "black", 
    size = 10, angle = 90))

In the above boxplot the median age of Black offenders whose crime was categorized as murder and non-negligent manslaughter is 25. You should try to interpret other boxplots as well.

In the boxplots below, I layered the ggplot object with boxplot geom function and the theme which rotates the x-axis text 90 degrees and italicizes it. I have used fill with VicSex.

obj1 <- ggplot(data = md, aes(x = interaction(Homicide, OffRace), y = OffAge, 
    fill = OffRace))
obj2 <- ggplot(data = md, aes(x = interaction(Homicide, OffRace), y = OffAge, 
    fill = VicSex))

thm <- theme(axis.text.x = element_text(face = "italic", color = "black", size = 10, 
    angle = 90))

plt1 <- obj1 + geom_boxplot() + thm
plt2 <- obj2 + geom_boxplot() + thm

grid.arrange(plt1, plt2, ncol = 2)

The plot on the right is same as the left plot except that we have further broken down the age of offenders by homicide type, ethnicity and gender. As you can see, there are many simple ways to present additional information without cluttering a graphic.

Scatterplots (Quantitative Variables)

Now, let’s focus on quantitative variables. For this I will use public dataset called the LungCap dataset available in the marinstat lectures website here. First download the data here and load it in R.

bd <- readr::read_delim("Data/LungCapData.txt", "\t", escape_double = FALSE, 
    trim_ws = TRUE)

This data has 725 observations for 6 variables. Try str(bd) and summary(bd) command to see the structure and basic summary. The variable LungCap, Age and Height are quantitative variable and Smoke, Gender and Caesaren are the categorical variable.

For the scatterplot I need two or more quantitative variables. For now I will only focus on two quantitative variables, Age and LungCap. Age will appear on the x-axis and LungCap will appear on the y-axis. The scatter plot shows a positive association between LungCap and Age.

# Quick Plot
qplot(data = bd, Age, LungCap)

When developing scatterplots you should try to determine which variable is the dependent variable and which is the independent variable. With regards to lung capacity and age, you should be asking "does the lung capacity depend on age or does age depend on lung capacity?" Let’s assume that lung capacity should depend on age. Therefore, LungCap should be on the y-axis and Age on the x-axis. While it is fairly straightforward to discern the directional relationship between lung capacity and age, it is not always so easy to determine the relationship between variables.

For example, if you were examining the relationship between police and crime, which is the dependent/independent variable? Just a thought exercise.

Scatterplot in Facets

Here, I have developed the scatter plot of LungCap and Age but with the facets based on Smoke. All the facets are vertical. The object sc1 is a generic scatter plot but it seperates the relationship of Age and LungCap with respect to smoking behavior of the children in the sample. You can see that smoking starts around 10 years of age. The object sc2 is the same as sc1 but here it also accounts for the Height (another quantitative variable). Higher Height is colored with a lighter tone. In sc3, instead of Height, if I consider Gender (which is a categorical variable) then color is not represented in tints and tones but the color is discrete. Here the Gender is either male or female. In sc5, I have allowed for the interaction between Smoke and Gender. It appears that if you allow for higher level interactions the scatterplot becomes less useful.

# Scatterplot in Facets
sc1 <- qplot(data = bd, Age, LungCap, main = "sc1", facets = Smoke ~ .)
sc2 <- qplot(data = bd, Age, LungCap, color = Height, main = "sc2", facets = Smoke ~ 
    .)
sc3 <- qplot(data = bd, Age, LungCap, color = Gender, main = "sc3", facets = Smoke ~ 
    .)
sc4 <- qplot(data = bd, Age, Height, color = Smoke, main = "sc4", facets = Gender ~ 
    .)
sc5 <- qplot(data = bd, Age, Height, color = interaction(Smoke, Gender), main = "sc5", 
    facets = Caesarean ~ .)
grid.arrange(sc1, sc2, sc3, sc4, sc5, ncol = 2)

Fitting and plotting a line inside a scatterplot

The purpose of a scatterplot is to exploit the relationship between the two quantitative variables. Fitting a linear trend can help in this comprehension. In the following code I allow for the interaction of several categorical variables and fit a linear trend.

# Fitting and ploting a line inside scatter plot
sc6 <- qplot(data = bd, Age, LungCap, geom = c("point", "smooth"), method = "lm", 
    color = Gender, main = "sc6")
sc7 <- qplot(data = bd, Age, LungCap, geom = c("point", "smooth"), method = "lm", 
    color = Gender, facets = Smoke ~ ., main = "sc7")
grid.arrange(sc6, sc7, ncol = 2)

In the above scatter plot, sc6 (on the left) is a basic scatterplot of Age and LungCap colored with Gender, but I have fit a linear trend with the method="lm". Note that method="lm" represents the linear method. If I don’t put it in, then ggplot will give a smooth curve. From the figure we can see that the LungCap of males is higher than it is for females at all ages.

In the sc7 (on the right), I allowed for color based on the Gender and facets based on the variable Smoke. Looking at sc7 it seems that the lung capacity of males is higher than females for non-smoker (same as sc6). This plot also reveals an interesting story. The smoking behavior starts after the age of 10, and the lungCap of male smokers declines significantly faster when compared to female smokers.

sc8 <- qplot(data = bd, Age, LungCap, geom = c("point", "smooth"), method = "lm", 
    color = interaction(Smoke, Gender), facets = Caesarean ~ ., main = "sc8")
sc8

In sc8, I try to see how the LungCap varies as the Age increases for the male and female children in the sample based on their smoking behavior. But, I also examine whether delivered by cesarean section (c-section) or not.

Fitting and ploting a smooth curve inside scatter plot

Sometimes the relation may not be linear, as we observed in the LungCap data. For example, stress level and years of job experiences follow an inverted U-shaped relationship. Generally speaking, for the first few years of experience the stress level increases, but after some level of experience stress levels tends to decline.

To examine the non-linear relationship, we can use scatterplot and a smooth curve instead of a line. The following three diagrams represent such relationships. Here I didn’t impose the method as linear.

# Fitting and ploting a smooth curve inside scatter plot
qplot(data = bd, Age, LungCap, geom = c("point", "smooth"))

qplot(data = bd, Age, LungCap, geom = c("point", "smooth"), color = Gender, 
    facets = Smoke ~ .)

qplot(data = bd, Age, Height, geom = c("point", "smooth"), color = interaction(Smoke, 
    Gender), facets = Caesarean ~ .)

Assignment

Please submit the assignment in the following Rmarkdown format. Download R Markdown file here. Make sure you save this file in .Rmd extension. Type your codes and kint the document. You may need to install "rmarkdown" package. If you are not familiar with the rmarkdown, please watch this video here.

Consider the fuel economy dataset, mpg, which is a sample records make, model, class, engine size, transmission and fuel economy for a selection of US cars in 1999 and 2008. This data comes from the EPA fuel economy website here. If you load the library ggplot2, mpg is pre-built in ggplot2.

require(ggplot2)
head(mpg, 10)
## # A tibble: 10 × 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 2 more variables: fl <chr>, class <chr>

The first 10 cars in the mpg dataset, included in the ggplot2 package.

Q1. What is the dimension of this data? How many observations and variables are there?

Q2. List the categorical and quantitative variables. (hint: here and try use sapply function, is.numeric() and is.character())

Here, I have changed the character to factors and extracted the levels of each of those factors. Factors represents the categorical variable.

Q3. Create 5 different histograms of displ, cyl, year, cty and hwy. Try using gridExtra::grid.arrange(). Is this graph helpful? Can we define year and cyl as the quantitative variables? Justify your answers in less than 100 words.

Q4. Develop three boxplots of drv on displ, cty and hwy. Interpret the results.

Q5. Develop the boxplot of class on city mileage. Interpret the results. Can you find three outliers (highest city mileage) of compact class?

Q.6 Sub-select the class of vehicles that are not manual transmission. The dimension of your data should now be 157 x 11. Develop the same boxplot. Interpret the results based on the mean, variance and outliers. Try to find which vehicle has the highest city mileage in each class category. Hint: filter(data, column != "manual(m5)" & column != "manual(m6)")

Q.7 How are engine sizes and fuel economy related over the time? In other words, what is the relationship between displ and hwy? Plot a colored scatterplot based on cyl using factor() and facets with year. Also, layer with the geom_smooth(method = "lm"). Explain the graphs.

Hint: qplot(a, b, data=mpg, facets= , colour = factor()) + geom_smooth(method= "lm").