Data Visualization

Introduction

In this class, we will start to learn how to visualize data with the ggplot2 package in R. Again, to activate all functions in ggplot2, we need load the package. Usually we simply load tidyverse which contains ggplot2.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.4
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

As we see, ggplot2 is part of the tidyverse package.

Recall: types of random variables

Categorical (or qualitative) variable: takes values that are not numerical (not numbers)
- Ordinal variable: similar to categorical but with ordered categories
Numeric (or quantitative) variable: takes values that are numeric (numbers)
- Discrete variable: A numeric variable whose possible values can be listed.
- Continuous variable: A numeric variable who possible values are from interval of real numbers.

Recall: plot types for different data types

Why do we have this many plot types? One reason is that we need different plots to best illustrate the relationship between (usually one or two) variables of different types.

bar plots: (usually) for one categorical variable
histograms: for one numeric variable
box plots: for one continuous variable
Scatter plots: (usually) for two numeric variables
Multiple box plots: for one continuous variable and one categorical/discrete variable
Stacked bar plots: for two categorical variables.

Warmup Exercise

Again, let’s use the fuel economy data mpg as the first data set to work on. To recall what we learned from last class, answer the following questions using R.

How to have a quick view of data?
How many samples are there in the data set? How many variables are there?
What is the data type for the variable “model”? How about “cyl”? How about “displ”?
What is the meaning of the variable “fl”?
How to obtain the data from one variable, such as “drv”, and store it separately as a vector in R?

Create scatter plots in R

Now let’s learn from creating scatter plots, which is one of the most commonly used graphs in scientific research. Let’s plot the cty variable against the hwy variable in the mpg data set, which is given below.

The R Code for `cty vs hwy` scatter plot

ggplot(data = mpg) +
  geom_point(mapping = aes(x = cty, y = hwy))

ggplot() creates a coordinate system that you can add layers to.
- Using it alone won’t plot anything.
First argument of ggplot() is the data set to use in the graph.
The function geom_point() adds a scatter plot (which is called a layer) to the current plot.
“+” simply means more codes are coming in the next line

ggplot(data = mpg) +
  geom_point(mapping = aes(x = cty, y = hwy))

geom_point() function
- mapping argument: defines how variables in your data set are mapped to visual properties.
  - Always paired with the aes() function which constructs aesthetic mappings.
  - x and y arguments of aes() specify which variables to map to the x and y axes. You don’t have to have quotation marks for variable names.
  - Other arguments will be introduced later.
- It looks for the mapped variable in the data argument, in this case, mpg.

Lab Exercise

Make a scatter plot of displ vs hwy from the mpg data set.
Observe the plot, what preliminary conclusion can you draw from the plot?
Explain why the total number of data points on the plot is less than the total number of samples (which is 234).

Overplotting

The reason why we see fewer points than the amount of samples is that, the values of displ and hwy are rounded so some points overlap with each other. This problem is known as overplotting.

For example, the fifth and the sixth sample share the same displ and hwy values.

mpg[5:6, c('displ', 'hwy')]  # This code shows the values from "hwy" and "displ" for the 5th and 6th sample

## # A tibble: 2 × 2
##   displ   hwy
##   <dbl> <int>
## 1   2.8    26
## 2   2.8    26

“jitter” positioning to avoid overplotting (Optional)

We can add the option position = "jitter" into the geom_point function to avoid overplotting problem. By doing this, we add a small amount of random noise to each point, which spreads the points out.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

Here the position argument controls position adjustments, which determines how to arrange geoms that would otherwise occupy the same space.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

Create a bar plot

Next, let’s learn how to create a bar plot for one variable with ggplot2. The code template is very similar to that for scatter plots. But it must be for a categorical variable and we don’t need the y variable in the mapping. Now let’s plot the bar plot for the variable drv in the mpg data set.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv))

Here we use the function geom_bar to create a bar plot.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv))

A bar plot summarizes the count (or frequency) of each category in the data set.

Lab Exercise

Create proper bar plots to answer the following questions:

Which vehicle class contains least samples in the data set?
which three manufacturer contains most samples in the data set?

Graphing template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

This template contains the most basic information needed to create a graph:

We need a data set as <DATA>.
We need to know which type of graph to create by selecting different <GEOM_FUNCTION>.
We need to know which variables are used from <MAPPINGS>.

We will expand this template to more complicated cases in future classes. More details can be found at

https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf

Create a colored bar plot

We can use the fill argument in the aes function to make a bar plot colored.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv, fill = drv))

Stacked bar plot

Here we use the same value of fill and x argument, which means “filled by colors of different x values”. If we use different values, it becomes a stacked bar plot.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv, fill = class))

Stacked bar plot

A stacked bar plot is used to show the distribution among combination of two categorical variables by breaking down each bar into smaller colored bars. Observe the graph on the last page and answer the following questions.

Is most SUV 4-wheel drive (4WD), forward-wheel drive (FWD), or rear-wheel drive (RWD)?
Which is the most common drive train type for compact cars?
What is the drive train type of 2seaters?
What is the drive train type of pickups?

Dodged bar plot

Another way to show the distribution among combination of two categorical variables is using dodged bar plot.

ggplot(data = mpg) + 
  geom_bar(mapping = aes(x = drv, fill = class), position = "dodge")

Again, we use the position argument in geom function to adjust the position of bars.

Exercise

Use dodged bar plots, to answer the following question:

In this data set, which manufacturer produces most SUVs?
Change the keyword x in the aes function into y and reproduce the plot. What did you see?

Box plots

Next, let’s learn how to create a box plot. A box plot summarizes key information about the center, spread and potential outliers of a numeric variable.

First, let’s review how to read a box plot.

Create a box plot

We use the function geom_boxplot to create a box plot.

ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = displ)) +
  scale_y_discrete(breaks = NULL) # Remove the y scales

ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(y = displ)) + # The plot can also be vertical
  scale_x_discrete(breaks = NULL) # Remove the x scales

Add whisker lines

The looking of boxplots created by ggplot2 does not have the whisker lines. There is a trick to add them onto our plot:

ggplot(data = mpg, mapping = aes(y = displ)) + 
  stat_boxplot(geom = "errorbar", width = 0.5) + # The "width" controls the line size
  geom_boxplot() +
  scale_x_discrete(breaks = NULL)

Lab Exercise

Try to remove the line of code geom_boxplot() from the code above and see what it gets .
Try to put the line of code geom_boxplot() before the stat_boxplot line and see what it gets.

Create multiple boxplots

More often, multiple boxplots are used to compare the effect of a categorical variable on a numeric variable. It’s very easy to do this with ggplot. We simply use both x and y arguments.

For example, we hope to study the effect of drive train types on fuel economy measured by hwy. We can create the plots with the following code.

ggplot(data = mpg, mapping = aes(x = drv, y = hwy)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot()

Lab Exercise

Create a multiple boxplot for variables manufacturer and cty, answer the following question:

Within the data set, cars from which manufacturer is most fuel economic?
Within the data set, cars from which manufacturer is least fuel economic?
Do you think the conclusion from Q1 and Q2 is generally true for data beyond the current data set?

Creating a smooth line graph

How are these two plots similar?

Both plots use the same data set, but different visual objects (which we call geoms).

Codes for a smooth line graph

# left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))   # point geom

# right
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))  # smooth geom

The function geom_smooth creates a smoothed conditional means curve to fit the data. The shaded region represents the 95% (can be adjusted) confidence interval.

In this plot, there is statistical modeling behind it. Therefore we must learn statistical methods to fully understand the details.

Multiple-layer plot

What if we hope to combine the two plots into one plot? It is very simple to do it with ggplot2 - we just apply two geom functions which add two layers to the same plot.

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) + 
      geom_smooth(mapping = aes(x = displ, y = hwy))

We can put mapping into the ggplot function to avoid redundant codes

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_point() + geom_smooth()

In-class Exercise (submit to Canvas)

For the following two questions, submit your plot and answer to Canvas (go to Discussions tab and reply to the post)

Use the built-in diamonds data set in ggplot2, create a scatter plot and a smooth line plot (in the same graph) for price in y and carat in x. What conclusions can you draw from your figure?
(self-study) Do some self-study to see how the function geom_count() works. Create a plot with mpg data set using geom_count()

Summary

In this class, we learned some basics of data visualization with ggplot2 in R. You are required to

understand which plot type to use for various types of variables
understand how to create scatter plots, bar plots, box plots, multiple box plots and smoothed line graph with different geom functions.
understand the basic template of plotting graphs with ggplot2
make relevant plots to answer simple questions regarding a given data set

Data Visualization - Part One

Miao Yu

2024-01-22

Introduction

Recall: types of random variables

Recall: plot types for different data types

Warmup Exercise

Create scatter plots in R

The R Code for `cty vs hwy` scatter plot

Lab Exercise

Overplotting

“jitter” positioning to avoid overplotting (Optional)

Create a bar plot

Lab Exercise

Graphing template

Create a colored bar plot

Stacked bar plot

Stacked bar plot

Dodged bar plot

Exercise

Box plots

Create a box plot

Add whisker lines

Lab Exercise

Create multiple boxplots

Lab Exercise

Creating a smooth line graph

Codes for a smooth line graph

Multiple-layer plot

In-class Exercise (submit to Canvas)

Summary

Data Visualization - Part One

Miao Yu

2024-01-22

Introduction

Recall: types of random variables

Recall: plot types for different data types

Warmup Exercise

Create scatter plots in R

The R Code for cty vs hwy scatter plot

Lab Exercise

Overplotting

“jitter” positioning to avoid overplotting (Optional)

Create a bar plot

Lab Exercise

Graphing template

Create a colored bar plot

Stacked bar plot

Stacked bar plot

Dodged bar plot

Exercise

Box plots

Create a box plot

Add whisker lines

Lab Exercise

Create multiple boxplots

Lab Exercise

Creating a smooth line graph

Codes for a smooth line graph

Multiple-layer plot

In-class Exercise (submit to Canvas)

Summary

The R Code for `cty vs hwy` scatter plot