Introduction

library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ----------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Run ggplot(data = mpg) what do you see?

ggplot(data = mpg)

This code creates an empty plot. The ggplot() function creates the background of the plot, but since no layers were specified with geom function, nothing is drawn.

How many rows are in mtcars? How many columns?

There are 32 rows and 11 columns in the mtcars data frame.

nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11

The glimpse() function also displays the number of rows and columns in a data frame.

glimpse(mtcars)
## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...

What does the drv variable describe? Read the help for ?mpg to find out.

The drv variable is a categorical variable which categorizes cars into front-wheels, rear-wheels, or four-wheel drive.1

Value Description
“f” front-wheel drive
“r” rear-wheel drive
“4” four-wheel drive

Make a scatter plot of hwy vs. cyl.

ggplot(mpg, aes(x = hwy, y = cyl)) +
  geom_point()

What happens if you make a scatter plot of class vs drv? Why is the plot not useful?

The resulting scatterplot has only a few points.

ggplot(mpg, aes(x = class, y = drv)) +
  geom_point()

A scatter plot is not a useful display of these variables since both drv and class are categorical variables. Since categorical variables typically take a small number of values, there are a limited number of unique combinations of (x, y) values that can be displayed. In this data, drv takes 3 values and class takes 7 values, meaning that there are only 21 values that could be plotted on a scatterplot of drv vs. class. In this data, there 12 values of (drv, class) are observed.

count(mpg, drv, class)
## # A tibble: 12 x 3
##    drv   class          n
##    <chr> <chr>      <int>
##  1 4     compact       12
##  2 4     midsize        3
##  3 4     pickup        33
##  4 4     subcompact     4
##  5 4     suv           51
##  6 f     compact       35
##  7 f     midsize       38
##  8 f     minivan       11
##  9 f     subcompact    22
## 10 r     2seater        5
## 11 r     subcompact     9
## 12 r     suv           11

A simple scatter plot does not show how many observations there are for each (x, y) value. As such, scatterplots work best for plotting a continuous x and a continuous y variable, and when all (x, y) values are unique.