3- Data Visualization

Importing Libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

3.2.4 Exercises

  1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

Only a grid with no plots.

  1. How many rows are in mpg? How many columns?
dim(mpg)
## [1] 234  11

234 rows, 11 columns.

  1. What does the drv variable describe? Read the help for ?mpg to find out.
?mpg

“drv- the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd”

  1. Make a scatterplot of hwy vs cyl.
ggplot(data = mpg) + geom_point(aes(x = hwy, y = cyl))

  1. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
ggplot(mpg) + geom_point(aes(class,drv))

Class is a discrete variable, so a scatterplot doesn’t really make sense. A bar plot would be better.

3.3.1 Exercises

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Color for the entire plot (not dependent on a variable) the color argument must go outside the aes() function.

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

Manufacturer, model, trans, drv, fl and class are categorical. Displ, year, cty, and hwy are continuous. You can look at which are <int> and which are <char>

mpg
  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = year))

Categorical variables get specific color/size values, where with continuous variables a spectrum is assigned. Shapes can’t be put on a smooth spectrum, so you cannot assign a continuous variable to shape.

  1. What happens if you map the same variable to multiple aesthetics?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = trans, shape = trans))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 10. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 96 rows containing missing values (geom_point).

That variable is assigned a value for each of the aesthetics. For example, auto(l4) is green and square.

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

Stroke changes the thickness of a border around the shape being graphed.

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

If you set it to a Boolean condition then it will create a version of the aesthetic for true and false.

3.5.1 Exercises

  1. What happens if you facet on a continuous variable?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty)

It makes a graph for every unqiue instance of the continous variable.

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

Becuase both are discrete variables, all of the points for their intersection fall on the same point. Each point on this graph represents all of the points on the facet_grid(drv ~ cyl) plot.

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

It makes a facet grid with only one variable. The . represents nothing. Its a facet wrap but instead of seperate plots, its one plot with seperate sections.

  1. Take the first faceted plot in this section. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

It is much easier to see the distributions of the individual classes, but harder to see the overall trend across all classes. With a larger dataset, the overall chart becomes increasingly crowded. Faceting can help separate the data into more readable plots.

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow and ncol sets the number of rows and columns the faceted plots with be displayed in. facet_grid() doesn’t have these arguments because the rows and columns are determined by the number of unqiue values of the variables used to facet.

  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

This would make is easier to read and interpret because graphs and computer screens are wider on the x axis. So if we can squish each graph less.