3- Data Visualization

Importing Libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

3.2.4 Exercises

Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

Only a grid with no plots.

How many rows are in mpg? How many columns?

dim(mpg)

## [1] 234  11

234 rows, 11 columns.

What does the drv variable describe? Read the help for ?mpg to find out.

?mpg

“drv- the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd”

Make a scatterplot of hwy vs cyl.

ggplot(data = mpg) + geom_point(aes(x = hwy, y = cyl))

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(mpg) + geom_point(aes(class,drv))

Class is a discrete variable, so a scatterplot doesn’t really make sense. A bar plot would be better.

3.3.1 Exercises

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Color for the entire plot (not dependent on a variable) the color argument must go outside the aes() function.

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

Manufacturer, model, trans, drv, fl and class are categorical. Displ, year, cty, and hwy are continuous. You can look at which are <int> and which are <char>

mpg

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = year))

Categorical variables get specific color/size values, where with continuous variables a spectrum is assigned. Shapes can’t be put on a smooth spectrum, so you cannot assign a continuous variable to shape.

What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = trans, shape = trans))

## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 10. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 96 rows containing missing values (geom_point).

That variable is assigned a value for each of the aesthetics. For example, auto(l4) is green and square.

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

Stroke changes the thickness of a border around the shape being graphed.

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

If you set it to a Boolean condition then it will create a version of the aesthetic for true and false.

3.5.1 Exercises

What happens if you facet on a continuous variable?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty)

It makes a graph for every unqiue instance of the continous variable.

What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

Becuase both are discrete variables, all of the points for their intersection fall on the same point. Each point on this graph represents all of the points on the facet_grid(drv ~ cyl) plot.

What plots does the following code make? What does . do?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

It makes a facet grid with only one variable. The . represents nothing. Its a facet wrap but instead of seperate plots, its one plot with seperate sections.

Take the first faceted plot in this section. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

It is much easier to see the distributions of the individual classes, but harder to see the overall trend across all classes. With a larger dataset, the overall chart becomes increasingly crowded. Faceting can help separate the data into more readable plots.

Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow and ncol sets the number of rows and columns the faceted plots with be displayed in. facet_grid() doesn’t have these arguments because the rows and columns are determined by the number of unqiue values of the variables used to facet.

When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

This would make is easier to read and interpret because graphs and computer screens are wider on the x axis. So if we can squish each graph less.

3.6.1 Exercises

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

Line: geom_line(), Boxplot: geom_boxplot(), Histrogram: geom_histrogram(), Area: geom_area().

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

Scatter plot of displacement against hwy mpg, dots are colored according to drive train. Overlayed is a smooth regression line of the data.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

I was wrong. Forgot that if you set a grouping variable it will apply that to additional geoms.

What does show.legend = FALSE do? What happens if you remove it?
Why do you think I used it earlier in the chapter?

show.legend = FALSE removes the legend on the side of the graph that gives the label for aesthetics like color and shape.
I would guess so that graphs would all be the same size when next to each other.

What does the se argument to geom_smooth() do?

Controls the semi-transparent standard error graph around the line.

Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

They will not because the data and mapping carry over from ggplot() to the geoms.

Recreate the R code necessary to generate the following graphs.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(aes(group = drv), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy, color = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = drv)) +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = drv)) +
  geom_smooth(aes(linetype = drv), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 5, color = 'white', alpha = 0.5) +
  geom_point(aes(color = drv))

3.7.1 Exercises

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

geom_pointrange().

ggplot(diamonds, aes(cut, depth)) +
  geom_pointrange(
    stat = "summary",
    fun.min = min,
    fun.max = max,
    fun = median
  )

What does geom_col() do? How is it different to geom_bar()?

geom_col() uses stat_identity() as default, where geom_bar() uses stat_count(). This means you need to provide a y variable for geom_col() but not geom_bar().

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

geom	stat
`geom_bar()`	`stat_count()`
`geom_bin2d()`	`stat_bin_2d()`
`geom_boxplot()`	`stat_boxplot()`
`geom_contour_filled()`	`stat_contour_filled()`
`geom_contour()`	`stat_contour()`
`geom_count()`	`stat_sum()`
`geom_density_2d()`	`stat_density_2d()`
`geom_density()`	`stat_density()`
`geom_dotplot()`	`stat_bindot()`
`geom_function()`	`stat_function()`
`geom_sf()`	`stat_sf()`
`geom_sf()`	`stat_sf()`
`geom_smooth()`	`stat_smooth()`
`geom_violin()`	`stat_ydensity()`
`geom_hex()`	`stat_bin_hex()`
`geom_qq_line()`	`stat_qq_line()`
`geom_qq()`	`stat_qq()`
`geom_quantile()`	`stat_quantile()`

Their names are often indentical.

What variables does stat_smooth() compute? What parameters control its behavior?

It computes the predicted variable x, the lower confidence interval xmin, the upper confidence interval xmax, and the standard error se. It can be controlled with mappin, position, method, formula and other arguments

In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

Without setting group = 1 (or = to anything) then it goes by geom_bar()’s default of grouping by x. This is fine normally, but if we apply a stat to our y variable, then we just end up comparing the stat(x) to x. In this case the proportion of Fair in Fair is 100%, which makes all the bars the same height. group = 1 fixes this, making it so the proportion(cut) is compared to all levels of cut

ggplot(data = diamonds) + 
  geom_bar(aes(x = cut, y = after_stat(prop), fill = color, group = 1))

But hey wheres the color gone. Lets try removing the grouping.

ggplot(data = diamonds) + 
  geom_bar(aes(x = cut, y = after_stat(prop), fill = color))

yea… thats not right

ggplot(data = diamonds) + 
  geom_bar(aes(x = cut, y = after_stat(count) / sum(after_stat(count)), fill = color))

This took a second to get but makes sense. When we set y = after_stat(prop) we are computing the y for the entire plot then grouping the color by x. We need to calculate the proportion for each individual x.

3.8.1 Exercises

What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

Many of the dots are overlapped, this makes the data seem less than it actually is. We can either adjust to the alpha to show overlap, or adjust the jitters to slightly move overlapped data so they are viable.

What parameters to geom_jitter() control the amount of jittering?

width and height.

Compare and contrast geom_jitter() with geom_count().

geom_count() shows overlapped data as a single larger dot, where geom_jitter() makes a small spread of the smaller dots.

What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

By default for geom_boxplot(), position = "dodge2"

ggplot(mpg, aes(hwy, class)) +
  geom_boxplot(aes(fill = fl))

We can see the each graph per fl per class is dodged from the others

3.9.1 Exercises

Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(mpg) +
  geom_bar(aes(x = manufacturer, fill = manufacturer)) +
  coord_polar()

What does labs() do? Read the documentation.

Allows you to set the labels for any and all text on a ggplot

What's the difference between coord_quickmap() and coord_map()?

coord_map() preserves straight lines where coord_quickmap() does not. This makes coord_quickmap() significantly faster.

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

We can see the relationship is linear. coord_fixed() is important because it sets the grid to a aspect ratio of 1, so the scale of both axis are the same. This is important so that we can clearly see the trend. geom_abline() adds a line seperate fron the data. In this case because no arguments are passed it defaults to y = x.

R for Data Science Chapter 3 Exercises

Ethan Niser

9-19-2022