EXERCISE 3.2.4
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[30m[32mv[30m [34mggplot2[30m 3.2.1 [32mv[30m [34mpurrr [30m 0.3.3
[32mv[30m [34mtibble [30m 2.1.3 [32mv[30m [34mdplyr [30m 0.8.3
[32mv[30m [34mtidyr [30m 1.0.0 [32mv[30m [34mstringr[30m 1.4.0
[32mv[30m [34mreadr [30m 1.3.1 [32mv[30m [34mforcats[30m 0.4.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

ANSWER
empty graph
How many rows are in mpg? How many columns?
?mpg
ANSWER
A data frame with 234 rows and 11 variables
What does the drv variable describe? Read the help for ?mpg to find out.
ANSWER
f = front-wheel drive, r = rear wheel drive, 4 = 4wd
Make a scatterplot of hwy vs cyl
ggplot(data=mpg)+
geom_point(mapping = aes(x=hwy, y=cyl))

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
ggplot(data=mpg)+
geom_point(mapping = aes(x=class, y=drv))

ANSWER
The plot is not useful because varible class is categorical
3.3.1 EXERCISES
What’s gone wrong with this code? Why are the points not blue?
SOLUTION
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
?mpg
str(mpg)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 234 obs. of 11 variables:
$ manufacturer: chr "audi" "audi" "audi" "audi" ...
$ model : chr "a4" "a4" "a4" "a4" ...
$ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
$ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
$ drv : chr "f" "f" "f" "f" ...
$ cty : int 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
$ fl : chr "p" "p" "p" "p" ...
$ class : chr "compact" "compact" "compact" "compact" ...
ANSWER
categorical - drv,class,fl,trans,manufacturer,model,trans
Continous - model,displ,cyl,hwy,year,cty
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty))

NA
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty))

NA
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
Error: A continuous variable can not be mapped to shape

What happens if you map the same variable to multiple aesthetics?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size=cty, color=cty))

NA
What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
Use the stroke aesthetic to modify the width of the border
ggplot(data=mpg)+
geom_point(mapping=aes(x = displ, y = hwy, size=cty, color=cty,stroke = 5))

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
ggplot(data=mpg)+
geom_point(mapping=aes(x = displ, y = hwy, color=displ<5))

3.5.1 EXERCISE
What happens if you facet on a continuous variable?
facet_wrap() will work with continous variable but not as useful as working with categorical variable.
ggplot(data=mpg)+
geom_point(mapping=aes(x = displ, y = hwy))+
facet_wrap(~cty)

What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
No combination of data points
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))+
facet_grid(drv ~ cyl)

What plots does the following code make? What does . do?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)

conclusion
rows are facetted by the variable on the left hand side of ~. and columns are facetted by the variable on the right hand side of .~
Take the first faceted plot in this section:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
Advantages of using faceting is that it is useful for categorical variables,to split your plot into facets, subplots that each display one subset of the data making focus on particular facets alone. in contrast colour aesthetic having display of multiple colors with increase in categorical features can cause confusion.
Disadvantage of using faceting is that since the points are on separate plots direct comparison may not be direct
Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?
?facet_wrap()
nrow, ncol
Number of rows and columns.
When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
One logical reason is that since the dependent variables are usually plotted on the y-axis, it is much easier to compare the highs and lows and the trends of the variables if the plots are placed side by side.
EXERCISE 3.6.1
What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
ggplot(data = mpg)+
geom_line(mapping=aes(x=displ,y=hwy))

NA
ggplot(data = mpg)+
geom_boxplot(mapping=aes(x=displ,y=hwy,color=drv))

NA
ggplot(data = mpg)+
geom_histogram()
ggplot(data = mpg)+
geom_area(mapping=aes(x=displ,y=hwy))

NA
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)

What does show.legend = FALSE do?
Removes a legend that is used to explain which levels correspond to which values.
What happens if you remove it?
You see the legend explaining what levels correspond to which values.
Why do you think I used it earlier in the chapter?
To show how to remove a legend
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)

What does the se argument to geom_smooth() do
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = TRUE)

ANSWER
se argument to geom_smooth() does add a loess smooth
Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()

ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

ANSWER
No the two graphs will not look different, because both functions produce the same output. The difference is that second function only does duplication in your code, which is not good practice for obtaining clean code.
Recreate the R code necessary to generate the following graphs.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(size=class),show.legend = FALSE) +
geom_smooth(se=FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(size=class),show.legend = FALSE) +
geom_smooth(mapping = aes(group=drv),se=FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color=drv,size=class),show.legend = TRUE) +
geom_smooth(mapping = aes(color=drv),show.legend = TRUE,se=FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color=drv,size=class),show.legend = TRUE) +
geom_smooth(show.legend = FALSE,se=FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color=drv,size=class),show.legend = TRUE) +
geom_smooth(mapping = aes(linetype=drv),show.legend = TRUE,se=FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy,size=class)) +
geom_point(mapping = aes(color=drv),show.legend = TRUE)

NA
EXERCISE 3.7.1
What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
?stat_summary()
PREVIOUS PLOT
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)

SOLUTION
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median)

?stat_summary()
What does geom_col() do? How is it different to geom_bar()?
?geom_bar
ANSWER
geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
?geom_bar
geom_bar(mapping = NULL, data = NULL, stat = “count”, position = “stack”, …, width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
stat_count(mapping = NULL, data = NULL, geom = “bar”, position = “stack”, …, width = NULL, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) Both stat_count & geom_bar() understands the following aesthetics: x, y, alpha, colour, fill, group, linetype & size.
What variables does stat_smooth() compute? What parameters control its behaviour?
?stat_smooth
y - predicted value
ymin - lower pointwise confidence interval around the mean
ymax - upper pointwise confidence interval around the mean
se - standard error
In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
THE PLOT IS WRONG: when you exclude the “group = 1”
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))

THE PLOT IS WRONG when: you exclude the “group = 1”
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..,group=1))

ANSWER
we need to set group=“1” to override the default behavior, which here is to group by cut and in general is to group by the x variable. For example, here, the default would be for geom_bar to return the number of rows with cut equal to “Fair”, “Good”, etc.
3.8.1 Exercises
What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()

Answer
cty & hwy are rounded so the points appear on a grid and many points overlap each other.
What parameters to geom_jitter() control the amount of jittering?
?geom_jitter
ANSWER
Height & Weight
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 0.5, height = 0.5)

Compare and contrast geom_jitter() with geom_count().
?geom_count
This is a variant geom_point() that counts the number of observations at each location, then maps the count to point area. It useful when you have discrete data and overplotting.
?geom_jitter
The jitter geom is a convenient shortcut for geom_point(position = “jitter”). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.
What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(y = displ, x = drv, color = factor(year)))

3.9.1 Exercises
Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity)) +
coord_polar()

What does labs() do? Read the documentation.
?labs
Modify axis, legend, and plot labels
What’s the difference between coord_quickmap() and coord_map()?
?coord_quickmap
?coord_map
coord_map projects a portion of the earth, which is approximately spherical, onto a flat 2D plane using any projection defined by the mapproj package. Map projections do not, in general, preserve straight lines, so this requires considerable computation. coord_quickmap is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator.
What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()

The plot shows us that there is a positive linear trend between hwy and cty, and the slope is approximately close 1, meaning that a unit increase in cty is associated with a unit increase in hwy.
?coord_fixed()
coord_fixed forces a specified aspect ratio between the physical representation of the units on the axes. The ratio is 1 by default. It is important to fix the aspect ratio in this case because hwy and cty are measured in the same unit (miles per gallon). Any other aspect ratios will give a visually incorrect representation and might lead us to believe that one increasese at a faster rate than the other.
?geom_abline()
geom_abline() adds a diagonal reference line to the plot, thus allows us
4.4 Practice
Why does this code not work?
my_variable <- 10
my_varlable
Error: object 'my_varlable' not found
Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)
object called in the second line does not match the object name
Tweak each of the following R commands so that they run correctly:
library(tidyverse)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8)
Error in fliter(mpg, cyl = 8) : could not find function "fliter"
Press Alt + Shift + K. What happens? How can you get to the same place using the menus?
keyboard shortcut reference menu apppears
you can get the same place using the Tools menu then choosing the keyboard shortcut help.
