L02 Data Visualization

Data Science 1 with R (STAT301-1)

Author

YOUR NAME

# Loading package(s)
library(tidyverse)

Exercise 1

There are three particularly important parameters to our template for building a graphic with ggplot2. They are <DATA>, <GEOM_FUNCTION>, and <MAPPINGS>. The importance of <DATA> is obvious. <GEOM_FUNCTION> is referring to the selection of a geom. <MAPPINGS>, specifically aes(<MAPPINGS>), is referring to the process of defining aesthetic mappings.

What is a geom?
What is an aesthetic mapping?

Solution

A geom is the geometrical object that a plot uses to represent data. It tells the plot how the data should be displayed. Aesthetic mapping, meanwhile, defines how variables are mapped to visual properties, such as color, size, and shape.

Exercise 2

Make a scatterplot of hwy vs cyl.

ggplot(data = mpg) +
 geom_point(mapping = aes(x = cyl, y = hwy))

Exercise 3

ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = class))

What happens if you make a scatterplot of class vs drv? What is the major drawback of this plot — really limits the plots usefulness?

Solution

The scatterplot provides has very little value because the y-axis is not a numerical value. Having class be the x axis and utilizing a bar chart would lead to a more effect data visulation. You could also add a third variable and have class be illustrated with an aesthetic mapping.

Exercise 4

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Solution

There should be a closed paranthesis after “y = hwy”. The aesthetic needs to go outside of the aes() function.

Exercise 5

What does the stroke aesthetic do? What shapes, provide the numerical references, does it work with? (Hint: use ?geom_point)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), shape = 8 , stroke = 7)

Solution

The stroke aesthetic changes the thickness of each point on a scatterplot. Shapes 1, 10, 16, 19, 20, 21 are circles, shapes 2,5, 9, 17, 18, 23, 24 are diamonds, shape 3 is a plus, shape 4 is an X, Shapes 6, 25 are downwards diamonds, shapes 7, 12, 13, 14, 15, 22 are square, shape 8 is an aestrick, shape 11 is a star. In total there, are 25 shapes, but it takes until “shape= #127” for nothing to appear. Shapes 65-91 go through the entire alphabet.The stroke used to only apply to 21-25, but now works for all of them.

Exercise 6

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

ggplot(data = mpg) + geom_point(aes(x = cty, y = hwy, colour = displ < 5))

Solution

Now, every time “displ < 5”, the dots on the plot turn blue. All points where “displ > 5” are red.

Exercise 7

What do the empty cells in the plot using facets indicate?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Solution

The empty cells in the plot using the facets indicate there were no combination of those two variables that equaled the specific value.

How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

Solution

The x-axis here is the right variable the previous graph is being faceted by, with the y-axis being the upper variable. As you can see, there is no combination of (4,5) on this graph, which explains why those combinations equated into an empty cell in the previous graph.

Exercise 8

Given the faceted plot:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

Solution

Facets allow for an easier splitting of data given different groupings, which works very well for a larger dataset. At the same time, it can be unbalanced with a smaller dataset. The 2seater cell is an example of this, and can be difficult to interpret immediately. For larger datasets, faceting would work better, but the color aesthetic would be superior for smaller datasets.

Exercise 9

Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol argument?

?facet_wrap

Solution

nrow tells the graph the amount of rows to have, while ncol tells the graph the number of columns it needs. You can also have fixed or free scales, use the as.table argument to dictate whether to have the top values at the bottom right or top right, can switch the coordinates, drop all factor levels not used in the data, change the direction of the graph, or place the labels in a specific spot using strip.position. You don’t need nrow and ncol arguments in facet_grid() because the number or rows and columns is determined by the unique amount of values present in the dataset.

Exercise 10

When using facet_grid() it is suggested that you put the variable with more unique levels in the columns. Why do you think that this practice is suggested?

Solution

Making the variable with more unique levels in the rows would make the graph taller, and a wider graph is superior in terms of a cleaner data visulation.

Exercise 11

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_line() + geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_boxplot() + geom_point()

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_area() + geom_point()

Solution

geom_line() produces a line chart, geom_boxplot() produces a boxplot, geom_histogram() produces a histogram and geom_area produces an area chart.

Exercise 12

Suppose we have a dataset named dat containing the variables weight, height, and gender. Predict what the output/graphic will look like for the code below.

ggplot(data = dat, mapping = aes(x = height, y = weight, color = gender)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

Solution

This code uses the smooth geom, meaning that the graph will be a smooth line fitted to the data. In this case, there will be two lines of different colors, one to represent male and one to represent female, with weight (y-axis) being compared to height (x-axis).

Exercise 13

Will these two graphs look different? Why/why not? — Try answering without running code and then check.

# Graph 1
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

# Graph 2
ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

Solution

These graphs will not look different. The data and mapping is already specified in the ggplot command, so it isn’t needed in the geom() command, and the same is true vise versa. I will run the code below to display it after predicting.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

Exercise 14

Recreate the R code necessary to generate the following graphs (first 5 required, 6th is a challenge).

Solution

Optional Challenge

Exercise 15

What is the default geom associated with stat_summary()? How could you rewrite the code to use the default geom function instead of the stat function?

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

Solution

The default geom associated with stat_summary is geom_point range. To rewrite the code to use the default geom function instead of the stat function, we would run this:

ggplot(data = diamonds) + geom_pointrange(mapping = aes(x = cut, y = depth), stat= "summary" , fun.min = min, fun.max = max, fun = median)

Exercise 16

What variables does stat_smooth() compute? In your own words, describe how the parameters method, formula, and span effect its behavior.

Solution

Stat_smooth() computes four different variables: the predicted value (y or x), the lower pointwise confidence interval around the mean (ymin or xmin), the upper pointwise confidence interval around the mean (ymax or xmax), and the standard error (se). The method parameter affects the smoothing method to use; if null, it is chosen based on the size of the largest group. The formula parameter implies the formula that should be used in the smooth function; formula= ‘NULL’ would indicate the formula should be y - x. Meanwhile, the span paramater deals with the amount of smoothing in the graph; smoother lines would be for larger numbers.

Exercise 17

What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

Solution

The lack of any extra aesthetics makes this plot more difficult to read with less data analysis to be had. To improve this, adding a third variable, such as class, size, shape, or alpha. While in theory one could argue a simple plot is best, it’s pretty easy to understand that more city mpg will likely also be correlated with more highway mpg, so more nuance is needed than that. The following four aesthetic mapping changes would be a strong addition to the plot:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color= class )) + geom_point()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, size= class )) + geom_point()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, alpha= class )) + geom_point()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, shape= class )) + geom_point()

Exercise 18

What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

Solution

The default position adjustment for geom_boxplot() is “position = dodge”, which places overlapping objects besides one another.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, position= 'dodge')) + geom_boxplot()

Exercise 19

What does labs() do? Read the documentation.

?labs

Solution

Labs() is used for changing axis, legend, and plot labels. The arguments include the text for the title, the subtitle text, the text for the caption, the text for the tag label, alternate text, and the axis labels.

Exercise 20

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

Solution

This plot shows a positive correlation between city and highway mpg, so as one gets higher, the other does as well. Coord_fixed is important because it keeps the graph proportiate; it forces a specified ratio between the physical represent of data units on the axes. For instance, ratio=1 means that one unit on the x-axis is equal in lenth to one unit on the y-axis.The geom_abline add reference lines to a plot, mapping out the “y= mx + b” equation.

Exercise 21

In a few sentences, describe the approach to building graphics that is implemented in ggplot2.

Solution

Building graphics starts with finding the dataset, and, from there, transforming it into the information display you desire. From there, you should choose a geometric object to represent each observation in the dataset, with the ability to use aesthetic mapping to further represent each variable. After that, select a coordinate system to place the geoms into and then utilize the location of the objects to display the values of the x and y values. While further adjustments can be made, either via a position adjustment, faceting, or adding additional layers, this is not neccesary, and the graphic is complete. Really, it’s a methodical step-by-step process with a variety of options, based on how elegant you’d like your graphic to be, but simply following a simple blueprint will get you a desired baseline.

Challenge

No one, undergraduate or graduate, is required to complete this challenge. This section is for those wanting to go a little further with ggplot2.

Attempt to recreate the following graphic.

Hint: ggthemes package