Sections 3-3.2: Introduction, Prerequisites, First Steps, The mpg Data Frame, Creating a ggplot, A Graphing Template

1. Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

  • I see a blank graph, because we haven’t provided the mapping for the x and y axis.

2. How many rows are in mpg? How many columns?

nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
  • 234 rows and 11 columns

3. What does the drv variable describe? Read the help for ?mpg to find out.

?mpg
## starting httpd help server ... done
  • drv: the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd

4. Make a scatterplot of hwy vs cyl.

#map the cyl column and hwy column of dataset mpg to x and y axis.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = cyl, y = hwy))

  • drv: the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd

5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

#map the class column and drv column of dataset mpg to x and y axis.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

span style=“color: #3d518e;”> I see a few points that does not form a pattern. The reason that it’s not useful is that drv and class are categorical variables, which have only a few values.


Sections 3.3: Aesthetic Mappings

1. What’s gone wrong with this code? Why are the points not blue?

#map the displ column and hwy column of dataset mpg to x and y axis, and make the color of all the points blue.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  • The given code put the color setting inside the aes(), it should go outside of it. Above is the corrected code.

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

#view mpg table.
mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows
  • The categorical variables are manufacturer, model, trans, drv, fl, class. The continuous variables are displ, year, cyl, cty, hwy. When you run mpg, you can see the data type below each column header. are categorical, and or are continuous.

3.Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

#mapping cty to color
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cty))

  • When you map a continuous variable to color, R will use gradient color that has different shades instead of different colors.
#mapping cty to size
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cty))

  • When you map a continuous variable to size, I think this makes more sense than mapping a class to size as from the graph you can easily tell which dot is larger, which means that it has a higher value.
#mapping cty to shape
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
## Error in `scale_f()`:
## ! A continuous variable can not be mapped to shape

  • When you map a continuous variable to shape, R will return an error. Because shapes are not continuous, it doesn’t make sense to map a continuous variable to shapes.

4.What happens if you map the same variable to multiple aesthetics?

#mapping displ to both x axis and size, mapping hwy to both y axis and color
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = hwy, size = displ))

  • One variable having two aestherics is unnecessary, and doesn’t really help viewers to perceiver the information quicker.

5.What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

#making the stroke width 3, color purple, the fill of the shape white, size 5
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), shape=21, color='purple', fill='white', size=5, stroke=3)

  • One variable having two aesthetics is unnecessary, and doesn’t really help viewers to perceive the information quicker.

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

ggplot(mpg, aes(x = displ, y = hwy, colour = displ < 5)) +
  geom_point()

  • displ<5 becomes a logical variable, of which the value is True or False.

Sections 3.4 & 3.5: Common Problems, Facets

3. What plots does the following code make? What does . do?

#facet by values of drv on the y-axis.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

#facet by values of cyl on the x-axis.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

  • the symbol. ignores the dimension it indicates when faceting.

4. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

  • The advantage of facet instead of the color aesthetic is that, when there are too many classes, with facet you can easilydistinguish between classes, where with colours, it might be hard to tell which is which becuase there are too many colors. The disadvantage is that because values of different classes are on different plots, it’s hard to compare.

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

?facet_wrap
  • nrow determines the number of rows in the facets, ncol determins the column number.There are other options like as.table, which lay out the facets as a table with highest values at the bottom-right. facet_grid doesn’t have nrow and ncol because the number of unique values of teh variables specified in the function would be the number of rows or columns.

Sections 3.6 Geometric Objects

2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

#map displ to x-axis, hmy to y-axis, and drv as color coding. geom_point will create a scatter plot for this mapping, and geom_smooth creates three smooth lines, each for one drv, without standard errors.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

5. Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • These two graphs should look exactly the same. because the mapping indicated are the same.

6. Recreate the R code necessary to generate the following graphs.

#map displ to x-axis, hwy to y-axis, draw a scatter plot and a smooth line
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#map displ to x-axis, hwy to y-axis, draw a scatter plot and three smooth lines where each represent one drv.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(group = drv), se = FALSE) +
  geom_point()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#map displ to x-axis, hwy to y-axis, drv to colour, draw a scatter plot and three smooth lines where each represent one drv.
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#map displ to x-axis, hwy to y-axis, draw a scatter plot with colour representing drv, and one smooth line that doesn't include drv values.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = drv)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#map displ to x-axis, hwy to y-axis, draw a scatter plot with colour representing drv, draw three smooth lines where each drv has a different linetype.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = drv)) +
  geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#map displ to x-axis, hwy to y-axis, draw two scatter plots, one include drv values as colours, the other one exclude drv values, and make the size of each dot 4, make the color white.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 4, color = "white") +
  geom_point(aes(colour = drv))


Sections 3.7 Statistical Transformations

1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

#the default geom is geom_pointrange, stat+'summary' is added because the default stat is identity.
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary"
  )
## No summary function supplied, defaulting to `mean_se()`

#stat_summary uses the mean and sd to calculate the middle point and endpoints of the line. To recreate the previous plot where min and max were the endpoints, the values for fun.min, fun.max and fun needs to be defined.
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.min = min,
    fun.max = max,
    fun = median
  )

2. What does geom_col() do? How is it different to geom_bar()?

  • the difference is in the default stat. The default state of geom_col is stat_identity, it has x and y values. the default stat of geom_bar is stat_count, it only has x values, where the y is the count number.

5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

  • the problem is that the proportions are calculated within the groups and all the bars have the same height, which is 1.

Sections 3.8 Position Adjustment

1. What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point(position = "jitter")

  • there are overplotting because there ar multiple observations for each combination of cty and hwy values, to improve it, we can use a jitter position adjustment to decrease overplotting.

2. What parameters to geom_jitter() control the amount of jittering?

  • width controls the amount of horizontal displacement, height controls the amount of vertical displacement.

3. Compare and contrast geom_jitter() with geom_count().

#The geom geom_jitter() adds random variation to the locations points of the graph.This method reduces overplotting since two points with the same location are unlikely to have the same random variation.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()

#The geom geom_count() sizes the points relative to the number of observations. Combinations of (x, y) values with more observations will be larger than those with fewer observations.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

#When adding a third variable as color aesthetic, geom_count will be less readable than geom_jitter
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_jitter()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
  geom_count()


Sections 3.9 Coordinate Systems

1. Turn a stacked bar chart into a pie chart using coord_polar().

#mapping the y to the angle of each pie section.
ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

3. What’s the difference between coord_quickmap() and coord_map()?

  • coord_map uses map projections to project the 2-dimensional Earth onto a 2-dimensional plane, where the coord_quickmap uses an approximate but faster map projection which ignores the curvature of Earth and adjusts the map for the latitude/longitude ratio.

4. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

#mapping the y to the angle of each pie section.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

  • City and hightway MPG is equal. The abline geom adds a line with specified slope and intercept to the plot. The function coord_fixed() ensures that the line produced by geom_abline() is at a 45-degree angle. A 45-degree line makes it easy to compare the highway and city mileage.