1— title: “Week 2: Code along 1” author: “Alex St. Pierre” date: “2025-09-01” output: html_document: toc: yes pdf_document: default word_document: default editor_options: chunk_output_type: console —

Welcome

Ch1 Introduction

The data science project workflow

Prerequisites

  • R
  • RStudio
  • r packages

Install the tidyverse package

Running R code

1+2
## [1] 3

Getting help

  • Google
  • Stackoverflow

Ch2 Introduction to Data Exploration

Ch3 Data Visualization

Set up

library(tidyverse)

data

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

3.2.4 Exercises

  1. Running ggplot(data = mpg) yields a blank box with no data plotted within it.
ggplot(data = mpg)

  1. There are 234 rows and 11 columns in the mpg data set.
  2. The drv variable describes the “type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.”
ggplot(data = mpg) +
    geom_point(mapping = aes(x = hwy, y = cyl))

4. A scatterplot of hwy VS cyl.

  1. A scatterplot of class VS drv. It’s not useful because there’s no way of understanding how many of each car make and model fall into each category.
ggplot(data = mpg) +
    geom_point(mapping = aes(x = class, y = drv))

aesthetics

  • x
  • y
  • color
  • size
  • alpha
  • shape
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = class))

3.3.1 Exercises

  1. The points aren’t blue because ggplot assumes “blue” is a data category rather than a fixed attribute. Assigning outside of aes() would fix this.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  1. mpg - Categorical Variables: manufacturer, model, trans, drv, fl, class. mpg - Continuous Variables: displ, year, cyl, cty, hwy.

  2. Assigning a continuous variable to color changes the hue of the plot-points. Assigning this same variable to size changes the plot-point size. Assigning a continuous variable to shape throws an error message, halting execution. My analysis suggests that there are more years than available shapes. Mapping a categorical variable to these same aesthetics assigns each group with a distinct color, plot-point size, and symbol.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = year))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = year))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = drv))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = drv))
## Warning: Using size for a discrete variable is not advised.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = drv))

4. Using the same continuous variable to multiple aesthetics assigns both a specific color-gradient and size for the same plot-point. Using the same categorical variable assigns a specific color, shape, and an arbitrary size to the same plot-points.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = year, size = year))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = drv , size = drv, shape = drv))
## Warning: Using size for a discrete variable is not advised.

  1. The stroke aesthetic determines the border width for certain shapes with both a fill and a border.

  2. Mapping an aesthetic to something other than a variable name, in this case: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5)) assigns a color based on whether or not the engine displacement is greater or less than 5 liters.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

common problems

  • Sometimes you’ll run the code and nothing happens.
  • Putting the + in the wrong place.

How to get help

  • ? function name
  • Select the function name and press F1
  • Read the error message
  • Google the error message

facets

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~class, nrow = 2)

Exercises 3.5.1

  1. Faceting on a continuous variable renders a bunch of tiny subplots, not very useful due to the data being difficult to read…
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~ displ)

  1. The empty cells mean that there are no cars in the dataset with those parameters. By using facet_grid(drv ~ cyl), all possible combinations of those two variables are rendered; including those where no data exits.
ggplot(data = mpg) +
    geom_point(mapping = aes(x = drv, y = cyl)) +
    facet_grid(drv ~ cyl)

  1. The first plot splits rows by the drv type (4, f, r) whereas the second plot splits columns by the number of cylinders (4,5,6,8). “.” tells ggplot2 to leave that dimension un-faceted or “do not facet in this direction.”
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

  1. Faceting allows for clear, side-by-side comparisons without the potential for overlap if the classes were all plotted together. This makes pattern recognition between classes much easier. On the other hand, faceting can be disadvantageous when trying to make direct comparisons of these classes due to them being separated and on different axes. With a larger dataset, faceting can create a cluttered visualization of data due to the higher number of panels; In this situation, it would be more sensible to use a color aesthetic instead.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

  1. nrow and ncol assigns the number of rows and columns for the facet_wrap function. Other options which control the layout of individual panels include: shrink (If TRUE, will shrink scales to fit output of statistics, not raw data. If FALSE, will be range of raw data before statistical summary.), switch (flips x and y labels from bottom and left to top and right), and scales (controls the axis scales). Facet_grid doesn’t allow for the adjustment of rows and columns because the layout is fixed by the data itself which is why it doesn’t have nrow and ncol args.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(~ class)

  1. Columns stretch horizontally across the screen whereas rows do the opposite, causing the plot to become tall and squished together. Putting the variable with more unique levels in the columns allows for better overall readability of the chart.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(class ~ drv)

geometric objects

different visual object to represent data

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy))    

3.6.1 Exercises

  1. Line Chart -> geom_line() Boxplot -> geom_boxplot() Histogram -> geom_histogram() Area Chart -> geom_area()

  2. I predicted that chart would plot vehicles based on their engine displacement and fuel mileage while color-coding each point based on what type of drivetrain it had.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

3. show.legend = FALSE removes the legend from the side of the chart. If you remove it, ggplot will show the legend by default. I assume show.legend = FALSE was used to remove a legend where one would have been generated automatically due to mapping color = drv.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE
  )

  1. the “se” argument controls whether or not the shaded ribbon appears around the plotted line on the chart by assigning = TRUE or = FALSE to it.

  2. Yes, each code chunk represents how mappings can either be set globally or locally within each geom.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

not every aesthetic works with every geom

two geoms in the same graph!

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
    geom_point(mapping = aes(color = class)) +
    geom_smooth()

local vs. global mappings This makes it possible to display different aesthetics in different layers.

specify different data for each layer

statistical transformation

3.7.1 Exercises

  1. By default, stat_summary() uses geom = “pointrange”. Because every stat_ has a default geome, they can be swapped with one another to achieve the same result, .
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.min = min,
    fun.max = max,
    fun = median
  )

  1. Geom_col() requires both an x and y value to create bars while geom_bar figures the y-value by itelf and only needs the input of an x-value. One needs both parameters while the other just needs one.

  2. The one thing all geoms and stats have in common is that they are both assigned a default stat/geom and can be used interchangeably to chart data.

  3. stat_smooth() computes “y” and “ymin/ymax”. It’s behavior is controlled with method, formula, se, level, span, n, and fullrange parameters.

  4. Adding group = 1 calculates proportions across the whole dataset where ommitting it calculates proportions within each bar.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

adjustments for bar charts

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

adjustments for scatterplots

coordinate systems

switch x and y

set the aspect ratio correctly for maps

Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

3.9.1 Exercises

  1. A stacked bar chart turned into a pie chart using coord_polar().
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut)) +
    coord_polar()

  1. labs() enables you to rename or add titles, subtitles, captions etc. in ggplot.

  2. coord_map() -> slower, more accurate. Uses map projections to make curved surfaces appear correctly on a flat plot. coord_quickmap() -> faster, less accurate. Adjusts the aspect ratio so that distances appear roughly accurate.

  3. The plot shows that highway mpg is always higher than city mpg. The importance of coord_fixed is accurately shown at 45 degrees to avoid visual distortions when representing the plotted data. geom_abline() is responsible for drawing the reference line for equal city/highway mpg.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

the layered grammar of graphics

The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of:

  • a dataset,
  • a geom,
  • a set of mappings,
  • a stat,
  • a position adjustment,
  • a coordinate system, and
  • a faceting scheme.