Welcome

Ch1 Introduction

The data science project workflow

Prerequisites

  • R
  • RStudio
  • r packages

Install the tidyverse package

Running R code

1+2
## [1] 3

Getting help

  • Google
  • Stackoverflow

Ch2 Introduction to Data Exploration

Ch3 Data Visualization

Set up

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2

data

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

aesthetics

  • x
  • y
  • color
  • size
  • alpha
  • shape
ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = class))

common problems

  • Sometimes you’ll run the code and nothing happens.
  • Putting the + in the wrong place.

How to get help

  • ? function name
  • Select the function name and press F1
  • Read the error message
  • Google the error message

facets

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~class, nrow = 2)

ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(drv ~ cyl)

Exercices 3.5.1

  1. What happens if you facet on a continuous variable?
ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~year, nrow = 2)

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~displ, nrow = 2)

It will display the data points in many facets, treating the continuous variable like a categorical. Thus, it will split them into the range (the continuous variable) we chose and will create a separate “box” (facet) for each one. It is noted that by doing so, you risk having too many facets, like in the case above: displ.

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

An empty cell in plot with facet_grid(drv ~ cyl) means that it is not an option that exist in the data set. This relate to this plot as this facet shows all the combinations that already exist, which helps identify combinations that do not exist and/or are missing.

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

When the . is on the right, it creates an horizontal display, with facets for each value of drv. When the . is on the left, it creates an vertical display, with facets for each value of cyl.

Basically, the . indicates “no facet in this dimension”, as such, when on the right only row facets can be created, and when on the left, only column facets can be created.

  1. Take the first faceted plot in this section:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

By using faceting instead of the colour aesthetic, it gives a clearer view of each distinct group. It is easier to compare their patterns when using facets since there is less clutter and overlapping, making it easier to read each points.

However, the disadvantages of faceting are that it is more difficult to compare the values from different groups as they are in different facets. It is less obvious than when using one panel and colours. Also, depending on the number of facets, it can be more difficult to make them fit in a document or a page when there are a lot of them.

If I had a larger dataset, facets would become more valuable since they is a higher chance of overlapping points and clutter. Also, the more groups, the more colours, which would make it even more difficult to read Thus, facets would allow a clearer view of the data for each group.

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow: Controls the number of rows

ncol: Controls the number of columns

Other options that control the layout of the individual panels are: scales, dir, and strip.position.

facet_grid() doesn’t have nrow and ncol arguments because it is defined by rows and columns, thus we don’t have to manually set them.

  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Having more levels in the columns will work better with screen, since most screens have a bigger width then length, making it a better use of space to have data appear horizontally.

geometric objects

different visual object to represent data

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
    geom_smooth(mapping = aes(x = displ, y = hwy))

not every aesthetic works with every geom

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

two geoms in the same graph!

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

local vs. global mappings This makes it possible to display different aesthetics in different layers.

specify different data for each layer

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)

statistical transformation

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

demo = tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
## Warning: `stat(prop)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#> Warning: `stat(prop)` was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(prop)` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median)

position adjustments

adjustments for bar charts

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Others:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + 
  geom_bar(fill = NA, position = "identity")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

adjustments for scatterplots

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

coordinate systems

switch x and y

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

set the aspect ratio correctly for maps

Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut)) + 
    coord_polar()

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL)

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL) + coord_flip()

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL) + coord_polar()

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut)) +
    coord_polar()

the layered grammar of graphics

The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of:

  • a dataset,
  • a geom,
  • a set of mappings,
  • a stat,
  • a position adjustment,
  • a coordinate system, and
  • a faceting scheme.