Welcome

Ch1 Introduction

The data science project workflow

Prerequisites

  • R
  • RStudio
  • r packages

Install the tidyverse package

Running R code

1+2
## [1] 3

Getting help

  • Google
  • Stackoverflow

Ch2 Introduction to Data Exploration

Ch3 Data Visualization

Set up

library(tidyverse)

data

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

aesthetics

  • x
  • y
  • color
  • size
  • alpha
  • shape
ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = class))

By size:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.

#> Warning: Using size for a discrete variable is not advised.

Other:

# Left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.

# Right
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many of them.
## Warning: Removed 62 rows containing missing values or values outside the scale range
## (`geom_point()`).

Exercices 3.2.4

  1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

What do I see: Blank space

  1. How many rows are in mpg? How many columns?

Rows: 234 Columns: 11

  1. What does the drv variable describe? Read the help for ?mpg to find out.

drv means: the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd

  1. Make a scatterplot of hwy vs cyl.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

  1. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

What happens: graph is created but it does not indicate any correlations, patterns or trends between the two variables. The plot is not useful because it does not clearly highlight a trend, thus, it lacks purpose.

Exercices 3.3.1

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

What is wrong with the code: it has one more ) at the end, it should have been:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

Categorical: manufacturer, model, trans, drv, fl, class

Continuous: displ, year, cyl, cty, hwy

We can see this information when we run ?mpg, because the categorical describe the types or categories while the continuous describe measurable quantities. To verify this information:

categorical_vars <- names(mpg)[sapply(mpg, function(x) class(x) %in% c("factor", "character"))]
categorical_vars
## [1] "manufacturer" "model"        "trans"        "drv"          "fl"          
## [6] "class"
continuous_vars <- names(mpg)[sapply(mpg, function(x) class(x) %in% c("numeric", "integer"))]
continuous_vars
## [1] "displ" "year"  "cyl"   "cty"   "hwy"
  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = class, size = class))
## Warning: Using size for a discrete variable is not advised.
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many of them.
## Warning: Removed 62 rows containing missing values or values outside the scale range
## (`geom_point()`).

For categorical variables, the shape and color change, while for continuous variables the size changes.

  1. What happens if you map the same variable to multiple aesthetics?

You will create a graph that highlight a pattern, and the size and color of a point will vary depending on where that point is compared to the rest of the data.

ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy, color = displ, size = displ))

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

The stroke aesthetic modify the width of the border. It can only work for shapes that have border, and for which you can colour the inside and outside separately. Ex”:

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy, colour = displ < 5))

When the data reach the number indicated in the colour section, the points will change colour.

common problems

  • Sometimes you’ll run the code and nothing happens.
  • Putting the + in the wrong place.

How to get help

  • ? function name
  • Select the function name and press F1
  • Read the error message
  • Google the error message

facets

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~class, nrow = 2)

Facet with 2 variables:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Exercices 3.5.1

  1. What happens if you facet on a continuous variable?
ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~year, nrow = 2)

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~displ, nrow = 2)

It will display the data points in many facets, treating the continuous variable like a categorical. Thus, it will split them into the range (the continuous variable) we chose and will create a separate “box” (facet) for each one. It is noted that by doing so, you risk having too many facets, like in the case above: displ.

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

An empty cell in plot with facet_grid(drv ~ cyl) means that it is not an option that exist in the data set. This relate to this plot as this facet shows all the combinations that already exist, which helps identify combinations that do not exist and/or are missing.

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

When the . is on the right, it creates an horizontal display, with facets for each value of drv. When the . is on the left, it creates an vertical display, with facets for each value of cyl.

Basically, the . indicates “no facet in this dimension”, as such, when on the right only row facets can be created, and when on the left, only column facets can be created.

  1. Take the first faceted plot in this section:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

By using faceting instead of the colour aesthetic, it gives a clearer view of each distinct group. It is easier to compare their patterns when using facets since there is less clutter and overlapping, making it easier to read each points.

However, the disadvantages of faceting are that it is more difficult to compare the values from different groups as they are in different facets. It is less obvious than when using one panel and colours. Also, depending on the number of facets, it can be more difficult to make them fit in a document or a page when there are a lot of them.

If I had a larger dataset, facets would become more valuable since they is a higher chance of overlapping points and clutter. Also, the more groups, the more colours, which would make it even more difficult to read Thus, facets would allow a clearer view of the data for each group.

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow: Controls the number of rows

ncol: Controls the number of columns

Other options that control the layout of the individual panels are: scales, dir, and strip.position.

facet_grid() doesn’t have nrow and ncol arguments because it is defined by rows and columns, thus we don’t have to manually set them.

  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Having more levels in the columns will work better with screen, since most screens have a bigger width then length, making it a better use of space to have data appear horizontally.

geometric objects

different visual object to represent data

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
    geom_smooth(mapping = aes(x = displ, y = hwy))

not every aesthetic works with every geom

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

two geoms in the same graph!

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

local vs. global mappings This makes it possible to display different aesthetics in different layers.

specify different data for each layer

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)

Exercices 3.6.1

  1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

Line chart:geom_line

Boxplot: geom_boxplot

Area Chart: geom_area

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

Prediction: A graph with points and curves

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

Predictions confirmed.

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

show.legend = FALSE, will unable the display of the legend for the group specified in the code using this. I think you used it earlier in the chapter, to help the students focus on the data, since it gives a clearer view, making it easier to learn in that context.

  1. What does the se argument to geom_smooth() do?

The se argument controls if the standard error margin will be displayed around the line or not. If it is = YES, it will be displayed, if it is = FALSE, it will only leave the line, making the graph cleaner.

  1. Will these two graphs look different? Why/why not?

No, because they say the same thing, and the different organization are asking for the same result. In the first option, the () in the last two lines indicates that the previous line is include there, and vice-versa for the second option.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

  1. Recreate the R code necessary to generate the following graphs.
ggplot(mpg, aes(displ, hwy)) + 
    geom_point(size = 5) + 
    geom_smooth(se = FALSE, size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(mpg, aes(displ, hwy, color = drv)) + 
    geom_point(se = FALSE, color = "black", size = 5) +
    geom_smooth(se = FALSE, size = 2)
## Warning in geom_point(se = FALSE, color = "black", size = 5): Ignoring unknown
## parameters: `se`

ggplot(mpg, aes(displ, hwy, color = drv)) +
    geom_point(size = 5) +
    geom_smooth(se = FALSE, size = 2)

ggplot(mpg, aes(displ, hwy)) +
    geom_point(aes(color = drv, size = 5)) +
    geom_smooth(se = FALSE, size = 2)

ggplot(mpg, aes(displ, hwy, color = drv)) +
    geom_point(size = 5) +
    geom_smooth(aes(color = drv, size = 2), se = FALSE)

ggplot(mtcars, aes(wt, mpg), fill = drv) +
  geom_point(shape = 21, colour = "white", size = 5, stroke = 5)

statistical transformation

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

demo = tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
## Warning: `stat(prop)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(prop)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#> Warning: `stat(prop)` was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(prop)` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median)

position adjustments

adjustments for bar charts

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Others:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + 
  geom_bar(fill = NA, position = "identity")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

adjustments for scatterplots

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

coordinate systems

switch x and y

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

set the aspect ratio correctly for maps

Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL)

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL) + coord_flip()

ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL) + coord_polar()

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut)) +
    coord_polar()

Exercices 3.8.1

  1. What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

The problem with this plot, is that there is not much information to take from this graph. As such, the viewer are limited in their interpretation of the graph: they can’t learn about a specific vehicule’s class or characteristics.

How I would improve it: I would add a line to show more clearly the trend and I would add colours to show clearly the categories.

ggplot(data = mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(method = "lm", se = FALSE)

  1. What parameters to geom_jitter() control the amount of jittering?

geom_jitter() controls the width and height of the data’s jitter.

  1. Compare and contrast geom_jitter() with geom_count().

geom_jitter() shows frequency (overlapping) by allowing jittering and more randomness to the plot. It spreads each points in width and lenght.

geom_count() shows frequency without moving the points, it only changes the size to show frequency (overlapping)

  1. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

The default position is “dodge”.

ggplot(mpg, aes(x = class, y = hwy, fill = drv)) +
    geom_boxplot()

ggplot(mpg, aes(x = class, y = hwy, fill = drv)) +
    geom_boxplot(position = "dodge")

Exercices 3.9.1

  1. Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE,width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL)

bar = ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut), show.legend = FALSE,width = 1) + 
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL)
bar + coord_polar()

  1. What does labs() do? Read the documentation.

labs() allows to put Titles on plot, axis, legend. Basically, it labels everything you could want to label.

  1. What’s the difference between coord_quickmap() and coord_map()?

coord_quickmap(): creates a quick vizualisation of a map, as it sets the aspect ratio correctly for maps. It is approximate, but will be created faster

coord_map(): gives a precise map as it uses polar coordinates.

  1. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

It tells me that cars generally get better mileage on highways than in the city, as the point are above the line.The coord_fixed() is important as it set the 1:1 ratio. The geom_abline creates the reference line, that illustrate when the ratio is the same in cities and on highways.

the layered grammar of graphics

The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of:

  • a dataset,
  • a geom,
  • a set of mappings,
  • a stat,
  • a position adjustment,
  • a coordinate system, and
  • a faceting scheme.