R for Data Science

Whole game/Quickstart

1. Data Visualization

a. Tutorial

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
library(ggthemes)
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

b. Exerecises

  1. How many rows are in penguins? How many columns?
dim(penguins)
## [1] 344   8
  1. What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.
?penguins
  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) + geom_point()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

At a glance, there seems to be no relationships between these 2 variables.

But who knows? Maybe there a third variable that affects both of them

ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, color=species)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

> Yup!, as I thought!

  1. What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom? > Since ‘species’ is a categorical variable. You can’t use a scatterplot. In that case, we can use a bar chart
ggplot(penguins, aes(x=species, y=bill_depth_mm)) + geom_bar()

wait what? I thought you could have both aesthetic. Wait, let me ask AI, haha.

ohhh, so there is a dedicated geom for it, geom_col(). Let me try

ggplot(penguins, aes(x=species, y=bill_depth_mm)) + geom_col()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_col()`).

Okay, now that’s what I expected. However, Nemotron-4 said it’s better to visualize it using boxplot or violin plot. I wonder why? Well, let’s try it out

ggplot(penguins, aes(x=species, y=bill_depth_mm)) + geom_violin()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

Ahh, that looks much better. Now we can see exactly how the bill_depth value changes across species. Wait, that’s a wrong interpreation. I mean, the distribution of bill_depth across the 3 species, instead of just comparing the frequency distribution.

Alright, let’s try the boxplot

ggplot(penguins, aes(x=species, y=bill_depth_mm)) + geom_boxplot()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Pretty cool!

  1. Why does the following give an error and how would you fix it?

    ggplot(data = penguins) + 
      geom_point()

Yeah, it’s missing the aes() function.

  1. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

  2. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

  3. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

Plot
Plot

Okay, the x and y is obvious, flipper_length_mm and body_mass,g while the bill_depth is also obvious, color. Since it applies to all geom, it should be set at global. Oh right, there is a smooth function to that runs across all the plots. So it should also be global

Alright, let’s test this out

ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=bill_depth_mm)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

> Yup, perfect!