R For Data Science | by: Wickham & Grolemund

Chapter 1 | Data Visualization with GGplot

1.1 Load in libraries

Add the tidyverse, palmerpenguins, and ggthemes to your setup chunk before doing these exercises.

1.2 View data

Tibble

There are different ways to view data. The first is a tibble, much like a table of the rows and columns that your data contain. Just type the name of the data, and it prints it as a tibble.

penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Glimpse()

You can also quickly view the data type for each column using glimpse().

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

View()

Finally, instead of using your mouse to click open your data, you can just use view() to quickly open the data in a new window.

view(penguins)

Geom_point

Create a visual display of the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.

plot <- ggplot(penguins, aes(flipper_length_mm, body_mass_g, color=species, shape=species))+
  geom_point()+
  labs(title="Body mass & flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x="Flipper Length (mm)", y="Body Mass (g)")

plot

Add a curve

plot+geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

When aesthetic mappings are defined in ggplot(), at the global level, they’re passed down to each of the subsequent geom layers of the plot. However, each geom function in ggplot2 can also take a mapping argument, which allows for aesthetic mappings at the local level that are added to those inherited from the global level. Since we want points to be colored based on species but don’t want the lines to be separated out for them, we should specify color = species for geom_point() only.

Applying viz locally

plot2 <- ggplot(penguins, aes(flipper_length_mm, body_mass_g))+
  geom_point(aes(color=species, shape=species))+
  labs(title="Body mass & flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x="Flipper Length (mm)", y="Body Mass (g)")+
  geom_smooth(method="lm")

plot2
## `geom_smooth()` using formula = 'y ~ x'

Changing colors

plot2+scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'

1.1 - 1.2 Exercises

1 | How many rows are in penguins? How many columns?
dim(penguins)
## [1] 344   8

There are 344 rows and 8 columns in the penguins data.

2 | What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.
#?penguins

The variable, bill_depth_mm represents a number denoting bill depth (millimeters).

3 | Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.
ggplot(penguins, aes(bill_length_mm, bill_depth_mm))+
  geom_point(aes(color=species, shape=species))+
  labs(title="Penguin bill length vs. bill depth", x="Bill Length (mm)", y="Bill Depth (mm)")+
  geom_smooth(method="lm", color="gray")+
  scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
As bill length (mm) decreases, so too does the depth of each penguin bill.

As bill length (mm) decreases, so too does the depth of each penguin bill.

4 | What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

Because species is categorical – I think that it would be a better choice to display box and whiskers for these data or maybe a violin plot.

plot3 <- ggplot(penguins, aes(species, bill_depth_mm))+
  geom_violin(aes(color=species))+
  labs(title="Penguin species vs. bill depth", x="Bill Length (mm)", y="Bill Depth (mm)")+
  scale_color_colorblind()

plot3

5 | Why does the following give an error and how would you fix it?
ggplot(data = penguins) + 
  geom_point()

The code above is missing x and y values necessary to add aesthetics to the plot. I would add aes(x="", y="") inside of the geom_point, or inside of the ggplot if I wanted to apply the code globally rather than locally.

6 |What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.
#?geom_point()

na.rm in geom_point() makes it so that if your dataframe has any missing values (which ours does, recall the initial ggplot warnings to the data) the missing values are removed. In geom_point, the na.rm is set to FALSE automatically. So instead of removing the warnings from each code chunk above, I could have just added na.rm=TRUE to my geom_point() function.

7 | Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().
plot3+labs(caption = "Data come from the palmerpenguins package.")

8 | Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?
ggplot(penguins, aes(flipper_length_mm, body_mass_g, color=bill_depth_mm))+
  geom_point(na.rm=T)+
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

9 | Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)

This code will create a scatter plot with flipper length on the x axis and body mass on the y axis. The plot will be colored by the categorical islands and there will be a line smoothed over it. I do not know what the se=F does under the geom_smooth() function.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#?geom_smooth

se within the geom_smooth() function displays the confidence interval around smooth. (TRUE by default, see level to control.)

10 | Will these two graphs look different? Why/why not?
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )

These two plots should not look any different, even though in the second plot the aes is applied locally. The reason is that both are calling the exact same display both globally and locally. They have the same x and the same y in both cases. Plot 2 just seems to overcommand, plot 1 seems to simplify the request.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()+
  labs(title="Plot 1")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )+
  labs(title="Plot 2")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

1.3 Being efficient

You do not need to specify the data each time by saying data = penguins. Instead, just write penguins. Soon we will get more into *** piping *** and you will be able to create a plot rather quickly by doing the following:

penguins |>
  ggplot(aes(flipper_length_mm, body_mass_g))+
  geom_point()

1.4 Visualizing distributions

How you decide to visualize your data depends on the data types you are trying to view.

  1. numerical vs. numerical = scatter
  2. numerical vs. categorical = bar, violin, box and whisker
  3. one numerical = histogram or density plot

Note that a density plot is kinda like dropping a piece of spaghetti over the bars of a bar plot.

1.3 - 1.4 Exercises

1 | Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
penguins |>
  ggplot(aes(y=species, fill=species))+
  geom_bar()

The only difference is that it is tipped on its side now.

2 | How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

```ggplot(penguins, aes(x = species)) + geom_bar(color = “red”)

ggplot(penguins, aes(x = species)) + geom_bar(fill = “red”)```

fill changes the inside colors of the bars and color changes to outline of the bar. Both are effective and you can decide which way you prefer.

3 | What does the bins argument in geom_histogram() do?

It allows you to set a specific number of bins. The default is 30 bins and this is overwritten by binwidth if you decide to use this argument instead.

4 | Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?
diamonds|>
  ggplot(aes(x=carat))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

diamonds|>
  ggplot(aes(x=carat))+
  geom_histogram(binwidth=20)

diamonds|>
  ggplot(aes(carat))+
  geom_histogram(binwidth = 0.5)

mean(diamonds$carat)
## [1] 0.7979397
diamonds|>
  ggplot(aes(carat))+
  geom_histogram(binwidth = 0.8)

diamonds|>
  ggplot(aes(carat))+
  geom_histogram(binwidth = 0.3)

A binwidth less than 1 reveals the most interesting pattern.

1.5 Visualizing Relationships

Note that if a data point falls outside of the IQR by 1.5 times is shown as a dot above or below the whisker.

penguins |>
  ggplot(aes(species, body_mass_g))+
  geom_boxplot()

You can also choose to display the above data as a density plot:

plot4 <- penguins |>
  ggplot(aes(body_mass_g, color=species))+
  geom_density()

plot4

And we can add a fill to density with a transparency using the alpha command.

penguins|>
  ggplot(aes(body_mass_g, color=species, fill=species))+
  geom_density(alpha=0.5)

4 or more variables

Say you have 2 categories and 2 numbers you are trying to display. Try facetwrap()!

penguins|>
  ggplot(aes(body_mass_g, flipper_length_mm))+
  geom_point(aes(color=species, shape=species))+
  facet_wrap(~island)

Displaying 2 categories

penguins|>
  ggplot(aes(island, fill=species))+
  geom_bar()

1.5 Exercises

1 | The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
2 | Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?
mpg|>
  ggplot(aes(hwy, displ))+
  geom_point(aes(color=cty))

mpg|>
  ggplot(aes(hwy, displ))+
  geom_point(aes(color=manufacturer, shape=manufacturer))

You can only use shapes on categories (characters).

3 | In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?
mpg|>
  ggplot(aes(hwy, displ, linewidth=year))+
  geom_point(aes(linewidth=year))

4 | What happens if you map the same variable to multiple aesthetics?
mpg|>
  ggplot(aes(hwy, hwy))+
  geom_point(aes(color=hwy))

The data seem too nice! :)

5 | Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?
penguins|>
  ggplot(aes(bill_depth_mm, bill_length_mm))+
  geom_point(aes(color=species))

penguins|>
  ggplot(aes(bill_depth_mm, bill_length_mm))+
  geom_point(aes(color=species))+
  facet_wrap(~species)

It seems that Adelie penguins have a shorter bill length on average than the other two species of penguin. It also seems that bill depth is lowest in Gentoo penguins. In all penguin species, the depth of the bill increases as the length of the bill increases.

6 | Why does the following yield two separate legends? How would you fix it to combine the two legends?
ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm
  )
) +
  geom_point(aes(color = species, shape = species))+
  labs(color="Species", shape="Species")

All you have to do is ensure that you specify each with the same label within the labs() function.

7 | Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")

1.6 Saving your plots

You can save your plots using ggsave()!

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

ggsave(filename = "penguin-plot.png")
## Saving 7 x 5 in image

If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them. You can learn more about ggsave() in the documentation.

Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in Chapter 28.

1.6 Exercises

1 | Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?
ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

ggsave("mpg-plot.png")
## Saving 7 x 5 in image

The second plot is the only one being saved in this code. It is because ggsave() will work on the most recent plot that you have in the line of code above.

2 | What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?
ggsave("mpg-plot.pdf")
## Saving 7 x 5 in image

1.7 Common problems

If you’re still stuck, try the help. You can get help about any R function by running ?function_name in the console, or highlighting the function name and pressing F1 in RStudio. Don’t worry if the help doesn’t seem that helpful - instead skip down to the examples and look for code that matches what you’re trying to do.