Linyi_week2

Data visualization

Exercise 1.2.5

  1. How many rows are in penguins? How many columns?
library(palmerpenguins)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nrow(penguins)
[1] 344
ncol(penguins)
[1] 8

344;8

  1. What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.

The culmen is the upper ridge of a bird’s bill. In the simplified penguins data, culmen length and depth are renamed as variables bill_length_mm and bill_depth_mm to be more intuitive.

  1. Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

    penguins |> 
      ggplot(#data = penguins, 
        mapping = aes(x = bill_depth_mm,
                      y = bill_length_mm)) +
      geom_point()
    Warning: Removed 2 rows containing missing values (`geom_point()`).

    no obvious relationship.

  2. What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

    penguins |> 
      ggplot(#data = penguins,
        mapping = aes(x = bill_depth_mm,
                      y = species)) +
      geom_point()
    Warning: Removed 2 rows containing missing values (`geom_point()`).

    Density plots with a coloured line, this could be replaced with fill = species to move from the line to a more solid chart.

    penguins |> 
      ggplot(#data = penguins,
             mapping = aes(x = bill_depth_mm,
                           colour = species)) +
      geom_density(linewidth = 2)
    Warning: Removed 2 rows containing non-finite values (`stat_density()`).

  3. Why does the following give an error and how would you fix it?

    The error is pointing out there are no aesthetics added, including x and y axis information.

    penguins |> 
      ggplot(mapping = aes(x = bill_depth_mm,
                           y = body_mass_g)) + 
      geom_point()
    Warning: Removed 2 rows containing missing values (`geom_point()`).

  4. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

    There are missing data points (either on the x or the y) so gives a warning in the console along with the plotted chart. Adding na.rm = TRUE into the geom_point() removes these missing value data poins and so there is no longer a warning printed in the console.

    The default value of na.rm = is FALSE.

    penguins |> 
      ggplot(mapping = aes(x = bill_depth_mm,
                           y = body_mass_g)) + 
      geom_point(na.rm = TRUE)

  5. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

    penguins |> 
      ggplot(mapping = aes(x = bill_depth_mm,
                           y = body_mass_g)) + 
      geom_point(na.rm = TRUE) + 
      labs(caption = "Data come from the palmerpenguins package.")

  6. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

    Mapped to the geom_level:

    penguins |> 
      ggplot(mapping = aes(x = flipper_length_mm,
                           y = body_mass_g)) + 
      geom_point(aes(colour = bill_depth_mm),
                 na.rm = TRUE) +
      geom_smooth(na.rm = TRUE)
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  7. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

    ggplot(
      data = penguins,
      mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
    ) +
      geom_point() +
      geom_smooth(se = FALSE)
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
    Warning: Removed 2 rows containing missing values (`geom_point()`).

The chart will have dots with flipper length along the bottom and body mass along the side. The dots will be coloured by the islands and there will be as many curvy lines as there are distinct islands but with no grey extensions/bits to the lines.

  1. Will these two graphs look different? Why/why not?

    ggplot(
      data = penguins,
      mapping = aes(x = flipper_length_mm, y = body_mass_g)
    ) +
      geom_point() +
      geom_smooth()
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
    Warning: Removed 2 rows containing missing values (`geom_point()`).

    ggplot() +
      geom_point(
        data = penguins,
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      ) +
      geom_smooth(
        data = penguins,
        mapping = aes(x = flipper_length_mm, y = body_mass_g)
      )
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
    Removed 2 rows containing missing values (`geom_point()`).

They are the same because the first has the data and aes() which will be available to the geom() functions. The second is a repetition of the code leaving the global ggplot() empty but giving the same chart.

Exercise 1.4.3

  1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

    The plot appears to be ‘flipped’ so the species are appearing on the left side (y axis).

    ggplot(data = penguins,
           aes(y = species))+
      geom_bar()

  2. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

    fill is more useful.

    ggplot(penguins, aes(x = species)) +
      geom_bar(color = "red")

    ggplot(penguins, aes(x = species)) +
      geom_bar(fill = "red")

  3. What does the bins argument in geom_histogram() do?

    ggplot(penguins, aes(x = body_mass_g)) +
      geom_histogram()
    `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

    Bins represents the number of “buckets” that data is cut into, automatically this is 30.

    To remove the warning message:

    ggplot(penguins, aes(x = body_mass_g)) +
      geom_histogram(bins = 10,
                     na.rm = TRUE)

  4. Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?

ggplot(diamonds, aes(x = carat)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bins changes the “buckets” or the number of cuts made to the data

Exercise 1.5.5

  1. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

    str(ggplot2::mpg)
    tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
     $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
     $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
     $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
     $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
     $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
     $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
     $ drv         : chr [1:234] "f" "f" "f" "f" ...
     $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
     $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
     $ fl          : chr [1:234] "p" "p" "p" "p" ...
     $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

    Categorical would be manufacture, model, trans, drv, fl and class. Numerical would be displ, year, cyl, cty and hwy.

  2. Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point()

    # third numerical variable, color
    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(aes(color = cyl))

    # size
    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(aes( size = cyl))

# color and size
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes( color = cyl, size = cyl))

#shape
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes( shape = class))
Warning: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
ℹ you have requested 7 values. Consider specifying shapes manually if you need
  that many have them.
Warning: Removed 62 rows containing missing values (`geom_point()`).

When numerical variables are mapped to color or size aesthetics, the values of the numerical variable determine the color or size of points. Higher numerical values often correspond to lighter colors or larger sizes.

When categorical variables are mapped to color, each category is assigned a unique color. This aids in distinguishing between different categories

When categorical variables are mapped to shape, each category is represented by a distinct shape. This approach is valuable for differentiation, particularly if color isn’t distinguishable.

  1. In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

Mapping a third variable to the linewidth aesthetic in a scatterplot of hwy vs. displ using the mpg dataset would not have any impact, because linewidth aesthetic is typically used for specifying the width of lines in line plots, not for points in scatterplots

  1. What happens if you map the same variable to multiple aesthetics?

    It can enhance understanding if each conveys distinct information.

  2. Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

    It reveals the relationship of bill depth and length in different species.

    ggplot(data = penguins,
     mapping = aes(x = bill_length_mm , y = bill_depth_mm))+
      geom_point(aes(color = species))
    Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot(data = penguins,
 mapping = aes(x = bill_length_mm , y = bill_depth_mm))+
  geom_point(aes(color = species))+ 
    facet_wrap(~species)
Warning: Removed 2 rows containing missing values (`geom_point()`).

  1. Why does the following yield two separate legends? How would you fix it to combine the two legends?

    ggplot(
      data = penguins,
      mapping = aes(
        x = bill_length_mm, y = bill_depth_mm, 
        color = species, shape = species )) +
      geom_point() +
      scale_color_discrete(name = "Species") +
      scale_shape_discrete(name = "Species")
    Warning: Removed 2 rows containing missing values (`geom_point()`).

    1. Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

    The first displays the distribution of penguin species on different islands, and answers the question: “What is the distribution of penguin species on each island?”. The second shows the distribution of penguin islands for each species, and snswers the question: “What is the distribution of penguin islands for each species?”

Exercise 1.6.1

  1. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

    ggplot(mpg, aes(x = class)) +
      geom_bar()
    ggplot(mpg, aes(x = cty, y = hwy)) +
      geom_point()
    ggsave("mpg-plot.png")

    The second code is saved because it was the last code run.

  2. What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?

To save as a pdf the code needs to be changed from .png to .pdf