Project_1_HW_answers_MoystadAsk

Author

Ask Moystad

—————- SECTION 1.2.5 EXERCISES ——————

Q1: How many rows are in penguins? How many columns?

install.packages("tidyverse")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)

install.packages("palmerpenguins")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)

install.packages("ggthemes")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
(as 'lib' is unspecified)

library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("ggthemes")
library("palmerpenguins")

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Q1 answer: 344 rows, 8 columns

Q2: What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.

?penguins

Q2 answer: bill_length_mm: a number denoting bill length (millimeters)

Q3: Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

ggplot(data = penguins, mapping = aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point()

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q3 answer: Without picking out any more variables, there does not appear to be a clear corrolation.

Q4: What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = species)) + geom_point()

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q4 answer: Because species is a factor, and not an integer or dbl, a better choice might be a histogram or barchart

Q5: Why does the following give an error and how would you fix it?

ggplot(data = penguins,
       mapping = aes(x = flipper_length_mm, y = bill_length_mm)) + geom_point()

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q5 Answer: We have recalled the data and said that we want the scatterplot, but we have not picked the specific data that we want to use for the scatter plot. ### Q5 Answer: We have recalled the data and said that we want the scatterplot, but we have not picked the specific data that we want to use for the scatter plot. It does not know what to do.

Q6: What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

ggplot(data = penguins, mapping = aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point()

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q6 answer:^^^

na.rm If FALSE, the default, missing values are removed with a warning. If TRUE, missing values are silently removed

ggplot(data = penguins, mapping = aes(x = bill_length_mm, y = bill_depth_mm, na.rm = TRUE)) + geom_point(na.rm = TRUE) + labs( title = "Data come from the palmerpenguins package", )

Q7: Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) + geom_point(mapping = aes(color = bill_depth_mm,), na.rm = TRUE) + geom_smooth(method = "loess", na.rm = TRUE)

`geom_smooth()` using formula = 'y ~ x'

Q8 Answer: ^^^

Q9: Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

Q9 prediction: I think that I will get a graph that shows me the flipper length in relation to body mass with the islands denoting the colour of the individual dots. # However the line of fit is set to false and I have no idea what that might mean.

ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island) ) + geom_point() + geom_smooth(se = FALSE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q9 Answer: So this gave me individual lines of best fit for each. I will now try and see what happens when se = TRUE

ggplot( data = penguins, 
        mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
        ) + geom_point() + geom_smooth(se = TRUE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q9 Answer continued: So what that means is that the variance is removed if SE = FALSE. I can see how this might be helpful.

Q10: Will these two graphs look different? Why/why not?

Top code is more efficent, does the same stuff

ggplot( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) + geom_point() + geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot() + geom_point( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g) ) + geom_smooth( data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g))

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q10 Answer: They seem identical. Clearly one can write more efficient code if one knows how.

—————- SECTION 1.4.3 EXERCISES ——————

Q1: Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

ggplot(penguins, aes(y = fct_infreq(species))) + geom_bar()

Q1 answer: The bars are plotted against the y axis not along x axis

Q2: How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

ggplot(penguins, aes(x = species)) + geom_bar(color = "red")

ggplot(penguins, aes(x = species)) + geom_bar(fill = "red")

Q2 answer: The top code just adds an outline to the barcharts, the one below makes the bars themselves red.

Q3:What does the bins argument in geom_histogram() do?

Q3 answer: The bins argument determines the shape of the bins inthe graph. Binwidth for example sets the width of each bin.

Q4: Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package.

Experiment with different binwidths. What binwidth reveals the most interesting patterns?

ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 100, na.rm = TRUE)

ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 10, na.rm = TRUE)

ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.1, na.rm = TRUE)

glimpse(diamonds)

Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.01, na.rm = TRUE)

Q4 answer: Taking a closer look at the dataset, and the graph for that matter,

It looks like we might either benefit from going really small 0.01 or slightly larger at 0.05, depending on what we want to express with our visualisation.

#—————- SECTION 1.5.5 EXERCISES ——————

Q1: The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

?mpg

glimpse(mpg)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

head(mpg)

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Q1 answer: By reading the about page (?mpg) and glimpsing at the dataset I can see that, the categorical variables are manufacturer,model,transmission, drive, fuel type and class. The numerical variables are displacement, year, cylinders,city miles, and high waymiles. This information is given to me by how the dataset is categorised.

Q2: Make a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

ggplot(data = mpg, mapping = aes(x = hwy, y = displ)) + geom_point()

ggplot(data = mpg, mapping = aes(x = hwy, y = displ)) + geom_point(aes(color = year))

ggplot(data = mpg, mapping = aes(x = hwy, y = displ)) + geom_point(aes(size = cyl))

ggplot(data = mpg, mapping = aes(x = hwy, y = displ)) + geom_point(aes(color = year, size = cyl, shape = trans))

Warning: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
ℹ you have requested 10 values. Consider specifying shapes manually if you need
  that many have them.

Warning: Removed 96 rows containing missing values (`geom_point()`).

Q2 answer: They are mapped on a sliding scale, so they do not necessarily create neat categories unless there are few enough of them (for example, the year of make could be treated as a categoriacal variable). IN other cases the system appears to run out of ways to fit them in because because there are not enough colours or shapes.

Q3: In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

ggplot(data = mpg, mapping = aes(x = hwy, y = displ)) + geom_point() + scale_linewidth()

Q3 answer: I couldn’t get linewidth to work

Q4: What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg, mapping = aes(x = hwy, y = displ, )) + geom_point(aes(color = year, size = year))

Q4 answer:It appears to just add more “traits” to the plotted variable. In the plot above for example, not only is the year of make mapped in a different colour but also in a different size.

Q5: Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm)
) +
  geom_point(aes(color = species, shape = species), na.rm = TRUE) + (facet_wrap(~species)) +
  labs(
    title = "Bill length vs Bill depth",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Bill length (mm)", y = "Bill depth (mm)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()

Q5 answer: It reveals that there is a very strong correlation between bill length and depth within species.This is further higlighted by using faceting wrap.

Q6: Why does the following yield two separate legends? How would you fix it to combine the two legends?

Q6 answer: The issue appears to be that by defining only colour as species it left out shape so R created a seperate legend for shape. This can be fixed by defining shape as “sppecies”.

ggplot( data = penguins, 
        mapping = aes( x = bill_length_mm, y = bill_depth_mm, color = species, shape = species )
        ) + geom_point() + labs(colour = "Species", shape = "Species")

Warning: Removed 2 rows containing missing values (`geom_point()`).

Q7: Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

ggplot(penguins, aes(x = island, fill = species)) + geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) + geom_bar(position = "fill")

Q7 answer: For the first one: What is the composition of penguin species on each island? Second one: Where might we find the adelie, chinstrap or gentoo species of penguin?

#—————- SECTION 1.6.1 EXERCISES ——————

Q1: Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

Q1 answer: It saves the last plot that is displayed. So since I displayed the scatterplot last, it would be the one that has been saved.

ggplot(mpg, aes(x = class)) + geom_bar()

ggplot(mpg, aes(x = cty, y = hwy)) + geom_point()

ggsave("mpg-plot.png")

Saving 7 x 5 in image

Q2: What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?

Q2 answer: I could do so by going to tidyverse.org and checking or using the function ?ggsave(), all that is required to do in order to save it as a pdf is to change the file save name.

?ggsave() 

ggsave("mpg-plot.pdf")

Saving 7 x 5 in image