Hello, Quarto

Meet Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Meet the penguins

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

The plot below shows the relationship between flipper and bill lengths of these penguins.

Why I am here

I am taking this module as part of my Msc programme for endangered species recovery and conservation. The end goal really is to get to work with an all time favourite of mine, tigers. Not picky, as long as its stripey and orange.

Week 2

2 Image embedding:

Humbug

This is my cat, his name is Humbug. He is a very cool dude, in this image he is wearing his Christmas sweater and i thought you would like to see him.

Video embedding:

Week 3 - 6.1

library(tidyverse)
View(diamonds)
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
names(diamonds)
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"      

mutate ()

diamonds %>% 
  mutate(JustOne = 1,
         Values = "something",
         Simple = TRUE)
# A tibble: 53,940 × 13
   carat cut    color clarity depth table price     x     y     z JustOne Values
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl> <chr> 
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43       1 somet…
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31       1 somet…
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31       1 somet…
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63       1 somet…
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75       1 somet…
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48       1 somet…
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47       1 somet…
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53       1 somet…
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49       1 somet…
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39       1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>

One variable called JustOne where all of the values inside the column are 1. One variable called Values where all of the values inside are something: One variable called Simple where all the values equal TRUE

diamonds %>% 
  mutate(price200 = price - 200)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows

can be used to create variables based on existing variables from the dataset.# multiple columns at once, separating each new variable with a comma.

diamonds %>% 
  mutate(price200 = price - 200,        # $200 OFF from the original price
         price20perc = price * 0.20,    # 20% of the original price
         price20percoff = price * 0.80, # 20% OFF from the original price 
         pricepercarat = price / carat, # ratio of price to carat
         pizza = depth ^ 2)             # Square the original depth
# A tibble: 53,940 × 15
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
#   pricepercarat <dbl>, pizza <dbl>

Saving

diamonds.new <- # saving changes to diamonds as a new object
  diamonds %>%  # original dataset
  mutate(price200 = price - 200,        # $200 OFF from the original price
         price20perc = price * .20,     # 20% of the original price
         price20percoff = price * 0.80, # 20% OFF from the original price
         pricepercarat = price / carat, # ratio of price to carat
         pizza = depth ^ 2)             # Square the original depth

6.1.1 Nesting functions

Nesting is where one function, “nests” inside another function.

diamonds %>% 
  mutate(m = mean(price))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z     m
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933.
# ℹ 53,930 more rows
diamonds %>% 
  mutate(m = mean(price),     # calculates the mean price
         sd = sd(price),      # calculates standard deviation
         med = median(price)) # calculates the median price
# A tibble: 53,940 × 13
   carat cut       color clarity depth table price     x     y     z     m    sd
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933. 3989.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933. 3989.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933. 3989.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933. 3989.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933. 3989.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933. 3989.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933. 3989.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933. 3989.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933. 3989.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>

6.1.1.0.1 - Exercises

1)A

library(tidyverse)

# Load the midwest dataset
data(midwest)

# Create a tibble with the original data and a new column for average population density
midwest_with_avg_pop_den <- midwest %>%
  mutate(avg.pop.den = mean(popdensity))

# Print the resulting tibble
print(midwest_with_avg_pop_den)
# A tibble: 437 × 29
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

B

# Create a tibble with the original data and a new column for average population density and average area
midwest_with_avg_pop_den_and_area <- midwest %>%
  mutate(avg.pop.den = mean(popdensity),
         avg.area = mean(area))

# Print the resulting tibble
print(midwest_with_avg_pop_den_and_area)
# A tibble: 437 × 30
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 21 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

C

# Create a tibble with the original data and a new column for average population density, average area, and total adults
midwest_with_avg_pop_den_and_area_and_total_adults <- midwest %>%
  mutate(avg.pop.den = mean(popdensity),
         avg.area = mean(area),
         totadult = sum(popadults))

# Print the resulting tibble
print(midwest_with_avg_pop_den_and_area_and_total_adults)
# A tibble: 437 × 31
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 22 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

D

# Create a tibble with the original data and a new column for average population density, average area, total adults, and total population minus white population
midwest_with_avg_pop_den_and_area_and_total_adults_and_tot_minus_white <- midwest %>%
  mutate(avg.pop.den = mean(popdensity),
         avg.area = mean(area),
         totadult = sum(popadults),
         tot.minus.white = poptotal - popwhite)

# Print the resulting tibble
print(midwest_with_avg_pop_den_and_area_and_total_adults_and_tot_minus_white)
# A tibble: 437 × 32
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 23 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …
# Alternative calculation for tot.minus.white without using poptotal and popwhite
midwest_with_avg_pop_den_and_area_and_total_adults_and_tot_minus_white_alternative <- midwest %>%
  mutate(avg.pop.den = mean(popdensity),
         avg.area = mean(area),
         totadult = sum(popadults),
         tot.minus.white = popblack + popamerindian + popasian + popother)

# Print the resulting tibble
print(midwest_with_avg_pop_den_and_area_and_total_adults_and_tot_minus_white_alternative)
# A tibble: 437 × 32
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 23 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

E

# Calculate the ratio of children to adults in poverty
midwest_with_all <- midwest %>%
  mutate(avg.pop.den = mean(popdensity),
         avg.area = mean(area),
         totadult = sum(popadults),
         tot.minus.white = popblack + popamerindian + popasian + popother,
         child.to.adult = percchildbelowpovert / percadultpoverty)

# Print the resulting tibble
print(midwest_with_all)
# A tibble: 437 × 33
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 24 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

F

# Existing mutations:** We can reuse existing mutations from previous steps.

midwest_with_all <- midwest %>%
  mutate(avg.pop.den = mean(popdensity),
         avg.area = mean(area),
         totadult = sum(popadults),
         tot.minus.white = popblack + popamerindian + popasian + popother,
         child.to.adult = percchildbelowpovert / percadultpoverty,
         ratio.adult = popadults / poptotal)

# Print the resulting tibble
print(midwest_with_all)
# A tibble: 437 × 34
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

G

# Existing mutations:** We can reuse existing mutations from previous steps.

midwest_with_all <- midwest %>%
  mutate(avg.pop.den = mean(popdensity),
         avg.area = mean(area),
         totadult = sum(popadults),
         tot.minus.white = popblack + popamerindian + popasian + popother,
         child.to.adult = percchildbelowpovert / percadultpoverty,
         ratio.adult = popadults / poptotal,
         perc.adult = ratio.adult * 100)

# Print the resulting tibble
print(midwest_with_all)
# A tibble: 437 × 35
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
# ℹ 427 more rows
# ℹ 26 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, …

2

view(presidential)
# Load the presidential dataset
data(presidential)

# Create a tibble with the original data and a new column for duration
presidential_with_duration <- presidential %>%
  mutate(duration = (end - start) / 365)

# Print the resulting tibble
print(presidential_with_duration)
# A tibble: 12 × 5
   name       start      end        party      duration     
   <chr>      <date>     <date>     <chr>      <drtn>       
 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.005479 days
 2 Kennedy    1961-01-20 1963-11-22 Democratic 2.838356 days
 3 Johnson    1963-11-22 1969-01-20 Democratic 5.167123 days
 4 Nixon      1969-01-20 1974-08-09 Republican 5.553425 days
 5 Ford       1974-08-09 1977-01-20 Republican 2.452055 days
 6 Carter     1977-01-20 1981-01-20 Democratic 4.002740 days
 7 Reagan     1981-01-20 1989-01-20 Republican 8.005479 days
 8 Bush       1989-01-20 1993-01-20 Republican 4.002740 days
 9 Clinton    1993-01-20 2001-01-20 Democratic 8.005479 days
10 Bush       2001-01-20 2009-01-20 Republican 8.005479 days
11 Obama      2009-01-20 2017-01-20 Democratic 8.005479 days
12 Trump      2017-01-20 2021-01-20 Republican 4.002740 days

3

# Load the economics dataset
data(economics)

# Create a tibble with the original data and a new column for the percentage of unemployed
economics_with_perc_unemploy <- economics %>%
  mutate(perc.unemploy = (unemploy / pop) * 100)

# Print the resulting tibble
print(economics_with_perc_unemploy)
# A tibble: 574 × 7
   date         pce    pop psavert uempmed unemploy perc.unemploy
   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>         <dbl>
 1 1967-07-01  507. 198712    12.6     4.5     2944          1.48
 2 1967-08-01  510. 198911    12.6     4.7     2945          1.48
 3 1967-09-01  516. 199113    11.9     4.6     2958          1.49
 4 1967-10-01  512. 199311    12.9     4.9     3143          1.58
 5 1967-11-01  517. 199498    12.8     4.7     3066          1.54
 6 1967-12-01  525. 199657    11.8     4.8     3018          1.51
 7 1968-01-01  531. 199808    11.7     5.1     2878          1.44
 8 1968-02-01  534. 199920    12.3     4.5     3001          1.50
 9 1968-03-01  544. 200056    11.7     4.1     2877          1.44
10 1968-04-01  544  200208    12.3     4.6     2709          1.35
# ℹ 564 more rows

4. a

# Load the txhousing dataset
data(txhousing)

# Create a tibble with the original data and a new column for success rate
txhousing_with_successrate <- txhousing %>%
  mutate(successrate = (sales / listings) * 100)

# Print the resulting tibble
print(txhousing_with_successrate)
# A tibble: 8,602 × 10
   city     year month sales  volume median listings inventory  date successrate
   <chr>   <int> <int> <dbl>   <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
 1 Abilene  2000     1    72  5.38e6  71400      701       6.3 2000         10.3
 2 Abilene  2000     2    98  6.51e6  58700      746       6.6 2000.        13.1
 3 Abilene  2000     3   130  9.28e6  58100      784       6.8 2000.        16.6
 4 Abilene  2000     4    98  9.73e6  68600      785       6.9 2000.        12.5
 5 Abilene  2000     5   141  1.06e7  67300      794       6.8 2000.        17.8
 6 Abilene  2000     6   156  1.39e7  66900      780       6.6 2000.        20  
 7 Abilene  2000     7   152  1.26e7  73500      742       6.2 2000.        20.5
 8 Abilene  2000     8   131  1.07e7  75000      765       6.4 2001.        17.1
 9 Abilene  2000     9   104  7.62e6  64500      771       6.5 2001.        13.5
10 Abilene  2000    10   101  7.04e6  59300      764       6.6 2001.        13.2
# ℹ 8,592 more rows

B

# Create a tibble with the original data and new columns for success rate and fail rate
txhousing_with_successrate_and_failrate <- txhousing %>%
  mutate(successrate = (sales / listings) * 100,
         failrate = 100 - successrate)

# Print the resulting tibble
print(txhousing_with_successrate_and_failrate)
# A tibble: 8,602 × 11
   city     year month sales  volume median listings inventory  date successrate
   <chr>   <int> <int> <dbl>   <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
 1 Abilene  2000     1    72  5.38e6  71400      701       6.3 2000         10.3
 2 Abilene  2000     2    98  6.51e6  58700      746       6.6 2000.        13.1
 3 Abilene  2000     3   130  9.28e6  58100      784       6.8 2000.        16.6
 4 Abilene  2000     4    98  9.73e6  68600      785       6.9 2000.        12.5
 5 Abilene  2000     5   141  1.06e7  67300      794       6.8 2000.        17.8
 6 Abilene  2000     6   156  1.39e7  66900      780       6.6 2000.        20  
 7 Abilene  2000     7   152  1.26e7  73500      742       6.2 2000.        20.5
 8 Abilene  2000     8   131  1.07e7  75000      765       6.4 2001.        17.1
 9 Abilene  2000     9   104  7.62e6  64500      771       6.5 2001.        13.5
10 Abilene  2000    10   101  7.04e6  59300      764       6.6 2001.        13.2
# ℹ 8,592 more rows
# ℹ 1 more variable: failrate <dbl>

6.1.1.1 Recode()

diamonds %>% 
  mutate(cut.new = recode(cut,
                          "Ideal" = "IDEAL"))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z cut.new  
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>    
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 IDEAL    
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 Premium  
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 Good     
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 Premium  
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 Good     
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 Very Good
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Very Good
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 Very Good
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 Fair     
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 Very Good
# ℹ 53,930 more rows
diamonds %>% 
  mutate(cut.new = recode(cut,
                          "Ideal" = "IDEAL",
                          "Fair" = "Okay",
                          "Premium" = "pizza"))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z cut.new  
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>    
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 IDEAL    
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 pizza    
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 Good     
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 pizza    
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 Good     
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 Very Good
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Very Good
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 Very Good
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 Okay     
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 Very Good
# ℹ 53,930 more rows
# creating a dataset with 2 variables (Sex , TestScore)
Sex <- factor(c("male", "m", "M", "Female", "Female", "Female"))
TestScore <- c(10, 20, 10, 25, 12, 5)
dataset <- tibble(Sex, TestScore)
str(dataset)
tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
 $ Sex      : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
 $ TestScore: num [1:6] 10 20 10 25 12 5
# creating a new variable (Sex.new) with recoded values 
# from the original variable (Sex)
dataset %>% 
  mutate(Sex.new = recode(Sex, 
                          "m" = "Male",
                          "M" = "Male",
                          "male" = "Male"))
# A tibble: 6 × 3
  Sex    TestScore Sex.new
  <fct>      <dbl> <fct>  
1 male          10 Male   
2 m             20 Male   
3 M             10 Male   
4 Female        25 Female 
5 Female        12 Female 
6 Female         5 Female 
dataset.new <- # saving the changes to a new object
  dataset %>% 
  mutate(Sex.new = recode(Sex, 
                          "m" = "Male",
                          "M" = "Male",
                          "male" = "Male"))
str(dataset.new)
tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
 $ Sex      : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
 $ TestScore: num [1:6] 10 20 10 25 12 5
 $ Sex.new  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1
str(dataset)
tibble [6 × 2] (S3: tbl_df/tbl/data.frame)
 $ Sex      : Factor w/ 4 levels "Female","m","M",..: 4 2 3 1 1 1
 $ TestScore: num [1:6] 10 20 10 25 12 5

6.2 Summarize ()

Summarise collapses all rows and returns a one-row summary.

diamonds %>% 
  summarize(avg.price = mean(price))
# A tibble: 1 × 1
  avg.price
      <dbl>
1     3933.
diamonds %>% 
  summarize(avg.price = mean(price),     # average price of all diamonds
            dbl.price = mean(price) * 2, # calculating double the average price
            random.add = 1 + 2,          # a math operation without an existing variable 
            avg.carat = mean(carat),     # average carat size of all diamonds
            stdev.price = sd(price))     # calculating the standard deviation 
# A tibble: 1 × 5
  avg.price dbl.price random.add avg.carat stdev.price
      <dbl>     <dbl>      <dbl>     <dbl>       <dbl>
1     3933.     7866.          3     0.798       3989.

6.3 group_by() and ungroup()

Takes existing data and groups specific variables together for future operations.

## Creating identification number to represent 50 individual people
ID <- c(1:50)

## Creating sex variable (25 males/25 females)
Sex <- rep(c("male", "female"), 25) # rep stands for replicate

## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33, 
         39, 28, 26, 29, 33, 22, 35, 23, 26, 36, 
         21, 20, 31, 21, 35, 39, 36, 22, 22, 25, 
         27, 30, 26, 34, 38, 39, 30, 29, 26, 25, 
         26, 36, 23, 21, 21, 39, 26, 26, 27, 21) 

## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173, 
           0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482, 
           0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752, 
           0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567, 
           0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)

## Creating a unified dataset that puts together all variables
data <- tibble(ID, Sex, Age, Score)

6.3.1 summarize() and group_by()

data %>% 
  group_by(Sex) %>% 
  summarize(m = mean(Score), # calculates the mean
            s = sd(Score),   # calculates the standard deviation
            n = n()) %>%     # calculates the total number of observations
  ungroup()
# A tibble: 2 × 4
  Sex        m     s     n
  <chr>  <dbl> <dbl> <int>
1 female 0.437 0.268    25
2 male   0.487 0.268    25
data %>% 
  group_by(Sex, Age) %>%     # grouped by Sex and Age
  summarize(m = mean(Score),
            s = sd(Score),   
            n = n()) %>% 
  ungroup()
`summarise()` has grouped output by 'Sex'. You can override using the `.groups`
argument.
# A tibble: 27 × 5
   Sex      Age     m      s     n
   <chr>  <dbl> <dbl>  <dbl> <int>
 1 female    20 0.046 NA         1
 2 female    21 0.740  0.253     3
 3 female    22 0.672  0.253     2
 4 female    23 0.501 NA         1
 5 female    25 0.579  0.167     3
 6 female    26 0.41  NA         1
 7 female    28 0.152 NA         1
 8 female    29 0.426  0.339     2
 9 female    30 0.170  0.238     2
10 female    33 0.173 NA         1
# ℹ 17 more rows

6.3.2 mutate() and group_by()

data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Score)) %>% # calculates mean score by Sex
  ungroup()
# A tibble: 50 × 5
      ID Sex      Age Score     m
   <int> <chr>  <dbl> <dbl> <dbl>
 1     1 male      26 0.01  0.487
 2     2 female    25 0.418 0.437
 3     3 male      39 0.014 0.487
 4     4 female    37 0.09  0.437
 5     5 male      31 0.061 0.487
 6     6 female    34 0.328 0.437
 7     7 male      34 0.656 0.487
 8     8 female    30 0.002 0.437
 9     9 male      26 0.639 0.487
10    10 female    33 0.173 0.437
# ℹ 40 more rows

6.3.3 Ungrouping

Always ungroup() when you’ve finished with your calculations.

## Example 1

data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age)) %>%   # calculates the average age of males and females
  mutate(x = mean(Score)) %>% # counts number of participants
  ungroup()                   # closing ungroup() 
# A tibble: 50 × 6
      ID Sex      Age Score     m     x
   <int> <chr>  <dbl> <dbl> <dbl> <dbl>
 1     1 male      26 0.01   29.2 0.487
 2     2 female    25 0.418  29.0 0.437
 3     3 male      39 0.014  29.2 0.487
 4     4 female    37 0.09   29.0 0.437
 5     5 male      31 0.061  29.2 0.487
 6     6 female    34 0.328  29.0 0.437
 7     7 male      34 0.656  29.2 0.487
 8     8 female    30 0.002  29.0 0.437
 9     9 male      26 0.639  29.2 0.487
10    10 female    33 0.173  29.0 0.437
# ℹ 40 more rows
## Example 2

data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age)) %>%  # calculates the average age of males and females
  ungroup() %>%              # nested ungroup()
  mutate(x = mean(Score))    # counts number of participants
# A tibble: 50 × 6
      ID Sex      Age Score     m     x
   <int> <chr>  <dbl> <dbl> <dbl> <dbl>
 1     1 male      26 0.01   29.2 0.462
 2     2 female    25 0.418  29.0 0.462
 3     3 male      39 0.014  29.2 0.462
 4     4 female    37 0.09   29.0 0.462
 5     5 male      31 0.061  29.2 0.462
 6     6 female    34 0.328  29.0 0.462
 7     7 male      34 0.656  29.2 0.462
 8     8 female    30 0.002  29.0 0.462
 9     9 male      26 0.639  29.2 0.462
10    10 female    33 0.173  29.0 0.462
# ℹ 40 more rows
## Creating/Saving the object named "data1"
data1 <- 
  data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age))

## Using the data1 object after it's been saved above (WITHOUT an ungroup)
data1 %>% 
  mutate(x = mean(Score))
# A tibble: 50 × 6
# Groups:   Sex [2]
      ID Sex      Age Score     m     x
   <int> <chr>  <dbl> <dbl> <dbl> <dbl>
 1     1 male      26 0.01   29.2 0.487
 2     2 female    25 0.418  29.0 0.437
 3     3 male      39 0.014  29.2 0.487
 4     4 female    37 0.09   29.0 0.437
 5     5 male      31 0.061  29.2 0.487
 6     6 female    34 0.328  29.0 0.437
 7     7 male      34 0.656  29.2 0.487
 8     8 female    30 0.002  29.0 0.437
 9     9 male      26 0.639  29.2 0.487
10    10 female    33 0.173  29.0 0.437
# ℹ 40 more rows
## Creating/Saving the object named "data1"
data1 <- 
  data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age))

## Using the data1 object after it's been saved above (WITHOUT an ungroup)
data1 %>% 
  group_by(Age) %>% 
  mutate(x = mean(Score)) %>% 
  ungroup()
# A tibble: 50 × 6
      ID Sex      Age Score     m     x
   <int> <chr>  <dbl> <dbl> <dbl> <dbl>
 1     1 male      26 0.01   29.2 0.583
 2     2 female    25 0.418  29.0 0.579
 3     3 male      39 0.014  29.2 0.210
 4     4 female    37 0.09   29.0 0.09 
 5     5 male      31 0.061  29.2 0.491
 6     6 female    34 0.328  29.0 0.550
 7     7 male      34 0.656  29.2 0.550
 8     8 female    30 0.002  29.0 0.209
 9     9 male      26 0.639  29.2 0.583
10    10 female    33 0.173  29.0 0.347
# ℹ 40 more rows

proper method:

data1 <- 
  data %>% 
  group_by(Sex) %>% 
  mutate(m = mean(Age)) %>% 
  ungroup() # Ungroup at the end of a definition!!!

data1 %>% 
  group_by(Sex, Age) %>%  # group the relevant variables here
  mutate(x = mean(Score)) %>% 
  ungroup()
# A tibble: 50 × 6
      ID Sex      Age Score     m     x
   <int> <chr>  <dbl> <dbl> <dbl> <dbl>
 1     1 male      26 0.01   29.2 0.605
 2     2 female    25 0.418  29.0 0.579
 3     3 male      39 0.014  29.2 0.045
 4     4 female    37 0.09   29.0 0.09 
 5     5 male      31 0.061  29.2 0.491
 6     6 female    34 0.328  29.0 0.497
 7     7 male      34 0.656  29.2 0.656
 8     8 female    30 0.002  29.0 0.170
 9     9 male      26 0.639  29.2 0.605
10    10 female    33 0.173  29.0 0.173
# ℹ 40 more rows

6.4 filter()

Only retain specific rows of data that meet the specified requirement(s).

diamonds %>% filter(cut == "Fair")
# A tibble: 1,610 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 1,600 more rows
diamonds %>%
  filter(cut == "Fair" | cut == "Good", price <= 600)
# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

Alternative:

diamonds %>%
  filter(cut %in% c("Fair", "Good"), price <= 600)
# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

The following code would require the cut be Fair and Good (for which none exists):

diamonds %>%
  filter(cut == "Fair", cut == "Good", price <= 600)
# A tibble: 0 × 10
# ℹ 10 variables: carat <dbl>, cut <ord>, color <ord>, clarity <ord>,
#   depth <dbl>, table <dbl>, price <int>, x <dbl>, y <dbl>, z <dbl>

Only display data from diamonds that do not have a cut value of Fair:

diamonds %>% filter(cut != "Fair")
# A tibble: 52,330 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
10  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73
# ℹ 52,320 more rows

6.5 select()

Select only the columns (variables) that you want to see. Gets rid of all other columns.

diamonds %>% select(cut, color)
# A tibble: 53,940 × 2
   cut       color
   <ord>     <ord>
 1 Ideal     E    
 2 Premium   E    
 3 Good      E    
 4 Premium   I    
 5 Good      J    
 6 Very Good J    
 7 Very Good I    
 8 Very Good H    
 9 Fair      E    
10 Very Good H    
# ℹ 53,930 more rows
#only retain the first 5 columns
diamonds %>% select(1:5)
# A tibble: 53,940 × 5
   carat cut       color clarity depth
   <dbl> <ord>     <ord> <ord>   <dbl>
 1  0.23 Ideal     E     SI2      61.5
 2  0.21 Premium   E     SI1      59.8
 3  0.23 Good      E     VS1      56.9
 4  0.29 Premium   I     VS2      62.4
 5  0.31 Good      J     SI2      63.3
 6  0.24 Very Good J     VVS2     62.8
 7  0.24 Very Good I     VVS1     62.3
 8  0.26 Very Good H     SI1      61.9
 9  0.22 Fair      E     VS2      65.1
10  0.23 Very Good H     VS1      59.4
# ℹ 53,930 more rows

to select all but cut:

diamonds %>% select(-cut)
# A tibble: 53,940 × 9
   carat color clarity depth table price     x     y     z
   <dbl> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Retain all of the columns except for cut and color:

diamonds %>% select(-cut, -color)
# A tibble: 53,940 × 8
   carat clarity depth table price     x     y     z
   <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Retain all of the columns except for the first five columns:

diamonds %>% select (-1,-2,-3,-4,-5)
# A tibble: 53,940 × 5
   table price     x     y     z
   <dbl> <int> <dbl> <dbl> <dbl>
 1    55   326  3.95  3.98  2.43
 2    61   326  3.89  3.84  2.31
 3    65   327  4.05  4.07  2.31
 4    58   334  4.2   4.23  2.63
 5    58   335  4.34  4.35  2.75
 6    57   336  3.94  3.96  2.48
 7    57   336  3.95  3.98  2.47
 8    55   337  4.07  4.11  2.53
 9    61   337  3.87  3.78  2.49
10    61   338  4     4.05  2.39
# ℹ 53,930 more rows
# or
diamonds %>% select(-(1:5))
# A tibble: 53,940 × 5
   table price     x     y     z
   <dbl> <int> <dbl> <dbl> <dbl>
 1    55   326  3.95  3.98  2.43
 2    61   326  3.89  3.84  2.31
 3    65   327  4.05  4.07  2.31
 4    58   334  4.2   4.23  2.63
 5    58   335  4.34  4.35  2.75
 6    57   336  3.94  3.96  2.48
 7    57   336  3.95  3.98  2.47
 8    55   337  4.07  4.11  2.53
 9    61   337  3.87  3.78  2.49
10    61   338  4     4.05  2.39
# ℹ 53,930 more rows

You can also retain all of the columns, but rearrange some of the columns to appear at the beginning—this moves the x,y,z variables to the first 3 columns:

diamonds %>% select(x,y,z, everything())
# A tibble: 53,940 × 10
       x     y     z carat cut       color clarity depth table price
   <dbl> <dbl> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  3.95  3.98  2.43  0.23 Ideal     E     SI2      61.5    55   326
 2  3.89  3.84  2.31  0.21 Premium   E     SI1      59.8    61   326
 3  4.05  4.07  2.31  0.23 Good      E     VS1      56.9    65   327
 4  4.2   4.23  2.63  0.29 Premium   I     VS2      62.4    58   334
 5  4.34  4.35  2.75  0.31 Good      J     SI2      63.3    58   335
 6  3.94  3.96  2.48  0.24 Very Good J     VVS2     62.8    57   336
 7  3.95  3.98  2.47  0.24 Very Good I     VVS1     62.3    57   336
 8  4.07  4.11  2.53  0.26 Very Good H     SI1      61.9    55   337
 9  3.87  3.78  2.49  0.22 Fair      E     VS2      65.1    61   337
10  4     4.05  2.39  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

6.6 arrange()

Allows you arrange values within a variable in ascending or descending order (if that is applicable to your values).

diamonds %>% arrange(cut)
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 53,930 more rows

Arrange price by numerical order (lowest to highest):

diamonds %>% arrange(price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Arrange cut in descending alphabetical order:

diamonds %>% arrange(desc(cut))
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
 9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
# ℹ 53,930 more rows

Arrange price in descending numerical order:

diamonds %>% arrange(desc(price))
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

6.6.1 Exercises

problem a:

## Problem A
midwest %>%   #utilize's midwest dataset
  group_by(state) %>% 
  summarize(poptotalmean = mean(poptotal),  #one row summary, mean of population toal
            poptotalmed = median(poptotal),  #one row summary of population total median
            popmax = max(poptotal),   #one row summary of the max of population total
            popmin = min(poptotal),  #one row summary of the min of population minimum
            popdistinct = n_distinct(poptotal), #unique population variables
            popfirst = first(poptotal), 
            popany = any(poptotal < 5000),
            popany2 = any(poptotal > 2000000)) %>% 
  ungroup()   #ungrouping of data
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B:

## Problem B
midwest %>%    #utilise midwest data set
  group_by(state) %>%    #group data within the state column
  summarize(num5k = sum(poptotal < 5000),
            num2mil = sum(poptotal > 2000000),
            numrows = n()) %>% 
  ungroup()  #ungroup
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C part1:

## Problem C
# part I
midwest %>%   #utilize midwest data set
  group_by(county) %>%   #group by county column
  summarize(x = n_distinct(state)) %>% 
  arrange(desc(x)) %>%   #arrange valuable
  ungroup()  #ungroup
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

Problem C part 2:

# part II
#n():

#Counts the total number of rows in a dataset or within a group. It's equivalent to the nrow() function.

#n_distinct():

#Counts the number of unique values in a specific column or set of columns. It's useful for determining the diversity or variability of values within a variable.

midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% 
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Part 3:

midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(county)) %>% 
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows

Problem D:

diamonds %>% 
  group_by(clarity) %>%   #group data in the clarity column
  summarize(a = n_distinct(color),  #summarise the amount of unique values in the colour data column
            b = n_distinct(price),  #summarise the amount of unique values in the price data column
            c = n()) %>%   ##Counts the total number of rows in a dataset
  ungroup()  #ungroups the selected data
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Porblem E part 1:

diamonds %>% 
  group_by(color, cut) %>%  #group data colour and cut
  summarize(m = mean(price), #summarise the mean data in price data
            s = sd(price)) %>%   #calculate the standard deviation of price
  ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

part3:

# - How good is the sale if the price of diamonds equaled msale? 20%?
# - e.x. The diamonds are x% off original price in msale.
diamonds %>%  #working with diamond data set
  group_by(cut, color, clarity) %>%  #groups data by cut, colour and clarity
  summarize(m = mean(price),  #mean of the price data
            s = sd(price), #standard deviation of the price data
            msale = m * 0.80) %>%   #the sale price of 80% of the mean of the price when grouped by cut, colour and clarity
  ungroup()  #ends grouping
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem F:

diamonds %>%  #working with diamond data set
  group_by(cut) %>%   #group cut data set
  summarize(potato = mean(depth),  #potato is the name being set for the mean depth
            pizza = mean(price),  #pizza is the name being set for the mean price
            popcorn = median(y), #popcorn is the name set for the median grouped by cut
            pineapple = potato - pizza, #pineapple is the column name for the mean potato - mean pizza
            papaya = pineapple ^ 2, #papaya is the column name for the square of mean depth - the mean price both being grouped by cut
            peach = n()) %>% 
  ungroup()  #ends grouping
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G part 1:

diamonds %>%  #diamond data set
  group_by(color) %>%  #group colour data
  summarize(m = mean(price)) %>%  #means of prices grouped by colour
  mutate(x1 = str_c("Diamond color ", color), #adds label "Diamond color" for colour vector.
         x2 = 5) %>% 
  ungroup() #ends grouping
# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Part 2: What does the first ungroup() do? Is it useful here? Why/why not? Eliminates repetitive code to ensure code runs smoothly and without data sets grouped. Ungrouping is not useful here as there is nothing grouped, and is not at the end of grouping code. Why isn’t there a closing ungroup() after the mutate()? Because, i dont think its needed there

Problem H part 1:

diamonds %>%  #working with diamond dataset
  group_by(color) %>%  #group by colour 
  mutate(x1 = price * 0.5) %>%  #create new variable which will be 0.5 of the price based on the group colour.
  summarize(m = mean(x1)) %>%  #mean of half of the price of diamond dataset grouped by colour
  ungroup()  #ungroup
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

Part 2:

# part II
# What's the difference between part I and II?
diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  ungroup() %>%  
  summarize(m = mean(x1)) 
# A tibble: 1 × 1
      m
  <dbl>
1 1966.

Whats the difference between part 1 and 2? Ungroup was before the summarize command, therefore ruining the resulting data.

6.6.1 Questions

Why is grouping data necessary? To specify the parameters you want to evaluate

Why is ungrouping data necessary? To clear the parameters you set, to close the coding, or, to set new parameters

When should you ungroup data? when closing data, or when working with new parameters

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? I dont believe so, unless it is grouping parameters to work with

Research methods

Good question:

Does the cut variable, have a bigger impact on the price of diamonds more than other variables do?

Bad question

Do Diamonds sell?

Data Analysis

Loading the data

library(tidyverse)
library(modeldata)

Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':

    penguins
# The basics of visualising data.

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "gold1",
             size = 2,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

# Adding another layer


ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'

ggplot(crickets, aes(x = temp, #creating a plot of the cricket data 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Oranges")
`geom_smooth()` using formula = 'y ~ x'

# Other plots

ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "gold1")

ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "PuOr")

ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "RdPu") +
  theme_dark() #changes background theme

?theme_dark()
starting httpd help server ... done

Research Methods

What is a good hypothesis?

A good hypothesis is observational, it also can not be something you definitely know will be accepted. A good hypothesis tests either outcomes with the possibility of it being rejected or accepted. However, a good hypothesis is still an educated “guess”, in which, typically pre-reading before tests are carried out, may give enough supporting evidence, or infered evidence, to produce an informed prediction.

Week 5

Example graphs, identifying what tests can be associated to them, and why.

To be honest im not 100% sure yet, but im trying to get used to identifying which statistical tests to use when analysis data. Graph1 For graph 1, we can assume the fulfillment of normality, as the data size is larger than 30. Therefore, a parametric test is more suitable, the study design is unpaired due to there being only one measurement for each individual species. The number of groups is 3, and thus, a one way ANOVA test seems to be appropriate. Graph2 This density plot is used to compare the difference in petal length between 3 iris species. There is only a single measurement so the data is unpaired, there are 3 groups, the data is not normally distributed so a Kruskal wallis test should be used. Graph3

Graph4

Recreating graphs

ggplot(iris, aes(x = Species,
                 y = Sepal.Length,
                 color = Species)) +
  geom_boxplot() +
  theme_dark() +
  scale_color_manual(values=c("gold1","black","orange")) +
  scale_fill_manual(values=c("gold1","black","orange"))

  labs(x = "Species",
       y = "Sepal Length",
       title = "Sepal length of 3 Iris species")
$x
[1] "Species"

$y
[1] "Sepal Length"

$title
[1] "Sepal length of 3 Iris species"

attr(,"class")
[1] "labels"

This is the recreation of graph 1.

ggplot(iris, aes(x = Petal.Length,
                 fill = Species)) +
  geom_density(alpha = 0.3) +
  theme_dark() +
  scale_color_manual(values=c("gold1","black","orange")) +
  scale_fill_manual(values=c("gold1","black","orange"))

  labs(x = "Petal Length",
       y = "Density",
       title = "Petal length of 3 Iris species")
$x
[1] "Petal Length"

$y
[1] "Density"

$title
[1] "Petal length of 3 Iris species"

attr(,"class")
[1] "labels"

This is a recreation of graph 2

ggplot(iris, aes(x = Petal.Length,
                 y = Petal.Width)) +
  geom_point(aes(colour = Species,
                 shape = Species)) +
  geom_smooth(method = "lm", color = "blue") +
  theme_dark() +
    scale_color_manual(values=c("gold1", "black", "orange")) +
  scale_fill_manual(values=c("gold1", "black", "orange")) +
  labs(x = "Petal Length",
       y = "Petal Width",
       title = "Petal width, and length for Iris species")
`geom_smooth()` using formula = 'y ~ x'

Recreation of graoh 3

data(iris)
iris.new <-
  mutate(iris, size=ifelse(Sepal.Length < median(Sepal.Length),
                           "small", "big"))
ggplot(iris.new, aes(x = Species, color=size, fill=size)) +
  geom_bar(position="dodge") +
  theme_dark() +
  scale_color_manual(values=c("gold1", "black")) +
  scale_fill_manual(values=c("gold1", "black")) +
  labs(x = "Species",
       y = "Count",
       title = "Sepal size of Iris species")