Workbook

Author

Emily Sankey

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

Code

1 + 1

[1] 2

You can add options to executable code like this:

[1] 4

The echo: false option disables the printing of code (only output is displayed).

italics - bullet - points

Mutate

mutate() can be used to create variables based on existing variables from the dataset. You can create multiple columns at once, separating each new variable with a comma.

In order to save changes, you must define the new dataset as an object using <-! It is good practice to save your changes to a dataset under a new object name (e.g., diamonds.new) each time you make changes rather than saving over the original dataset name (e.g., diamonds).

Nesting

Where one function (mean()) “nests” inside another function (mutate()).

6.1.1.0.1 Exercises (midwest data set)

To add a new column To calculate average population density

Code

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.3

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tibble' was built under R version 4.2.3

Warning: package 'tidyr' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'purrr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.3

Warning: package 'forcats' was built under R version 4.2.3

Warning: package 'lubridate' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(ggplot2)  # This is where midwest is located
midwest <- ggplot2::midwest  # Load the midwest dataset

avg_density <- mean(midwest$popdensity)
midwest$avg.pop.den <- avg_density
head(midwest)

# A tibble: 6 × 29
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>

To calculate entire average area for entire data set

Code

avg.area <- mean(midwest$area)
head(midwest)

# A tibble: 6 × 29
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>

To calculate total number of adults in dataset

Code

total_adults <- sum(midwest$popadults)
midwest$totadult <- total_adults
head(midwest)

# A tibble: 6 × 30
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 21 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>

To calculate the difference between total population and white people.

Code

midwest$tot.minus.white <- midwest$poptotal - midwest$popwhite
## alternatively without using poptotal and popwhite
midwest$tot.minus.white <- midwest$popblack + midwest$popamerindian + midwest$popasian + midwest$popother
head(midwest)

# A tibble: 6 × 31
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 22 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>

To calculate the ratio between populations

Code

midwest$child.to.adult <- midwest$percchildbelowpovert / midwest$percadultpoverty
head(midwest)

# A tibble: 6 × 32
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 23 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>, child.to.adult <dbl>

Code

## Why are the values in child.to.adult all different? Because they represent the ratio of the percentage of children below the poverty line to the percentage of adults in poverty for each observation (row) in the dataset.

midwest$ratio.adult <- midwest$popadults / midwest$poptotal
head(midwest)

# A tibble: 6 × 33
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 24 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …

To calculate a percentage of a total population that are adults per county

Code

midwest$perc.adult <- (midwest$popadults / midwest$poptotal) * 100
head(midwest)

# A tibble: 6 × 34
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …

Exercises (presidential data set)

To load dataset

Code

library(datasets)
data("presidential")
head(presidential) # to look at the data itself

# A tibble: 6 × 4
  name       start      end        party     
  <chr>      <date>     <date>     <chr>     
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Kennedy    1961-01-20 1963-11-22 Democratic
3 Johnson    1963-11-22 1969-01-20 Democratic
4 Nixon      1969-01-20 1974-08-09 Republican
5 Ford       1974-08-09 1977-01-20 Republican
6 Carter     1977-01-20 1981-01-20 Democratic

Code

str(presidential) # to understand the structure and types of the dataset

tibble [12 × 4] (S3: tbl_df/tbl/data.frame)
 $ name : chr [1:12] "Eisenhower" "Kennedy" "Johnson" "Nixon" ...
 $ start: Date[1:12], format: "1953-01-20" "1961-01-20" ...
 $ end  : Date[1:12], format: "1961-01-20" "1963-11-22" ...
 $ party: chr [1:12] "Republican" "Democratic" "Democratic" "Republican" ...

To create a duration column by calculating the difference in days

Code

presidential$start <- as.Date(presidential$start)
presidential$end <- as.Date(presidential$end)
presidential$duration <- as.numeric(presidential$end - presidential$start)
head(presidential)

# A tibble: 6 × 5
  name       start      end        party      duration
  <chr>      <date>     <date>     <chr>         <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican     2922
2 Kennedy    1961-01-20 1963-11-22 Democratic     1036
3 Johnson    1963-11-22 1969-01-20 Democratic     1886
4 Nixon      1969-01-20 1974-08-09 Republican     2027
5 Ford       1974-08-09 1977-01-20 Republican      895
6 Carter     1977-01-20 1981-01-20 Democratic     1461

Exercises (economics dataset)

To load data

Code

library(datasets)
data("economics")
head(economics)

# A tibble: 6 × 6
  date         pce    pop psavert uempmed unemploy
  <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
1 1967-07-01  507. 198712    12.6     4.5     2944
2 1967-08-01  510. 198911    12.6     4.7     2945
3 1967-09-01  516. 199113    11.9     4.6     2958
4 1967-10-01  512. 199311    12.9     4.9     3143
5 1967-11-01  517. 199498    12.8     4.7     3066
6 1967-12-01  525. 199657    11.8     4.8     3018

To calculate the % of the population that is unemployed

Code

economics$perc.unemploy <- (economics$unemploy / economics$pop) * 100 ##multiplying by 100 turns the ratio into a %.
head(economics)

# A tibble: 6 × 7
  date         pce    pop psavert uempmed unemploy perc.unemploy
  <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>         <dbl>
1 1967-07-01  507. 198712    12.6     4.5     2944          1.48
2 1967-08-01  510. 198911    12.6     4.7     2945          1.48
3 1967-09-01  516. 199113    11.9     4.6     2958          1.49
4 1967-10-01  512. 199311    12.9     4.9     3143          1.58
5 1967-11-01  517. 199498    12.8     4.7     3066          1.54
6 1967-12-01  525. 199657    11.8     4.8     3018          1.51

Exercises (TX housing dataset)

To load the dataset

Code

library(ggplot2)
data("txhousing")
head(txhousing)

# A tibble: 6 × 9
  city     year month sales   volume median listings inventory  date
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.

To create success rate

Code

txhousing$successrate <- (txhousing$sales / txhousing$listings) * 100
head(txhousing)

# A tibble: 6 × 10
  city     year month sales   volume median listings inventory  date successrate
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
1 Abilene  2000     1    72  5380000  71400      701       6.3 2000         10.3
2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.        13.1
3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.        16.6
4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.        12.5
5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.        17.8
6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.        20

To calculate the % of houses that do not sell of the total listing available.

Code

txhousing$failrate <- ((txhousing$listings - txhousing$sales) / txhousing$listings) * 100
head(txhousing)

# A tibble: 6 × 11
  city     year month sales   volume median listings inventory  date successrate
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
1 Abilene  2000     1    72  5380000  71400      701       6.3 2000         10.3
2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.        13.1
3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.        16.6
4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.        12.5
5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.        17.8
6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.        20  
# ℹ 1 more variable: failrate <dbl>

6.6.1

Problem A

Code

midwest %>% # dataset and pipe tool used to take output of one function & use as input for next
  group_by(state) %>% # groups the data by the state column
  summarize(poptotalmean = mean(poptotal), # calculates summary stats for each group & calculates the mean of the pop total 
            poptotalmed = median(poptotal), # calculates median of population total
            popmax = max(poptotal), # finds maximum value of the population total
            popmin = min(poptotal), # finds minimum value of the population total
            popdistinct = n_distinct(poptotal), # counts the number of distinct values in population total column
            popfirst = first(poptotal), # retrieves the first value of the pop total column for each state
            popany = any(poptotal < 5000), # checks if there are any values in the pop total coumn that are less than 5000 for each state.
            popany2 = any(poptotal > 2000000)) %>% # checks for any values in the pop total column that are greater than 2mil
  ungroup() # removes grouping structure from date frame. Useful if plan to perform operations that should not be done in groups

# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B

Code

midwest %>% 
  group_by(state) %>% 
  summarize(num5k = sum(poptotal < 5000), # calculates summary statistic for each group. Inside function, several calculations are defined. 
            num2mil = sum(poptotal > 2000000),
            numrows = n()) %>% 
  ungroup()

# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Code

#5K= calculates number of counties in each state where the population is less than 5000
#num2mil= where pop is greater than 2mil
#numrows= counts number of counties in each state group

Problem C

Code

# part I
midwest %>% 
  group_by(county) %>% # groups data by the county column
  summarize(x = n_distinct(state)) %>% # calculate number of distinct states associated with each county
  arrange(desc(x)) %>% # sorts summarised data in decending order based on values in column x 
  ungroup()

# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

Code

# part II
# How does n() differ from n_distinct()? = n counts the total number of observations (rows) in current group, whereas n distinct counts number of unique (distinct) values in specified column within current group
# When would they be the same? different? = Same when every row in group corresponds to a distinct value in the column being counted. Different when theres multiple records for the same value in the specified column
midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% # calculates total number of rows or observations for each county and stores in x
  ungroup()

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Code

# part III
# hint: 
# - How many distinctly different counties are there for each county? = if you group by county, you'll get 1 row per county, showing total count of records associated with that county
# - Can there be more than 1 (county) county in each county? = no but can be multiple records for that county based on other factors such as different years
# - What if we replace 'county' with 'state'? = the output will show the total number of records for each state. 
midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(county)) %>% # calculates number of distinct counties associated with each county which will always return 1.
  ungroup()

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows

Problem D

Code

diamonds %>% 
  group_by(clarity) %>% # groups dataset by clarity column
  summarize(a = n_distinct(color), # creates summary statistics for each group. In function, calculates number of distinct colours for each clarity group
            b = n_distinct(price),# In function, calculates number of distinct colours for each clarity group
            c = n()) %>% # counts total number of diamonds within each clarity group
  ungroup()

# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

Code

# part I
diamonds %>% 
  group_by(color, cut) %>% # groups dataset by data and cut
  summarize(m = mean(price),# calculates mean price of diamonds for each comb of colour and cut
            s = sd(price)) %>% # calculates standard deviation of price for each comb of colour and cut
  ungroup()

`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.

# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

Code

# part II
diamonds %>% 
  group_by(cut, color) %>% # operations are identical 
  summarize(m = mean(price),
            s = sd(price)) %>% 
  ungroup()

`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows

Code

# part III
# hint: 
# - How good is the sale if the price of diamonds equaled msale? = If msale represents a price that is 20% off mean price of diamonds in each group, you could evaluate how the sale compares to the original mean price
# - e.x. The diamonds are x% off original price in msale.
diamonds %>% 
  group_by(cut, color, clarity) %>% # groups dataset by the columns
  summarize(m = mean(price),
            s = sd(price),
            msale = m * 0.80) %>% # calculates sale price that is 20% off mean price
  ungroup()

`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.

# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem F

Code

diamonds %>% 
  group_by(cut) %>% 
  summarize(potato = mean(depth),# calculates mean depth of diamonds for each cut category
            pizza = mean(price),# calculates mean price of diamonds for each cut category
            popcorn = median(y),# calculates median of a variable called 'y' (y must be a column in diamonds dataset. If not defined, will produce an error.)
            pineapple = potato - pizza,#  calculates difference between mean depth and mean price
            papaya = pineapple ^ 2,# squares value of pineapple
            peach = n()) %>% # counts total number of diamonds in each cut category
  ungroup()

# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

Code

# part I
diamonds %>% 
  group_by(color) %>% # groups dataset by colour variable
  summarize(m = mean(price)) %>% # calculates mean price of diamonds for each colour group
  mutate(x1 = str_c("Diamond color ", color), 
         x2 = 5) %>% # adds new columns to summarize new dataset. X1 creates a new column (x1) that links the string diamond colour with the actual colour value. X2  creates a new column (x2) and assigns the value 5 to every row.
  ungroup()

# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Code

# part II
# What does the first ungroup() do? Is it useful here? Why/why not? = used to remove grouping structure after summarisation. its useful if you plan to perform additional operations on resulting dataframe that should not be affected by previous grouping
# Why isn't there a closing ungroup() after the mutate()? = since the first ungroup was to ensure that the subsequent operations do not treat the data as grouped, there is no need for another ungroup after mutate.
diamonds %>% 
  group_by(color) %>% 
  summarize(m = mean(price)) %>% 
  ungroup() %>% # ungroup before mutate: (1) summarise mean price by colour, (2) ungrouping summarised data, (3) adding new columns (x1 & x2)
  mutate(x1 = str_c("Diamond color ", color),
         x2 = 5)

# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Problem H

Code

# part I
diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # creates new column (x1) that contains half the value of the price for each diamond
  summarize(m = mean(x1)) %>% # calculates mean of new column for each colour group stored in a new column named 'm'
  ungroup()

# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

Code

# part II
# What's the difference between part I and II? = (1) position of ungroup, (2) effect on results
diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  ungroup() %>%  # removed grouping structure before summarising
  summarize(m = mean(x1))

# A tibble: 1 × 1
      m
  <dbl>
1 1966.

Why is grouping data necessary? - crucial for performing calculations and analyses within subsets of data

Why is ungrouping data necessary? - necessary to ensure that further operations do not unintentionally respect previous groupings

When should you ungroup data? - after summary operations or before applying long group specific trabsformations

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? - if theres no grouping in code, ungroup is not required

Chapter 6.7: Extra Practice

Code

library(dplyr)
View(diamonds)
diamonds_sorted_low_to_high <- diamonds %>%
  arrange(price)
diamonds_sorted_high_to_low <- diamonds %>%
  arrange(desc(price))
diamonds_sorted_low_price_cut <- diamonds %>%
  arrange(price, cut)
diamonds_sorted_high_price_cut <- diamonds %>%
  arrange(desc(price), cut)
diamonds_sorted_price_clarity <- diamonds %>%
  arrange(price, desc(clarity))
diamonds_with_sale_price <- diamonds %>%
  mutate(salePrice = price - 250)
diamonds_selected <- diamonds %>%
  select(-x, -y, -z)
diamonds_count_by_cut <- diamonds %>%
  group_by(cut) %>%
  summarise(count = n())
diamonds_with_total_num <- diamonds %>%
  mutate(totalNum = n())

Exercise for research methods

Good question = How does the quality off the diamond impact the price?

Bad question = Which diamond is the cheapest?

Week 4 post-session

Histogram

Code

library(tidyverse)
library(ggplot2)
options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("modeldata")

Installing package into 'C:/Users/sanke/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)


  There is a binary version available but the source version is later:
          binary source needs_compilation
modeldata  1.3.0  1.4.0             FALSE

installing the source package 'modeldata'

Code

library(modeldata)
View(crickets)

## x & y axis
## To add titles & captions to graph, then adding colour 
ggplot(crickets, aes(x = temp,
                     y = rate,
                     color = species)) +
  geom_point()

Code

  labs(x = "Temperature",
     y = "Chirp rate",
     color = "Species",
     title = "Cricket chirps",
     caption = "McDonald (2009)")

$x
[1] "Temperature"

$y
[1] "Chirp rate"

$colour
[1] "Species"

$title
[1] "Cricket chirps"

$caption
[1] "McDonald (2009)"

attr(,"class")
[1] "labels"

Code

  scale_color_brewer(palette = "Dark2")

<ggproto object: Class ScaleDiscrete, Scale, gg>
    aesthetics: colour
    axis_order: function
    break_info: function
    break_positions: function
    breaks: waiver
    call: call
    clone: function
    dimension: function
    drop: TRUE
    expand: waiver
    get_breaks: function
    get_breaks_minor: function
    get_labels: function
    get_limits: function
    get_transformation: function
    guide: legend
    is_discrete: function
    is_empty: function
    labels: waiver
    limits: NULL
    make_sec_title: function
    make_title: function
    map: function
    map_df: function
    n.breaks.cache: NULL
    na.translate: TRUE
    na.value: NA
    name: waiver
    palette: function
    palette.cache: NULL
    position: left
    range: environment
    rescale: function
    reset: function
    train: function
    train_df: function
    transform: function
    transform_df: function
    super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Code

## Modifying basic properties of the plot

ggplot(crickets, aes(x = temp,
                     y = rate)) +
  geom_point(color = "red",
             size = 3,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
     y = "Chirp rate",
     color = "Species",
     title = "Cricket chirps",
     caption = "McDonald (2009)")

Code

## Learn more about options for geom_
# with ?geom_point

## Adding another layer 
# Regression line 
# lm = linear model
# se = standard error

ggplot(crickets, aes(x = temp,
                     y = rate)) +
  geom_point() +
  geom_smooth(method = "lm",
             se = FALSE) +
  labs(x = "Temperature",
     y = "Chirp rate",
     title = "Cricket chirps",
     caption = "McDonald (2009)")

`geom_smooth()` using formula = 'y ~ x'

Code

## add layer

ggplot(crickets, aes(x = temp,
                     y = rate,
                     color = species)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
     y = "Chirp rate",
     color = "Species",
     title = "Cricket chirps",
     caption = "McDonald (2009)")

`geom_smooth()` using formula = 'y ~ x'

Code

  scale_color_brewer(palette = "Dark2")

<ggproto object: Class ScaleDiscrete, Scale, gg>
    aesthetics: colour
    axis_order: function
    break_info: function
    break_positions: function
    breaks: waiver
    call: call
    clone: function
    dimension: function
    drop: TRUE
    expand: waiver
    get_breaks: function
    get_breaks_minor: function
    get_labels: function
    get_limits: function
    get_transformation: function
    guide: legend
    is_discrete: function
    is_empty: function
    labels: waiver
    limits: NULL
    make_sec_title: function
    make_title: function
    map: function
    map_df: function
    n.breaks.cache: NULL
    na.translate: TRUE
    na.value: NA
    name: waiver
    palette: function
    palette.cache: NULL
    position: left
    range: environment
    rescale: function
    reset: function
    train: function
    train_df: function
    transform: function
    transform_df: function
    super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Code

## Other plots

ggplot(crickets, aes(x = rate)) +
  geom_histogram(bins = 15) # one quantitative variable

Code

ggplot(crickets, aes(x = rate)) +
  geom_freqpoly(bins = 15)

Code

ggplot(crickets, aes(x = species)) +
  geom_bar(color = "black", #boundary colour
           fill = "lightblue")

Code

ggplot(crickets, aes(x = species,
                     fill = species)) +
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

Code

# Bar chart - for a single categorical variable
# Histogram - for a single quantitative variable
# Scatterplot - for 2 quantitative variables
# Boxplot - for 1 quantitative & 1 categorical variable 

ggplot(crickets, aes(x = species,
                     y = rate,
                     color = species)) +
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal() #remove grey background

Code

?theme_minimal() #options for complete themes

starting httpd help server ...

 done

Code

# Faceting

# not great:
ggplot(crickets, aes(x = rate,
                     fill = species)) +
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

Code

ggplot(crickets, aes(x = rate,
                     fill = species)) +
  geom_histogram(bins = 15,
                 show.legend = FALSE) +
  facet_wrap(~species) # ~ means wrap by species

Code

  scale_fill_brewer(palette = "Dark2")

<ggproto object: Class ScaleDiscrete, Scale, gg>
    aesthetics: fill
    axis_order: function
    break_info: function
    break_positions: function
    breaks: waiver
    call: call
    clone: function
    dimension: function
    drop: TRUE
    expand: waiver
    get_breaks: function
    get_breaks_minor: function
    get_labels: function
    get_limits: function
    get_transformation: function
    guide: legend
    is_discrete: function
    is_empty: function
    labels: waiver
    limits: NULL
    make_sec_title: function
    make_title: function
    map: function
    map_df: function
    n.breaks.cache: NULL
    na.translate: TRUE
    na.value: NA
    name: waiver
    palette: function
    palette.cache: NULL
    position: left
    range: environment
    rescale: function
    reset: function
    train: function
    train_df: function
    transform: function
    transform_df: function
    super:  <ggproto object: Class ScaleDiscrete, Scale, gg>

Code

?facet_wrap # help file for facets 


ggplot(crickets, aes(x = rate,
                     fill = species)) +
  geom_histogram(bins = 15,
                 show.legend = FALSE) +
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2")

Code

file.rename("Quarto-Workbook-exercises.rmarkdown", "Quarto_Workbook_exercises.rmd")

[1] TRUE

Exercise for research methods

What is a good research hypothesis? A good research hypothesis is a testable statement that is both specific and measurable, predicting a relationship between variables. It should guide the research method, allowing for empirical investigation, as well as being formulated in a way that allows for the possibility of being supported or refuted through data collection and analysis.

Week 5 post-session

Code

library(tidyverse)
library(ggplot2)
data("iris")
View(iris)
ggplot(iris, aes(x = Species, y = Sepal.Length, color = Species)) +
  geom_boxplot(fill = NA) +  # Set fill to NA for no fill
  labs(title = "Boxplot of Sepal Length of Different Species",
       x = "Species",
       y = "Sepal Length") +
  theme_minimal() +
  scale_color_manual(values = c("setosa" = "red", 
                                 "versicolor" = "green", 
                                 "virginica" = "blue")) +
  theme(legend.title = element_blank())

Code

# Statistical tests that can be associated with boxplots:
# 1. T-test
# 2. ANOVA
# 3. Kruskal-Wallis
# 4. Mann-Whitney U
# 5. Post-hoc

Code

data("iris")
View(iris)
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  geom_density(alpha = 0.5) +  # Create the density plot with transparency
  labs(x = "Petal.Length", y = "Density") + 
  theme_minimal()

Code

# Statistical tests that can be associated with density plots:
# 1. Kolmogorov-Smirnov 
# 2. Mann-Whitney U
# 3. Shapiro-Wilk
# 4. Bartlett's
# 5. Kruskal-Wallis

Code

data("iris")
View(iris)
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = TRUE) +  # Add separate regression lines with confidence intervals
  labs(x = "Petal.Length", y = "Petal.Width") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Code

# Statistical tests that can be associated with scatterplots:
# 1. Kolmogorov-Smirnov
# 2. Mann-Whitney U
# 3. Shapiro-Wilk
# 4. Bartlett's
# 5. Kruskal-Wallis

Code

data("iris")
View(iris)
data <- data.frame(
  Species = c(rep("setosa", 2), rep("versicolor", 2), rep("virginica", 2)),
  size = c("big", "small", "big", "small", "big", "small"),
  count = c(1, 50, 30, 25, 50, 5)
)
ggplot(data, aes(x = Species, y = count, fill = size)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(y = "count", x = "Species") +
  theme_minimal() +
  scale_fill_manual(values = c("big" = "salmon", "small" = "lightblue"))

Code

iris %>%
mutate(size=ifelse(Sepal.Length < median(Sepal.Length),
  "small", "big"))

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  size
1            5.1         3.5          1.4         0.2     setosa small
2            4.9         3.0          1.4         0.2     setosa small
3            4.7         3.2          1.3         0.2     setosa small
4            4.6         3.1          1.5         0.2     setosa small
5            5.0         3.6          1.4         0.2     setosa small
6            5.4         3.9          1.7         0.4     setosa small
7            4.6         3.4          1.4         0.3     setosa small
8            5.0         3.4          1.5         0.2     setosa small
9            4.4         2.9          1.4         0.2     setosa small
10           4.9         3.1          1.5         0.1     setosa small
11           5.4         3.7          1.5         0.2     setosa small
12           4.8         3.4          1.6         0.2     setosa small
13           4.8         3.0          1.4         0.1     setosa small
14           4.3         3.0          1.1         0.1     setosa small
15           5.8         4.0          1.2         0.2     setosa   big
16           5.7         4.4          1.5         0.4     setosa small
17           5.4         3.9          1.3         0.4     setosa small
18           5.1         3.5          1.4         0.3     setosa small
19           5.7         3.8          1.7         0.3     setosa small
20           5.1         3.8          1.5         0.3     setosa small
21           5.4         3.4          1.7         0.2     setosa small
22           5.1         3.7          1.5         0.4     setosa small
23           4.6         3.6          1.0         0.2     setosa small
24           5.1         3.3          1.7         0.5     setosa small
25           4.8         3.4          1.9         0.2     setosa small
26           5.0         3.0          1.6         0.2     setosa small
27           5.0         3.4          1.6         0.4     setosa small
28           5.2         3.5          1.5         0.2     setosa small
29           5.2         3.4          1.4         0.2     setosa small
30           4.7         3.2          1.6         0.2     setosa small
31           4.8         3.1          1.6         0.2     setosa small
32           5.4         3.4          1.5         0.4     setosa small
33           5.2         4.1          1.5         0.1     setosa small
34           5.5         4.2          1.4         0.2     setosa small
35           4.9         3.1          1.5         0.2     setosa small
36           5.0         3.2          1.2         0.2     setosa small
37           5.5         3.5          1.3         0.2     setosa small
38           4.9         3.6          1.4         0.1     setosa small
39           4.4         3.0          1.3         0.2     setosa small
40           5.1         3.4          1.5         0.2     setosa small
41           5.0         3.5          1.3         0.3     setosa small
42           4.5         2.3          1.3         0.3     setosa small
43           4.4         3.2          1.3         0.2     setosa small
44           5.0         3.5          1.6         0.6     setosa small
45           5.1         3.8          1.9         0.4     setosa small
46           4.8         3.0          1.4         0.3     setosa small
47           5.1         3.8          1.6         0.2     setosa small
48           4.6         3.2          1.4         0.2     setosa small
49           5.3         3.7          1.5         0.2     setosa small
50           5.0         3.3          1.4         0.2     setosa small
51           7.0         3.2          4.7         1.4 versicolor   big
52           6.4         3.2          4.5         1.5 versicolor   big
53           6.9         3.1          4.9         1.5 versicolor   big
54           5.5         2.3          4.0         1.3 versicolor small
55           6.5         2.8          4.6         1.5 versicolor   big
56           5.7         2.8          4.5         1.3 versicolor small
57           6.3         3.3          4.7         1.6 versicolor   big
58           4.9         2.4          3.3         1.0 versicolor small
59           6.6         2.9          4.6         1.3 versicolor   big
60           5.2         2.7          3.9         1.4 versicolor small
61           5.0         2.0          3.5         1.0 versicolor small
62           5.9         3.0          4.2         1.5 versicolor   big
63           6.0         2.2          4.0         1.0 versicolor   big
64           6.1         2.9          4.7         1.4 versicolor   big
65           5.6         2.9          3.6         1.3 versicolor small
66           6.7         3.1          4.4         1.4 versicolor   big
67           5.6         3.0          4.5         1.5 versicolor small
68           5.8         2.7          4.1         1.0 versicolor   big
69           6.2         2.2          4.5         1.5 versicolor   big
70           5.6         2.5          3.9         1.1 versicolor small
71           5.9         3.2          4.8         1.8 versicolor   big
72           6.1         2.8          4.0         1.3 versicolor   big
73           6.3         2.5          4.9         1.5 versicolor   big
74           6.1         2.8          4.7         1.2 versicolor   big
75           6.4         2.9          4.3         1.3 versicolor   big
76           6.6         3.0          4.4         1.4 versicolor   big
77           6.8         2.8          4.8         1.4 versicolor   big
78           6.7         3.0          5.0         1.7 versicolor   big
79           6.0         2.9          4.5         1.5 versicolor   big
80           5.7         2.6          3.5         1.0 versicolor small
81           5.5         2.4          3.8         1.1 versicolor small
82           5.5         2.4          3.7         1.0 versicolor small
83           5.8         2.7          3.9         1.2 versicolor   big
84           6.0         2.7          5.1         1.6 versicolor   big
85           5.4         3.0          4.5         1.5 versicolor small
86           6.0         3.4          4.5         1.6 versicolor   big
87           6.7         3.1          4.7         1.5 versicolor   big
88           6.3         2.3          4.4         1.3 versicolor   big
89           5.6         3.0          4.1         1.3 versicolor small
90           5.5         2.5          4.0         1.3 versicolor small
91           5.5         2.6          4.4         1.2 versicolor small
92           6.1         3.0          4.6         1.4 versicolor   big
93           5.8         2.6          4.0         1.2 versicolor   big
94           5.0         2.3          3.3         1.0 versicolor small
95           5.6         2.7          4.2         1.3 versicolor small
96           5.7         3.0          4.2         1.2 versicolor small
97           5.7         2.9          4.2         1.3 versicolor small
98           6.2         2.9          4.3         1.3 versicolor   big
99           5.1         2.5          3.0         1.1 versicolor small
100          5.7         2.8          4.1         1.3 versicolor small
101          6.3         3.3          6.0         2.5  virginica   big
102          5.8         2.7          5.1         1.9  virginica   big
103          7.1         3.0          5.9         2.1  virginica   big
104          6.3         2.9          5.6         1.8  virginica   big
105          6.5         3.0          5.8         2.2  virginica   big
106          7.6         3.0          6.6         2.1  virginica   big
107          4.9         2.5          4.5         1.7  virginica small
108          7.3         2.9          6.3         1.8  virginica   big
109          6.7         2.5          5.8         1.8  virginica   big
110          7.2         3.6          6.1         2.5  virginica   big
111          6.5         3.2          5.1         2.0  virginica   big
112          6.4         2.7          5.3         1.9  virginica   big
113          6.8         3.0          5.5         2.1  virginica   big
114          5.7         2.5          5.0         2.0  virginica small
115          5.8         2.8          5.1         2.4  virginica   big
116          6.4         3.2          5.3         2.3  virginica   big
117          6.5         3.0          5.5         1.8  virginica   big
118          7.7         3.8          6.7         2.2  virginica   big
119          7.7         2.6          6.9         2.3  virginica   big
120          6.0         2.2          5.0         1.5  virginica   big
121          6.9         3.2          5.7         2.3  virginica   big
122          5.6         2.8          4.9         2.0  virginica small
123          7.7         2.8          6.7         2.0  virginica   big
124          6.3         2.7          4.9         1.8  virginica   big
125          6.7         3.3          5.7         2.1  virginica   big
126          7.2         3.2          6.0         1.8  virginica   big
127          6.2         2.8          4.8         1.8  virginica   big
128          6.1         3.0          4.9         1.8  virginica   big
129          6.4         2.8          5.6         2.1  virginica   big
130          7.2         3.0          5.8         1.6  virginica   big
131          7.4         2.8          6.1         1.9  virginica   big
132          7.9         3.8          6.4         2.0  virginica   big
133          6.4         2.8          5.6         2.2  virginica   big
134          6.3         2.8          5.1         1.5  virginica   big
135          6.1         2.6          5.6         1.4  virginica   big
136          7.7         3.0          6.1         2.3  virginica   big
137          6.3         3.4          5.6         2.4  virginica   big
138          6.4         3.1          5.5         1.8  virginica   big
139          6.0         3.0          4.8         1.8  virginica   big
140          6.9         3.1          5.4         2.1  virginica   big
141          6.7         3.1          5.6         2.4  virginica   big
142          6.9         3.1          5.1         2.3  virginica   big
143          5.8         2.7          5.1         1.9  virginica   big
144          6.8         3.2          5.9         2.3  virginica   big
145          6.7         3.3          5.7         2.5  virginica   big
146          6.7         3.0          5.2         2.3  virginica   big
147          6.3         2.5          5.0         1.9  virginica   big
148          6.5         3.0          5.2         2.0  virginica   big
149          6.2         3.4          5.4         2.3  virginica   big
150          5.9         3.0          5.1         1.8  virginica   big

Code

# Statistical tests that can be associated with bar charts:
# 1. Chi-Square
# 2. T-test
# 3. ANOVA
# 4. Mann-Whitney U
# 5. Kruskal-Wallis
# 6. Z-test