Code
1 + 1[1] 2
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
1 + 1[1] 2
You can add options to executable code like this:
[1] 4
The echo: false option disables the printing of code (only output is displayed).
italics - bullet - points
mutate() can be used to create variables based on existing variables from the dataset. You can create multiple columns at once, separating each new variable with a comma.
In order to save changes, you must define the new dataset as an object using <-! It is good practice to save your changes to a dataset under a new object name (e.g., diamonds.new) each time you make changes rather than saving over the original dataset name (e.g., diamonds).
Where one function (mean()) “nests” inside another function (mutate()).
To add a new column To calculate average population density
library(tidyverse)Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) # This is where midwest is located
midwest <- ggplot2::midwest # Load the midwest dataset
avg_density <- mean(midwest$popdensity)
midwest$avg.pop.den <- avg_density
head(midwest)# A tibble: 6 × 29
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>
To calculate entire average area for entire data set
avg.area <- mean(midwest$area)
head(midwest)# A tibble: 6 × 29
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>
To calculate total number of adults in dataset
total_adults <- sum(midwest$popadults)
midwest$totadult <- total_adults
head(midwest)# A tibble: 6 × 30
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 21 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>
To calculate the difference between total population and white people.
midwest$tot.minus.white <- midwest$poptotal - midwest$popwhite
## alternatively without using poptotal and popwhite
midwest$tot.minus.white <- midwest$popblack + midwest$popamerindian + midwest$popasian + midwest$popother
head(midwest)# A tibble: 6 × 31
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 22 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>
To calculate the ratio between populations
midwest$child.to.adult <- midwest$percchildbelowpovert / midwest$percadultpoverty
head(midwest)# A tibble: 6 × 32
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 23 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>, child.to.adult <dbl>
## Why are the values in child.to.adult all different? Because they represent the ratio of the percentage of children below the poverty line to the percentage of adults in poverty for each observation (row) in the dataset.
midwest$ratio.adult <- midwest$popadults / midwest$poptotal
head(midwest)# A tibble: 6 × 33
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 24 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …
To calculate a percentage of a total population that are adults per county
midwest$perc.adult <- (midwest$popadults / midwest$poptotal) * 100
head(midwest)# A tibble: 6 × 34
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAND… IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
# totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …
To load dataset
library(datasets)
data("presidential")
head(presidential) # to look at the data itself# A tibble: 6 × 4
name start end party
<chr> <date> <date> <chr>
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Kennedy 1961-01-20 1963-11-22 Democratic
3 Johnson 1963-11-22 1969-01-20 Democratic
4 Nixon 1969-01-20 1974-08-09 Republican
5 Ford 1974-08-09 1977-01-20 Republican
6 Carter 1977-01-20 1981-01-20 Democratic
str(presidential) # to understand the structure and types of the datasettibble [12 × 4] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:12] "Eisenhower" "Kennedy" "Johnson" "Nixon" ...
$ start: Date[1:12], format: "1953-01-20" "1961-01-20" ...
$ end : Date[1:12], format: "1961-01-20" "1963-11-22" ...
$ party: chr [1:12] "Republican" "Democratic" "Democratic" "Republican" ...
To create a duration column by calculating the difference in days
presidential$start <- as.Date(presidential$start)
presidential$end <- as.Date(presidential$end)
presidential$duration <- as.numeric(presidential$end - presidential$start)
head(presidential)# A tibble: 6 × 5
name start end party duration
<chr> <date> <date> <chr> <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican 2922
2 Kennedy 1961-01-20 1963-11-22 Democratic 1036
3 Johnson 1963-11-22 1969-01-20 Democratic 1886
4 Nixon 1969-01-20 1974-08-09 Republican 2027
5 Ford 1974-08-09 1977-01-20 Republican 895
6 Carter 1977-01-20 1981-01-20 Democratic 1461
To load data
library(datasets)
data("economics")
head(economics)# A tibble: 6 × 6
date pce pop psavert uempmed unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944
2 1967-08-01 510. 198911 12.6 4.7 2945
3 1967-09-01 516. 199113 11.9 4.6 2958
4 1967-10-01 512. 199311 12.9 4.9 3143
5 1967-11-01 517. 199498 12.8 4.7 3066
6 1967-12-01 525. 199657 11.8 4.8 3018
To calculate the % of the population that is unemployed
economics$perc.unemploy <- (economics$unemploy / economics$pop) * 100 ##multiplying by 100 turns the ratio into a %.
head(economics)# A tibble: 6 × 7
date pce pop psavert uempmed unemploy perc.unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944 1.48
2 1967-08-01 510. 198911 12.6 4.7 2945 1.48
3 1967-09-01 516. 199113 11.9 4.6 2958 1.49
4 1967-10-01 512. 199311 12.9 4.9 3143 1.58
5 1967-11-01 517. 199498 12.8 4.7 3066 1.54
6 1967-12-01 525. 199657 11.8 4.8 3018 1.51
To load the dataset
library(ggplot2)
data("txhousing")
head(txhousing)# A tibble: 6 × 9
city year month sales volume median listings inventory date
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
To create success rate
txhousing$successrate <- (txhousing$sales / txhousing$listings) * 100
head(txhousing)# A tibble: 6 × 10
city year month sales volume median listings inventory date successrate
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5380000 71400 701 6.3 2000 10.3
2 Abilene 2000 2 98 6505000 58700 746 6.6 2000. 13.1
3 Abilene 2000 3 130 9285000 58100 784 6.8 2000. 16.6
4 Abilene 2000 4 98 9730000 68600 785 6.9 2000. 12.5
5 Abilene 2000 5 141 10590000 67300 794 6.8 2000. 17.8
6 Abilene 2000 6 156 13910000 66900 780 6.6 2000. 20
To calculate the % of houses that do not sell of the total listing available.
txhousing$failrate <- ((txhousing$listings - txhousing$sales) / txhousing$listings) * 100
head(txhousing)# A tibble: 6 × 11
city year month sales volume median listings inventory date successrate
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5380000 71400 701 6.3 2000 10.3
2 Abilene 2000 2 98 6505000 58700 746 6.6 2000. 13.1
3 Abilene 2000 3 130 9285000 58100 784 6.8 2000. 16.6
4 Abilene 2000 4 98 9730000 68600 785 6.9 2000. 12.5
5 Abilene 2000 5 141 10590000 67300 794 6.8 2000. 17.8
6 Abilene 2000 6 156 13910000 66900 780 6.6 2000. 20
# ℹ 1 more variable: failrate <dbl>
Problem A
midwest %>% # dataset and pipe tool used to take output of one function & use as input for next
group_by(state) %>% # groups the data by the state column
summarize(poptotalmean = mean(poptotal), # calculates summary stats for each group & calculates the mean of the pop total
poptotalmed = median(poptotal), # calculates median of population total
popmax = max(poptotal), # finds maximum value of the population total
popmin = min(poptotal), # finds minimum value of the population total
popdistinct = n_distinct(poptotal), # counts the number of distinct values in population total column
popfirst = first(poptotal), # retrieves the first value of the pop total column for each state
popany = any(poptotal < 5000), # checks if there are any values in the pop total coumn that are less than 5000 for each state.
popany2 = any(poptotal > 2000000)) %>% # checks for any values in the pop total column that are greater than 2mil
ungroup() # removes grouping structure from date frame. Useful if plan to perform operations that should not be done in groups# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
Problem B
midwest %>%
group_by(state) %>%
summarize(num5k = sum(poptotal < 5000), # calculates summary statistic for each group. Inside function, several calculations are defined.
num2mil = sum(poptotal > 2000000),
numrows = n()) %>%
ungroup()# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
#5K= calculates number of counties in each state where the population is less than 5000
#num2mil= where pop is greater than 2mil
#numrows= counts number of counties in each state groupProblem C
# part I
midwest %>%
group_by(county) %>% # groups data by the county column
summarize(x = n_distinct(state)) %>% # calculate number of distinct states associated with each county
arrange(desc(x)) %>% # sorts summarised data in decending order based on values in column x
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
# part II
# How does n() differ from n_distinct()? = n counts the total number of observations (rows) in current group, whereas n distinct counts number of unique (distinct) values in specified column within current group
# When would they be the same? different? = Same when every row in group corresponds to a distinct value in the column being counted. Different when theres multiple records for the same value in the specified column
midwest %>%
group_by(county) %>%
summarize(x = n()) %>% # calculates total number of rows or observations for each county and stores in x
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
# part III
# hint:
# - How many distinctly different counties are there for each county? = if you group by county, you'll get 1 row per county, showing total count of records associated with that county
# - Can there be more than 1 (county) county in each county? = no but can be multiple records for that county based on other factors such as different years
# - What if we replace 'county' with 'state'? = the output will show the total number of records for each state.
midwest %>%
group_by(county) %>%
summarize(x = n_distinct(county)) %>% # calculates number of distinct counties associated with each county which will always return 1.
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
Problem D
diamonds %>%
group_by(clarity) %>% # groups dataset by clarity column
summarize(a = n_distinct(color), # creates summary statistics for each group. In function, calculates number of distinct colours for each clarity group
b = n_distinct(price),# In function, calculates number of distinct colours for each clarity group
c = n()) %>% # counts total number of diamonds within each clarity group
ungroup()# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
Problem E
# part I
diamonds %>%
group_by(color, cut) %>% # groups dataset by data and cut
summarize(m = mean(price),# calculates mean price of diamonds for each comb of colour and cut
s = sd(price)) %>% # calculates standard deviation of price for each comb of colour and cut
ungroup()`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
# part II
diamonds %>%
group_by(cut, color) %>% # operations are identical
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
# part III
# hint:
# - How good is the sale if the price of diamonds equaled msale? = If msale represents a price that is 20% off mean price of diamonds in each group, you could evaluate how the sale compares to the original mean price
# - e.x. The diamonds are x% off original price in msale.
diamonds %>%
group_by(cut, color, clarity) %>% # groups dataset by the columns
summarize(m = mean(price),
s = sd(price),
msale = m * 0.80) %>% # calculates sale price that is 20% off mean price
ungroup()`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
Problem F
diamonds %>%
group_by(cut) %>%
summarize(potato = mean(depth),# calculates mean depth of diamonds for each cut category
pizza = mean(price),# calculates mean price of diamonds for each cut category
popcorn = median(y),# calculates median of a variable called 'y' (y must be a column in diamonds dataset. If not defined, will produce an error.)
pineapple = potato - pizza,# calculates difference between mean depth and mean price
papaya = pineapple ^ 2,# squares value of pineapple
peach = n()) %>% # counts total number of diamonds in each cut category
ungroup()# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
Problem G
# part I
diamonds %>%
group_by(color) %>% # groups dataset by colour variable
summarize(m = mean(price)) %>% # calculates mean price of diamonds for each colour group
mutate(x1 = str_c("Diamond color ", color),
x2 = 5) %>% # adds new columns to summarize new dataset. X1 creates a new column (x1) that links the string diamond colour with the actual colour value. X2 creates a new column (x2) and assigns the value 5 to every row.
ungroup()# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
# part II
# What does the first ungroup() do? Is it useful here? Why/why not? = used to remove grouping structure after summarisation. its useful if you plan to perform additional operations on resulting dataframe that should not be affected by previous grouping
# Why isn't there a closing ungroup() after the mutate()? = since the first ungroup was to ensure that the subsequent operations do not treat the data as grouped, there is no need for another ungroup after mutate.
diamonds %>%
group_by(color) %>%
summarize(m = mean(price)) %>%
ungroup() %>% # ungroup before mutate: (1) summarise mean price by colour, (2) ungrouping summarised data, (3) adding new columns (x1 & x2)
mutate(x1 = str_c("Diamond color ", color),
x2 = 5) # A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Problem H
# part I
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>% # creates new column (x1) that contains half the value of the price for each diamond
summarize(m = mean(x1)) %>% # calculates mean of new column for each colour group stored in a new column named 'm'
ungroup() # A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
# part II
# What's the difference between part I and II? = (1) position of ungroup, (2) effect on results
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup() %>% # removed grouping structure before summarising
summarize(m = mean(x1)) # A tibble: 1 × 1
m
<dbl>
1 1966.
Why is grouping data necessary? - crucial for performing calculations and analyses within subsets of data
Why is ungrouping data necessary? - necessary to ensure that further operations do not unintentionally respect previous groupings
When should you ungroup data? - after summary operations or before applying long group specific trabsformations
If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? - if theres no grouping in code, ungroup is not required
Chapter 6.7: Extra Practice
library(dplyr)
View(diamonds)
diamonds_sorted_low_to_high <- diamonds %>%
arrange(price)
diamonds_sorted_high_to_low <- diamonds %>%
arrange(desc(price))
diamonds_sorted_low_price_cut <- diamonds %>%
arrange(price, cut)
diamonds_sorted_high_price_cut <- diamonds %>%
arrange(desc(price), cut)
diamonds_sorted_price_clarity <- diamonds %>%
arrange(price, desc(clarity))
diamonds_with_sale_price <- diamonds %>%
mutate(salePrice = price - 250)
diamonds_selected <- diamonds %>%
select(-x, -y, -z)
diamonds_count_by_cut <- diamonds %>%
group_by(cut) %>%
summarise(count = n())
diamonds_with_total_num <- diamonds %>%
mutate(totalNum = n())Good question = How does the quality off the diamond impact the price?
Bad question = Which diamond is the cheapest?
library(tidyverse)
library(ggplot2)
options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("modeldata")Installing package into 'C:/Users/sanke/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
There is a binary version available but the source version is later:
binary source needs_compilation
modeldata 1.3.0 1.4.0 FALSE
installing the source package 'modeldata'
library(modeldata)
View(crickets)
## x & y axis
## To add titles & captions to graph, then adding colour
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "McDonald (2009)")$x
[1] "Temperature"
$y
[1] "Chirp rate"
$colour
[1] "Species"
$title
[1] "Cricket chirps"
$caption
[1] "McDonald (2009)"
attr(,"class")
[1] "labels"
scale_color_brewer(palette = "Dark2")<ggproto object: Class ScaleDiscrete, Scale, gg>
aesthetics: colour
axis_order: function
break_info: function
break_positions: function
breaks: waiver
call: call
clone: function
dimension: function
drop: TRUE
expand: waiver
get_breaks: function
get_breaks_minor: function
get_labels: function
get_limits: function
get_transformation: function
guide: legend
is_discrete: function
is_empty: function
labels: waiver
limits: NULL
make_sec_title: function
make_title: function
map: function
map_df: function
n.breaks.cache: NULL
na.translate: TRUE
na.value: NA
name: waiver
palette: function
palette.cache: NULL
position: left
range: environment
rescale: function
reset: function
train: function
train_df: function
transform: function
transform_df: function
super: <ggproto object: Class ScaleDiscrete, Scale, gg>
## Modifying basic properties of the plot
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point(color = "red",
size = 3,
alpha = .3,
shape = "square") +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "McDonald (2009)")## Learn more about options for geom_
# with ?geom_point
## Adding another layer
# Regression line
# lm = linear model
# se = standard error
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "McDonald (2009)")`geom_smooth()` using formula = 'y ~ x'
## add layer
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "McDonald (2009)")`geom_smooth()` using formula = 'y ~ x'
scale_color_brewer(palette = "Dark2")<ggproto object: Class ScaleDiscrete, Scale, gg>
aesthetics: colour
axis_order: function
break_info: function
break_positions: function
breaks: waiver
call: call
clone: function
dimension: function
drop: TRUE
expand: waiver
get_breaks: function
get_breaks_minor: function
get_labels: function
get_limits: function
get_transformation: function
guide: legend
is_discrete: function
is_empty: function
labels: waiver
limits: NULL
make_sec_title: function
make_title: function
map: function
map_df: function
n.breaks.cache: NULL
na.translate: TRUE
na.value: NA
name: waiver
palette: function
palette.cache: NULL
position: left
range: environment
rescale: function
reset: function
train: function
train_df: function
transform: function
transform_df: function
super: <ggproto object: Class ScaleDiscrete, Scale, gg>
## Other plots
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) # one quantitative variable ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)ggplot(crickets, aes(x = species)) +
geom_bar(color = "black", #boundary colour
fill = "lightblue")ggplot(crickets, aes(x = species,
fill = species)) +
geom_bar(show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2")# Bar chart - for a single categorical variable
# Histogram - for a single quantitative variable
# Scatterplot - for 2 quantitative variables
# Boxplot - for 1 quantitative & 1 categorical variable
ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2") +
theme_minimal() #remove grey background?theme_minimal() #options for complete themesstarting httpd help server ...
done
# Faceting
# not great:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species) # ~ means wrap by species scale_fill_brewer(palette = "Dark2")<ggproto object: Class ScaleDiscrete, Scale, gg>
aesthetics: fill
axis_order: function
break_info: function
break_positions: function
breaks: waiver
call: call
clone: function
dimension: function
drop: TRUE
expand: waiver
get_breaks: function
get_breaks_minor: function
get_labels: function
get_limits: function
get_transformation: function
guide: legend
is_discrete: function
is_empty: function
labels: waiver
limits: NULL
make_sec_title: function
make_title: function
map: function
map_df: function
n.breaks.cache: NULL
na.translate: TRUE
na.value: NA
name: waiver
palette: function
palette.cache: NULL
position: left
range: environment
rescale: function
reset: function
train: function
train_df: function
transform: function
transform_df: function
super: <ggproto object: Class ScaleDiscrete, Scale, gg>
?facet_wrap # help file for facets
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species,
ncol = 1) +
scale_fill_brewer(palette = "Dark2")file.rename("Quarto-Workbook-exercises.rmarkdown", "Quarto_Workbook_exercises.rmd")[1] TRUE
What is a good research hypothesis? A good research hypothesis is a testable statement that is both specific and measurable, predicting a relationship between variables. It should guide the research method, allowing for empirical investigation, as well as being formulated in a way that allows for the possibility of being supported or refuted through data collection and analysis.
library(tidyverse)
library(ggplot2)
data("iris")
View(iris)
ggplot(iris, aes(x = Species, y = Sepal.Length, color = Species)) +
geom_boxplot(fill = NA) + # Set fill to NA for no fill
labs(title = "Boxplot of Sepal Length of Different Species",
x = "Species",
y = "Sepal Length") +
theme_minimal() +
scale_color_manual(values = c("setosa" = "red",
"versicolor" = "green",
"virginica" = "blue")) +
theme(legend.title = element_blank())# Statistical tests that can be associated with boxplots:
# 1. T-test
# 2. ANOVA
# 3. Kruskal-Wallis
# 4. Mann-Whitney U
# 5. Post-hocdata("iris")
View(iris)
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_density(alpha = 0.5) + # Create the density plot with transparency
labs(x = "Petal.Length", y = "Density") +
theme_minimal()# Statistical tests that can be associated with density plots:
# 1. Kolmogorov-Smirnov
# 2. Mann-Whitney U
# 3. Shapiro-Wilk
# 4. Bartlett's
# 5. Kruskal-Wallis data("iris")
View(iris)
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point(mapping = aes(colour = Species, shape = Species)) +
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Add separate regression lines with confidence intervals
labs(x = "Petal.Length", y = "Petal.Width") +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
# Statistical tests that can be associated with scatterplots:
# 1. Kolmogorov-Smirnov
# 2. Mann-Whitney U
# 3. Shapiro-Wilk
# 4. Bartlett's
# 5. Kruskal-Wallisdata("iris")
View(iris)
data <- data.frame( # create data frame called data with 3 columns
Species = c(rep("setosa", 2), rep("versicolor", 2), rep("virginica", 2)),
size = c("big", "small", "big", "small", "big", "small"),
count = c(1, 50, 30, 25, 50, 5)
)
ggplot(data, aes(x = Species, y = count, fill = size)) +
geom_bar(stat = "identity", position = "dodge") + # "dodge" ensures bars are side by side rather than stacked
labs(y = "count", x = "Species") +
theme_minimal() +
scale_fill_manual(values = c("big" = "salmon", "small" = "lightblue"))iris %>%
mutate(size=ifelse(Sepal.Length < # mutate = adds/modifies columns in dataframe
# ifelse = creates new column size
median(Sepal.Length),
"small", "big")) Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa small
2 4.9 3.0 1.4 0.2 setosa small
3 4.7 3.2 1.3 0.2 setosa small
4 4.6 3.1 1.5 0.2 setosa small
5 5.0 3.6 1.4 0.2 setosa small
6 5.4 3.9 1.7 0.4 setosa small
7 4.6 3.4 1.4 0.3 setosa small
8 5.0 3.4 1.5 0.2 setosa small
9 4.4 2.9 1.4 0.2 setosa small
10 4.9 3.1 1.5 0.1 setosa small
11 5.4 3.7 1.5 0.2 setosa small
12 4.8 3.4 1.6 0.2 setosa small
13 4.8 3.0 1.4 0.1 setosa small
14 4.3 3.0 1.1 0.1 setosa small
15 5.8 4.0 1.2 0.2 setosa big
16 5.7 4.4 1.5 0.4 setosa small
17 5.4 3.9 1.3 0.4 setosa small
18 5.1 3.5 1.4 0.3 setosa small
19 5.7 3.8 1.7 0.3 setosa small
20 5.1 3.8 1.5 0.3 setosa small
21 5.4 3.4 1.7 0.2 setosa small
22 5.1 3.7 1.5 0.4 setosa small
23 4.6 3.6 1.0 0.2 setosa small
24 5.1 3.3 1.7 0.5 setosa small
25 4.8 3.4 1.9 0.2 setosa small
26 5.0 3.0 1.6 0.2 setosa small
27 5.0 3.4 1.6 0.4 setosa small
28 5.2 3.5 1.5 0.2 setosa small
29 5.2 3.4 1.4 0.2 setosa small
30 4.7 3.2 1.6 0.2 setosa small
31 4.8 3.1 1.6 0.2 setosa small
32 5.4 3.4 1.5 0.4 setosa small
33 5.2 4.1 1.5 0.1 setosa small
34 5.5 4.2 1.4 0.2 setosa small
35 4.9 3.1 1.5 0.2 setosa small
36 5.0 3.2 1.2 0.2 setosa small
37 5.5 3.5 1.3 0.2 setosa small
38 4.9 3.6 1.4 0.1 setosa small
39 4.4 3.0 1.3 0.2 setosa small
40 5.1 3.4 1.5 0.2 setosa small
41 5.0 3.5 1.3 0.3 setosa small
42 4.5 2.3 1.3 0.3 setosa small
43 4.4 3.2 1.3 0.2 setosa small
44 5.0 3.5 1.6 0.6 setosa small
45 5.1 3.8 1.9 0.4 setosa small
46 4.8 3.0 1.4 0.3 setosa small
47 5.1 3.8 1.6 0.2 setosa small
48 4.6 3.2 1.4 0.2 setosa small
49 5.3 3.7 1.5 0.2 setosa small
50 5.0 3.3 1.4 0.2 setosa small
51 7.0 3.2 4.7 1.4 versicolor big
52 6.4 3.2 4.5 1.5 versicolor big
53 6.9 3.1 4.9 1.5 versicolor big
54 5.5 2.3 4.0 1.3 versicolor small
55 6.5 2.8 4.6 1.5 versicolor big
56 5.7 2.8 4.5 1.3 versicolor small
57 6.3 3.3 4.7 1.6 versicolor big
58 4.9 2.4 3.3 1.0 versicolor small
59 6.6 2.9 4.6 1.3 versicolor big
60 5.2 2.7 3.9 1.4 versicolor small
61 5.0 2.0 3.5 1.0 versicolor small
62 5.9 3.0 4.2 1.5 versicolor big
63 6.0 2.2 4.0 1.0 versicolor big
64 6.1 2.9 4.7 1.4 versicolor big
65 5.6 2.9 3.6 1.3 versicolor small
66 6.7 3.1 4.4 1.4 versicolor big
67 5.6 3.0 4.5 1.5 versicolor small
68 5.8 2.7 4.1 1.0 versicolor big
69 6.2 2.2 4.5 1.5 versicolor big
70 5.6 2.5 3.9 1.1 versicolor small
71 5.9 3.2 4.8 1.8 versicolor big
72 6.1 2.8 4.0 1.3 versicolor big
73 6.3 2.5 4.9 1.5 versicolor big
74 6.1 2.8 4.7 1.2 versicolor big
75 6.4 2.9 4.3 1.3 versicolor big
76 6.6 3.0 4.4 1.4 versicolor big
77 6.8 2.8 4.8 1.4 versicolor big
78 6.7 3.0 5.0 1.7 versicolor big
79 6.0 2.9 4.5 1.5 versicolor big
80 5.7 2.6 3.5 1.0 versicolor small
81 5.5 2.4 3.8 1.1 versicolor small
82 5.5 2.4 3.7 1.0 versicolor small
83 5.8 2.7 3.9 1.2 versicolor big
84 6.0 2.7 5.1 1.6 versicolor big
85 5.4 3.0 4.5 1.5 versicolor small
86 6.0 3.4 4.5 1.6 versicolor big
87 6.7 3.1 4.7 1.5 versicolor big
88 6.3 2.3 4.4 1.3 versicolor big
89 5.6 3.0 4.1 1.3 versicolor small
90 5.5 2.5 4.0 1.3 versicolor small
91 5.5 2.6 4.4 1.2 versicolor small
92 6.1 3.0 4.6 1.4 versicolor big
93 5.8 2.6 4.0 1.2 versicolor big
94 5.0 2.3 3.3 1.0 versicolor small
95 5.6 2.7 4.2 1.3 versicolor small
96 5.7 3.0 4.2 1.2 versicolor small
97 5.7 2.9 4.2 1.3 versicolor small
98 6.2 2.9 4.3 1.3 versicolor big
99 5.1 2.5 3.0 1.1 versicolor small
100 5.7 2.8 4.1 1.3 versicolor small
101 6.3 3.3 6.0 2.5 virginica big
102 5.8 2.7 5.1 1.9 virginica big
103 7.1 3.0 5.9 2.1 virginica big
104 6.3 2.9 5.6 1.8 virginica big
105 6.5 3.0 5.8 2.2 virginica big
106 7.6 3.0 6.6 2.1 virginica big
107 4.9 2.5 4.5 1.7 virginica small
108 7.3 2.9 6.3 1.8 virginica big
109 6.7 2.5 5.8 1.8 virginica big
110 7.2 3.6 6.1 2.5 virginica big
111 6.5 3.2 5.1 2.0 virginica big
112 6.4 2.7 5.3 1.9 virginica big
113 6.8 3.0 5.5 2.1 virginica big
114 5.7 2.5 5.0 2.0 virginica small
115 5.8 2.8 5.1 2.4 virginica big
116 6.4 3.2 5.3 2.3 virginica big
117 6.5 3.0 5.5 1.8 virginica big
118 7.7 3.8 6.7 2.2 virginica big
119 7.7 2.6 6.9 2.3 virginica big
120 6.0 2.2 5.0 1.5 virginica big
121 6.9 3.2 5.7 2.3 virginica big
122 5.6 2.8 4.9 2.0 virginica small
123 7.7 2.8 6.7 2.0 virginica big
124 6.3 2.7 4.9 1.8 virginica big
125 6.7 3.3 5.7 2.1 virginica big
126 7.2 3.2 6.0 1.8 virginica big
127 6.2 2.8 4.8 1.8 virginica big
128 6.1 3.0 4.9 1.8 virginica big
129 6.4 2.8 5.6 2.1 virginica big
130 7.2 3.0 5.8 1.6 virginica big
131 7.4 2.8 6.1 1.9 virginica big
132 7.9 3.8 6.4 2.0 virginica big
133 6.4 2.8 5.6 2.2 virginica big
134 6.3 2.8 5.1 1.5 virginica big
135 6.1 2.6 5.6 1.4 virginica big
136 7.7 3.0 6.1 2.3 virginica big
137 6.3 3.4 5.6 2.4 virginica big
138 6.4 3.1 5.5 1.8 virginica big
139 6.0 3.0 4.8 1.8 virginica big
140 6.9 3.1 5.4 2.1 virginica big
141 6.7 3.1 5.6 2.4 virginica big
142 6.9 3.1 5.1 2.3 virginica big
143 5.8 2.7 5.1 1.9 virginica big
144 6.8 3.2 5.9 2.3 virginica big
145 6.7 3.3 5.7 2.5 virginica big
146 6.7 3.0 5.2 2.3 virginica big
147 6.3 2.5 5.0 1.9 virginica big
148 6.5 3.0 5.2 2.0 virginica big
149 6.2 3.4 5.4 2.3 virginica big
150 5.9 3.0 5.1 1.8 virginica big
# Statistical tests that can be associated with bar charts:
# 1. Chi-Square
# 2. T-test
# 3. ANOVA
# 4. Mann-Whitney U
# 5. Kruskal-Wallis
# 6. Z-test