[1] 4
RM&DA Workbook
Introduction to Quarto
To-do list
- customise workbook
- load first data set
- publish using R-Pubs
Adding a figure
Go to ‘Insert’ tab at the top right and go to figure. Add a file - add caption and alternative text.
Adding code chunks - key things to note
When you click the Render button a document will be generated that includes both content and the output of embedded code. Note all packages needed to run these chunks were loaded at the start of the page. You can embed code like this:
Chunk 1: The echo:false option disables the printing of code (only output is displayed).
Chunk 2: Basic data management
“%>%” The Pipe operator allows you to chain functions together in a more readable and intuitive way. It takes the output of one function and passes it as the first argument to the next function.
“mutate()” Adds new columns or modifies current variables in the data set. We can also use other functions inside the mutate function to create our new variable(s) - nesting.
“summarise()” collapses all rows and returns a one-row summary, for example it allows us to calculate the mean. We can also perform multiple operations with summarize() and nest other useful functions inside it.
“group_by()” and “ungroup()” take existing data and groups specific variables together for future operations; many operations are performed on groups. For example, grouping by age and sex (male/female) might be useful in a dataset if we care about how females of a certain age scored compared to males of a certain age (or comparing ages within males or within females).
- “summarize()” and “group_by()” can be used to calculate/compare the average Score (and other measures) for males and females separately for example.
- “mutate()” and “group_by()” could also be utilized to add a new column based on the group.
“ungroup()” is always used after the group() command after performing calculations. If you forget to ungroup() data, future data management will likely produce errors. Always ungroup() when you’ve finished with your calculations.
“filter()” is only used to retain specific rows of data that meet the specified requirement(s).
“select()” is used only the for the columns (variables) that you want to see; gets rid of all other columns. You can to refer to the columns by the column position (first column) or by name. The order in which you list the column names/positions is the order that the columns will be displayed.
“arrange()” allows you arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values.
Chunk 3: Plotting data set using ggplot
library(ggplot2) # For plotting
library(palmerpenguins) # For the penguins dataset
ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_point(aes(color = species, shape = species)) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
labs(
title = "Flipper and bill length",
subtitle = "Dimensions for penguins at Palmer Station LTER",
x = "Flipper length (mm)", y = "Bill length (mm)",
color = "Penguin species", shape = "Penguin species"
) +
theme_minimal()Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Using LaTeX (TinyTex)
$ = math mode
\(\alpha\) \[ \pi \in \mathbb{r} \]
Data Wrangling
Session workthrough
1. Calling a tibble
library(palmerpenguins)
data("penguins")
penguins %>%
select(1:3)# A tibble: 344 × 3
species island bill_length_mm
<fct> <fct> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
6 Adelie Torgersen 39.3
7 Adelie Torgersen 38.9
8 Adelie Torgersen 39.2
9 Adelie Torgersen 34.1
10 Adelie Torgersen 42
# ℹ 334 more rows
2. Summary
penguins %>%
summary() species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
3. Subsetting data
penguins %>%
select(body_mass_g)# A tibble: 344 × 1
body_mass_g
<int>
1 3750
2 3800
3 3250
4 NA
5 3450
6 3650
7 3625
8 4675
9 3475
10 4250
# ℹ 334 more rows
penguins %>%
filter(species=="Gentoo",
bill_length_mm > 50,
sex=="male") %>%
select(bill_length_mm,
bill_depth_mm) %>%
arrange(bill_depth_mm)# A tibble: 21 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 51.3 14.2
2 50.2 14.3
3 50.1 15
4 50.7 15
5 50.4 15.3
6 52.5 15.6
7 54.3 15.7
8 50.8 15.7
9 50.4 15.7
10 53.4 15.8
# ℹ 11 more rows
4. Adding columns
penguins %>%
select(bill_length_mm,
bill_depth_mm) %>%
mutate(bill_volume=bill_length_mm+bill_depth_mm) %>%
mutate(log_bill_volume=log(bill_volume)) %>%
mutate(bill_categ=ifelse(bill_volume<60, "small", "big"))# A tibble: 344 × 5
bill_length_mm bill_depth_mm bill_volume log_bill_volume bill_categ
<dbl> <dbl> <dbl> <dbl> <chr>
1 39.1 18.7 57.8 4.06 small
2 39.5 17.4 56.9 4.04 small
3 40.3 18 58.3 4.07 small
4 NA NA NA NA <NA>
5 36.7 19.3 56 4.03 small
6 39.3 20.6 59.9 4.09 small
7 38.9 17.8 56.7 4.04 small
8 39.2 19.6 58.8 4.07 small
9 34.1 18.1 52.2 3.96 small
10 42 20.2 62.2 4.13 big
# ℹ 334 more rows
5. Reshaping data
penguins %>%
select(bill_length_mm,
bill_depth_mm,
year) %>%
pivot_longer(col=c(bill_length_mm:bill_depth_mm),
names_to = "bill_feature", values_to = "value") # A tibble: 688 × 3
year bill_feature value
<int> <chr> <dbl>
1 2007 bill_length_mm 39.1
2 2007 bill_depth_mm 18.7
3 2007 bill_length_mm 39.5
4 2007 bill_depth_mm 17.4
5 2007 bill_length_mm 40.3
6 2007 bill_depth_mm 18
7 2007 bill_length_mm NA
8 2007 bill_depth_mm NA
9 2007 bill_length_mm 36.7
10 2007 bill_depth_mm 19.3
# ℹ 678 more rows
Post sesssion
data("diamonds")
diamonds %>%
group_by(clarity) %>%
summarise(a = n_distinct(color),
b = n_distinct(price),
c = n()) %>%
ungroup()# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
diamonds %>%
group_by(color, cut) %>%
summarize(m = mean(price),
sd = sd(price)) %>%
ungroup()`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m sd
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
diamonds %>%
group_by(cut, color, clarity) %>%
summarize(m = mean(price),
sd = sd(price),
msale = m * 0.80) %>%
ungroup()`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m sd msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
Data Exploration & Visulisation
For Continuous Variables
Histogram
penguins %>%
group_by(species) %>%
ggplot(aes(x=bill_length_mm, color=species, fill=species))+
geom_histogram(alpha=0.5)+
scale_color_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
scale_fill_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
labs(
title = "Bill Length Across the Palmer Penguins",
x = "Bill Length (mm)", y = "Number of Penguins") penguins %>%
group_by(species) %>%
ggplot(aes(x=bill_depth_mm, color=species, fill=species))+
geom_histogram(alpha=0.5)+
scale_color_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
scale_fill_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
labs(
title = "Bill Depth Across the Palmer Penguins",
x = "Bill Depth (mm)", y = "Number of Penguins") Box Plot
penguins %>%
group_by(species) %>%
ggplot(aes(x=species,
y=bill_length_mm,
color=species,
fill=species))+
geom_boxplot(alpha=0.5) +
scale_color_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
scale_fill_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
labs(
title = "Bill Length Across the Palmer Penguins",
x = "Species", y = "Bill Length (mm)") penguins %>%
group_by(species) %>%
ggplot(aes(x=species,
y=bill_depth_mm,
color=species,
fill=species))+
geom_boxplot(alpha=0.5) +
geom_jitter() +
scale_color_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
scale_fill_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
labs(
title = "Bill Depth Across the Palmer Penguins",
x = "Species", y = "Bill Depth (mm)") For categorical variables
Bar Chart
penguins %>%
group_by(species) %>%
ggplot(aes(x=species,
color=species,
fill=species)) +
geom_bar(alpha=0.5) +
scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
labs(
title = "Number of Palmer Penguins Used Across Each Species",
x = "Species", y = "Number of Penguins") penguins %>%
group_by(species) %>%
ggplot(aes(x=island,
color=species,
fill=species))+
geom_bar(alpha=0.5) +
scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
labs(
title = "Number of Palmer Penguins Used Across Each Species per Island",
x = "Island", y = "Number of Penguins") penguins %>%
group_by(species) %>%
ggplot(aes(x=year,
color=species,
fill=species))+
geom_bar(alpha=0.5) +
scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
labs(
title = "Number of Palmer Penguins Used Across Each Species per Year",
x = "Year", y = "Number of Penguins") penguins %>%
group_by(species) %>%
ggplot(aes(x=year,
color=species,
fill=species))+
geom_bar(position = "fill", alpha=0.5) +
scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
labs(
title = "Number of Palmer Penguins Used Across Each Species per Year",
subtitle = "using the 'fill' position",
x = "Year", y = "Number of Penguins") penguins %>%
group_by(species) %>%
ggplot(aes(x=year,
color=species,
fill=species))+
geom_bar(position = "dodge", alpha=0.5) +
scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
labs(
title = "Number of Palmer Penguins Used Across Each Species per Year",
subtitle = "using the 'dodge' position",
x = "Year", y = "Number of Penguins") For Visualising Correlations
Scatter graph
penguins %>%
group_by(species) %>%
ggplot(aes(x=bill_length_mm,
y=bill_depth_mm,
color=species,
shape=species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_color_manual(values = c("#7451ff","#a691ff","#d6ceff")) +
scale_shape_manual(values = c("triangle", "square", "circle")) +
labs(
title = "Bill dimentions for Palmer Penguins",
x = "Bill depth (mm)", y = "Bill Depth (mm)") Box Plot (for multiple variable)
penguins %>%
group_by(species) %>%
na.omit() %>%
ggplot(aes(x=sex,
y=body_mass_g,
color=species,
fill=species))+
geom_boxplot(alpha=0.5) +
scale_color_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
scale_fill_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
labs(
title = "Body Mass per Sex",
x = "Sex", y = "Body Mass (g)") penguins %>%
group_by(species) %>%
na.omit() %>%
ggplot(aes(x=species,
y=body_mass_g,
color=sex,
fill=sex))+
geom_boxplot(alpha=0.5) +
scale_color_manual(values = c("male" = "#60d5ff", "female" = "#ff9bf5")) +
scale_fill_manual(values = c("male" = "#60d5ff", "female" = "#ff9bf5")) +
labs(
title = "Body Mass per Sex",
subtitle = "Inverting Groups",
x = "Species", y = "Body Mass (g)") For Visualising Distributions
penguins %>%
na.omit() %>%
pivot_longer(bill_length_mm:body_mass_g, names_to = "trait") %>%
ggplot(aes(x=value,
group=species,
fill=species,
color=species))+
geom_density(alpha=0.5)+
scale_color_manual(values = c("Adelie" = "#c6ff9b", "Chinstrap" = "#ffb987", "Gentoo" = "#ff9bf5")) +
scale_fill_manual(values = c("Adelie" = "#c6ff9b", "Chinstrap" = "#ffb987", "Gentoo" = "#ff9bf5")) +
facet_grid(~trait, scales = "free_x")Choosing the right analysis
Graph 1 - Box Plot
Predictor variable (species) is categorical and outcome variable (sepal length) is numerical
There are more then 2 groups being tested - there are 3 different species
= ANOVA
Graph 2 - Density Plot
Predictor variable (species) is categorical and outcome variable (sepal length) is numerical
There are more then 2 groups being tested - there are 3 different species
= ANOVA
Graph 3 - Scatter Plot
Predictor variable (species) is categorical and outcome variabls (sepal length/width) are numerical
There are more then 2 groups being tested - there are 3 different species
There is more then one outcome variable
= MANOVA
Graph 4 - Histogram
Predictor variable (species) is categorical and outcome variables (size - big/small) is also categorical
= Chi-Squared Test