Week 1 01.10.24
Meet Quarto
This workbook will serve as a guide to assist with coding as well as highlighting my development.
A picture from a recent trip to Iceland.
To add a photograph use the image sign above and select browse which will pull up all available images. It is also possible to insert links to data/ buzzwords.
Coding
Switiching to source (tab) will highlight the markdown which will be generated as you work in visual. Usually the rendered document will look somewhat the same as it did using ‘visual’.
Rendered documents refer to the generation of a file which contains the combination of the code and the output.
A YAML header is highlighted by 3 dashes (—) at either end.
YAML uses key value pairs in the format key: value. Other fields found in headers are : date, subtitle, theme, font colour.
Code Chunks
R code chunks identified with { r } with (optional) chunk options, in YAML style, identified by #| at the beginning of the line.
In this case, the label of the code chunk is load-packages, and we set include to false to indicate that we don’t want the chunk itself or any of its outputs in the rendered documents.
Echo: false - hides code only producing output. - this can be document wide if add function to YAML
Markdown text
Text with formatting, including section headers, hyperlinks, an embedded image, and an inline code chunk.
Quarto uses markdown syntax for text. If using the visual editor, you won’t need to learn much markdown syntax for authoring your document, as you can use the menus and shortcuts to add a header, bold text, insert a table. if using the sourse editor, you can achieve these with the markdown expression like ##, **bold**
This scatter plot illustrates the relationship between flipper and bill length of penguins.
Week 3 pre-session 08.10.24
Tidyverse
with reference to chapter 4 R for Graduate Students by Y. Wendy Huynh.
── Attaching core tidyverse packages ────────────────────── tidyverse 2.0.0
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
Tidyverse is a package which contains other packages such as dplyr and ggplot2. This package must be installed on the device before accessing onthe library.
Using install.packages(tidyverse) before library(tidyverse).
Packages often have naming conflicts, therefore R uses the functions related to the most recent package.
To load ONLY the filter function from dplyr
dplyr::filter()
loads ALL functions from dplyr
library(dplyr)
Example Code:
diamonds %>%
group_by(clarity) %>%
summarize(m = mean(price)) %>%
ungroup()
# A tibble: 8 x 2
clarity m
<ord> <dbl>
1 I1 3924
2 SI2 5063
3 SI1 3996
4 VS2 3925
5 VS1 3839
6 VVS2 3284
7 VVS1 2523
8 IF 2865.
Diamond data set
Chapter 5 R for Grad students
The diamond dataset is built into R and is available with the ggplot2 package.
View(diamonds) will open up the data when typed into the console. To edit this data you must do so in code.
view(diamonds)str(diamonds) will allow you to look at the structure of the data. There are 10 total variables (three ordered factors, one integer, and 6 numeric.
str(diamonds)tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Week 3 post-session 10.10.24
Chapter 6.6 arrange ( )
Allows you arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values.
diamonds %>% arrange(cut)# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
diamonds %>% # utilizes the diamonds dataset
group_by(color, clarity) %>% # groups data by color and clarity variables
mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
ungroup() %>% # data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% # new variable, original price + $10
select(cut, color, # retain only these listed columns
clarity, price,
price200, random10) %>%
arrange(color) %>% # visualize data ordered by color
group_by(cut) %>% # group data by cut
mutate(dis = n_distinct(price), # counts the number of unique price values per cut
rowID = row_number()) %>% # numbers each row consecutively for each cut
ungroup() # final ungrouping of data# A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
Chapter 6.7 Extra practise
View all of the variable names in
view(diamonds)- Arrange the diamonds by lowest to highest price
diamonds %>% arrange(price)# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
- Arrange the diamonds by highest to lowest price
diamonds %>% arrange(desc(price))# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
- Arrange the diamonds by lowest price and cut
diamonds %>% arrange(price)%>% arrange(cut)# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.25 Fair E VS1 55.2 64 361 4.21 4.23 2.33
3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91 2.39
4 0.27 Fair E VS1 66.4 58 371 3.99 4.02 2.66
5 0.3 Fair J VS2 64.8 58 416 4.24 4.16 2.72
6 0.3 Fair F SI1 63.1 58 496 4.3 4.22 2.69
7 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
8 0.37 Fair F SI1 65.3 56 527 4.53 4.47 2.94
9 0.3 Fair D SI2 64.6 54 536 4.29 4.25 2.76
10 0.25 Fair D VS1 61.2 55 563 4.09 4.11 2.51
# ℹ 53,930 more rows
Arrange the diamonds by highest price and cut
diamonds %>% arrange(desc(price)) %>% arrange(desc(cut))# A tibble: 53,940 × 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56 2 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11 3 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21 4 2.05 Ideal G SI1 61.9 57 18787 8.1 8.16 5.03 5 1.6 Ideal F VS1 62 56 18780 7.47 7.52 4.65 6 2.06 Ideal I VS2 62.2 55 18779 8.15 8.19 5.08 7 1.71 Ideal G VVS2 62.1 55 18768 7.66 7.63 4.75 8 2.08 Ideal H SI1 58.7 60 18760 8.36 8.4 4.92 9 2.03 Ideal G SI1 60 55.8 18757 8.17 8.3 4.95 10 2.61 Ideal I SI2 62.1 56 18756 8.85 8.73 5.46 # ℹ 53,930 more rows
Arrange the diamonds by by lowest to highest price and worst to best clarity.
diamonds %>% arrange(price) %>% arrange(desc(clarity))# A tibble: 53,940 × 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 0.23 Very Good H IF 63.9 55 369 3.89 3.9 2.49 2 0.24 Very Good H IF 61.3 56 449 4.04 4.06 2.48 3 0.26 Ideal H IF 61.1 57 468 4.12 4.16 2.53 4 0.23 Very Good F IF 61 62 485 3.95 3.99 2.42 5 0.3 Ideal J IF 61.5 56 489 4.32 4.33 2.66 6 0.3 Ideal J IF 61.5 57 489 4.29 4.36 2.66 7 0.23 Very Good E IF 59.9 58 492 3.98 4.03 2.4 8 0.24 Good F IF 65.1 58 492 3.86 3.88 2.52 9 0.24 Ideal H IF 62.5 54 504 3.97 4 2.49 10 0.24 Ideal H IF 62.1 57 504 4 4.04 2.5 # ℹ 53,930 more rowsCreate a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond
diamonds %>% mutate(saleprice = price - 250)# A tibble: 53,940 × 11 carat cut color clarity depth table price x y z saleprice <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88 # ℹ 53,930 more rowsRemove the x, y, and zvariables from the diamond dataset
diamonds %>% select(-x, -y, -z)# A tibble: 53,940 × 7 carat cut color clarity depth table price <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> 1 0.23 Ideal E SI2 61.5 55 326 2 0.21 Premium E SI1 59.8 61 326 3 0.23 Good E VS1 56.9 65 327 4 0.29 Premium I VS2 62.4 58 334 5 0.31 Good J SI2 63.3 58 335 6 0.24 Very Good J VVS2 62.8 57 336 7 0.24 Very Good I VVS1 62.3 57 336 8 0.26 Very Good H SI1 61.9 55 337 9 0.22 Fair E VS2 65.1 61 337 10 0.23 Very Good H VS1 59.4 61 338 # ℹ 53,930 more rowsDetermine the number of diamonds there are for each cut value
diamonds %>% group_by(cut) %>% summarise(number = n()) %>% ungroup()# A tibble: 5 × 2
cut number
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
- Create a new column named total num that calculates the total number of diamonds.
diamonds %>% mutate(totalnum = sum(n()))# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalnum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Week 3 post session
Week 4 15.10.24
Contingency tables and how to write a good research paper
Data Exploration & Scientific Hypotheses
How to create a histogram to highlight the relationship between the bill length and species of penguin:
library(tidyverse)
library(palmerpenguins)
data("penguins")
penguins %>%
group_by(species) %>%
ggplot(aes(x=bill_length_mm, color=species, fill=species))+
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
How to create a box plot to highlight the relationship between the bill length and species of penguin :
library(tidyverse)
library(palmerpenguins)
data("penguins")
penguins %>%
group_by(species) %>%
ggplot(aes(x=species,
y=bill_length_mm,
color=species,
fill=species))+
geom_boxplot(alpha=0.5)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Categorical variables
- Note
Ordinal: categories that maintain an order
Nominal: that has no ranking order
Binary; nominal variables with two categories.
Numerical: Discrete; numbered values that can only take certain values
Continuous; numbered values that are measured can be any number within a particular range.
Species of penguin
library(tidyverse)
library(palmerpenguins)
penguins %>%
ggplot(aes(x=species,
color=species,
fill=species))+
geom_bar(alpha=0.5)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Observations per year
library(tidyverse)
library(palmerpenguins)
penguins %>%
ggplot(aes(x=year,
color=species,
fill=species))+
geom_bar()+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Observations per island
library(tidyverse)
library(palmerpenguins)
penguins %>%
ggplot(aes(x=island,
color=species,
fill=species))+
geom_bar()+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Visualizing correlations
penguins %>%
ggplot(aes(x=bill_length_mm,
y = bill_depth_mm))+
geom_point()+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Visualising correlations per species
penguins %>%
ggplot(aes(x=bill_length_mm,
y = bill_depth_mm,
color=species,
fill=species))+
geom_point()+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Moments of centrality
Mean
Medium
Mode

Moments of dispersion variance
Standard deviation
Standard error
Range + quarantines
Body mass per sex
penguins %>%
na.omit() %>%
ggplot(aes(x=sex,
y = body_mass_g,
color=species,
fill=species))+
geom_boxplot(alpha=0.7)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Body mass per sex (iverting groups)
penguins %>%
na.omit() %>%
ggplot(aes(x=species,
y = body_mass_g,
color=sex,
fill=sex))+
geom_boxplot(alpha=0.7)+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))Can body mass predict bill length?
Does sex explain flipper length?
Check distributions
penguins %>%
na.omit() %>%
pivot_longer(bill_length_mm:body_mass_g, names_to = "trait") %>%
ggplot(aes(x=value,
group=species,
fill=species,
color=species))+
geom_density(alpha=0.7)+
facet_grid(~trait, scales = "free_x" )+
theme(axis.text=element_text(size=16),
axis.title=element_text(size=16))+
theme_minimal()Checking via histogram
set.seed(999)
normal<-rnorm(100)
normal %>%
as.tibble() %>%
ggplot(aes(value))+
geom_histogram(color="#DD4A48", fill="#DD4A48")+
geom_vline(xintercept=c(mean(normal), (mean(normal)+sd(normal)),mean(normal)-sd(normal)),
linetype="dashed")Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
set.seed(999)
normal<-rnorm(100)
normal %>%
as.tibble() %>%
ggplot(aes(value))+
geom_boxplot(fill="#DD4A48",alpha=0.7)Research Methods
3.2 Scientific Hypotheses
Why we need a hypothesis?
Candidate explanation to a phenomenon
Contain previsions and expectations
Feedback theory
Advance Science
{#By Efbrazil - Own work, CC BY-SA 4.0}
What hypotheses must contain?
It is affirmative statement
It is not a questions
Must lead to expectations if confirmed
Self-explanatory
Types of hypotheses
Scientific Hypotheses
Candidate statements to explain an observed phenomenon
Meant to generate logical predictions
Working guidelines
Statistical Hypotheses
Logical predictions
Confirmed by stats
Can be drawn in a graph
How to write hypotheses?
You should tell a story
Don’t use a subheading
Never reefer to statistical hypotheses
If drug X have an effect on reducing headache, then When many exotic species are introduced in the ecosystem, then the likelihood of severe ecological disruption increases
Identify the variables: Determine your independent and dependent variable.
Forecast a prediction: State what you expect to occur based on your current knowledge.
Drug X reduce the headaches because it blocks the neuroreceptors of pain Exotic species in high quantities disrupts ecological mutualistic networks
Week 4 post session
Data visualization
library(tidyverse)
library(modeldata)
Attaching package: 'modeldata'
The following object is masked _by_ '.GlobalEnv':
penguins
The following object is masked from 'package:palmerpenguins':
penguins
?ggplotstarting httpd help server ...
done
?crickets
view(crickets)ggplot(crickets,aes(x = temp,
y = rate)) +
geom_point() +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket Chirps",
caption = " Source: Mcdonald (2009)" )ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2")Modifiying basic properties of the plot
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point(color = "red",
size = 2,
alpha = .3,
shape = "square") +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")# Learn more about the options for the geom_abline()
# with ?geom_pointAdding another layer
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
title = "Cricket chirps",
caption = "Source: McDonald (2009)")`geom_smooth()` using formula = 'y ~ x'
ggplot(crickets, aes(x = temp,
y = rate,
color = species)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Temperature",
y = "Chirp rate",
color = "Species",
title = "Cricket chirps",
caption = "Source: McDonald (2009)") +
scale_color_brewer(palette = "Dark2") `geom_smooth()` using formula = 'y ~ x'
Other plots
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) # one quantitative variableggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)ggplot(crickets, aes(x = species)) +
geom_bar(color = "black",
fill = "lightblue")ggplot(crickets, aes(x = species,
fill = species)) +
geom_bar(show.legend = FALSE) +
scale_fill_brewer(palette = "Dark2")ggplot(crickets, aes(x = species,
y = rate,
color = species)) +
geom_boxplot(show.legend = FALSE) +
scale_color_brewer(palette = "Dark2") +
theme_minimal()?theme_minimal()
# faceting
# not great:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species) +
scale_fill_brewer(palette = "Dark2")?facet_wrap```{rggplot(crickets, aes(x = rate,} fill = species)) + geom_histogram(bins = 15, show.legend = FALSE) + facet_wrap(~species, ncol = 1) + scale_fill_brewer(palette = “Dark2”) + theme_minimal()}
```
Research Method post session week 4
A research hypotheses is a condensed claim of what is to be observed during an experiment- It is the ingress of scientiific inquiry. Evidence is has been found dating back to Ancient Greek philosophy (Stoicisim and Epicurus) illustrating the use of research hypothesis and empiricism.
The formation of a hypotheses begins with a literature review to determine knowledge gaps and the forecasting of the outcome we would expect to occur. There are two types of hypotheses in research, null and alternative.
Effective hypotheses contain these qualities: testability, brevity and objectivity, clarity and relevance.
Week 5 pre session 22.10.24
Week 5 22.10.24
Choosing the right analysis
library(palmerpenguins)
library(tidyverse)
penguins %>%
glimpse()Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Is there an association between species and sex?
penguins %>%
na.omit() %>%
ggplot(aes(
x= species,
color=sex,
fill=sex))+
geom_bar(position = "dodge")-> cat_x_cat
cat_x_catFrequency Tests Useful in testing associations between categorical variables |
Mean tests Useful for testing differences in means |
|---|---|
| Chi-sqaure | T-test (two levels) |
| G- test | Anovas (3+ levels) |
| Contingency tables | Non-parametric equivalents |
| log-linear models | Nested and two-way |
| Post-hoc tests (Tukey HSD, Student, etc.) |
Correlations and models
correlations – many variations
Linear models– many variations
Highly predictive and powerful but depend on many conditions
Logistic models
Logistic models
Predictive of odds
Similar inlogic to frequency tests
Similar in calculation to linear models
Research Methods
5.1 Hypothetico-Deductive Reasoning
Workflow
Question: How can climate location affect sexual dimorphism in penguins?
Hypothesis: Colder temperature leads to larger bodies thus reducing dimorphism
Prediction: Penguins in colder island have similar body measures
Make a question
How is global warming effecting wildlife tourism in Norway.
Create a hypothesis
Wildlife tourism in Norway is on the decline as climate change increase.
Draw a prediction
The increase in temperature of the earth is causing changes to the arctic landscape leading to a decline in flora and fauna as they fail to adapt. Thus resulting in a decline in wildlife tourism as animal populations are decreasing.
Workflow (Wikipedia)
1- Based on observation, previous collected data and literature, find a knowledge gap
2- Form a hypothesis that explains the phenomenon
3- Deduce some expected patterns, assuming your hypotheses is true
4- Design a experiment to test your hypothesis
The scientific method
source: Crnkovic and Crnkovic 2014
Week 5 post session
Diagram one: Boxplots visually illustrate a data set plotting the lower quartile (Q1), median and upper quartile (Q3).
Mann-Whitney U: can be used to compare the medians of independent groups to determine statistical significance. As a non parametric testing method it is suitable for data that doesn’t follow a normal distribution.
Kruskal-Wallis: Non parametric test that is able to compare the medians of three or more independent groups. This tests can determine statistical significance among the medians
Chi-square Test: Can test the independence or relationship between categorical variables.
Diagram two: Visually providing the distribution of a data set can provide a insight on skewness, kurtosis and outliers.
Shapiro-wilk tests, can determine if a sample is from a ‘normal’ distribution. Allowing the use of parametric statistical tests to be used.
Chi-sqaure can be used to test categorical variables following a specified distribution to observe whether a discrete variable matches a theoretical one
Skewness test is able to measure the asymmetry of the distribution, as this can indicate the statistical tests which can be used.
Diagram three: Line graphs highlights the correlation of two continuous variables
ANOVA is able to determine the significance in means of multiple groups over time.
Chi-sqaured can be used to test the significance of categorical variables overtime.
Diagram four: Bar charts display categorical data.
Chi-squared be used to determine the significant association between variables when there is a contingency table representing the frequency of two variables.
T -test can be used to compare the means of two groups to determine if they are significantly different.
ANOVA can be used like above to compare the means of multiple groups
View(iris)
#Load the necessary library
library(ggplot2)
# Create the boxplot
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
labs(title = "Sepal Length by Species",
x = "Species",
y = "Sepal Length") +
theme_minimal()# Create the density plot
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_density(alpha = 0.5) + # Adjust alpha for transparency
labs(title = "Density Plot of Petal Length by Species",
x = "Petal Length",
y = "Density") +
theme_minimal()# Create the line graph
#ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
# geom_line(aes(group = Species)) + # Group by Species for separate lines
#labs(title = "Petal Length vs. Petal Width by Species",
# x = "Petal Length",
# y = "Petal Width") +
# theme_minimal()# Create the scatter plot
#ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
# geom_point() +
#labs(title = "Petal Length vs. Petal Width by Species",
# x = "Petal Length",
# y = "Petal Width") +
#theme_minimal()# Create the scatter plot with different symbols for each species
#ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species)) +
# geom_point(size = 3) + # Adjust point size as needed
#labs(title = "Petal Length vs. Petal Width by Species",
# x = "Petal Length",
# y = "Petal Width") +
# theme_minimal()library(ggplot2)
library(dplyr)
# Create a new categorical variable "SizeCategory"
iris <- iris %>%
mutate(SizeCategory = ifelse(Sepal.Length < median(Sepal.Length), "small", "big"))
# Create the histogram
ggplot(iris, aes(x = Species, fill = SizeCategory)) +
geom_bar(position = "dodge") +
labs(title = "Count of Species by Size",
x = "Species",
y = "Count") +
theme_minimal()Week 6 29.10.24
Frequency Tests
When to use frequency tests
When categorical variables are present.
library(tidyverse)
library(ggplot2)
library(dbplyr)
Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':
ident, sql
library(knitr)ladybirds <- tribble(
~Habitat, ~Site, ~Colour, ~Number,
"Rural", "R1", "Black", 10,
"Rural", "R2", "Black", 3,
"Rural", "R3", "Black", 4,
"Rural", "R4", "Black", 7,
"Rural", "R5", "Black", 6,
"Rural", "R1", "Red", 15,
"Rural", "R2", "Red", 18,
"Rural", "R3", "Red", 9,
"Rural", "R4", "Red", 12,
"Rural", "R5", "Red", 16,
"Industrial", "U1", "Black", 32,
"Industrial", "U2", "Black", 25,
"Industrial", "U3", "Black", 25,
"Industrial", "U4", "Black", 17,
"Industrial", "U5", "Black", 16,
"Industrial", "U1", "Red", 17,
"Industrial", "U2", "Red", 23,
"Industrial", "U3", "Red", 21,
"Industrial", "U4", "Red", 9,
"Industrial", "U5",
"Red", 15
)ladybirds%>%
group_by(Habitat, Colour) %>%
summarize(count = sum(Number)) |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Habitat | Colour | count |
|---|---|---|
| Industrial | Black | 115 |
| Industrial | Red | 85 |
| Rural | Black | 30 |
| Rural | Red | 70 |
ladybirds%>%
group_by(Habitat, Colour) %>%
summarize(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Habitat | Black | Red |
|---|---|---|
| Industrial | 115 | 85 |
| Rural | 30 | 70 |
How habitat type influences morphotype occurrence of ladybirds?
ladybirds |>
group_by(Habitat, Colour) |>
summarize(count = sum(Number)) |>
mutate(prop=count/sum(count)) |> # our new proportion variable
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Habitat | Colour | count | prop |
|---|---|---|---|
| Industrial | Black | 115 | 0.575 |
| Industrial | Red | 85 | 0.425 |
| Rural | Black | 30 | 0.300 |
| Rural | Red | 70 | 0.700 |
ladybirds |>
group_by(Habitat, Colour) |>
summarize(count = sum(Number)) |>
mutate(prop=count/sum(count)) |> # our new proportion variable
dplyr::select(Habitat, Colour, prop) %>%
spread(Habitat, prop) |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Colour | Industrial | Rural |
|---|---|---|
| Black | 0.575 | 0.3 |
| Red | 0.425 | 0.7 |
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
ladybirds |>
group_by(Habitat, Colour) %>%
summarize(count = sum(Number)) %>%
spread(Colour, count, fill = 0)|>
adorn_totals(c("row", "col")) |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Habitat | Black | Red | Total |
|---|---|---|---|
| Industrial | 115 | 85 | 200 |
| Rural | 30 | 70 | 100 |
| Total | 145 | 155 | 300 |
Proportions
#total
ladybirds |>
group_by(Habitat, Colour) %>%
summarize(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
proportions() |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Black | Red | |
|---|---|---|
| Industrial | 0.3833333 | 0.2833333 |
| Rural | 0.1000000 | 0.2333333 |
#rows
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
as.matrix()->t`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
proportions(t,1) |>
kable()| Black | Red | |
|---|---|---|
| Industrial | 0.575 | 0.425 |
| Rural | 0.300 | 0.700 |
#columns
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
as.matrix()->t`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
proportions(t,2) |>
kable()| Black | Red | |
|---|---|---|
| Industrial | 0.7931034 | 0.5483871 |
| Rural | 0.2068966 | 0.4516129 |
Is there an association between habitat and LB morphotype?
Habitat ~ colour
habitat/ colour
Tweaking tables.
ladybirds |>
group_by(Habitat, Colour) |>
summarize(count = sum(Number)) |>
mutate(prop=count/sum(count)) |> # our new proportion variable
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Habitat | Colour | count | prop |
|---|---|---|---|
| Industrial | Black | 115 | 0.575 |
| Industrial | Red | 85 | 0.425 |
| Rural | Black | 30 | 0.300 |
| Rural | Red | 70 | 0.700 |
ladybirds |>
group_by(Habitat, Colour) |>
summarize(count = sum(Number)) |>
mutate(prop=count/sum(count)) |> # our new proportion variable
dplyr::select(Habitat, Colour, prop) %>%
spread(Habitat, prop) |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Colour | Industrial | Rural |
|---|---|---|
| Black | 0.575 | 0.3 |
| Red | 0.425 | 0.7 |
library(janitor)
ladybirds |>
group_by(Habitat, Colour) %>%
summarize(count = sum(Number)) %>%
spread(Colour, count, fill = 0)|>
adorn_totals(c("row", "col")) |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Habitat | Black | Red | Total |
|---|---|---|---|
| Industrial | 115 | 85 | 200 |
| Rural | 30 | 70 | 100 |
| Total | 145 | 155 | 300 |
Proportions
Totals
ladybirds |>
group_by(Habitat, Colour) %>%
summarize(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
proportions() |>
kable()`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
| Black | Red | |
|---|---|---|
| Industrial | 0.3833333 | 0.2833333 |
| Rural | 0.1000000 | 0.2333333 |
Rows
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
as.matrix()->t`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
proportions(t,1) |>
kable()| Black | Red | |
|---|---|---|
| Industrial | 0.575 | 0.425 |
| Rural | 0.300 | 0.700 |
Columns
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
as.matrix()->t`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
proportions(t,2) |>
kable()| Black | Red | |
|---|---|---|
| Industrial | 0.7931034 | 0.5483871 |
| Rural | 0.2068966 | 0.4516129 |
2/3 of LB are found in Industrial areas
It is rare fo find a black LB in rural areas
Red LB don’t show any habitat preference
Black LB prefer Industrial areas
Work with tables and graph
Goodness of fit tests
Independecy tests
Do people who watch Naruto watch Ghibli?
aem naruto<-matrix(c(35,205,8,48), nrow=2, byrow=TRUE) chisq.test(naruto)$expected}
{chisq.test(naruto)}
Homogenity tests
Chi−squared=∑(Obs−Exp)²/Exp
6.1 Research Methods- How to make good titles?
What is a title?
A short statment which provides, relevants geographical scope, methods, taxonomical group and effect using keywords.
It is not a longer over informative statement with too-specialized non keywords.
Descriptive titles
Methodological titles
‘Spoiler’ title
Interrogative titles