R Skills Workbook

Author

Tia Fernandez


Week 1 01.10.24

Meet Quarto

This workbook will serve as a guide to assist with coding as well as highlighting my development.

A picture from a recent trip to Iceland.

The small fishing town of Húsavík is the whale watching capital of Europe.

Photograph of a defrosting lake against a sunrise

Reykjavikurtjorn during Polar winter.

To add a photograph use the image sign above and select browse which will pull up all available images. It is also possible to insert links to data/ buzzwords.

Coding

Switiching to source (tab) will highlight the markdown which will be generated as you work in visual. Usually the rendered document will look somewhat the same as it did using ‘visual’.

Rendered documents refer to the generation of a file which contains the combination of the code and the output.

A YAML header is highlighted by 3 dashes (—) at either end.

YAML uses key value pairs in the format key: value. Other fields found in headers are : date, subtitle, theme, font colour.

Code Chunks

R code chunks identified with { r } with (optional) chunk options, in YAML style, identified by #| at the beginning of the line.

In this case, the label of the code chunk is load-packages, and we set include to false to indicate that we don’t want the chunk itself or any of its outputs in the rendered documents.

Echo: false - hides code only producing output. - this can be document wide if add function to YAML



Markdown text

Text with formatting, including section headers, hyperlinks, an embedded image, and an inline code chunk.

Quarto uses markdown syntax for text. If using the visual editor, you won’t need to learn much markdown syntax for authoring your document, as you can use the menus and shortcuts to add a header, bold text, insert a table. if using the sourse editor, you can achieve these with the markdown expression like ##, **bold**

This scatter plot illustrates the relationship between flipper and bill length of penguins.

Week 3 pre-session 08.10.24

Tidyverse

with reference to chapter 4 R for Graduate Students by Y. Wendy Huynh.

── Attaching core tidyverse packages ────────────────────── tidyverse 2.0.0

✔ dplyr     1.1.4     ✔ readr     2.1.5 

✔forcats    1.0.0     ✔ stringr   1.5.1 

✔ ggplot2   3.5.1     ✔ tibble    3.2.1 

✔lubridate  1.9.3     ✔ tidyr     1.3.1 

✔ purrr     1.0.2     

── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ── 

✖ dplyr::filter() masks stats::filter() 

✖ dplyr::lag()    masks stats::lag() 

ℹ Use the conflicted package to force all conflicts to become errors

Tidyverse is a package which contains other packages such as dplyr and ggplot2. This package must be installed on the device before accessing onthe library.

Using install.packages(tidyverse) before library(tidyverse).

Packages often have naming conflicts, therefore R uses the functions related to the most recent package.

To load ONLY the filter function from dplyr

dplyr::filter()

loads ALL functions from dplyr

library(dplyr)

Example Code:

diamonds %>%

group_by(clarity) %>%

summarize(m = mean(price)) %>%

ungroup()

# A tibble: 8 x 2
clarity     m
<ord>   <dbl> 
1 I1      3924
2 SI2     5063
3 SI1     3996
4 VS2     3925
5 VS1     3839
6 VVS2    3284
7 VVS1    2523
8 IF      2865.

Diamond data set

Chapter 5 R for Grad students

The diamond dataset is built into R and is available with the ggplot2 package.

View(diamonds) will open up the data when typed into the console. To edit this data you must do so in code.

view(diamonds)

str(diamonds) will allow you to look at the structure of the data. There are 10 total variables (three ordered factors, one integer, and 6 numeric.

str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Don’t forget that R cares about capitalization.

Week 3 post-session 10.10.24

Chapter 6.6 arrange ( )

Allows you arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values.

diamonds %>% arrange(cut)
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 53,930 more rows
diamonds %>%                         # utilizes the diamonds dataset
  group_by(color, clarity) %>%       # groups data by color and clarity variables
  mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
  ungroup() %>%                      # data no longer grouped by color and clarity
  mutate(random10 = 10 + price) %>%  # new variable, original price + $10
  select(cut, color,                 # retain only these listed columns
         clarity, price, 
         price200, random10) %>% 
  arrange(color) %>%                 # visualize data ordered by color
  group_by(cut) %>%                  # group data by cut
  mutate(dis = n_distinct(price),    # counts the number of unique price values per cut 
         rowID = row_number()) %>%   # numbers each row consecutively for each cut
  ungroup()                          # final ungrouping of data
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows

Chapter 6.7 Extra practise

View all of the variable names in

view(diamonds)
  • Arrange the diamonds by lowest to highest price
diamonds %>% arrange(price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
  • Arrange the diamonds by highest to lowest price
diamonds %>% arrange(desc(price))
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
  • Arrange the diamonds by lowest price and cut
diamonds %>% arrange(price)%>% arrange(cut)
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows
  • Arrange the diamonds by highest price and cut

    diamonds %>% arrange(desc(price)) %>% arrange(desc(cut))
    # A tibble: 53,940 × 10
       carat cut   color clarity depth table price     x     y     z
       <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
     1  1.51 Ideal G     IF       61.7  55   18806  7.37  7.41  4.56
     2  2.07 Ideal G     SI2      62.5  55   18804  8.2   8.13  5.11
     3  2.15 Ideal G     SI2      62.6  54   18791  8.29  8.35  5.21
     4  2.05 Ideal G     SI1      61.9  57   18787  8.1   8.16  5.03
     5  1.6  Ideal F     VS1      62    56   18780  7.47  7.52  4.65
     6  2.06 Ideal I     VS2      62.2  55   18779  8.15  8.19  5.08
     7  1.71 Ideal G     VVS2     62.1  55   18768  7.66  7.63  4.75
     8  2.08 Ideal H     SI1      58.7  60   18760  8.36  8.4   4.92
     9  2.03 Ideal G     SI1      60    55.8 18757  8.17  8.3   4.95
    10  2.61 Ideal I     SI2      62.1  56   18756  8.85  8.73  5.46
    # ℹ 53,930 more rows
  • Arrange the diamonds by by lowest to highest price and worst to best clarity.

    diamonds %>% arrange(price) %>% arrange(desc(clarity))
    # A tibble: 53,940 × 10
       carat cut       color clarity depth table price     x     y     z
       <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
     1  0.23 Very Good H     IF       63.9    55   369  3.89  3.9   2.49
     2  0.24 Very Good H     IF       61.3    56   449  4.04  4.06  2.48
     3  0.26 Ideal     H     IF       61.1    57   468  4.12  4.16  2.53
     4  0.23 Very Good F     IF       61      62   485  3.95  3.99  2.42
     5  0.3  Ideal     J     IF       61.5    56   489  4.32  4.33  2.66
     6  0.3  Ideal     J     IF       61.5    57   489  4.29  4.36  2.66
     7  0.23 Very Good E     IF       59.9    58   492  3.98  4.03  2.4 
     8  0.24 Good      F     IF       65.1    58   492  3.86  3.88  2.52
     9  0.24 Ideal     H     IF       62.5    54   504  3.97  4     2.49
    10  0.24 Ideal     H     IF       62.1    57   504  4     4.04  2.5 
    # ℹ 53,930 more rows
  • Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond

     diamonds %>% mutate(saleprice = price - 250)
    # A tibble: 53,940 × 11
       carat cut       color clarity depth table price     x     y     z saleprice
       <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
     1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
     2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
     3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
     4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
     5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
     6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
     7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
     8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
     9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
    10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
    # ℹ 53,930 more rows
  • Remove the x, y, and zvariables from the diamond dataset

    diamonds %>% select(-x, -y, -z)
    # A tibble: 53,940 × 7
       carat cut       color clarity depth table price
       <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
     1  0.23 Ideal     E     SI2      61.5    55   326
     2  0.21 Premium   E     SI1      59.8    61   326
     3  0.23 Good      E     VS1      56.9    65   327
     4  0.29 Premium   I     VS2      62.4    58   334
     5  0.31 Good      J     SI2      63.3    58   335
     6  0.24 Very Good J     VVS2     62.8    57   336
     7  0.24 Very Good I     VVS1     62.3    57   336
     8  0.26 Very Good H     SI1      61.9    55   337
     9  0.22 Fair      E     VS2      65.1    61   337
    10  0.23 Very Good H     VS1      59.4    61   338
    # ℹ 53,930 more rows
  • Determine the number of diamonds there are for each cut value

diamonds %>% group_by(cut) %>% summarise(number = n()) %>% ungroup()
# A tibble: 5 × 2
  cut       number
  <ord>      <int>
1 Fair        1610
2 Good        4906
3 Very Good  12082
4 Premium    13791
5 Ideal      21551
  • Create a new column named total num that calculates the total number of diamonds.
diamonds %>% mutate(totalnum = sum(n()))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalnum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

Week 3 post session

Week 4 15.10.24

Contingency tables and how to write a good research paper

Data Exploration & Scientific Hypotheses

How to create a histogram to highlight the relationship between the bill length and species of penguin:

library(tidyverse)
library(palmerpenguins)

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=bill_length_mm, color=species, fill=species))+
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

How to create a box plot to highlight the relationship between the bill length and species of penguin :

library(tidyverse)
library(palmerpenguins)

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=species, 
             y=bill_length_mm, 
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Categorical variables

  • Note

    Ordinal: categories that maintain an order

    • Nominal: that has no ranking order

    • Binary; nominal variables with two categories.

    • Numerical: Discrete; numbered values that can only take certain values

    • Continuous; numbered values that are measured can be any number within a particular range.

Species of penguin

library(tidyverse)
library(palmerpenguins)

penguins %>% 
  ggplot(aes(x=species,
             color=species, 
             fill=species))+
  geom_bar(alpha=0.5)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Observations per year

library(tidyverse)
library(palmerpenguins)

penguins %>% 
  ggplot(aes(x=year,
             color=species, 
             fill=species))+
  geom_bar()+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Observations per island

library(tidyverse)
library(palmerpenguins)

penguins %>% 
  ggplot(aes(x=island,
             color=species, 
             fill=species))+
  geom_bar()+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Visualizing correlations

penguins %>% 
  ggplot(aes(x=bill_length_mm, 
             y = bill_depth_mm))+
  geom_point()+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Visualising correlations per species

penguins %>% 
  ggplot(aes(x=bill_length_mm, 
             y = bill_depth_mm,
             color=species, 
             fill=species))+
  geom_point()+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Note

Moments of centrality

  • Mean

  • Medium

  • Mode

Moments of dispersion variance

  • Standard deviation

  • Standard error

  • Range + quarantines

Body mass per sex

penguins %>% 
  na.omit() %>% 
  ggplot(aes(x=sex, 
             y = body_mass_g,
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.7)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Body mass per sex (iverting groups)

penguins %>% 
  na.omit() %>% 
  ggplot(aes(x=species, 
             y = body_mass_g,
             color=sex, 
             fill=sex))+
  geom_boxplot(alpha=0.7)+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))

Can body mass predict bill length?

Does sex explain flipper length?

Check distributions

penguins %>% 
  na.omit() %>% 
  pivot_longer(bill_length_mm:body_mass_g, names_to = "trait") %>% 
  ggplot(aes(x=value,
         group=species,
         fill=species,
         color=species))+
  geom_density(alpha=0.7)+
  facet_grid(~trait, scales = "free_x" )+
  theme(axis.text=element_text(size=16),
        axis.title=element_text(size=16))+
  theme_minimal()

Checking via histogram

set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_histogram(color="#DD4A48", fill="#DD4A48")+
  geom_vline(xintercept=c(mean(normal), (mean(normal)+sd(normal)),mean(normal)-sd(normal)), 
             linetype="dashed")
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_boxplot(fill="#DD4A48",alpha=0.7)

Research Methods

3.2 Scientific Hypotheses

Why we need a hypothesis?

  • Candidate explanation to a phenomenon

  • Contain previsions and expectations

  • Feedback theory

  • Advance Science

{#By Efbrazil - Own work, CC BY-SA 4.0}

What hypotheses must contain?

  • It is affirmative statement

  • It is not a questions

  • Must lead to expectations if confirmed

  • Self-explanatory

Types of hypotheses

Scientific Hypotheses

  • Candidate statements to explain an observed phenomenon

  • Meant to generate logical predictions

  • Working guidelines

Statistical Hypotheses

  • Logical predictions

  • Confirmed by stats

  • Can be drawn in a graph

How to write hypotheses?

  • You should tell a story

  • Don’t use a subheading

  • Never reefer to statistical hypotheses

The “If/then” method:

If drug X have an effect on reducing headache, then When many exotic species are introduced in the ecosystem, then the likelihood of severe ecological disruption increases

  1. Identify the variables: Determine your independent and dependent variable.

  2. Forecast a prediction: State what you expect to occur based on your current knowledge.

The statement method:

Drug X reduce the headaches because it blocks the neuroreceptors of pain Exotic species in high quantities disrupts ecological mutualistic networks

Week 4 post session

Data visualization

library(tidyverse)
library(modeldata)

Attaching package: 'modeldata'
The following object is masked _by_ '.GlobalEnv':

    penguins
The following object is masked from 'package:palmerpenguins':

    penguins
?ggplot
starting httpd help server ...
 done
?crickets
view(crickets)
ggplot(crickets,aes(x = temp, 
                    y = rate)) + 
  
  geom_point() + 
  labs(x = "Temperature",
       y = "Chirp rate",
       title =  "Cricket Chirps",
       caption = " Source: Mcdonald (2009)" )

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

Modifiying basic properties of the plot

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

# Learn more about the options for the geom_abline()
# with ?geom_point

Adding another layer

ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'

ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2") 
`geom_smooth()` using formula = 'y ~ x'

Other plots

ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue")

ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

?theme_minimal()

# faceting

# not great:
ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

?facet_wrap

```{rggplot(crickets, aes(x = rate,} fill = species)) + geom_histogram(bins = 15, show.legend = FALSE) + facet_wrap(~species, ncol = 1) + scale_fill_brewer(palette = “Dark2”) + theme_minimal()}

```

Research Method post session week 4

A research hypotheses is a condensed claim of what is to be observed during an experiment- It is the ingress of scientiific inquiry. Evidence is has been found dating back to Ancient Greek philosophy (Stoicisim and Epicurus) illustrating the use of research hypothesis and empiricism.

The formation of a hypotheses begins with a literature review to determine knowledge gaps and the forecasting of the outcome we would expect to occur. There are two types of hypotheses in research, null and alternative.

Effective hypotheses contain these qualities: testability, brevity and objectivity, clarity and relevance.

Week 5 pre session 22.10.24

Week 5 22.10.24

Choosing the right analysis

library(palmerpenguins)
library(tidyverse)

penguins %>% 
  glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Is there an association between species and sex?

penguins %>% 
  na.omit() %>% 
  ggplot(aes(
    x= species, 
    color=sex, 
    fill=sex))+
  geom_bar(position = "dodge")-> cat_x_cat
cat_x_cat

Frequency Tests

Useful in testing associations between categorical variables

Mean tests

Useful for testing differences in means

Chi-sqaure T-test (two levels)
G- test Anovas (3+ levels)
Contingency tables Non-parametric equivalents
log-linear models Nested and two-way
Post-hoc tests (Tukey HSD, Student, etc.)

Correlations and models

  • correlations – many variations

  • Linear models– many variations

Highly predictive and powerful but depend on many conditions

Logistic models

  • Logistic models

  • Predictive of odds

  • Similar inlogic to frequency tests

  • Similar in calculation to linear models

Research Methods

5.1 Hypothetico-Deductive Reasoning

Workflow

Question: How can climate location affect sexual dimorphism in penguins?

Hypothesis: Colder temperature leads to larger bodies thus reducing dimorphism

Prediction: Penguins in colder island have similar body measures

Your Turn
  • Make a question

    How is global warming effecting wildlife tourism in Norway.

  • Create a hypothesis

    Wildlife tourism in Norway is on the decline as climate change increase.

  • Draw a prediction

    The increase in temperature of the earth is causing changes to the arctic landscape leading to a decline in flora and fauna as they fail to adapt. Thus resulting in a decline in wildlife tourism as animal populations are decreasing.

Workflow (Wikipedia)

1- Based on observation, previous collected data and literature, find a knowledge gap

2- Form a hypothesis that explains the phenomenon

3- Deduce some expected patterns, assuming your hypotheses is true

4- Design a experiment to test your hypothesis

The scientific method

source: Crnkovic and Crnkovic 2014

Week 5 post session

Diagram one: Boxplots visually illustrate a data set plotting the lower quartile (Q1), median and upper quartile (Q3).

  • Mann-Whitney U: can be used to compare the medians of independent groups to determine statistical significance. As a non parametric testing method it is suitable for data that doesn’t follow a normal distribution.

  • Kruskal-Wallis: Non parametric test that is able to compare the medians of three or more independent groups. This tests can determine statistical significance among the medians

  • Chi-square Test: Can test the independence or relationship between categorical variables.

Diagram two: Visually providing the distribution of a data set can provide a insight on skewness, kurtosis and outliers.

  • Shapiro-wilk tests, can determine if a sample is from a ‘normal’ distribution. Allowing the use of parametric statistical tests to be used.

  • Chi-sqaure can be used to test categorical variables following a specified distribution to observe whether a discrete variable matches a theoretical one

  • Skewness test is able to measure the asymmetry of the distribution, as this can indicate the statistical tests which can be used.

Diagram three: Line graphs highlights the correlation of two continuous variables

  • ANOVA is able to determine the significance in means of multiple groups over time.

  • Chi-sqaured can be used to test the significance of categorical variables overtime.

Diagram four: Bar charts display categorical data.

  • Chi-squared be used to determine the significant association between variables when there is a contingency table representing the frequency of two variables.

  • T -test can be used to compare the means of two groups to determine if they are significantly different.

  • ANOVA can be used like above to compare the means of multiple groups

View(iris)
#Load the necessary library
library(ggplot2)

# Create the boxplot
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(title = "Sepal Length by Species",
       x = "Species",
       y = "Sepal Length") +
  theme_minimal()

# Create the density plot
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  geom_density(alpha = 0.5) +  # Adjust alpha for transparency
  labs(title = "Density Plot of Petal Length by Species",
       x = "Petal Length",
       y = "Density") +
  theme_minimal()

# Create the line graph
#ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
 # geom_line(aes(group = Species)) +  # Group by Species for separate lines
  #labs(title = "Petal Length vs. Petal Width by Species",
   #    x = "Petal Length",
    #   y = "Petal Width") +
#  theme_minimal()
# Create the scatter plot
#ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
 # geom_point() +
  #labs(title = "Petal Length vs. Petal Width by Species",
   #    x = "Petal Length",
    #   y = "Petal Width") +
  #theme_minimal()
# Create the scatter plot with different symbols for each species
#ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, shape = Species)) +
 # geom_point(size = 3) +  # Adjust point size as needed
  #labs(title = "Petal Length vs. Petal Width by Species",
   #    x = "Petal Length",
    #   y = "Petal Width") +
#  theme_minimal()
library(ggplot2)
library(dplyr)

# Create a new categorical variable "SizeCategory"
iris <- iris %>%
  mutate(SizeCategory = ifelse(Sepal.Length < median(Sepal.Length), "small", "big"))

# Create the histogram
ggplot(iris, aes(x = Species, fill = SizeCategory)) +
  geom_bar(position = "dodge") +
  labs(title = "Count of Species by Size",
       x = "Species",
       y = "Count") +
  theme_minimal()

Week 6 29.10.24

Frequency Tests

When to use frequency tests

When categorical variables are present.

library(tidyverse)
library(ggplot2)
library(dbplyr)

Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':

    ident, sql
library(knitr)
ladybirds <- tribble(
  ~Habitat, ~Site, ~Colour, ~Number,
  "Rural", "R1", "Black", 10,
  "Rural", "R2", "Black", 3,
  "Rural", "R3", "Black", 4,
  "Rural", "R4", "Black", 7,
  "Rural", "R5", "Black", 6,
  "Rural", "R1", "Red", 15,
  "Rural", "R2", "Red", 18,
  "Rural", "R3", "Red", 9,
  "Rural", "R4", "Red", 12,
  "Rural", "R5", "Red", 16,
  "Industrial", "U1", "Black", 32,
  "Industrial", "U2", "Black", 25,
  "Industrial", "U3", "Black", 25,
  "Industrial", "U4", "Black", 17,
  "Industrial", "U5", "Black", 16,
  "Industrial", "U1", "Red", 17,
  "Industrial", "U2", "Red", 23,
  "Industrial", "U3", "Red", 21,
  "Industrial", "U4", "Red", 9,
  "Industrial", "U5",  
 "Red", 15
)
ladybirds%>% 
  group_by(Habitat, Colour) %>% 
  summarize(count = sum(Number)) |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Habitat Colour count
Industrial Black 115
Industrial Red 85
Rural Black 30
Rural Red 70
ladybirds%>% 
  group_by(Habitat, Colour) %>% 
  summarize(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Habitat Black Red
Industrial 115 85
Rural 30 70

How habitat type influences morphotype occurrence of ladybirds?

ladybirds |> 
  group_by(Habitat, Colour) |> 
  summarize(count = sum(Number)) |> 
  mutate(prop=count/sum(count)) |>   # our new proportion variable
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Habitat Colour count prop
Industrial Black 115 0.575
Industrial Red 85 0.425
Rural Black 30 0.300
Rural Red 70 0.700
ladybirds |> 
  group_by(Habitat, Colour) |> 
  summarize(count = sum(Number)) |> 
  mutate(prop=count/sum(count)) |>   # our new proportion variable
  dplyr::select(Habitat, Colour, prop) %>% 
  spread(Habitat, prop) |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Colour Industrial Rural
Black 0.575 0.3
Red 0.425 0.7
library(janitor)

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarize(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0)|> 
  adorn_totals(c("row", "col")) |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Habitat Black Red Total
Industrial 115 85 200
Rural 30 70 100
Total 145 155 300

Proportions

#total
ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarize(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> 
  proportions() |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Black Red
Industrial 0.3833333 0.2833333
Rural 0.1000000 0.2333333
#rows
ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> 
  as.matrix()->t
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
  proportions(t,1) |> 
    kable()
Black Red
Industrial 0.575 0.425
Rural 0.300 0.700
#columns
ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> 
  as.matrix()->t
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
  proportions(t,2) |> 
    kable()
Black Red
Industrial 0.7931034 0.5483871
Rural 0.2068966 0.4516129

Is there an association between habitat and LB morphotype?

Habitat ~ colour

habitat/ colour

Tweaking tables.

ladybirds |> 
  group_by(Habitat, Colour) |> 
  summarize(count = sum(Number)) |> 
  mutate(prop=count/sum(count)) |>   # our new proportion variable
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Habitat Colour count prop
Industrial Black 115 0.575
Industrial Red 85 0.425
Rural Black 30 0.300
Rural Red 70 0.700
ladybirds |> 
  group_by(Habitat, Colour) |> 
  summarize(count = sum(Number)) |> 
  mutate(prop=count/sum(count)) |>   # our new proportion variable
  dplyr::select(Habitat, Colour, prop) %>% 
  spread(Habitat, prop) |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Colour Industrial Rural
Black 0.575 0.3
Red 0.425 0.7
library(janitor)
ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarize(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0)|> 
  adorn_totals(c("row", "col")) |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Habitat Black Red Total
Industrial 115 85 200
Rural 30 70 100
Total 145 155 300

Proportions

Totals

ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarize(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> 
  proportions() |> 
  kable()
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
Black Red
Industrial 0.3833333 0.2833333
Rural 0.1000000 0.2333333

Rows

ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> 
  as.matrix()->t
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
  proportions(t,1) |> 
    kable()
Black Red
Industrial 0.575 0.425
Rural 0.300 0.700

Columns

ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> 
  as.matrix()->t
`summarise()` has grouped output by 'Habitat'. You can override using the
`.groups` argument.
  proportions(t,2) |> 
    kable()
Black Red
Industrial 0.7931034 0.5483871
Rural 0.2068966 0.4516129
  • 2/3 of LB are found in Industrial areas

  • It is rare fo find a black LB in rural areas

  • Red LB don’t show any habitat preference

  • Black LB prefer Industrial areas

Work with tables and graph

Goodness of fit tests

Independecy tests

Do people who watch Naruto watch Ghibli?

aem naruto<-matrix(c(35,205,8,48), nrow=2, byrow=TRUE) chisq.test(naruto)$expected}

{chisq.test(naruto)}

Homogenity tests

Chi−squared=∑(Obs−Exp)²/Exp

6.1 Research Methods- How to make good titles?

What is a title?

A short statment which provides, relevants geographical scope, methods, taxonomical group and effect using keywords.

It is not a longer over informative statement with too-specialized non keywords.

Descriptive titles

Methodological titles

‘Spoiler’ title

Interrogative titles