Research methods and data analysis

Author

Sophie W - N0842818

This is my (Sophie’s) online workbook! It is split up into sessions and then sections within that.

Post session one (wc/ 23rd September 2024)

First I started off with inserting an image

Cool grasshopper 🦗 - more specifically the elegant grasshopper Zonocerus elegans

Loading the needed packages 📦

Click on code to see what they are! They are all the packages used throughout this workbook.

Code
library(knitr)
library(rmarkdown)
library(quarto)
library(palmerpenguins)
library(ggplot2) # this is within tidyverse but I wanted to show its use
library(tidyverse)
library(dplyr)
library(vtable)
library(gt)
library(modeldata)
library(geomtextpath)

Quarto tutorial

Meet the penguins 🐧

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

Heres what the first few lines of data looks like!

Code
# Check the first few rows of the data set
kable(head(penguins))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie Torgersen 39.1 18.7 181 3750 male
Adelie Torgersen 39.5 17.4 186 3800 female
Adelie Torgersen 40.3 18.0 195 3250 female
Adelie Torgersen NA NA NA NA NA
Adelie Torgersen 36.7 19.3 193 3450 female
Adelie Torgersen 39.3 20.6 190 3650 male

The plot below shows the relationship between flipper and bill lengths of these penguins.

I encountered many problems making this graph despite copying the code from the tutorial. I ended up un-installing all the packages and quarto and re-installing to no avail, however doing the creation and printing of the plot separately worked. So I’m sticking with that!

Code
# Create the plot
p <- ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_bw() +
  theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p)

Doing my own thing

Trying another plot with different data

First I imported the data and had a look at the first couple of rows.

Code
# Import data set
pitfall <- read_csv("pitfall.csv")

# First few rows
kable(head(pitfall))
sample_point num_individuals num_species
1 3 3
2 2 4
3 1 1
4 1 1
5 2 2
6 8 6

About the data set 🐞🕳️

Collected early this September in Scotland. Non-lethal pitfalls placed in between hairy wood ant nests to try and look into their activity, however only one ant was captured (and immediately released). Although this data cannot be used to look into the activity of the ant it can be used as practice, especially since I might do pitfall trapping for my research project.

My friends Katie and Curtis creating holes in the ground to put the traps into.

‎‎‎

Me making the non-lethal traps by putting Vaseline on the inside lip of a cup. The cups were then placed into the holes in the ground.

Reshaping the data

The data needs to be reshaped to allow both the number of species and the number of individuals (categories) to be plotted on the same graph. This allows for comparisons to be made!

Code
# Reshape data
pitfall_long <- pitfall %>%
  pivot_longer(cols = c(num_individuals, num_species), 
               names_to = "category", 
               values_to = "count")

# Now it looks like this:
kable(head(pitfall_long))
sample_point category count
1 num_individuals 3
1 num_species 3
2 num_individuals 2
2 num_species 4
3 num_individuals 1
3 num_species 1

Plotting the data

Same with the penguins, I plotted and printed separately. Below is the graph that was created! Mostly there’s the same amount of individuals and species in each sample point. Point six has a lot more as mistakenly it was placed under a tree where beating samples were taken.

Code
# Create the plot
f <- ggplot(pitfall_long, aes(fill = category, y = count, x = sample_point)) + 
    geom_bar(position = "dodge", width = 0.8, stat = "identity", color="black") +
  scale_fill_manual(name = "Legend", labels = c("Individuals", "Species"), values=c("#E69F00", "#56B4E9")) +
  scale_x_discrete(limits = as.character(1:15), 
                   labels = paste(1:15)) +
  labs(x = "Sample point", y = "Total number", title = "Total number of species and individuals at each sample point") +
  theme_bw() +
  theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(f)

Session two (wc/ 30th Septermber 2024)

Ethics for experimental design - notes

What is the scientific method?

A procedure which involves finding new/correcting previous knowledge through testing and experimentation.

What can influence the scientific method?

  • Personal beliefs and biases

  • Societal norms

  • Available technology

  • Funding and resources

  • And many more!

Science is an inferential exercise

As soon as you apply or interpret science you end up inserting values and judgements within it (based on personal beliefs and biases/societal norms).

Therefore ethics are then subject to the society at the time!

Ethics definition

  • Moral principles that govern human’s behaviours or the conducting of an activity.

  • The branch of knowledge that deals with moral principles.

  • External - Social

Morality definition

  • Principles concerning the distinction between right and wrong and good and bad behaviour.

  • A system of ideas of right and wrong conduct.

  • Virtuous conduct.

  • Can vary from person to person.

  • Internal

Animal experimentation example

There is a spectrum of opinions ranging from one extreme (Acceptable to use animals in any way that we see fit) to another (Animals have rights and it is wrong to use them).

The example is as follows: You are a member of an animal experimentation committee. An experiment, which requires keeping mice in isolation, is proposed to measure the effect of a vaccine which will stop a debilitating disease in humans. Note - mice are extremely social and suffer badly when isolated

Is this scenario ethical???

It depends on your judgement:

Yes - It is permissible to keep mice in isolation to obtain the results required.

No - We should not inflict suffering on animals.

Maybe - The experiment needs to be done in isolation and will help many if the vaccine is proven to be successful, but what can be done to make it better for the mice?

This is where the three R’s come into play!

The three R’s

Principles for human experimental technique set out by Russel and Burch (1959). They are widely accepted guidelines that can also be used for animals. It is possible to create a better experiment with more statistical ‘power’ after considering the three R’s.

Replacement

  • Methods that avoid or replace the use of animals that are protected under the Animals (Scientific Procedures) Act 1986, amended 2012 (ASPA).

  • Protected animals include all living vertebrates (except humans - they are under other protections as they can consent) and some invertebrates (e.g., cephalopods like octopuses and squid).

  • Instead use:

    • Human volunteers

    • Computer modelling

    • Animal cell cultures

    • Alternative species

Reduction

  • Use less subjects where possible.

  • Can often obtain the same amount of information.

  • ‘Power analysis’ can show what the smallest possible sample size that is needed for the same ‘power.’

  • Often includes good use of experimental design and stats.

Refinement

  • Try to use methods that minimise the pain, suffering, distress or lasting harm experienced by the animals.

  • Applies to all aspects of animals use, from housing and husbandry to the specific procedures that will be performed on them.

  • e.g., use appropriate anaesthetics and analgesics, training animals to cooperate to avoid stress and providing animals with appropriate housing that allows for the expression of species-specific behaviours, such as having enclosures next to one another in the mice example (so they can still see and hear one another without being in direct contact).

More info: https://www.nc3rs.org.uk/

Challenges

  • The globalisation of medical research: the need to capture universal values and formulate universal principles (global bioethics).

  • The emerging reality of the diversity of moral cultures: the need to respect plurality and ethical diversity (Asian, American, European bioethics).

Other things to consider

  • Decolonisation - Avoiding ‘helicopter science’

  • Anti-oppressive practices

  • Discrimination

Pre session three (wc/ 7th October 2024)

Data analysis

Tidyverse package

The tidyverse package is a collection of other packages such as ggplot2 and dplyr. They all share an underlying design philosophy, grammar, and data structures. The code from this exercise elsewhere!

Research methods

How to write a research question

What is a research question?

A question (usually addresses an issue/problem) that a study/research project aims to answer, through data analysis.

The path to a research question starts off with a broad topic which gets narrowed down through preliminary research and searching for gaps in the knowledge, to eventually create a question.

Types of research questions

Quantitative
  • Precise

  • Often exploring relationships, comparisons or descriptions in a study

  • Not answerable with “yes” or “no”

    • Therefore, questions don’t use words like “is,” “are,” “do,” or “does.”
Qualitative
  • Broad or more specific areas of study

  • Adaptable, non-directional and more flexible

    • Therefore, these questions aim to “discover,” “explain,” or “explore.”
Mixed-method

Combination of both!

Examples of good and bad research questions

Example one

Bad: How does social media affect people’s behaviour?

Good: What effect does the daily use of YouTube have on the attention span of children aged under 16?

The bad one is vague and lacks specification.

Example two

Bad: Has there been an increase in childhood obesity in the US in the past 10 years?

Good: How have school intervention programs and parental education levels affected the rate of childhood obesity among 1st to 6th-grade students?

The bad one is simple plus it can be answered with yes or no!

Common mistakes

  • Overly broad questions

    • Unfocused and vague questions make it difficult to draw meaningful conclusions
  • Not feasible

    • Make sure that you have the resources and time to answer the question
  • Ambiguous language

    • Can lead to misunderstandings
  • Neglecting existing literature

    • Can result in doing a question thats already been heavily addressed
  • Ignoring ethics

    • Can lead to harm to the subjects used

Notes from: https://research.com/research/how-to-write-a-research-question

Session three (wc/ 7th October 2024)

Data wrangling

Back to the penguins data set!

A tibble

Code
penguins <- palmerpenguins::penguins

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Summaries of the data set

Using summary()
Code
pen1 <- summary(penguins)

print(pen1)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 
Using vtable()
Code
penguins %>% 
  vtable(., lush = TRUE)
.
Name Class Values Missing Summary
species factor 'Adelie' 'Chinstrap' 'Gentoo' 0 nuniq: 3
island factor 'Biscoe' 'Dream' 'Torgersen' 0 nuniq: 3
bill_length_mm numeric Num: 32.1 to 59.6 2 mean: 43.922, sd: 5.46, nuniq: 164
bill_depth_mm numeric Num: 13.1 to 21.5 2 mean: 17.151, sd: 1.975, nuniq: 80
flipper_length_mm integer Num: 172 to 231 2 mean: 200.915, sd: 14.062, nuniq: 55
body_mass_g integer Num: 2700 to 6300 2 mean: 4201.754, sd: 801.955, nuniq: 94
sex factor 'female' 'male' 11 nuniq: 2
year integer Num: 2007 to 2009 0 mean: 2008.029, sd: 0.818, nuniq: 3
Grouping by species using group_by()
Code
penguins %>% 
  group_by(species) %>% 
  na.omit() %>% 
  summarise(mean = mean(bill_length_mm), sd=sd(bill_length_mm), n = n())
# A tibble: 3 × 4
  species    mean    sd     n
  <fct>     <dbl> <dbl> <int>
1 Adelie     38.8  2.66   146
2 Chinstrap  48.8  3.34    68
3 Gentoo     47.6  3.11   119

Reshaping the data into a wide format using Tidyr

It is a way of organizing data, where each variable is represented as a column, and each observation is represented as a row.

Code
penguins %>% 
  mutate(row = row_number()) %>% # needed to add a row number to identify each row as separate case
  select(row, species, island, body_mass_g) %>% 
  pivot_wider(names_from = island, values_from = body_mass_g)
# A tibble: 344 × 5
     row species Torgersen Biscoe Dream
   <int> <fct>       <int>  <int> <int>
 1     1 Adelie       3750     NA    NA
 2     2 Adelie       3800     NA    NA
 3     3 Adelie       3250     NA    NA
 4     4 Adelie         NA     NA    NA
 5     5 Adelie       3450     NA    NA
 6     6 Adelie       3650     NA    NA
 7     7 Adelie       3625     NA    NA
 8     8 Adelie       4675     NA    NA
 9     9 Adelie       3475     NA    NA
10    10 Adelie       4250     NA    NA
# ℹ 334 more rows

Types of variables

Code
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Categorical variables
  • Ordinal

    • Categories that maintain an order
  • Nominal

    • Categories with no order ranking
  • Binary

    • Nominal variables with two categories
Numerical variables
  • Discrete

    • Numbered values that can only take certain values
  • Continuous

    • Numered values that are measured, and can be any number within a particular range

Research methods

Research questions

Inductive reasoning vs deductive reasoning

Inductive
  1. Observation

  2. Pattern

  3. Hypothesis

  4. Theory

Deductive
  1. Theory

  2. Hypothesis

  3. Observation

  4. Confirmation

When to pose a question?

When a scientific question preceeds the hypotheses. Comes after a good screening of theory.

Good and bad questions - see pre-session activity on why these questions are good or bad

Bad:
  • Is there any difference between A and B?

  • Is A bigger than B?

  • Can X influence Y?

  • Are there little green people out there?

Good:
  • What explains differences between A and B?

  • What makes A bigger than B?

  • How X can influence Y?

  • How can evolution lead to little green people out there?

Importance of a good question

There is no good research based on bad questions. Also, a good question always push knowledge further. However, it is something that comes with practise.

Post session three (wc/ 7th October 2024)

Data analysis

Exercise 6.6.1

1. Practice typing/executing these problems and explain what each line of code does. Make sure tidyverse is loaded in your libraries!

I have removed the code fold on these problems so you can see all the annotations easier.

library(tidyverse) is the star of this section :)

Problem A

# %>% means putting the result into the next thing probably going to be used a lot so I am just mentioning here!

midwest %>% # Using the midwest data set (included in the tidyverse package)
  group_by(state) %>% # Grouping the data by states in the midwest of the US
  summarize(poptotalmean = mean(poptotal), # summarize() means creating a summary of the state grouped data. Inside of the function various summary statistics are calculated for the population columns. poptotalmean = mean(poptotal) calculates the mean of the total population for each state.
            poptotalmed = median(poptotal), # Calculates the median of the total population for each state
            popmax = max(poptotal), # Calculates the maximum value for the total population for each state
            popmin = min(poptotal), # Calculates the minimum value for the total population for each state
            popdistinct = n_distinct(poptotal), # Counts the number of unique values in the total population for each state
            popfirst = first(poptotal), # Gets the first value of the total population for each state (the first row of the group)
            popany = any(poptotal < 5000), # Checks if there are any values in the population total less than 5,000 in each state (response will be either TRUE - yes or FALSE - no)
            popany2 = any(poptotal > 2000000)) %>% # Does the same as above except if any values greater than 2,000,000
  ungroup() # Final ungrouping of the data - makes sure subsequent operations wont get affected
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B

midwest %>% # Using midwest data set
  group_by(state) %>% # Groups the data by states
  summarize(num5k = sum(poptotal < 5000), # Summarising of grouped data for... num5k = sum(poptotal < 5000) which counts the number of times the population total is less than 5,000 in each state
            num2mil = sum(poptotal > 2000000), # Counts the number of times the population total is more than 2,000,000 in each state
            numrows = n()) %>% # Counts the number of rows in each state group
  ungroup() # Final ungrouping of the data
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

Part I
midwest %>% # Using the midwest data set
  group_by(county) %>% # Groups the data by county
  summarize(x = n_distinct(state)) %>% # Summary ~ Counts the number of unique states for each county (shows how many different states each county spans)
  arrange(desc(x)) %>% # Sorts the summarised data into descending order based on the value of x (counties spanning the most states appear first)
  ungroup() # Final ungrouping of the data
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows
Part II
# How does n() differ from n_distinct()? - n() is the total number of rows in the current group, it is used when you want to know how many entries there are in each group. n_distinct() is the number of unique values in a specified column, it is used when you want to know how many unique values exist in your chosen column within the group.
# When would they be the same? different? - They would be the same when every entry in a column within the group is the same. They would be different when there are different values within a column.

midwest %>% # Using the midwest data set
  group_by(county) %>% # Groups the data by county
  summarize(x = n()) %>% # Summary ~ Counts the total number of rows in each county group (shows how many entries exist for each county).
  ungroup() # Final ungrouping of the data
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows
Part III
# hint: 
# - How many distinctly different counties are there for each county? Only one as each group is defined by that distinct county
# - Can there be more than 1 (county) county in each county? No!
# - What if we replace 'county' with 'state'? I'm guessing it means the first county in the code (group_by(state) instead) and keeping the second the same (n_distinct(county)). It would count the number of unique counties for each state.

midwest %>% # Using the midwest data set
  group_by(county) %>% # Groups the data by county
  summarize(x = n_distinct(county)) %>% # Summary ~ Counts the number of unique counties in each county which will always be 1!
  ungroup() # Final ungrouping of the data
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows

Problem D

diamonds %>% # Using the diamonds data set
  group_by(clarity) %>% # Grouping by clarity
  summarize(a = n_distinct(color), # Summary ~ Counts the number of unique diamond colours in each clarity group
            b = n_distinct(price), # Counts the number of unique prices in each clarity group
            c = n()) %>% # Counts the total number of rows in each clarity group (shows how many entries exist for each clarity).
  ungroup() # Final ungrouping of the data
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

Part I
diamonds %>% # Using the diamonds data set
  group_by(color, cut) %>% # Grouping by colour and cut
  summarize(m = mean(price), # Summary ~ Calculates the mean price of diamond for each combination of colour and cut
            s = sd(price)) %>% # Calculates the standard deviation of diamond prices for each combination of colour and cut
  ungroup() # Final ungrouping of the data
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows
Part II
# Took me a moment to see the difference, but it was the order of cut and colour in the grouping command. The summary statistics (mean and sd) will be the same either way, however the order of the columns in the tibble is different!

diamonds %>% # Using the diamonds data set
  group_by(cut, color) %>% # Grouping by cut and colour
  summarize(m = mean(price), # Summary ~ Calculates the mean price of diamond for each combination of cut and colour
            s = sd(price)) %>% # Calculates the standard deviation of diamond prices for each combination of cut and colour
  ungroup() # Final ungrouping of the data
# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows
Part III
# - How good is the sale if the price of diamonds equaled msale? - It is a 20% discount, which is very good especially as they are diamonds.
# - e.x. The diamonds are x% off original price in msale.

diamonds %>% # Using the diamonds data set
  group_by(cut, color, clarity) %>% # Grouping by cut and colour
  summarize(m = mean(price), # Summary ~ Calculates the mean price of diamond for each combination of cut and colour
            s = sd(price), # Calculates the standard deviation of diamond prices for each combination of cut and colour
            msale = m * 0.80) %>% # Calculates a sale price that is 80% of the mean price - 20% discount
  ungroup() # Final ungrouping of the data
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem F

# These are all given weird unrelated food names to make it confusing :)
diamonds %>% # Using the diamonds data set
  group_by(cut) %>% # Grouping by cut
  summarize(potato = mean(depth), # Summary ~ Calculates potato which is the mean depth of diamonds for each cut
            pizza = mean(price), # Calculates pizza which is the mean price of diamonds for each cut
            popcorn = median(y), # Calculates popcorn which is the median of the y dimension (height of diamonds) for each cut
            pineapple = potato - pizza, # Calculates pineapple which is the difference between potato (mean depth) and pizza (mean price) for each cut
            papaya = pineapple ^ 2, # Calculates papaya which is pineapple (above) to the power of 2
            peach = n()) %>% # Calculates peach which is total number of rows (diamonds) for each cut
  ungroup() # Final ungrouping of the data
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

Part I
diamonds %>% # Using the diamonds data set
  group_by(color) %>% # Grouping by colour
  summarize(m = mean(price)) %>% # Summary ~ Calculates the mean price of diamonds for each colour
  mutate(x1 = str_c("Diamond color ", color), # Mutate() - adds new columns to the summarised data. x1 = str_c("Diamond color ", color) - creates a new column called x1 that links "Diamond color" with each unique colour. e.g. "Diamond color D" or "Diamond color E"
         x2 = 5) %>% # Creates a new column called x2 that has a constant value of 5 in each row
  ungroup() # Final ungrouping of the data
# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5
Part II
# What does the first ungroup() do? Is it useful here? Why/why not? - Removes the grouping structure of the data after summarising. Meaning the mutate() operation isn't performed in the grouped data set. It is useful as we added new columns/performed operations that do not depend on the grouping of the data by colour. These are then instead applied to the entire data frame as opposed to within the groups avoiding any unintended consequences.
# Why isn't there a closing ungroup() after the mutate()? - It would have no effect as mutate() doesn't group the data and the data was already ungrouped before hand.

diamonds %>% # Using the diamonds data set
  group_by(color) %>% # Grouping the data by colour
  summarize(m = mean(price)) %>% # Summary ~ Calculates the mean price of diamonds for each colour
  ungroup() %>% # Ungroups the data
  mutate(x1 = str_c("Diamond color ", color), # Mutate ~ Creates a new column called x1 that links "Diamond color" with each unique colour
         x2 = 5) # Creates a new column called x2 that has a constant value of 5 in each row
# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Problem H

Part I
diamonds %>% # Using the diamonds data set
  group_by(color) %>% # Grouping by colour
  mutate(x1 = price * 0.5) %>% # Mutate ~ Creates a new column called x1 that is half the price of each diamonds and since the data is grouped by colour x1 will be the half the price for each diamond colour group
  summarize(m = mean(x1)) %>% # Summary ~ Calculates the mean x1 value for each colour group (mean of half the price of each diamond colour)
  ungroup() # Final ungrouping of the data
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.
Part II
# What's the difference between part I and II? - The order of things (ungroup is at the end in the first and before summarise in the second). Part I provides mean of half the prices for each diamond colour and part II gives the overall mean of half the prices for all diamonds (regardless of colour). This is because part II the data is ungrouped before the summary so the mean of x1 is calculated on the whole data set not the grouped one like in part I.

diamonds %>% # Using the diamonds data set
  group_by(color) %>% # Grouping by colour
  mutate(x1 = price * 0.5) %>%  # Mutate ~ Creates a new column called x1 that is half the price of each diamonds and since the data is grouped by colour x1 will be the half the price for each diamond colour group
  ungroup() %>% # Ungrouping of the data
  summarize(m = mean(x1)) # Summary ~ Calculates the mean x1 value for each colour group (mean of half the price of each diamond colour)
# A tibble: 1 × 1
      m
  <dbl>
1 1966.

2. Why is grouping data necessary?

When you want to perform calculations that are specific to some parts of the data (the things you make grouped). It can also allow for further insight into the data as things like summarize() and mutate() can work within specific sections of the data.

3. Why is ungrouping data necessary?

If you want to perform operations that do not need to be constrained by previously defined groups. Also, prevents unintended consequences (if not done all operations will still respect the grouping).

4. When should you ungroup data?

  • After summarising and you don’t need to maintain the grouping for further things.

  • Before operations you want to perform to all the whole data set

  • Before final output to make it cleaner especially if you want to visualise the data

5. If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No as no groups are defined if group_by() isn’t used! In the example mutate() operates on the entire data set and no groups are defined.

Exercise 6.7

Using the diamonds built-in data set (requires tidyverse), perform the following tasks (annotations visible when you look at code):

Task 1 - View all of the variable names in diamonds (hint: view()).

Code
view(diamonds) # Allows the viewing of variables, which includes: carat, cut,
#color, clarity, depth, table, price, x, y, and z

Task 2 - Arrange the diamonds by:

Lowest to highest price (hint: arrange())
Code
diamonds %>% # Using diamonds data set
  arrange(price) # Arranges price from low to high
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
Highest to lowest price (hint: arrange(), desc())
Code
diamonds %>% # Using diamonds data set
  arrange(desc(price)) # Arranges price from high to low (descending)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
Lowest price and cut
Code
# Look at top of table

diamonds %>% # Using diamonds data set
  group_by(cut) %>% # Grouping by cut
  summarize(lowest_price = min(price)) %>% # Calculates the minimum price of diamonds per cut
  ungroup() # Final ungrouping of the data
# A tibble: 5 × 2
  cut       lowest_price
  <ord>            <int>
1 Fair               337
2 Good               327
3 Very Good          336
4 Premium            326
5 Ideal              326
Highest price and cut
Code
# Look at the bottom of the table

diamonds %>% # Using diamonds data set
  group_by(cut) %>% # Grouping by cut
  summarize(highest_price = max(price)) %>% # Calculates the maximum price of diamonds per cut
  ungroup() # Final ungrouping of the data
# A tibble: 5 × 2
  cut       highest_price
  <ord>             <int>
1 Fair              18574
2 Good              18788
3 Very Good         18818
4 Premium           18823
5 Ideal             18806

Task 3 - Arrange the diamonds by lowest to highest price and worst to best clarity.

Code
# Converted clarity to an ordered factor
diamonds$clarity <- factor(diamonds$clarity,
                           levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"),
                           ordered = TRUE)

diamonds %>% # Using diamonds data set
  arrange(clarity, price) # Arranges clarity from worst to best and price from lowest to best
# A tibble: 53,940 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 2  0.32 Good    D     I1       64      54   361  4.33  4.36  2.78
 3  0.31 Premium F     I1       62.9    59   394  4.33  4.29  2.71
 4  0.34 Ideal   D     I1       62.5    57   413  4.47  4.49  2.8 
 5  0.34 Ideal   D     I1       61.4    55   413  4.5   4.52  2.77
 6  0.32 Premium E     I1       60.9    58   444  4.42  4.38  2.68
 7  0.43 Premium H     I1       62      59   452  4.78  4.83  2.98
 8  0.41 Good    G     I1       63.8    56   467  4.7   4.74  3.01
 9  0.32 Good    D     I1       64      54   468  4.36  4.33  2.78
10  0.32 Ideal   E     I1       60.7    57   490  4.45  4.41  2.69
# ℹ 53,930 more rows

Task 4 - Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate()).

Code
diamonds %>% # Using the diamonds data set
  mutate(salePrice = price - 250) # Mutate ~ Creates a new column called salePrice that is the price of diamonds with a discount of $250
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z salePrice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

Task 5 - Remove the x, y, and z variables from the diamonds data set (hint: select()).

Code
diamonds %>% # Using the diamonds data set
  select(-x, -y, -z) # Removes the x, y and z variables (- is minus)
# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

Task 6 - Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).

Code
diamonds %>% # Using the diamonds data set
  group_by(cut) %>% # Grouping by cut
  summarize(x = n()) %>% # Summary ~ Counts the total number of rows in each cut group (shows how many entries exist for each cut).
  ungroup() # Final ungrouping of the data
# A tibble: 5 × 2
  cut           x
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

Task 7 - Create a new column named totalNum that calculates the total number of diamonds.

Code
diamonds %>% # Using diamonds data set
  mutate(totalNum = n()) # Mutate ~ Creates a new column called totalNum that calculates the total number of diamonds (each row is the same as the total is the same throughout)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalNum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

Research methods

Creating a good and bad question about the diamonds data set

Bad questions

  • Are diamonds expensive? Very vague and can be answered with just a yes or no

  • What is the best diamond? Best is a subjective term - could be colour or price whatever the person thinks

  • How do diamonds compare to other gemstones? Just using the diamonds data set this question cannot be answered as it brings in unrelated subjects

Good questions

  • How does the price of diamonds vary by cut, quality and clarity, and what is the average price for each cut category? More specific and cannot be answered with just a yes or no, also, uses clear language and it's feasible with the diamonds data set

  • What is the relationship between carat, weight and price for diamonds of different clarity levels? Once again specific as it targets specific variables and the relationship between them

  • What percentage of diamonds in the diamonds data set are classified as Ideal cut, and how does that compare to other cut categories? Allows for clear statistical analysis

Pre session four (wc/ 14th October 2024)

ggplot2 package

My favourite package!!! I have used ggplot2 since starting with RStudio five years ago and I would never make a graph without it :) These next few plots I will do the basic ggplot2 code and then add a bunch of fancy colours and things to show what can be done. Chapter that info is from: https://r-graphics.org/chapter-quick Website I always use for ggplot2: https://r-graph-gallery.com/

Scatter plot

Code
# Create the plot
b1 <- ggplot(mtcars, aes(x = wt, y = mpg)) + # Making graph with ggplot using cars
  geom_point() + # Scatter plot
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) + # Adds line
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1)) # Decoration

# Print the plot explicitly
print(b1)

Line plot

Code
# Create the plot
b2 <- ggplot(pressure, aes(x = temperature, y = pressure)) + # Making graph with ggplot using cars
  geom_line() + # Line plot
  geom_point(shape = 21, color = "black", fill = "#6dd6a7", size = 4) + # Adds points of certain colour and size
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1)) # Decoration

# Print the plot explicitly
print(b2)

Bar graph

Code
# Create the plot
b3 <- ggplot(mtcars, aes(x = factor(cyl), fill=as.factor(cyl))) + # Making graph with ggplot using cars - also filling colour as cyl
  geom_bar() + # Bar plot
    scale_fill_hue(c = 40) + # Changing colour
  theme(legend.position="none") + # Removing the legend as its not needed
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1)) # Decoration

# Print the plot explicitly
print(b3)

Histogram

Code
# Create the plot
b4 <- ggplot(mtcars, aes(x = mpg)) + # Making graph with ggplot using cars
  geom_histogram(binwidth = 4, fill = "#69b3a2", color = "#ffffff", alpha = 0.7) + # Plotting histogram - and changing colour
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1)) # Decoration

# Print the plot explicitly
print(b4)

Boxplot

Code
# Create the plot
b5 <- ggplot(ToothGrowth, aes(x = supp, y = len)) + # Making graph with ggplot using ToothGrowth
  geom_boxplot(fill = "slateblue", alpha = 0.6) + # Boxplot with colour
    geom_jitter(color="black", size = 2, alpha = 0.8) + # Adds individual observations
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1)) # Decoration

# Print the plot explicitly
print(b5)

Function curve

Code
# Plot a user-defined function
myfun <- function(xvar) {
  1 / (1 + exp(-xvar + 10))
}

# Create the plot
b6 <- ggplot(data.frame(x = c(0, 20)), aes(x = x)) + # Making graph with ggplot using data set created
  stat_function(fun = myfun, geom = "line", color = "red", linewidth = 2) + # Function curve thats red and thicker
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1)) # Decoration

# Print the plot explicitly
print(b6)

Session four (wc/ 14th October 2024)

Data exploration

How to visually check continuous variables using the penguins data set?

Histogram

Code
data("penguins")

# Group by species
penguins_grouped <- penguins %>% 
  group_by(species)

# Create the plot
p1 <- ggplot(penguins_grouped, 
      aes(x = bill_length_mm, color = species, fill = species)) +
      geom_histogram() +
      labs(x = "Bill length in mm", y = "Number of observed penguins", title = "The bill length of observed penguins per species") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p1)

Chinstrap and Gentoo have similar sized of bill lengths, whereas Adelie has a smaller bill length.

Boxplot

Code
data("penguins")

# Group by species
penguins_grouped <- penguins %>% 
  group_by(species)

# Create the plot
p2 <- ggplot(penguins_grouped, aes(x = species, 
             y = bill_length_mm, 
             color = species, 
             fill = species)) +
  geom_boxplot(alpha = 0.8, show.legend = FALSE) +
  labs(x = "Species", y = "Bill length in mm", title = "The bill length of penguins per species") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))
  
# Print the plot explicitly
print(p2)

Similar to above, just a different way to visually show it.

How to visually check categorical variables using the penguins data set?

Species of penguin

Code
# Create the plot
p3 <- ggplot(penguins, aes(x = species,
             color = species, 
             fill = species)) +
  geom_bar(show.legend = FALSE) +
labs(x = "Species", y = "Number of penguins observed", title = "The number of penguins observed per species") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p3)

Adelie observed the most and Chinstrap observed the least.

Observations per year

IMPORTANT

Since adding the package modeldata for post session four this chunk would not render at alllll! It took me several hours to realise the year column was missing from the data in the vtable and therefore this plot couldn’t be made. Both palmerspenguins and modeldata have penguins data sets with only the palmer one including years. So to fix my issue you will see in the code I had to specify I was using palmerspenguins data (I also did this at the beginning of session three).

Code
# Load the penguins dataset from palmerpenguins explicitly
penguins <- palmerpenguins::penguins

# Create the plot
p4 <- ggplot(penguins, aes(x = year, color = species, fill = species)) +
  geom_bar() +
  labs(x = "Year", y = "Number of penguins observed", title = "The number of penguins observed per year") +
  theme_minimal() +
  theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot
print(p4)

Gradually the total number of penguins observed increases each year. Each bar has a breakdown of the number of species observed in that year but as the bars stacked its harder to compare.

Observations per island

Code
# Create the plot
p5 <- ggplot(penguins, aes(x = island,
             color = species, 
             fill = species)) +
  geom_bar() +
labs(x = "Island", y = "Number of penguins observed", title = "The number of penguins observed per island") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p5)

Adelie penguins can be found on all three islands (Biscoe, Dream and Torgersen). Chinstrap penguins can only be found on Dream island. Gentoo penguins can only be found on Biscoe island.

Visulising correlations

Code
# Create the plot
p6 <- ggplot(penguins, aes(x = bill_length_mm, 
             y = bill_depth_mm)) +
  geom_point() +
labs(x = "Bill length in mm", y = "Bill depth in mm", title = "The correlation between bill length and depth") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p6)

A lot of dots everywhere, hard to make out any pattern.

You can also do this correlation per species:
Code
# Create the plot
p7 <- ggplot(penguins, aes(x = bill_length_mm, 
             y = bill_depth_mm, color = species, 
             fill = species)) +
  geom_point() +
labs(x = "Bill length in mm", y = "Bill depth in mm", title = "The correlation between bill length and depth per species") +
  geom_smooth(method = lm , se = TRUE) +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p7)

Now that the dots are coloured by species you can see that there is different positive correlations between bill length and depth for each species.

Body mass per sex

Code
# Clean the data (removes any n/a values)
clean_penguins <- na.omit(penguins)

# Create the plot
p8 <- ggplot(clean_penguins, aes(x = sex, 
             y = body_mass_g,
             color = species, 
             fill = species)) +
  geom_boxplot(alpha = 0.8) +
labs(x = "Sex", y = "Body mass in grams", title = "The correlation between sex and body mass per species") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p8)

Male penguins no matter the species are larger than the females. Although Gentoo penguins do have a greater difference.

Inverting the groups:
Code
# Clean the data (removes any n/a values)
clean_penguins <- na.omit(penguins)

# Create the plot
p9 <- ggplot(clean_penguins, aes(x = species, 
             y = body_mass_g,
             color = sex, 
             fill = sex)) +
  geom_boxplot(alpha = 0.8) +
labs(x = "Species", y = "Body mass in grams", title = "The correlation between penguin species and body mass per sex") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p9)

Same as above, but I feel this graph is easier to read when comparing the sex across each species as the males and female boxplots are next to each other.

Questions

Can body mass predict bill length?

Code
# Clean the data (removes any n/a values)
clean_penguins <- na.omit(penguins)

# Create the plot
p10 <- ggplot(clean_penguins, aes(x = body_mass_g, 
             y = bill_length_mm, color = species, 
             fill = species)) +
  geom_point() +
labs(x = "Body mass in grams", y = "Bill length in mm", title = "The correlation between bill length and body mass per species") +
  geom_smooth(method = lm , se = TRUE) +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p10)

Using this graph you could measure the weight of a penguin in grams (which I presume would be easier than measuring its bill length), then go along the x axis with the weight (and the species of penguin) to the line. Where the weight hits the line is also what the supposed bill length would be! e.g., an Adelie penguin with the weight of 4000g would have a bill length of around 40mm.

Does sex explain flipper length?

Code
# Clean the data (removes any n/a values)
clean_penguins <- na.omit(penguins)

# Create the plot
p11 <- ggplot(clean_penguins, aes(x = species, 
             y = bill_length_mm,
             color = sex, 
             fill = sex)) +
  geom_boxplot(alpha = 0.8) +
labs(x = "Species", y = "Bill length in mm", title = "The sex and bill length of penguins per species") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(p11)

The median bill length is greater for males than females across each species, with it being bigger than the interquartile range for each species’s females. However, the overall range still overlaps for each sex of each species so statistical analysis should be done to say with more confidence if sex can explain flipper length.

Exploring data is asking relevant questions, not just correlating random things! I like to always roughly draw my graph on paper to help visualise what I will be creating on RStudio.

Distributions

Checking distributions

Code
# Clean the data (removes any n/a values)
clean_penguins <- na.omit(penguins)

# Turning data to long format
long_penguins <- clean_penguins %>%
  pivot_longer(bill_length_mm:body_mass_g, names_to = "trait")

# Create the plot
p12 <- ggplot(long_penguins, aes(x = value,
         group = species,
         fill = species,
         color = species)) +
  geom_density(alpha=0.8) +
  facet_grid(~trait, scales = "free_x" ) +
labs(x = "Value", y = "Density", title = "Checking the distributions") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))
  
# Print the plot explicitly
print(p12)

Moments of centrality and dispersion

Centrality

Image showing moments of centrality - from the slides

Dispersion

Image showing moments of dispersion - from the slides

  • Variance

  • Standard deviation

  • Standard Error

  • Range

  • Quantiles

Checking distributions via a histogram

Code
set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_histogram(color="#DD4A48", fill="#DD4A48")+
  geom_vline(xintercept=c(mean(normal), (mean(normal)+sd(normal)),mean(normal)-sd(normal)), 
             linetype="dashed") +
  labs(x = "Value", y = "Count", title = "Checking the distributions via a histogram") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

Shows a sort of bell curve that can be interpretated.

Checking distributions via a boxplot

Code
set.seed(999)
normal<-rnorm(100)
normal %>% 
  as.tibble() %>% 
  ggplot(aes(value))+
  geom_boxplot(fill="#DD4A48",alpha=0.7) +
  labs(x = "Value", y = "Count", title = "Checking the distributions via a boxplot") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

Take a look at the median (line in box) and interquartile range (the box) to determine the spread of the data.

Scientific hypotheses

A hypothesis:

  • Is affirmative

  • Is not a question

  • Is self-explanatory

  • Must lead to expectations if confirmed

Why do we need hypotheses?

  • Candidate explanation to a phenomenon

  • Contain previsions and expectations

  • Feedback theory

  • Advance Science

Types of hypotheses

There are two types of hypotheses!

Scientific hypotheses

  • Candidate statements to explain an observed phenomenon

  • Meant to generate logical predictions

  • Working guidelines

Statistical hypotheses

  • Logical predictions

  • Confirmed by stats

  • Can be drawn in a graph

Image showing how statistical hypotheses can be drawn on a graph - from the slides

A good working hypothesis is everything

  • A wrong hypotheses will misguide you

  • A good hypothesis will keep you excited and up to date

  • Practise makes perfect

How to write hypotheses?

  • Tell a story

  • Don’t use a subheading

  • Never reefer to statistical hypotheses

The if/then method

  • If drug X have an effect on reducing headache, then

  • When many exotic species are introduced in the ecosystem, then the likelihood of severe ecological disruption increases.

The statement method

  • Drug X reduce the headaches because it blocks the neuroreceptors of pain.

  • Exotic species in high quantities disrupts ecological mutualistic networks.

Post session four (wc/ 14th October 2024)

Data analysis

Following Andrew’s script

Packages and the data used

The packages used here are loaded at the beginning but also shown in this chunk.

?crickets gives more information on the data or you can even do ?ggplot2 to find out more about a package.

Code
# Packages being used
library(tidyverse)
library(modeldata)

# Taking a look at the crickets data set
crickets %>% 
  vtable(., lush = TRUE)
.
Name Class Values Missing Summary
species factor 'O. exclamationis' 'O. niveus' 0 nuniq: 2
temp numeric Num: 17.2 to 30.4 0 mean: 23.765, sd: 3.823, nuniq: 17
rate numeric Num: 44.3 to 101.7 0 mean: 72.887, sd: 16.914, nuniq: 31

The basics

Code
# Create the plot
c1 <- ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c1)

The chirp rate increases with temperature.

Code
# Create the plot
c2 <- ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c2)

The chirp rate increased with temperature for both species, but O. exclamationis has a greater overall chirp rate.

Modifiying basic properties of the plot

Code
# Create the plot
c3 <- ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = 0.3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c3)

Colours the points to red with an alpha of 0.3 and size of 2. They are also square now!

Code
# Create the plot
c4 <- ggplot(crickets, aes(x = temp, 
                     y = rate,
                     shape = species, 
                     color = species)) + 
  geom_point(alpha = 0.6, size = 5) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c4)

Gives different shaped and coloured points to the two species!

Adding another layer

Code
# Create the plot
c5 <- ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c5)

Adds a line of best fit.

Code
# Need this package for the labels on the lines
library(geomtextpath)

# Create the plot
c6 <- ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_labelsmooth(aes(label = species), fill = "white",
                method = "lm", formula = y ~ x,
                size = 3, linewidth = 1, boxlinewidth = 0.4) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2") +
      theme_minimal() +
      guides(color = 'none') +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c6)

Adds a line of best fit for each species. They are even labelled!

Other plots

Code
# Create the plot
c7 <- ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) +
      theme_minimal() +
      guides(color = 'none') +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c7)

Histogram of the crickets data showing the number of observations per chirp rate.

Code
# Create the plot
c8 <- ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15) +
      theme_minimal() +
      guides(color = 'none') +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c8)

Graph of the crickets data showing the number of observations per chirp rate. You can see how the peaks are the same as the histogram above.

Code
# Create the plot
c9 <- ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue") +
      theme_minimal() +
      guides(color = 'none') +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c9)

Bar chart showing the number of times each species was observed. O. niveus was observed more! You can see the bars are both set to be light blue with a black border

Code
# Create the plot
c10 <- ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2") +
      theme_minimal() +
      guides(color = 'none') +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c10)

Bar chart showing the same as above but has scale_fill_brewer() with the palette "Dark2".

Code
# Create the plot
c11 <- ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
      theme_minimal() +
      guides(color = 'none') +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c11)

Boxplot of both species’ chirp rates. It shows how O. exclamationis has a greater median and interquartile range than O. niveus. The colours are the same as above except only for the outline.

Faceting

Code
# Create the plot
c12 <- ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")  +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c12)

Not visually great!

Code
# Create the plot
c13 <- ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")  +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c13)

Separates the two species into their own graph stacked next to each other.

Code
# Create the plot
c14 <- ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(c14)

Separates the two species into their own graph stacked on top of each other.

Research methods

Summary on what is a good research hypothesis:

A research hypothesis is a suggested explanation for a phenomenon or an alternate reasoned proposal that suggests the possible correlation between or among a set of phenomena. They can be super specific or broad as long as they can be tested by experiments and observations. A good research hypothesis will be testable (allows you to work towards observable results), brief and objective (not too wordy but has all the needed information to make it clear) and clear and relevant (a clear aim on what you want to find out based on the field of study and whatever gaps there might be).

Pre session five (wc/ 21st October 2024)

Finding the appropriate statistical test

There are many different tests to chose from based on how much data and what kind of data you have collected. It is always important to think about the statistical test you are aiming towards when creating the experimental design, so you don’t collect data that has no statistical power/no meaning.

I usually use a flow chart like the one below to work out what stats test will be needed.

Flow chart of statistical tests

Session five (wc/ 21st October 2024)

Choosing the right analysis

Hypothetico-deductive reasoning

Post session five (wc/ 21st October 2024)

Creating graphs and determining what tests are associated with them

Using the iris data set.

Example one

Code
# Create the plot
i1 <- ggplot(iris, aes(x = Species,
                     y = Sepal.Length,
                     color = Species)) +
    geom_boxplot(alpha = 0) +
      theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(i1)

Example two

Code
# Create the plot
i2 <- ggplot(iris, aes(x = Petal.Length, 
                      color = Species, 
                      fill = Species)) +
  geom_density(alpha = 0.5) +
theme_minimal() +
      theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(i2)

Example three

Code
# Create the plot
i3 <- ggplot(iris, aes(x = Petal.Length,
                       y = Petal.Width,
                       shape = Species,
                       color = Species)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = lm, se = TRUE, color = "darkgrey",
              aes(group = 1)) +
  theme_minimal() +
  theme(plot.background = element_rect(color = "#5e5d59", 
                                       fill = NA, size = 1))

# Print the plot explicitly
print(i3)

Example four

Code
# Create the plot with mutated data
i4 <- iris %>%
  mutate(Size = ifelse(Sepal.Length < median(Sepal.Length), "Small", "Big")) %>%
  group_by(Species, Size) %>%
  summarize(Count = n(), .groups = 'drop') %>%
  ungroup() %>%
  ggplot(aes(x = Species, y = Count, fill = Size)) +
  geom_bar(position = "dodge", stat = "identity", colour = "white") +
  theme_minimal() +
  theme(plot.background = element_rect(color = "#5e5d59", fill = NA, size = 1))

# Print the plot explicitly
print(i4)

Session six (wc/ 28th October 2024)

Work in progress

🚧🔨👷🧱️🚧