This is my (Sophie’s) online workbook! It is split up into sessions and then sections within that.
Post session one (wc/ 23rd September 2024)
First I started off with inserting an image
Cool grasshopper 🦗 - more specifically the elegant grasshopper Zonocerus elegans
Loading the needed packages 📦
Click on code to see what they are! They are all the packages used throughout this workbook.
Code
library(knitr)library(rmarkdown)library(quarto)library(palmerpenguins)library(ggplot2) # this is within tidyverse but I wanted to show its uselibrary(tidyverse)library(dplyr)library(vtable)library(gt)library(modeldata)library(geomtextpath)
Quarto tutorial
Meet the penguins 🐧
The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
Heres what the first few lines of data looks like!
Code
# Check the first few rows of the data setkable(head(penguins))
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
Adelie
Torgersen
39.1
18.7
181
3750
male
Adelie
Torgersen
39.5
17.4
186
3800
female
Adelie
Torgersen
40.3
18.0
195
3250
female
Adelie
Torgersen
NA
NA
NA
NA
NA
Adelie
Torgersen
36.7
19.3
193
3450
female
Adelie
Torgersen
39.3
20.6
190
3650
male
The plot below shows the relationship between flipper and bill lengths of these penguins.
I encountered many problems making this graph despite copying the code from the tutorial. I ended up un-installing all the packages and quarto and re-installing to no avail, however doing the creation and printing of the plot separately worked. So I’m sticking with that!
Code
# Create the plotp <-ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm)) +geom_point(aes(color = species, shape = species)) +scale_color_manual(values =c("darkorange","purple","cyan4")) +labs(title ="Flipper and bill length",subtitle ="Dimensions for penguins at Palmer Station LTER",x ="Flipper length (mm)", y ="Bill length (mm)",color ="Penguin species", shape ="Penguin species" ) +theme_bw() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p)
Doing my own thing
Trying another plot with different data
First I imported the data and had a look at the first couple of rows.
Code
# Import data setpitfall <-read_csv("pitfall.csv")# First few rowskable(head(pitfall))
sample_point
num_individuals
num_species
1
3
3
2
2
4
3
1
1
4
1
1
5
2
2
6
8
6
About the data set 🐞🕳️
Collected early this September in Scotland. Non-lethal pitfalls placed in between hairy wood ant nests to try and look into their activity, however only one ant was captured (and immediately released). Although this data cannot be used to look into the activity of the ant it can be used as practice, especially since I might do pitfall trapping for my research project.
My friends Katie and Curtis creating holes in the ground to put the traps into.
Me making the non-lethal traps by putting Vaseline on the inside lip of a cup. The cups were then placed into the holes in the ground.
Reshaping the data
The data needs to be reshaped to allow both the number of species and the number of individuals (categories) to be plotted on the same graph. This allows for comparisons to be made!
Code
# Reshape datapitfall_long <- pitfall %>%pivot_longer(cols =c(num_individuals, num_species), names_to ="category", values_to ="count")# Now it looks like this:kable(head(pitfall_long))
sample_point
category
count
1
num_individuals
3
1
num_species
3
2
num_individuals
2
2
num_species
4
3
num_individuals
1
3
num_species
1
Plotting the data
Same with the penguins, I plotted and printed separately. Below is the graph that was created! Mostly there’s the same amount of individuals and species in each sample point. Point six has a lot more as mistakenly it was placed under a tree where beating samples were taken.
Code
# Create the plotf <-ggplot(pitfall_long, aes(fill = category, y = count, x = sample_point)) +geom_bar(position ="dodge", width =0.8, stat ="identity", color="black") +scale_fill_manual(name ="Legend", labels =c("Individuals", "Species"), values=c("#E69F00", "#56B4E9")) +scale_x_discrete(limits =as.character(1:15), labels =paste(1:15)) +labs(x ="Sample point", y ="Total number", title ="Total number of species and individuals at each sample point") +theme_bw() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(f)
Session two (wc/ 30th Septermber 2024)
Ethics for experimental design - notes
What is the scientific method?
A procedure which involves finding new/correcting previous knowledge through testing and experimentation.
What can influence the scientific method?
Personal beliefs and biases
Societal norms
Available technology
Funding and resources
And many more!
Science is an inferential exercise
As soon as you apply or interpret science you end up inserting values and judgements within it (based on personal beliefs and biases/societal norms).
Therefore ethics are then subject to the society at the time!
Ethics definition
Moral principles that govern human’s behaviours or the conducting of an activity.
The branch of knowledge that deals with moral principles.
External - Social
Morality definition
Principles concerning the distinction between right and wrong and good and bad behaviour.
A system of ideas of right and wrong conduct.
Virtuous conduct.
Can vary from person to person.
Internal
Animal experimentation example
There is a spectrum of opinions ranging from one extreme (Acceptable to use animals in any way that we see fit) to another (Animals have rights and it is wrong to use them).
The example is as follows: You are a member of an animal experimentation committee. An experiment, which requires keeping mice in isolation, is proposed to measure the effect of a vaccine which will stop a debilitating disease in humans. Note - mice are extremely social and suffer badly when isolated
Is this scenario ethical???
It depends on your judgement:
Yes - It is permissible to keep mice in isolation to obtain the results required.
No - We should not inflict suffering on animals.
Maybe - The experiment needs to be done in isolation and will help many if the vaccine is proven to be successful, but what can be done to make it better for the mice?
This is where the three R’s come into play!
The three R’s
Principles for human experimental technique set out by Russel and Burch (1959). They are widely accepted guidelines that can also be used for animals. It is possible to create a better experiment with more statistical ‘power’ after considering the three R’s.
Replacement
Methods that avoid or replace the use of animals that are protected under the Animals (Scientific Procedures) Act 1986, amended 2012 (ASPA).
Protected animals include all living vertebrates (except humans - they are under other protections as they can consent) and some invertebrates (e.g., cephalopods like octopuses and squid).
Instead use:
Human volunteers
Computer modelling
Animal cell cultures
Alternative species
Reduction
Use less subjects where possible.
Can often obtain the same amount of information.
‘Power analysis’ can show what the smallest possible sample size that is needed for the same ‘power.’
Often includes good use of experimental design and stats.
Refinement
Try to use methods that minimise the pain, suffering, distress or lasting harm experienced by the animals.
Applies to all aspects of animals use, from housing and husbandry to the specific procedures that will be performed on them.
e.g., use appropriate anaesthetics and analgesics, training animals to cooperate to avoid stress and providing animals with appropriate housing that allows for the expression of species-specific behaviours, such as having enclosures next to one another in the mice example (so they can still see and hear one another without being in direct contact).
The globalisation of medical research: the need to capture universal values and formulate universal principles (global bioethics).
The emerging reality of the diversity of moral cultures: the need to respect plurality and ethical diversity (Asian, American, European bioethics).
Other things to consider
Decolonisation - Avoiding ‘helicopter science’
Anti-oppressive practices
Discrimination
Pre session three (wc/ 7th October 2024)
Data analysis
Tidyverse package
The tidyverse package is a collection of other packages such as ggplot2 and dplyr. They all share an underlying design philosophy, grammar, and data structures. The code from this exercise elsewhere!
Research methods
How to write a research question
What is a research question?
A question (usually addresses an issue/problem) that a study/research project aims to answer, through data analysis.
The path to a research question starts off with a broad topic which gets narrowed down through preliminary research and searching for gaps in the knowledge, to eventually create a question.
Types of research questions
Quantitative
Precise
Often exploring relationships, comparisons or descriptions in a study
Not answerable with “yes” or “no”
Therefore, questions don’t use words like “is,” “are,” “do,” or “does.”
Qualitative
Broad or more specific areas of study
Adaptable, non-directional and more flexible
Therefore, these questions aim to “discover,” “explain,” or “explore.”
Mixed-method
Combination of both!
Examples of good and bad research questions
Example one
Bad: How does social media affect people’s behaviour?
Good: What effect does the daily use of YouTube have on the attention span of children aged under 16?
The bad one is vague and lacks specification.
Example two
Bad: Has there been an increase in childhood obesity in the US in the past 10 years?
Good: How have school intervention programs and parental education levels affected the rate of childhood obesity among 1st to 6th-grade students?
The bad one is simple plus it can be answered with yes or no!
Common mistakes
Overly broad questions
Unfocused and vague questions make it difficult to draw meaningful conclusions
Not feasible
Make sure that you have the resources and time to answer the question
Ambiguous language
Can lead to misunderstandings
Neglecting existing literature
Can result in doing a question thats already been heavily addressed
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Summaries of the data set
Using summary()
Code
pen1 <-summary(penguins)print(pen1)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
Using vtable()
Code
penguins %>%vtable(., lush =TRUE)
.
Name
Class
Values
Missing
Summary
species
factor
'Adelie' 'Chinstrap' 'Gentoo'
0
nuniq: 3
island
factor
'Biscoe' 'Dream' 'Torgersen'
0
nuniq: 3
bill_length_mm
numeric
Num: 32.1 to 59.6
2
mean: 43.922, sd: 5.46, nuniq: 164
bill_depth_mm
numeric
Num: 13.1 to 21.5
2
mean: 17.151, sd: 1.975, nuniq: 80
flipper_length_mm
integer
Num: 172 to 231
2
mean: 200.915, sd: 14.062, nuniq: 55
body_mass_g
integer
Num: 2700 to 6300
2
mean: 4201.754, sd: 801.955, nuniq: 94
sex
factor
'female' 'male'
11
nuniq: 2
year
integer
Num: 2007 to 2009
0
mean: 2008.029, sd: 0.818, nuniq: 3
Grouping by species using group_by()
Code
penguins %>%group_by(species) %>%na.omit() %>%summarise(mean =mean(bill_length_mm), sd=sd(bill_length_mm), n =n())
# A tibble: 3 × 4
species mean sd n
<fct> <dbl> <dbl> <int>
1 Adelie 38.8 2.66 146
2 Chinstrap 48.8 3.34 68
3 Gentoo 47.6 3.11 119
Reshaping the data into a wide format using Tidyr
It is a way of organizing data, where each variable is represented as a column, and each observation is represented as a row.
Code
penguins %>%mutate(row =row_number()) %>%# needed to add a row number to identify each row as separate caseselect(row, species, island, body_mass_g) %>%pivot_wider(names_from = island, values_from = body_mass_g)
# A tibble: 344 × 5
row species Torgersen Biscoe Dream
<int> <fct> <int> <int> <int>
1 1 Adelie 3750 NA NA
2 2 Adelie 3800 NA NA
3 3 Adelie 3250 NA NA
4 4 Adelie NA NA NA
5 5 Adelie 3450 NA NA
6 6 Adelie 3650 NA NA
7 7 Adelie 3625 NA NA
8 8 Adelie 4675 NA NA
9 9 Adelie 3475 NA NA
10 10 Adelie 4250 NA NA
# ℹ 334 more rows
Types of variables
Code
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Categorical variables
Ordinal
Categories that maintain an order
Nominal
Categories with no order ranking
Binary
Nominal variables with two categories
Numerical variables
Discrete
Numbered values that can only take certain values
Continuous
Numered values that are measured, and can be any number within a particular range
Research methods
Research questions
Inductive reasoning vs deductive reasoning
Inductive
Observation
Pattern
Hypothesis
Theory
Deductive
Theory
Hypothesis
Observation
Confirmation
When to pose a question?
When a scientific question preceeds the hypotheses. Comes after a good screening of theory.
Good and bad questions - see pre-session activity on why these questions are good or bad
Bad:
Is there any difference between A and B?
Is A bigger than B?
Can X influence Y?
Are there little green people out there?
Good:
What explains differences between A and B?
What makes A bigger than B?
How X can influence Y?
How can evolution lead to little green people out there?
Importance of a good question
There is no good research based on bad questions. Also, a good question always push knowledge further. However, it is something that comes with practise.
Post session three (wc/ 7th October 2024)
Data analysis
Exercise 6.6.1
1. Practice typing/executing these problems and explain what each line of code does. Make sure tidyverse is loaded in your libraries!
I have removed the code fold on these problems so you can see all the annotations easier.
library(tidyverse) is the star of this section :)
Problem A
# %>% means putting the result into the next thing probably going to be used a lot so I am just mentioning here!midwest %>%# Using the midwest data set (included in the tidyverse package)group_by(state) %>%# Grouping the data by states in the midwest of the USsummarize(poptotalmean =mean(poptotal), # summarize() means creating a summary of the state grouped data. Inside of the function various summary statistics are calculated for the population columns. poptotalmean = mean(poptotal) calculates the mean of the total population for each state.poptotalmed =median(poptotal), # Calculates the median of the total population for each statepopmax =max(poptotal), # Calculates the maximum value for the total population for each statepopmin =min(poptotal), # Calculates the minimum value for the total population for each statepopdistinct =n_distinct(poptotal), # Counts the number of unique values in the total population for each statepopfirst =first(poptotal), # Gets the first value of the total population for each state (the first row of the group)popany =any(poptotal <5000), # Checks if there are any values in the population total less than 5,000 in each state (response will be either TRUE - yes or FALSE - no)popany2 =any(poptotal >2000000)) %>%# Does the same as above except if any values greater than 2,000,000ungroup() # Final ungrouping of the data - makes sure subsequent operations wont get affected
midwest %>%# Using midwest data setgroup_by(state) %>%# Groups the data by statessummarize(num5k =sum(poptotal <5000), # Summarising of grouped data for... num5k = sum(poptotal < 5000) which counts the number of times the population total is less than 5,000 in each statenum2mil =sum(poptotal >2000000), # Counts the number of times the population total is more than 2,000,000 in each statenumrows =n()) %>%# Counts the number of rows in each state groupungroup() # Final ungrouping of the data
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C
Part I
midwest %>%# Using the midwest data setgroup_by(county) %>%# Groups the data by countysummarize(x =n_distinct(state)) %>%# Summary ~ Counts the number of unique states for each county (shows how many different states each county spans)arrange(desc(x)) %>%# Sorts the summarised data into descending order based on the value of x (counties spanning the most states appear first)ungroup() # Final ungrouping of the data
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
Part II
# How does n() differ from n_distinct()? - n() is the total number of rows in the current group, it is used when you want to know how many entries there are in each group. n_distinct() is the number of unique values in a specified column, it is used when you want to know how many unique values exist in your chosen column within the group.# When would they be the same? different? - They would be the same when every entry in a column within the group is the same. They would be different when there are different values within a column.midwest %>%# Using the midwest data setgroup_by(county) %>%# Groups the data by countysummarize(x =n()) %>%# Summary ~ Counts the total number of rows in each county group (shows how many entries exist for each county).ungroup() # Final ungrouping of the data
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Part III
# hint: # - How many distinctly different counties are there for each county? Only one as each group is defined by that distinct county# - Can there be more than 1 (county) county in each county? No!# - What if we replace 'county' with 'state'? I'm guessing it means the first county in the code (group_by(state) instead) and keeping the second the same (n_distinct(county)). It would count the number of unique counties for each state.midwest %>%# Using the midwest data setgroup_by(county) %>%# Groups the data by countysummarize(x =n_distinct(county)) %>%# Summary ~ Counts the number of unique counties in each county which will always be 1!ungroup() # Final ungrouping of the data
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
Problem D
diamonds %>%# Using the diamonds data setgroup_by(clarity) %>%# Grouping by claritysummarize(a =n_distinct(color), # Summary ~ Counts the number of unique diamond colours in each clarity groupb =n_distinct(price), # Counts the number of unique prices in each clarity groupc =n()) %>%# Counts the total number of rows in each clarity group (shows how many entries exist for each clarity).ungroup() # Final ungrouping of the data
diamonds %>%# Using the diamonds data setgroup_by(color, cut) %>%# Grouping by colour and cutsummarize(m =mean(price), # Summary ~ Calculates the mean price of diamond for each combination of colour and cuts =sd(price)) %>%# Calculates the standard deviation of diamond prices for each combination of colour and cutungroup() # Final ungrouping of the data
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
Part II
# Took me a moment to see the difference, but it was the order of cut and colour in the grouping command. The summary statistics (mean and sd) will be the same either way, however the order of the columns in the tibble is different!diamonds %>%# Using the diamonds data setgroup_by(cut, color) %>%# Grouping by cut and coloursummarize(m =mean(price), # Summary ~ Calculates the mean price of diamond for each combination of cut and colours =sd(price)) %>%# Calculates the standard deviation of diamond prices for each combination of cut and colourungroup() # Final ungrouping of the data
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
Part III
# - How good is the sale if the price of diamonds equaled msale? - It is a 20% discount, which is very good especially as they are diamonds.# - e.x. The diamonds are x% off original price in msale.diamonds %>%# Using the diamonds data setgroup_by(cut, color, clarity) %>%# Grouping by cut and coloursummarize(m =mean(price), # Summary ~ Calculates the mean price of diamond for each combination of cut and colours =sd(price), # Calculates the standard deviation of diamond prices for each combination of cut and colourmsale = m *0.80) %>%# Calculates a sale price that is 80% of the mean price - 20% discountungroup() # Final ungrouping of the data
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
Problem F
# These are all given weird unrelated food names to make it confusing :)diamonds %>%# Using the diamonds data setgroup_by(cut) %>%# Grouping by cutsummarize(potato =mean(depth), # Summary ~ Calculates potato which is the mean depth of diamonds for each cutpizza =mean(price), # Calculates pizza which is the mean price of diamonds for each cutpopcorn =median(y), # Calculates popcorn which is the median of the y dimension (height of diamonds) for each cutpineapple = potato - pizza, # Calculates pineapple which is the difference between potato (mean depth) and pizza (mean price) for each cutpapaya = pineapple ^2, # Calculates papaya which is pineapple (above) to the power of 2peach =n()) %>%# Calculates peach which is total number of rows (diamonds) for each cutungroup() # Final ungrouping of the data
diamonds %>%# Using the diamonds data setgroup_by(color) %>%# Grouping by coloursummarize(m =mean(price)) %>%# Summary ~ Calculates the mean price of diamonds for each colourmutate(x1 =str_c("Diamond color ", color), # Mutate() - adds new columns to the summarised data. x1 = str_c("Diamond color ", color) - creates a new column called x1 that links "Diamond color" with each unique colour. e.g. "Diamond color D" or "Diamond color E"x2 =5) %>%# Creates a new column called x2 that has a constant value of 5 in each rowungroup() # Final ungrouping of the data
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Part II
# What does the first ungroup() do? Is it useful here? Why/why not? - Removes the grouping structure of the data after summarising. Meaning the mutate() operation isn't performed in the grouped data set. It is useful as we added new columns/performed operations that do not depend on the grouping of the data by colour. These are then instead applied to the entire data frame as opposed to within the groups avoiding any unintended consequences.# Why isn't there a closing ungroup() after the mutate()? - It would have no effect as mutate() doesn't group the data and the data was already ungrouped before hand.diamonds %>%# Using the diamonds data setgroup_by(color) %>%# Grouping the data by coloursummarize(m =mean(price)) %>%# Summary ~ Calculates the mean price of diamonds for each colourungroup() %>%# Ungroups the datamutate(x1 =str_c("Diamond color ", color), # Mutate ~ Creates a new column called x1 that links "Diamond color" with each unique colourx2 =5) # Creates a new column called x2 that has a constant value of 5 in each row
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Problem H
Part I
diamonds %>%# Using the diamonds data setgroup_by(color) %>%# Grouping by colourmutate(x1 = price *0.5) %>%# Mutate ~ Creates a new column called x1 that is half the price of each diamonds and since the data is grouped by colour x1 will be the half the price for each diamond colour groupsummarize(m =mean(x1)) %>%# Summary ~ Calculates the mean x1 value for each colour group (mean of half the price of each diamond colour)ungroup() # Final ungrouping of the data
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
Part II
# What's the difference between part I and II? - The order of things (ungroup is at the end in the first and before summarise in the second). Part I provides mean of half the prices for each diamond colour and part II gives the overall mean of half the prices for all diamonds (regardless of colour). This is because part II the data is ungrouped before the summary so the mean of x1 is calculated on the whole data set not the grouped one like in part I.diamonds %>%# Using the diamonds data setgroup_by(color) %>%# Grouping by colourmutate(x1 = price *0.5) %>%# Mutate ~ Creates a new column called x1 that is half the price of each diamonds and since the data is grouped by colour x1 will be the half the price for each diamond colour groupungroup() %>%# Ungrouping of the datasummarize(m =mean(x1)) # Summary ~ Calculates the mean x1 value for each colour group (mean of half the price of each diamond colour)
# A tibble: 1 × 1
m
<dbl>
1 1966.
2. Why is grouping data necessary?
When you want to perform calculations that are specific to some parts of the data (the things you make grouped). It can also allow for further insight into the data as things like summarize() and mutate() can work within specific sections of the data.
3. Why is ungrouping data necessary?
If you want to perform operations that do not need to be constrained by previously defined groups. Also, prevents unintended consequences (if not done all operations will still respect the grouping).
4. When should you ungroup data?
After summarising and you don’t need to maintain the grouping for further things.
Before operations you want to perform to all the whole data set
Before final output to make it cleaner especially if you want to visualise the data
5. If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?
No as no groups are defined if group_by() isn’t used! In the example mutate() operates on the entire data set and no groups are defined.
Exercise 6.7
Using the diamonds built-in data set (requires tidyverse), perform the following tasks (annotations visible when you look at code):
Task 1 - View all of the variable names in diamonds (hint: view()).
Code
view(diamonds) # Allows the viewing of variables, which includes: carat, cut,#color, clarity, depth, table, price, x, y, and z
Task 2 - Arrange the diamonds by:
Lowest to highest price (hint: arrange())
Code
diamonds %>%# Using diamonds data setarrange(price) # Arranges price from low to high
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Highest to lowest price (hint: arrange(), desc())
Code
diamonds %>%# Using diamonds data setarrange(desc(price)) # Arranges price from high to low (descending)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
Lowest price and cut
Code
# Look at top of tablediamonds %>%# Using diamonds data setgroup_by(cut) %>%# Grouping by cutsummarize(lowest_price =min(price)) %>%# Calculates the minimum price of diamonds per cutungroup() # Final ungrouping of the data
# A tibble: 5 × 2
cut lowest_price
<ord> <int>
1 Fair 337
2 Good 327
3 Very Good 336
4 Premium 326
5 Ideal 326
Highest price and cut
Code
# Look at the bottom of the tablediamonds %>%# Using diamonds data setgroup_by(cut) %>%# Grouping by cutsummarize(highest_price =max(price)) %>%# Calculates the maximum price of diamonds per cutungroup() # Final ungrouping of the data
# A tibble: 5 × 2
cut highest_price
<ord> <int>
1 Fair 18574
2 Good 18788
3 Very Good 18818
4 Premium 18823
5 Ideal 18806
Task 3 - Arrange the diamonds by lowest to highest price and worst to best clarity.
Code
# Converted clarity to an ordered factordiamonds$clarity <-factor(diamonds$clarity,levels =c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"),ordered =TRUE)diamonds %>%# Using diamonds data setarrange(clarity, price) # Arranges clarity from worst to best and price from lowest to best
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
2 0.32 Good D I1 64 54 361 4.33 4.36 2.78
3 0.31 Premium F I1 62.9 59 394 4.33 4.29 2.71
4 0.34 Ideal D I1 62.5 57 413 4.47 4.49 2.8
5 0.34 Ideal D I1 61.4 55 413 4.5 4.52 2.77
6 0.32 Premium E I1 60.9 58 444 4.42 4.38 2.68
7 0.43 Premium H I1 62 59 452 4.78 4.83 2.98
8 0.41 Good G I1 63.8 56 467 4.7 4.74 3.01
9 0.32 Good D I1 64 54 468 4.36 4.33 2.78
10 0.32 Ideal E I1 60.7 57 490 4.45 4.41 2.69
# ℹ 53,930 more rows
Task 4 - Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate()).
Code
diamonds %>%# Using the diamonds data setmutate(salePrice = price -250) # Mutate ~ Creates a new column called salePrice that is the price of diamonds with a discount of $250
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z salePrice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
Task 5 - Remove the x, y, and z variables from the diamonds data set (hint: select()).
Code
diamonds %>%# Using the diamonds data setselect(-x, -y, -z) # Removes the x, y and z variables (- is minus)
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
Task 6 - Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).
Code
diamonds %>%# Using the diamonds data setgroup_by(cut) %>%# Grouping by cutsummarize(x =n()) %>%# Summary ~ Counts the total number of rows in each cut group (shows how many entries exist for each cut).ungroup() # Final ungrouping of the data
# A tibble: 5 × 2
cut x
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
Task 7 - Create a new column named totalNum that calculates the total number of diamonds.
Code
diamonds %>%# Using diamonds data setmutate(totalNum =n()) # Mutate ~ Creates a new column called totalNum that calculates the total number of diamonds (each row is the same as the total is the same throughout)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Research methods
Creating a good and bad question about the diamonds data set
Bad questions
Are diamonds expensive? Very vague and can be answered with just a yes or no
What is the best diamond? Best is a subjective term - could be colour or price whatever the person thinks
How do diamonds compare to other gemstones? Just using the diamonds data set this question cannot be answered as it brings in unrelated subjects
Good questions
How does the price of diamonds vary by cut, quality and clarity, and what is the average price for each cut category? More specific and cannot be answered with just a yes or no, also, uses clear language and it's feasible with the diamonds data set
What is the relationship between carat, weight and price for diamonds of different clarity levels? Once again specific as it targets specific variables and the relationship between them
What percentage of diamonds in the diamonds data set are classified as Ideal cut, and how does that compare to other cut categories? Allows for clear statistical analysis
Pre session four (wc/ 14th October 2024)
ggplot2 package
My favourite package!!! I have used ggplot2 since starting with RStudio five years ago and I would never make a graph without it :) These next few plots I will do the basic ggplot2 code and then add a bunch of fancy colours and things to show what can be done. Chapter that info is from: https://r-graphics.org/chapter-quick Website I always use for ggplot2: https://r-graph-gallery.com/
Scatter plot
Code
# Create the plotb1 <-ggplot(mtcars, aes(x = wt, y = mpg)) +# Making graph with ggplot using carsgeom_point() +# Scatter plotgeom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +# Adds linetheme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1)) # Decoration# Print the plot explicitlyprint(b1)
Line plot
Code
# Create the plotb2 <-ggplot(pressure, aes(x = temperature, y = pressure)) +# Making graph with ggplot using carsgeom_line() +# Line plotgeom_point(shape =21, color ="black", fill ="#6dd6a7", size =4) +# Adds points of certain colour and sizetheme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1)) # Decoration# Print the plot explicitlyprint(b2)
Bar graph
Code
# Create the plotb3 <-ggplot(mtcars, aes(x =factor(cyl), fill=as.factor(cyl))) +# Making graph with ggplot using cars - also filling colour as cylgeom_bar() +# Bar plotscale_fill_hue(c =40) +# Changing colourtheme(legend.position="none") +# Removing the legend as its not neededtheme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1)) # Decoration# Print the plot explicitlyprint(b3)
Histogram
Code
# Create the plotb4 <-ggplot(mtcars, aes(x = mpg)) +# Making graph with ggplot using carsgeom_histogram(binwidth =4, fill ="#69b3a2", color ="#ffffff", alpha =0.7) +# Plotting histogram - and changing colourtheme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1)) # Decoration# Print the plot explicitlyprint(b4)
Boxplot
Code
# Create the plotb5 <-ggplot(ToothGrowth, aes(x = supp, y = len)) +# Making graph with ggplot using ToothGrowthgeom_boxplot(fill ="slateblue", alpha =0.6) +# Boxplot with colourgeom_jitter(color="black", size =2, alpha =0.8) +# Adds individual observationstheme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1)) # Decoration# Print the plot explicitlyprint(b5)
Function curve
Code
# Plot a user-defined functionmyfun <-function(xvar) {1/ (1+exp(-xvar +10))}# Create the plotb6 <-ggplot(data.frame(x =c(0, 20)), aes(x = x)) +# Making graph with ggplot using data set createdstat_function(fun = myfun, geom ="line", color ="red", linewidth =2) +# Function curve thats red and thickertheme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1)) # Decoration# Print the plot explicitlyprint(b6)
Session four (wc/ 14th October 2024)
Data exploration
How to visually check continuous variables using the penguins data set?
Histogram
Code
data("penguins")# Group by speciespenguins_grouped <- penguins %>%group_by(species)# Create the plotp1 <-ggplot(penguins_grouped, aes(x = bill_length_mm, color = species, fill = species)) +geom_histogram() +labs(x ="Bill length in mm", y ="Number of observed penguins", title ="The bill length of observed penguins per species") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p1)
Chinstrap and Gentoo have similar sized of bill lengths, whereas Adelie has a smaller bill length.
Boxplot
Code
data("penguins")# Group by speciespenguins_grouped <- penguins %>%group_by(species)# Create the plotp2 <-ggplot(penguins_grouped, aes(x = species, y = bill_length_mm, color = species, fill = species)) +geom_boxplot(alpha =0.8, show.legend =FALSE) +labs(x ="Species", y ="Bill length in mm", title ="The bill length of penguins per species") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p2)
Similar to above, just a different way to visually show it.
How to visually check categorical variables using the penguins data set?
Species of penguin
Code
# Create the plotp3 <-ggplot(penguins, aes(x = species,color = species, fill = species)) +geom_bar(show.legend =FALSE) +labs(x ="Species", y ="Number of penguins observed", title ="The number of penguins observed per species") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p3)
Adelie observed the most and Chinstrap observed the least.
Observations per year
IMPORTANT
Since adding the package modeldata for post session four this chunk would not render at alllll! It took me several hours to realise the year column was missing from the data in the vtable and therefore this plot couldn’t be made. Both palmerspenguins and modeldata have penguins data sets with only the palmer one including years. So to fix my issue you will see in the code I had to specify I was using palmerspenguins data (I also did this at the beginning of session three).
Code
# Load the penguins dataset from palmerpenguins explicitlypenguins <- palmerpenguins::penguins# Create the plotp4 <-ggplot(penguins, aes(x = year, color = species, fill = species)) +geom_bar() +labs(x ="Year", y ="Number of penguins observed", title ="The number of penguins observed per year") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plotprint(p4)
Gradually the total number of penguins observed increases each year. Each bar has a breakdown of the number of species observed in that year but as the bars stacked its harder to compare.
Observations per island
Code
# Create the plotp5 <-ggplot(penguins, aes(x = island,color = species, fill = species)) +geom_bar() +labs(x ="Island", y ="Number of penguins observed", title ="The number of penguins observed per island") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p5)
Adelie penguins can be found on all three islands (Biscoe, Dream and Torgersen). Chinstrap penguins can only be found on Dream island. Gentoo penguins can only be found on Biscoe island.
Visulising correlations
Code
# Create the plotp6 <-ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +geom_point() +labs(x ="Bill length in mm", y ="Bill depth in mm", title ="The correlation between bill length and depth") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p6)
A lot of dots everywhere, hard to make out any pattern.
You can also do this correlation per species:
Code
# Create the plotp7 <-ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, fill = species)) +geom_point() +labs(x ="Bill length in mm", y ="Bill depth in mm", title ="The correlation between bill length and depth per species") +geom_smooth(method = lm , se =TRUE) +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p7)
Now that the dots are coloured by species you can see that there is different positive correlations between bill length and depth for each species.
Body mass per sex
Code
# Clean the data (removes any n/a values)clean_penguins <-na.omit(penguins)# Create the plotp8 <-ggplot(clean_penguins, aes(x = sex, y = body_mass_g,color = species, fill = species)) +geom_boxplot(alpha =0.8) +labs(x ="Sex", y ="Body mass in grams", title ="The correlation between sex and body mass per species") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p8)
Male penguins no matter the species are larger than the females. Although Gentoo penguins do have a greater difference.
Inverting the groups:
Code
# Clean the data (removes any n/a values)clean_penguins <-na.omit(penguins)# Create the plotp9 <-ggplot(clean_penguins, aes(x = species, y = body_mass_g,color = sex, fill = sex)) +geom_boxplot(alpha =0.8) +labs(x ="Species", y ="Body mass in grams", title ="The correlation between penguin species and body mass per sex") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p9)
Same as above, but I feel this graph is easier to read when comparing the sex across each species as the males and female boxplots are next to each other.
Questions
Can body mass predict bill length?
Code
# Clean the data (removes any n/a values)clean_penguins <-na.omit(penguins)# Create the plotp10 <-ggplot(clean_penguins, aes(x = body_mass_g, y = bill_length_mm, color = species, fill = species)) +geom_point() +labs(x ="Body mass in grams", y ="Bill length in mm", title ="The correlation between bill length and body mass per species") +geom_smooth(method = lm , se =TRUE) +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p10)
Using this graph you could measure the weight of a penguin in grams (which I presume would be easier than measuring its bill length), then go along the x axis with the weight (and the species of penguin) to the line. Where the weight hits the line is also what the supposed bill length would be! e.g., an Adelie penguin with the weight of 4000g would have a bill length of around 40mm.
Does sex explain flipper length?
Code
# Clean the data (removes any n/a values)clean_penguins <-na.omit(penguins)# Create the plotp11 <-ggplot(clean_penguins, aes(x = species, y = bill_length_mm,color = sex, fill = sex)) +geom_boxplot(alpha =0.8) +labs(x ="Species", y ="Bill length in mm", title ="The sex and bill length of penguins per species") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p11)
The median bill length is greater for males than females across each species, with it being bigger than the interquartile range for each species’s females. However, the overall range still overlaps for each sex of each species so statistical analysis should be done to say with more confidence if sex can explain flipper length.
Exploring data is asking relevant questions, not just correlating random things! I like to always roughly draw my graph on paper to help visualise what I will be creating on RStudio.
Distributions
Checking distributions
Code
# Clean the data (removes any n/a values)clean_penguins <-na.omit(penguins)# Turning data to long formatlong_penguins <- clean_penguins %>%pivot_longer(bill_length_mm:body_mass_g, names_to ="trait")# Create the plotp12 <-ggplot(long_penguins, aes(x = value,group = species,fill = species,color = species)) +geom_density(alpha=0.8) +facet_grid(~trait, scales ="free_x" ) +labs(x ="Value", y ="Density", title ="Checking the distributions") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(p12)
Moments of centrality and dispersion
Centrality
Image showing moments of centrality - from the slides
Dispersion
Image showing moments of dispersion - from the slides
Variance
Standard deviation
Standard Error
Range
Quantiles
Checking distributions via a histogram
Code
set.seed(999)normal<-rnorm(100)normal %>%as.tibble() %>%ggplot(aes(value))+geom_histogram(color="#DD4A48", fill="#DD4A48")+geom_vline(xintercept=c(mean(normal), (mean(normal)+sd(normal)),mean(normal)-sd(normal)), linetype="dashed") +labs(x ="Value", y ="Count", title ="Checking the distributions via a histogram") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))
Shows a sort of bell curve that can be interpretated.
Checking distributions via a boxplot
Code
set.seed(999)normal<-rnorm(100)normal %>%as.tibble() %>%ggplot(aes(value))+geom_boxplot(fill="#DD4A48",alpha=0.7) +labs(x ="Value", y ="Count", title ="Checking the distributions via a boxplot") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))
Take a look at the median (line in box) and interquartile range (the box) to determine the spread of the data.
Scientific hypotheses
A hypothesis:
Is affirmative
Is not a question
Is self-explanatory
Must lead to expectations if confirmed
Why do we need hypotheses?
Candidate explanation to a phenomenon
Contain previsions and expectations
Feedback theory
Advance Science
Types of hypotheses
There are two types of hypotheses!
Scientific hypotheses
Candidate statements to explain an observed phenomenon
Meant to generate logical predictions
Working guidelines
Statistical hypotheses
Logical predictions
Confirmed by stats
Can be drawn in a graph
Image showing how statistical hypotheses can be drawn on a graph - from the slides
A good working hypothesis is everything
A wrong hypotheses will misguide you
A good hypothesis will keep you excited and up to date
Practise makes perfect
How to write hypotheses?
Tell a story
Don’t use a subheading
Never reefer to statistical hypotheses
The if/then method
If drug X have an effect on reducing headache, then …
When many exotic species are introduced in the ecosystem, then the likelihood of severe ecological disruption increases.
The statement method
Drug X reduce the headaches because it blocks the neuroreceptors of pain.
Exotic species in high quantities disrupts ecological mutualistic networks.
Post session four (wc/ 14th October 2024)
Data analysis
Following Andrew’s script
Packages and the data used
The packages used here are loaded at the beginning but also shown in this chunk.
?crickets gives more information on the data or you can even do ?ggplot2 to find out more about a package.
Code
# Packages being usedlibrary(tidyverse)library(modeldata)# Taking a look at the crickets data setcrickets %>%vtable(., lush =TRUE)
.
Name
Class
Values
Missing
Summary
species
factor
'O. exclamationis' 'O. niveus'
0
nuniq: 2
temp
numeric
Num: 17.2 to 30.4
0
mean: 23.765, sd: 3.823, nuniq: 17
rate
numeric
Num: 44.3 to 101.7
0
mean: 72.887, sd: 16.914, nuniq: 31
The basics
Code
# Create the plotc1 <-ggplot(crickets, aes(x = temp, y = rate)) +geom_point() +labs(x ="Temperature",y ="Chirp rate",title ="Cricket chirps",caption ="Source: McDonald (2009)") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c1)
The chirp rate increases with temperature.
Code
# Create the plotc2 <-ggplot(crickets, aes(x = temp, y = rate,color = species)) +geom_point() +labs(x ="Temperature",y ="Chirp rate",color ="Species",title ="Cricket chirps",caption ="Source: McDonald (2009)") +scale_color_brewer(palette ="Dark2") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c2)
The chirp rate increased with temperature for both species, but O. exclamationis has a greater overall chirp rate.
Modifiying basic properties of the plot
Code
# Create the plotc3 <-ggplot(crickets, aes(x = temp, y = rate)) +geom_point(color ="red",size =2,alpha =0.3,shape ="square") +labs(x ="Temperature",y ="Chirp rate",title ="Cricket chirps",caption ="Source: McDonald (2009)") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c3)
Colours the points to red with an alpha of 0.3 and size of 2. They are also square now!
Code
# Create the plotc4 <-ggplot(crickets, aes(x = temp, y = rate,shape = species, color = species)) +geom_point(alpha =0.6, size =5) +labs(x ="Temperature",y ="Chirp rate",title ="Cricket chirps",caption ="Source: McDonald (2009)") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c4)
Gives different shaped and coloured points to the two species!
Adding another layer
Code
# Create the plotc5 <-ggplot(crickets, aes(x = temp, y = rate)) +geom_point() +geom_smooth(method ="lm",se =FALSE) +labs(x ="Temperature",y ="Chirp rate",title ="Cricket chirps",caption ="Source: McDonald (2009)") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c5)
Adds a line of best fit.
Code
# Need this package for the labels on the lineslibrary(geomtextpath)# Create the plotc6 <-ggplot(crickets, aes(x = temp, y = rate,color = species)) +geom_point() +geom_labelsmooth(aes(label = species), fill ="white",method ="lm", formula = y ~ x,size =3, linewidth =1, boxlinewidth =0.4) +labs(x ="Temperature",y ="Chirp rate",color ="Species",title ="Cricket chirps",caption ="Source: McDonald (2009)") +scale_color_brewer(palette ="Dark2") +theme_minimal() +guides(color ='none') +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c6)
Adds a line of best fit for each species. They are even labelled!
Other plots
Code
# Create the plotc7 <-ggplot(crickets, aes(x = rate)) +geom_histogram(bins =15) +theme_minimal() +guides(color ='none') +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c7)
Histogram of the crickets data showing the number of observations per chirp rate.
Code
# Create the plotc8 <-ggplot(crickets, aes(x = rate)) +geom_freqpoly(bins =15) +theme_minimal() +guides(color ='none') +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c8)
Graph of the crickets data showing the number of observations per chirp rate. You can see how the peaks are the same as the histogram above.
Code
# Create the plotc9 <-ggplot(crickets, aes(x = species)) +geom_bar(color ="black",fill ="lightblue") +theme_minimal() +guides(color ='none') +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c9)
Bar chart showing the number of times each species was observed. O. niveus was observed more! You can see the bars are both set to be light blue with a black border
Code
# Create the plotc10 <-ggplot(crickets, aes(x = species, fill = species)) +geom_bar(show.legend =FALSE) +scale_fill_brewer(palette ="Dark2") +theme_minimal() +guides(color ='none') +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c10)
Bar chart showing the same as above but has scale_fill_brewer() with the palette "Dark2".
Code
# Create the plotc11 <-ggplot(crickets, aes(x = species, y = rate,color = species)) +geom_boxplot(show.legend =FALSE) +scale_color_brewer(palette ="Dark2") +theme_minimal() +guides(color ='none') +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c11)
Boxplot of both species’ chirp rates. It shows how O. exclamationis has a greater median and interquartile range than O. niveus. The colours are the same as above except only for the outline.
Faceting
Code
# Create the plotc12 <-ggplot(crickets, aes(x = rate, fill = species)) +geom_histogram(bins =15) +scale_fill_brewer(palette ="Dark2") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c12)
Not visually great!
Code
# Create the plotc13 <-ggplot(crickets, aes(x = rate,fill = species)) +geom_histogram(bins =15,show.legend =FALSE) +facet_wrap(~species) +scale_fill_brewer(palette ="Dark2") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c13)
Separates the two species into their own graph stacked next to each other.
Code
# Create the plotc14 <-ggplot(crickets, aes(x = rate,fill = species)) +geom_histogram(bins =15,show.legend =FALSE) +facet_wrap(~species,ncol =1) +scale_fill_brewer(palette ="Dark2") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(c14)
Separates the two species into their own graph stacked on top of each other.
Research methods
Summary on what is a good research hypothesis:
A research hypothesis is a suggested explanation for a phenomenon or an alternate reasoned proposal that suggests the possible correlation between or among a set of phenomena. They can be super specific or broad as long as they can be tested by experiments and observations. A good research hypothesis will be testable (allows you to work towards observable results), brief and objective (not too wordy but has all the needed information to make it clear) and clear and relevant (a clear aim on what you want to find out based on the field of study and whatever gaps there might be).
Pre session five (wc/ 21st October 2024)
Finding the appropriate statistical test
There are many different tests to chose from based on how much data and what kind of data you have collected. It is always important to think about the statistical test you are aiming towards when creating the experimental design, so you don’t collect data that has no statistical power/no meaning.
I usually use a flow chart like the one below to work out what stats test will be needed.
Flow chart of statistical tests
Session five (wc/ 21st October 2024)
Choosing the right analysis
Hypothetico-deductive reasoning
Post session five (wc/ 21st October 2024)
Creating graphs and determining what tests are associated with them
Using the iris data set.
Example one
Code
# Create the ploti1 <-ggplot(iris, aes(x = Species,y = Sepal.Length,color = Species)) +geom_boxplot(alpha =0) +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(i1)
Example two
Code
# Create the ploti2 <-ggplot(iris, aes(x = Petal.Length, color = Species, fill = Species)) +geom_density(alpha =0.5) +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(i2)
Example three
Code
# Create the ploti3 <-ggplot(iris, aes(x = Petal.Length,y = Petal.Width,shape = Species,color = Species)) +geom_point(alpha =0.7, size =2) +geom_smooth(method = lm, se =TRUE, color ="darkgrey",aes(group =1)) +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(i3)
Example four
Code
# Create the plot with mutated datai4 <- iris %>%mutate(Size =ifelse(Sepal.Length <median(Sepal.Length), "Small", "Big")) %>%group_by(Species, Size) %>%summarize(Count =n(), .groups ='drop') %>%ungroup() %>%ggplot(aes(x = Species, y = Count, fill = Size)) +geom_bar(position ="dodge", stat ="identity", colour ="white") +theme_minimal() +theme(plot.background =element_rect(color ="#5e5d59", fill =NA, size =1))# Print the plot explicitlyprint(i4)