RM&DA Workbook

Author

Ellelouise

Published

September 28, 2024

Introduction to Quarto

To-do list

  1. customise workbook
  2. load first data set
  3. publish using R-Pubs

Adding a figure

he is a whole mood

Gary the gadget guy

Go to ‘Insert’ tab at the top right and go to figure. Add a file - add caption and alternative text.

Adding code chunks - key things to note

When you click the Render button a document will be generated that includes both content and the output of embedded code. Note all packages needed to run these chunks were loaded at the start of the page. You can embed code like this:

Chunk 1: The echo:false option disables the printing of code (only output is displayed).

[1] 4

Chunk 2: Basic data management

“%>%” The Pipe operator allows you to chain functions together in a more readable and intuitive way. It takes the output of one function and passes it as the first argument to the next function.

“mutate()” Adds new columns or modifies current variables in the data set. We can also use other functions inside the mutate function to create our new variable(s) - nesting.

“summarise()” collapses all rows and returns a one-row summary, for example it allows us to calculate the mean. We can also perform multiple operations with summarize() and nest other useful functions inside it.

“group_by()” and “ungroup()” take existing data and groups specific variables together for future operations; many operations are performed on groups. For example, grouping by age and sex (male/female) might be useful in a dataset if we care about how females of a certain age scored compared to males of a certain age (or comparing ages within males or within females).

  • “summarize()” and “group_by()” can be used to calculate/compare the average Score (and other measures) for males and females separately for example.
  • “mutate()” and “group_by()” could also be utilized to add a new column based on the group.

“ungroup()” is always used after the group() command after performing calculations. If you forget to ungroup() data, future data management will likely produce errors. Always ungroup() when you’ve finished with your calculations.

“filter()” is only used to retain specific rows of data that meet the specified requirement(s).

“select()” is used only the for the columns (variables) that you want to see; gets rid of all other columns. You can to refer to the columns by the column position (first column) or by name. The order in which you list the column names/positions is the order that the columns will be displayed.

“arrange()” allows you arrange values within a variable in ascending or descending order (if that is applicable to your values). This can apply to both numerical and non-numerical values.

Chunk 3: Plotting data set using ggplot

library(ggplot2)          # For plotting
library(palmerpenguins)   # For the penguins dataset

ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(
    title = "Flipper and bill length",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Flipper length (mm)", y = "Bill length (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Using LaTeX (TinyTex)

$ = math mode

\(\alpha\) \[ \pi \in \mathbb{r} \]

Data Wrangling

Session workthrough

1. Calling a tibble

library(palmerpenguins)
  data("penguins")
  
penguins %>%
  select(1:3)
# A tibble: 344 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen           NA  
 5 Adelie  Torgersen           36.7
 6 Adelie  Torgersen           39.3
 7 Adelie  Torgersen           38.9
 8 Adelie  Torgersen           39.2
 9 Adelie  Torgersen           34.1
10 Adelie  Torgersen           42  
# ℹ 334 more rows

2. Summary

penguins %>%
  summary()
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

3. Subsetting data

penguins %>%
  select(body_mass_g)
# A tibble: 344 × 1
   body_mass_g
         <int>
 1        3750
 2        3800
 3        3250
 4          NA
 5        3450
 6        3650
 7        3625
 8        4675
 9        3475
10        4250
# ℹ 334 more rows
penguins %>%
  filter(species=="Gentoo",
         bill_length_mm > 50,
         sex=="male") %>%
  select(bill_length_mm,
         bill_depth_mm) %>%
  arrange(bill_depth_mm)
# A tibble: 21 × 2
   bill_length_mm bill_depth_mm
            <dbl>         <dbl>
 1           51.3          14.2
 2           50.2          14.3
 3           50.1          15  
 4           50.7          15  
 5           50.4          15.3
 6           52.5          15.6
 7           54.3          15.7
 8           50.8          15.7
 9           50.4          15.7
10           53.4          15.8
# ℹ 11 more rows

4. Adding columns

penguins %>%
  select(bill_length_mm,
         bill_depth_mm) %>%
  mutate(bill_volume=bill_length_mm+bill_depth_mm) %>%
  mutate(log_bill_volume=log(bill_volume)) %>%
  mutate(bill_categ=ifelse(bill_volume<60, "small", "big"))
# A tibble: 344 × 5
   bill_length_mm bill_depth_mm bill_volume log_bill_volume bill_categ
            <dbl>         <dbl>       <dbl>           <dbl> <chr>     
 1           39.1          18.7        57.8            4.06 small     
 2           39.5          17.4        56.9            4.04 small     
 3           40.3          18          58.3            4.07 small     
 4           NA            NA          NA             NA    <NA>      
 5           36.7          19.3        56              4.03 small     
 6           39.3          20.6        59.9            4.09 small     
 7           38.9          17.8        56.7            4.04 small     
 8           39.2          19.6        58.8            4.07 small     
 9           34.1          18.1        52.2            3.96 small     
10           42            20.2        62.2            4.13 big       
# ℹ 334 more rows

5. Reshaping data

penguins %>%
  select(bill_length_mm,
         bill_depth_mm,
         year) %>%
  pivot_longer(col=c(bill_length_mm:bill_depth_mm),
               names_to = "bill_feature", values_to = "value") 
# A tibble: 688 × 3
    year bill_feature   value
   <int> <chr>          <dbl>
 1  2007 bill_length_mm  39.1
 2  2007 bill_depth_mm   18.7
 3  2007 bill_length_mm  39.5
 4  2007 bill_depth_mm   17.4
 5  2007 bill_length_mm  40.3
 6  2007 bill_depth_mm   18  
 7  2007 bill_length_mm  NA  
 8  2007 bill_depth_mm   NA  
 9  2007 bill_length_mm  36.7
10  2007 bill_depth_mm   19.3
# ℹ 678 more rows

Post sesssion

data("diamonds")
 
diamonds %>% 
  group_by(clarity) %>%
  summarise(a = n_distinct(color),
            b = n_distinct(price),
            c = n()) %>% 
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790
diamonds %>% 
  group_by(color, cut) %>% 
  summarize(m = mean(price),
            sd = sd(price)) %>% 
  ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
   color cut           m    sd
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows
diamonds %>% 
  group_by(cut, color, clarity) %>% 
  summarize(m = mean(price),
            sd = sd(price),
            msale = m * 0.80) %>% 
  ungroup()
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
   cut   color clarity     m    sd msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Data Exploration & Visulisation

For Continuous Variables

Histogram

penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=bill_length_mm, color=species, fill=species))+
  geom_histogram(alpha=0.5)+
  scale_color_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
  scale_fill_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
  labs(
   title = "Bill Length Across the Palmer Penguins",
    x = "Bill Length (mm)", y = "Number of Penguins") 

penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=bill_depth_mm, color=species, fill=species))+
  geom_histogram(alpha=0.5)+
  scale_color_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
  scale_fill_manual(values = c("Adelie" = "#2df8ff", "Chinstrap" = "#34e1a0", "Gentoo" = "#8abf26")) +
  labs(
   title = "Bill Depth Across the Palmer Penguins",
    x = "Bill Depth (mm)", y = "Number of Penguins") 

Box Plot

penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=species, 
             y=bill_length_mm, 
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5) +
  scale_color_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
  scale_fill_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
  labs(
   title = "Bill Length Across the Palmer Penguins",
    x = "Species", y = "Bill Length (mm)") 

penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=species, 
             y=bill_depth_mm, 
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5) +
  geom_jitter() +
  scale_color_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
  scale_fill_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
  labs(
   title = "Bill Depth Across the Palmer Penguins",
    x = "Species", y = "Bill Depth (mm)") 

For categorical variables

Bar Chart

penguins %>% 
  group_by(species) %>% 
  ggplot(aes(x=species,
             color=species, 
             fill=species)) +
  geom_bar(alpha=0.5) +
  scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  labs(
   title = "Number of Palmer Penguins Used Across Each Species",
    x = "Species", y = "Number of Penguins") 

penguins %>% 
  group_by(species) %>% 
  ggplot(aes(x=island,
             color=species, 
             fill=species))+
geom_bar(alpha=0.5) +
  scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  labs(
    title = "Number of Palmer Penguins Used Across Each Species per Island",
     x = "Island", y = "Number of Penguins") 

penguins %>% 
  group_by(species) %>% 
  ggplot(aes(x=year,
             color=species, 
             fill=species))+
  geom_bar(alpha=0.5) +
  scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  labs(
    title = "Number of Palmer Penguins Used Across Each Species per Year",
     x = "Year", y = "Number of Penguins") 

penguins %>% 
  group_by(species) %>% 
  ggplot(aes(x=year,
             color=species, 
             fill=species))+
  geom_bar(position = "fill", alpha=0.5) +
  scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  labs(
    title = "Number of Palmer Penguins Used Across Each Species per Year",
    subtitle = "using the 'fill' position",
     x = "Year", y = "Number of Penguins") 

penguins %>% 
  group_by(species) %>% 
  ggplot(aes(x=year,
             color=species, 
             fill=species))+
  geom_bar(position = "dodge", alpha=0.5) +
  scale_color_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  scale_fill_manual(values = c("Adelie" = "#fff0a3", "Chinstrap" = "#ffc875", "Gentoo" = "#ff9a5c")) +
  labs(
    title = "Number of Palmer Penguins Used Across Each Species per Year",
    subtitle = "using the 'dodge' position",
     x = "Year", y = "Number of Penguins") 

For Visualising Correlations

Scatter graph

penguins %>% 
  group_by(species) %>% 
  ggplot(aes(x=bill_length_mm, 
             y=bill_depth_mm,
             color=species, 
             shape=species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_manual(values = c("#7451ff","#a691ff","#d6ceff")) +
  scale_shape_manual(values = c("triangle", "square", "circle")) +
    labs(
       title = "Bill dimentions for Palmer Penguins",
        x = "Bill depth (mm)", y = "Bill Depth (mm)") 

Box Plot (for multiple variable)

penguins %>% 
  group_by(species) %>% 
  na.omit() %>% 
  ggplot(aes(x=sex, 
             y=body_mass_g,
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5) +
  scale_color_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
  scale_fill_manual(values = c("Adelie" = "#fdb7ff", "Chinstrap" = "#ff74c3", "Gentoo" = "#ff006f")) +
  labs(
   title = "Body Mass per Sex",
    x = "Sex", y = "Body Mass (g)") 

penguins %>% 
  group_by(species) %>% 
  na.omit() %>% 
  ggplot(aes(x=species, 
             y=body_mass_g,
             color=sex, 
             fill=sex))+
  geom_boxplot(alpha=0.5) +
  scale_color_manual(values = c("male" = "#60d5ff", "female" = "#ff9bf5")) +
  scale_fill_manual(values = c("male" = "#60d5ff", "female" = "#ff9bf5")) +
  labs(
   title = "Body Mass per Sex",
   subtitle = "Inverting Groups",
    x = "Species", y = "Body Mass (g)") 

For Visualising Distributions

penguins %>% 
  na.omit() %>% 
  pivot_longer(bill_length_mm:body_mass_g, names_to = "trait") %>% 
  ggplot(aes(x=value,
         group=species,
         fill=species,
         color=species))+
  geom_density(alpha=0.5)+
  scale_color_manual(values = c("Adelie" = "#c6ff9b", "Chinstrap" = "#ffb987", "Gentoo" = "#ff9bf5")) +
  scale_fill_manual(values = c("Adelie" = "#c6ff9b", "Chinstrap" = "#ffb987", "Gentoo" = "#ff9bf5")) +
  facet_grid(~trait, scales = "free_x")

Choosing the right analysis

Graph 1 - Box Plot

  • Predictor variable (species) is categorical and outcome variable (sepal length) is numerical

  • There are more then 2 groups being tested - there are 3 different species

    = ANOVA


Graph 2 - Density Plot

  • Predictor variable (species) is categorical and outcome variable (sepal length) is numerical

  • There are more then 2 groups being tested - there are 3 different species

    = ANOVA


Graph 3 - Scatter Plot

  • Predictor variable (species) is categorical and outcome variabls (sepal length/width) are numerical

  • There are more then 2 groups being tested - there are 3 different species

  • There is more then one outcome variable

    = MANOVA


Graph 4 - Histogram

  • Predictor variable (species) is categorical and outcome variables (size - big/small) is also categorical

    = Chi-Squared Test