HC’s Workbook

Author

H. Cyril

Data Analysis with R

Using RStudio for Data Analysis with R.

Week One

Introduction to R on RStudio (basics of R). Installation and set-up of RStudio
Getting used to terminology, code structure, available options for different functions
Creation of workbook with Quarto using R
Using the published workbook as a space to add completed assignments and tasks

Getting introduced to the basics of R

This exercise involves the use of data on penguins included in the Palmer penguins package.

The first step is to load the package and label the chunk for easy subsequent reference. When the command ‘include’ is set as ‘true’, the code is shown in the output. Setting the command ‘echo’ as true results in the output of every chunk of code to be visible below the code. Setting ‘output’ as ‘false’ ensures that the output is not visible.

The library is set to include the palmerpenguins data. Typing the dataset name lists the data below the code.

library(tidyverse)
library(palmerpenguins)

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Viewing a summary of the penguin data

This chunk of rcode has been labelled as ‘summary-penguins’, as the command ‘glimpse’ provides a brief overview/summary of the penguins data. The use of the ‘%>%’ symbol ensures that the command is operated using the data specified to the left of the symbol. This symbol can be inserted using the shortcut ‘ctrl+shift+m’.

penguins %>% 
  glimpse()

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Penguins of the Palmer Archipelago

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

Week Two

Generating a scatter plot displaying the relationship between bill length and bill depth

The scatter plot below shows the relationship between bill length and bill depth of these penguins, with bill length in mm along the x-axis and bill depth in mm along the y-axis. Adelie penguins are indicated with filled seagreen circles, Chinstrap penguins with filled blue triangles, and Gentoo penguins with filled pink squares. The title and subtitle of the scatter plot are specified using the commands ‘title’ and ‘subtitle’, respectively.

ggplot(penguins, 
       aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  scale_color_manual(values = c("seagreen","blue","pink")) +
  labs(
    title = "Bill Length and Depth",
    subtitle = "Dimensions for penguins at Palmer Station LTER",
    x = "Bill Length (mm)", y = "Bill Depth (mm)",
    color = "Penguin species", shape = "Penguin species"
  ) +
  theme_minimal()

Generating a histogram exploring the frequency of flipper lengths in the Palmer Archipelago penguins

It is important to label chunks of rcode accordingly, as mentioned previously.

The following plot is a histogram showing the frequency of different flipper lengths occurring in the penguin population on Palmer Archipelago. The Adelie penguin is indicated in seagreen, the Chinstrap penguin in blue, and the Gentoo penguin in pink.

flipper_hist <- ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_histogram(aes(fill = species), 
                 alpha = 0.5, 
                 position = "identity") +
  scale_fill_manual(values = c("seagreen","blue","pink")) +
  labs(x = "Flipper length (mm)",
       y = "Frequency",
       title = "Frequency of Penguin flipper lengths",
        subtitle = "Dimensions for penguins at Palmer Station LTER")

flipper_hist

Visualising relative body mass of the three penguin species

The following chunk of rcode generates a boxplot of the body mass of the three species with respect to each other. The majority of the rcode in this chunk was obtained from the rcode for ‘Boxplots’ available on the slides for Week 3 in the course room for ‘Research Methods and Data Analysis’.

However, commands were included to alter the variables studied and to customise the colour code to match the one used since the beginning of the document. These are: ’ scale_color_manual(values = c(“seagreen”,“blue”,“pink”)) + scale_fill_manual(values = c(“seagreen”,“blue”,“pink”))’

Also, the labels and title of the boxplot were added using ‘labs(x = “Species”, y = “Body mass (g)”, title = “Relative body mass of the three penguin species”)’

library(tidyverse)
library(palmerpenguins)

#| label: boxplot-bodymass-species
#| warning: false
#| echo: true

data("penguins")
penguins %>% 
group_by(species) %>% 
  ggplot(aes(x=species, 
             y=body_mass_g, 
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.5)+ 
  scale_color_manual(values = c("seagreen","blue","pink")) +
  scale_fill_manual(values = c("seagreen","blue","pink")) +
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12)) +
  labs(x = "Species",
       y = "Body mass (g)",
       title = "Relative body mass of the three penguin species",
        subtitle = "Dimensions for penguins at Palmer Station LTER")

Visualising observed occurrence of the three penguin species on the three islands

The following rcode chunk generates a bar graph showing the observed occurrence of each of the three penguin species in the three islands. Similar to the previous graph, the rcode for this bar graph was obtained from the slides available in the learning room. However, changes were made to the colour code as used throughout this document using the commands mentioned in the previous section.

library(tidyverse)
library(palmerpenguins)

#| label: bargraph-penguins-islands
#| warning: false
#| echo: true

penguins %>% 
  ggplot(aes(x=island,
             color=species, 
             fill=species))+
  geom_bar()+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12)) +
scale_color_manual(values = c("seagreen","blue","pink")) +
  scale_fill_manual(values = c("seagreen","blue","pink")) + 
  labs(x = "Island",
       y = "Species",
       title = "Observations of the three penguin species on the three islands",
        subtitle = "Dimensions for penguins at Palmer Station LTER")

Body mass and bill length

The boxplot below shows the correlation between flipper length in mm and body mass in g for the three penguin species observed on the Palmer Archipelago, and was generated by the rcode chunk as available on the slide for ‘Body mass per sex’ in the course learning room on NOW, with a few changes to the variables studied, colour code as previously mentioned, and with labels, title and subtitle included.

penguins %>% 
  na.omit() %>% 
  ggplot(aes(x=body_mass_g, 
             y = flipper_length_mm,
             color=species, 
             fill=species))+
  geom_boxplot(alpha=0.7)+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12)) +
  scale_color_manual(values = c("seagreen","blue","pink")) +
  scale_fill_manual(values = c("seagreen","blue","pink")) + 
  labs(x = "Body mass (g)",
       y = "Flipper length (mm)",
       title = "Correlation between Body Mass and Flipper Length",
        subtitle = "Dimensions for penguins at Palmer Station LTER")

Inserting images

The following image is that of an Adelie Penguin.

The following image shows a Chinstrap Penguin

The following image is that of a Gentoo Penguin

These images were downloaded from the Wikimedia Commons websites, saved to the file Pictures, within the working directory ‘RStudio’ in New Volume (E:) on my computer, and embedded using the ‘Insert–>Image’ option available on Visual mode.

Embedding a video

The following video is of Chinstrap and Gentoo penguins recorded in Antarctica. It was embedded by copying the embed code from the Youtube video website and pasting below. Rendering the Quarto file produces the Quarto html file with the embedded video on it.

![]()

Week Three

Data Wrangling - Lecture - 8/10/2024

This week, we were introduced to a few more basic concepts in R. A very important point to remember is to always ensure reproducibility, which is an important aspect of science. We understood data wrangling and its importance, and made aware of how to upload data sets, use a few basic functions such as str(), mutate(), ifelse(), summarise () and so on. Furthermore, we learnt the purpose of the pipe symbol %>%, and also got to know of a few packages, such as magrittr, plyr and tidyr, apart from tidyverse. Also, we learnt about the different types of variables and the correct format of data for analyses using R.

Data Wrangling - Formative Exercises - 10/10/2024

Section 6.6.1 - “R for Graduate Students”

For the formative exercises, we follow the book “R for Graduate Students” by Wendy Huynh. Section 6.6.1 lists problems for which the code is provided. Our task is to annotate each line of the code with the purpose of each function used in the code. The package tidyverse is loaded. Each chunk of rcode is labelled accordingly for easy future reference.

library(tidyverse)

Problem A

midwest %>%             #database is called
  group_by(state) %>%     # groups the data by state
  summarize(poptotalmean = mean(poptotal),   #provides one-row summary,                                 variable with mean of total population
            poptotalmed = median(poptotal),  #calculates median value
            popmax = max(poptotal),     #provides maximum value
            popmin = min(poptotal),     #provides minimum value
            popdistinct = n_distinct(poptotal), #provides no. of distinct entries
            popfirst = first(poptotal),   #returns the first element
            popany = any(poptotal < 5000), #checks if any value is below 5000
            popany2 = any(poptotal > 2000000)) %>% #checks if any value is above 2000000
  ungroup()         # ungroups the data

# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B

midwest %>%     # calls database for analysis
  group_by(state) %>%  # groups by state
  summarize(num5k = sum(poptotal < 5000), # provides one-row summary, new variable with sum of total populations less than 5000
            num2mil = sum(poptotal > 2000000), # variable with sum of total populations greater than 2000000
            numrows = n()) %>%  # returns no. of rows for each state
  ungroup()       # ungroups data

# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

part I

midwest %>%  # calls midwest database
  group_by(county) %>%  # groups by county
  summarize(x = n_distinct(state)) %>%  # provides one-row summary, variable x describes no. of unique combinations in the vector "state"
  arrange(desc(x)) %>%    # data returned by summarize() is arranged in descending order
  ungroup()     # ungroups data

# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

part II

How does n() differ from n_distinct()? When would they be the same? different?

midwest %>%     # calls the database
  group_by(county) %>%    # groups by county
  summarize(x = n()) %>%    # provides a one-row summary, variable x provides no. of total observations, unlike n_distinct(), which returns the no. of distinct observations in vector
  ungroup()   # ungroups data

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

              # n() and n_distinct() would be the same if there are no repetitions in the observation, ie, no repeated combination
              #  n() and n_distinct() would be different if there are repeated observations

part III hint: - How many distinctly different counties are there for each county? - Can there be more than 1 (county) county in each county? - What if we replace ‘county’ with ‘state’?

midwest %>%     # calls midwest database
  group_by(county) %>%    # groups by county
  summarize(x = n_distinct(county)) %>%     # provides one-row summary, variable x provides no of distinct values for each county
  ungroup()         # ungroups data

# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows

                    # some counties have a max value up to 4
                    # there are more than 1 counties for each county
                    # replacing 'county' with 'state' would group by 'state' and provide no. of distinct states

Problem D

diamonds %>%  # calls diamonds database
  group_by(clarity) %>%  # groups by clarity
  summarize(a = n_distinct(color),  # creates one-row summary, variable a gives no. of distinct color values
            b = n_distinct(price),  # variable b gives no. of distinct price values
            c = n()) %>%  # variable c gives no. of total observations
  ungroup()               # ungroups data

# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

part I

diamonds %>%    # calls diamonds database
  group_by(color, cut) %>%    # groups by color and then cut, in order
  summarize(m = mean(price),    # returns one-row summary, variable m gives the mean price
            s = sd(price)) %>%    # returns standard deviation of price
  ungroup()               # ungroups data

# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

part II

diamonds %>%    # calls diamonds database
  group_by(cut, color) %>%    # groups by cut and then color, in order
  summarize(m = mean(price),    # creates one-row summary, variable m  gives mean price
            s = sd(price)) %>%    # variable s returns standard deviation of price
  ungroup()     # ungroups data

# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows

part III hint: - How good is the sale if the price of diamonds equaled msale? - e.x. The diamonds are x% off original price in msale.

diamonds %>%    # calls diamonds database
  group_by(cut, color, clarity) %>%     # groups by cut, color and clarity, in that order
  summarize(m = mean(price),    # variable m returns mean price
            s = sd(price),    # variable s returns standard deviation of price
            msale = m * 0.80) %>%     # variable msale gives the value of 80% off the mean price
  ungroup()   # ungroups data

# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

              # if diamond price equaled msale, then it appears to be an effective sale, though the quality of the diamonds may be questionable

Problem F

diamonds %>%    # calls diamonds data set
  group_by(cut) %>%     # groups diamonds by cut
  summarize(potato = mean(depth), # creates one-row summary, variable 'potato' returns mean depth
            pizza = mean(price), # variable 'pizza' returns mean price
            popcorn = median(y), # variable 'popcorn' returns median value of width in mm
            pineapple = potato - pizza, # variable 'pineapple' gives the difference between mean depth and mean price
            papaya = pineapple ^ 2,   # variable 'papaya' gives the squared value of pineapple, ie, squared value of difference between mean depth and mean price
            peach = n()) %>%    # variable 'peach' gives no. of observation
  ungroup()   # ungroups data

# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

part I

diamonds %>%    # calls diamonds database
  group_by(color) %>%     # groups by color
  summarize(m = mean(price)) %>%    # creates one-row summary, variable m returns mean price
  mutate(x1 = str_c("Diamond color ", color), # mutate() creates new nested variables x1, which combines the two character vectors "Diamond color" and 'color' into a single character vector
         x2 = 5) %>%    # variable x2 assigns the value 5 to each row
  ungroup()     # ungroups data

# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

part II What does the first ungroup() do? Is it useful here? Why/why not? Why isn’t there a closing ungroup() after the mutate()?

diamonds %>%    # calls diamonds dataset
  group_by(color) %>%     # groups by color
  summarize(m = mean(price)) %>%    # creates a one-row summary of mean price of all diamonds
  ungroup() %>%     # ungroups data grouped earlier by color
  mutate(x1 = str_c("Diamond color ", color),   # mutate() creates two variables x1 and x2, as described previously
         x2 = 5)

# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

         # first ungroup() ungroups the data grouped earlier by color before mutate() is executed
         # there is no need for a closing ungroup() after mutate() because there is no group_by() function before mutate()

Problem H

part I

diamonds %>%    # calls the diamonds database
  group_by(color) %>%     # groups by color
  mutate(x1 = price * 0.5) %>%    # creates new variable x1, which gives the half of price for each color
  summarize(m = mean(x1)) %>% # creates a on-row summary, variable m gives the mean value of the half price for each color
  ungroup()     # ungroups data

# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

part II What’s the difference between part I and II?

diamonds %>%    # calls the diamonds database
  group_by(color) %>%     # groups by color
  mutate(x1 = price * 0.5) %>%    # mutate() creates new variable x1, which is the half of price for each color
  ungroup() %>%  # ungroups data grouped earlier by color
  summarize(m = mean(x1))       # creates one-row summary, variable m gives mean of half price values after they are ungrouped

# A tibble: 1 × 1
      m
  <dbl>
1 1966.

    # part I gives mean half price for ever color group, whereas part II gives half mean price for all the diamonds in total

Data Wrangling - Formative Exercises - 10/10/2024

Section 6.7 (Extra Practice) - “R for Graduate Students”

This exercise involves writing our own code chunks for each task with our own comments. The first step is to ensure the tidyverse package is loaded. Each chunk of rcode is labelled accoringly for easy future reference.

library(tidyverse)   # loads the tidyverse package

1. View all of the variable names in diamonds (hint: View()).

library(tidyverse)
#| label: view-diamonds
view(diamonds)   # view object in a new tab on RStudio, not visible in rendered Quarto html file

names(diamonds)   # quick view of all variable names in diamonds database

 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"

2. Arrange the diamonds by:

Lowest to highest price (hint: arrange())
Highest to lowest price (hint: arrange(), desc())
Lowest price and cut
highest price and cut

- arranging diamonds by lowest to highest price

diamonds %>% arrange(price) # arranges diamonds by price in ascending order, ie, from the lowest to highest values. The pipe symbol takes the output of the function to its left as the input for the operation of the function on its right

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

- arranging diamonds from highest to lowest by price

diamonds %>% arrange(desc(price))  # arranges diamonds in descending order of price, from the highest to lowest values

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

- arranging diamonds by lowest price and cut

diamonds %>% 
 arrange(price) %>% arrange(cut) # arranges diamonds in ascending order by price, and again arranging them in ascending order by cut, so the output shows diamonds arranged by both price and cut in ascending order

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows

- arranging diamonds by highest price and cut

diamonds %>% 
  arrange(desc(price)) %>% arrange(desc(cut)) # arranges diamonds in descending order by price, and again arranging them in descending order by cut, so the output shows diamonds arranged by both price and cut in descending order

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.51 Ideal G     IF       61.7  55   18806  7.37  7.41  4.56
 2  2.07 Ideal G     SI2      62.5  55   18804  8.2   8.13  5.11
 3  2.15 Ideal G     SI2      62.6  54   18791  8.29  8.35  5.21
 4  2.05 Ideal G     SI1      61.9  57   18787  8.1   8.16  5.03
 5  1.6  Ideal F     VS1      62    56   18780  7.47  7.52  4.65
 6  2.06 Ideal I     VS2      62.2  55   18779  8.15  8.19  5.08
 7  1.71 Ideal G     VVS2     62.1  55   18768  7.66  7.63  4.75
 8  2.08 Ideal H     SI1      58.7  60   18760  8.36  8.4   4.92
 9  2.03 Ideal G     SI1      60    55.8 18757  8.17  8.3   4.95
10  2.61 Ideal I     SI2      62.1  56   18756  8.85  8.73  5.46
# ℹ 53,930 more rows

3. Arrange the diamonds by lowest to highest price and worst to best clarity.

diamonds %>% 
  arrange(price, clarity) # arranging diamonds in ascending order by price and clarity, ie, from lowest to highest by price and from worst to best by clarity

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

4. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate()).

diamonds %>% 
  mutate(salePrice = price - 250) # mutate() creates new variable "salePrice", which reflects discount of $250 off of original price

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z salePrice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

5. Remove the x, y, and z variables from the diamonds dataset (hint: select()).

diamonds %>% select(-x, -y, -z) # select() helps remove/retain desired variables. In this case, it removes the x, y and z variables and retains the others

# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

6. Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).

diamonds %>% 
  group_by(cut) %>%   # groups diamonds by cut
  summarise(n = n()) %>%  # produces a one-row summary of the number of observations, ie, number of diamonds for each cut value
  ungroup ()    # ungroups data

# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

7. Create a new column named totalNum that calculates the total number of diamonds.

diamonds %>% 
  mutate(totalNum = n()) # creates a new column, totalNum, which gives the total number of diamonds

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalNum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

Research Methods - Formative Exercise - 11/10/2024

Develop a good question and a bad question based on the diamonds dataset using the principles discussed previously.

Bad Question Do all variables influence the price of diamonds?

Good Question What combination of variables result in a diamond being sold at the highest price, and what are the values of the variables in question, which in combination determine the sale of a diamond at the highest price?

Week Four

Data Exploration - Lecture - 15/10/2024

During this lecture, we were introduced to the importance of framing good questions and understanding the data available for analysis. We were also made aware of the types of variables, as well as the different ways to visualise data, namely, bar chart, histograms, scatter plots and so on. We also learnt how to visualise correlations and check distributions. In addition, we learnt about the moments of centrality and dispersion.

Data Exploration - Formative Exercise - 16/10/2024

For this week’s formative exercise, we follow the video tutorial “Visualization with R in 36 minutes.” The task is to follow the tutorial, and reproduce the graphs explained in the video, organised according to our individual way.

The first step is to load the “tidyverse” package, which includes the ggplot2 package, necessary for data exploration.

library(tidyverse)

Second, the “modeldata” package is first installed and then loaded, which includes the “crickets” database, to be used for subsequent analysis for the exercise.

library(modeldata)

library(tidyverse) #tidyverse is loaded

library(modeldata)  #modeldata is loaded

The next step is to take a look at the crickets dataset, which we will work with.

view(crickets) # view object in a new tab on RStudio, not visible in rendered Quarto html file

This database has 31 observations with corresponding values for temperature and chirping rate for 2 species of crickets, O. exclamationis and O. niveus

Deciding on which graphic to use for visualising a data set depends on the variables in question, namely, the type and number of variables. If there is only a single variable of interest, the most used plots are as follows: - single categorical variable: bar chart - single quantitative variable: histogram If there are two variables for analysis, the commonly used plots are: - both categorical: grouped bar chart - both quantitative: scatterplot - one categorical and one quantitative: box plot

1. For one variable

1.1. Single categorical variable

1.1.1. Generating a bar chart for a single categorical variable.

A bar chart is the generally used default way of representing the frequency/count of a single categorical variable. The following code generates a bar chart of the count of individuals in each species of cricket in the data set. This is accomplished using the geom_bar() function.

ggplot(crickets, aes(x = species, #crickets dataset loaded, species variable mapped to the x-axis
                     fill = species)) + # color of bars set according to species
  geom_bar() + # specifies bar chart
  labs(x = "Species", # label assigned according to variable on x-axis
       y = "No. of individuals", # label assigned for y-axis
       color = "Species", # label provided for legend
       title = "Number of individuals of each cricket species") + # suitable title also specified
  scale_fill_brewer(palette = "Dark2") # increases color contrast for improved accessibility

The differentiation of species by color in the bar chart, indication of species in the legend and along the x-axis is redundant and unnecessary, and therefore, the legend can be removed by using the function (show.legend = FALSE).

ggplot(crickets, aes(x = species, #crickets dataset loaded, species variable mapped to the x-axis
                     fill = species)) + # color of bars set according to species
  geom_bar(show.legend = FALSE) + # specifies bar chart, removes legend
  labs(x = "Species", # label assigned according to variable on x-axis
       y = "No. of individuals", # label assigned for y-axis
      title = "Number of individuals of each cricket species") + # suitable title also specified
  scale_fill_brewer(palette = "Dark2") # increases color contrast for improved accessibility

1.2. Single quantitative variable

1.2.1. Generating a histogram of chirp rates

Histograms constitute the most common way of visualising data with a single quantitative variable, with frequency along the y-axis. The geom_histogram() function generates a histogram with default bin number of 30, which does not result in a good understanding of the data.

ggplot(crickets, aes(x = rate)) + # database loaded, single variable mapped to x-axis
  geom_histogram() + # specifies histogram, with default bin number
labs(x = "Chirp rate", # label assigned to x-axis
     y = "Frequency", # label assigned to y-axis
     title = "Frequency of chirp rates") # suitable title specified

It is a good data exploration practice to experiment with different number of bins, which affects the way trends in a dataset are visualised, between a large number of bins and a small number of them.

ggplot(crickets, aes(x = rate)) + # database loaded, single variable mapped to x-axis
  geom_histogram(bins = 15) + # specifies histogram, with bin number = 15 
labs(x = "Chirp rate", # label assigned to x-axis
     y = "Frequency", # label assigned to y-axis
     title = "Frequency of chirp rates") # suitable title specified

The way bins are displayed can also be altered by using binwidth, for one quantitative variable.

ggplot(crickets, aes(x = rate)) + # database loaded, single variable mapped to x-axis
  geom_histogram(binwidth = 5)  + # specifies histogram, with desired binwidth
labs(x = "Chirp rate", # label assigned to x-axis
     y = "Frequency", # label assigned to y-axis
     title = "Frequency of chirp rates") # suitable title specified

Another way of representing the same histogram in a different way is using a frequency polygon using the geom_freqpoly() function, wherein chirp rate is along the x-axis, just as in the histogram

ggplot(crickets, aes(x = rate)) + # database loaded, single variable mapped to x-axis
  geom_freqpoly(bins = 15) + # generates frequency polygon
labs(x = "Chirp rate", # label assigned to x-axis
     y = "Frequency", # label assigned to y-axis
     title = "Frequency of chirp rates") # suitable title specified

1.2.2. Faceting

Faceting in R is the splitting up of a chart into multiple smaller grids, which display a different subset of the data. The scales of these plots are more or less the same. Faceting is useful when a histogram graphically represents data, for example, for 2 species. As the bins for both species are stacked one over the other, this type of graphic is not particularly useful in visualisation, especially if the number of categories or variables increases more than 2, as evident in the output of the following rcode chunk.

ggplot(crickets, aes(x = rate, # crickets dataset, rate mapped to x-axis
                     fill = species)) + # bin color according to species
  geom_histogram(bins = 15) + # generates histogram with 15 bins
  labs(x = "Chirp rate", # label assigned according to variable on x-axis
       y = "No. of individuals", # label assigned for y-axis
       color = "Species", # label provided for legend
       title = "Chirp rates of the cricket species") + # suitable title also specified
  scale_color_brewer(palette = "Dark2") # increases contrast for improved visibility

The better way to resolve this, other than to create separate datasets for each species, which is long-winded, is to use the function facet_wrap(), which wraps the visualised data by species. The scales of the plots are the same and therefore, it is easier to compare the two plots, and make reasonable, valid deductions from the plots.

ggplot(crickets, aes(x = rate, # crickets dataset, rate mapped to x-axis
                     fill = species)) + # bin color according to species
  geom_histogram(bins = 15, # generates histogram with 15 bins
                 show.legend = FALSE) + # removes legend
  facet_wrap(~species) + # wraps by species
  labs(x = "Chirp rate", # label assigned according to variable on x-axis
       y = "No. of individuals", # label assigned for y-axis
              title = "Chirp rates of individual species") + # suitable title also specified
  scale_fill_brewer(palette = "Dark2") # increases contrast for improved visibility

A better way of visualising the above plots would be to arrange them vertically, so that the difference in chirp rates between the two species becomes more apparent. This is done using the following rcode chunk

ggplot(crickets, aes(x = rate, # crickets dataset, rate mapped to x-axis
                     fill = species)) + # bin color according to species
  geom_histogram(bins = 15, # generates histogram with 15 bins
                 show.legend = FALSE) + # removes legend
  facet_wrap(~species, # wraps by species
             ncol = 1) + # specifies no. of columns to be considered
  labs(x = "Chirp rate", # label assigned according to variable on x-axis
       y = "No. of individuals", # label assigned for y-axis
      title = "Chirp rate of individual species") + # suitable title also specified
  scale_fill_brewer(palette = "Dark2") + # increases contrast for improved visibility
  theme_minimal() # removes grey default background

2. For two variables

2.1. For two quantitative variables

2.1.1a. Creating a scatter plot of chirping rate and temperature

This scatter plot uses the crickets database, with temperature on the x-axis and chirping rate on the y-axis. This is the mostly used way of visualising the relationship between two quantitative variables in a dataset. The geom_point() function helps generate scatter plots.

# basic scatter plot of temperature vs chirping rate
ggplot(crickets, aes(x = temp,  # crickets database, temp along x-axis
                     y = rate,  # chirp rate along y-axis
                     color = species)) + # color of points according to species
  geom_point() +  # scatter plot is specified using this command
  labs(x = "Temperature", # labels assigned according to variables on each axis
       y = "Chirp rate",
       color = "Species", # label provided for legend
       title = "Chirping in crickets") + # suitable title also specified
  scale_color_brewer(palette = "Dark2") # increases contrast for improved accessibility

2.1.1b. Adding a regression line to the existing scatter plot

The function geom_smooth() is the default command used to fit a smoother or curve of best fit over the data. However, it results in a flexible curve that fits the data but tells nothing informative about the relationship between the variables in question. This smoother is accompanied by a grey error ribbon that provides a measure of uncertainty in the smoother.

ggplot(crickets, aes(x = temp,   # crickets database, temp along x-axis
                     y = rate)) +  # chirp rate along y-axis
  geom_point() + # scatter plot is specified using this command
  geom_smooth() + # fits a smoother with error bar over the data
  labs(x = "Temperature", # labels assigned according to variables on each axis
       y = "Chirp rate",
       color = "Species", # label provided for legend
       title = "Chirping in crickets") + # suitable title also specified
  scale_color_brewer(palette = "Dark2") # increases contrast for improved accessibility

A regression line can be added using the geom_smooth() command by inserting arguments within the said command that alter the qualities of the smoother. The grey error bar can be removed by setting “se = FALSE.” Adjusting the color aesthetics of the regression line to indicate each species separately results in regression lines with a closer fit to the data.

ggplot(crickets, aes(x = temp,  # crickets database, temp along x-axis
                     y = rate,  # chirp rate along y-axis
                     color = species)) + # color of points according to species
  geom_point() + # scatter plot is specified using this command
  geom_smooth(method = "lm", # specifies a linear model
              se = FALSE) +  # removes the grey error bar
labs(x = "Temperature", # labels assigned according to variables on each axis
       y = "Chirp rate",
       color = "Species", # label provided for legend
       title = "Chirping in crickets") + # suitable title also specified
  scale_color_brewer(palette = "Dark2") # increases contrast for improved accessibility

2.2. For two variables - one categorical, one quantitative

2.2.1. Creating a box plot for species vs chirp rate

This graphic is used when analysing the relationship between one quantitative and one categorical variable in a dataset. Box plots can be generated using the geom_boxplot() function. The following boxplots consider the cricket species and their respective chirp rates.

ggplot(crickets, aes(x = species, # crickets dataset, species along x-axis
                     y = rate, # chirp rate along y-axis
                     color = species)) + # boxplot color according to species
  geom_boxplot(show.legend = FALSE) + # creates box plot, hides legend
  labs(x = "Species", # label assigned according to variable on x-axis
       y = "Chirp rate", # label assigned for y-axis
      title = "Relationship between chirp rate and species") + # suitable title also specified
  scale_color_brewer(palette = "Dark2") # increased color contrast for better visibility

In the above plot, the grey background is provided by default. In order ot improve aesthetics, this grey background can be removed using the theme_minimal() function, as follows.

ggplot(crickets, aes(x = species, # crickets dataset, species along x-axis
                     y = rate, # chirp rate along y-axis
                     color = species)) + # boxplot color according to species
  geom_boxplot(show.legend = FALSE) + # creates box plot, hides legend
  labs(x = "Species", # label assigned according to variable on x-axis
       y = "Chirp rate", # label assigned for y-axis
      title = "Relationship between chirp rate and species") + # suitable title also specified
  scale_color_brewer(palette = "Dark2") + # increased color contrast for better visibility
  theme_minimal() # removes the grey default background

Research Methods - Formative Exercise - 17/10/2024

What makes a good research hypothesis?

A good research hypothesis can only be put forth after an exhaustive, comprehensive review of existing scientific literature on the topic of interest. This is to identify any gaps in knowledge that the research hypothesis can help in addressing. The most important aspect of a good research hypothesis is that it should be testable and practical. It should also be flexible to either acceptance or rejection based on subsequent experiments. It also should have a considerable extent of clarity on the concept to be studied, so that subsequent experimentation and analysis can be guided properly. A good research hypothesis should be objective and based on facts and scientific evidence, and not based on personal opinions and beliefs. Finally, it should be relevant to the research topic/area in question. Otherwise, it would misguide the entire research process following it.

Week Five

Choosing the right analysis - Lecture - 22/10/2024

This week’s lecture was on identifying the nature of variables, which extended into learning about the different types of statistical analyses, depending on the nature of the variables in question. We were introduced to frequency tests, mean tests, correlations and models. The second half of the lecture dealt with research hypotheses, the importance of hypotheses in science, the difference between scientific and statistical hypotheses and hypothetico-deductive reasoning.

Choosing the right analysis - Formative Exercise - 23/10/2024

The task for this week is to identify the statistical tests most appropriate for and relevant to the graphics provided on the NOW page for the module, under Week 5 - Post-session, and to reproduce the graphics using R code. The data set used for this exercise is “iris,” which is pre-built in R.

data("iris")

The layout of this task, as planned by me, will begin with the R code for each graphic, the graphic itself and then identification of suitable tests and its explanation, for better clarity.

library(tidyverse) # loading tidyverse for ggplot functions

1. Box plot

Reproducing the box plot of sepal length for the 3 Iris species, I. setosa, I. versicolor, and I. virginica.

ggplot(iris, aes(x = Species, # calls iris data set, assigns species to x-axis
                 y = Sepal.Length, # assigns sepal length to y-axis
                 color = Species)) + # generates legend
  geom_boxplot() + # generates boxplot
  labs(x = "Species", # labels added to x- and y-axes, title also added
       y = "Sepal Length",
       title = "Sepal length for the three Iris species")

Identification of suitable tests: The purpose of the above graphic appears to be to compare sepal length between the three Iris species. The variable along the x-axis, species, is categorical, and the variable along the y-axis, sepal length is numerical and on continuous scale. The data appear fairly normally distributed for “I. setosa and I. versicolor, the data for I. virginca appear slightly negatively skewed. However, we assume the fulfillment of normality, as the data size is larger than 30 (n=150). Therefore, parametric tests are more suitable in this case than non-parametric tests. Furthermore, the study design is unpaired, as there is only one measurement (variable being measured) for each individual of each species. The number of groups (species) is 3 (n>2), and therefore, considering all the above factors, a one-way ANOVA seems to be the most appropriate test for this graphic.

2. Density plot

Reproducing the density plot of Petal length for the 3 Iris species.

ggplot(iris, aes(x = Petal.Length, # calls iris data set, assigns petal length to x-axis
                 fill = Species)) + # fills area under each density plot with colour according to species
  geom_density(alpha = 0.3) + # generates density plot, with density plot transparency adjusted to match that of graphic on NOW page
  labs(x = "Petal Length", # labels added to x- and y-axes, title also added
       y = "Density",
       title = "Petal length for the three Iris species")

Identification of suitable tests: As a density plot is a representation of the frequency distribution of a numeric variable, it is safe to assume that the aim of this graphic is to compare the difference in petal length (and corresponding frequencies) between the three Iris species. The scale of the measured variable, petal length is continuous, as it is a numerical variable (density, along y-axis, is also numerical). As only a single measurement has been made for each species, the study design can be considered unpaired. There are more than 2 groups (3 species). The data are not normally distributed, and therefore, it does not fulfill the assumptions of normality. Hence, the Kruskal-Wallis test would be most appropriate in this case.

3. Scatter Plot

Reproducing the scatter plot with regression line of Petal length and Petal width for the 3 Iris species.

ggplot(iris, aes(x = Petal.Length, # loads iris data set, assigns petal length to x-axis
                 y = Petal.Width)) + # assigns petal length to y-axis
                  geom_point(aes(colour = Species, # generates scatter plot, with colour of points set to species
                                 shape = Species)) + # shape of points set to species
  geom_smooth(method = "lm") + # generates regression line with error ribbon
  labs(x = "Petal Length", # labels added to x- and y-axes, title also added
       y = "Petal Width",
       title = "Petal length and width for Iris")

Identification of suitable tests: A scatter plot is generally used to visualise the relationship between two variables, and therefore we assume this to be the aim of this graphic. Petal length, along the x-axis, and petal width, along the y-axis, are both numerical, on continuous scale. The data appear fairly normally distributed, without extreme values; therefore, the assumptions of normality are considered to be fulfilled. There are 3 groups (species; n>2), and the study design is paired, as each individual has a measurement of petal length and petal width. Therefore, Pearson correlation would be an appropriate test to use in this case. However, simple linear regression could also be suitable, as the regression line has been generated using a model, with y as a function of x.

4. Bar chart

Reproducing bar chart of frequency of big and small sepals in the plants according to species

iris.modified <- # modified data set created
iris %>%  # pipe symbol indicates function operation on iris data set
  mutate(size = ifelse(Sepal.Length < median(Sepal.Length), "small", "big")) # creates a new variable size, with two categories, big and small
iris.modified # calls the modified data set

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  size
1            5.1         3.5          1.4         0.2     setosa small
2            4.9         3.0          1.4         0.2     setosa small
3            4.7         3.2          1.3         0.2     setosa small
4            4.6         3.1          1.5         0.2     setosa small
5            5.0         3.6          1.4         0.2     setosa small
6            5.4         3.9          1.7         0.4     setosa small
7            4.6         3.4          1.4         0.3     setosa small
8            5.0         3.4          1.5         0.2     setosa small
9            4.4         2.9          1.4         0.2     setosa small
10           4.9         3.1          1.5         0.1     setosa small
11           5.4         3.7          1.5         0.2     setosa small
12           4.8         3.4          1.6         0.2     setosa small
13           4.8         3.0          1.4         0.1     setosa small
14           4.3         3.0          1.1         0.1     setosa small
15           5.8         4.0          1.2         0.2     setosa   big
16           5.7         4.4          1.5         0.4     setosa small
17           5.4         3.9          1.3         0.4     setosa small
18           5.1         3.5          1.4         0.3     setosa small
19           5.7         3.8          1.7         0.3     setosa small
20           5.1         3.8          1.5         0.3     setosa small
21           5.4         3.4          1.7         0.2     setosa small
22           5.1         3.7          1.5         0.4     setosa small
23           4.6         3.6          1.0         0.2     setosa small
24           5.1         3.3          1.7         0.5     setosa small
25           4.8         3.4          1.9         0.2     setosa small
26           5.0         3.0          1.6         0.2     setosa small
27           5.0         3.4          1.6         0.4     setosa small
28           5.2         3.5          1.5         0.2     setosa small
29           5.2         3.4          1.4         0.2     setosa small
30           4.7         3.2          1.6         0.2     setosa small
31           4.8         3.1          1.6         0.2     setosa small
32           5.4         3.4          1.5         0.4     setosa small
33           5.2         4.1          1.5         0.1     setosa small
34           5.5         4.2          1.4         0.2     setosa small
35           4.9         3.1          1.5         0.2     setosa small
36           5.0         3.2          1.2         0.2     setosa small
37           5.5         3.5          1.3         0.2     setosa small
38           4.9         3.6          1.4         0.1     setosa small
39           4.4         3.0          1.3         0.2     setosa small
40           5.1         3.4          1.5         0.2     setosa small
41           5.0         3.5          1.3         0.3     setosa small
42           4.5         2.3          1.3         0.3     setosa small
43           4.4         3.2          1.3         0.2     setosa small
44           5.0         3.5          1.6         0.6     setosa small
45           5.1         3.8          1.9         0.4     setosa small
46           4.8         3.0          1.4         0.3     setosa small
47           5.1         3.8          1.6         0.2     setosa small
48           4.6         3.2          1.4         0.2     setosa small
49           5.3         3.7          1.5         0.2     setosa small
50           5.0         3.3          1.4         0.2     setosa small
51           7.0         3.2          4.7         1.4 versicolor   big
52           6.4         3.2          4.5         1.5 versicolor   big
53           6.9         3.1          4.9         1.5 versicolor   big
54           5.5         2.3          4.0         1.3 versicolor small
55           6.5         2.8          4.6         1.5 versicolor   big
56           5.7         2.8          4.5         1.3 versicolor small
57           6.3         3.3          4.7         1.6 versicolor   big
58           4.9         2.4          3.3         1.0 versicolor small
59           6.6         2.9          4.6         1.3 versicolor   big
60           5.2         2.7          3.9         1.4 versicolor small
61           5.0         2.0          3.5         1.0 versicolor small
62           5.9         3.0          4.2         1.5 versicolor   big
63           6.0         2.2          4.0         1.0 versicolor   big
64           6.1         2.9          4.7         1.4 versicolor   big
65           5.6         2.9          3.6         1.3 versicolor small
66           6.7         3.1          4.4         1.4 versicolor   big
67           5.6         3.0          4.5         1.5 versicolor small
68           5.8         2.7          4.1         1.0 versicolor   big
69           6.2         2.2          4.5         1.5 versicolor   big
70           5.6         2.5          3.9         1.1 versicolor small
71           5.9         3.2          4.8         1.8 versicolor   big
72           6.1         2.8          4.0         1.3 versicolor   big
73           6.3         2.5          4.9         1.5 versicolor   big
74           6.1         2.8          4.7         1.2 versicolor   big
75           6.4         2.9          4.3         1.3 versicolor   big
76           6.6         3.0          4.4         1.4 versicolor   big
77           6.8         2.8          4.8         1.4 versicolor   big
78           6.7         3.0          5.0         1.7 versicolor   big
79           6.0         2.9          4.5         1.5 versicolor   big
80           5.7         2.6          3.5         1.0 versicolor small
81           5.5         2.4          3.8         1.1 versicolor small
82           5.5         2.4          3.7         1.0 versicolor small
83           5.8         2.7          3.9         1.2 versicolor   big
84           6.0         2.7          5.1         1.6 versicolor   big
85           5.4         3.0          4.5         1.5 versicolor small
86           6.0         3.4          4.5         1.6 versicolor   big
87           6.7         3.1          4.7         1.5 versicolor   big
88           6.3         2.3          4.4         1.3 versicolor   big
89           5.6         3.0          4.1         1.3 versicolor small
90           5.5         2.5          4.0         1.3 versicolor small
91           5.5         2.6          4.4         1.2 versicolor small
92           6.1         3.0          4.6         1.4 versicolor   big
93           5.8         2.6          4.0         1.2 versicolor   big
94           5.0         2.3          3.3         1.0 versicolor small
95           5.6         2.7          4.2         1.3 versicolor small
96           5.7         3.0          4.2         1.2 versicolor small
97           5.7         2.9          4.2         1.3 versicolor small
98           6.2         2.9          4.3         1.3 versicolor   big
99           5.1         2.5          3.0         1.1 versicolor small
100          5.7         2.8          4.1         1.3 versicolor small
101          6.3         3.3          6.0         2.5  virginica   big
102          5.8         2.7          5.1         1.9  virginica   big
103          7.1         3.0          5.9         2.1  virginica   big
104          6.3         2.9          5.6         1.8  virginica   big
105          6.5         3.0          5.8         2.2  virginica   big
106          7.6         3.0          6.6         2.1  virginica   big
107          4.9         2.5          4.5         1.7  virginica small
108          7.3         2.9          6.3         1.8  virginica   big
109          6.7         2.5          5.8         1.8  virginica   big
110          7.2         3.6          6.1         2.5  virginica   big
111          6.5         3.2          5.1         2.0  virginica   big
112          6.4         2.7          5.3         1.9  virginica   big
113          6.8         3.0          5.5         2.1  virginica   big
114          5.7         2.5          5.0         2.0  virginica small
115          5.8         2.8          5.1         2.4  virginica   big
116          6.4         3.2          5.3         2.3  virginica   big
117          6.5         3.0          5.5         1.8  virginica   big
118          7.7         3.8          6.7         2.2  virginica   big
119          7.7         2.6          6.9         2.3  virginica   big
120          6.0         2.2          5.0         1.5  virginica   big
121          6.9         3.2          5.7         2.3  virginica   big
122          5.6         2.8          4.9         2.0  virginica small
123          7.7         2.8          6.7         2.0  virginica   big
124          6.3         2.7          4.9         1.8  virginica   big
125          6.7         3.3          5.7         2.1  virginica   big
126          7.2         3.2          6.0         1.8  virginica   big
127          6.2         2.8          4.8         1.8  virginica   big
128          6.1         3.0          4.9         1.8  virginica   big
129          6.4         2.8          5.6         2.1  virginica   big
130          7.2         3.0          5.8         1.6  virginica   big
131          7.4         2.8          6.1         1.9  virginica   big
132          7.9         3.8          6.4         2.0  virginica   big
133          6.4         2.8          5.6         2.2  virginica   big
134          6.3         2.8          5.1         1.5  virginica   big
135          6.1         2.6          5.6         1.4  virginica   big
136          7.7         3.0          6.1         2.3  virginica   big
137          6.3         3.4          5.6         2.4  virginica   big
138          6.4         3.1          5.5         1.8  virginica   big
139          6.0         3.0          4.8         1.8  virginica   big
140          6.9         3.1          5.4         2.1  virginica   big
141          6.7         3.1          5.6         2.4  virginica   big
142          6.9         3.1          5.1         2.3  virginica   big
143          5.8         2.7          5.1         1.9  virginica   big
144          6.8         3.2          5.9         2.3  virginica   big
145          6.7         3.3          5.7         2.5  virginica   big
146          6.7         3.0          5.2         2.3  virginica   big
147          6.3         2.5          5.0         1.9  virginica   big
148          6.5         3.0          5.2         2.0  virginica   big
149          6.2         3.4          5.4         2.3  virginica   big
150          5.9         3.0          5.1         1.8  virginica   big

ggplot(iris.modified, aes(x = Species, # loads the modified data set, categorical variable, species, assigned to x-axis
                          colour = size, # bar colour set according to sepal size
                          fill = size)) + # bar fill colour set according to sepal size
  geom_bar(position = "dodge") # creates bar chart, with a bar corresponding to each size category adjacent to each other, for each species

Identification of suitable tests: This bar chart helps visualise the count of individuals in each species with small or big sepals, where the “small” category refers to those individuals with sepal length less than the median sepal length value, and the “big” category, to those with sepal length greater tha median sepal length. In this case, both “species” and “size” are categorical and nominal. We assume the objective of this graphic to be a comparison (through a frequency distribution) of the difference in sepal size (length) between the three species. The study design seems to be unpaired, as the only variable measured is sepal length, which categorises individuals into either of the two groups, “small” and “big.” There are more than 2 groups (3 species), and therefore, taking into account these considerations, a chi-square test of homogeneity would be the most suitable test in this case.

Week 7 - Formative Assessment

Title and abstract writing

Investigating factors influencing activity and behaviour patterns in the Western Santa Cruz tortoise, Chelonoidis porteri, using Dirichlet regressions and generalised linear mixed models

Affiliations: James Cook University, Queensland, Australia

Abstract:

Although vital to biodiversity conservation, preserving and restoring important natural areas are unfeasible because of booming global human populations. Instead, land-sharing with wildlife is more practical. Determining how animals balance activity patterns relating to land-use types has important conservation implications. This study helped determine the factors influencing time proportions spent by Western Santa Cruz tortoises, Chelonoidis porteri, on eating, resting and walking in agricultural areas of Santa Cruz. Tortoise behaviour on the Santa Cruz farms, Galapagos, was observed during wet and dry seasons. The duration and type of tortoise behaviour were recorded. Habitat characteristics, including percent cover, density and mean height of vegetation, were noted for each land-use type. Carapace length and width were measured for each observed tortoise, and thermal images of the animals and surrounding habitats were captured. The time proportions spent for eating, walking and resting, and behavioural categories relating to vegetation characteristics, were analysed using Dirichlet regressions and generalised linear mixed models, respectively, using R. Eating and resting durations were greatly affected by land-use type and temperature. Vegetation characteristics also affected tortoise activity patterns. Eating probability positively correlated with vegetation cover, but also depended on vegetation density. The tortoises spent significantly longer resting on abandoned land than on livestock and touristic land. Vegetation height and density influenced walking probability. Land-use type and vegetation characteristics strongly influenced tortoise behaviour patterns. The differences in activity patterns indicated preferences for activities reducing energetic costs to the tortoises. Understanding how land-use types affect activity patterns can inform conservation management actions.

Keywords: land-use type; vegetation characteristic; carapace temperature; eating; resting; walking