Introduction to R on RStudio (basics of R). Installation and set-up of RStudio
Getting used to terminology, code structure, available options for different functions
Creation of workbook with Quarto using R
Using the published workbook as a space to add completed assignments and tasks
Getting introduced to the basics of R
This exercise involves the use of data on penguins included in the Palmer penguins package.
The first step is to load the package and label the chunk for easy subsequent reference. When the command ‘include’ is set as ‘true’, the code is shown in the output. Setting the command ‘echo’ as true results in the output of every chunk of code to be visible below the code. Setting ‘output’ as ‘false’ ensures that the output is not visible.
The library is set to include the palmerpenguins data. Typing the dataset name lists the data below the code.
library(tidyverse)library(palmerpenguins)penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Viewing a summary of the penguin data
This chunk of rcode has been labelled as ‘summary-penguins’, as the command ‘glimpse’ provides a brief overview/summary of the penguins data. The use of the ‘%>%’ symbol ensures that the command is operated using the data specified to the left of the symbol. This symbol can be inserted using the shortcut ‘ctrl+shift+m’.
The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
Week Two
Generating a scatter plot displaying the relationship between bill length and bill depth
The scatter plot below shows the relationship between bill length and bill depth of these penguins, with bill length in mm along the x-axis and bill depth in mm along the y-axis. Adelie penguins are indicated with filled seagreen circles, Chinstrap penguins with filled blue triangles, and Gentoo penguins with filled pink squares. The title and subtitle of the scatter plot are specified using the commands ‘title’ and ‘subtitle’, respectively.
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +geom_point(aes(color = species, shape = species)) +scale_color_manual(values =c("seagreen","blue","pink")) +labs(title ="Bill Length and Depth",subtitle ="Dimensions for penguins at Palmer Station LTER",x ="Bill Length (mm)", y ="Bill Depth (mm)",color ="Penguin species", shape ="Penguin species" ) +theme_minimal()
Generating a histogram exploring the frequency of flipper lengths in the Palmer Archipelago penguins
It is important to label chunks of rcode accordingly, as mentioned previously.
The following plot is a histogram showing the frequency of different flipper lengths occurring in the penguin population on Palmer Archipelago. The Adelie penguin is indicated in seagreen, the Chinstrap penguin in blue, and the Gentoo penguin in pink.
flipper_hist <-ggplot(data = penguins, aes(x = flipper_length_mm)) +geom_histogram(aes(fill = species), alpha =0.5, position ="identity") +scale_fill_manual(values =c("seagreen","blue","pink")) +labs(x ="Flipper length (mm)",y ="Frequency",title ="Frequency of Penguin flipper lengths",subtitle ="Dimensions for penguins at Palmer Station LTER")flipper_hist
Visualising relative body mass of the three penguin species
The following chunk of rcode generates a boxplot of the body mass of the three species with respect to each other. The majority of the rcode in this chunk was obtained from the rcode for ‘Boxplots’ available on the slides for Week 3 in the course room for ‘Research Methods and Data Analysis’.
However, commands were included to alter the variables studied and to customise the colour code to match the one used since the beginning of the document. These are: ’ scale_color_manual(values = c(“seagreen”,“blue”,“pink”)) + scale_fill_manual(values = c(“seagreen”,“blue”,“pink”))’
Also, the labels and title of the boxplot were added using ‘labs(x = “Species”, y = “Body mass (g)”, title = “Relative body mass of the three penguin species”)’
library(tidyverse)library(palmerpenguins)#| label: boxplot-bodymass-species#| warning: false#| echo: truedata("penguins")penguins %>%group_by(species) %>%ggplot(aes(x=species, y=body_mass_g, color=species, fill=species))+geom_boxplot(alpha=0.5)+scale_color_manual(values =c("seagreen","blue","pink")) +scale_fill_manual(values =c("seagreen","blue","pink")) +theme(axis.text=element_text(size=12),axis.title=element_text(size=12)) +labs(x ="Species",y ="Body mass (g)",title ="Relative body mass of the three penguin species",subtitle ="Dimensions for penguins at Palmer Station LTER")
Visualising observed occurrence of the three penguin species on the three islands
The following rcode chunk generates a bar graph showing the observed occurrence of each of the three penguin species in the three islands. Similar to the previous graph, the rcode for this bar graph was obtained from the slides available in the learning room. However, changes were made to the colour code as used throughout this document using the commands mentioned in the previous section.
library(tidyverse)library(palmerpenguins)#| label: bargraph-penguins-islands#| warning: false#| echo: truepenguins %>%ggplot(aes(x=island,color=species, fill=species))+geom_bar()+theme(axis.text=element_text(size=12),axis.title=element_text(size=12)) +scale_color_manual(values =c("seagreen","blue","pink")) +scale_fill_manual(values =c("seagreen","blue","pink")) +labs(x ="Island",y ="Species",title ="Observations of the three penguin species on the three islands",subtitle ="Dimensions for penguins at Palmer Station LTER")
Body mass and bill length
The boxplot below shows the correlation between flipper length in mm and body mass in g for the three penguin species observed on the Palmer Archipelago, and was generated by the rcode chunk as available on the slide for ‘Body mass per sex’ in the course learning room on NOW, with a few changes to the variables studied, colour code as previously mentioned, and with labels, title and subtitle included.
penguins %>%na.omit() %>%ggplot(aes(x=body_mass_g, y = flipper_length_mm,color=species, fill=species))+geom_boxplot(alpha=0.7)+theme(axis.text=element_text(size=12),axis.title=element_text(size=12)) +scale_color_manual(values =c("seagreen","blue","pink")) +scale_fill_manual(values =c("seagreen","blue","pink")) +labs(x ="Body mass (g)",y ="Flipper length (mm)",title ="Correlation between Body Mass and Flipper Length",subtitle ="Dimensions for penguins at Palmer Station LTER")
Inserting images
The following image is that of an Adelie Penguin.
The following image shows a Chinstrap Penguin
The following image is that of a Gentoo Penguin
These images were downloaded from the Wikimedia Commons websites, saved to the file Pictures, within the working directory ‘RStudio’ in New Volume (E:) on my computer, and embedded using the ‘Insert–>Image’ option available on Visual mode.
Embedding a video
The following video is of Chinstrap and Gentoo penguins recorded in Antarctica. It was embedded by copying the embed code from the Youtube video website and pasting below. Rendering the Quarto file produces the Quarto html file with the embedded video on it.
![]()
Week Three
Data Wrangling - Lecture - 8/10/2024
This week, we were introduced to a few more basic concepts in R. A very important point to remember is to always ensure reproducibility, which is an important aspect of science. We understood data wrangling and its importance, and made aware of how to upload data sets, use a few basic functions such as str(), mutate(), ifelse(), summarise () and so on. Furthermore, we learnt the purpose of the pipe symbol %>%, and also got to know of a few packages, such as magrittr, plyr and tidyr, apart from tidyverse. Also, we learnt about the different types of variables and the correct format of data for analyses using R.
Data Wrangling - Formative Exercises - 10/10/2024
Section 6.6.1 - “R for Graduate Students”
For the formative exercises, we follow the book “R for Graduate Students” by Wendy Huynh. Section 6.6.1 lists problems for which the code is provided. Our task is to annotate each line of the code with the purpose of each function used in the code. The package tidyverse is loaded. Each chunk of rcode is labelled accordingly for easy future reference.
library(tidyverse)
Problem A
midwest %>%#database is calledgroup_by(state) %>%# groups the data by statesummarize(poptotalmean =mean(poptotal), #provides one-row summary, variable with mean of total populationpoptotalmed =median(poptotal), #calculates median valuepopmax =max(poptotal), #provides maximum valuepopmin =min(poptotal), #provides minimum valuepopdistinct =n_distinct(poptotal), #provides no. of distinct entriespopfirst =first(poptotal), #returns the first elementpopany =any(poptotal <5000), #checks if any value is below 5000popany2 =any(poptotal >2000000)) %>%#checks if any value is above 2000000ungroup() # ungroups the data
midwest %>%# calls database for analysisgroup_by(state) %>%# groups by statesummarize(num5k =sum(poptotal <5000), # provides one-row summary, new variable with sum of total populations less than 5000num2mil =sum(poptotal >2000000), # variable with sum of total populations greater than 2000000numrows =n()) %>%# returns no. of rows for each stateungroup() # ungroups data
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C
part I
midwest %>%# calls midwest databasegroup_by(county) %>%# groups by countysummarize(x =n_distinct(state)) %>%# provides one-row summary, variable x describes no. of unique combinations in the vector "state"arrange(desc(x)) %>%# data returned by summarize() is arranged in descending orderungroup() # ungroups data
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
part II
How does n() differ from n_distinct()? When would they be the same? different?
midwest %>%# calls the databasegroup_by(county) %>%# groups by countysummarize(x =n()) %>%# provides a one-row summary, variable x provides no. of total observations, unlike n_distinct(), which returns the no. of distinct observations in vectorungroup() # ungroups data
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
# n() and n_distinct() would be the same if there are no repetitions in the observation, ie, no repeated combination# n() and n_distinct() would be different if there are repeated observations
part III hint: - How many distinctly different counties are there for each county? - Can there be more than 1 (county) county in each county? - What if we replace ‘county’ with ‘state’?
midwest %>%# calls midwest databasegroup_by(county) %>%# groups by countysummarize(x =n_distinct(county)) %>%# provides one-row summary, variable x provides no of distinct values for each countyungroup() # ungroups data
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
# some counties have a max value up to 4# there are more than 1 counties for each county# replacing 'county' with 'state' would group by 'state' and provide no. of distinct states
Problem D
diamonds %>%# calls diamonds databasegroup_by(clarity) %>%# groups by claritysummarize(a =n_distinct(color), # creates one-row summary, variable a gives no. of distinct color valuesb =n_distinct(price), # variable b gives no. of distinct price valuesc =n()) %>%# variable c gives no. of total observationsungroup() # ungroups data
diamonds %>%# calls diamonds databasegroup_by(color, cut) %>%# groups by color and then cut, in ordersummarize(m =mean(price), # returns one-row summary, variable m gives the mean prices =sd(price)) %>%# returns standard deviation of priceungroup() # ungroups data
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
part II
diamonds %>%# calls diamonds databasegroup_by(cut, color) %>%# groups by cut and then color, in ordersummarize(m =mean(price), # creates one-row summary, variable m gives mean prices =sd(price)) %>%# variable s returns standard deviation of priceungroup() # ungroups data
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
part III hint: - How good is the sale if the price of diamonds equaled msale? - e.x. The diamonds are x% off original price in msale.
diamonds %>%# calls diamonds databasegroup_by(cut, color, clarity) %>%# groups by cut, color and clarity, in that ordersummarize(m =mean(price), # variable m returns mean prices =sd(price), # variable s returns standard deviation of pricemsale = m *0.80) %>%# variable msale gives the value of 80% off the mean priceungroup() # ungroups data
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
# if diamond price equaled msale, then it appears to be an effective sale, though the quality of the diamonds may be questionable
Problem F
diamonds %>%# calls diamonds data setgroup_by(cut) %>%# groups diamonds by cutsummarize(potato =mean(depth), # creates one-row summary, variable 'potato' returns mean depthpizza =mean(price), # variable 'pizza' returns mean pricepopcorn =median(y), # variable 'popcorn' returns median value of width in mmpineapple = potato - pizza, # variable 'pineapple' gives the difference between mean depth and mean pricepapaya = pineapple ^2, # variable 'papaya' gives the squared value of pineapple, ie, squared value of difference between mean depth and mean pricepeach =n()) %>%# variable 'peach' gives no. of observationungroup() # ungroups data
diamonds %>%# calls diamonds databasegroup_by(color) %>%# groups by colorsummarize(m =mean(price)) %>%# creates one-row summary, variable m returns mean pricemutate(x1 =str_c("Diamond color ", color), # mutate() creates new nested variables x1, which combines the two character vectors "Diamond color" and 'color' into a single character vectorx2 =5) %>%# variable x2 assigns the value 5 to each rowungroup() # ungroups data
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
part II What does the first ungroup() do? Is it useful here? Why/why not? Why isn’t there a closing ungroup() after the mutate()?
diamonds %>%# calls diamonds datasetgroup_by(color) %>%# groups by colorsummarize(m =mean(price)) %>%# creates a one-row summary of mean price of all diamondsungroup() %>%# ungroups data grouped earlier by colormutate(x1 =str_c("Diamond color ", color), # mutate() creates two variables x1 and x2, as described previouslyx2 =5)
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
# first ungroup() ungroups the data grouped earlier by color before mutate() is executed# there is no need for a closing ungroup() after mutate() because there is no group_by() function before mutate()
Problem H
part I
diamonds %>%# calls the diamonds databasegroup_by(color) %>%# groups by colormutate(x1 = price *0.5) %>%# creates new variable x1, which gives the half of price for each colorsummarize(m =mean(x1)) %>%# creates a on-row summary, variable m gives the mean value of the half price for each colorungroup() # ungroups data
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
part II What’s the difference between part I and II?
diamonds %>%# calls the diamonds databasegroup_by(color) %>%# groups by colormutate(x1 = price *0.5) %>%# mutate() creates new variable x1, which is the half of price for each colorungroup() %>%# ungroups data grouped earlier by colorsummarize(m =mean(x1)) # creates one-row summary, variable m gives mean of half price values after they are ungrouped
# A tibble: 1 × 1
m
<dbl>
1 1966.
# part I gives mean half price for ever color group, whereas part II gives half mean price for all the diamonds in total
Data Wrangling - Formative Exercises - 10/10/2024
Section 6.7 (Extra Practice) - “R for Graduate Students”
This exercise involves writing our own code chunks for each task with our own comments. The first step is to ensure the tidyverse package is loaded. Each chunk of rcode is labelled accoringly for easy future reference.
library(tidyverse) # loads the tidyverse package
1. View all of the variable names in diamonds (hint: View()).
library(tidyverse)#| label: view-diamondsview(diamonds) # view object in a new tab on RStudio, not visible in rendered Quarto html file
names(diamonds) # quick view of all variable names in diamonds database
diamonds %>%arrange(price) # arranges diamonds by price in ascending order, ie, from the lowest to highest values. The pipe symbol takes the output of the function to its left as the input for the operation of the function on its right
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
- arranging diamonds from highest to lowest by price
diamonds %>%arrange(desc(price)) # arranges diamonds in descending order of price, from the highest to lowest values
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
- arranging diamonds by lowest price and cut
diamonds %>%arrange(price) %>%arrange(cut) # arranges diamonds in ascending order by price, and again arranging them in ascending order by cut, so the output shows diamonds arranged by both price and cut in ascending order
diamonds %>%arrange(desc(price)) %>%arrange(desc(cut)) # arranges diamonds in descending order by price, and again arranging them in descending order by cut, so the output shows diamonds arranged by both price and cut in descending order
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
2 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
3 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
4 2.05 Ideal G SI1 61.9 57 18787 8.1 8.16 5.03
5 1.6 Ideal F VS1 62 56 18780 7.47 7.52 4.65
6 2.06 Ideal I VS2 62.2 55 18779 8.15 8.19 5.08
7 1.71 Ideal G VVS2 62.1 55 18768 7.66 7.63 4.75
8 2.08 Ideal H SI1 58.7 60 18760 8.36 8.4 4.92
9 2.03 Ideal G SI1 60 55.8 18757 8.17 8.3 4.95
10 2.61 Ideal I SI2 62.1 56 18756 8.85 8.73 5.46
# ℹ 53,930 more rows
3. Arrange the diamonds by lowest to highest price and worst to best clarity.
diamonds %>%arrange(price, clarity) # arranging diamonds in ascending order by price and clarity, ie, from lowest to highest by price and from worst to best by clarity
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
4. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate()).
diamonds %>%mutate(salePrice = price -250) # mutate() creates new variable "salePrice", which reflects discount of $250 off of original price
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z salePrice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
5. Remove the x, y, and z variables from the diamonds dataset (hint: select()).
diamonds %>%select(-x, -y, -z) # select() helps remove/retain desired variables. In this case, it removes the x, y and z variables and retains the others
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
6. Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).
diamonds %>%group_by(cut) %>%# groups diamonds by cutsummarise(n =n()) %>%# produces a one-row summary of the number of observations, ie, number of diamonds for each cut valueungroup () # ungroups data
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
7. Create a new column named totalNum that calculates the total number of diamonds.
diamonds %>%mutate(totalNum =n()) # creates a new column, totalNum, which gives the total number of diamonds
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Research Methods - Formative Exercise - 11/10/2024
Develop a good question and a bad question based on the diamonds dataset using the principles discussed previously.
Bad Question Do all variables influence the price of diamonds?
Good Question What combination of variables result in a diamond being sold at the highest price, and what are the values of the variables in question, which in combination determine the sale of a diamond at the highest price?
Week Four
Data Exploration - Lecture - 15/10/2024
During this lecture, we were introduced to the importance of framing good questions and understanding the data available for analysis. We were also made aware of the types of variables, as well as the different ways to visualise data, namely, bar chart, histograms, scatter plots and so on. We also learnt how to visualise correlations and check distributions. In addition, we learnt about the moments of centrality and dispersion.
Data Exploration - Formative Exercise - 16/10/2024
For this week’s formative exercise, we follow the video tutorial “Visualization with R in 36 minutes.” The task is to follow the tutorial, and reproduce the graphs explained in the video, organised according to our individual way.
The first step is to load the “tidyverse” package, which includes the ggplot2 package, necessary for data exploration.
library(tidyverse)
Second, the “modeldata” package is first installed and then loaded, which includes the “crickets” database, to be used for subsequent analysis for the exercise.
library(modeldata)
library(tidyverse) #tidyverse is loaded
library(modeldata) #modeldata is loaded
The next step is to take a look at the crickets dataset, which we will work with.
view(crickets) # view object in a new tab on RStudio, not visible in rendered Quarto html file
This database has 31 observations with corresponding values for temperature and chirping rate for 2 species of crickets, O. exclamationis and O. niveus
Deciding on which graphic to use for visualising a data set depends on the variables in question, namely, the type and number of variables. If there is only a single variable of interest, the most used plots are as follows: - single categorical variable: bar chart - single quantitative variable: histogram If there are two variables for analysis, the commonly used plots are: - both categorical: grouped bar chart - both quantitative: scatterplot - one categorical and one quantitative: box plot
1. For one variable
1.1. Single categorical variable
1.1.1. Generating a bar chart for a single categorical variable.
A bar chart is the generally used default way of representing the frequency/count of a single categorical variable. The following code generates a bar chart of the count of individuals in each species of cricket in the data set. This is accomplished using the geom_bar() function.
ggplot(crickets, aes(x = species, #crickets dataset loaded, species variable mapped to the x-axisfill = species)) +# color of bars set according to speciesgeom_bar() +# specifies bar chartlabs(x ="Species", # label assigned according to variable on x-axisy ="No. of individuals", # label assigned for y-axiscolor ="Species", # label provided for legendtitle ="Number of individuals of each cricket species") +# suitable title also specifiedscale_fill_brewer(palette ="Dark2") # increases color contrast for improved accessibility
The differentiation of species by color in the bar chart, indication of species in the legend and along the x-axis is redundant and unnecessary, and therefore, the legend can be removed by using the function (show.legend = FALSE).
ggplot(crickets, aes(x = species, #crickets dataset loaded, species variable mapped to the x-axisfill = species)) +# color of bars set according to speciesgeom_bar(show.legend =FALSE) +# specifies bar chart, removes legendlabs(x ="Species", # label assigned according to variable on x-axisy ="No. of individuals", # label assigned for y-axistitle ="Number of individuals of each cricket species") +# suitable title also specifiedscale_fill_brewer(palette ="Dark2") # increases color contrast for improved accessibility
1.2. Single quantitative variable
1.2.1. Generating a histogram of chirp rates
Histograms constitute the most common way of visualising data with a single quantitative variable, with frequency along the y-axis. The geom_histogram() function generates a histogram with default bin number of 30, which does not result in a good understanding of the data.
ggplot(crickets, aes(x = rate)) +# database loaded, single variable mapped to x-axisgeom_histogram() +# specifies histogram, with default bin numberlabs(x ="Chirp rate", # label assigned to x-axisy ="Frequency", # label assigned to y-axistitle ="Frequency of chirp rates") # suitable title specified
It is a good data exploration practice to experiment with different number of bins, which affects the way trends in a dataset are visualised, between a large number of bins and a small number of them.
ggplot(crickets, aes(x = rate)) +# database loaded, single variable mapped to x-axisgeom_histogram(bins =15) +# specifies histogram, with bin number = 15 labs(x ="Chirp rate", # label assigned to x-axisy ="Frequency", # label assigned to y-axistitle ="Frequency of chirp rates") # suitable title specified
The way bins are displayed can also be altered by using binwidth, for one quantitative variable.
ggplot(crickets, aes(x = rate)) +# database loaded, single variable mapped to x-axisgeom_histogram(binwidth =5) +# specifies histogram, with desired binwidthlabs(x ="Chirp rate", # label assigned to x-axisy ="Frequency", # label assigned to y-axistitle ="Frequency of chirp rates") # suitable title specified
Another way of representing the same histogram in a different way is using a frequency polygon using the geom_freqpoly() function, wherein chirp rate is along the x-axis, just as in the histogram
ggplot(crickets, aes(x = rate)) +# database loaded, single variable mapped to x-axisgeom_freqpoly(bins =15) +# generates frequency polygonlabs(x ="Chirp rate", # label assigned to x-axisy ="Frequency", # label assigned to y-axistitle ="Frequency of chirp rates") # suitable title specified
1.2.2. Faceting
Faceting in R is the splitting up of a chart into multiple smaller grids, which display a different subset of the data. The scales of these plots are more or less the same. Faceting is useful when a histogram graphically represents data, for example, for 2 species. As the bins for both species are stacked one over the other, this type of graphic is not particularly useful in visualisation, especially if the number of categories or variables increases more than 2, as evident in the output of the following rcode chunk.
ggplot(crickets, aes(x = rate, # crickets dataset, rate mapped to x-axisfill = species)) +# bin color according to speciesgeom_histogram(bins =15) +# generates histogram with 15 binslabs(x ="Chirp rate", # label assigned according to variable on x-axisy ="No. of individuals", # label assigned for y-axiscolor ="Species", # label provided for legendtitle ="Chirp rates of the cricket species") +# suitable title also specifiedscale_color_brewer(palette ="Dark2") # increases contrast for improved visibility
The better way to resolve this, other than to create separate datasets for each species, which is long-winded, is to use the function facet_wrap(), which wraps the visualised data by species. The scales of the plots are the same and therefore, it is easier to compare the two plots, and make reasonable, valid deductions from the plots.
ggplot(crickets, aes(x = rate, # crickets dataset, rate mapped to x-axisfill = species)) +# bin color according to speciesgeom_histogram(bins =15, # generates histogram with 15 binsshow.legend =FALSE) +# removes legendfacet_wrap(~species) +# wraps by specieslabs(x ="Chirp rate", # label assigned according to variable on x-axisy ="No. of individuals", # label assigned for y-axistitle ="Chirp rates of individual species") +# suitable title also specifiedscale_fill_brewer(palette ="Dark2") # increases contrast for improved visibility
A better way of visualising the above plots would be to arrange them vertically, so that the difference in chirp rates between the two species becomes more apparent. This is done using the following rcode chunk
ggplot(crickets, aes(x = rate, # crickets dataset, rate mapped to x-axisfill = species)) +# bin color according to speciesgeom_histogram(bins =15, # generates histogram with 15 binsshow.legend =FALSE) +# removes legendfacet_wrap(~species, # wraps by speciesncol =1) +# specifies no. of columns to be consideredlabs(x ="Chirp rate", # label assigned according to variable on x-axisy ="No. of individuals", # label assigned for y-axistitle ="Chirp rate of individual species") +# suitable title also specifiedscale_fill_brewer(palette ="Dark2") +# increases contrast for improved visibilitytheme_minimal() # removes grey default background
2. For two variables
2.1. For two quantitative variables
2.1.1a. Creating a scatter plot of chirping rate and temperature
This scatter plot uses the crickets database, with temperature on the x-axis and chirping rate on the y-axis. This is the mostly used way of visualising the relationship between two quantitative variables in a dataset. The geom_point() function helps generate scatter plots.
# basic scatter plot of temperature vs chirping rateggplot(crickets, aes(x = temp, # crickets database, temp along x-axisy = rate, # chirp rate along y-axiscolor = species)) +# color of points according to speciesgeom_point() +# scatter plot is specified using this commandlabs(x ="Temperature", # labels assigned according to variables on each axisy ="Chirp rate",color ="Species", # label provided for legendtitle ="Chirping in crickets") +# suitable title also specifiedscale_color_brewer(palette ="Dark2") # increases contrast for improved accessibility
2.1.1b. Adding a regression line to the existing scatter plot
The function geom_smooth() is the default command used to fit a smoother or curve of best fit over the data. However, it results in a flexible curve that fits the data but tells nothing informative about the relationship between the variables in question. This smoother is accompanied by a grey error ribbon that provides a measure of uncertainty in the smoother.
ggplot(crickets, aes(x = temp, # crickets database, temp along x-axisy = rate)) +# chirp rate along y-axisgeom_point() +# scatter plot is specified using this commandgeom_smooth() +# fits a smoother with error bar over the datalabs(x ="Temperature", # labels assigned according to variables on each axisy ="Chirp rate",color ="Species", # label provided for legendtitle ="Chirping in crickets") +# suitable title also specifiedscale_color_brewer(palette ="Dark2") # increases contrast for improved accessibility
A regression line can be added using the geom_smooth() command by inserting arguments within the said command that alter the qualities of the smoother. The grey error bar can be removed by setting “se = FALSE.” Adjusting the color aesthetics of the regression line to indicate each species separately results in regression lines with a closer fit to the data.
ggplot(crickets, aes(x = temp, # crickets database, temp along x-axisy = rate, # chirp rate along y-axiscolor = species)) +# color of points according to speciesgeom_point() +# scatter plot is specified using this commandgeom_smooth(method ="lm", # specifies a linear modelse =FALSE) +# removes the grey error barlabs(x ="Temperature", # labels assigned according to variables on each axisy ="Chirp rate",color ="Species", # label provided for legendtitle ="Chirping in crickets") +# suitable title also specifiedscale_color_brewer(palette ="Dark2") # increases contrast for improved accessibility
2.2. For two variables - one categorical, one quantitative
2.2.1. Creating a box plot for species vs chirp rate
This graphic is used when analysing the relationship between one quantitative and one categorical variable in a dataset. Box plots can be generated using the geom_boxplot() function. The following boxplots consider the cricket species and their respective chirp rates.
ggplot(crickets, aes(x = species, # crickets dataset, species along x-axisy = rate, # chirp rate along y-axiscolor = species)) +# boxplot color according to speciesgeom_boxplot(show.legend =FALSE) +# creates box plot, hides legendlabs(x ="Species", # label assigned according to variable on x-axisy ="Chirp rate", # label assigned for y-axistitle ="Relationship between chirp rate and species") +# suitable title also specifiedscale_color_brewer(palette ="Dark2") # increased color contrast for better visibility
In the above plot, the grey background is provided by default. In order ot improve aesthetics, this grey background can be removed using the theme_minimal() function, as follows.
ggplot(crickets, aes(x = species, # crickets dataset, species along x-axisy = rate, # chirp rate along y-axiscolor = species)) +# boxplot color according to speciesgeom_boxplot(show.legend =FALSE) +# creates box plot, hides legendlabs(x ="Species", # label assigned according to variable on x-axisy ="Chirp rate", # label assigned for y-axistitle ="Relationship between chirp rate and species") +# suitable title also specifiedscale_color_brewer(palette ="Dark2") +# increased color contrast for better visibilitytheme_minimal() # removes the grey default background
Research Methods - Formative Exercise - 17/10/2024
What makes a good research hypothesis?
A good research hypothesis can only be put forth after an exhaustive, comprehensive review of existing scientific literature on the topic of interest. This is to identify any gaps in knowledge that the research hypothesis can help in addressing. The most important aspect of a good research hypothesis is that it should be testable and practical. It should also be flexible to either acceptance or rejection based on subsequent experiments. It also should have a considerable extent of clarity on the concept to be studied, so that subsequent experimentation and analysis can be guided properly. A good research hypothesis should be objective and based on facts and scientific evidence, and not based on personal opinions and beliefs. Finally, it should be relevant to the research topic/area in question. Otherwise, it would misguide the entire research process following it.
Week Five
Choosing the right analysis - Lecture - 22/10/2024
This week’s lecture was on identifying the nature of variables, which extended into learning about the different types of statistical analyses, depending on the nature of the variables in question. We were introduced to frequency tests, mean tests, correlations and models. The second half of the lecture dealt with research hypotheses, the importance of hypotheses in science, the difference between scientific and statistical hypotheses and hypothetico-deductive reasoning.
Choosing the right analysis - Formative Exercise - 23/10/2024
The task for this week is to identify the statistical tests most appropriate for and relevant to the graphics provided on the NOW page for the module, under Week 5 - Post-session, and to reproduce the graphics using R code. The data set used for this exercise is “iris,” which is pre-built in R.
data("iris")
The layout of this task, as planned by me, will begin with the R code for each graphic, the graphic itself and then identification of suitable tests and its explanation, for better clarity.
library(tidyverse) # loading tidyverse for ggplot functions
1. Box plot
Reproducing the box plot of sepal length for the 3 Iris species, I. setosa, I. versicolor, and I. virginica.
ggplot(iris, aes(x = Species, # calls iris data set, assigns species to x-axisy = Sepal.Length, # assigns sepal length to y-axiscolor = Species)) +# generates legendgeom_boxplot() +# generates boxplotlabs(x ="Species", # labels added to x- and y-axes, title also addedy ="Sepal Length",title ="Sepal length for the three Iris species")
Identification of suitable tests: The purpose of the above graphic appears to be to compare sepal length between the three Iris species. The variable along the x-axis, species, is categorical, and the variable along the y-axis, sepal length is numerical and on continuous scale. The data appear fairly normally distributed for “I. setosa and I. versicolor, the data for I. virginca appear slightly negatively skewed. However, we assume the fulfillment of normality, as the data size is larger than 30 (n=150). Therefore, parametric tests are more suitable in this case than non-parametric tests. Furthermore, the study design is unpaired, as there is only one measurement (variable being measured) for each individual of each species. The number of groups (species) is 3 (n>2), and therefore, considering all the above factors, a one-way ANOVA seems to be the most appropriate test for this graphic.
2. Density plot
Reproducing the density plot of Petal length for the 3 Iris species.
ggplot(iris, aes(x = Petal.Length, # calls iris data set, assigns petal length to x-axisfill = Species)) +# fills area under each density plot with colour according to speciesgeom_density(alpha =0.3) +# generates density plot, with density plot transparency adjusted to match that of graphic on NOW pagelabs(x ="Petal Length", # labels added to x- and y-axes, title also addedy ="Density",title ="Petal length for the three Iris species")
Identification of suitable tests: As a density plot is a representation of the frequency distribution of a numeric variable, it is safe to assume that the aim of this graphic is to compare the difference in petal length (and corresponding frequencies) between the three Iris species. The scale of the measured variable, petal length is continuous, as it is a numerical variable (density, along y-axis, is also numerical). As only a single measurement has been made for each species, the study design can be considered unpaired. There are more than 2 groups (3 species). The data are not normally distributed, and therefore, it does not fulfill the assumptions of normality. Hence, the Kruskal-Wallis test would be most appropriate in this case.
3. Scatter Plot
Reproducing the scatter plot with regression line of Petal length and Petal width for the 3 Iris species.
ggplot(iris, aes(x = Petal.Length, # loads iris data set, assigns petal length to x-axisy = Petal.Width)) +# assigns petal length to y-axisgeom_point(aes(colour = Species, # generates scatter plot, with colour of points set to speciesshape = Species)) +# shape of points set to speciesgeom_smooth(method ="lm") +# generates regression line with error ribbonlabs(x ="Petal Length", # labels added to x- and y-axes, title also addedy ="Petal Width",title ="Petal length and width for Iris")
Identification of suitable tests: A scatter plot is generally used to visualise the relationship between two variables, and therefore we assume this to be the aim of this graphic. Petal length, along the x-axis, and petal width, along the y-axis, are both numerical, on continuous scale. The data appear fairly normally distributed, without extreme values; therefore, the assumptions of normality are considered to be fulfilled. There are 3 groups (species; n>2), and the study design is paired, as each individual has a measurement of petal length and petal width. Therefore, Pearson correlation would be an appropriate test to use in this case. However, simple linear regression could also be suitable, as the regression line has been generated using a model, with y as a function of x.
4. Bar chart
Reproducing bar chart of frequency of big and small sepals in the plants according to species
iris.modified <-# modified data set creatediris %>%# pipe symbol indicates function operation on iris data setmutate(size =ifelse(Sepal.Length <median(Sepal.Length), "small", "big")) # creates a new variable size, with two categories, big and smalliris.modified # calls the modified data set
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.1 3.5 1.4 0.2 setosa small
2 4.9 3.0 1.4 0.2 setosa small
3 4.7 3.2 1.3 0.2 setosa small
4 4.6 3.1 1.5 0.2 setosa small
5 5.0 3.6 1.4 0.2 setosa small
6 5.4 3.9 1.7 0.4 setosa small
7 4.6 3.4 1.4 0.3 setosa small
8 5.0 3.4 1.5 0.2 setosa small
9 4.4 2.9 1.4 0.2 setosa small
10 4.9 3.1 1.5 0.1 setosa small
11 5.4 3.7 1.5 0.2 setosa small
12 4.8 3.4 1.6 0.2 setosa small
13 4.8 3.0 1.4 0.1 setosa small
14 4.3 3.0 1.1 0.1 setosa small
15 5.8 4.0 1.2 0.2 setosa big
16 5.7 4.4 1.5 0.4 setosa small
17 5.4 3.9 1.3 0.4 setosa small
18 5.1 3.5 1.4 0.3 setosa small
19 5.7 3.8 1.7 0.3 setosa small
20 5.1 3.8 1.5 0.3 setosa small
21 5.4 3.4 1.7 0.2 setosa small
22 5.1 3.7 1.5 0.4 setosa small
23 4.6 3.6 1.0 0.2 setosa small
24 5.1 3.3 1.7 0.5 setosa small
25 4.8 3.4 1.9 0.2 setosa small
26 5.0 3.0 1.6 0.2 setosa small
27 5.0 3.4 1.6 0.4 setosa small
28 5.2 3.5 1.5 0.2 setosa small
29 5.2 3.4 1.4 0.2 setosa small
30 4.7 3.2 1.6 0.2 setosa small
31 4.8 3.1 1.6 0.2 setosa small
32 5.4 3.4 1.5 0.4 setosa small
33 5.2 4.1 1.5 0.1 setosa small
34 5.5 4.2 1.4 0.2 setosa small
35 4.9 3.1 1.5 0.2 setosa small
36 5.0 3.2 1.2 0.2 setosa small
37 5.5 3.5 1.3 0.2 setosa small
38 4.9 3.6 1.4 0.1 setosa small
39 4.4 3.0 1.3 0.2 setosa small
40 5.1 3.4 1.5 0.2 setosa small
41 5.0 3.5 1.3 0.3 setosa small
42 4.5 2.3 1.3 0.3 setosa small
43 4.4 3.2 1.3 0.2 setosa small
44 5.0 3.5 1.6 0.6 setosa small
45 5.1 3.8 1.9 0.4 setosa small
46 4.8 3.0 1.4 0.3 setosa small
47 5.1 3.8 1.6 0.2 setosa small
48 4.6 3.2 1.4 0.2 setosa small
49 5.3 3.7 1.5 0.2 setosa small
50 5.0 3.3 1.4 0.2 setosa small
51 7.0 3.2 4.7 1.4 versicolor big
52 6.4 3.2 4.5 1.5 versicolor big
53 6.9 3.1 4.9 1.5 versicolor big
54 5.5 2.3 4.0 1.3 versicolor small
55 6.5 2.8 4.6 1.5 versicolor big
56 5.7 2.8 4.5 1.3 versicolor small
57 6.3 3.3 4.7 1.6 versicolor big
58 4.9 2.4 3.3 1.0 versicolor small
59 6.6 2.9 4.6 1.3 versicolor big
60 5.2 2.7 3.9 1.4 versicolor small
61 5.0 2.0 3.5 1.0 versicolor small
62 5.9 3.0 4.2 1.5 versicolor big
63 6.0 2.2 4.0 1.0 versicolor big
64 6.1 2.9 4.7 1.4 versicolor big
65 5.6 2.9 3.6 1.3 versicolor small
66 6.7 3.1 4.4 1.4 versicolor big
67 5.6 3.0 4.5 1.5 versicolor small
68 5.8 2.7 4.1 1.0 versicolor big
69 6.2 2.2 4.5 1.5 versicolor big
70 5.6 2.5 3.9 1.1 versicolor small
71 5.9 3.2 4.8 1.8 versicolor big
72 6.1 2.8 4.0 1.3 versicolor big
73 6.3 2.5 4.9 1.5 versicolor big
74 6.1 2.8 4.7 1.2 versicolor big
75 6.4 2.9 4.3 1.3 versicolor big
76 6.6 3.0 4.4 1.4 versicolor big
77 6.8 2.8 4.8 1.4 versicolor big
78 6.7 3.0 5.0 1.7 versicolor big
79 6.0 2.9 4.5 1.5 versicolor big
80 5.7 2.6 3.5 1.0 versicolor small
81 5.5 2.4 3.8 1.1 versicolor small
82 5.5 2.4 3.7 1.0 versicolor small
83 5.8 2.7 3.9 1.2 versicolor big
84 6.0 2.7 5.1 1.6 versicolor big
85 5.4 3.0 4.5 1.5 versicolor small
86 6.0 3.4 4.5 1.6 versicolor big
87 6.7 3.1 4.7 1.5 versicolor big
88 6.3 2.3 4.4 1.3 versicolor big
89 5.6 3.0 4.1 1.3 versicolor small
90 5.5 2.5 4.0 1.3 versicolor small
91 5.5 2.6 4.4 1.2 versicolor small
92 6.1 3.0 4.6 1.4 versicolor big
93 5.8 2.6 4.0 1.2 versicolor big
94 5.0 2.3 3.3 1.0 versicolor small
95 5.6 2.7 4.2 1.3 versicolor small
96 5.7 3.0 4.2 1.2 versicolor small
97 5.7 2.9 4.2 1.3 versicolor small
98 6.2 2.9 4.3 1.3 versicolor big
99 5.1 2.5 3.0 1.1 versicolor small
100 5.7 2.8 4.1 1.3 versicolor small
101 6.3 3.3 6.0 2.5 virginica big
102 5.8 2.7 5.1 1.9 virginica big
103 7.1 3.0 5.9 2.1 virginica big
104 6.3 2.9 5.6 1.8 virginica big
105 6.5 3.0 5.8 2.2 virginica big
106 7.6 3.0 6.6 2.1 virginica big
107 4.9 2.5 4.5 1.7 virginica small
108 7.3 2.9 6.3 1.8 virginica big
109 6.7 2.5 5.8 1.8 virginica big
110 7.2 3.6 6.1 2.5 virginica big
111 6.5 3.2 5.1 2.0 virginica big
112 6.4 2.7 5.3 1.9 virginica big
113 6.8 3.0 5.5 2.1 virginica big
114 5.7 2.5 5.0 2.0 virginica small
115 5.8 2.8 5.1 2.4 virginica big
116 6.4 3.2 5.3 2.3 virginica big
117 6.5 3.0 5.5 1.8 virginica big
118 7.7 3.8 6.7 2.2 virginica big
119 7.7 2.6 6.9 2.3 virginica big
120 6.0 2.2 5.0 1.5 virginica big
121 6.9 3.2 5.7 2.3 virginica big
122 5.6 2.8 4.9 2.0 virginica small
123 7.7 2.8 6.7 2.0 virginica big
124 6.3 2.7 4.9 1.8 virginica big
125 6.7 3.3 5.7 2.1 virginica big
126 7.2 3.2 6.0 1.8 virginica big
127 6.2 2.8 4.8 1.8 virginica big
128 6.1 3.0 4.9 1.8 virginica big
129 6.4 2.8 5.6 2.1 virginica big
130 7.2 3.0 5.8 1.6 virginica big
131 7.4 2.8 6.1 1.9 virginica big
132 7.9 3.8 6.4 2.0 virginica big
133 6.4 2.8 5.6 2.2 virginica big
134 6.3 2.8 5.1 1.5 virginica big
135 6.1 2.6 5.6 1.4 virginica big
136 7.7 3.0 6.1 2.3 virginica big
137 6.3 3.4 5.6 2.4 virginica big
138 6.4 3.1 5.5 1.8 virginica big
139 6.0 3.0 4.8 1.8 virginica big
140 6.9 3.1 5.4 2.1 virginica big
141 6.7 3.1 5.6 2.4 virginica big
142 6.9 3.1 5.1 2.3 virginica big
143 5.8 2.7 5.1 1.9 virginica big
144 6.8 3.2 5.9 2.3 virginica big
145 6.7 3.3 5.7 2.5 virginica big
146 6.7 3.0 5.2 2.3 virginica big
147 6.3 2.5 5.0 1.9 virginica big
148 6.5 3.0 5.2 2.0 virginica big
149 6.2 3.4 5.4 2.3 virginica big
150 5.9 3.0 5.1 1.8 virginica big
ggplot(iris.modified, aes(x = Species, # loads the modified data set, categorical variable, species, assigned to x-axiscolour = size, # bar colour set according to sepal sizefill = size)) +# bar fill colour set according to sepal sizegeom_bar(position ="dodge") # creates bar chart, with a bar corresponding to each size category adjacent to each other, for each species
Identification of suitable tests: This bar chart helps visualise the count of individuals in each species with small or big sepals, where the “small” category refers to those individuals with sepal length less than the median sepal length value, and the “big” category, to those with sepal length greater tha median sepal length. In this case, both “species” and “size” are categorical and nominal. We assume the objective of this graphic to be a comparison (through a frequency distribution) of the difference in sepal size (length) between the three species. The study design seems to be unpaired, as the only variable measured is sepal length, which categorises individuals into either of the two groups, “small” and “big.” There are more than 2 groups (3 species), and therefore, taking into account these considerations, a chi-square test of homogeneity would be the most suitable test in this case.
Week 7- Formative Assessment
Title and abstract writing
Investigating factors influencing activity and behaviour patterns in the Western Santa Cruz tortoise, Chelonoidis porteri, using Dirichlet regressions and generalised linear mixed models
Affiliations: James Cook University, Queensland, Australia
Abstract:
Although vital to biodiversity conservation, preserving and restoring important natural areas are unfeasible because of booming global human populations. Instead, land-sharing with wildlife is more practical. Determining how animals balance activity patterns relating to land-use types has important conservation implications. This study helped determine the factors influencing time proportions spent by Western Santa Cruz tortoises, Chelonoidis porteri, on eating, resting and walking in agricultural areas of Santa Cruz. Tortoise behaviour on the Santa Cruz farms, Galapagos, was observed during wet and dry seasons. The duration and type of tortoise behaviour were recorded. Habitat characteristics, including percent cover, density and mean height of vegetation, were noted for each land-use type. Carapace length and width were measured for each observed tortoise, and thermal images of the animals and surrounding habitats were captured. The time proportions spent for eating, walking and resting, and behavioural categories relating to vegetation characteristics, were analysed using Dirichlet regressions and generalised linear mixed models, respectively, using R. Eating and resting durations were greatly affected by land-use type and temperature. Vegetation characteristics also affected tortoise activity patterns. Eating probability positively correlated with vegetation cover, but also depended on vegetation density. The tortoises spent significantly longer resting on abandoned land than on livestock and touristic land. Vegetation height and density influenced walking probability. Land-use type and vegetation characteristics strongly influenced tortoise behaviour patterns. The differences in activity patterns indicated preferences for activities reducing energetic costs to the tortoises. Understanding how land-use types affect activity patterns can inform conservation management actions.