Welcome to Computer Lab 5 for the Data Analysis (DA) component of BIO2POS!
In DA Topic 5, we shifted our focus to modelling relationships between numeric variables. We discussed the concept of correlation, and introduced linear regression. Both simple linear regression (with the one independent variable) and multiple linear regression (with multiple independent variables) were covered, including the modelling process and model interpretation. The assumptions of these different models were also discussed.
In this computer lab, you will continue to learn how to use the statistical software jamovi, and create simple and multiple linear regression models using real data sets. You will also learn how to interpret and summarise jamovi output for these models, and how to check the assumptions of the models in jamovi.
These labs are designed to provide you with plenty of opportunities to practice different aspects of the statistical content covered in the lectures.
Each lab consists of core questions (with the 🌱 symbol) and extension questions (with the 🌳 symbol).
Having completed this lab, you will be able to:
You will also be able to interpret the results of the above statistical techniques, check the assumptions of the tests, and provide clear summary statements highlighting the key statistical outputs of the models.
Before you begin, please check the following:
Please aim to complete Step 1 before starting this lab, as doing so will help you to better understand the content covered. Please aim to complete Step 2 before the next week of DA content.
To begin, we will return to the red crab data, collected by Green (1997), which we began analysing in DA Computer Lab 1. As a refresher, recall that this data has recorded values for variables including:
Figure 1.1: Note. From File:Gecarcoidea natalis 248249776.jpg, by Fernando Pérez Peralta, 2019, Wikimedia Commons (https://commons.wikimedia.org/). CC0 1.0 DEED
It is recommended that you save all your lab work, e.g. on OneDrive, so that you can access it easily at a later date.
This red crab data is available in the current week’s tile on LMS, in the file crab_data_extended.omv. Download this file now,
and save it on your computer. Also open up a text document (e.g. a Word document), in which you can write down your responses and save your jamovi output as you work through the lab.
Open up jamovi and load in the crab_data_extended.omv file.
If you would like to analyse this red crab data in R, you can download either the crab_data_extended.omv or crab_data_extended.csv file in the current week’s tile on LMS.
Also open up a text document (e.g. a Word document), in which you can write down your responses and
save your R output as you work through the lab. Alternatively, you can simply save your R script, as the code you write is reproducible.
Open up R, and open a new R script (Ctrl + Shift + N in Windows, or go to File -> New File -> R Script), set your current working directory to where you downloading the crab data, and read in the crab data.
If you need to refresh your memory on any of the processes mentioned above, just check the Details box below:
Recall that to open .omv files in R, we’ll need the jmvReadWrite package installed and loaded. If you are not sure whether you have this package, copy-paste and then run the following code in your R script, one line at a time.
install.packages("jmvReadWrite") # this line installs the package we need
library(jmvReadWrite) # this line loads the package in our current session
# Note that anything after a # is called a comment in R, and isn't treated as executable code
To run a line of code in RStudio, just have your cursor on that line, and click the Run Selected Line(s) button at the top right of the script (where the green arrow is, see reference image below). Your line of code will then be run, or executed, and you should see the code and some other output appear in the Console section below your script file.
Recall that to set your Working Directory (where R looks for files), the two simplest options in RStudio are:
Then:
crab_data_extended <- read_omv("crab_data_extended.omv")
# This line loads our crab data set into RStudio,
# and stores the data in an object we've called crab_data_extended
You should now see crab_data_extended listed in the Environment section of RStudio in the top right - this means the data is loaded in RStudio, and ready for analysis!
Alternatively, if you wanted to use the .csv file, you can simply use the following code:
crab_data_extended <- read.csv("crab_data_extended.csv", header = T)
# This line loads our crab data set into RStudio,
# and stores the data in an object we've called crab_data_extended
# The header = T part ensures the column names (e.g. CW, are treated as names rather than observations)
To begin, suppose that we would like to determine if there are any correlations between different physical characteristics of the red crabs.
If you would like to refresh your memory on correlations, check the Topic 5A lecture.
In jamovi, click on the Analyses tab, and then click on Regression and select Correlation Matrix.
Drag CW and WEIGHT across into the box on the right to produce the correlation matrix for these two variables.
Under the Plot heading, select the correlation matrix and densities for variables boxes.
Assessing both the correlation matrix and plots, what do you notice about the relationship between the two variables?
Based on your findings from part 2.0.1, decide if it is more appropriate to compute the Pearson or Spearman correlation for the chosen variables, and select the relevant box under the Correlation Coefficients subheading.
Add the variables LEG, CLAW and OtherClaw to your correlation matrix, and for each pair determine which correlation coefficient is more appropriate to use.
Assessing your results, answer the following questions:
Which correlations appear to be linear?
Which correlations (if any) are statistically significant?
Which correlation is the strongest?
In RStudio, we can use the function cor to compute the correlation values between two numeric variables.
If we would like to plot the data to visually assess the correlation, the simplest option is to use the plot function.
To begin, run the following code to create a scatter plot of CW and WEIGHT. Then, using this code as a basis, create scatter plots for the other pairs of numeric variables in the data set.
plot(crab_data_extended$CW, crab_data_extended$WEIGHT)
# Recall the $ operator allows us to access specific columns in our data frame
Based on your visual assessment, decide if it is more appropriate to use the Pearson or Spearman correlation for each pair of variables.
Run the following code to compute the Pearson correlation for each pair of variables:
cor(crab_data_extended[, -5])
# The -5 tells R to ignore the 5th column (which is not numeric)
If you felt that Spearman correlation was more appropriate, you can compute this for a specific pair of variables by including the additional argument method = "spearman", e.g.:
cor(crab_data_extended$CW, crab_data_extended$WEIGHT, method = "spearman)
Assessing your results, answer the following questions:
Which correlations appear to be linear?
Which correlations (if any) are statistically significant?
Which correlation is the strongest?
Given that we observe a strong linear correlation between the CLAW and OtherClaw variables, suppose we would like to construct a simple linear regression (SLR) model that uses CLAW values to predict OtherClaw values.
Select the Linear Regression option in the Regression tab, and drag CLAW (our independent variable in this scenario) to the Covariates box, and OtherClaw to the Dependent Variable box.
You should see some output appear automatically in the Results section.
If successful, this could mean for example that in future we would only have to measure one claw per crab.
Scroll down to the Assumption Checks section, expand it, and tick the Normality test, Q-Q plot of residuals, and Residuals plots boxes.
Scroll down a little further to the Model Coefficients section, expand it, and tick the Confidence interval box under the Estimate subheading.
Before we proceed further, we need to check the model assumptions. Complete the following:
While the Shapiro-Wilk normality test is typically sufficient as a check for the assumption of normality, we may also like to visualise the distribution of the residuals.
Linear Regression section, expand the Save section, and tick the Residuals box. This will save the residuals obtained from fitting the SLR.Exploration tab, and select Descriptives. You should see a new variable, Residuals.Using your SLR model output, complete the following:
To conclude our SLR modelling, we can add the SLR regression line to a scatter plot of the data.
Navigate to the Exploration tab, select the Scatterplot option, and then construct the plot of OtherClaw vs CLAW.
Under the Regression Line subheading, select Linear. You should see that the fitted line fits the data almost perfectly.
To fit a linear regression in R, the simplest option is to use the lm function, which comes built in with the base R installation.
Suppose that, as a starting point, we would like to model OtherClaw against CLAW.
If successful, this could mean for example that in future we would only have to measure one claw per crab.
To do this, and assuming CLAW is the independent variable, we can run the following R code:
crab_lm <- lm(OtherClaw ~ CLAW, data = crab_data_extended)
# Here the dependent variable is specified first, and then the ~ symbol
# is used as shorthand for 'regressed against', followed by the independent variable(s)
# Finally, we specify the data set we are using, via the `data` argument
summary(crab_lm) # The summary function will provide more useful information
# than simply calling crab_lm by itself
To check the confidence intervals for the parameter coefficient estimates, we can use the confint function, as follows:
confint(crab_lm)
Before we proceed further, we need to check the model assumptions. To begin, list the assumptions of a simple linear regression.
To check the test assumptions, we can begin by using the shapiro.test and qqnorm and `qqline functions, in the same manner as we did in Question 2.0.6 of DA Computer Lab 3.
Remember that we are checking the normality of the linear regression’s fitted residuals, not the underlying crab data.
Recall that if we use the $ operator, we can access specific components of the data stored in R objects. To check the residuals computed as part of the linear model fitting procedure, we can use crab_lm$residuals.
Carry out these steps now, using the above information as a guide.
While the Shapiro-Wilk normality test is typically sufficient as a check for the assumption of normality, it can also be helpful to visualise the distribution of the residuals.
hist function, create a histogram of the residuals.Using your SLR model output, complete the following:
To conclude our SLR modelling, we can add the SLR regression line to a scatter plot of the data.
First, we can create a simple scatter plot with the following R code:
Then, we can add the fitted regression line using the following R code:
abline(reg = crab_lm$coefficients, col = "blue")
You should see that the fitted line fits the data almost perfectly.
Repeat all the steps from question 3, but this time model WEIGHT (dependent variable) against CW (independent variable).
Suppose that we want to extend our simple linear regression framework to take into account multiple independent variables.
We will now assess whether a multiple linear regression (MLR) can be used to help us model the weight of the red crabs.
To turn an SLR into an MLR in jamovi, we simply add more variables to the Covariates box, in the Linear Regression section.
Create a regression with WEIGHT as the dependent variable and CW, LEG and CLAW as the covariates.
Rather than simply having the one MLR model, as created in part 5.0.1, we can create a set of nested models via the Model Builder box in the Linear Regression section. This way, we can observe the impact of adding each additional independent variable at a time to our simpler model.
Click the blue Add New Block button twice, then drag LEG from Block 1 to Block 2, and CLAW from Block 1 to Block 3. In the Results window, you should now be able to cycle through the three models.
This process is known as stepwise regression. Since we are adding more variables to create a more complex model, we refer to this as forwards selection. If instead we were removing variables one at a time to create a simpler model, that would be backwards selection.
Normally, the decision on which variable to add/remove at each stage is based on a metric like the AIC.
For an MLR model, it is more appropriate to refer to the adjusted \(R^2\) value than the \(R^2\) value when assessing the fit of the model.
To obtain this value in jamovi, expand the Model Fit section, and select the Adjusted \(R^2\) box.
Also select the AIC box, to check the impact of the inclusion of each additional independent variable on the resultant fit on the model.
What do you observe?
Write out your fitted regression model for the full model with all 3 independent variables, and provide an interpretation of each of the fitted coefficient values.
For example interpretations of fitted coefficients, check slides 17-18 of the Topic 5B lecture.
Check the MLR model assumptions, and discuss your findings.
If we would like to fit a multiple linear regression model in R, we can use the lm function as before, and just add additional independent variables.
So for example, if we would like to model WEIGHT against CW, LEG and CLAW, we could write:
crab_mlr <- lm(WEIGHT ~ CW + LEG + CLAW, data = crab_data_extended)
summary(crab_mlr)
If we would like to conduct stepwise regression, in case there are simpler models that are close to, or just as efficacious as, the “full”’ model with all three independent variables, we can use the step function, as follows:
step(crab_mlr)
You can also try fitting separate linear models, using the lm function process.
For an MLR model, it is more appropriate to refer to the adjusted \(R^2\) value than the \(R^2\) value when assessing the fit of the model.
Recall that we would also like to focus on the AIC score here, to check the impact of the inclusion of each additional independent variable on the resultant fit on the model. What do you observe?
Write out your fitted regression model for the full model with all 3 independent variables, and provide an interpretation of each of the fitted coefficient values.
For example interpretations of fitted coefficients, check slides 17-18 of the Topic 5B lecture.
Check the MLR model assumptions, and discuss your findings.
A recent paper by Frisbie et al. (2024) discussed the possibility of increases in the arsenic levels in Bangladeshi drinking water, due to rising sea levels and flooding, induced by climate change. The paper is freely accessible here.
A copy of the data analysed in the study is available in this week’s tile on LMS, in the file Bangladesh_water_quality.omv. Download this data now, save it on your PC, and open it in jamovi.
Figure 6.1: Note. From File:Sundarbans, Bangladesh - 6 January 2023 (52612859696).jpg, by SentinelHub, 2023, Wikimedia Commons (https://commons.wikimedia.org/). CC BY 2.0 DEED This image contains modified Copernicus Sentinel data 2023.
As part of their analyses, Frisbie et al. (2024) conducted several simple linear regressions of Arsenic (micrograms per liter) against other variables.
To familiarise yourself with the data and their results, read through the Statistical analyses paragraph section of their paper, and then check Table 2. from the Results and Discussion section.
Also check Figures 4, 6, 11, 13 and 15.
Using the Bangladesh_water_quality.omv data set, carry out simple linear regressions in jamovi to reproduce the 5 simple linear regressions reported in Table 2 of Frisbie et al. (2024).
Confirm that the jamovi output matches that shown in Table 2 of the paper.
Comment on the \(R^2\) values for the 5 regressions. What do you observe?
Check the SLR model assumptions for each of the 5 regressions. Based on the results, and your findings from part 6.3 do you have any concerns about the validity of the SLR results reported in the paper?
Conduct a multiple linear regression on Arsenic, using all the independent variables from the Bangladesh_water_quality.omv data set. Make sure to check all the relevant model assumptions.
Write out your fitted model, and comment on the statistical significance of each of the estimated coefficients, and the overall fit of the model.
Based on your results from the previous parts of this question, do you think that either simple and/or multiple linear regression analyses were appropriate to use in this context?
A recent paper by Seo & Takikawa (2022a) assessed the socioeconomic factors affecting national healthcare expenditure and health system performances in regions across Japan. The paper is freely accessible here, but you are not expected to read it in detail.
As part of the study, regression models were fitted to the data collected. The original data, available from the Dryad website (see Seo & Takikawa, (2022b)) has been cleaned and prepared for you, and is available in this week’s tile on LMS:
japan_medical_suburbs_cleaned.omv contains data on suburbs in the Chiba prefecturejapan_medical_central_cleaned.omv contains data on the central cities in TokyoFigure 7.1: Note. From File:Tokyo landscape seen from Shibuya Stream 10.jpg, by Syced, 2023, Wikimedia Commons (https://commons.wikimedia.org/). CC0 1.0 DEED
To begin, suppose we are interested in whether there is a relationship between the number of doctors in a specific region, and the total medical expenses in that region.
Conduct the following steps for both data sets separately:
Create a Correlation Matrix of total medical expenses (the Medical expenses*(JPY**) variable) against Number of Doctors.
Fit a simple linear regression model using these variables, with total medical expenses being the dependent variable.
Assess the model assumptions in both cases, and then provide a brief summary of your findings, including a comparison of the results between the two data sets.
Focusing on the japan_medical_suburbs_cleaned.omv data set, fit a multiple linear regression model of total medical expenses against all of the other variables, except Location and Suburb, and then answer the following:
Combine the two data sets into one and fit a multiple linear regression model of total medical expenses against all of the other variables, except Location.
Using the Suburb variable, check if being in the suburbs has a statistically significant impact on the total medical expenses.
Then, repeat parts a-c of part 7.2.
Frisbie, S.H., Mitchell, E.J., and Molla, A.R. (2024). Sea level rise from climate change is expected to increase the release of arsenic into Bangladesh’s drinking well water by reduction and by the salt effect. PLOS ONE 19(1): e0295172. https://doi.org/10.1371/journal.pone.0295172
Green, P. T. (1997). Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and biomass. Journal of Tropical Ecology, 13(1), 17-38
Seo, Y., and Takikawa, T. (2022a). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan. Healthcare. 10(6):968. https://doi.org/10.3390/healthcare10060968
Seo, Y., and Takikawa, T. (2022b). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan [Dataset]. Dryad. https://doi.org/10.5061/dryad.h18931znw
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.