Welcome to Computer Lab 5 for the Data Analysis (DA) component of BIO2POS!
In DA Topic 5, we shifted our focus to modelling relationships between numeric variables. We discussed the concept of correlation, and introduced linear regression. Both simple linear regression (with the one independent variable) and multiple linear regression (with multiple independent variables) were covered, including the modelling process and model interpretation. The assumptions of these different models were also discussed.
In this computer lab, you will continue to learn how to use the statistical software jamovi, and create simple and multiple linear regression models using real data sets. You will also learn how to interpret and summarise jamovi output for these models, and how to check the assumptions of the models in jamovi.
These labs are designed to provide you with plenty of opportunities to practice different aspects of the statistical content covered in the lectures.
Each lab consists of core questions (with the 🌱 symbol) and extension questions (with the 🌳 symbol).
Having completed this lab, you will be able to conduct the following tests and calculations in jamovi:
You will also be able to interpret the results of the above statistical techniques, check the assumptions of the tests, and provide clear summary statements highlighting the key statistical outputs of the models.
Please complete at least step 1. first, as doing so will help you to better understand the concepts you will need for this computer lab.
To begin, we will return to the red crab data, collected by Green (1997), which we began analysing in DA Computer Lab 1. As a refresher, recall that this data has recorded values for variables including:
Figure 0.1: Note. From File:Gecarcoidea natalis 248249776.jpg, by Fernando Pérez Peralta, 2019, Wikimedia Commons (https://commons.wikimedia.org/). CC0 1.0 DEED
This red crab data is available in the current week’s tile on LMS, in the file crab_data_extended.omv. Download this file now,
and save it on your computer, if you do not already have a copy. Also open up a text document (e.g. a Word document), in which you can write down your responses and save your jamovi output as you work through the lab.
It is recommended that you save all your lab work, e.g. on OneDrive, so that you can access it easily at a later date.
Open up jamovi and load in the crab_data_extended.omv file.
To begin, suppose that we would like to determine if there are any correlations between different physical characteristics of the red crabs.
If you would like to refresh your memory on correlations, check the Topic 5A lecture.
In jamovi, click on the Analyses tab, and then click on Regression and select Correlation Matrix.
Drag CW and WEIGHT across into the box on the right to produce the correlation matrix for these two variables.
Under the Plot heading, select the correlation matrix and densities for variables boxes.
Assessing both the correlation matrix and plots, what do you notice about the relationship between the two variables?
Based on your findings from part 1.2, decide if it is more appropriate to compute the Pearson or Spearman correlation for the chosen variables, and select the relevant box under the Correlation Coefficients subheading.
Add the variables LEG, CLAW and OtherClaw to your correlation matrix, and for each pair determine which correlation coefficient is more appropriate to use.
Assessing your results, answer the following questions:
Which correlations appear to be linear?
Which correlations (if any) are statistically significant?
Which correlation is the strongest?
Given that we observe a strong linear correlation between the CLAW and OtherClaw variables, suppose we would like to construct a simple linear regression (SLR) model that uses CLAW values to predict OtherClaw values.
Select the Linear Regression option in the Regression tab, and drag CLAW (our independent variable in this scenario) to the Covariates box, and OtherClaw to the Dependent Variable box.
You should see some output appear automatically in the Results section.
If successful, this could mean for example that in future we would only have to measure one claw per crab.
Scroll down to the Assumption Checks section, expand it, and tick the Normality test, Q-Q plot of residuals, and Residuals plots boxes.
Scroll down a little further to the Model Coefficients section, expand it, and tick the Confidence interval box under the Estimate subheading.
Before we proceed further, we need to check the model assumptions. Complete the following:
While the Shapiro-Wilk normality test is typically sufficient as a check for the assumption of normality, we may also like to visualise the distribution of the residuals.
Linear Regression section, expand the Save section, and tick the Residuals box. This will save the residuals obtained from fitting the SLR.Exploration tab, and select Descriptives. You should see a new variable, Residuals.Using your SLR model output, complete the following:
To conclude our SLR modelling, we can add the SLR regression line to a scatter plot of the data.
Navigate to the Exploration tab, select the Scatterplot option, and then construct the plot of OtherClaw vs CLAW.
Under the Regression Line subheading, select Linear. You should see that the fitted line fits the data almost perfectly.
Repeat all the steps from question 2, but this time model WEIGHT (dependent variable) against CW (independent variable).
Suppose that we want to extend our simple linear regression framework to take into account multiple independent variables.
We will now assess whether a multiple linear regression (MLR) can be used to help us model the weight of the red crabs.
To turn an SLR into an MLR in jamovi, we simply add more variables to the Covariates box, in the Linear Regression section.
Create a regression with WEIGHT as the dependent variable and CW, LEG and CLAW as the covariates.
Rather than simply having the one MLR model, as created in part 4.1, we can create a set of nested models via the Model Builder box in the Linear Regression section. This way, we can observe the impact of adding each additional independent variable at a time to our simpler model.
Click the blue Add New Block button twice, then drag LEG from Block 1 to Block 2, and CLAW from Block 1 to Block 3. In the Results window, you should now be able to cycle through the three models.
This process is known as stepwise regression. Since we are adding more variables to create a more complex model, we refer to this as forwards selection. If instead we were removing variables one at a time to create a simpler model, that would be backwards selection.
Normally, the decision on which variable to add/remove at each stage is based on a metric like the AIC.
For an MLR model, it is more appropriate to refer to the adjusted \(R^2\) value than the \(R^2\) value when assessing the fit of the model.
To obtain this value in jamovi, expand the Model Fit section, and select the Adjusted \(R^2\) box.
Also select the AIC box, to check the impact of the inclusion of each additional independent variable on the resultant fit on the model.
What do you observe?
Write out your fitted regression model for the full model with all 3 independent variables, and provide an interpretation of each of the fitted coefficient values.
For example interpretations of fitted coefficients, check slides 17-18 of the Topic 5B lecture.
Check the MLR model assumptions, and discuss your findings.
A recent paper by Frisbie et al. (2024) discussed the possibility of increases in the arsenic levels in Bangladeshi drinking water, due to rising sea levels and flooding, induced by climate change. The paper is freely accessible here.
A copy of the data analysed in the study is available in this week’s tile on LMS, in the file Bangladesh_water_quality.omv. Download this data now, save it on your PC, and open it in jamovi.
Figure 5.1: Note. From File:Sundarbans, Bangladesh - 6 January 2023 (52612859696).jpg, by SentinelHub, 2023, Wikimedia Commons (https://commons.wikimedia.org/). CC BY 2.0 DEED This image contains modified Copernicus Sentinel data 2023.
As part of their analyses, Frisbie et al. (2024) conducted several simple linear regressions of Arsenic (micrograms per liter) against other variables.
To familiarise yourself with the data and their results, read through the Statistical analyses paragraph section of their paper, and then check Table 2. from the Results and Discussion section.
Also check Figures 4, 6, 11, 13 and 15.
Using the Bangladesh_water_quality.omv data set, carry out simple linear regressions in jamovi to reproduce the 5 simple linear regressions reported in Table 2 of Frisbie et al. (2024).
Confirm that the jamovi output matches that shown in Table 2 of the paper.
Comment on the \(R^2\) values for the 5 regressions. What do you observe?
Check the SLR model assumptions for each of the 5 regressions. Based on the results, and your findings from part 5.3 do you have any concerns about the validity of the SLR results reported in the paper?
Conduct a multiple linear regression on Arsenic, using all the independent variables from the Bangladesh_water_quality.omv data set. Make sure to check all the relevant model assumptions.
Write out your fitted model, and comment on the statistical significance of each of the estimated coefficients, and the overall fit of the model.
Based on your results from the previous parts of this question, do you think that either simple and/or multiple linear regression analyses were appropriate to use in this context?
A recent paper by Seo & Takikawa (2022a) assessed the socioeconomic factors affecting national healthcare expenditure and health system performances in regions across Japan. The paper is freely accessible here, but you are not expected to read it in detail.
As part of the study, regression models were fitted to the data collected. The original data, available from the Dryad website (see Seo & Takikawa, (2022b)) has been cleaned and prepared for you, and is available in this week’s tile on LMS:
japan_medical_suburbs_cleaned.omv contains data on suburbs in the Chiba prefecturejapan_medical_central_cleaned.omv contains data on the central cities in TokyoFigure 6.1: Note. From File:Tokyo landscape seen from Shibuya Stream 10.jpg, by Syced, 2023, Wikimedia Commons (https://commons.wikimedia.org/). CC0 1.0 DEED
To begin, suppose we are interested in whether there is a relationship between the number of doctors in a specific region, and the total medical expenses in that region.
Conduct the following steps for both data sets separately:
Create a Correlation Matrix of total medical expenses (the Medical expenses*(JPY**) variable) against Number of Doctors.
Fit a simple linear regression model using these variables, with total medical expenses being the dependent variable.
Assess the model assumptions in both cases, and then provide a brief summary of your findings, including a comparison of the results between the two data sets.
Focusing on the japan_medical_suburbs_cleaned.omv data set, fit a multiple linear regression model of total medical expenses against all of the other variables, except Location and Suburb, and then answer the following:
Combine the two data sets into one and fit a multiple linear regression model of total medical expenses against all of the other variables, except Location.
Using the Suburb variable, check if being in the suburbs has a statistically significant impact on the total medical expenses.
Then, repeat parts a-c of part 6.2.
Frisbie, S.H., Mitchell, E.J., and Molla, A.R. (2024). Sea level rise from climate change is expected to increase the release of arsenic into Bangladesh’s drinking well water by reduction and by the salt effect. PLOS ONE 19(1): e0295172. https://doi.org/10.1371/journal.pone.0295172
Green, P. T. (1997). Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and biomass. Journal of Tropical Ecology, 13(1), 17-38
Seo, Y., and Takikawa, T. (2022a). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan. Healthcare. 10(6):968. https://doi.org/10.3390/healthcare10060968
Seo, Y., and Takikawa, T. (2022b). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan [Dataset]. Dryad. https://doi.org/10.5061/dryad.h18931znw
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.