Topic 5: Regression Modelling in jamovi


These are the solutions for DA Computer Lab 5.

Please make sure to go over these after the lab session, and finish off any questions you may have missed during the lab.


Preparations: Red Crab Data

No answer required.

In this question we prepared to assess the red crab data, collected by Green (1997). This data has recorded values for variables including:

  • CW: Carapace width (mm)
  • LEG: Leg length (mm)
  • CLAW: Length of dominant claw (mm)
  • WEIGHT: Weight (grams)
  • SEX: The sex of the crab (1 = female, 2 = male)
  • OtherClaw: Length of other claw (mm)

a.

No answer required.

b.

No answer required.


1 Red Crab Correlations 🌱

No answer required.

1.1

1.2

We observe that the variables CW and WEIGHT are highly positively correlated, with \(\rho = 0.932\), and statistically significant, with \(p < .001\). However, their relationship is non-linear, so it would be more appropriate to use the Spearman correlation rather than the Pearson correlation.

1.3

We observe that the Spearman correlation of 0.993 is larger than the original Pearson correlation.

1.4

Results for both correlation options are shown below.

Based on these results, it appears the pairs LEG vs CLAW, LEG vs OtherClaw, and CLAW vs OtherClaw all have linear relationships and can use the Pearson correlation. However, all other pairs exhibit non-linear characteristics, and should probably use the Spearman correlation.

1.5

  1. LEG vs CLAW, LEG vs OtherClaw, and CLAW vs OtherClaw all have linear relationships

  2. All of the correlations are found to be statistically significant at the 5% level of significance

  3. The strongest correlation is 0.999, for CLAW vs OtherClaw (which is perhaps not surprising)


2 Red Crab Claw SLR 🌱

No answer required.

2.1

2.2

2.3

2.4

  1. The three key assumptions covered in the lecture are linearity, constant variance of residuals, and normality of residuals

  2. Yes, the simple linear regression assumptions appear to be met. The Shapiro-Wilk test statistic of \(0.997\), with \(p = 0.860 > 0.05\) suggest the normality assumption is satisfied, and this is supported by inspection of the Normal Q-Q plot. The residuals vs fitted values plot shows a random scatter of points around the horizontal line at 0, with no obvious fanning or pattern present. This suggests the constant variance of residuals assumption is satisfied. From our initial correlation calculations, we can see there is a linear relationship between the variables.

2.5

  1. No answer required.
  2. No answer required.

The histogram of residuals appears approximately symmetric, and supports our conclusion that the normality assumption is satisfied.

2.6

  1. The model fit the data extremely well, with an \(R^2\) value of 0.998.
  2. The fitted regression model is \[\widehat{\text{Other Claw}} = 0.065 + 0.999 \times \text{Claw}\]
  3. A simple linear regression model was fitted to data on \(n=273\) red crabs from Christmas Island. The Other Claw length (mm) of the red crabs was regressed against Claw length (mm).

The Pearson correlation between these two numeric variables was large, positive, and statistically significant at the 5% level of significance, with \(r = 0.999, p < .001\).

The fitted SLR model was \(\widehat{\text{Other Claw}} = 0.065 + 0.999 \times \text{Claw}\). We are 95% confident that on average, a 1 mm increase in Claw length leads to an increase in Other Claw length of between 0.993 mm and 1.004 mm.

The model has an excellent fit, with \(R^2 = 0.998\). The linear association is statistically significant, with Claw length being a significant predictor of Other Claw length (\(\hat{\beta}_1 = 0.999\), \(p < .001\)).

2.7

3 Red Crab Weight SLR 🌱

The simple linear regression assumptions are all violated. The Shapiro-Wilk test statistic of \(0.869\), with \(p < .001\) suggests the normality assumption is violated. This is supported by inspection of the Normal Q-Q plot, which has standardized residuals exhibiting clear deviations from the theoretical line. The residuals vs fitted values plot shows clear curvature plus fanning for larger fitted values. This suggests the constant variance of residuals assumption is violated. From our initial correlation calculations, we can see there is a non-linear relationship between the variables.

3.1

  1. No answer required.
  2. No answer required.

The histogram of residuals is clearly asymmetric, and appears multimodal. This supports our conclusion that the normality assumption is violated.

Make sure you have assessed the correct set of residuals here - you may have two sets of residuals now, if you kept the residuals obtained from Question 2.5.

3.2

  1. In terms solely of \(R^2\), the model fit the data extremely well, with an \(R^2\) value of 0.868.
  2. The fitted regression model is \[\widehat{\text{Weight}} = -192.422 + 4.864 \times \text{Carapace Width}\]
  3. A simple linear regression model was fitted to data on \(n=273\) red crabs from Christmas Island. The Weight (grams) of the red crabs was regressed against Carapace width (mm).

The Spearman correlation between these two numeric variables was large, positive, and statistically significant at the 5% level of significance, with \(r_S = 0.993, p < .001\).

The fitted SLR model was \(\widehat{\text{Weight}} = -192.422 + 4.864 \times \text{Carapace Width}\). We are 95% confident that on average, a 1 mm increase in Carapace width leads to an increase in Weight of between 4.637 grams and 5.091 grams.

The model has an excellent fit, with \(R^2 = 0.868\). The linear association is statistically significant, with Carapace width being a significant predictor of Weight (\(\hat{\beta}_1 = 4.864\), \(p < .001\)).

However, all of the SLR model assumptions were violated, meaning that these results are invalid and cannot be relied upon.

In cases like this, we could try to perform some transformation of the data, to try and obtain results which satisfy the model assumptions, but such techniques are beyond the scope of this subject.


4 Red Crab Weight MLR 🌱

No answer required.

4.1

4.2

4.3

We observe that the AIC reduces as each new independent variable is added. The larger reduction occurs when moving from an SLR to an MLR. This is also reflected in the changes in the adjusted \(R^2\) - while each additional variable improves the adjusted \(R^2\) (suggesting the new variable is worth including in the model), the greater benefit to the model fit is obtained from the change from SLR to MLR. The change from model 2 to model 3 is not as beneficial.

4.4

The fitted regression model is

\[\widehat{\text{Weight}} = -148.007 + 1.686 \times \text{CW} - 3.585 \times \text{Claw} + 11.584 \times \text{Leg}\] All coefficient estimates are statistically significant, with \(p\)-values \(< .001\).

  • The y-intercept estimate of \(-148.007\) suggests that if a crab has characteristics with 0 values (CW, CLAW, LEG), their weight will be \(-148.007\) grams. Of course, this is not reasonable to interpret at face value, since the CW, CLAW and LEG values will always be non-zero.
  • The CW coefficient estimate of \(1.686\) suggests that, all other variables being fixed, a 1 unit increase in CW leads to an increase in weight of 1.686 grams on average.
  • The Claw coefficient estimate of \(-3.585\) suggests that, all other variables being fixed, a 1 unit increase in Claw leads to a decrease in weight of 3.585 grams on average.
  • The Leg coefficient estimate of \(11.584\) suggests that, all other variables being fixed, a 1 unit increase in Leg leads to an increase in weight of 11.584 grams on average.

4.5

The simple linear regression assumptions are all violated. The Shapiro-Wilk test statistic of \(0.948\), with \(p < .001\) suggests the normality assumption is violated. The Normal Q-Q plot, which has standardized residuals exhibiting clear deviations from the theoretical line, supports this conclusion. The residuals vs fitted values plot shows clear curvature plus fanning for larger fitted values. This suggests the constant variance of residuals assumption is violated.

In practice this example would require the transformation of some of the variables and some more advanced regression techniques due to the violation of the model assumptions. It is a very good example of where the regression output looks good (high adjusted \(R^2\), significant variables) but in fact there are major problems.


5 Bangladeshi Water Quality Paper Review 🌳

No answer required.

In this question we considered data from a recent paper by Frisbie et al. (2024)on the possibility of increases in the arsenic levels in Bangladeshi drinking water, due to rising sea levels and flooding, induced by climate change. The paper is freely accessible here.

Data from the study was provided in the file Bangladesh_water_quality.omv.

Please note that none of the components of this question are intended as a criticism of the original paper. Rather, the questions are included to promote discussion and active learning. The content we consider in the computer lab focuses on only one component of the paper, and it should be noted that the paper includes several other advanced analyses which are beyond the scope of this subject.

5.1

No answer required.

Arsenic (micrograms per liter) is the response variable. Explanatory variables include:

  • ORP: Oxidation-reduction potential
  • DO: Dissolved oxygen
  • SC: Specific conductance
  • pH: pH level
  • T(C): Temperature, in celcius

5.2

Results from the jamovi SLR modelling are included below. Students should compare these with the models reported in Table 2 of the Frisbie et al. (2024) paper, and confirm the results match.

Arsenic vs ORP

Arsenic vs DO

Arsenic vs SC

Arsenic vs pH

Arsenic vs T(C)

5.3

The \(R^2\) values for the 5 regressions are typically very weak, with:

  • Arsenic vs ORP \(R^2 = 0.21\)
  • Arsenic vs DO \(R^2 = 0.106\)
  • Arsenic vs SC \(R^2 = 0.063\)
  • Arsenic vs pH \(R^2 = 0.055\)
  • Arsenic vs T(C) \(R^2 = 0.016\)

5.4

For all SLRs, we observe that the Shapiro-Wilk tests suggest the normality assumption is violated. This is supported by the Normal Q-Q plots. The residuals vs fitted values plots show clear signs of fanning and/or other issues.

For conciseness, not all results are presented here, only results for Arsenic vs ORP.

These results suggest that we should have concerns about the validity of the SLR results reported in the paper.

It does not appear that the SLR model assumptions were checked in the paper.

5.5

The fitted model is \[\widehat{\text{Arsenic} (\mu \text{g/L})} = 57.447 + 0.005 \times SC - 9.894 \times pH - 0.392 \times ORP + 4.546 \times T(C) - 7.970 \times DO\]

The overall fit of the model is weak, with an adjusted \(R^2 = 0.192\).

The only statistically significant coefficient is for the ORP explanatory variable, with \(\hat{\beta_3} = -0.392, \, p = 0.002 < .05\). We can conclude that ORP is a significant predictor of Arsenic concentration.

We are 95% confident that on average, with all other variables fixed, a 1 unit increase in ORP (mV) leads to a decrease in the Arsenic concentration level of between 0.15 to 0.635 \(\mu\)g/L.

However, any interpretation of the results is suspect as the MLR model assumptions are clearly violated.

5.6

Based on our results from the previous parts of this question, it does not appear that either simple or multiple linear regression analyses were appropriate to use in this context (unless transformations of the data were made, in order to ensure the model assumptions were not violated).


6 Japanese Healthcare Expenses Regressions 🌳

No answer required.

In this question we considered data from a recent paper by Seo & Takikawa (2022a) on the socioeconomic factors affecting national healthcare expenditure and health system performances in regions across Japan.

Data from the study was provided in the files:

  • japan_medical_suburbs_cleaned.omv contains data on suburbs in the Chiba prefecture
  • japan_medical_central_cleaned.omv contains data on the central cities in Tokyo

6.1

Suburbs Results

Central Results

Suburbs Results

Central Results

Suburbs Results

Central Results

Summary

Simple linear regression models were fitted to Japanese medical data, with total medical costs (JPY) regressed against number of doctors. Data sets for central Tokyo \((n = 23)\) and the Chiba prefecture \((n = 27)\) were assessed separately.

For central Tokyo, the Pearson correlation of -0.197 was not statistically significant, with \(p = 0.367\), suggesting there was no significant linear relationship between total medical expenses and number of doctors.

However for the Chiba prefecture, the Pearson correlation was statistically significant, with \(r = 0.512\) and \(p = 0.006\).

The fitted SLR models were:

  • For central Tokyo: \[\widehat{\text{Total Medical Expenses}} = 244581.115 - 7.320 \times \text{Number of doctors}\]
  • For the Chiba prefecture: \[\widehat{\text{Total Medical Expenses}} = 269217.610 + 46.045 \times \text{Number of doctors}\]

The central Tokyo model has a (very) weak fit, with \(R^2 = 0.039\), while the Chiba prefecture model has a moderate fit, with \(R^2 = 0.262\).

For the Chiba prefecture SLR, the number of doctors was a significant predictor of total medical costs (\(\hat{\beta_1} = 46.045\), \(p = 0.006\)).

We estimate that on average, for every one additional doctor in service, the total medical costs in the Chiba prefecture will increase by 46.045 JPY.

We observe that the relationship between number of doctors and total medical costs appears to be different between the central Tokyo and Chiba prefecture regions of Japan.

However, the results for our SLRs should be treated with caution, as assumptions checks suggest that some of the SLR model assumptions are violated for both cases (we have relatively small sample sizes, which does not help):

Suburbs

  • The Shapiro-Wilk normality test results, with test statistic 0.894 and \(p = 0.01\) suggest the normality assumption is violated. This is supported by inspection of the Normal Q-Q plot, with points deviating clearly from the theoretical line.
  • The constant variance assumption may be satisfied, but it appears that an outlier may be skewing some of the results. Results may improve were this to be removed.

Central

  • Since the model does not show a statistically significant relationship between the two variables, checking the model assumptions is not as vital here.
  • The Shapiro-Wilk normality test results, with test statistic 0.961 and \(p = 0.480\) suggest the normality assumption is satisfied. However inspection of the Normal Q-Q plot, which has many points deviating clearly from the theoretical line, suggest that there may be some underlying normality issues.
  • The constant variance assumption may be satisfied, but for smaller fitted values we have less observations, making interpretation less assured.

6.2

  1. The variables with statistically significant coefficients are:
  • The intercept (\(\hat{\beta}_0 = - 57394.814\), \(p = 0.021\))
  • Medical expenses for inpatients (JPY) (\(\hat{\beta}_1 = 0.999\), \(p < .001\))
  • Medical expenses for outpatients (JPY) (\(\hat{\beta}_2 = 1.115\), \(p < .001\))
  • Income (10,000JPY) (\(\hat{\beta}_8 = 78.542\), \(p = 0.004\))
  • Percentage of singles (%) (\(\hat{\beta}_{13} = 399.918\), \(p = 0.041\))
  • Percentage of households with own houses (%) (\(\hat{\beta}_{14} = 465.504\), \(p = 0.007\))

Checks suggest that there are no violations of the MLR model assumptions. The Shapiro-Wilk test statistic of 0.972 has a \(p\)-value of \(0.666 > 0.05\). The Normal Q-Q plot appears reasonable, and the residuals vs fitted values plot (while there is some clumping for smaller fitted values) shows the residuals exhibit reasonably constant variance, with no obvious pattern or curvature present.

  1. A brief example summary is included below.

A multiple linear regression model was fitted to medical data on \(n=27\) locations in the Chiba prefecture.

The Total medical expenses (JPY) variable was regressed against all other variables in the data set. The model had an excellent fit, with an adjusted \(R^2 = 0.997\). Several explanatory variables were found to have statistically significant coefficients, including Medical expenses for inpatients (JPY), Medical expenses for outpatients (JPY), Income (10,000JPY), Percentage of singles (%), Percentage of households with own houses (%). The coefficient estimates suggest that these explanatory variables all have a positive relationship with total medical expenses (i.e., keeping other variables fixed, a one unit increase in any of the listed explanatory variables leads to an increase in total medical expenses).

You may wish to expand upon the results here, and discuss the confidence interval interpretations for each of the statistically significant coefficients too.

6.3

When the two data sets are combined, and the Suburb variable is assessed in conjunction with all the other variables, we find that it is not statistically significant, with the coefficient estimate having a \(p\)-value of \(0.392\).

  1. The only variables with statistically significant coefficients are:
  • Medical expenses for inpatients (JPY) (\(\hat{\beta}_1 = 1.031\), \(p < .001\))
  • Medical expenses for outpatients (JPY) (\(\hat{\beta}_2 = 1.020\), \(p < .001\))
  • Outpatient consultation rates (%) (\(\hat{\beta}_{4} = 15.874\), \(p = 0.005\))

Checks suggest that there are no major violations of the MLR model assumptions. The Shapiro-Wilk test statistic of 0.991 has a \(p\)-value of \(0.966 > 0.05\). The Normal Q-Q plot appears reasonable. There may be some concern of non-constant variance in the residuals vs fitted values plot, with a slight increase in variance of residuals between approximately 250000 and 300000, but overall the plot appears reasonable.

  1. A brief example summary is included below.

A multiple linear regression model was fitted to medical data on \(n=50\) locations in central Tokyo and the Chiba prefecture.

The Total medical expenses (JPY) variable was regressed against all other variables in the data set. The model had an excellent fit, with an adjusted \(R^2 = 0.998\). However, only three explanatory variables were found to have statistically significant coefficients, namely Medical expenses for inpatients (JPY), Medical expenses for outpatients (JPY), and Outpatient consultation rates (%). The coefficient estimates suggest that these explanatory variables all have a positive relationship with total medical expenses (i.e., keeping other variables fixed, a one unit increase in any of the listed explanatory variables leads to an increase in total medical expenses). The medical expenses for inpatients and outpatients were expected to be important contributors to the total medical expenses, as this is logical.

Here we observe how different analyses of the data can lead to different results and inferences.


Excellent work! That concludes the DA Topic 5 jamovi computer lab


References

Frisbie, S.H., Mitchell, E.J., and Molla, A.R. (2024). Sea level rise from climate change is expected to increase the release of arsenic into Bangladesh’s drinking well water by reduction and by the salt effect. PLOS ONE 19(1): e0295172. https://doi.org/10.1371/journal.pone.0295172

Green, P. T. (1997). Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and biomass. Journal of Tropical Ecology, 13(1), 17-38

Seo, Y., and Takikawa, T. (2022a). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan. Healthcare. 10(6):968. https://doi.org/10.3390/healthcare10060968

Seo, Y., and Takikawa, T. (2022b). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan [Dataset]. Dryad. https://doi.org/10.5061/dryad.h18931znw


These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.