These are the solutions for DA Computer Lab 5.
Please make sure to go over these after the lab session, and finish off any questions you may have missed during the lab.
No answer required.
In this question we prepared to assess the red crab data, collected by Green (1997). This data has recorded values for variables including:
No answer required.
No answer required.
No answer required.
We observe that the variables CW and WEIGHT are highly positively correlated, with \(\rho = 0.932\), and statistically significant, with \(p < .001\). However, their relationship is non-linear, so it would be more appropriate to use the Spearman correlation rather than the Pearson correlation.
We observe that the Spearman correlation of 0.993 is larger than the original Pearson correlation.
Results for both correlation options are shown below.
Based on these results, it appears the pairs LEG vs CLAW, LEG vs OtherClaw, and CLAW vs OtherClaw all have linear relationships and can use the Pearson correlation. However, all other pairs exhibit non-linear characteristics, and should probably use the Spearman correlation.
LEG vs CLAW, LEG vs OtherClaw, and CLAW vs OtherClaw all have linear relationships
All of the correlations are found to be statistically significant at the 5% level of significance
The strongest correlation is 0.999, for CLAW vs OtherClaw (which is perhaps not surprising)
No answer required.
The three key assumptions covered in the lecture are linearity, constant variance of residuals, and normality of residuals
Yes, the simple linear regression assumptions appear to be met. The Shapiro-Wilk test statistic of \(0.997\), with \(p = 0.860 > 0.05\) suggest the normality assumption is satisfied, and this is supported by inspection of the Normal Q-Q plot. The residuals vs fitted values plot shows a random scatter of points around the horizontal line at 0, with no obvious fanning or pattern present. This suggests the constant variance of residuals assumption is satisfied. From our initial correlation calculations, we can see there is a linear relationship between the variables.
The histogram of residuals appears approximately symmetric, and supports our conclusion that the normality assumption is satisfied.
The Pearson correlation between these two numeric variables was large, positive, and statistically significant at the 5% level of significance, with \(r = 0.999, p < .001\).
The fitted SLR model was \(\widehat{\text{Other Claw}} = 0.065 + 0.999 \times \text{Claw}\). We are 95% confident that on average, a 1 mm increase in Claw length leads to an increase in Other Claw length of between 0.993 mm and 1.004 mm.
The model has an excellent fit, with \(R^2 = 0.998\). The linear association is statistically significant, with Claw length being a significant predictor of Other Claw length (\(\hat{\beta}_1 = 0.999\), \(p < .001\)).
The simple linear regression assumptions are all violated. The Shapiro-Wilk test statistic of \(0.869\), with \(p < .001\) suggests the normality assumption is violated. This is supported by inspection of the Normal Q-Q plot, which has standardized residuals exhibiting clear deviations from the theoretical line. The residuals vs fitted values plot shows clear curvature plus fanning for larger fitted values. This suggests the constant variance of residuals assumption is violated. From our initial correlation calculations, we can see there is a non-linear relationship between the variables.
The histogram of residuals is clearly asymmetric, and appears multimodal. This supports our conclusion that the normality assumption is violated.
Make sure you have assessed the correct set of residuals here - you may have two sets of residuals now, if you kept the residuals obtained from Question 2.5.
The Spearman correlation between these two numeric variables was large, positive, and statistically significant at the 5% level of significance, with \(r_S = 0.993, p < .001\).
The fitted SLR model was \(\widehat{\text{Weight}} = -192.422 + 4.864 \times \text{Carapace Width}\). We are 95% confident that on average, a 1 mm increase in Carapace width leads to an increase in Weight of between 4.637 grams and 5.091 grams.
The model has an excellent fit, with \(R^2 = 0.868\). The linear association is statistically significant, with Carapace width being a significant predictor of Weight (\(\hat{\beta}_1 = 4.864\), \(p < .001\)).
However, all of the SLR model assumptions were violated, meaning that these results are invalid and cannot be relied upon.
In cases like this, we could try to perform some transformation of the data, to try and obtain results which satisfy the model assumptions, but such techniques are beyond the scope of this subject.
No answer required.
We observe that the AIC reduces as each new independent variable is added. The larger reduction occurs when moving from an SLR to an MLR. This is also reflected in the changes in the adjusted \(R^2\) - while each additional variable improves the adjusted \(R^2\) (suggesting the new variable is worth including in the model), the greater benefit to the model fit is obtained from the change from SLR to MLR. The change from model 2 to model 3 is not as beneficial.
The fitted regression model is
\[\widehat{\text{Weight}} = -148.007 + 1.686 \times \text{CW} - 3.585 \times \text{Claw} + 11.584 \times \text{Leg}\] All coefficient estimates are statistically significant, with \(p\)-values \(< .001\).
The simple linear regression assumptions are all violated. The Shapiro-Wilk test statistic of \(0.948\), with \(p < .001\) suggests the normality assumption is violated. The Normal Q-Q plot, which has standardized residuals exhibiting clear deviations from the theoretical line, supports this conclusion. The residuals vs fitted values plot shows clear curvature plus fanning for larger fitted values. This suggests the constant variance of residuals assumption is violated.
In practice this example would require the transformation of some of the variables and some more advanced regression techniques due to the violation of the model assumptions. It is a very good example of where the regression output looks good (high adjusted \(R^2\), significant variables) but in fact there are major problems.
No answer required.
In this question we considered data from a recent paper by Frisbie et al. (2024)on the possibility of increases in the arsenic levels in Bangladeshi drinking water, due to rising sea levels and flooding, induced by climate change. The paper is freely accessible here.
Data from the study was provided in the file Bangladesh_water_quality.omv.
Please note that none of the components of this question are intended as a criticism of the original paper. Rather, the questions are included to promote discussion and active learning. The content we consider in the computer lab focuses on only one component of the paper, and it should be noted that the paper includes several other advanced analyses which are beyond the scope of this subject.
No answer required.
Arsenic (micrograms per liter) is the response variable. Explanatory variables include:
Results from the jamovi SLR modelling are included below. Students should compare these with the models reported in Table 2 of the Frisbie et al. (2024) paper, and confirm the results match.
The \(R^2\) values for the 5 regressions are typically very weak, with:
For all SLRs, we observe that the Shapiro-Wilk tests suggest the normality assumption is violated. This is supported by the Normal Q-Q plots. The residuals vs fitted values plots show clear signs of fanning and/or other issues.
For conciseness, not all results are presented here, only results for Arsenic vs ORP.
These results suggest that we should have concerns about the validity of the SLR results reported in the paper.
It does not appear that the SLR model assumptions were checked in the paper.
The fitted model is \[\widehat{\text{Arsenic} (\mu \text{g/L})} = 57.447 + 0.005 \times SC - 9.894 \times pH - 0.392 \times ORP + 4.546 \times T(C) - 7.970 \times DO\]
The overall fit of the model is weak, with an adjusted \(R^2 = 0.192\).
The only statistically significant coefficient is for the ORP explanatory variable, with \(\hat{\beta_3} = -0.392, \, p = 0.002 < .05\). We can conclude that ORP is a significant predictor of Arsenic concentration.
We are 95% confident that on average, with all other variables fixed, a 1 unit increase in ORP (mV) leads to a decrease in the Arsenic concentration level of between 0.15 to 0.635 \(\mu\)g/L.
However, any interpretation of the results is suspect as the MLR model assumptions are clearly violated.
Based on our results from the previous parts of this question, it does not appear that either simple or multiple linear regression analyses were appropriate to use in this context (unless transformations of the data were made, in order to ensure the model assumptions were not violated).
No answer required.
In this question we considered data from a recent paper by Seo & Takikawa (2022a) on the socioeconomic factors affecting national healthcare expenditure and health system performances in regions across Japan.
Data from the study was provided in the files:
japan_medical_suburbs_cleaned.omv contains data on suburbs in the Chiba prefecturejapan_medical_central_cleaned.omv contains data on the central cities in TokyoSimple linear regression models were fitted to Japanese medical data, with total medical costs (JPY) regressed against number of doctors. Data sets for central Tokyo \((n = 23)\) and the Chiba prefecture \((n = 27)\) were assessed separately.
For central Tokyo, the Pearson correlation of -0.197 was not statistically significant, with \(p = 0.367\), suggesting there was no significant linear relationship between total medical expenses and number of doctors.
However for the Chiba prefecture, the Pearson correlation was statistically significant, with \(r = 0.512\) and \(p = 0.006\).
The fitted SLR models were:
The central Tokyo model has a (very) weak fit, with \(R^2 = 0.039\), while the Chiba prefecture model has a moderate fit, with \(R^2 = 0.262\).
For the Chiba prefecture SLR, the number of doctors was a significant predictor of total medical costs (\(\hat{\beta_1} = 46.045\), \(p = 0.006\)).
We estimate that on average, for every one additional doctor in service, the total medical costs in the Chiba prefecture will increase by 46.045 JPY.
We observe that the relationship between number of doctors and total medical costs appears to be different between the central Tokyo and Chiba prefecture regions of Japan.
However, the results for our SLRs should be treated with caution, as assumptions checks suggest that some of the SLR model assumptions are violated for both cases (we have relatively small sample sizes, which does not help):
Suburbs
Central
Checks suggest that there are no violations of the MLR model assumptions. The Shapiro-Wilk test statistic of 0.972 has a \(p\)-value of \(0.666 > 0.05\). The Normal Q-Q plot appears reasonable, and the residuals vs fitted values plot (while there is some clumping for smaller fitted values) shows the residuals exhibit reasonably constant variance, with no obvious pattern or curvature present.
A multiple linear regression model was fitted to medical data on \(n=27\) locations in the Chiba prefecture.
The Total medical expenses (JPY) variable was regressed against all other variables in the data set. The model had an excellent fit, with an adjusted \(R^2 = 0.997\). Several explanatory variables were found to have statistically significant coefficients, including Medical expenses for inpatients (JPY), Medical expenses for outpatients (JPY), Income (10,000JPY), Percentage of singles (%), Percentage of households with own houses (%). The coefficient estimates suggest that these explanatory variables all have a positive relationship with total medical expenses (i.e., keeping other variables fixed, a one unit increase in any of the listed explanatory variables leads to an increase in total medical expenses).
You may wish to expand upon the results here, and discuss the confidence interval interpretations for each of the statistically significant coefficients too.
When the two data sets are combined, and the Suburb variable is assessed in conjunction with all the other variables, we find that it is not statistically significant, with the coefficient estimate having a \(p\)-value of \(0.392\).
Checks suggest that there are no major violations of the MLR model assumptions. The Shapiro-Wilk test statistic of 0.991 has a \(p\)-value of \(0.966 > 0.05\). The Normal Q-Q plot appears reasonable. There may be some concern of non-constant variance in the residuals vs fitted values plot, with a slight increase in variance of residuals between approximately 250000 and 300000, but overall the plot appears reasonable.
A multiple linear regression model was fitted to medical data on \(n=50\) locations in central Tokyo and the Chiba prefecture.
The Total medical expenses (JPY) variable was regressed against all other variables in the data set. The model had an excellent fit, with an adjusted \(R^2 = 0.998\). However, only three explanatory variables were found to have statistically significant coefficients, namely Medical expenses for inpatients (JPY), Medical expenses for outpatients (JPY), and Outpatient consultation rates (%). The coefficient estimates suggest that these explanatory variables all have a positive relationship with total medical expenses (i.e., keeping other variables fixed, a one unit increase in any of the listed explanatory variables leads to an increase in total medical expenses). The medical expenses for inpatients and outpatients were expected to be important contributors to the total medical expenses, as this is logical.
Here we observe how different analyses of the data can lead to different results and inferences.
Frisbie, S.H., Mitchell, E.J., and Molla, A.R. (2024). Sea level rise from climate change is expected to increase the release of arsenic into Bangladesh’s drinking well water by reduction and by the salt effect. PLOS ONE 19(1): e0295172. https://doi.org/10.1371/journal.pone.0295172
Green, P. T. (1997). Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and biomass. Journal of Tropical Ecology, 13(1), 17-38
Seo, Y., and Takikawa, T. (2022a). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan. Healthcare. 10(6):968. https://doi.org/10.3390/healthcare10060968
Seo, Y., and Takikawa, T. (2022b). Regional Variation in National Healthcare Expenditure and Health System Performance in Central Cities and Suburbs in Japan [Dataset]. Dryad. https://doi.org/10.5061/dryad.h18931znw
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematical and Physical Sciences and with the Department of Environment and Genetics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.