Xiangkai Wang (s3509521) & Davide Baj (s3689919)
Last updated: 15 October, 2017
The purpose of this presentation is to illustrate, via a ficticious case study, the usage of statistical methods of simple linear regression to analyse the relationship of two continuous variables and how, without proper contextualisation, a signficant statistical result can lead to the wrong conclusions.
A microbiology laboratory studying the waters of the Husdon River (NY, USA) is interested in knowing whether there is any correlation between the acidity of the water and its dissolved content of oxygen.
To find out whether there is any relationship between these two continuous variables, a random sample with a representative number of observations is analysed using:
descriptive statistics
simple linear regression
A sample of observations taken every 15 minutes at station Piermont Pier NY from the 5th to the 8th of October 2017 was sourced from http://hudson.dl.stevens-tech.edu/hrecos/d/index.shtml.
Complete metadata information and sampling method is provided: http://www.hrecos.org/images/Data/forweb/HRPMNTH.Metadata.pdf
Relevant data for this investigation:
Data pre-processing: removed data header and renamed variables for consistency
HR <- read_csv("hudson.csv")
HRDissolved oxygen:
precision <- 2
HR %>% summarise(Min = min(Dissolved_Oxygen, na.rm = TRUE),
Q1 = quantile(Dissolved_Oxygen, probs = .25, na.rm = TRUE) %>% round(precision),
Median = median(Dissolved_Oxygen, na.rm = TRUE) %>% round(precision),
Q3 = quantile(Dissolved_Oxygen, probs = .75, na.rm = TRUE) %>% round(precision),
Max = max(Dissolved_Oxygen, na.rm = TRUE),
Mean = mean(Dissolved_Oxygen, na.rm = TRUE) %>% round(precision),
SD = sd(Dissolved_Oxygen, na.rm = TRUE) %>% round(precision),
n = n(),
Missing = sum(is.na(Dissolved_Oxygen))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 4.88 | 5.9 | 6.12 | 6.31 | 6.97 | 6.07 | 0.4 | 384 | 0 |
Acidity:
HR %>% summarise(Min = min(Acidity, na.rm = TRUE),
Q1 = quantile(Acidity, probs = .25, na.rm = TRUE) %>% round(precision),
Median = median(Acidity, na.rm = TRUE) %>% round(precision),
Q3 = quantile(Acidity, probs = .75, na.rm = TRUE) %>% round(precision),
Max = max(Acidity, na.rm = TRUE),
Mean = mean(Acidity, na.rm = TRUE) %>% round(precision),
SD = sd(Acidity, na.rm = TRUE) %>% round(precision),
n = n(),
Missing = sum(is.na(Acidity))) -> table2
knitr::kable(table2)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 7.57 | 7.71 | 7.73 | 7.76 | 7.84 | 7.73 | 0.05 | 384 | 0 |
# histogram attributes
color = "slategray2"
breaks = 10
HR$Dissolved_Oxygen %>% hist(breaks=breaks, main="Empirical frequency of Dissolved oxygen",
xlab="Dissolved oxygen (mg/L)", col=color) Dissolved oxygen appears to be approximately normally distributed and slightly skewed to the left.
HR$Acidity %>% hist(breaks=breaks, main="Empirical frequency of Acidity",
xlab="Acidity (pH)", col=color) Acidity appears to be approximately normally distributed and slightly skewed to the left.
qqPlot(HR$Dissolved_Oxygen, dist = "norm",
ylab="Dissolved oxygen quantiles",
xlab="Normal theoretical quantiles",
main="Q-Q Plot")The Q-Q Plot confirms the essential normality of the Dissolved oxygen observations and the absence of significant outliers.
qqPlot(HR$Acidity, dist = "norm",
ylab="Acidity quantiles",
xlab="Normal theoretical quantiles",
main="Q-Q Plot") The Q-Q Plot confirms the essential normality of the Dissolved oxygen observations and the absence of significant outliers.
The test focuses on the slope of the regression line
\[y = \alpha + \beta x + \epsilon\]
If the slope of the regression line is significantly different from 0, then we can conclude that there is a significant relationship between Dissolved oxygen and water Acidity.
\[H_0: \beta = 0 \] The Null Hypothesis for the linear regression is that there is no association between Dissolved oxygen and Acidity, which means the slope of the regression line, the beta coefficient, is 0.
\[H_A: \beta \ne 0 \] The Alternate Hypothesis for the linear regression is that there is an association between Dissolved oxygen and Acidity, which means the slope of the regression line, the beta coefficient, is not equal to 0.
linearModel <- lm (Acidity ~ Dissolved_Oxygen, data = HR)
linearModel %>% summary()##
## Call:
## lm(formula = Acidity ~ Dissolved_Oxygen, data = HR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.061671 -0.013672 0.001887 0.014326 0.054374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.020909 0.017439 402.60 <2e-16 ***
## Dissolved_Oxygen 0.116759 0.002867 40.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02221 on 382 degrees of freedom
## Multiple R-squared: 0.8128, Adjusted R-squared: 0.8123
## F-statistic: 1659 on 1 and 382 DF, p-value: < 2.2e-16
linearModel %>% confint()## 2.5 % 97.5 %
## (Intercept) 6.986620 7.0551972
## Dissolved_Oxygen 0.111123 0.1223956
plot(Acidity ~ Dissolved_Oxygen, data = HR,
xlab = "Dissolved oxygen (mg/L)",
ylab = "Acidity (pH)",
main="Scatter plot with regression line")
abline(linearModel, col = "red") From the Scatter plot with regression line emerges a strong linear relationship between the Dissolved oxygen and the water Acidity.
plot(linearModel, which=1) Since the red line, indicating the relationship between fitted values and residuals, is nearly flat, this is a good indicator that we are modelling a linear relationship.
plot(linearModel, which=2) The Q-Q plot suggests there are no major deviations from normality, which means that it is safe to assume the residuals are approximately normally distributed.
plot(linearModel, which=3) This plots verifies Homoscedasticity. In fact, the rel line is close to flat and variance in the square root of the standardised residuals is consistent across the fitted values.
plot(linearModel, which=5) In the Residuals vs Leverage plot we check the presence of outliers which can unduly influence the fit of the regression model. Since no outliers fall beyond the bands of the Cook’s distances, there is no evidence of influential cases.
Result of the analysis:
A linear regression model was fitted to predict the dependent variable, Acidity, using measures of Dissolved oxygen as a single predictor. Prior to fitting the regression, a scatter plot assessing the bivariate relationship between Acidity and Dissolved oxygen was inspected. The scatter plot demonstrated evidence of a strong positive linear relationship. Other non-linear trends were ruled out. The overall regression model was statistically significant: \[F(1, 382) = 1670,\space p<.001\]
and explained 81.34% of the variability in Acidity \[R^2=.8134\] The estimated regression equation was \[Acidity = 7.044 + .009 * DissolvedOxygen\] The positive slope for Dissolved oxygen was statistically significant:
\[b=0.009,\space t(382)=40.87, \space p<.001, 95% CI [0.009, 0.010]\] Final inspection of the residuals supported normality and homoscedasticity.
Based on the data, one could infer that there is a strong, positive correlation between pH and concentration of dissolved oxygen. However, in practice it has been documented that oxygenation of water is not proportional to pH (Hyslope et al. 2015). This underscores the fact that correlation does not imply causation. Altough pH and oxygen concentration appear related in this analysis, we know that they are not and that other external factors are likely to account for this apparent correlation.
Hyslope, B., Huneycutt, A. and Cobbs, M., 2015. H2O No, I Can’t Breathe: How pH level affects dissolved oxygen concentration in water. Journal of Introductory Biology Investigations, 2(4).