Introduction

Rpubs link: http://rpubs.com/javide/ph

The purpose of this presentation is to illustrate, via a ficticious case study, the usage of statistical methods of simple linear regression to analyse the relationship of two continuous variables and how, without proper contextualisation, a signficant statistical result can lead to the wrong conclusions.

Problem Statement

A microbiology laboratory studying the waters of the Husdon River (NY, USA) is interested in knowing whether there is any correlation between the acidity of the water and its dissolved content of oxygen.
To find out whether there is any relationship between these two continuous variables, a random sample with a representative number of observations is analysed using:

descriptive statistics
simple linear regression

Data

A sample of observations taken every 15 minutes at station Piermont Pier NY from the 5th to the 8th of October 2017 was sourced from http://hudson.dl.stevens-tech.edu/hrecos/d/index.shtml.

Complete metadata information and sampling method is provided: http://www.hrecos.org/images/Data/forweb/HRPMNTH.Metadata.pdf

Relevant data for this investigation:

Acidity: Hydrogen ion concentration (pH)
Dissolved oxygen: mg/L

Data pre-processing: removed data header and renamed variables for consistency

HR <- read_csv("hudson.csv")
HR

Descriptive Statistics:

Dissolved oxygen:

precision <- 2
HR %>% summarise(Min = min(Dissolved_Oxygen, na.rm = TRUE),
         Q1 = quantile(Dissolved_Oxygen, probs = .25, na.rm = TRUE) %>% round(precision),
         Median = median(Dissolved_Oxygen, na.rm = TRUE) %>% round(precision),
         Q3 = quantile(Dissolved_Oxygen, probs = .75, na.rm = TRUE) %>% round(precision),
         Max = max(Dissolved_Oxygen, na.rm = TRUE),
         Mean = mean(Dissolved_Oxygen, na.rm = TRUE) %>% round(precision),
         SD = sd(Dissolved_Oxygen, na.rm = TRUE) %>% round(precision),
         n = n(),
         Missing = sum(is.na(Dissolved_Oxygen))) -> table1
knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
4.88	5.9	6.12	6.31	6.97	6.07	0.4	384	0

Acidity:

HR %>% summarise(Min = min(Acidity, na.rm = TRUE),
         Q1 = quantile(Acidity, probs = .25, na.rm = TRUE) %>% round(precision),
         Median = median(Acidity, na.rm = TRUE) %>% round(precision),
         Q3 = quantile(Acidity, probs = .75, na.rm = TRUE) %>% round(precision),
         Max = max(Acidity, na.rm = TRUE),
         Mean = mean(Acidity, na.rm = TRUE) %>% round(precision),
         SD = sd(Acidity, na.rm = TRUE) %>% round(precision),
         n = n(),
         Missing = sum(is.na(Acidity))) -> table2
knitr::kable(table2)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
7.57	7.71	7.73	7.76	7.84	7.73	0.05	384	0

Visualisation: Dissolved oxygen

# histogram attributes
color = "slategray2"
breaks = 10

HR$Dissolved_Oxygen %>% hist(breaks=breaks, main="Empirical frequency of Dissolved oxygen", 
                             xlab="Dissolved oxygen (mg/L)", col=color)

Dissolved oxygen appears to be approximately normally distributed and slightly skewed to the left.

Visualisation: Acidity

HR$Acidity %>% hist(breaks=breaks, main="Empirical frequency of Acidity", 
                    xlab="Acidity (pH)", col=color)

Acidity appears to be approximately normally distributed and slightly skewed to the left.

Visualisation: Dissolved oxygen normal quantiles

qqPlot(HR$Dissolved_Oxygen, dist = "norm", 
       ylab="Dissolved oxygen quantiles", 
       xlab="Normal theoretical quantiles", 
       main="Q-Q Plot")

The Q-Q Plot confirms the essential normality of the Dissolved oxygen observations and the absence of significant outliers.

Visualisation: Acidity normal quantiles

qqPlot(HR$Acidity, dist = "norm", 
       ylab="Acidity quantiles", 
       xlab="Normal theoretical quantiles", 
       main="Q-Q Plot")

The Q-Q Plot confirms the essential normality of the Dissolved oxygen observations and the absence of significant outliers.

Hypothesis Testing for Linear Regression

The test focuses on the slope of the regression line

\[y = \alpha + \beta x + \epsilon\]

If the slope of the regression line is significantly different from 0, then we can conclude that there is a significant relationship between Dissolved oxygen and water Acidity.

\[H_0: \beta = 0 \] The Null Hypothesis for the linear regression is that there is no association between Dissolved oxygen and Acidity, which means the slope of the regression line, the beta coefficient, is 0.

\[H_A: \beta \ne 0 \] The Alternate Hypothesis for the linear regression is that there is an association between Dissolved oxygen and Acidity, which means the slope of the regression line, the beta coefficient, is not equal to 0.

Testing Linear Regression

linearModel <- lm (Acidity ~ Dissolved_Oxygen, data = HR)
linearModel %>% summary()

## 
## Call:
## lm(formula = Acidity ~ Dissolved_Oxygen, data = HR)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.061671 -0.013672  0.001887  0.014326  0.054374 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.020909   0.017439  402.60   <2e-16 ***
## Dissolved_Oxygen 0.116759   0.002867   40.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02221 on 382 degrees of freedom
## Multiple R-squared:  0.8128, Adjusted R-squared:  0.8123 
## F-statistic:  1659 on 1 and 382 DF,  p-value: < 2.2e-16

linearModel %>% confint()

##                     2.5 %    97.5 %
## (Intercept)      6.986620 7.0551972
## Dissolved_Oxygen 0.111123 0.1223956

Linear regression visualitation

plot(Acidity ~ Dissolved_Oxygen, data = HR, 
     xlab = "Dissolved oxygen (mg/L)", 
     ylab = "Acidity (pH)", 
     main="Scatter plot with regression line")
abline(linearModel, col = "red")

From the Scatter plot with regression line emerges a strong linear relationship between the Dissolved oxygen and the water Acidity.

Verification of assumptions for linear regression

plot(linearModel, which=1)

Since the red line, indicating the relationship between fitted values and residuals, is nearly flat, this is a good indicator that we are modelling a linear relationship.

Verification of assumptions for linear regression

plot(linearModel, which=2)

The Q-Q plot suggests there are no major deviations from normality, which means that it is safe to assume the residuals are approximately normally distributed.

Verification of assumptions for linear regression

plot(linearModel, which=3)

This plots verifies Homoscedasticity. In fact, the rel line is close to flat and variance in the square root of the standardised residuals is consistent across the fitted values.

Verification of assumptions for linear regression

plot(linearModel, which=5)

In the Residuals vs Leverage plot we check the presence of outliers which can unduly influence the fit of the regression model. Since no outliers fall beyond the bands of the Cook’s distances, there is no evidence of influential cases.

Verification of assumptions for linear regression

Result of the analysis:

Independence: this is assumed as the data have been scientifically collected
Linearity: from the scatter plot we have verfied that there is a linear bivariate relationship
Normality of residuals: verified in the QQ Plot
Homoscedasticity: verified in the Scale-Location plot

Discussion

A linear regression model was fitted to predict the dependent variable, Acidity, using measures of Dissolved oxygen as a single predictor. Prior to fitting the regression, a scatter plot assessing the bivariate relationship between Acidity and Dissolved oxygen was inspected. The scatter plot demonstrated evidence of a strong positive linear relationship. Other non-linear trends were ruled out. The overall regression model was statistically significant: \[F(1, 382) = 1670,\space p<.001\]

and explained 81.34% of the variability in Acidity \[R^2=.8134\] The estimated regression equation was \[Acidity = 7.044 + .009 * DissolvedOxygen\] The positive slope for Dissolved oxygen was statistically significant:

\[b=0.009,\space t(382)=40.87, \space p<.001, 95% CI [0.009, 0.010]\] Final inspection of the residuals supported normality and homoscedasticity.

Conclusion

Based on the data, one could infer that there is a strong, positive correlation between pH and concentration of dissolved oxygen. However, in practice it has been documented that oxygenation of water is not proportional to pH (Hyslope et al. 2015). This underscores the fact that correlation does not imply causation. Altough pH and oxygen concentration appear related in this analysis, we know that they are not and that other external factors are likely to account for this apparent correlation.

Apparent correlation between pH and dissolved oxygen in water

A statistical analysis

Introduction

Problem Statement

Data

Descriptive Statistics:

Visualisation: Dissolved oxygen

Visualisation: Acidity

Visualisation: Dissolved oxygen normal quantiles

Visualisation: Acidity normal quantiles

Hypothesis Testing for Linear Regression

Testing Linear Regression

Linear regression visualitation

Verification of assumptions for linear regression

Verification of assumptions for linear regression

Verification of assumptions for linear regression

Verification of assumptions for linear regression

Verification of assumptions for linear regression

Discussion

Conclusion

References