We can see that as area increases there seems to be an increase in the pH value. Hence, there seems to be some sort of positive correlation between area and pH value.
\[\ n = 13 \] \[\ \sum_{n=1}^{13} X_i = 1844 \] \[\overline{X} = \frac{\sum_{n=1}^{13} X_i}{n} = 141.846\] \[ \sum_{n=1}^{13} Y_i = 90.5 \] \[\overline{Y} = \frac{\sum_{n=1}^{13} Y_i}{n} = 6.962\] \[ \sum_{n=1}^{13} XY_i = 13397 \] \[ \sum_{n=1}^{13} X^2_i = 356828 \] \[ \ S_{xx} = \sum_{n=1}^{13} X^2_i - \frac{(\sum_{n=1}^{13} X_i)^2}{n} = 356828 - \frac{3400336}{13} = 95263.692 \] \[ \sum_{n=1}^{13} Y^2_i = 637.11 \] \[ \ S_{yy} = \sum_{n=1}^{13} Y^2_i - \frac{(\sum_{n=1}^{13} Y_i)^2}{n} = 637.11 - \frac{8190.25}{13} = 7.091 \] \[ \ S_{xy} = \sum_{n=1}^{13} XY_i - \frac{\sum_{n=1}^{13} X_i*\sum_{n=1}^{13} Y_i}{n} = 13397 - \frac{1844*90.5}{13} = 559.923 \]
Slope \[\hat{\beta} = \frac{S_{xy}}{S_{xx}} = \frac{559.923}{95263.692} = 0.00587\] Intercept \[\hat{\alpha} = \overline{Y} - \hat{\beta}\overline{X} = 6.962 - 0.00587 * 141.846 = 6.128\] Coefficient of determination \[\ r^2 = \frac{S^2_{xy}}{S_{xx}S_{yy}} = \frac{559.923^2}{95263.692*7.091} = 0.464\] Standard deviation of the errors s \[\ s = \sqrt{\frac{S_{yy}- \hat\beta S_{xy}}{n-2}} = \sqrt{\frac{7.091 - 0.00587*559.923}{13-2}} = 0.5877\] Standard deviation of the estimates \[\ S_{\hat\alpha} = s \sqrt{\frac{\sum X^2_i}{nS_{xx}}} = 0.5877* \sqrt {\frac{356828}{13*95263.6923}} = 0.315\] \[\ S_{\hat\beta} = \frac{s}{\sqrt{S_{xx}}} = \frac{0.5877}{\sqrt{95263.692}} = 0.0019\] Estimated eqaution of the line is: \[\ \hat Y = 6.128 + 0.0059X \]
The ANOVA table is as follows:
Source | Sum of Squares | Degrees of Freedom | Mean Squares |
---|---|---|---|
Regression | 3.291 | 1 | 3.291 |
Residual | 3.8 | 11 | 0.345 |
Total | 7.091 | 12 |
\[\ f = \frac{SSR/1}{SSE/(n-2)} = 9.527\] \(H_0\) = A linear relationship does not exist between y and x
against the alternative
\(H_1\) = A linear relationship does exist between y and x: \[\ F^{(1, 11)}_{0.05} = 4.8443\] \[\ 9.527 > 4.8443 \] Hence, \[\ f > F^{(1, 11)}_{0.05}\] We reject the null hypothesis. Hence, we can conclude that a linear relationship exists between area and pH level.
##
## Call:
## lm(formula = pHLakesOntario.pH ~ pHLakesOntario.Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8269 -0.6741 0.1607 0.3730 0.6967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.127822 0.315483 19.424 7.31e-10 ***
## pHLakesOntario.Area 0.005878 0.001904 3.087 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5877 on 11 degrees of freedom
## Multiple R-squared: 0.4641, Adjusted R-squared: 0.4154
## F-statistic: 9.527 on 1 and 11 DF, p-value: 0.01035
The Residual vs Fitted Plot:
The residuals seems to evenly spread aroud the 0 mean line. The mean fitted value (R line) seems to sway away from the 0 mean line due to the presence of outlier.
Normal QQ plot: The QQ plot plot more or less follows a straight line confirming linearity. A few deviations from the line exists but these could be due to the presence of outliers.
Scale - Location: There is no straight line with equal spread of residuals. This shows that there is a significant influence of outliers in the plot.
Residuals Vs Leverage: As we can see here there is one point which is beyong the Cook’s distance. This means that this point is highly influencial. This could be an outlier and could explain the distortions in the previous plots.
Estimated eqaution of the line is: \[\ \hat Y = 6.128 + 0.0059X \] Therefore, pH level for another lake in the same region with area 2050 sq.km is: \[\ \hat Y = 6.128 + 0.0059 * 2050 \] \[\ \hat Y = 18.1769 \pm (3.106 * 3.6843) \] Hence, pH level prediction at 99% confidence interval \[\ 18.1769 \pm (11.4437) = (6.7332,29.6206)\]
\[\ n = 25 \] \[\ \sum_{n=1}^{25} X_i = 1315 \] \[\overline{X} = \frac{\sum_{n=1}^{25} X_i}{n} = 52.6\] \[ \sum_{n=1}^{25} Y_i = 235.6 \] \[\overline{Y} = \frac{\sum_{n=1}^{25} Y_i}{n} = 9.424\] \[ \sum_{n=1}^{25} XY_i = 11821.432 \] \[ \sum_{n=1}^{25} X^2_i = 76323.42 \] \[ \ S_{xx} = \sum_{n=1}^{25} X^2_i - \frac{(\sum_{n=1}^{25} X_i)^2}{n} = 76323.42 - \frac{1729225}{25} = 7154.42 \] \[ \sum_{n=1}^{25} Y^2_i = 2284.1102 \] \[ \ S_{yy} = \sum_{n=1}^{25} Y^2_i - \frac{(\sum_{n=1}^{25} Y_i)^2}{n} = 2284.1102 - \frac{55507.36}{25} = 63.8157 \] \[ \ S_{xy} = \sum_{n=1}^{25} XY_i - \frac{\sum_{n=1}^{25} X_i*\sum_{n=1}^{25} Y_i}{n} = 11821.432 - \frac{1315*235.6}{25} = -571.1279 \]
The least square estimates for the slope: Slope \[\hat{\beta} = \frac{S_{xy}}{S_{xx}} = \frac{-571.1279}{7154.42} = -0.0798\] The least square estimates for the constant: Intercept \[\hat{\alpha} = \overline{Y} - \hat{\beta}\overline{X} = 9.424 - -0.0798 * 52.6 = 13.6230\] Coefficient of determination \[\ r^2 = \frac{S^2_{xy}}{S_{xx}S_{yy}} = \frac{-571.1279^2}{7154.42*63.8158} = 0.7144\] Standard deviation of the errors s \[\ s = \sqrt{\frac{S_{yy}- \hat\beta S_{xy}}{n-2}} = \sqrt{\frac{63.8158 - -0.0798*-571.1279}{25-2}} = 0.8901\] Standard deviation of the estimates \[\ S_{\hat\alpha} = s \sqrt{\frac{\sum X^2_i}{nS_{xx}}} = 0.8901* \sqrt {\frac{76323.42}{25*7154.42}} = 0.5815\] \[\ S_{\hat\beta} = \frac{s}{\sqrt{S_{xx}}} = \frac{0.8901}{\sqrt{7154.42}} = 0.0105\] Estimated eqaution of the line is: \[\ \hat Y = 13.6230 - 0.0798X \]
observation | steam | temperature | fitted temperature | residuals |
---|---|---|---|---|
\((n)\) | \((x)\) | \((y)\) | (\(\hat y = 13.6230 - 0.0798*x\)) | \((y - \hat y)\) |
1 | 35.3 | 10.98 | 10.80606 | 0.17394 |
2 | 29.7 | 11.13 | 11.25294 | -0.12294 |
3 | 30.8 | 12.51 | 11.16516 | 1.34484 |
4 | 58.8 | 8.40 | 8.93076 | -0.53076 |
5 | 61.4 | 9.27 | 8.72328 | 0.54672 |
6 | 71.3 | 8.73 | 7.93326 | 0.79674 |
7 | 74.4 | 6.36 | 7.68588 | -1.32588 |
8 | 76.7 | 8.50 | 7.50234 | 0.99766 |
9 | 70.7 | 7.82 | 7.98114 | -0.16114 |
10 | 57.5 | 9.14 | 9.03450 | 0.10550 |
11 | 46.4 | 8.24 | 9.92028 | -1.68028 |
12 | 28.9 | 12.19 | 11.31678 | 0.87322 |
13 | 28.1 | 11.88 | 11.38062 | 0.49938 |
14 | 39.1 | 9.57 | 10.50282 | -0.93282 |
15 | 46.8 | 10.94 | 9.88836 | 1.05164 |
16 | 48.5 | 9.58 | 9.75270 | -0.17270 |
17 | 59.3 | 10.09 | 8.89086 | 1.19914 |
18 | 70.0 | 8.11 | 8.03700 | 0.07300 |
19 | 70.0 | 6.83 | 8.03700 | -1.20700 |
20 | 74.5 | 8.88 | 7.67790 | 1.20210 |
21 | 72.1 | 7.68 | 7.86942 | -0.18942 |
22 | 58.1 | 8.47 | 8.98662 | -0.51662 |
23 | 44.6 | 8.86 | 10.06392 | -1.20392 |
24 | 33.4 | 10.36 | 10.95768 | -0.59768 |
25 | 28.6 | 11.08 | 11.34072 | -0.26072 |
The ANOVA table is as follows:
Source | Sum of Squares | Degrees of Freedom | Mean Squares |
---|---|---|---|
Regression | 45.593 | 1 | 45.593 |
Residual | 18.223 | 23 | 0.792 |
Total | 63.816 | 24 |
The ANOVA table is used for testing the linearity of the fitted values. The mean squares of the regression and residuals can be used for hypothesis testing of the linearity between the fitted x and y. Hence, this can help in the hypothesis testing of the fitted model.
Coefficient of determination: \[\ r^2 = \frac{S^2_{xy}}{S_{xx}S_{yy}} = \frac{-571.1279^2}{7154.42*63.8158} = 0.7144\]
This means that 71.4% of the variation in the mean atmospheric temperature is explained by the linear variation in the steam.
Correlation Coefficient: \[\ r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \frac{-571.127}{\sqrt{7154.42*63.8158}} = -\sqrt{0.7144} = -0.8452\]
The following conclusions can be derived from the Correlation Coefficient
From the Magnitude of r | Strength of Linear Relationship |
---|---|
\[\ 0 \le r \le 0.3 \] | weak |
\[\ 0.3 \le r \le 0.55\] | slight |
\[\ 0.55 \le r \le 0.8\] | moderate |
\[\ 0.8 \le r \le 1\] | strong |
Hence we can see in our case as |r| = 0.84, there is strong linear relationship between y and x.
Standard deviation of the errors: \[\ s = \sqrt{\frac{S_{yy}- \hat\beta S_{xy}}{n-2}} = \sqrt{\frac{63.8158 - -0.0798*-571.1279}{25-2}} = 0.8901\] Standard deviation of the constant: \[\ S_{\hat\alpha} = s \sqrt{\frac{\sum X^2_i}{nS_{xx}}} = 0.8901* \sqrt {\frac{76323.42}{25*7154.42}} = 0.5815\] Standard deviation of the slope: \[\ S_{\hat\beta} = \frac{s}{\sqrt{S_{xx}}} = \frac{0.8901}{\sqrt{7154.42}} = 0.0105\]
Significance test for Slope
T test with
\(H_0\) = Slope \(\beta = 0\) (a linear relationship does not exist between y and x)
against the alternative
\(H_1\) = Slope \(\beta \neq 0\) (a linear relationship does exist between y and x)
Standard error of the slope = 0.0105
\(\hat\beta\) = -0.0798
\[\ t = \frac{\hat\beta - \beta_0}{se(\hat\beta)} = \frac{-0.0798 - 0}{0.0105} = -7.6 \]
At 95% Confidence interval:
\[\ t_{0.025, 23} = 2.069\] \[\ |-7.6| > 2.069\] Hence \(H_0\) is rejected.
Conclusion, a linear relationship does exist between y and x
Significance test for Intercept T test with
\(H_0\) = Intercept \(\alpha = \alpha_0\)
against the alternative
\(H_1\) = Intercept \(\alpha \neq \alpha_0\)
Standard error of the intercept = 0.5815
\(\hat\alpha\) = 13.6230
\[\ t = \frac{\hat\alpha - \alpha_0}{se(\hat\alpha)} = \frac{13.6230 - 0}{0.5815} = 23.4273 \]
At 95% Confidence interval:
\[\ t_{0.025, 23} = 2.069\] \[\ |23.4273| > 2.069\]
Hence \(H_0\) is rejected.
Confidence interval for Slope:
\[\ (\hat\beta \pm t^{(n-2)}_{\alpha/2}*S_{\hat \beta})\] \[\ -0.0798 \pm 2.069*0.0105\] \[\ (-0.1016, -0.058) \]
Confidence interval for Intercept:
\[\ (\hat\alpha \pm t^{(n-2)}_{\alpha/2}*S_{\hat \alpha})\] \[\ 13.6230 \pm 2.069*0.5815\] \[\ (12.4198, 14.826) \]
plot(steamAtmsphTemp.steam,steamAtmsphTemp.temperature,xlab="steam (pounds per month)", ylab="mean atmospheric temperature (deg. F)", main="Scatter plot for steam \n and atmospheric temperature")
The fitted line can easily be drawn using abline function and lm function:
plot(steamAtmsphTemp.steam,steamAtmsphTemp.temperature,xlab="steam (pounds per month)", ylab="mean atmospheric temperature (deg. F)", main="Scatter plot for steam \n and atmospheric temperature")
steamAtmsphTemp.lm <- lm(steamAtmsphTemp.temperature ~ steamAtmsphTemp.steam)
abline(steamAtmsphTemp.lm)
Once the linear model is found using the lm function, the residuals output can be used to display the residuals.
steamAtmsphTemp.lm$residuals
## 1 2 3 4 5 6
## 0.17496361 -0.12207708 1.34573449 -0.52906210 0.54849250 0.79879656
## 7 8 9 10 11 12
## -1.32373449 0.99987151 -0.15910065 0.10716060 -1.67893790 0.87405997
## 13 14 15 16 17 18
## 0.50019701 -0.93168736 1.05299358 -0.17129764 1.20085225 0.07501926
## 19 20 21 22 23 24
## -1.20498074 1.20424838 -0.18734048 -0.51494219 -1.20262955 -0.59671091
## 25
## -0.25988864
The aov function can be used to make the ANOVA table:
steamAtmsphTemp.anova <- aov(steamAtmsphTemp.temperature ~ steamAtmsphTemp.steam,
data = steamAtmsphTemp)
summary(steamAtmsphTemp.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## steamAtmsphTemp.steam 1 45.59 45.59 57.54 1.05e-07 ***
## Residuals 23 18.22 0.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient of determination can be found by the following method:
summary(steamAtmsphTemp.lm)$r.squared
## [1] 0.7144375
The correlation coefficient can be found using the following method:
cor(steamAtmsphTemp.steam,steamAtmsphTemp.temperature)
## [1] -0.8452441
summary.lm(steamAtmsphTemp.lm)
##
## Call:
## lm(formula = steamAtmsphTemp.temperature ~ steamAtmsphTemp.steam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6789 -0.5291 -0.1221 0.7988 1.3457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.62299 0.58146 23.429 < 2e-16 ***
## steamAtmsphTemp.steam -0.07983 0.01052 -7.586 1.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8901 on 23 degrees of freedom
## Multiple R-squared: 0.7144, Adjusted R-squared: 0.702
## F-statistic: 57.54 on 1 and 23 DF, p-value: 1.055e-07
The confidence interval can also be found:
confint(steamAtmsphTemp.lm, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 12.4201404 14.82583815
## steamAtmsphTemp.steam -0.1015984 -0.05805901
par(mfrow = c(2, 2))
plot(steamAtmsphTemp.lm)
Explanation: Residual Vs Fitted: This is a horizontal line over 0 and the residuals are evenly spread around the horizontal line. This confirms linearity.
Normal QQ Plot: This follows a straight line along with the dotted line which confirms the linearity. There are just 2 points that do not fall in this line but this can be ignored.
Scale-Location: This is a horizontal line with residuals evenly spread around the line. This means there is homoscendasticity.
Residuals Vs Leverage There are no points beyond the Cook’s distance. This means there are no influencial points and significant outliers to impact the regression.