Q1

(a)

We can see that as area increases there seems to be an increase in the pH value. Hence, there seems to be some sort of positive correlation between area and pH value.

(b)

\[\ n = 13 \] \[\ \sum_{n=1}^{13} X_i = 1844 \] \[\overline{X} = \frac{\sum_{n=1}^{13} X_i}{n} = 141.846\] \[ \sum_{n=1}^{13} Y_i = 90.5 \] \[\overline{Y} = \frac{\sum_{n=1}^{13} Y_i}{n} = 6.962\] \[ \sum_{n=1}^{13} XY_i = 13397 \] \[ \sum_{n=1}^{13} X^2_i = 356828 \] \[ \ S_{xx} = \sum_{n=1}^{13} X^2_i - \frac{(\sum_{n=1}^{13} X_i)^2}{n} = 356828 - \frac{3400336}{13} = 95263.692 \] \[ \sum_{n=1}^{13} Y^2_i = 637.11 \] \[ \ S_{yy} = \sum_{n=1}^{13} Y^2_i - \frac{(\sum_{n=1}^{13} Y_i)^2}{n} = 637.11 - \frac{8190.25}{13} = 7.091 \] \[ \ S_{xy} = \sum_{n=1}^{13} XY_i - \frac{\sum_{n=1}^{13} X_i*\sum_{n=1}^{13} Y_i}{n} = 13397 - \frac{1844*90.5}{13} = 559.923 \]

Slope \[\hat{\beta} = \frac{S_{xy}}{S_{xx}} = \frac{559.923}{95263.692} = 0.00587\] Intercept \[\hat{\alpha} = \overline{Y} - \hat{\beta}\overline{X} = 6.962 - 0.00587 * 141.846 = 6.128\] Coefficient of determination \[\ r^2 = \frac{S^2_{xy}}{S_{xx}S_{yy}} = \frac{559.923^2}{95263.692*7.091} = 0.464\] Standard deviation of the errors s \[\ s = \sqrt{\frac{S_{yy}- \hat\beta S_{xy}}{n-2}} = \sqrt{\frac{7.091 - 0.00587*559.923}{13-2}} = 0.5877\] Standard deviation of the estimates \[\ S_{\hat\alpha} = s \sqrt{\frac{\sum X^2_i}{nS_{xx}}} = 0.5877* \sqrt {\frac{356828}{13*95263.6923}} = 0.315\] \[\ S_{\hat\beta} = \frac{s}{\sqrt{S_{xx}}} = \frac{0.5877}{\sqrt{95263.692}} = 0.0019\] Estimated eqaution of the line is: \[\ \hat Y = 6.128 + 0.0059X \]

(c)

The ANOVA table is as follows:

Source Sum of Squares Degrees of Freedom Mean Squares
Regression 3.291 1 3.291
Residual 3.8 11 0.345
Total 7.091 12

\[\ f = \frac{SSR/1}{SSE/(n-2)} = 9.527\] \(H_0\) = A linear relationship does not exist between y and x

against the alternative

\(H_1\) = A linear relationship does exist between y and x: \[\ F^{(1, 11)}_{0.05} = 4.8443\] \[\ 9.527 > 4.8443 \] Hence, \[\ f > F^{(1, 11)}_{0.05}\] We reject the null hypothesis. Hence, we can conclude that a linear relationship exists between area and pH level.

(d)

## 
## Call:
## lm(formula = pHLakesOntario.pH ~ pHLakesOntario.Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8269 -0.6741  0.1607  0.3730  0.6967 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.127822   0.315483  19.424 7.31e-10 ***
## pHLakesOntario.Area 0.005878   0.001904   3.087   0.0103 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5877 on 11 degrees of freedom
## Multiple R-squared:  0.4641, Adjusted R-squared:  0.4154 
## F-statistic: 9.527 on 1 and 11 DF,  p-value: 0.01035

The Residual vs Fitted Plot:

The residuals seems to evenly spread aroud the 0 mean line. The mean fitted value (R line) seems to sway away from the 0 mean line due to the presence of outlier.

Normal QQ plot: The QQ plot plot more or less follows a straight line confirming linearity. A few deviations from the line exists but these could be due to the presence of outliers.

Scale - Location: There is no straight line with equal spread of residuals. This shows that there is a significant influence of outliers in the plot.

Residuals Vs Leverage: As we can see here there is one point which is beyong the Cook’s distance. This means that this point is highly influencial. This could be an outlier and could explain the distortions in the previous plots.

(e)

Estimated eqaution of the line is: \[\ \hat Y = 6.128 + 0.0059X \] Therefore, pH level for another lake in the same region with area 2050 sq.km is: \[\ \hat Y = 6.128 + 0.0059 * 2050 \] \[\ \hat Y = 18.1769 \pm (3.106 * 3.6843) \] Hence, pH level prediction at 99% confidence interval \[\ 18.1769 \pm (11.4437) = (6.7332,29.6206)\]

Q2

(a)

\[\ n = 25 \] \[\ \sum_{n=1}^{25} X_i = 1315 \] \[\overline{X} = \frac{\sum_{n=1}^{25} X_i}{n} = 52.6\] \[ \sum_{n=1}^{25} Y_i = 235.6 \] \[\overline{Y} = \frac{\sum_{n=1}^{25} Y_i}{n} = 9.424\] \[ \sum_{n=1}^{25} XY_i = 11821.432 \] \[ \sum_{n=1}^{25} X^2_i = 76323.42 \] \[ \ S_{xx} = \sum_{n=1}^{25} X^2_i - \frac{(\sum_{n=1}^{25} X_i)^2}{n} = 76323.42 - \frac{1729225}{25} = 7154.42 \] \[ \sum_{n=1}^{25} Y^2_i = 2284.1102 \] \[ \ S_{yy} = \sum_{n=1}^{25} Y^2_i - \frac{(\sum_{n=1}^{25} Y_i)^2}{n} = 2284.1102 - \frac{55507.36}{25} = 63.8157 \] \[ \ S_{xy} = \sum_{n=1}^{25} XY_i - \frac{\sum_{n=1}^{25} X_i*\sum_{n=1}^{25} Y_i}{n} = 11821.432 - \frac{1315*235.6}{25} = -571.1279 \]

The least square estimates for the slope: Slope \[\hat{\beta} = \frac{S_{xy}}{S_{xx}} = \frac{-571.1279}{7154.42} = -0.0798\] The least square estimates for the constant: Intercept \[\hat{\alpha} = \overline{Y} - \hat{\beta}\overline{X} = 9.424 - -0.0798 * 52.6 = 13.6230\] Coefficient of determination \[\ r^2 = \frac{S^2_{xy}}{S_{xx}S_{yy}} = \frac{-571.1279^2}{7154.42*63.8158} = 0.7144\] Standard deviation of the errors s \[\ s = \sqrt{\frac{S_{yy}- \hat\beta S_{xy}}{n-2}} = \sqrt{\frac{63.8158 - -0.0798*-571.1279}{25-2}} = 0.8901\] Standard deviation of the estimates \[\ S_{\hat\alpha} = s \sqrt{\frac{\sum X^2_i}{nS_{xx}}} = 0.8901* \sqrt {\frac{76323.42}{25*7154.42}} = 0.5815\] \[\ S_{\hat\beta} = \frac{s}{\sqrt{S_{xx}}} = \frac{0.8901}{\sqrt{7154.42}} = 0.0105\] Estimated eqaution of the line is: \[\ \hat Y = 13.6230 - 0.0798X \]

(b)

observation steam temperature fitted temperature residuals
\((n)\) \((x)\) \((y)\) (\(\hat y = 13.6230 - 0.0798*x\)) \((y - \hat y)\)
1 35.3 10.98 10.80606 0.17394
2 29.7 11.13 11.25294 -0.12294
3 30.8 12.51 11.16516 1.34484
4 58.8 8.40 8.93076 -0.53076
5 61.4 9.27 8.72328 0.54672
6 71.3 8.73 7.93326 0.79674
7 74.4 6.36 7.68588 -1.32588
8 76.7 8.50 7.50234 0.99766
9 70.7 7.82 7.98114 -0.16114
10 57.5 9.14 9.03450 0.10550
11 46.4 8.24 9.92028 -1.68028
12 28.9 12.19 11.31678 0.87322
13 28.1 11.88 11.38062 0.49938
14 39.1 9.57 10.50282 -0.93282
15 46.8 10.94 9.88836 1.05164
16 48.5 9.58 9.75270 -0.17270
17 59.3 10.09 8.89086 1.19914
18 70.0 8.11 8.03700 0.07300
19 70.0 6.83 8.03700 -1.20700
20 74.5 8.88 7.67790 1.20210
21 72.1 7.68 7.86942 -0.18942
22 58.1 8.47 8.98662 -0.51662
23 44.6 8.86 10.06392 -1.20392
24 33.4 10.36 10.95768 -0.59768
25 28.6 11.08 11.34072 -0.26072

(c)

The ANOVA table is as follows:

Source Sum of Squares Degrees of Freedom Mean Squares
Regression 45.593 1 45.593
Residual 18.223 23 0.792
Total 63.816 24

The ANOVA table is used for testing the linearity of the fitted values. The mean squares of the regression and residuals can be used for hypothesis testing of the linearity between the fitted x and y. Hence, this can help in the hypothesis testing of the fitted model.

(d)

Coefficient of determination: \[\ r^2 = \frac{S^2_{xy}}{S_{xx}S_{yy}} = \frac{-571.1279^2}{7154.42*63.8158} = 0.7144\]

This means that 71.4% of the variation in the mean atmospheric temperature is explained by the linear variation in the steam.

Correlation Coefficient: \[\ r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \frac{-571.127}{\sqrt{7154.42*63.8158}} = -\sqrt{0.7144} = -0.8452\]

The following conclusions can be derived from the Correlation Coefficient

From the Magnitude of r Strength of Linear Relationship
\[\ 0 \le r \le 0.3 \] weak
\[\ 0.3 \le r \le 0.55\] slight
\[\ 0.55 \le r \le 0.8\] moderate
\[\ 0.8 \le r \le 1\] strong

Hence we can see in our case as |r| = 0.84, there is strong linear relationship between y and x.

(e)

Standard deviation of the errors: \[\ s = \sqrt{\frac{S_{yy}- \hat\beta S_{xy}}{n-2}} = \sqrt{\frac{63.8158 - -0.0798*-571.1279}{25-2}} = 0.8901\] Standard deviation of the constant: \[\ S_{\hat\alpha} = s \sqrt{\frac{\sum X^2_i}{nS_{xx}}} = 0.8901* \sqrt {\frac{76323.42}{25*7154.42}} = 0.5815\] Standard deviation of the slope: \[\ S_{\hat\beta} = \frac{s}{\sqrt{S_{xx}}} = \frac{0.8901}{\sqrt{7154.42}} = 0.0105\]

(f)

Significance test for Slope

T test with

\(H_0\) = Slope \(\beta = 0\) (a linear relationship does not exist between y and x)

against the alternative

\(H_1\) = Slope \(\beta \neq 0\) (a linear relationship does exist between y and x)

Standard error of the slope = 0.0105

\(\hat\beta\) = -0.0798

\[\ t = \frac{\hat\beta - \beta_0}{se(\hat\beta)} = \frac{-0.0798 - 0}{0.0105} = -7.6 \]

At 95% Confidence interval:

\[\ t_{0.025, 23} = 2.069\] \[\ |-7.6| > 2.069\] Hence \(H_0\) is rejected.

Conclusion, a linear relationship does exist between y and x

Significance test for Intercept T test with

\(H_0\) = Intercept \(\alpha = \alpha_0\)

against the alternative

\(H_1\) = Intercept \(\alpha \neq \alpha_0\)

Standard error of the intercept = 0.5815

\(\hat\alpha\) = 13.6230

\[\ t = \frac{\hat\alpha - \alpha_0}{se(\hat\alpha)} = \frac{13.6230 - 0}{0.5815} = 23.4273 \]

At 95% Confidence interval:

\[\ t_{0.025, 23} = 2.069\] \[\ |23.4273| > 2.069\]

Hence \(H_0\) is rejected.

(g)

Confidence interval for Slope:

\[\ (\hat\beta \pm t^{(n-2)}_{\alpha/2}*S_{\hat \beta})\] \[\ -0.0798 \pm 2.069*0.0105\] \[\ (-0.1016, -0.058) \]

Confidence interval for Intercept:

\[\ (\hat\alpha \pm t^{(n-2)}_{\alpha/2}*S_{\hat \alpha})\] \[\ 13.6230 \pm 2.069*0.5815\] \[\ (12.4198, 14.826) \]

Q3

(1)

plot(steamAtmsphTemp.steam,steamAtmsphTemp.temperature,xlab="steam (pounds per month)", ylab="mean atmospheric temperature (deg. F)", main="Scatter plot for steam \n and atmospheric temperature")

(2)

(a)

The fitted line can easily be drawn using abline function and lm function:

plot(steamAtmsphTemp.steam,steamAtmsphTemp.temperature,xlab="steam (pounds per month)", ylab="mean atmospheric temperature (deg. F)", main="Scatter plot for steam \n and atmospheric temperature")
steamAtmsphTemp.lm <- lm(steamAtmsphTemp.temperature ~ steamAtmsphTemp.steam)
abline(steamAtmsphTemp.lm)

(b)

Once the linear model is found using the lm function, the residuals output can be used to display the residuals.

steamAtmsphTemp.lm$residuals
##           1           2           3           4           5           6 
##  0.17496361 -0.12207708  1.34573449 -0.52906210  0.54849250  0.79879656 
##           7           8           9          10          11          12 
## -1.32373449  0.99987151 -0.15910065  0.10716060 -1.67893790  0.87405997 
##          13          14          15          16          17          18 
##  0.50019701 -0.93168736  1.05299358 -0.17129764  1.20085225  0.07501926 
##          19          20          21          22          23          24 
## -1.20498074  1.20424838 -0.18734048 -0.51494219 -1.20262955 -0.59671091 
##          25 
## -0.25988864

(c)

The aov function can be used to make the ANOVA table:

steamAtmsphTemp.anova <- aov(steamAtmsphTemp.temperature ~ steamAtmsphTemp.steam,
                             data = steamAtmsphTemp)
summary(steamAtmsphTemp.anova)
##                       Df Sum Sq Mean Sq F value   Pr(>F)    
## steamAtmsphTemp.steam  1  45.59   45.59   57.54 1.05e-07 ***
## Residuals             23  18.22    0.79                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(d)

The coefficient of determination can be found by the following method:

summary(steamAtmsphTemp.lm)$r.squared
## [1] 0.7144375

The correlation coefficient can be found using the following method:

cor(steamAtmsphTemp.steam,steamAtmsphTemp.temperature)
## [1] -0.8452441

(e & f)

summary.lm(steamAtmsphTemp.lm)
## 
## Call:
## lm(formula = steamAtmsphTemp.temperature ~ steamAtmsphTemp.steam)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6789 -0.5291 -0.1221  0.7988  1.3457 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           13.62299    0.58146  23.429  < 2e-16 ***
## steamAtmsphTemp.steam -0.07983    0.01052  -7.586 1.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8901 on 23 degrees of freedom
## Multiple R-squared:  0.7144, Adjusted R-squared:  0.702 
## F-statistic: 57.54 on 1 and 23 DF,  p-value: 1.055e-07

(g)

The confidence interval can also be found:

confint(steamAtmsphTemp.lm, level = 0.95)
##                            2.5 %      97.5 %
## (Intercept)           12.4201404 14.82583815
## steamAtmsphTemp.steam -0.1015984 -0.05805901

(3)

par(mfrow = c(2, 2))
plot(steamAtmsphTemp.lm)

Explanation: Residual Vs Fitted: This is a horizontal line over 0 and the residuals are evenly spread around the horizontal line. This confirms linearity.

Normal QQ Plot: This follows a straight line along with the dotted line which confirms the linearity. There are just 2 points that do not fall in this line but this can be ignored.

Scale-Location: This is a horizontal line with residuals evenly spread around the line. This means there is homoscendasticity.

Residuals Vs Leverage There are no points beyond the Cook’s distance. This means there are no influencial points and significant outliers to impact the regression.