Read in the dataset and take a look at the head of the data. There are five variables.
Creek Side Number Distance Measurement
1 A Sea 1 0.00 123.92
2 A Land 1 0.00 118.32
3 A Land 2 25.32 127.10
4 A Land 3 47.80 148.34
5 A Land 4 79.72 34.90
6 A Land 5 136.49 130.95
Question 1: Make a plot containing boxplots of the contaminant measurements, separating out the Sea and Land measurements of each creek. Make sure the plot is properly labelled. Describe very briefly any features you observe from the boxplots.
In the plot above, the red points are the mean mearuements. We can observe that, for creek A and D, the mean measurements between sea side and land side are quite different. And for creek F, the variance of measurements between sea side and land side is quite large.
From the plot above, we can see that, for both sea side and land side, the variance of the measurements among the six creeks are quite different.
There is an obvious outlier in the measurements data from the land side of creek C, and the measurements data from sea side of creek E is right skewed.
Question 2: We are interested in finding out if the sea-side measurements are different from the land-side measurements or not. We will do it in two ways.
(a) Test each creek separately whether the mean sea-side measurements are equal to the mean land-side measurements or not. Summarize your results succinctly.
[1] "ALand$Measurement and ASea$Measurement"
t statistic p-value
t-test -2.549559 0.06576873
The p-value is 0.06577 > 0.05, however the marginal probability is quite close to 0.05 compared to other creeks. We can conclude that the mean measurements of the sea-side and the land-side for creek A are probably different.
[1] "BLand$Measurement and BSea$Measurement"
t statistic p-value
t-test -0.4897228 0.6414854
The p-value is 0.6415 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek B.
[1] "CLand$Measurement and CSea$Measurement"
t statistic p-value
t-test -0.1231174 0.9049277
The p-value is 0.9049 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek C.
[1] "DLand$Measurement and DSea$Measurement"
t statistic p-value
t-test 2.156275 0.06107179
The p-value is 0.06107 > 0.05, same as creek A, the marginal probability is quite close to 0.05 compared to other creeks. We can conclude that the mean measurements of the sea-side and the land-side for creek D are probably different.
[1] "ELand$Measurement and ESea$Measurement"
t statistic p-value
t-test 0.2796955 0.787288
The p-value is 0.7873 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek E.
[1] "FLand$Measurement and FSea$Measurement"
t statistic p-value
t-test -0.3276592 0.7535138
The p-value is 0.7535 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek F.
(b) Consider Creek and Side as categorical variables and fit a regression model containing these two variables as well as their interactions. You do not need to transform the response. Interpret the model and summarize your findings.
Call:
lm(formula = Measurement ~ Creek + Side + Creek:Side)
Residuals:
Min 1Q Median 3Q Max
-149.43 -48.44 -10.52 43.16 196.16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 103.139 25.023 4.122 0.000121 ***
CreekB 5.151 36.476 0.141 0.888187
CreekC -22.396 35.387 -0.633 0.529305
CreekD 62.885 36.476 1.724 0.090036 .
CreekE 82.973 39.564 2.097 0.040348 *
CreekF 55.221 39.564 1.396 0.168114
SideSea 93.286 45.110 2.068 0.043113 *
CreekB:SideSea -78.026 64.406 -1.211 0.230625
CreekC:SideSea -89.014 63.795 -1.395 0.168241
CreekD:SideSea -192.412 64.406 -2.988 0.004117 **
CreekE:SideSea -108.810 66.203 -1.644 0.105675
CreekF:SideSea -74.626 66.203 -1.127 0.264290
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 75.07 on 58 degrees of freedom
Multiple R-squared: 0.274, Adjusted R-squared: 0.1363
F-statistic: 1.99 on 11 and 58 DF, p-value: 0.04606
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Creek | 5 | 70955.63 | 14191.13 | 2.52 | 0.0394 |
| Side | 1 | 186.36 | 186.36 | 0.03 | 0.8563 |
| Creek:Side | 5 | 52200.59 | 10440.12 | 1.85 | 0.1168 |
| Residuals | 58 | 326839.19 | 5635.16 |
Fit the regression model first, and then perform ANOVA on the model. From the ANOVA table above we can see that the p-value of Creek is 0.0394 < 0.05. So it is the significant varibale in the model.
(c) Compare your findings from parts a and b above. Explain any similarities or differences you find.
Question 3: There is also interest in examining if there is any trend in the land-side measurements with distance away from the flood gate.
(a) Combining the land-side measurements for all the creeks, fit a regression model with Distance as the covariate. Think about whether you want to transform the measurements or not. Provide a rationale for your decision.
Call:
lm(formula = Measurement.land ~ Distance)
Residuals:
Min 1Q Median 3Q Max
-116.69 -71.02 -11.88 37.45 223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 131.482293 17.902763 7.344 3.59e-09 ***
Distance -0.004078 0.017940 -0.227 0.821
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 86.16 on 44 degrees of freedom
Multiple R-squared: 0.001173, Adjusted R-squared: -0.02153
F-statistic: 0.05167 on 1 and 44 DF, p-value: 0.8212
After fitting a regression model, there is clearly a skewness in the Normal Q-Q plot of the residuals. So the measurements variable probably needs transformation. Using the box-cox method, the suggested lambda value is approximately 0.33. So I create a new variable called y who is the log transformation of the original measurements. Then I fit the regression model again.
Call:
lm(formula = y ~ Distance)
Residuals:
Min 1Q Median 3Q Max
-2.0857 -0.5542 0.1646 0.5172 1.2094
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6914698 0.1649950 28.43 <2e-16 ***
Distance -0.0001307 0.0001653 -0.79 0.434
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7941 on 44 degrees of freedom
Multiple R-squared: 0.01399, Adjusted R-squared: -0.008416
F-statistic: 0.6245 on 1 and 44 DF, p-value: 0.4336
After transformation, the Q-Q plot of the residuals looks more normal.
(b) Repeat the above analysis, but with each creek separately. Remember to use only the land-side measurements.
Creek A:
Using the same method from part (a), the suggested transformation for creek A measuremets variable is approximately a square transformation. The suggested lambda value is about 2.
lm(formula = dataA$Measurement ~ dataA$Distance)
lm(formula = yA ~ dataA$Distance)
Creek B:
For creek B, there is no need to transform the measurements variable, since the suggested lambda value is around 1.
lm(formula = dataB$Measurement ~ dataB$Distance)
Creek C:
Using the same method from part (a), the suggested transformation for creek C measuremets variable is log transformation. The suggested lambda value is about 0.
lm(formula = dataC$Measurement ~ dataC$Distance)
lm(formula = yC ~ dataC$Distance)
Creek D:
For creek D, there is also no need to transform the measurements variable. The suggested lambda value is around 1.
lm(formula = dataD$Measurement ~ dataD$Distance)
Creek E:
Using the same method from part (a), the suggested transformation for creek E measuremets variable is log transformation. The suggested lambda value is about 0.
lm(formula = dataE$Measurement ~ dataE$Distance)
lm(formula = yE ~ dataE$Distance)
Creek F:
Using the same method from part (a), the suggested transformation for creek F measuremets variable is log transformation. The suggested lambda value is about 0.
lm(formula = dataF$Measurement ~ dataF$Distance)
lm(formula = yE ~ dataE$Distance)
Summarize your findings from the fitted models in 3a and 3b above, and make conclusions about any trends you may (or may not) have found.
In all of the fitted models from 3(a) and 3(b), it seems that the p-value for distance variable are all greater than 0.05 (not shown here due to page limit). However I don’t think I can jump to conclusion that the distance variable is insignificant as a coviriate, because I suspect the distance variable also needs transformation. So more exploratory analysis should be performed before the final conclusion is established.