Math 664 - Methods for Statistical Consulting

HW-03

Student Name: Yao Zhang (31250772)

Read in the dataset and take a look at the head of the data. There are five variables.

  Creek Side Number Distance Measurement
1     A  Sea      1     0.00      123.92
2     A Land      1     0.00      118.32
3     A Land      2    25.32      127.10
4     A Land      3    47.80      148.34
5     A Land      4    79.72       34.90
6     A Land      5   136.49      130.95

Question 1: Make a plot containing boxplots of the contaminant measurements, separating out the Sea and Land measurements of each creek. Make sure the plot is properly labelled. Describe very briefly any features you observe from the boxplots.

In the plot above, the red points are the mean mearuements. We can observe that, for creek A and D, the mean measurements between sea side and land side are quite different. And for creek F, the variance of measurements between sea side and land side is quite large.

From the plot above, we can see that, for both sea side and land side, the variance of the measurements among the six creeks are quite different.
There is an obvious outlier in the measurements data from the land side of creek C, and the measurements data from sea side of creek E is right skewed.

Question 2: We are interested in finding out if the sea-side measurements are different from the land-side measurements or not. We will do it in two ways.

(a) Test each creek separately whether the mean sea-side measurements are equal to the mean land-side measurements or not. Summarize your results succinctly.

[1] "ALand$Measurement and ASea$Measurement"
       t statistic    p-value
t-test   -2.549559 0.06576873

The p-value is 0.06577 > 0.05, however the marginal probability is quite close to 0.05 compared to other creeks. We can conclude that the mean measurements of the sea-side and the land-side for creek A are probably different.

[1] "BLand$Measurement and BSea$Measurement"
       t statistic   p-value
t-test  -0.4897228 0.6414854

The p-value is 0.6415 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek B.

[1] "CLand$Measurement and CSea$Measurement"
       t statistic   p-value
t-test  -0.1231174 0.9049277

The p-value is 0.9049 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek C.

[1] "DLand$Measurement and DSea$Measurement"
       t statistic    p-value
t-test    2.156275 0.06107179

The p-value is 0.06107 > 0.05, same as creek A, the marginal probability is quite close to 0.05 compared to other creeks. We can conclude that the mean measurements of the sea-side and the land-side for creek D are probably different.

[1] "ELand$Measurement and ESea$Measurement"
       t statistic  p-value
t-test   0.2796955 0.787288

The p-value is 0.7873 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek E.

[1] "FLand$Measurement and FSea$Measurement"
       t statistic   p-value
t-test  -0.3276592 0.7535138

The p-value is 0.7535 > 0.05. So there is not enough evidence to reject the null hypothesis. The mean measurements of the sea-side and the land-side are probably equal for creek F.

(b) Consider Creek and Side as categorical variables and fit a regression model containing these two variables as well as their interactions. You do not need to transform the response. Interpret the model and summarize your findings.


Call:
lm(formula = Measurement ~ Creek + Side + Creek:Side)

Residuals:
    Min      1Q  Median      3Q     Max 
-149.43  -48.44  -10.52   43.16  196.16 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     103.139     25.023   4.122 0.000121 ***
CreekB            5.151     36.476   0.141 0.888187    
CreekC          -22.396     35.387  -0.633 0.529305    
CreekD           62.885     36.476   1.724 0.090036 .  
CreekE           82.973     39.564   2.097 0.040348 *  
CreekF           55.221     39.564   1.396 0.168114    
SideSea          93.286     45.110   2.068 0.043113 *  
CreekB:SideSea  -78.026     64.406  -1.211 0.230625    
CreekC:SideSea  -89.014     63.795  -1.395 0.168241    
CreekD:SideSea -192.412     64.406  -2.988 0.004117 ** 
CreekE:SideSea -108.810     66.203  -1.644 0.105675    
CreekF:SideSea  -74.626     66.203  -1.127 0.264290    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 75.07 on 58 degrees of freedom
Multiple R-squared:  0.274, Adjusted R-squared:  0.1363 
F-statistic:  1.99 on 11 and 58 DF,  p-value: 0.04606
Df Sum Sq Mean Sq F value Pr(>F)
Creek 5 70955.63 14191.13 2.52 0.0394
Side 1 186.36 186.36 0.03 0.8563
Creek:Side 5 52200.59 10440.12 1.85 0.1168
Residuals 58 326839.19 5635.16

Fit the regression model first, and then perform ANOVA on the model. From the ANOVA table above we can see that the p-value of Creek is 0.0394 < 0.05. So it is the significant varibale in the model.

(c) Compare your findings from parts a and b above. Explain any similarities or differences you find.

  • In part (a), I use two sample t-test to see if there is a significant difference in the measurements data between the sea-side and the land-side. And the conclusion is Yes for creeks A and D, but NO for all the others.
  • In part (b), I condsider Creek and Side as two categorical variables and fit them in a regression model. And from the ANOVA result, I get the conclusion that the Side variable is insignificant and the Creek variable significant.

Question 3: There is also interest in examining if there is any trend in the land-side measurements with distance away from the flood gate.

(a) Combining the land-side measurements for all the creeks, fit a regression model with Distance as the covariate. Think about whether you want to transform the measurements or not. Provide a rationale for your decision.


Call:
lm(formula = Measurement.land ~ Distance)

Residuals:
    Min      1Q  Median      3Q     Max 
-116.69  -71.02  -11.88   37.45  223.98 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 131.482293  17.902763   7.344 3.59e-09 ***
Distance     -0.004078   0.017940  -0.227    0.821    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 86.16 on 44 degrees of freedom
Multiple R-squared:  0.001173,  Adjusted R-squared:  -0.02153 
F-statistic: 0.05167 on 1 and 44 DF,  p-value: 0.8212

After fitting a regression model, there is clearly a skewness in the Normal Q-Q plot of the residuals. So the measurements variable probably needs transformation. Using the box-cox method, the suggested lambda value is approximately 0.33. So I create a new variable called y who is the log transformation of the original measurements. Then I fit the regression model again.


Call:
lm(formula = y ~ Distance)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0857 -0.5542  0.1646  0.5172  1.2094 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.6914698  0.1649950   28.43   <2e-16 ***
Distance    -0.0001307  0.0001653   -0.79    0.434    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7941 on 44 degrees of freedom
Multiple R-squared:  0.01399,   Adjusted R-squared:  -0.008416 
F-statistic: 0.6245 on 1 and 44 DF,  p-value: 0.4336

After transformation, the Q-Q plot of the residuals looks more normal.

(b) Repeat the above analysis, but with each creek separately. Remember to use only the land-side measurements.

Creek A:
Using the same method from part (a), the suggested transformation for creek A measuremets variable is approximately a square transformation. The suggested lambda value is about 2.

lm(formula = dataA$Measurement ~ dataA$Distance)
lm(formula = yA ~ dataA$Distance)

Creek B:
For creek B, there is no need to transform the measurements variable, since the suggested lambda value is around 1.

lm(formula = dataB$Measurement ~ dataB$Distance)

Creek C:
Using the same method from part (a), the suggested transformation for creek C measuremets variable is log transformation. The suggested lambda value is about 0.

lm(formula = dataC$Measurement ~ dataC$Distance)
lm(formula = yC ~ dataC$Distance)

Creek D:
For creek D, there is also no need to transform the measurements variable. The suggested lambda value is around 1.

lm(formula = dataD$Measurement ~ dataD$Distance)

Creek E:
Using the same method from part (a), the suggested transformation for creek E measuremets variable is log transformation. The suggested lambda value is about 0.

lm(formula = dataE$Measurement ~ dataE$Distance)
lm(formula = yE ~ dataE$Distance)

Creek F:
Using the same method from part (a), the suggested transformation for creek F measuremets variable is log transformation. The suggested lambda value is about 0.

lm(formula = dataF$Measurement ~ dataF$Distance)
lm(formula = yE ~ dataE$Distance)

Summarize your findings from the fitted models in 3a and 3b above, and make conclusions about any trends you may (or may not) have found.
In all of the fitted models from 3(a) and 3(b), it seems that the p-value for distance variable are all greater than 0.05 (not shown here due to page limit). However I don’t think I can jump to conclusion that the distance variable is insignificant as a coviriate, because I suspect the distance variable also needs transformation. So more exploratory analysis should be performed before the final conclusion is established.