Researchers at National Institutes of Standards and Technology (NIST) collected pipeline data on ultrasonic measurements of the depth of defects in the Alaska pipeline in the field. The depth of the defects were then remeasured in the laboratory. These measurements were performed in six different batches. It turns out that this batch effect is not significant and so can be ignored in the analysis that follows. The laboratory measurements are more accurate than the in-field measurements, but more time consuming and expensive. We want to develop a regression equation for correcting the in-field measurements.

  1. Fit a regression model Lab ~ Field. Check for non-constant variance.
##   Field  Lab Batch
## 1    18 20.2     1
## 2    38 56.0     1
## 3    15 12.5     1
## 4    20 21.2     1
## 5    18 15.5     1
## 6    36 39.0     1
## [1] 107   3
## [1] "Field" "Lab"   "Batch"
## 
## Call:
## lm(formula = Lab ~ Field, data = pipeline)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.985  -4.072  -1.431   2.504  24.334 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.96750    1.57479  -1.249    0.214    
## Field        1.22297    0.04107  29.778   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.865 on 105 degrees of freedom
## Multiple R-squared:  0.8941, Adjusted R-squared:  0.8931 
## F-statistic: 886.7 on 1 and 105 DF,  p-value: < 2.2e-16

## [1] 0.08754773
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(pipeline_lm)
## W = 0.92697, p-value = 1.841e-05

There doesn’t appear to be any autocorrelation from repeated experiments, however the Shapiro-Wilk test supports rejecting the null hypothesis that residuals are normally distributed.

  1. We wish to use weights to account for the non-constant variance. Here we split the range of Field into 12 groups of size nine (except for the last group which has only eight values). Within each group, we compute the variance of Lab as varlab and the mean of Field as meanfield. Supposing pipeline is the name of your data frame, the following R code will make the needed computations:

i <- order(pipeline\(Field) npipe <- pipeline[i,] ff <- gl(12,9)[-108] meanfield <- unlist(lapply(split(npipe\)Field,ff),mean)) varlab <- unlist(lapply(split(npipe$Lab,ff),var))

Suppose we guess that the the variance in the response is linked to the predictor in the following way:

\[var(Lab) = a_0 Field^{a_1}\]

Regress log(varlab) on log(meanfield) to estimate a0 and a1. (You might choose to remove the last point.) Use this to determine appropriate weights in a WLS fit of Lab on Field. Show the regression summary.

##    (Intercept) log(meanfield) 
##     -0.3537921      1.1244202

In this case, the intercept is log(a0) and the slope of meanfield is a1.

## Generalized least squares fit by REML
##   Model: Lab ~ Field 
##   Data: pipeline 
##        AIC      BIC    logLik
##   708.1764 716.1383 -351.0882
## 
## Variance function:
##  Structure: fixed weights
##  Formula: ~(var_lab) 
## 
## Coefficients:
##                 Value Std.Error  t-value p-value
## (Intercept) -1.494365 0.9070661 -1.64747  0.1025
## Field        1.208276 0.0348839 34.63704  0.0000
## 
##  Correlation: 
##       (Intr)
## Field -0.809
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.7814680 -0.6930709 -0.2727923  0.5313567  2.9450736 
## 
## Residual standard error: 1.169044 
## Degrees of freedom: 107 total; 105 residual

  1. An alternative to weighting is transformation. Find transformations on Lab and/or Field so that in the transformed scale the relationship is approximately linear with constant variance. You may restrict your choice of transformation to square root, log and inverse.
## 
## Call:
## lm(formula = sqrt(Lab) ~ Field, data = pipeline)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1688 -0.4060 -0.0514  0.2766  1.7103 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.558536   0.122530   20.88   <2e-16 ***
## Field       0.100642   0.003195   31.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6119 on 105 degrees of freedom
## Multiple R-squared:  0.9043, Adjusted R-squared:  0.9034 
## F-statistic: 991.9 on 1 and 105 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = log(Lab) ~ Field, data = pipeline)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97034 -0.13220  0.00857  0.15797  0.49898 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.251322   0.052269   43.07   <2e-16 ***
## Field       0.035526   0.001363   26.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.261 on 105 degrees of freedom
## Multiple R-squared:  0.8661, Adjusted R-squared:  0.8648 
## F-statistic: 679.2 on 1 and 105 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = 1/(Lab) ~ (Field), data = pipeline)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.023296 -0.012976 -0.005380  0.005615  0.150379 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0892885  0.0044632   20.00   <2e-16 ***
## Field       -0.0014220  0.0001164  -12.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02229 on 105 degrees of freedom
## Multiple R-squared:  0.587,  Adjusted R-squared:  0.5831 
## F-statistic: 149.2 on 1 and 105 DF,  p-value: < 2.2e-16