Chapter 15 – Introduction to the Design of Experimental and Observational Studies


Load the data sets

env <- new.env()
load("../data.rda", envir = env)

Input Teaching Effectiveness Data

df <- get("CH15FI01", envir = env)
names(df) <- c("y", "x")

FIGURE 15.1 (p 645)

Teaching Performance Comparison–Teaching Effectiveness Example

stripchart(split(df$y, df$x), method = "stack", main = "Attendance", xlab = "Rating",
           group.names = c("Attend", "Not Attend"), pch = 19, col = "gray30")
abline(h = 2 - 0.05)  # Add a reference line  below the 2nd group

plot of chunk stripchart

Input Quick Bread Volume Data

For pedagogical reasons, a linear model on these day are presented. As the comment on page 646 states, ordinal data can be treated as evenly spaced interval data of a continuous nature. This is the default behavior with ordered factors in lm. We're still interested, however, in treating our fitted model like an ANOVA model with dummary variables. We do this by specifying the contrast used when you create a linear model with a categorical variable (dummy encoding). You can get the same results (but not necesssarily presented in the same order) by keeping the factor unordered and just making sure the reference level is the minimum ordered factor (in this case, “Low”). This can be done with relevel.

df <- get("CH15FI06", envir = env)
names(df) <- c("y", "x")

# Treat the factor as ordered
df <- transform(df, x = factor(x, ordered = TRUE, levels = c("Low", "Medium", "High", "Very High")))


# Specify appropriate contrasts to great this ordered factor as a categorical factor
fit <- lm(y ~ x, df, contrasts = list(x = "contr.treatment"))
summary(fit)
## 
## Call:
## lm(formula = y ~ x, data = df, contrasts = list(x = "contr.treatment"))
## 
## Residuals:
##    1    2    3    4    5    6    7    8 
##   50  -50   25  -25  110 -110   75  -75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    860.0       72.2   11.91  0.00028 ***
## xMedium        385.0      102.1    3.77  0.01959 *  
## xHigh          540.0      102.1    5.29  0.00613 ** 
## xVery High     245.0      102.1    2.40  0.07439 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 102 on 4 degrees of freedom
## Multiple R-squared: 0.883,   Adjusted R-squared: 0.796 
## F-statistic: 10.1 on 3 and 4 DF,  p-value: 0.0246
anova(fit)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value Pr(>F)  
## x          3 315250  105083    10.1  0.025 *
## Residuals  4  41700   10425                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

FIGURE 15.6 (p 659)

Summary Plot–Quick Bread Volume Example

R defaults to plotting boxes for factors. We'll manually plot these as interval data. It is much more convenient to just make use of xyplot (lattice) in this case

xyplot(y ~ x, df, ylim = c(0, 1600), pch = 19)

plot(y ~ unclass(x), df, ylim = c(0, 1600), pch = 19, xaxt = 'n', xlab = "Oven Temperature", ylab = "Volume")
axis(1, at = seq(4), labels = levels(df$x))
title("Summary Plot")

plot of chunk unnamed-chunk-4

Input Blocked Quick Bread Volume Data

df <- get("CH15FI09", envir = env)
names(df) <- c("y", "x1", "x2")
df <- transform(df, x1 = factor(x1, ordered = TRUE, levels = c("Low", "Medium", "High", "Very High")))

fit <- lm(y ~ x1 + x2, df, contrasts = list(x1 = "contr.treatment"))
summary(fit)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df, contrasts = list(x1 = "contr.treatment"))
## 
## Residuals:
##   1   2   3   4   5   6   7   8 
## -15  15 -40  40  45 -45  10 -10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    795.0       40.6   19.60  0.00029 ***
## x1Medium       385.0       51.3    7.50  0.00491 ** 
## x1High         540.0       51.3   10.52  0.00183 ** 
## x1Very High    245.0       51.3    4.77  0.01746 *  
## x2Plant B      130.0       36.3    3.58  0.03722 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 51.3 on 3 degrees of freedom
## Multiple R-squared: 0.978,   Adjusted R-squared: 0.948 
## F-statistic: 33.1 on 4 and 3 DF,  p-value: 0.00812
anova(fit)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value Pr(>F)   
## x1         3 315250  105083    39.9 0.0064 **
## x2         1  33800   33800    12.8 0.0372 * 
## Residuals  3   7900    2633                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

FIGURE 15.9 (p 662)

Summary Plot–Blocked Quick Bread Volume Optimization Example

Again, the use of xyplot may be preferable. The auto.key parameter controls the legend by setting it a list of options. In this snippet below, that is to generate a 2 column legend that, in this case, will put the 2 Plant factors as a horizontal legend value.

xyplot(y ~ x1, df, groups = x2, type = "b", pch = 20, auto.key = list(columns = 2))

plot(y ~ unclass(x1), df, col = unclass(x2), pch = 20, xaxt = 'n', xlab = "Oven Temperature", ylab = "Volume")
lines(y ~ unclass(x1), df, subset = x2 == 'Plant A', col = unclass(x2))
lines(y ~ unclass(x1), df, subset = x2 == 'Plant B', col = unclass(x2))
axis(1, at = seq(4), labels = levels(df$x1))
title("Summary Plot")
text(3.6, 1400, "Plant B", col = 2)
text(3.5, 1100, "Plant A")

plot of chunk unnamed-chunk-6

Input Skin Sensitivity Experiment Data

The data is currently in a “wide” format and possibly can be reshaped using reshape (stats). However, stack (utils) will suffice. It returns the values with an indicator variable (ind) derived from the selected columns to stack; in this case, ind will consist of factors for 'control' and 'experiment'. Thus, the model will just require this long dataset combined with the subject numbers. An alternative is to use the functions, like melt, found in Wickham's reshape package.

df <- get("CH15TA01", envir = env)
names(df) <- c("subject", "control", "experiment", "difference")

df <- transform(df,
                x1 = stack(df, select = c(control, experiment))$ind,
                x2 = factor(subject),
                y  = stack(df, select = c(control, experiment))$values)

fit <- lm(y ~ x1 + x2, df)

TABLE 15.1 (p 670)

Data and Descriptive Statistics–Skin Sensitivity Experiment

Author error found for within-subject difference on the sample standard deviation.

0.758 - 0.807 != 0.0501

tab <- xtabs(y ~ x2 + x1, df)  # Wide format table like before
tab <- addmargins(tab, 1, FUN = list(list('Mean' = mean, 'Std' = sd)))  # Add sample means and std. dev. to bottom margin
tab <- addmargins(tab, 2, FUN = diff)  # Add within-subject differences to side margin
round(tab, 4)
##       x1
## x2     control experiment    diff
##   1     0.5900     0.4300 -0.1600
##   2     0.6900     0.5300 -0.1600
##   3     0.8200     0.5800 -0.2400
##   4     0.8000     0.6500 -0.1500
##   5     0.6600     0.3800 -0.2800
##   6     0.6700     0.4800 -0.1900
##   7     0.7600     0.6200 -0.1400
##   8     0.8000     0.6000 -0.2000
##   9     0.7200     0.4400 -0.2800
##   10    0.8600     0.6200 -0.2400
##   11    0.7400     0.6000 -0.1400
##   12    0.7200     0.5500 -0.1700
##   13    0.6600     0.4200 -0.2400
##   14    0.6600     0.5600 -0.1000
##   15    0.6800     0.4700 -0.2100
##   16    0.6700     0.5000 -0.1700
##   17    0.6900     0.5400 -0.1500
##   18    0.8500     0.6000 -0.2500
##   19    0.8500     0.6500 -0.2000
##   20    0.7400     0.5800 -0.1600
##   Mean  0.7315     0.5400 -0.1915
##   Std   0.0768     0.0807  0.0039

FIGURE 15.13 (p 670) #

Summary Plot–Skin Sensitivity Example

Since the points overplot, their x-values are randomly shifted a small amount using jitter.

x <- jitter(unclass(df$x1), 0.12)
plot(y ~ x, df, pch = 19, xaxt = 'n', xlab = "Treatment", ylab = "Diameter")
axis(1, at = seq(2), labels = levels(df$x1))
for (id in df$x2)
  lines(y ~ x, df, subset = x2 == id)
title("Summary Plot")

plot of chunk unnamed-chunk-9

FIGURE 15.14 (p 671)

Regression Results–Skin Sensitivity Experiment

As the authors point out, the coefficients on the subjects don't matter as the blocking variable was included only to improve the precision of the comparison between control and experimental treatments. The coefficients will be different from what the book has because it uses a different reference level. R uses the first factor as the reference level. The book (15.18) defines the model as using reference level 20 (the variable left out of the X_ij assignment). Thus, if you use relevel to change the reference level to 20, you'll get the same subject coefficients.

summary(fit)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0457 -0.0208  0.0000  0.0208  0.0457 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.6058     0.0257   23.61  1.5e-15 ***
## x1experiment  -0.1915     0.0112  -17.10  5.4e-13 ***
## x22            0.1000     0.0354    2.82  0.01085 *  
## x23            0.1900     0.0354    5.37  3.5e-05 ***
## x24            0.2150     0.0354    6.07  7.7e-06 ***
## x25            0.0100     0.0354    0.28  0.78070    
## x26            0.0650     0.0354    1.84  0.08214 .  
## x27            0.1800     0.0354    5.08  6.6e-05 ***
## x28            0.1900     0.0354    5.37  3.5e-05 ***
## x29            0.0700     0.0354    1.98  0.06278 .  
## x210           0.2300     0.0354    6.49  3.2e-06 ***
## x211           0.1600     0.0354    4.52  0.00024 ***
## x212           0.1250     0.0354    3.53  0.00224 ** 
## x213           0.0300     0.0354    0.85  0.40746    
## x214           0.1000     0.0354    2.82  0.01085 *  
## x215           0.0650     0.0354    1.84  0.08214 .  
## x216           0.0750     0.0354    2.12  0.04760 *  
## x217           0.1050     0.0354    2.97  0.00795 ** 
## x218           0.2150     0.0354    6.07  7.7e-06 ***
## x219           0.2400     0.0354    6.78  1.8e-06 ***
## x220           0.1500     0.0354    4.24  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.0354 on 19 degrees of freedom
## Multiple R-squared: 0.96,    Adjusted R-squared: 0.919 
## F-statistic: 23.1 on 20 and 19 DF,  p-value: 2.28e-09
anova(fit)
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## x1         1  0.367   0.367   292.4 5.4e-13 ***
## x2        19  0.212   0.011     8.9 7.3e-06 ***
## Residuals 19  0.024   0.001                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(y ~ x1, df), fit)  # (15.20)
## Analysis of Variance Table
## 
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2
##   Res.Df    RSS Df Sum of Sq   F  Pr(>F)    
## 1     38 0.2359                             
## 2     19 0.0238 19     0.212 8.9 7.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1