env <- new.env()
load("../data.rda", envir = env)
df <- get("CH15FI01", envir = env)
names(df) <- c("y", "x")
stripchart(split(df$y, df$x), method = "stack", main = "Attendance", xlab = "Rating",
group.names = c("Attend", "Not Attend"), pch = 19, col = "gray30")
abline(h = 2 - 0.05) # Add a reference line below the 2nd group
For pedagogical reasons, a linear model on these day are presented. As the comment on page 646 states, ordinal data can be treated as evenly spaced interval data of a continuous nature. This is the default behavior with ordered factors in lm. We're still interested, however, in treating our fitted model like an ANOVA model with dummary variables. We do this by specifying the contrast used when you create a linear model with a categorical variable (dummy encoding). You can get the same results (but not necesssarily presented in the same order) by keeping the factor unordered and just making sure the reference level is the minimum ordered factor (in this case, “Low”). This can be done with relevel.
df <- get("CH15FI06", envir = env)
names(df) <- c("y", "x")
# Treat the factor as ordered
df <- transform(df, x = factor(x, ordered = TRUE, levels = c("Low", "Medium", "High", "Very High")))
# Specify appropriate contrasts to great this ordered factor as a categorical factor
fit <- lm(y ~ x, df, contrasts = list(x = "contr.treatment"))
summary(fit)
##
## Call:
## lm(formula = y ~ x, data = df, contrasts = list(x = "contr.treatment"))
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 50 -50 25 -25 110 -110 75 -75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 860.0 72.2 11.91 0.00028 ***
## xMedium 385.0 102.1 3.77 0.01959 *
## xHigh 540.0 102.1 5.29 0.00613 **
## xVery High 245.0 102.1 2.40 0.07439 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102 on 4 degrees of freedom
## Multiple R-squared: 0.883, Adjusted R-squared: 0.796
## F-statistic: 10.1 on 3 and 4 DF, p-value: 0.0246
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 3 315250 105083 10.1 0.025 *
## Residuals 4 41700 10425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R defaults to plotting boxes for factors. We'll manually plot these as interval data. It is much more convenient to just make use of xyplot (lattice) in this case
xyplot(y ~ x, df, ylim = c(0, 1600), pch = 19)
plot(y ~ unclass(x), df, ylim = c(0, 1600), pch = 19, xaxt = 'n', xlab = "Oven Temperature", ylab = "Volume")
axis(1, at = seq(4), labels = levels(df$x))
title("Summary Plot")
df <- get("CH15FI09", envir = env)
names(df) <- c("y", "x1", "x2")
df <- transform(df, x1 = factor(x1, ordered = TRUE, levels = c("Low", "Medium", "High", "Very High")))
fit <- lm(y ~ x1 + x2, df, contrasts = list(x1 = "contr.treatment"))
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2, data = df, contrasts = list(x1 = "contr.treatment"))
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -15 15 -40 40 45 -45 10 -10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 795.0 40.6 19.60 0.00029 ***
## x1Medium 385.0 51.3 7.50 0.00491 **
## x1High 540.0 51.3 10.52 0.00183 **
## x1Very High 245.0 51.3 4.77 0.01746 *
## x2Plant B 130.0 36.3 3.58 0.03722 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 3 degrees of freedom
## Multiple R-squared: 0.978, Adjusted R-squared: 0.948
## F-statistic: 33.1 on 4 and 3 DF, p-value: 0.00812
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 3 315250 105083 39.9 0.0064 **
## x2 1 33800 33800 12.8 0.0372 *
## Residuals 3 7900 2633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, the use of xyplot may be preferable. The auto.key parameter controls the legend by setting it a list of options. In this snippet below, that is to generate a 2 column legend that, in this case, will put the 2 Plant factors as a horizontal legend value.
xyplot(y ~ x1, df, groups = x2, type = "b", pch = 20, auto.key = list(columns = 2))
plot(y ~ unclass(x1), df, col = unclass(x2), pch = 20, xaxt = 'n', xlab = "Oven Temperature", ylab = "Volume")
lines(y ~ unclass(x1), df, subset = x2 == 'Plant A', col = unclass(x2))
lines(y ~ unclass(x1), df, subset = x2 == 'Plant B', col = unclass(x2))
axis(1, at = seq(4), labels = levels(df$x1))
title("Summary Plot")
text(3.6, 1400, "Plant B", col = 2)
text(3.5, 1100, "Plant A")
The data is currently in a “wide” format and possibly can be reshaped using reshape (stats). However, stack (utils) will suffice. It returns the values with an indicator variable (ind) derived from the selected columns to stack; in this case, ind will consist of factors for 'control' and 'experiment'. Thus, the model will just require this long dataset combined with the subject numbers. An alternative is to use the functions, like melt, found in Wickham's reshape package.
df <- get("CH15TA01", envir = env)
names(df) <- c("subject", "control", "experiment", "difference")
df <- transform(df,
x1 = stack(df, select = c(control, experiment))$ind,
x2 = factor(subject),
y = stack(df, select = c(control, experiment))$values)
fit <- lm(y ~ x1 + x2, df)
Author error found for within-subject difference on the sample standard deviation.
0.758 - 0.807 != 0.0501
tab <- xtabs(y ~ x2 + x1, df) # Wide format table like before
tab <- addmargins(tab, 1, FUN = list(list('Mean' = mean, 'Std' = sd))) # Add sample means and std. dev. to bottom margin
tab <- addmargins(tab, 2, FUN = diff) # Add within-subject differences to side margin
round(tab, 4)
## x1
## x2 control experiment diff
## 1 0.5900 0.4300 -0.1600
## 2 0.6900 0.5300 -0.1600
## 3 0.8200 0.5800 -0.2400
## 4 0.8000 0.6500 -0.1500
## 5 0.6600 0.3800 -0.2800
## 6 0.6700 0.4800 -0.1900
## 7 0.7600 0.6200 -0.1400
## 8 0.8000 0.6000 -0.2000
## 9 0.7200 0.4400 -0.2800
## 10 0.8600 0.6200 -0.2400
## 11 0.7400 0.6000 -0.1400
## 12 0.7200 0.5500 -0.1700
## 13 0.6600 0.4200 -0.2400
## 14 0.6600 0.5600 -0.1000
## 15 0.6800 0.4700 -0.2100
## 16 0.6700 0.5000 -0.1700
## 17 0.6900 0.5400 -0.1500
## 18 0.8500 0.6000 -0.2500
## 19 0.8500 0.6500 -0.2000
## 20 0.7400 0.5800 -0.1600
## Mean 0.7315 0.5400 -0.1915
## Std 0.0768 0.0807 0.0039
Since the points overplot, their x-values are randomly shifted a small amount using jitter.
x <- jitter(unclass(df$x1), 0.12)
plot(y ~ x, df, pch = 19, xaxt = 'n', xlab = "Treatment", ylab = "Diameter")
axis(1, at = seq(2), labels = levels(df$x1))
for (id in df$x2)
lines(y ~ x, df, subset = x2 == id)
title("Summary Plot")
As the authors point out, the coefficients on the subjects don't matter as the blocking variable was included only to improve the precision of the comparison between control and experimental treatments. The coefficients will be different from what the book has because it uses a different reference level. R uses the first factor as the reference level. The book (15.18) defines the model as using reference level 20 (the variable left out of the X_ij assignment). Thus, if you use relevel to change the reference level to 20, you'll get the same subject coefficients.
summary(fit)
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0457 -0.0208 0.0000 0.0208 0.0457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6058 0.0257 23.61 1.5e-15 ***
## x1experiment -0.1915 0.0112 -17.10 5.4e-13 ***
## x22 0.1000 0.0354 2.82 0.01085 *
## x23 0.1900 0.0354 5.37 3.5e-05 ***
## x24 0.2150 0.0354 6.07 7.7e-06 ***
## x25 0.0100 0.0354 0.28 0.78070
## x26 0.0650 0.0354 1.84 0.08214 .
## x27 0.1800 0.0354 5.08 6.6e-05 ***
## x28 0.1900 0.0354 5.37 3.5e-05 ***
## x29 0.0700 0.0354 1.98 0.06278 .
## x210 0.2300 0.0354 6.49 3.2e-06 ***
## x211 0.1600 0.0354 4.52 0.00024 ***
## x212 0.1250 0.0354 3.53 0.00224 **
## x213 0.0300 0.0354 0.85 0.40746
## x214 0.1000 0.0354 2.82 0.01085 *
## x215 0.0650 0.0354 1.84 0.08214 .
## x216 0.0750 0.0354 2.12 0.04760 *
## x217 0.1050 0.0354 2.97 0.00795 **
## x218 0.2150 0.0354 6.07 7.7e-06 ***
## x219 0.2400 0.0354 6.78 1.8e-06 ***
## x220 0.1500 0.0354 4.24 0.00045 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0354 on 19 degrees of freedom
## Multiple R-squared: 0.96, Adjusted R-squared: 0.919
## F-statistic: 23.1 on 20 and 19 DF, p-value: 2.28e-09
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 0.367 0.367 292.4 5.4e-13 ***
## x2 19 0.212 0.011 8.9 7.3e-06 ***
## Residuals 19 0.024 0.001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(y ~ x1, df), fit) # (15.20)
## Analysis of Variance Table
##
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 0.2359
## 2 19 0.0238 19 0.212 8.9 7.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1