One-way ANOVA
use a one-way ANOVA to compare the average passenger age by passenger class
titanic <- read.table("titanic.txt", header=TRUE)
fit <- lm(Age~PClass, data=titanic)
summary(fit)
##
## Call:
## lm(formula = Age ~ PClass, data = titanic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.748 -7.300 -0.668 7.791 42.700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6678 0.8556 46.360 <2e-16 ***
## PClass2nd -11.3676 1.2299 -9.243 <2e-16 ***
## PClass3rd -14.4592 1.1191 -12.920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.86 on 753 degrees of freedom
## (557 observations deleted due to missingness)
## Multiple R-squared: 0.1884, Adjusted R-squared: 0.1862
## F-statistic: 87.38 on 2 and 753 DF, p-value: < 2.2e-16
For the Age and Memory data, make a subset of the Older subjects, and conduct a one-way ANOVA to compare words remembered by memory technique.
memory <- read.table("eysenck.txt", header=TRUE)
memOlder <- subset(memory, Age=="Older")
fit <- lm(Words~Process, data=memOlder)
summary(fit)
##
## Call:
## lm(formula = Words ~ Process, data = memOlder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.00 -1.85 -0.45 2.00 9.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0000 0.9835 11.184 1.39e-14 ***
## ProcessCounting -4.0000 1.3909 -2.876 0.00614 **
## ProcessImagery 2.4000 1.3909 1.725 0.09130 .
## ProcessIntentional 1.0000 1.3909 0.719 0.47589
## ProcessRhyming -4.1000 1.3909 -2.948 0.00506 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.11 on 45 degrees of freedom
## Multiple R-squared: 0.4468, Adjusted R-squared: 0.3976
## F-statistic: 9.085 on 4 and 45 DF, p-value: 1.815e-05
Two-way ANOVA
Using the pupae dataset, 1t a two-way ANOVA to compare PupalWeight to Gender and CO2_treatment. Which main effects are signi1cant?
pupae <- read.csv("pupae.csv")
pupae$CO2_treatment <- as.factor(pupae$CO2_treatment)
fit <- lm(PupalWeight~Gender+CO2_treatment, data=pupae)
summary(fit)
##
## Call:
## lm(formula = PupalWeight ~ Gender + CO2_treatment, data = pupae)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.126338 -0.028338 0.000162 0.022538 0.190663
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.26217 0.00886 29.589 < 2e-16 ***
## Gender 0.07817 0.01053 7.423 1.48e-10 ***
## CO2_treatment400 0.02017 0.01049 1.923 0.0583 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04623 on 75 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4434, Adjusted R-squared: 0.4285
## F-statistic: 29.87 on 2 and 75 DF, p-value: 2.872e-10
anova(fit)
## Analysis of Variance Table
##
## Response: PupalWeight
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 0.119802 0.119802 56.0472 1.126e-10 ***
## CO2_treatment 1 0.007902 0.007902 3.6966 0.05832 .
## Residuals 75 0.160314 0.002138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is there an interaction between Gender and CO2_treatment?
fit <- lm(PupalWeight~Gender*CO2_treatment, data=pupae)
anova(fit)
## Analysis of Variance Table
##
## Response: PupalWeight
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 0.119802 0.119802 55.3217 1.488e-10 ***
## CO2_treatment 1 0.007902 0.007902 3.6488 0.05998 .
## Gender:CO2_treatment 1 0.000063 0.000063 0.0292 0.86489
## Residuals 74 0.160251 0.002166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Repeat the above using T_treatment instead of CO2_treatment.
fit <- lm(PupalWeight~Gender+T_treatment, data=pupae)
summary(fit)
##
## Call:
## lm(formula = PupalWeight ~ Gender + T_treatment, data = pupae)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.134744 -0.026092 -0.000259 0.023459 0.197226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.275774 0.009946 27.728 < 2e-16 ***
## Gender 0.078203 0.010836 7.217 3.63e-10 ***
## T_treatmentelevated -0.005233 0.010909 -0.480 0.633
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04729 on 75 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4177, Adjusted R-squared: 0.4022
## F-statistic: 26.9 on 2 and 75 DF, p-value: 1.556e-09
anova(fit)
## Analysis of Variance Table
##
## Response: PupalWeight
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 0.119802 0.119802 53.5784 2.325e-10 ***
## T_treatment 1 0.000515 0.000515 0.2301 0.6328
## Residuals 75 0.167701 0.002236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit <- lm(PupalWeight~Gender*T_treatment, data=pupae)
anova(fit)
## Analysis of Variance Table
##
## Response: PupalWeight
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 0.119802 0.119802 52.9358 3e-10 ***
## T_treatment 1 0.000515 0.000515 0.2274 0.6349
## Gender:T_treatment 1 0.000227 0.000227 0.1005 0.7521
## Residuals 74 0.167474 0.002263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Perform a two-way ANOVA to test irrigation and fertilisation effects on tree diameter. Compare the results.
hfeif2012 <- read.csv("HFEIFplotmeans2012.csv")
iflm <- lm(diameter ~ treat, data=hfeif2012)
anova(iflm)
## Analysis of Variance Table
##
## Response: diameter
## Df Sum Sq Mean Sq F value Pr(>F)
## treat 3 46.022 15.3408 13.338 0.000396 ***
## Residuals 12 13.802 1.1502
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
levels(hfeif2012$treat)
## [1] "C" "F" "I" "IL"
hfeif2012$Fert_treat <- as.factor(ifelse(hfeif2012$treat %in% c("F","IL"),
"fertilized", "control"))
hfeif2012$Irrig_treat <- as.factor(ifelse(hfeif2012$treat %in% c("I","IL"),
"irrigated", "control"))
levels(hfeif2012$Fert_treat)
## [1] "control" "fertilized"
levels(hfeif2012$Irrig_treat)
## [1] "control" "irrigated"
lmif2 <- lm(diameter ~ Fert_treat*Irrig_treat, data=hfeif2012)
anova(lmif2)
## Analysis of Variance Table
##
## Response: diameter
## Df Sum Sq Mean Sq F value Pr(>F)
## Fert_treat 1 0.000 0.000 0.000 0.9978
## Irrig_treat 1 45.620 45.620 39.663 3.96e-05 ***
## Fert_treat:Irrig_treat 1 0.403 0.403 0.350 0.5651
## Residuals 12 13.802 1.150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(effects)
plot(allEffects(lmif2))
Multiple regression
Using the Pulse data, fit a multiple linear regression of Pulse1 against Age, Weight and Height (add the variables to the model in that order).
Do the results change when you use drop1 instead of anova?
pulse <- read.table("ms212.txt", header=TRUE)
fit <- lm(Pulse1 ~ Age+Weight+Height, data=pulse)
summary(fit)
##
## Call:
## lm(formula = Pulse1 ~ Age + Weight + Height, data = pulse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.838 -9.016 -0.115 6.497 70.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 123.39317 14.98976 8.232 5.42e-13 ***
## Age -0.45843 0.31739 -1.444 0.1516
## Weight -0.01985 0.10079 -0.197 0.8443
## Height -0.21542 0.09399 -2.292 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.82 on 105 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.09705, Adjusted R-squared: 0.07125
## F-statistic: 3.762 on 3 and 105 DF, p-value: 0.01304
anova(fit)
## Analysis of Variance Table
##
## Response: Pulse1
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 406.6 406.57 2.4757 0.11863
## Weight 1 584.1 584.13 3.5568 0.06206 .
## Height 1 862.7 862.72 5.2532 0.02390 *
## Residuals 105 17244.0 164.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(fit, test="F")
## Single term deletions
##
## Model:
## Pulse1 ~ Age + Weight + Height
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 17244 559.96
## Age 1 342.60 17587 560.11 2.0861 0.1516
## Weight 1 6.37 17250 558.00 0.0388 0.8443
## Height 1 862.72 18107 563.28 5.2532 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,2))
plot(fit)
Linear Model with factor and numeric variables
Repeat exercise and also include the factor Exercise. Does adding Exercise improve the model?
pulse$Exercise <- as.factor(pulse$Exercise)
fit2 <- lm(Pulse1 ~ Age+Weight+Height+Exercise, data=pulse)
summary(fit2)
##
## Call:
## lm(formula = Pulse1 ~ Age + Weight + Height + Exercise, data = pulse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.564 -7.706 -0.548 6.543 70.318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.500042 15.440027 7.416 3.53e-11 ***
## Age -0.551732 0.317033 -1.740 0.0848 .
## Weight 0.006992 0.101024 0.069 0.9450
## Height -0.199932 0.093220 -2.145 0.0343 *
## Exercise2 6.444701 3.812995 1.690 0.0940 .
## Exercise3 8.677150 4.105668 2.113 0.0370 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.67 on 103 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1347, Adjusted R-squared: 0.09267
## F-statistic: 3.206 on 5 and 103 DF, p-value: 0.009894
anova(fit2)
## Analysis of Variance Table
##
## Response: Pulse1
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 406.6 406.57 2.5341 0.11448
## Weight 1 584.1 584.13 3.6408 0.05916 .
## Height 1 862.7 862.72 5.3772 0.02238 *
## Exercise 2 718.5 359.25 2.2391 0.11171
## Residuals 103 16525.5 160.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(fit, fit2)
## df AIC
## fit 5 871.2904
## fit2 7 870.6514
Using the same data, fit a model of Pulse2 as a function of Pulse1 and Ran as main effects only.
pulse$Ran <- as.factor(pulse$Ran)
fit3 <- lm(Pulse2~Pulse1+Ran, data=pulse)
anova(fit3)
## Analysis of Variance Table
##
## Response: Pulse2
## Df Sum Sq Mean Sq F value Pr(>F)
## Pulse1 1 14063 14063 71.922 1.44e-13 ***
## Ran 1 72836 72836 372.497 < 2.2e-16 ***
## Residuals 106 20727 196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(effects)
plot(allEffects(fit3))
Now add the interaction between Pulse1 and Ran. Is it signi1cant? Also look at the effects plot, how is it different from the model without an interaction?
fit4 <- lm(Pulse2~Pulse1*Ran, data=pulse)
anova(fit4)
## Analysis of Variance Table
##
## Response: Pulse2
## Df Sum Sq Mean Sq F value Pr(>F)
## Pulse1 1 14063 14063 71.2471 1.872e-13 ***
## Ran 1 72836 72836 369.0002 < 2.2e-16 ***
## Pulse1:Ran 1 1 1 0.0049 0.9442
## Residuals 105 20726 197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(effects)
plot(allEffects(fit4))
Add (factor versions of) Alcohol, Smokes and Exercise to this model, and inspect whether the model improved.
fit5 <- lm(Pulse2~Pulse1+Ran+Alcohol+Smokes+Exercise, data=pulse)
anova(fit5)
## Analysis of Variance Table
##
## Response: Pulse2
## Df Sum Sq Mean Sq F value Pr(>F)
## Pulse1 1 14063 14063 69.4559 3.834e-13 ***
## Ran 1 72836 72836 359.7233 < 2.2e-16 ***
## Alcohol 1 38 38 0.1888 0.6648
## Smokes 1 24 24 0.1165 0.7336
## Exercise 2 12 6 0.0299 0.9706
## Residuals 102 20653 202
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Logistic regression
Using the Pulse data once more, build a model to see if Pulse2 can predict whether people were in the Ran group.
fit6 <- glm(Ran ~ Pulse2, data=pulse, family=binomial)
summary(fit6)
##
## Call:
## glm(formula = Ran ~ Pulse2, family = binomial, data = pulse)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.98452 -0.05046 0.13816 0.32434 2.88698
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.70267 3.34818 4.391 1.13e-05 ***
## Pulse2 -0.15712 0.03828 -4.104 4.06e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 148.444 on 108 degrees of freedom
## Residual deviance: 44.847 on 107 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 48.847
##
## Number of Fisher Scoring iterations: 7
library(visreg)
visreg(fit6, scale="response")
Generalized linear model (GLM)
First run the following code to generate some data
len <- 21
x <- seq(0,1,length=len)
y <- rpois(len,exp(x-0.5))
Fit a Poisson GLM to model y on x. Is x signi1cant in the model?
fit7 <- glm(y ~ x, family=poisson)
summary(fit7)
##
## Call:
## glm(formula = y ~ x, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.49510 -1.41391 -0.06732 0.72395 1.66556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.06744 0.42443 -0.159 0.874
## x 0.22335 0.70515 0.317 0.751
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22.227 on 20 degrees of freedom
## Residual deviance: 22.126 on 19 degrees of freedom
## AIC: 58.565
##
## Number of Fisher Scoring iterations: 5
Repeat above with a larger sample size. Compare the results
len <-101
x <- seq(0, 1, length=len)
y <- rpois(len, exp(x-0.5))
fit <-glm(y ~ x, family=poisson)
summary(fit)
##
## Call:
## glm(formula = y ~ x, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.91124 -1.29170 -0.02566 0.42149 2.15269
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4642 0.2158 -2.151 0.031480 *
## x 1.0883 0.3295 3.303 0.000956 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 134.15 on 100 degrees of freedom
## Residual deviance: 122.91 on 99 degrees of freedom
## AIC: 279.69
##
## Number of Fisher Scoring iterations: 5
The memory data were analysed assuming a normal error distribution and using a Poisson error distribution previously, and each approach resulted in a different outcome for the signi1cance of the interaction term. The participants in the study were asked to remember up to 27 words, not an unlimited number, and some of the participants were remembering close to this upper limit. Therefore, it may make sense to think of the response as a proportion. Use glm to model the response using a binomial error distribution.
memory <- read.delim('eysenck.txt')
m1 <- glm(cbind(Words, 27-Words) ~ Age * Process, data=memory, family=binomial)
anova(m1, test='LRT')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(Words, 27 - Words)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 99 421.68
## Age 1 36.393 98 385.29 1.613e-09 ***
## Process 4 238.591 94 146.70 < 2.2e-16 ***
## Age:Process 4 26.158 90 120.54 2.940e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1