Linear modelling

One-way ANOVA

use a one-way ANOVA to compare the average passenger age by passenger class

titanic <- read.table("titanic.txt", header=TRUE)
fit <- lm(Age~PClass, data=titanic)
summary(fit)

## 
## Call:
## lm(formula = Age ~ PClass, data = titanic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.748  -7.300  -0.668   7.791  42.700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6678     0.8556  46.360   <2e-16 ***
## PClass2nd   -11.3676     1.2299  -9.243   <2e-16 ***
## PClass3rd   -14.4592     1.1191 -12.920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.86 on 753 degrees of freedom
##   (557 observations deleted due to missingness)
## Multiple R-squared:  0.1884, Adjusted R-squared:  0.1862 
## F-statistic: 87.38 on 2 and 753 DF,  p-value: < 2.2e-16

For the Age and Memory data, make a subset of the Older subjects, and conduct a one-way ANOVA to compare words remembered by memory technique.

memory <- read.table("eysenck.txt", header=TRUE)
memOlder <- subset(memory, Age=="Older")
fit <- lm(Words~Process, data=memOlder)
summary(fit)

## 
## Call:
## lm(formula = Words ~ Process, data = memOlder)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7.00  -1.85  -0.45   2.00   9.60 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.0000     0.9835  11.184 1.39e-14 ***
## ProcessCounting     -4.0000     1.3909  -2.876  0.00614 ** 
## ProcessImagery       2.4000     1.3909   1.725  0.09130 .  
## ProcessIntentional   1.0000     1.3909   0.719  0.47589    
## ProcessRhyming      -4.1000     1.3909  -2.948  0.00506 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.11 on 45 degrees of freedom
## Multiple R-squared:  0.4468, Adjusted R-squared:  0.3976 
## F-statistic: 9.085 on 4 and 45 DF,  p-value: 1.815e-05

Two-way ANOVA

Using the pupae dataset, 1t a two-way ANOVA to compare PupalWeight to Gender and CO2_treatment. Which main effects are signi1cant?

pupae <- read.csv("pupae.csv")
pupae$CO2_treatment <- as.factor(pupae$CO2_treatment)
fit <- lm(PupalWeight~Gender+CO2_treatment, data=pupae)
summary(fit)

## 
## Call:
## lm(formula = PupalWeight ~ Gender + CO2_treatment, data = pupae)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.126338 -0.028338  0.000162  0.022538  0.190663 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.26217    0.00886  29.589  < 2e-16 ***
## Gender            0.07817    0.01053   7.423 1.48e-10 ***
## CO2_treatment400  0.02017    0.01049   1.923   0.0583 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04623 on 75 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.4434, Adjusted R-squared:  0.4285 
## F-statistic: 29.87 on 2 and 75 DF,  p-value: 2.872e-10

anova(fit)

## Analysis of Variance Table
## 
## Response: PupalWeight
##               Df   Sum Sq  Mean Sq F value    Pr(>F)    
## Gender         1 0.119802 0.119802 56.0472 1.126e-10 ***
## CO2_treatment  1 0.007902 0.007902  3.6966   0.05832 .  
## Residuals     75 0.160314 0.002138                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is there an interaction between Gender and CO2_treatment?

fit <- lm(PupalWeight~Gender*CO2_treatment, data=pupae)
anova(fit)

## Analysis of Variance Table
## 
## Response: PupalWeight
##                      Df   Sum Sq  Mean Sq F value    Pr(>F)    
## Gender                1 0.119802 0.119802 55.3217 1.488e-10 ***
## CO2_treatment         1 0.007902 0.007902  3.6488   0.05998 .  
## Gender:CO2_treatment  1 0.000063 0.000063  0.0292   0.86489    
## Residuals            74 0.160251 0.002166                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Repeat the above using T_treatment instead of CO2_treatment.

fit <- lm(PupalWeight~Gender+T_treatment, data=pupae)
summary(fit)

## 
## Call:
## lm(formula = PupalWeight ~ Gender + T_treatment, data = pupae)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.134744 -0.026092 -0.000259  0.023459  0.197226 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.275774   0.009946  27.728  < 2e-16 ***
## Gender               0.078203   0.010836   7.217 3.63e-10 ***
## T_treatmentelevated -0.005233   0.010909  -0.480    0.633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04729 on 75 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.4177, Adjusted R-squared:  0.4022 
## F-statistic:  26.9 on 2 and 75 DF,  p-value: 1.556e-09

anova(fit)

## Analysis of Variance Table
## 
## Response: PupalWeight
##             Df   Sum Sq  Mean Sq F value    Pr(>F)    
## Gender       1 0.119802 0.119802 53.5784 2.325e-10 ***
## T_treatment  1 0.000515 0.000515  0.2301    0.6328    
## Residuals   75 0.167701 0.002236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit <- lm(PupalWeight~Gender*T_treatment, data=pupae)
anova(fit)

## Analysis of Variance Table
## 
## Response: PupalWeight
##                    Df   Sum Sq  Mean Sq F value Pr(>F)    
## Gender              1 0.119802 0.119802 52.9358  3e-10 ***
## T_treatment         1 0.000515 0.000515  0.2274 0.6349    
## Gender:T_treatment  1 0.000227 0.000227  0.1005 0.7521    
## Residuals          74 0.167474 0.002263                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Perform a two-way ANOVA to test irrigation and fertilisation effects on tree diameter. Compare the results.

hfeif2012 <- read.csv("HFEIFplotmeans2012.csv")
iflm <- lm(diameter ~ treat, data=hfeif2012)
anova(iflm)

## Analysis of Variance Table
## 
## Response: diameter
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## treat      3 46.022 15.3408  13.338 0.000396 ***
## Residuals 12 13.802  1.1502                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

levels(hfeif2012$treat)

## [1] "C"  "F"  "I"  "IL"

hfeif2012$Fert_treat <- as.factor(ifelse(hfeif2012$treat %in% c("F","IL"),
"fertilized", "control"))
hfeif2012$Irrig_treat <- as.factor(ifelse(hfeif2012$treat %in% c("I","IL"),
"irrigated", "control"))

levels(hfeif2012$Fert_treat)

## [1] "control"    "fertilized"

levels(hfeif2012$Irrig_treat)

## [1] "control"   "irrigated"

lmif2 <- lm(diameter ~ Fert_treat*Irrig_treat, data=hfeif2012)
anova(lmif2)

## Analysis of Variance Table
## 
## Response: diameter
##                        Df Sum Sq Mean Sq F value   Pr(>F)    
## Fert_treat              1  0.000   0.000   0.000   0.9978    
## Irrig_treat             1 45.620  45.620  39.663 3.96e-05 ***
## Fert_treat:Irrig_treat  1  0.403   0.403   0.350   0.5651    
## Residuals              12 13.802   1.150                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(effects)
plot(allEffects(lmif2))

Multiple regression

Using the Pulse data, fit a multiple linear regression of Pulse1 against Age, Weight and Height (add the variables to the model in that order).

Do the results change when you use drop1 instead of anova?

pulse <- read.table("ms212.txt", header=TRUE)

fit <- lm(Pulse1 ~ Age+Weight+Height, data=pulse)
summary(fit)

## 
## Call:
## lm(formula = Pulse1 ~ Age + Weight + Height, data = pulse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.838  -9.016  -0.115   6.497  70.923 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 123.39317   14.98976   8.232 5.42e-13 ***
## Age          -0.45843    0.31739  -1.444   0.1516    
## Weight       -0.01985    0.10079  -0.197   0.8443    
## Height       -0.21542    0.09399  -2.292   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.82 on 105 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.09705,    Adjusted R-squared:  0.07125 
## F-statistic: 3.762 on 3 and 105 DF,  p-value: 0.01304

anova(fit)

## Analysis of Variance Table
## 
## Response: Pulse1
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## Age         1   406.6  406.57  2.4757 0.11863  
## Weight      1   584.1  584.13  3.5568 0.06206 .
## Height      1   862.7  862.72  5.2532 0.02390 *
## Residuals 105 17244.0  164.23                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(fit, test="F")

## Single term deletions
## 
## Model:
## Pulse1 ~ Age + Weight + Height
##        Df Sum of Sq   RSS    AIC F value Pr(>F)  
## <none>              17244 559.96                 
## Age     1    342.60 17587 560.11  2.0861 0.1516  
## Weight  1      6.37 17250 558.00  0.0388 0.8443  
## Height  1    862.72 18107 563.28  5.2532 0.0239 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow=c(1,2))
plot(fit)

Linear Model with factor and numeric variables

Repeat exercise and also include the factor Exercise. Does adding Exercise improve the model?

pulse$Exercise <- as.factor(pulse$Exercise)
fit2 <- lm(Pulse1 ~ Age+Weight+Height+Exercise, data=pulse)
summary(fit2)

## 
## Call:
## lm(formula = Pulse1 ~ Age + Weight + Height + Exercise, data = pulse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.564  -7.706  -0.548   6.543  70.318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 114.500042  15.440027   7.416 3.53e-11 ***
## Age          -0.551732   0.317033  -1.740   0.0848 .  
## Weight        0.006992   0.101024   0.069   0.9450    
## Height       -0.199932   0.093220  -2.145   0.0343 *  
## Exercise2     6.444701   3.812995   1.690   0.0940 .  
## Exercise3     8.677150   4.105668   2.113   0.0370 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.67 on 103 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1347, Adjusted R-squared:  0.09267 
## F-statistic: 3.206 on 5 and 103 DF,  p-value: 0.009894

anova(fit2)

## Analysis of Variance Table
## 
## Response: Pulse1
##            Df  Sum Sq Mean Sq F value  Pr(>F)  
## Age         1   406.6  406.57  2.5341 0.11448  
## Weight      1   584.1  584.13  3.6408 0.05916 .
## Height      1   862.7  862.72  5.3772 0.02238 *
## Exercise    2   718.5  359.25  2.2391 0.11171  
## Residuals 103 16525.5  160.44                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC(fit, fit2)

##      df      AIC
## fit   5 871.2904
## fit2  7 870.6514

Using the same data, fit a model of Pulse2 as a function of Pulse1 and Ran as main effects only.

pulse$Ran <- as.factor(pulse$Ran)
fit3 <- lm(Pulse2~Pulse1+Ran, data=pulse)
anova(fit3)

## Analysis of Variance Table
## 
## Response: Pulse2
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Pulse1      1  14063   14063  71.922  1.44e-13 ***
## Ran         1  72836   72836 372.497 < 2.2e-16 ***
## Residuals 106  20727     196                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(effects)
plot(allEffects(fit3))

Now add the interaction between Pulse1 and Ran. Is it signi1cant? Also look at the effects plot, how is it different from the model without an interaction?

fit4 <- lm(Pulse2~Pulse1*Ran, data=pulse)
anova(fit4)

## Analysis of Variance Table
## 
## Response: Pulse2
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## Pulse1       1  14063   14063  71.2471 1.872e-13 ***
## Ran          1  72836   72836 369.0002 < 2.2e-16 ***
## Pulse1:Ran   1      1       1   0.0049    0.9442    
## Residuals  105  20726     197                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(effects)
plot(allEffects(fit4))

Add (factor versions of) Alcohol, Smokes and Exercise to this model, and inspect whether the model improved.

fit5 <- lm(Pulse2~Pulse1+Ran+Alcohol+Smokes+Exercise, data=pulse)
anova(fit5)

## Analysis of Variance Table
## 
## Response: Pulse2
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## Pulse1      1  14063   14063  69.4559 3.834e-13 ***
## Ran         1  72836   72836 359.7233 < 2.2e-16 ***
## Alcohol     1     38      38   0.1888    0.6648    
## Smokes      1     24      24   0.1165    0.7336    
## Exercise    2     12       6   0.0299    0.9706    
## Residuals 102  20653     202                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic regression

Using the Pulse data once more, build a model to see if Pulse2 can predict whether people were in the Ran group.

fit6 <- glm(Ran ~ Pulse2, data=pulse, family=binomial)
summary(fit6)

## 
## Call:
## glm(formula = Ran ~ Pulse2, family = binomial, data = pulse)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.98452  -0.05046   0.13816   0.32434   2.88698  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 14.70267    3.34818   4.391 1.13e-05 ***
## Pulse2      -0.15712    0.03828  -4.104 4.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 148.444  on 108  degrees of freedom
## Residual deviance:  44.847  on 107  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 48.847
## 
## Number of Fisher Scoring iterations: 7

library(visreg)
visreg(fit6, scale="response")

Generalized linear model (GLM)

First run the following code to generate some data

len <- 21
x <- seq(0,1,length=len)
y <- rpois(len,exp(x-0.5))

Fit a Poisson GLM to model y on x. Is x signi1cant in the model?

fit7 <- glm(y ~ x, family=poisson)
summary(fit7)

## 
## Call:
## glm(formula = y ~ x, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.49510  -1.41391  -0.06732   0.72395   1.66556  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.06744    0.42443  -0.159    0.874
## x            0.22335    0.70515   0.317    0.751
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22.227  on 20  degrees of freedom
## Residual deviance: 22.126  on 19  degrees of freedom
## AIC: 58.565
## 
## Number of Fisher Scoring iterations: 5

Repeat above with a larger sample size. Compare the results

len <-101
x <- seq(0, 1, length=len)
y <- rpois(len, exp(x-0.5))
fit <-glm(y ~ x, family=poisson)
summary(fit)

## 
## Call:
## glm(formula = y ~ x, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.91124  -1.29170  -0.02566   0.42149   2.15269  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.4642     0.2158  -2.151 0.031480 *  
## x             1.0883     0.3295   3.303 0.000956 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 134.15  on 100  degrees of freedom
## Residual deviance: 122.91  on  99  degrees of freedom
## AIC: 279.69
## 
## Number of Fisher Scoring iterations: 5

The memory data were analysed assuming a normal error distribution and using a Poisson error distribution previously, and each approach resulted in a different outcome for the signi1cance of the interaction term. The participants in the study were asked to remember up to 27 words, not an unlimited number, and some of the participants were remembering close to this upper limit. Therefore, it may make sense to think of the response as a proportion. Use glm to model the response using a binomial error distribution.

memory <- read.delim('eysenck.txt')

m1 <- glm(cbind(Words, 27-Words) ~ Age * Process, data=memory, family=binomial)

anova(m1, test='LRT')

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(Words, 27 - Words)
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                           99     421.68              
## Age          1   36.393        98     385.29 1.613e-09 ***
## Process      4  238.591        94     146.70 < 2.2e-16 ***
## Age:Process  4   26.158        90     120.54 2.940e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear modelling

Kushan De Silva

August 16, 2017