setwd("F:/Documents/Work/Courses/Stats_LUND_2012_October/Exercises/Ex_06")

Exercise 6 - R

To learn:
. Multiple linear regression
. Partial correlation
. Model selection
. Thinking about collinearity
. Nonlinear regression

Part A:
1. Open the file “rmfor.sav”. The data set is divided into four types of years, these are called yrtype (with the levels 0, 1, 10 and 11). Then there is a variable (gud) describing the overall food availability in the territory in that year. Finally there are four variables describing how much the woodpeckers used different tree species: oak, birch, alder, lime for the standardised use of oak, birch, alder and lime respectively. Here we will concentrate on the four usage variables.

rmfor <- read.csv("rmfor.csv", header = TRUE, sep = ";")
names(rmfor)

## [1] "no"     "yrtype" "gud"    "oak"    "birch"  "alder"  "lime"

str(rmfor)

## 'data.frame':    27 obs. of  7 variables:
##  $ no    : int  0 0 0 0 7 7 7 10 10 10 ...
##  $ yrtype: int  0 1 10 11 0 10 11 0 10 11 ...
##  $ gud   : num  0.24 1.15 1.62 2.1 2.08 0.46 1.34 1.31 0.57 0.44 ...
##  $ oak   : num  0.1339 0.0244 0.1063 0.1348 0.0057 ...
##  $ birch : num  0.431 0.295 0.553 0.526 0 ...
##  $ alder : num  0.0183 0.2699 0.081 0.2772 0 ...
##  $ lime  : num  0.0737 0.3988 0.1437 0 0.9209 ...

rmfor$yrtype <- factor(rmfor$yrtype)

First of all make a scatterplot matrix of the four tree variables (oak, birch, alder and lime). According to Quick-R (a very useful website which I assume most of you have already discovered by now), there are at least 4 methods for doing this (see http://www.statmethods.net/graphs/scatterplot.html). The simplest uses the command pairs:

pairs(~oak + birch + alder + lime, data = rmfor, main = "Simple Scatterplot Matrix")

plot of chunk unnamed-chunk-3

Do they seem independent?
Most vairables appear independent, but not all. Where oak is common birch appears less comon, and where birch is common lime appears less common.

Now we will try to predict the usage of lime from the other usage variables. We will allow for interactions. For this, you have to compute new variables for the two-way interaction terms between oak, birch and alder. To be precise, this means you should calculate the products of oak and birch, oak and alder, and finally birch and alder, and then add them to the dataframe.

rmfor$oxb <- rmfor$oak * rmfor$birch
rmfor$oxa <- rmfor$oak * rmfor$alder
rmfor$bxa <- rmfor$birch * rmfor$alder

Compute Pearson correlations of lime against all the other usage variables, i.e. oak, birch, alder, and the interactions. Remember that you can use rcorr in the Hmisc library for this. Think of lime as your dependent variable. Are there any correlations?

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 2.14.2

## Loading required package: survival

## Loading required package: splines

## Hmisc library by Frank E Harrell Jr
## 
## Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview') to see overall
## documentation.
## 
## NOTE:Hmisc no longer redefines [.factor to drop unused levels when
## subsetting.  To get the old behavior of Hmisc type dropUnusedLevels().

## Attaching package: 'Hmisc'

## The following object(s) are masked from 'package:survival':
## 
## untangle.specials

## The following object(s) are masked from 'package:base':
## 
## format.pval, round.POSIXt, trunc.POSIXt, units

rcorr(as.matrix(rmfor), type = "pearson")

##           no yrtype   gud   oak birch alder  lime   oxb   oxa   bxa
## no      1.00  -0.05 -0.13  0.37 -0.09 -0.13 -0.10  0.10  0.20 -0.19
## yrtype -0.05   1.00 -0.19 -0.13  0.68  0.04 -0.49  0.53  0.19  0.27
## gud    -0.13  -0.19  1.00 -0.25 -0.32 -0.11  0.54 -0.46 -0.04 -0.09
## oak     0.37  -0.13 -0.25  1.00 -0.34 -0.36 -0.25  0.20  0.00 -0.35
## birch  -0.09   0.68 -0.32 -0.34  1.00  0.12 -0.67  0.44  0.07  0.47
## alder  -0.13   0.04 -0.11 -0.36  0.12  1.00 -0.31 -0.22  0.41  0.80
## lime   -0.10  -0.49  0.54 -0.25 -0.67 -0.31  1.00 -0.39 -0.26 -0.47
## oxb     0.10   0.53 -0.46  0.20  0.44 -0.22 -0.39  1.00  0.22 -0.07
## oxa     0.20   0.19 -0.04  0.00  0.07  0.41 -0.26  0.22  1.00  0.35
## bxa    -0.19   0.27 -0.09 -0.35  0.47  0.80 -0.47 -0.07  0.35  1.00
## 
## n= 27 
## 
## 
## P
##        no     yrtype gud    oak    birch  alder  lime   oxb    oxa   
## no            0.7985 0.5221 0.0544 0.6711 0.5143 0.6146 0.6163 0.3267
## yrtype 0.7985        0.3307 0.5168 0.0001 0.8316 0.0092 0.0045 0.3477
## gud    0.5221 0.3307        0.2061 0.1052 0.5850 0.0034 0.0147 0.8364
## oak    0.0544 0.5168 0.2061        0.0867 0.0674 0.2104 0.3227 0.9977
## birch  0.6711 0.0001 0.1052 0.0867        0.5376 0.0001 0.0223 0.7123
## alder  0.5143 0.8316 0.5850 0.0674 0.5376        0.1185 0.2660 0.0324
## lime   0.6146 0.0092 0.0034 0.2104 0.0001 0.1185        0.0455 0.1893
## oxb    0.6163 0.0045 0.0147 0.3227 0.0223 0.2660 0.0455        0.2803
## oxa    0.3267 0.3477 0.8364 0.9977 0.7123 0.0324 0.1893 0.2803       
## bxa    0.3383 0.1654 0.6706 0.0766 0.0134 0.0000 0.0123 0.7324 0.0693
##        bxa   
## no     0.3383
## yrtype 0.1654
## gud    0.6706
## oak    0.0766
## birch  0.0134
## alder  0.0000
## lime   0.0123
## oxb    0.7324
## oxa    0.0693
## bxa

names(rmfor)

##  [1] "no"     "yrtype" "gud"    "oak"    "birch"  "alder"  "lime"  
##  [8] "oxb"    "oxa"    "bxa"

# for only ones we are interested in:
trees <- rmfor[4:6]
trees <- cbind(trees, rmfor[8:10])
trees <- as.matrix(trees)
lime <- rmfor$lime
rcorr(lime, trees, type = "pearson")

##           x   oak birch alder   oxb   oxa   bxa
## x      1.00 -0.25 -0.67 -0.31 -0.39 -0.26 -0.47
## oak   -0.25  1.00 -0.34 -0.36  0.20  0.00 -0.35
## birch -0.67 -0.34  1.00  0.12  0.44  0.07  0.47
## alder -0.31 -0.36  0.12  1.00 -0.22  0.41  0.80
## oxb   -0.39  0.20  0.44 -0.22  1.00  0.22 -0.07
## oxa   -0.26  0.00  0.07  0.41  0.22  1.00  0.35
## bxa   -0.47 -0.35  0.47  0.80 -0.07  0.35  1.00
## 
## n= 27 
## 
## 
## P
##       x      oak    birch  alder  oxb    oxa    bxa   
## x            0.2104 0.0001 0.1185 0.0455 0.1893 0.0123
## oak   0.2104        0.0867 0.0674 0.3227 0.9977 0.0766
## birch 0.0001 0.0867        0.5376 0.0223 0.7123 0.0134
## alder 0.1185 0.0674 0.5376        0.2660 0.0324 0.0000
## oxb   0.0455 0.3227 0.0223 0.2660        0.2803 0.7324
## oxa   0.1893 0.9977 0.7123 0.0324 0.2803        0.0693
## bxa   0.0123 0.0766 0.0134 0.0000 0.7324 0.0693

there are 7 significant correlations, though this is without compensating for multipe comparisons.

Make a partial correlations matrix controlling for the most significant variable from the original correlation matrix. You can do this using the command pcor.test in the ppcor library. It has the form:

# pcor.test(x, y, z, method = 'pearson')
# 
# install.packages('ppcor')
library(ppcor)

## Warning: package 'ppcor' was built under R version 2.14.2

In this command x and y are the variables to be correlated and z is the variable(s) which you wish to control for. For example:

pcor.test(rmfor$lime, rmfor$oak, rmfor$birch, method = "pearson")

##   estimate   p.value statistic  n gp  Method
## 1  -0.6732 8.185e-06     -4.46 27  1 pearson

Note that you must do sepparate correlations for each pair of variables.

p.values <- rep(NA, 10)
dim(p.values) <- c(5, 2)
lim_oak <- pcor.test(rmfor$lime, rmfor$oak, rmfor$birch, method = "pearson")
p.values[1, 1] <- "lim_oak"
p.values[1, 2] <- lim_oak$p.value
lim_ald <- pcor.test(rmfor$lime, rmfor$alder, rmfor$birch, method = "pearson")
p.values[2, 1] <- "lim_ald"
p.values[2, 2] <- lim_ald$p.value
lim_oxb <- pcor.test(rmfor$lime, rmfor$oxb, rmfor$birch, method = "pearson")
p.values[3, 1] <- "lim_oxb"
p.values[3, 2] <- lim_oxb$p.value
lim_oxa <- pcor.test(rmfor$lime, rmfor$oxa, rmfor$birch, method = "pearson")
p.values[4, 1] <- "lim_oxa"
p.values[4, 2] <- lim_oxa$p.value
lim_bxa <- pcor.test(rmfor$lime, rmfor$bxa, rmfor$birch, method = "pearson")
p.values[5, 1] <- "lim_bxa"
p.values[5, 2] <- lim_bxa$p.value
p.values

##      [,1]      [,2]                  
## [1,] "lim_oak" "8.18539823252772e-06"
## [2,] "lim_ald" "0.117932241923188"   
## [3,] "lim_oxb" "0.477882571211855"   
## [4,] "lim_oxa" "0.147126076970424"   
## [5,] "lim_bxa" "0.214092179456883"

Make a new partial correlation matrix, controlling for both the variable you controlled for last time, and the one that was then the most significant. Repeat this step until no more variables should be included at the 5% level.

names(rmfor)

##  [1] "no"     "yrtype" "gud"    "oak"    "birch"  "alder"  "lime"  
##  [8] "oxb"    "oxa"    "bxa"


p.values2 <- rep(NA, 8)
dim(p.values2) <- c(4, 2)
lim_ald <- pcor.test(rmfor$lime, rmfor$alder, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[1, 1] <- "lim_ald"
p.values2[1, 2] <- lim_ald$p.value
lim_oxb <- pcor.test(rmfor$lime, rmfor$oxb, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[2, 1] <- "lim_oxb"
p.values2[2, 2] <- lim_oxb$p.value
lim_oxa <- pcor.test(rmfor$lime, rmfor$oxa, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[3, 1] <- "lim_oxa"
p.values2[3, 2] <- lim_oxa$p.value
lim_bxa <- pcor.test(rmfor$lime, rmfor$bxa, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[4, 1] <- "lim_bxa"
p.values2[4, 2] <- lim_bxa$p.value
p.values2

##      [,1]      [,2]                  
## [1,] "lim_ald" "0.00180188619376948" 
## [2,] "lim_oxb" "0.0265210414099598"  
## [3,] "lim_oxa" "0.0429466523524417"  
## [4,] "lim_bxa" "3.51064944731661e-05"


p.values3 <- rep(NA, 6)
dim(p.values3) <- c(3, 2)
lim_oxb <- pcor.test(rmfor$lime, rmfor$oxb, c(rmfor$birch, rmfor$oak, rmfor$alder), 
    method = "pearson")
p.values3[1, 1] <- "lim_oxb"
p.values3[1, 2] <- lim_oxb$p.value
lim_oxa <- pcor.test(rmfor$lime, rmfor$oxa, c(rmfor$birch, rmfor$oak, rmfor$alder), 
    method = "pearson")
p.values3[2, 1] <- "lim_oxa"
p.values3[2, 2] <- lim_oxa$p.value
lim_bxa <- pcor.test(rmfor$lime, rmfor$bxa, c(rmfor$birch, rmfor$oak, rmfor$alder), 
    method = "pearson")
p.values3[3, 1] <- "lim_bxa"
p.values3[3, 2] <- lim_bxa$p.value
p.values3

##      [,1]      [,2]                  
## [1,] "lim_oxb" "0.000669364219320102"
## [2,] "lim_oxa" "0.0380666506225002"  
## [3,] "lim_bxa" "3.64356069417029e-05"

Make a stepwise regression, with lime your dependent variable, oak, birch and alder and all their interaction variables as the independent variables. You can do this using the stepAIC command in the MASS library:

library(MASS)
fit <- lm(lime ~ birch, data = rmfor)
step <- stepAIC(fit, scope = list(upper = ~oak + birch + alder + oxb + oxa + 
    bxa, lower = ~1), direction = "forward")

## Start:  AIC=-80.79
## lime ~ birch
## 
##         Df Sum of Sq   RSS   AIC
## + oak    1     0.529 0.639 -95.1
## + alder  1     0.108 1.060 -81.4
## + oxa    1     0.094 1.074 -81.1
## <none>               1.168 -80.8
## + bxa    1     0.071 1.098 -80.5
## + oxb    1     0.024 1.144 -79.4
## 
## Step:  AIC=-95.09
## lime ~ birch + oak
## 
##         Df Sum of Sq   RSS    AIC
## + alder  1     0.372 0.267 -116.7
## + bxa    1     0.196 0.443 -103.0
## + oxa    1     0.082 0.556  -96.8
## <none>               0.639  -95.1
## + oxb    1     0.024 0.615  -94.1
## 
## Step:  AIC=-116.7
## lime ~ birch + oak + alder
## 
##        Df Sum of Sq   RSS  AIC
## <none>              0.267 -117
## + bxa   1   0.01558 0.251 -116
## + oxb   1   0.00112 0.266 -115
## + oxa   1   0.00035 0.266 -115

step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## lime ~ birch
## 
## Final Model:
## lime ~ birch + oak + alder
## 
## 
##      Step Df Deviance Resid. Df Resid. Dev     AIC
## 1                            25     1.1682  -80.79
## 2   + oak  1   0.5294        24     0.6387  -95.09
## 3 + alder  1   0.3721        23     0.2666 -116.68

summary(step)

## 
## Call:
## lm(formula = lime ~ birch + oak + alder, data = rmfor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3641 -0.0258  0.0150  0.0617  0.1564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.8883     0.0516   17.20  1.3e-14 ***
## birch        -1.0929     0.1022  -10.69  2.1e-10 ***
## oak          -0.8900     0.1076   -8.27  2.4e-08 ***
## alder        -0.7517     0.1327   -5.67  9.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.108 on 23 degrees of freedom
## Multiple R-squared: 0.873,   Adjusted R-squared: 0.857 
## F-statistic: 52.8 on 3 and 23 DF,  p-value: 1.82e-10

The first line specifies the initial model, and the stepAIC command line indicates which terms should be added to the model. What is the relation between the variables in the final model and the variables you accounted for in the partial correlations procedures?
The approach we followed for the partial correlations procedure is equivelent to a stepwise regression, just manual, so the result is the same.

You can check for collinearity problems in the model using kappa:

model <- model.matrix(~oak + birch + alder + oxb + oxa + bxa, data = rmfor)
kappa(model)

## [1] 61.29

# Would seem to be quite high - but unsure how to judge this.

Part B:
1. Time for a new data set: PCA.csv (the filename will be apparent in a later exercise). It is data on great reed warblers' body size. The first seven variables are just when and where an individual was caught etc. The other variables are the gender, age, and a number of body size variables. Note that there are a lot of missing values in the beginning of the file. Here, we will try to predict an individual's age from its body measures (not using gender). First of all, make a regression with age as dependent and the nine body measures as independent. Use Forward model selection. Note that R does not like missing data, so you will have exclude samples with missing data first:

pca <- read.csv("PCA.csv", header = TRUE, sep = ";")

## Warning: invalid input found on input connection 'PCA.csv'

pca.na <- na.omit(pca)
names(pca)

##  [1] "ind"    "ringnr" "tr"     "plats"  "year"   "date"   "hour"  
##  [8] "sign"   "sex"    "age"    "wing"   "tarsv"  "tail"   "wproj" 
## [15] "fett"   "weigh"  "bill"   "bihe"   "biwi"

# make a lm model and use forward stepwise regression to simplify
fit <- lm(age ~ wing, data = pca.na)
step <- stepAIC(fit, scope = list(upper = ~wing + tarsv + tail + wproj + fett + 
    weigh + bill + bihe + biwi, lower = ~1), direction = "forward")

## Start:  AIC=-Inf
## age ~ wing

## Warning: attempting model selection on an essentially perfect fit is
## nonsense

##        Df Sum of Sq RSS  AIC
## <none>                0 -Inf

step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## age ~ wing
## 
## Final Model:
## age ~ wing
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev  AIC
## 1                          0          0 -Inf

summary(step)

## 
## Call:
## lm(formula = age ~ wing, data = pca.na)
## 
## Residuals:
## ALL 1 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)       35         NA      NA       NA
## wing              NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom

What variables are chosen in the final model?
In the final model are 'wing', 'weigh', 'bihe', and 'biwi'. Two of which are significant, 'weigh' and 'bihe'. The explained variation in age is quite low, only around 2% of variation explained.
This is evident to if we plot one of the significant vairables:

plot(age ~ bihe, data = pca.na)

plot of chunk unnamed-chunk-13

Rerun the same regression but use backward selection instead. Backward selection means all variables are included at first, and then removed one by one. What variables are now selected? Is there any difference compared to the forward selection procedure?

fit2 <- lm(age ~ wing + tarsv + tail + wproj + fett + weigh + bill + bihe + 
    biwi, data = pca.na)
step2 <- stepAIC(fit2, scope = list(upper = ~wing + tarsv + tail + wproj + fett + 
    weigh + bill + bihe + biwi, lower = ~1), direction = "backward")

## Start:  AIC=-Inf
## age ~ wing + tarsv + tail + wproj + fett + weigh + bill + bihe + 
##     biwi

## Warning: attempting model selection on an essentially perfect fit is
## nonsense

## Error: missing value where TRUE/FALSE needed

step2$anova

## Error: object 'step2' not found

summary(step2)

## Error: object 'step2' not found

The result is the same, with the same vairables selected.
3. Make a third regression (forward selection), but excluding the wing parameter. Now you get a new model. What parameters are included? Can you explain why suddenly wproj is included? What is the correlation between wproj and wing?

fit3 <- lm(age ~ tarsv, data = pca.na)
step3 <- stepAIC(fit3, scope = list(upper = ~tarsv + tail + wproj + fett + weigh + 
    bill + bihe + biwi, lower = ~1), direction = "forward")

## Start:  AIC=-Inf
## age ~ tarsv

## Warning: attempting model selection on an essentially perfect fit is
## nonsense

##        Df Sum of Sq RSS  AIC
## <none>                0 -Inf

step3$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## age ~ tarsv
## 
## Final Model:
## age ~ tarsv
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev  AIC
## 1                          0          0 -Inf

summary(step3)

## 
## Call:
## lm(formula = age ~ tarsv, data = pca.na)
## 
## Residuals:
## ALL 1 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)       35         NA      NA       NA
## tarsv             NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom

Now five paramaters are inlcuded: tarsv, wproj, weigh, bihe, biwi. Wproj is now included which wasn't before. wproj presumably is colinear with wing, perhaps another wing measure.

# Testing for correlation between wing and wproj
library(Hmisc)
rcorr(pca.na$wing, pca.na$wproj, type = "pearson")

## Error: must have >4 observations

They are highly correlated, with low P value.

# To visualise this:
plot(pca.na$wing, pca.na$wproj)

plot of chunk unnamed-chunk-17

PART C.

Use the file PHEASANT.csv. It is growth data (WEIGHT and TARSH) of one pheasant chick from birth until the age of 140 days (AGEDAY). Plot WEIGHT against AGEDAY.

pheas <- read.csv("PHEASANT.CSV", sep = ";", header = TRUE)
plot(pheas$WEIGHT ~ pheas$AGEDAY)

plot of chunk unnamed-chunk-18

Non-linear regression can be used to fit data to a known equation. There is good theoretical reasons to believe that the weight should increase with age according to the following equation: weight=a/[1+e-k(age-i)] where a is the asymptote of the weight (the final weight that the chick may reach), k is the growth rate, and i is the inflection point, i.e. where growth goes from being concave to convex, or vice versa. Use non-linear regression to estimate these parameters using the command nls (= nonlinear least squares). For the exponential function use exp(-k(AGEDAY-i)). You will need something like the following values to start with, not to end up with horrific estimates (you may try to change these values just to see how sensitive non-linear modelling might be): a=1400, k=0.05, i=60:

nls <- nls(pheas$WEIGHT ~ a/(1 + exp(-k * (pheas$AGEDAY - i))), data = pheas, 
    start = list(a = 1400, k = 0.05, i = 60), na.action = na.omit)
# Save the predicted values.
pheas$PRED <- predict(nls, pheas)

You can get correlations of the parameter estimates using correlation=TRUE within summary.

summary(nls, correlation = TRUE)

## 
## Formula: pheas$WEIGHT ~ a/(1 + exp(-k * (pheas$AGEDAY - i)))
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a 1.31e+03   2.68e+01    48.9   <2e-16 ***
## k 5.15e-02   2.19e-03    23.5   <2e-16 ***
## i 6.62e+01   1.56e+00    42.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 33.5 on 27 degrees of freedom
## 
## Correlation of Parameter Estimates:
##   a     k    
## k -0.65      
## i  0.81 -0.71
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 7.17e-06

There are quite high correlations between the parameter estimates. What does that mean?
The high correlations suggest that values of a, i, and k are closely related, with k reducing as a increases (i.e. growth rate decreases as the chick grows), i is postively correlated with a, suggesting that the effect of age on weight get less as the chick gets older. i is negatively correlated with k, which means that when the growth rate is high then i is small. This expected as k is effectively the growth rate irrespective of age, then i reduces this depending on age.

Plot the original weight values and the predicted against the age, to see how the model seems to fit data.

names(pheas)

## [1] "AGEDAY" "WEIGHT" "TARSH"  "PRED"

plot(pheas$WEIGHT ~ pheas$AGEDAY)
points(pheas$PRED ~ pheas$AGEDAY, col = "red")

plot of chunk unnamed-chunk-21

If you wish, you can perform the same analysis for the variable TARSH (length of right tarsus). This should follow the same general equation, but you need different start values. Make a scatterplot to see what start values may be reasonable.

# not done.