setwd("F:/Documents/Work/Courses/Stats_LUND_2012_October/Exercises/Ex_06")
Exercise 6 - R
To learn:
. Multiple linear regression
. Partial correlation
. Model selection
. Thinking about collinearity
. Nonlinear regression
Part A:
1. Open the file “rmfor.sav”. The data set is divided into four types of years, these are called yrtype (with the levels 0, 1, 10 and 11). Then there is a variable (gud) describing the overall food availability in the territory in that year. Finally there are four variables describing how much the woodpeckers used different tree species: oak, birch, alder, lime for the standardised use of oak, birch, alder and lime respectively. Here we will concentrate on the four usage variables.
rmfor <- read.csv("rmfor.csv", header = TRUE, sep = ";")
names(rmfor)
## [1] "no" "yrtype" "gud" "oak" "birch" "alder" "lime"
str(rmfor)
## 'data.frame': 27 obs. of 7 variables:
## $ no : int 0 0 0 0 7 7 7 10 10 10 ...
## $ yrtype: int 0 1 10 11 0 10 11 0 10 11 ...
## $ gud : num 0.24 1.15 1.62 2.1 2.08 0.46 1.34 1.31 0.57 0.44 ...
## $ oak : num 0.1339 0.0244 0.1063 0.1348 0.0057 ...
## $ birch : num 0.431 0.295 0.553 0.526 0 ...
## $ alder : num 0.0183 0.2699 0.081 0.2772 0 ...
## $ lime : num 0.0737 0.3988 0.1437 0 0.9209 ...
rmfor$yrtype <- factor(rmfor$yrtype)
First of all make a scatterplot matrix of the four tree variables (oak, birch, alder and lime). According to Quick-R (a very useful website which I assume most of you have already discovered by now), there are at least 4 methods for doing this (see http://www.statmethods.net/graphs/scatterplot.html). The simplest uses the command pairs:
pairs(~oak + birch + alder + lime, data = rmfor, main = "Simple Scatterplot Matrix")
Do they seem independent?
Most vairables appear independent, but not all. Where oak is common birch appears less comon, and where birch is common lime appears less common.
rmfor$oxb <- rmfor$oak * rmfor$birch
rmfor$oxa <- rmfor$oak * rmfor$alder
rmfor$bxa <- rmfor$birch * rmfor$alder
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 2.14.2
## Loading required package: survival
## Loading required package: splines
## Hmisc library by Frank E Harrell Jr
##
## Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview') to see overall
## documentation.
##
## NOTE:Hmisc no longer redefines [.factor to drop unused levels when
## subsetting. To get the old behavior of Hmisc type dropUnusedLevels().
## Attaching package: 'Hmisc'
## The following object(s) are masked from 'package:survival':
##
## untangle.specials
## The following object(s) are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
rcorr(as.matrix(rmfor), type = "pearson")
## no yrtype gud oak birch alder lime oxb oxa bxa
## no 1.00 -0.05 -0.13 0.37 -0.09 -0.13 -0.10 0.10 0.20 -0.19
## yrtype -0.05 1.00 -0.19 -0.13 0.68 0.04 -0.49 0.53 0.19 0.27
## gud -0.13 -0.19 1.00 -0.25 -0.32 -0.11 0.54 -0.46 -0.04 -0.09
## oak 0.37 -0.13 -0.25 1.00 -0.34 -0.36 -0.25 0.20 0.00 -0.35
## birch -0.09 0.68 -0.32 -0.34 1.00 0.12 -0.67 0.44 0.07 0.47
## alder -0.13 0.04 -0.11 -0.36 0.12 1.00 -0.31 -0.22 0.41 0.80
## lime -0.10 -0.49 0.54 -0.25 -0.67 -0.31 1.00 -0.39 -0.26 -0.47
## oxb 0.10 0.53 -0.46 0.20 0.44 -0.22 -0.39 1.00 0.22 -0.07
## oxa 0.20 0.19 -0.04 0.00 0.07 0.41 -0.26 0.22 1.00 0.35
## bxa -0.19 0.27 -0.09 -0.35 0.47 0.80 -0.47 -0.07 0.35 1.00
##
## n= 27
##
##
## P
## no yrtype gud oak birch alder lime oxb oxa
## no 0.7985 0.5221 0.0544 0.6711 0.5143 0.6146 0.6163 0.3267
## yrtype 0.7985 0.3307 0.5168 0.0001 0.8316 0.0092 0.0045 0.3477
## gud 0.5221 0.3307 0.2061 0.1052 0.5850 0.0034 0.0147 0.8364
## oak 0.0544 0.5168 0.2061 0.0867 0.0674 0.2104 0.3227 0.9977
## birch 0.6711 0.0001 0.1052 0.0867 0.5376 0.0001 0.0223 0.7123
## alder 0.5143 0.8316 0.5850 0.0674 0.5376 0.1185 0.2660 0.0324
## lime 0.6146 0.0092 0.0034 0.2104 0.0001 0.1185 0.0455 0.1893
## oxb 0.6163 0.0045 0.0147 0.3227 0.0223 0.2660 0.0455 0.2803
## oxa 0.3267 0.3477 0.8364 0.9977 0.7123 0.0324 0.1893 0.2803
## bxa 0.3383 0.1654 0.6706 0.0766 0.0134 0.0000 0.0123 0.7324 0.0693
## bxa
## no 0.3383
## yrtype 0.1654
## gud 0.6706
## oak 0.0766
## birch 0.0134
## alder 0.0000
## lime 0.0123
## oxb 0.7324
## oxa 0.0693
## bxa
names(rmfor)
## [1] "no" "yrtype" "gud" "oak" "birch" "alder" "lime"
## [8] "oxb" "oxa" "bxa"
# for only ones we are interested in:
trees <- rmfor[4:6]
trees <- cbind(trees, rmfor[8:10])
trees <- as.matrix(trees)
lime <- rmfor$lime
rcorr(lime, trees, type = "pearson")
## x oak birch alder oxb oxa bxa
## x 1.00 -0.25 -0.67 -0.31 -0.39 -0.26 -0.47
## oak -0.25 1.00 -0.34 -0.36 0.20 0.00 -0.35
## birch -0.67 -0.34 1.00 0.12 0.44 0.07 0.47
## alder -0.31 -0.36 0.12 1.00 -0.22 0.41 0.80
## oxb -0.39 0.20 0.44 -0.22 1.00 0.22 -0.07
## oxa -0.26 0.00 0.07 0.41 0.22 1.00 0.35
## bxa -0.47 -0.35 0.47 0.80 -0.07 0.35 1.00
##
## n= 27
##
##
## P
## x oak birch alder oxb oxa bxa
## x 0.2104 0.0001 0.1185 0.0455 0.1893 0.0123
## oak 0.2104 0.0867 0.0674 0.3227 0.9977 0.0766
## birch 0.0001 0.0867 0.5376 0.0223 0.7123 0.0134
## alder 0.1185 0.0674 0.5376 0.2660 0.0324 0.0000
## oxb 0.0455 0.3227 0.0223 0.2660 0.2803 0.7324
## oxa 0.1893 0.9977 0.7123 0.0324 0.2803 0.0693
## bxa 0.0123 0.0766 0.0134 0.0000 0.7324 0.0693
there are 7 significant correlations, though this is without compensating for multipe comparisons.
# pcor.test(x, y, z, method = 'pearson')
#
# install.packages('ppcor')
library(ppcor)
## Warning: package 'ppcor' was built under R version 2.14.2
In this command x and y are the variables to be correlated and z is the variable(s) which you wish to control for. For example:
pcor.test(rmfor$lime, rmfor$oak, rmfor$birch, method = "pearson")
## estimate p.value statistic n gp Method
## 1 -0.6732 8.185e-06 -4.46 27 1 pearson
Note that you must do sepparate correlations for each pair of variables.
p.values <- rep(NA, 10)
dim(p.values) <- c(5, 2)
lim_oak <- pcor.test(rmfor$lime, rmfor$oak, rmfor$birch, method = "pearson")
p.values[1, 1] <- "lim_oak"
p.values[1, 2] <- lim_oak$p.value
lim_ald <- pcor.test(rmfor$lime, rmfor$alder, rmfor$birch, method = "pearson")
p.values[2, 1] <- "lim_ald"
p.values[2, 2] <- lim_ald$p.value
lim_oxb <- pcor.test(rmfor$lime, rmfor$oxb, rmfor$birch, method = "pearson")
p.values[3, 1] <- "lim_oxb"
p.values[3, 2] <- lim_oxb$p.value
lim_oxa <- pcor.test(rmfor$lime, rmfor$oxa, rmfor$birch, method = "pearson")
p.values[4, 1] <- "lim_oxa"
p.values[4, 2] <- lim_oxa$p.value
lim_bxa <- pcor.test(rmfor$lime, rmfor$bxa, rmfor$birch, method = "pearson")
p.values[5, 1] <- "lim_bxa"
p.values[5, 2] <- lim_bxa$p.value
p.values
## [,1] [,2]
## [1,] "lim_oak" "8.18539823252772e-06"
## [2,] "lim_ald" "0.117932241923188"
## [3,] "lim_oxb" "0.477882571211855"
## [4,] "lim_oxa" "0.147126076970424"
## [5,] "lim_bxa" "0.214092179456883"
names(rmfor)
## [1] "no" "yrtype" "gud" "oak" "birch" "alder" "lime"
## [8] "oxb" "oxa" "bxa"
p.values2 <- rep(NA, 8)
dim(p.values2) <- c(4, 2)
lim_ald <- pcor.test(rmfor$lime, rmfor$alder, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[1, 1] <- "lim_ald"
p.values2[1, 2] <- lim_ald$p.value
lim_oxb <- pcor.test(rmfor$lime, rmfor$oxb, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[2, 1] <- "lim_oxb"
p.values2[2, 2] <- lim_oxb$p.value
lim_oxa <- pcor.test(rmfor$lime, rmfor$oxa, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[3, 1] <- "lim_oxa"
p.values2[3, 2] <- lim_oxa$p.value
lim_bxa <- pcor.test(rmfor$lime, rmfor$bxa, c(rmfor$birch, rmfor$oak), method = "pearson")
p.values2[4, 1] <- "lim_bxa"
p.values2[4, 2] <- lim_bxa$p.value
p.values2
## [,1] [,2]
## [1,] "lim_ald" "0.00180188619376948"
## [2,] "lim_oxb" "0.0265210414099598"
## [3,] "lim_oxa" "0.0429466523524417"
## [4,] "lim_bxa" "3.51064944731661e-05"
p.values3 <- rep(NA, 6)
dim(p.values3) <- c(3, 2)
lim_oxb <- pcor.test(rmfor$lime, rmfor$oxb, c(rmfor$birch, rmfor$oak, rmfor$alder),
method = "pearson")
p.values3[1, 1] <- "lim_oxb"
p.values3[1, 2] <- lim_oxb$p.value
lim_oxa <- pcor.test(rmfor$lime, rmfor$oxa, c(rmfor$birch, rmfor$oak, rmfor$alder),
method = "pearson")
p.values3[2, 1] <- "lim_oxa"
p.values3[2, 2] <- lim_oxa$p.value
lim_bxa <- pcor.test(rmfor$lime, rmfor$bxa, c(rmfor$birch, rmfor$oak, rmfor$alder),
method = "pearson")
p.values3[3, 1] <- "lim_bxa"
p.values3[3, 2] <- lim_bxa$p.value
p.values3
## [,1] [,2]
## [1,] "lim_oxb" "0.000669364219320102"
## [2,] "lim_oxa" "0.0380666506225002"
## [3,] "lim_bxa" "3.64356069417029e-05"
library(MASS)
fit <- lm(lime ~ birch, data = rmfor)
step <- stepAIC(fit, scope = list(upper = ~oak + birch + alder + oxb + oxa +
bxa, lower = ~1), direction = "forward")
## Start: AIC=-80.79
## lime ~ birch
##
## Df Sum of Sq RSS AIC
## + oak 1 0.529 0.639 -95.1
## + alder 1 0.108 1.060 -81.4
## + oxa 1 0.094 1.074 -81.1
## <none> 1.168 -80.8
## + bxa 1 0.071 1.098 -80.5
## + oxb 1 0.024 1.144 -79.4
##
## Step: AIC=-95.09
## lime ~ birch + oak
##
## Df Sum of Sq RSS AIC
## + alder 1 0.372 0.267 -116.7
## + bxa 1 0.196 0.443 -103.0
## + oxa 1 0.082 0.556 -96.8
## <none> 0.639 -95.1
## + oxb 1 0.024 0.615 -94.1
##
## Step: AIC=-116.7
## lime ~ birch + oak + alder
##
## Df Sum of Sq RSS AIC
## <none> 0.267 -117
## + bxa 1 0.01558 0.251 -116
## + oxb 1 0.00112 0.266 -115
## + oxa 1 0.00035 0.266 -115
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## lime ~ birch
##
## Final Model:
## lime ~ birch + oak + alder
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 25 1.1682 -80.79
## 2 + oak 1 0.5294 24 0.6387 -95.09
## 3 + alder 1 0.3721 23 0.2666 -116.68
summary(step)
##
## Call:
## lm(formula = lime ~ birch + oak + alder, data = rmfor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3641 -0.0258 0.0150 0.0617 0.1564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8883 0.0516 17.20 1.3e-14 ***
## birch -1.0929 0.1022 -10.69 2.1e-10 ***
## oak -0.8900 0.1076 -8.27 2.4e-08 ***
## alder -0.7517 0.1327 -5.67 9.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.108 on 23 degrees of freedom
## Multiple R-squared: 0.873, Adjusted R-squared: 0.857
## F-statistic: 52.8 on 3 and 23 DF, p-value: 1.82e-10
The first line specifies the initial model, and the stepAIC command line indicates which terms should be added to the model. What is the relation between the variables in the final model and the variables you accounted for in the partial correlations procedures?
The approach we followed for the partial correlations procedure is equivelent to a stepwise regression, just manual, so the result is the same.
model <- model.matrix(~oak + birch + alder + oxb + oxa + bxa, data = rmfor)
kappa(model)
## [1] 61.29
# Would seem to be quite high - but unsure how to judge this.
Part B:
1. Time for a new data set: PCA.csv (the filename will be apparent in a later exercise). It is data on great reed warblers' body size. The first seven variables are just when and where an individual was caught etc. The other variables are the gender, age, and a number of body size variables. Note that there are a lot of missing values in the beginning of the file. Here, we will try to predict an individual's age from its body measures (not using gender). First of all, make a regression with age as dependent and the nine body measures as independent. Use Forward model selection. Note that R does not like missing data, so you will have exclude samples with missing data first:
pca <- read.csv("PCA.csv", header = TRUE, sep = ";")
## Warning: invalid input found on input connection 'PCA.csv'
pca.na <- na.omit(pca)
names(pca)
## [1] "ind" "ringnr" "tr" "plats" "year" "date" "hour"
## [8] "sign" "sex" "age" "wing" "tarsv" "tail" "wproj"
## [15] "fett" "weigh" "bill" "bihe" "biwi"
# make a lm model and use forward stepwise regression to simplify
fit <- lm(age ~ wing, data = pca.na)
step <- stepAIC(fit, scope = list(upper = ~wing + tarsv + tail + wproj + fett +
weigh + bill + bihe + biwi, lower = ~1), direction = "forward")
## Start: AIC=-Inf
## age ~ wing
## Warning: attempting model selection on an essentially perfect fit is
## nonsense
## Df Sum of Sq RSS AIC
## <none> 0 -Inf
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## age ~ wing
##
## Final Model:
## age ~ wing
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 0 0 -Inf
summary(step)
##
## Call:
## lm(formula = age ~ wing, data = pca.na)
##
## Residuals:
## ALL 1 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35 NA NA NA
## wing NA NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
What variables are chosen in the final model?
In the final model are 'wing', 'weigh', 'bihe', and 'biwi'. Two of which are significant, 'weigh' and 'bihe'. The explained variation in age is quite low, only around 2% of variation explained.
This is evident to if we plot one of the significant vairables:
plot(age ~ bihe, data = pca.na)
fit2 <- lm(age ~ wing + tarsv + tail + wproj + fett + weigh + bill + bihe +
biwi, data = pca.na)
step2 <- stepAIC(fit2, scope = list(upper = ~wing + tarsv + tail + wproj + fett +
weigh + bill + bihe + biwi, lower = ~1), direction = "backward")
## Start: AIC=-Inf
## age ~ wing + tarsv + tail + wproj + fett + weigh + bill + bihe +
## biwi
## Warning: attempting model selection on an essentially perfect fit is
## nonsense
## Error: missing value where TRUE/FALSE needed
step2$anova
## Error: object 'step2' not found
summary(step2)
## Error: object 'step2' not found
The result is the same, with the same vairables selected.
3. Make a third regression (forward selection), but excluding the wing parameter. Now you get a new model. What parameters are included? Can you explain why suddenly wproj is included? What is the correlation between wproj and wing?
fit3 <- lm(age ~ tarsv, data = pca.na)
step3 <- stepAIC(fit3, scope = list(upper = ~tarsv + tail + wproj + fett + weigh +
bill + bihe + biwi, lower = ~1), direction = "forward")
## Start: AIC=-Inf
## age ~ tarsv
## Warning: attempting model selection on an essentially perfect fit is
## nonsense
## Df Sum of Sq RSS AIC
## <none> 0 -Inf
step3$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## age ~ tarsv
##
## Final Model:
## age ~ tarsv
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 0 0 -Inf
summary(step3)
##
## Call:
## lm(formula = age ~ tarsv, data = pca.na)
##
## Residuals:
## ALL 1 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35 NA NA NA
## tarsv NA NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
Now five paramaters are inlcuded: tarsv, wproj, weigh, bihe, biwi. Wproj is now included which wasn't before. wproj presumably is colinear with wing, perhaps another wing measure.
# Testing for correlation between wing and wproj
library(Hmisc)
rcorr(pca.na$wing, pca.na$wproj, type = "pearson")
## Error: must have >4 observations
They are highly correlated, with low P value.
# To visualise this:
plot(pca.na$wing, pca.na$wproj)
PART C.
pheas <- read.csv("PHEASANT.CSV", sep = ";", header = TRUE)
plot(pheas$WEIGHT ~ pheas$AGEDAY)
Non-linear regression can be used to fit data to a known equation. There is good theoretical reasons to believe that the weight should increase with age according to the following equation: weight=a/[1+e-k(age-i)] where a is the asymptote of the weight (the final weight that the chick may reach), k is the growth rate, and i is the inflection point, i.e. where growth goes from being concave to convex, or vice versa. Use non-linear regression to estimate these parameters using the command nls (= nonlinear least squares). For the exponential function use exp(-k(AGEDAY-i)). You will need something like the following values to start with, not to end up with horrific estimates (you may try to change these values just to see how sensitive non-linear modelling might be): a=1400, k=0.05, i=60:
nls <- nls(pheas$WEIGHT ~ a/(1 + exp(-k * (pheas$AGEDAY - i))), data = pheas,
start = list(a = 1400, k = 0.05, i = 60), na.action = na.omit)
# Save the predicted values.
pheas$PRED <- predict(nls, pheas)
You can get correlations of the parameter estimates using correlation=TRUE within summary.
summary(nls, correlation = TRUE)
##
## Formula: pheas$WEIGHT ~ a/(1 + exp(-k * (pheas$AGEDAY - i)))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 1.31e+03 2.68e+01 48.9 <2e-16 ***
## k 5.15e-02 2.19e-03 23.5 <2e-16 ***
## i 6.62e+01 1.56e+00 42.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.5 on 27 degrees of freedom
##
## Correlation of Parameter Estimates:
## a k
## k -0.65
## i 0.81 -0.71
##
## Number of iterations to convergence: 6
## Achieved convergence tolerance: 7.17e-06
There are quite high correlations between the parameter estimates. What does that mean?
The high correlations suggest that values of a, i, and k are closely related, with k reducing as a increases (i.e. growth rate decreases as the chick grows), i is postively correlated with a, suggesting that the effect of age on weight get less as the chick gets older. i is negatively correlated with k, which means that when the growth rate is high then i is small. This expected as k is effectively the growth rate irrespective of age, then i reduces this depending on age.
names(pheas)
## [1] "AGEDAY" "WEIGHT" "TARSH" "PRED"
plot(pheas$WEIGHT ~ pheas$AGEDAY)
points(pheas$PRED ~ pheas$AGEDAY, col = "red")
# not done.