setwd("F:/Documents/Work/Courses/Stats_LUND_2012_October/Exercises/Ex_05")

Part A
1. Read in the file “ANCEX.csv”. It has two variables: xvar and yvar. Explore their distributions. Do they look normally distributed? Test it!

ancex <- read.csv("ANCEX.csv", header = TRUE, sep = ";")

names(ancex)

## [1] "XVAR" "YVAR"

str(ancex)

## 'data.frame':    30 obs. of  2 variables:
##  $ XVAR: num  2.14 2.26 19.96 15.01 2.61 ...
##  $ YVAR: num  16.8 15.7 69.3 54.7 15.1 ...

Neither of the two vairables look Normally distributed:
X variable

hist(ancex$XVAR)

plot of chunk unnamed-chunk-3

This may be Poission distributed.
Y vairable

hist(ancex$YVAR)

plot of chunk unnamed-chunk-4

This would perhaps benefit from a log transformation, given the right hand skew.
We are requested to test for normality. We will use the Shapiro-Wilc test:

shapiro.test(ancex$XVAR)

## 
##  Shapiro-Wilk normality test
## 
## data:  ancex$XVAR 
## W = 0.8287, p-value = 0.0002295

shapiro.test(ancex$YVAR)

## 
##  Shapiro-Wilk normality test
## 
## data:  ancex$YVAR 
## W = 0.8794, p-value = 0.002732

As expected both are highly significantly different from a normal distribution

Calculate the (Pearson) correlation between xvar and yvar (use rcorr in the Hmisc library). What does it tell you? What are the assumptions behind the significance test?

# Attach required package 'Hmisc' for Pearson correlation test
library(Hmisc)

## Warning: package 'Hmisc' was built under R version 2.14.2

## Loading required package: survival

## Loading required package: splines

## Hmisc library by Frank E Harrell Jr
## 
## Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview') to see overall
## documentation.
## 
## NOTE:Hmisc no longer redefines [.factor to drop unused levels when
## subsetting.  To get the old behavior of Hmisc type dropUnusedLevels().

## Attaching package: 'Hmisc'

## The following object(s) are masked from 'package:survival':
## 
## untangle.specials

## The following object(s) are masked from 'package:base':
## 
## format.pval, round.POSIXt, trunc.POSIXt, units

rcorr(ancex$XVAR, ancex$YVAR, type = "pearson")

##      x    y
## x 1.00 0.93
## y 0.93 1.00
## 
## n= 30 
## 
## 
## P
##   x  y 
## x     0
## y  0

The test suggests that X and Y are correlated, giving value 0.93, and this is signifcant with a low P value, here '0', so zero to several decimal places.
However, as the question asks, the assumptions of the Pearson correlation are not fulfilled - the data are not normally distributed, and may also not show a linear relationship (we don't know this yet though).

Make a scatterplot with xvar and yvar. Does it look like you expected, judging from the correlation?

plot(ancex$XVAR, ancex$YVAR)

plot of chunk unnamed-chunk-7

Given the high correlation value, the plot does look largely as expected, and shows a near linear relationship.

Now make a linear regression, using xvar to predict yvar. In other words, xvar is the independent and yvar is the dependent variable. For this you can use the lm command (lm for linear model):

# fit <- lm(ancex$YVAR~ancex$XVAR) summary(fit)

fit <- lm(ancex$YVAR ~ ancex$XVAR)
summary(fit)

## 
## Call:
## lm(formula = ancex$YVAR ~ ancex$XVAR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.229  -3.292  -0.504   4.211  10.507 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.287      1.707    6.03  1.7e-06 ***
## ancex$XVAR     2.905      0.215   13.49  9.0e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 5.88 on 28 degrees of freedom
## Multiple R-squared: 0.867,   Adjusted R-squared: 0.862 
## F-statistic:  182 on 1 and 28 DF,  p-value: 9.05e-14

View this on the scatterplot:

plot(ancex$XVAR, ancex$YVAR)
abline(fit)

plot of chunk unnamed-chunk-10

What is the P-value of the regression? What is H0?
The P-value is <0.001
H0 is that there is no correlation between the vairable. i.e. from knowing the value of X, you cannot estimate the value of Y.

The regression also produced parameter values. Make sure you can find them. The “Estimate” column is the important one. Write down the equation for the regression line!

The general equation: y = mx + c
Here: y = 2.91 * X + 10.29

Extract the confidence intervals for the parameters using the command confint. It has the form confint(object, parm, level = 0.95). What are the 95% confidence intervals?

confint(fit, parm = "ancex$XVAR", level = 0.95)

##            2.5 % 97.5 %
## ancex$XVAR 2.464  3.346

confint(fit, parm = "(Intercept)", level = 0.95)

##             2.5 % 97.5 %
## (Intercept) 6.792  13.78

You can also get a significance value by doing an ANOVA, using the command aov. Use the same commands as in step 4, but replace lm with aov and call your results something else, for example fit.aov. Compare the square of the appropriate t-value in the output from the lm analysis with the F-value in the ANOVA output. What do you conclude?

fit.aov <- aov((ancex$YVAR ~ ancex$XVAR))
fit.aov

## Call:
##    aov(formula = (ancex$YVAR ~ ancex$XVAR))
## 
## Terms:
##                 ancex$XVAR Residuals
## Sum of Squares        6292       969
## Deg. of Freedom          1        28
## 
## Residual standard error: 5.882 
## Estimated effects may be unbalanced

summary(fit.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## ancex$XVAR   1   6292    6292     182  9e-14 ***
## Residuals   28    969      35                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit)

## 
## Call:
## lm(formula = ancex$YVAR ~ ancex$XVAR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.229  -3.292  -0.504   4.211  10.507 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.287      1.707    6.03  1.7e-06 ***
## ancex$XVAR     2.905      0.215   13.49  9.0e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 5.88 on 28 degrees of freedom
## Multiple R-squared: 0.867,   Adjusted R-squared: 0.862 
## F-statistic:  182 on 1 and 28 DF,  p-value: 9.05e-14

F value: 181.9
t value: 13.486
square of t-value:

13.486 * 13.486

## [1] 181.9

Which is equivalent to the F value. We conlclude that they are the same, and infact that they are doing the same thing.

Add a regression line to your scatterplot using abline.

# Already done - see above.

It's about time we tested the assumptions of the regression. They are mostly concerned with the residuals, which are easily calculated in R. You can extract them from the model using the command residuals and add them directly to your dataframe:

ancex$resid <- residuals(fit)

The residuals are supposed to be normally distributed. Are they?

hist(ancex$resid)

plot of chunk unnamed-chunk-16

shapiro.test(ancex$resid)

## 
##  Shapiro-Wilk normality test
## 
## data:  ancex$resid 
## W = 0.9766, p-value = 0.7287

Yes, they are normally distributed.

The residuals are also supposed to be independent of the predictor XVAR. If you plot the residuals against XVAR, there should be no obvious nonlinear trend (like a hump, for example). A scatterplot is appropriate for this.

plot(ancex$resid ~ ancex$XVAR)

plot of chunk unnamed-chunk-17

Apparently ok.

Another assumption is that the variance of the residuals is independent of the independent variable XVAR. This can also be visually checked in the scatterplot you just made. What do you conclude?
They do not appear to be independent, with greater variance for lower x-values than high x-values.
Make a new regression, but now choose XVAR as the dependent and YVAR as the independent. What is the P-value? What is the regression slope?

mod2 <- lm(ancex$XVAR ~ ancex$YVAR)
summary(mod2)

## 
## Call:
## lm(formula = ancex$XVAR ~ ancex$YVAR)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.541 -1.037  0.005  0.924  4.149 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.2471     0.7119   -3.16   0.0038 ** 
## ancex$YVAR    0.2983     0.0221   13.49    9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.88 on 28 degrees of freedom
## Multiple R-squared: 0.867,   Adjusted R-squared: 0.862 
## F-statistic:  182 on 1 and 28 DF,  p-value: 9.05e-14

The p-value is <0.001. The slope is 0.298.

Now compare the P-values of the two regressions, and the P-value of the correlation above. What do you conclude? The p-values are the same for all, suggesting that the calculations are equivalent.

PART B

Time to switch data sets. Open the file “RMfor.csv”. This is woodpecker data. The data set is divided into four types of years; these are called yrtype (with the levels 0, 1, 10 and 11). Then there is a variable (gud) describing the overall food availability in the territory in that year. Finally there are four variables describing how much the woodpeckers used different tree species: oak, birch, alder, lime for the standardised use of oak, birch, alder and lime respectively. For this exercise you only need to worry about lime and birch. You can even delete the other variables if you want. Make the ordinary (Model I) regression with lime your dependent and birch the independent. Save the predicted values using the command fitted. Here I used the model name fit.RMfor:

# RMfor$pred.bir <- fitted(fit.RMfor)

rmfor <- read.csv("RMfor.csv", header = TRUE, sep = ";")
str(rmfor)

## 'data.frame':    27 obs. of  7 variables:
##  $ no    : int  0 0 0 0 7 7 7 10 10 10 ...
##  $ yrtype: int  0 1 10 11 0 10 11 0 10 11 ...
##  $ gud   : num  0.24 1.15 1.62 2.1 2.08 0.46 1.34 1.31 0.57 0.44 ...
##  $ oak   : num  0.1339 0.0244 0.1063 0.1348 0.0057 ...
##  $ birch : num  0.431 0.295 0.553 0.526 0 ...
##  $ alder : num  0.0183 0.2699 0.081 0.2772 0 ...
##  $ lime  : num  0.0737 0.3988 0.1437 0 0.9209 ...

rmfor$yrtype <- as.factor(rmfor$yrtype)

Fist make a regression of lime against birch. Then save the fitted model values

fit.rmfor <- lm(rmfor$lime ~ rmfor$birch)
summary(fit.rmfor)

## 
## Call:
## lm(formula = rmfor$lime ~ rmfor$birch)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5492 -0.1170 -0.0225  0.0985  0.3957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.5585     0.0715    7.81  3.6e-08 ***
## rmfor$birch  -0.8635     0.1933   -4.47  0.00015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.216 on 25 degrees of freedom
## Multiple R-squared: 0.444,   Adjusted R-squared: 0.422 
## F-statistic:   20 on 1 and 25 DF,  p-value: 0.000148

rmfor$pred.bir <- fitted(fit.rmfor)

There are two reasons why Model I regression may not be appropriate: i) there may be a considerable error of measurement in the independent variable, and ii) there is not a simple causal relation between the two variables, i.e. it is not easy to say which one should be x and which one should be y. Then you should use some sort of Model II regression. There are two types: Major Axis (the same as the principal component axis) which should only be applied when the scales of the two variables are similar, and Reduced Major Axis. Major axis regression minimizes the sum of the squared perpendicular distances between the points and the regression line, i.e. the function h=r2/(1+b2) where r is the residual (the vertical distance between the point and the line) and b is the slope (the regression coefficient) of the line. Reduced major axis minimizes the sum of the areas of the straight angle triangles between the points and the line. This is done by minimizing the function A=r2/(2 x|b|), where r is the residuals as before, and |b| is the absolute value of the slope. Both MA regression and RMA regression can be calculated using the lmodel2 package. The lmodel2 command has the form:

# lmodel2(formula, data = NULL, range.y=NULL, range.x=NULL, nperm=0)

The formula is written in the same was as for the lm command. Leave range.y and range.x as NULL (the default), since these are designed to be used in a different method (called Ranged Major Axis Regression and which we will not discuss here).

# install.packages('lmodel2')
library(lmodel2)

## Warning: package 'lmodel2' was built under R version 2.14.2

fit.MA <- lmodel2(lime ~ birch, data = rmfor, nperm = 99)

## RMA was not requested: it will not be computed.

# Note that for lmodel2 you don't need to ask for a summary of the model
# fit.
fit.MA

## 
## Model II regression
## 
## Call: lmodel2(formula = lime ~ birch, data = rmfor, nperm = 99)
## 
## n = 27   r = -0.6663   r-square = 0.444 
## Parametric P-values:   2-tailed = 0.0001479    1-tailed = 7.394e-05 
## Angle between the two OLS regression lines = 21.98 degrees
## 
## Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
## A permutation test of r is equivalent to a permutation test of the OLS slope
## P-perm for SMA = NA because the SMA slope cannot be tested
## 
## Regression results
##   Method Intercept     Slope  Angle (degrees)  P-perm (1-tailed)
## 1    OLS    0.5585   -0.8635           -40.81               0.01
## 2     MA    0.7403   -1.4680           -55.74               0.01
## 3    SMA    0.6885   -1.2959           -52.34                 NA
## 
## Confidence intervals
##   Method  2.5%-Intercept 97.5%-Intercept  2.5%-Slope 97.5%-Slope
## 1    OLS          0.4113          0.7057      -1.262     -0.4655
## 2     MA          0.5827          1.0477      -2.490     -0.9441
## 3    SMA          0.5868          0.8262      -1.754     -0.9576
## 
## Eigenvalues: 0.1091 0.01981 
## 
## H statistic used for computing C.I. of MA: 0.046

Plot the model fit

plot(fit.MA, method = "MA")

plot of chunk unnamed-chunk-24

In the lmodel2 output you will see values for OLS, MA, and SMA. OLS = ordinary least squares, MA = major axis, and SMA = reduced major axis. You can see how the calculated slope and intercept differ between the three methods.
Make an overlay scatter plot with lime and all three predicted values against birch. You will need to calculate them yourself since the fitted command doesn't work with lmodel2. All you have to do is use the intercepts and slopes from the lmodel2 output. You can generate a table with the regression results using:

fit.MA$regression.results

##   Method Intercept     Slope  Angle (degrees)  P-perm (1-tailed)
## 1    OLS    0.5585   -0.8635           -40.81               0.01
## 2     MA    0.7403   -1.4680           -55.74               0.01
## 3    SMA    0.6885   -1.2959           -52.34                 NA

Once you see the location of the slopes and intercepts, use these to calculate the predicted values. This is for the major axis regression:

intMA <- fit.MA$regression.results[2, 2]
slopeMA <- fit.MA$regression.results[2, 3]
rmfor$pred.bir.MA <- intMA + rmfor$birch * slopeMA
# OLS
intOLS <- fit.MA$regression.results[1, 2]
slopeOLS <- fit.MA$regression.results[1, 3]
rmfor$pred.bir.OLS <- intOLS + rmfor$birch * slopeOLS
# SMA
intSMA <- fit.MA$regression.results[3, 2]
slopeSMA <- fit.MA$regression.results[3, 3]
rmfor$pred.bir.SMA <- intSMA + rmfor$birch * slopeSMA

Plot the predicted values from the linear regression against birch using plot. Then add the predicted values for the major axis and reduced major axis regressions to the same plot using points. Compare with what you see when using the raw data.
All model types plot, with fitted values for MA (major axis) in black, OLS (ordinary least squares) in red, and SMA (reduced major axis) in blue.

plot(rmfor$birch ~ rmfor$pred.bir.MA)
points(rmfor$birch ~ rmfor$pred.bir.OLS, col = "red")
points(rmfor$birch ~ rmfor$pred.bir.SMA, col = "blue")

plot of chunk unnamed-chunk-27

What do you see?
The MA and SMA models give quite similar outcomes, wheras the OLS leads to a steeper slope, all three intercept each other at the same point.

If you feel like it, you may repeat the previous steps but letting birch be the dependent variable and lime the independent. You should find that the two types of Model II regressions give the result that b'=1/b and a'=-a/b, where b' and a' are the coefficients for the new model. This means that the regression line is the same regardless of which variable is x and y. For the Model I regression, however, you will not be able to predict the values a' and b' from a and b. If you want to predict lime from birch, you get different results than if you want to predict birch from lime.

# Not done.