setwd("F:/Documents/Work/Courses/Stats_LUND_2012_October/Exercises/Ex_05")
Part A
1. Read in the file “ANCEX.csv”. It has two variables: xvar and yvar. Explore their distributions. Do they look normally distributed? Test it!
ancex <- read.csv("ANCEX.csv", header = TRUE, sep = ";")
names(ancex)
## [1] "XVAR" "YVAR"
str(ancex)
## 'data.frame': 30 obs. of 2 variables:
## $ XVAR: num 2.14 2.26 19.96 15.01 2.61 ...
## $ YVAR: num 16.8 15.7 69.3 54.7 15.1 ...
Neither of the two vairables look Normally distributed:
X variable
hist(ancex$XVAR)
This may be Poission distributed.
Y vairable
hist(ancex$YVAR)
This would perhaps benefit from a log transformation, given the right hand skew.
We are requested to test for normality. We will use the Shapiro-Wilc test:
shapiro.test(ancex$XVAR)
##
## Shapiro-Wilk normality test
##
## data: ancex$XVAR
## W = 0.8287, p-value = 0.0002295
shapiro.test(ancex$YVAR)
##
## Shapiro-Wilk normality test
##
## data: ancex$YVAR
## W = 0.8794, p-value = 0.002732
As expected both are highly significantly different from a normal distribution
# Attach required package 'Hmisc' for Pearson correlation test
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 2.14.2
## Loading required package: survival
## Loading required package: splines
## Hmisc library by Frank E Harrell Jr
##
## Type library(help='Hmisc'), ?Overview, or ?Hmisc.Overview') to see overall
## documentation.
##
## NOTE:Hmisc no longer redefines [.factor to drop unused levels when
## subsetting. To get the old behavior of Hmisc type dropUnusedLevels().
## Attaching package: 'Hmisc'
## The following object(s) are masked from 'package:survival':
##
## untangle.specials
## The following object(s) are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
rcorr(ancex$XVAR, ancex$YVAR, type = "pearson")
## x y
## x 1.00 0.93
## y 0.93 1.00
##
## n= 30
##
##
## P
## x y
## x 0
## y 0
The test suggests that X and Y are correlated, giving value 0.93, and this is signifcant with a low P value, here '0', so zero to several decimal places.
However, as the question asks, the assumptions of the Pearson correlation are not fulfilled - the data are not normally distributed, and may also not show a linear relationship (we don't know this yet though).
plot(ancex$XVAR, ancex$YVAR)
Given the high correlation value, the plot does look largely as expected, and shows a near linear relationship.
# fit <- lm(ancex$YVAR~ancex$XVAR) summary(fit)
fit <- lm(ancex$YVAR ~ ancex$XVAR)
summary(fit)
##
## Call:
## lm(formula = ancex$YVAR ~ ancex$XVAR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.229 -3.292 -0.504 4.211 10.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.287 1.707 6.03 1.7e-06 ***
## ancex$XVAR 2.905 0.215 13.49 9.0e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.88 on 28 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.862
## F-statistic: 182 on 1 and 28 DF, p-value: 9.05e-14
View this on the scatterplot:
plot(ancex$XVAR, ancex$YVAR)
abline(fit)
What is the P-value of the regression? What is H0?
The P-value is <0.001
H0 is that there is no correlation between the vairable. i.e. from knowing the value of X, you cannot estimate the value of Y.
The general equation: y = mx + c
Here: y = 2.91 * X + 10.29
confint(fit, parm = "ancex$XVAR", level = 0.95)
## 2.5 % 97.5 %
## ancex$XVAR 2.464 3.346
confint(fit, parm = "(Intercept)", level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 6.792 13.78
fit.aov <- aov((ancex$YVAR ~ ancex$XVAR))
fit.aov
## Call:
## aov(formula = (ancex$YVAR ~ ancex$XVAR))
##
## Terms:
## ancex$XVAR Residuals
## Sum of Squares 6292 969
## Deg. of Freedom 1 28
##
## Residual standard error: 5.882
## Estimated effects may be unbalanced
summary(fit.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## ancex$XVAR 1 6292 6292 182 9e-14 ***
## Residuals 28 969 35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit)
##
## Call:
## lm(formula = ancex$YVAR ~ ancex$XVAR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.229 -3.292 -0.504 4.211 10.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.287 1.707 6.03 1.7e-06 ***
## ancex$XVAR 2.905 0.215 13.49 9.0e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.88 on 28 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.862
## F-statistic: 182 on 1 and 28 DF, p-value: 9.05e-14
F value: 181.9
t value: 13.486
square of t-value:
13.486 * 13.486
## [1] 181.9
Which is equivalent to the F value. We conlclude that they are the same, and infact that they are doing the same thing.
# Already done - see above.
ancex$resid <- residuals(fit)
hist(ancex$resid)
shapiro.test(ancex$resid)
##
## Shapiro-Wilk normality test
##
## data: ancex$resid
## W = 0.9766, p-value = 0.7287
Yes, they are normally distributed.
plot(ancex$resid ~ ancex$XVAR)
Apparently ok.
Another assumption is that the variance of the residuals is independent of the independent variable XVAR. This can also be visually checked in the scatterplot you just made. What do you conclude?
They do not appear to be independent, with greater variance for lower x-values than high x-values.
Make a new regression, but now choose XVAR as the dependent and YVAR as the independent. What is the P-value? What is the regression slope?
mod2 <- lm(ancex$XVAR ~ ancex$YVAR)
summary(mod2)
##
## Call:
## lm(formula = ancex$XVAR ~ ancex$YVAR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.541 -1.037 0.005 0.924 4.149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.2471 0.7119 -3.16 0.0038 **
## ancex$YVAR 0.2983 0.0221 13.49 9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.88 on 28 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.862
## F-statistic: 182 on 1 and 28 DF, p-value: 9.05e-14
The p-value is <0.001. The slope is 0.298.
PART B
# RMfor$pred.bir <- fitted(fit.RMfor)
rmfor <- read.csv("RMfor.csv", header = TRUE, sep = ";")
str(rmfor)
## 'data.frame': 27 obs. of 7 variables:
## $ no : int 0 0 0 0 7 7 7 10 10 10 ...
## $ yrtype: int 0 1 10 11 0 10 11 0 10 11 ...
## $ gud : num 0.24 1.15 1.62 2.1 2.08 0.46 1.34 1.31 0.57 0.44 ...
## $ oak : num 0.1339 0.0244 0.1063 0.1348 0.0057 ...
## $ birch : num 0.431 0.295 0.553 0.526 0 ...
## $ alder : num 0.0183 0.2699 0.081 0.2772 0 ...
## $ lime : num 0.0737 0.3988 0.1437 0 0.9209 ...
rmfor$yrtype <- as.factor(rmfor$yrtype)
Fist make a regression of lime against birch. Then save the fitted model values
fit.rmfor <- lm(rmfor$lime ~ rmfor$birch)
summary(fit.rmfor)
##
## Call:
## lm(formula = rmfor$lime ~ rmfor$birch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5492 -0.1170 -0.0225 0.0985 0.3957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5585 0.0715 7.81 3.6e-08 ***
## rmfor$birch -0.8635 0.1933 -4.47 0.00015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.216 on 25 degrees of freedom
## Multiple R-squared: 0.444, Adjusted R-squared: 0.422
## F-statistic: 20 on 1 and 25 DF, p-value: 0.000148
rmfor$pred.bir <- fitted(fit.rmfor)
# lmodel2(formula, data = NULL, range.y=NULL, range.x=NULL, nperm=0)
The formula is written in the same was as for the lm command. Leave range.y and range.x as NULL (the default), since these are designed to be used in a different method (called Ranged Major Axis Regression and which we will not discuss here).
# install.packages('lmodel2')
library(lmodel2)
## Warning: package 'lmodel2' was built under R version 2.14.2
fit.MA <- lmodel2(lime ~ birch, data = rmfor, nperm = 99)
## RMA was not requested: it will not be computed.
# Note that for lmodel2 you don't need to ask for a summary of the model
# fit.
fit.MA
##
## Model II regression
##
## Call: lmodel2(formula = lime ~ birch, data = rmfor, nperm = 99)
##
## n = 27 r = -0.6663 r-square = 0.444
## Parametric P-values: 2-tailed = 0.0001479 1-tailed = 7.394e-05
## Angle between the two OLS regression lines = 21.98 degrees
##
## Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
## A permutation test of r is equivalent to a permutation test of the OLS slope
## P-perm for SMA = NA because the SMA slope cannot be tested
##
## Regression results
## Method Intercept Slope Angle (degrees) P-perm (1-tailed)
## 1 OLS 0.5585 -0.8635 -40.81 0.01
## 2 MA 0.7403 -1.4680 -55.74 0.01
## 3 SMA 0.6885 -1.2959 -52.34 NA
##
## Confidence intervals
## Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
## 1 OLS 0.4113 0.7057 -1.262 -0.4655
## 2 MA 0.5827 1.0477 -2.490 -0.9441
## 3 SMA 0.5868 0.8262 -1.754 -0.9576
##
## Eigenvalues: 0.1091 0.01981
##
## H statistic used for computing C.I. of MA: 0.046
Plot the model fit
plot(fit.MA, method = "MA")
In the lmodel2 output you will see values for OLS, MA, and SMA. OLS = ordinary least squares, MA = major axis, and SMA = reduced major axis. You can see how the calculated slope and intercept differ between the three methods.
Make an overlay scatter plot with lime and all three predicted values against birch. You will need to calculate them yourself since the fitted command doesn't work with lmodel2. All you have to do is use the intercepts and slopes from the lmodel2 output. You can generate a table with the regression results using:
fit.MA$regression.results
## Method Intercept Slope Angle (degrees) P-perm (1-tailed)
## 1 OLS 0.5585 -0.8635 -40.81 0.01
## 2 MA 0.7403 -1.4680 -55.74 0.01
## 3 SMA 0.6885 -1.2959 -52.34 NA
Once you see the location of the slopes and intercepts, use these to calculate the predicted values. This is for the major axis regression:
intMA <- fit.MA$regression.results[2, 2]
slopeMA <- fit.MA$regression.results[2, 3]
rmfor$pred.bir.MA <- intMA + rmfor$birch * slopeMA
# OLS
intOLS <- fit.MA$regression.results[1, 2]
slopeOLS <- fit.MA$regression.results[1, 3]
rmfor$pred.bir.OLS <- intOLS + rmfor$birch * slopeOLS
# SMA
intSMA <- fit.MA$regression.results[3, 2]
slopeSMA <- fit.MA$regression.results[3, 3]
rmfor$pred.bir.SMA <- intSMA + rmfor$birch * slopeSMA
Plot the predicted values from the linear regression against birch using plot. Then add the predicted values for the major axis and reduced major axis regressions to the same plot using points. Compare with what you see when using the raw data.
All model types plot, with fitted values for MA (major axis) in black, OLS (ordinary least squares) in red, and SMA (reduced major axis) in blue.
plot(rmfor$birch ~ rmfor$pred.bir.MA)
points(rmfor$birch ~ rmfor$pred.bir.OLS, col = "red")
points(rmfor$birch ~ rmfor$pred.bir.SMA, col = "blue")
What do you see?
The MA and SMA models give quite similar outcomes, wheras the OLS leads to a steeper slope, all three intercept each other at the same point.
# Not done.