The data for this lecture is the same as for the previous lecture. It is included here in two R vectors to keep this document self-contained.
sbp = c(144, 220, 138, 145, 162, 142, 170, 124, 158, 154, 162, 150, 140, 110, 128, 130, 135, 114, 116, 124, 136, 142, 120, 120, 160, 158, 144, 130, 125, 175)
age = c(39, 47, 45, 47, 65, 46, 67, 42, 67, 56, 64, 56, 59, 34, 42, 48, 45, 17, 20, 19, 36, 50, 39, 21, 44, 53, 63, 29, 25, 69)
Later on it will be useful to have a data frame with the data, created as follows:
data = data.frame(x=age, y=sbp)
This allows us to use the names data$x and data$y to refer to the variables. Sometimes is easier to think in terms of x-y, sometimes is better to think of the actual content of the variables using age and sbp.
Let us first compute the variance of X (age) and y (sbp) directly:
(n = nrow(data))
## [1] 30
(meanX = mean(data$x))
## [1] 45.13333
(meanY = mean(data$y))
## [1] 142.5333
(covXY = sum((data$x - meanX) * (data$y - meanY)) / (n - 1))
## [1] 227.0989
Alternatively, simply use the cov function:
(covXY = cov(data$x, data$y))
## [1] 227.0989
The standard deviations of x and y are:
(sdX = sd(data$x))
## [1] 15.2942
(sdY = sd(data$y))
## [1] 22.58125
and so the correlation coefficient is
(rXY = covXY/(sdX * sdY))
## [1] 0.6575673
Or we could simply use the cor coefficient:
(cor(data$x, data$y))
## [1] 0.6575673
And finally let us check that this is the same as: \[\dfrac{s_x}{s_y}\hat\beta_1\] For this we first construct the linear model using lm (as we learned last week) and extract the intercept and the slope coefficient:
(lmXY = lm(y ~ x, data))
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Coefficients:
## (Intercept) x
## 98.7147 0.9709
(beta0 = lmXY$coefficients[1])
## (Intercept)
## 98.71472
(beta1 = lmXY$coefficients[2])
## x
## 0.9708704
And now back to the check:
(sdX / sdY) * beta1
## x
## 0.6575673
Just for fun, let us plot in a different color the points in each of the four quadrants defined by \(\bar x\) and \(\bar y\).
# Set the plot frame, but use "n" to avoid actually plotting the points.
plot(data$x, data$y, type="n", xlab="age", ylab="sbp")
# Horizontal and vertical line sthrough the means:
abline(h=meanY)
abline(v=meanX)
# Four data sets selecting the points in each quadrant.
# There are more elegant ways to do this, but this will do.
Q1 = data[(data$x - meanX > 0) & (data$y - meanY > 0), ]
Q2 = data[(data$x - meanX < 0) & (data$y - meanY > 0), ]
Q3 = data[(data$x - meanX< 0) & (data$y - meanY < 0), ]
Q4 = data[(data$x - meanX > 0) & (data$y - meanY < 0), ]
# And now we plot these data frames with different colors.
points(Q1$x, Q1$y, col="red", pch="+", cex=2)
points(Q2$x, Q2$y, col="blue", pch="+", cex=2)
points(Q3$x, Q3$y, col="magenta", pch="+", cex=2)
points(Q4$x, Q4$y, col="cyan", pch="+", cex=2)
Let us compute the value of the statistic:
(tStat = rXY * sqrt((n -2 ) / (1 - rXY^2)))
## [1] 4.618447
This is the same that we get from the linear model. Look in the second row of the t value column in the output of summary(lm):
summary(lmXY)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.724 -6.994 -0.520 2.931 75.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7147 10.0005 9.871 1.28e-10 ***
## x 0.9709 0.2102 4.618 7.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared: 0.4324, Adjusted R-squared: 0.4121
## F-statistic: 21.33 on 1 and 28 DF, p-value: 7.867e-05
For this contrast we will define a pair of functions to compute the Fisher’s z transformation and its inverse:
fishertrans = function(r){(1/2) * log((1 + r) / (1 - r))}
fisherInvTrans = function(z){(exp(2 * z) - 1) / (exp(2 * z) + 1)}
Now we can apply this to compute the value of th enormally distributed statistic (here it is called zsStat):
r0 = 0.85
fishertrans(rXY)
## [1] 0.7885156
(zsStat = (fishertrans(rXY) - fishertrans(r0)) / (1/sqrt(n - 3)))
## [1] -2.429914
The confidence interval is then obtained as usual:
(confInt_Z = fishertrans(rXY) + c(-1, 1) * qnorm(0.975) / sqrt(n -3))
## [1] 0.4113203 1.1657108
And transforming the interval back we get:
fisherInvTrans(confInt_Z)
## [1] 0.3895932 0.8228923
I should mention that you can get the same values using the psych library. If you need to install it just uncomment the first line and execute it:
#install.packages("psych", dependencies=TRUE)
library(psych)
fisherz(rXY)
## [1] 0.7885156
r.con(rXY, n, 0.95)
## [1] 0.3895932 0.8228923
Last week we already saw how to get the Anova table for the regression model:
(AnovaXY = anova(lmXY))
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 6394.0 6394.0 21.33 7.867e-05 ***
## Residuals 28 8393.4 299.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us review how the components of this table are obtained. First the total sum of squares:
(SSTotal = (n - 1) * var(data$y))
## [1] 14787.47
The sum of squares corresponding to the model (the explained variance):
(SSModel = sum((lmXY$fitted.values - meanY)^2))
## [1] 6394.023
The residual sum of squares (the unexplained variance; I like to think of this as the random component of variance):
(SSresiduals = sum(lmXY$residuals^2))
## [1] 8393.444
The Anova identity is just the fact that the following two expressions give the same result:
SSTotal
## [1] 14787.47
SSresiduals + SSModel
## [1] 14787.47
The SSresiduals and SSModel constitute the Sum Sq column of the Anova table. The degrees of freedom in the first column are obtained as follows:
(k = ncol(data))
## [1] 2
(dfModel = k - 1)
## [1] 1
(dfResidual = n -2)
## [1] 28
Here \(k\) is the number of variables in the model. In the simpole regression model we are studying now this could be simply set to 1. But we will also be studying regression models with more predictor variables, that is with \(k>1\).
Now we can divide the sums of squares by the corresponding degrees of freedom to get the Mean Sq column:
(meanSSmodel = SSModel / dfModel)
## [1] 6394.023
(meanSSresidual = SSresiduals / (n - 2))
## [1] 299.7659
The quotient of the values in that column leads to the F-statistic:
(Fstatistic = meanSSmodel / meanSSresidual)
## [1] 21.33006
And from the \(F\)-statistic and the degrees of freedom we can compute a p-value:
(pValue = pf(Fstatistic, df1 = dfModel, df2 = dfResidual, lower.tail = FALSE))
## [1] 7.867263e-05
The value of the \(F\)-statistic is simply the square of the \(t\)-statistic:
(tStatistic = beta1 / sqrt(meanSSresidual / ((n-1) * var(data$x))))
## x
## 4.618447
tStatistic^2
## x
## 21.33006
Thanks for your attention!