Preliminaries

Loading and exploring the data

The data for this lecture is the same as for the previous lecture. It is included here in two R vectors to keep this document self-contained.

sbp = c(144, 220, 138, 145, 162, 142, 170, 124, 158, 154, 162, 150, 140, 110, 128, 130, 135, 114, 116, 124, 136, 142, 120, 120, 160, 158, 144, 130, 125, 175)

age = c(39, 47, 45, 47, 65, 46, 67, 42, 67, 56, 64, 56, 59, 34, 42, 48, 45, 17, 20, 19, 36, 50, 39, 21, 44, 53, 63, 29, 25, 69)

Later on it will be useful to have a data frame with the data, created as follows:

data = data.frame(x=age, y=sbp)

This allows us to use the names data$x and data$y to refer to the variables. Sometimes is easier to think in terms of x-y, sometimes is better to think of the actual content of the variables using age and sbp.

The correlation coefficient.

Let us first compute the variance of X (age) and y (sbp) directly:

(n = nrow(data))
## [1] 30
(meanX = mean(data$x))
## [1] 45.13333
(meanY = mean(data$y))
## [1] 142.5333
(covXY = sum((data$x - meanX) * (data$y - meanY)) / (n - 1))
## [1] 227.0989

Alternatively, simply use the cov function:

(covXY = cov(data$x, data$y))
## [1] 227.0989

The standard deviations of x and y are:

(sdX = sd(data$x))
## [1] 15.2942
(sdY = sd(data$y))
## [1] 22.58125

and so the correlation coefficient is

(rXY = covXY/(sdX * sdY))
## [1] 0.6575673

Or we could simply use the cor coefficient:

(cor(data$x, data$y))
## [1] 0.6575673

And finally let us check that this is the same as: \[\dfrac{s_x}{s_y}\hat\beta_1\] For this we first construct the linear model using lm (as we learned last week) and extract the intercept and the slope coefficient:

(lmXY = lm(y ~ x, data))
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Coefficients:
## (Intercept)            x  
##     98.7147       0.9709
(beta0 = lmXY$coefficients[1])
## (Intercept) 
##    98.71472
(beta1 = lmXY$coefficients[2])
##         x 
## 0.9708704

And now back to the check:

(sdX / sdY) * beta1
##         x 
## 0.6575673

Plotting the poins in the four quadrants defined by \((\bar x, \bar y).\)

Just for fun, let us plot in a different color the points in each of the four quadrants defined by \(\bar x\) and \(\bar y\).

# Set the plot frame, but use "n" to avoid actually plotting the points.
plot(data$x, data$y, type="n", xlab="age", ylab="sbp")
# Horizontal and vertical line sthrough the means:
abline(h=meanY)
abline(v=meanX)
# Four data sets selecting the points in each quadrant.
# There are more elegant ways to do this, but this will do.
Q1 = data[(data$x - meanX > 0) & (data$y - meanY > 0), ]
Q2 = data[(data$x - meanX < 0) & (data$y - meanY > 0), ]
Q3 = data[(data$x - meanX< 0) & (data$y - meanY < 0), ]
Q4 = data[(data$x - meanX > 0) & (data$y - meanY < 0), ]
# And now we plot these data frames with different colors.
points(Q1$x, Q1$y, col="red", pch="+", cex=2)
points(Q2$x, Q2$y, col="blue", pch="+", cex=2)
points(Q3$x, Q3$y, col="magenta", pch="+", cex=2)
points(Q4$x, Q4$y, col="cyan", pch="+", cex=2)

Testing the null hypothesis \(H_0=\{\rho_{xy} = 0\}.\)

Let us compute the value of the statistic:

(tStat = rXY * sqrt((n -2 ) / (1 - rXY^2)))
## [1] 4.618447

This is the same that we get from the linear model. Look in the second row of the t value column in the output of summary(lm):

summary(lmXY)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.724  -6.994  -0.520   2.931  75.654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  98.7147    10.0005   9.871 1.28e-10 ***
## x             0.9709     0.2102   4.618 7.87e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared:  0.4324, Adjusted R-squared:  0.4121 
## F-statistic: 21.33 on 1 and 28 DF,  p-value: 7.867e-05

Testing the null hypothesis \(H_0=\{\rho_{xy} = .85\}.\)

For this contrast we will define a pair of functions to compute the Fisher’s z transformation and its inverse:

fishertrans = function(r){(1/2) * log((1 + r) / (1 - r))}
fisherInvTrans = function(z){(exp(2 * z) - 1) / (exp(2 * z) + 1)}

Now we can apply this to compute the value of th enormally distributed statistic (here it is called zsStat):

r0 = 0.85
fishertrans(rXY)
## [1] 0.7885156
(zsStat = (fishertrans(rXY) - fishertrans(r0)) / (1/sqrt(n - 3)))
## [1] -2.429914

The confidence interval is then obtained as usual:

(confInt_Z = fishertrans(rXY) + c(-1, 1) * qnorm(0.975) / sqrt(n -3))
## [1] 0.4113203 1.1657108

And transforming the interval back we get:

fisherInvTrans(confInt_Z)
## [1] 0.3895932 0.8228923

I should mention that you can get the same values using the psych library. If you need to install it just uncomment the first line and execute it:

#install.packages("psych", dependencies=TRUE)
library(psych)
fisherz(rXY)
## [1] 0.7885156
r.con(rXY, n, 0.95)
## [1] 0.3895932 0.8228923

Anova: analysis of variance table.

Last week we already saw how to get the Anova table for the regression model:

(AnovaXY = anova(lmXY))
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 6394.0  6394.0   21.33 7.867e-05 ***
## Residuals 28 8393.4   299.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let us review how the components of this table are obtained. First the total sum of squares:

(SSTotal = (n - 1) * var(data$y))
## [1] 14787.47

The sum of squares corresponding to the model (the explained variance):

(SSModel = sum((lmXY$fitted.values - meanY)^2))
## [1] 6394.023

The residual sum of squares (the unexplained variance; I like to think of this as the random component of variance):

(SSresiduals = sum(lmXY$residuals^2))
## [1] 8393.444

The Anova identity is just the fact that the following two expressions give the same result:

SSTotal
## [1] 14787.47
SSresiduals + SSModel
## [1] 14787.47

The SSresiduals and SSModel constitute the Sum Sq column of the Anova table. The degrees of freedom in the first column are obtained as follows:

(k = ncol(data))
## [1] 2
(dfModel = k - 1)
## [1] 1
(dfResidual = n -2)
## [1] 28

Here \(k\) is the number of variables in the model. In the simpole regression model we are studying now this could be simply set to 1. But we will also be studying regression models with more predictor variables, that is with \(k>1\).

Now we can divide the sums of squares by the corresponding degrees of freedom to get the Mean Sq column:

(meanSSmodel = SSModel / dfModel)
## [1] 6394.023
(meanSSresidual = SSresiduals / (n - 2))
## [1] 299.7659

The quotient of the values in that column leads to the F-statistic:

(Fstatistic = meanSSmodel / meanSSresidual)
## [1] 21.33006

And from the \(F\)-statistic and the degrees of freedom we can compute a p-value:

(pValue = pf(Fstatistic, df1 = dfModel, df2 = dfResidual, lower.tail = FALSE))
## [1] 7.867263e-05

The value of the \(F\)-statistic is simply the square of the \(t\)-statistic:

(tStatistic = beta1 / sqrt(meanSSresidual / ((n-1) * var(data$x))))
##        x 
## 4.618447
tStatistic^2
##        x 
## 21.33006

Thanks for your attention!