Applied Regression Analysis. Week01

Preliminaries
Loading and exploring the data
Plotting the data
Obtaining the regression line.
Adding the regression line to the scatter plot.
Analysis of the regression model on page 12 of the lecture.
Estimating $\sigma^2$.
Testing the null hypothesis $H_0=\{\beta_1 = 0\}$.
Testing the null hypothesis $H_0=\{\beta_0 = 75\}$.
Confidence interval for $\mu_{y|x}$ for 65 year old individuals.
Prediction interval for age 65 individuals.
Plotting confidence and prediction bands.

Preliminaries

These are my notes for the lectures of the Coursera course “Applied Regression Analysis” by Professor Stanley Lemeshow. The goal of these notes is to provide the R code to obtain the same results as the Stata code in the lectures.
I use (and strongly recommend) RStudio to do all my work in R. I wrote these notes with an intermediate R user in mind (that means, non-absolute newbie but non-expert either). There are now hundreds of ways of learning R. I recommend the course R Programming by R. Peng, J. Leek and B. Caffo (the course is part of the Data Science Specialization, which I also recommend if you are willing to learn a lot about data science using R). Also, the Open Intro Statistics R labs can be useful. I have also often found the following websites very useful:
The notes are written in Rmarkdown, which uses knitr to process the embedded R code and insert the results into the resulting document. The R code appears in darker grey boxes, while the output appears below in white boxes (the output appears here prefixed with ##, but that’s not really part of R output; see the Rmarkdown and knitr references for details). The important thing is that you can copy all the code from the grey boxes, paste it into R and execute it directly.
Two R-quirks of mine: I use = for assignments in R, against the recomended assignment operator <-. And as you may know, R does not print out the result of assignments by default. But you can get the result printed out if you surround the whole assignment statement with parenthesis. You will see that I use this extensively, except for cases where the output is irrelevant or too large to display conveniently.

Loading and exploring the data

The data for the first lecture appears in a table on page 2 of the pdf for the lecture. Thus you could cut and paste it to obtain a csv file. However the size of the data set is small and so I will just paste it here into two R vectors to keep this document self-contained.

sbp = c(144, 220, 138, 145, 162, 142, 170, 124, 158, 154, 162, 150, 140, 110, 128, 130, 135, 114, 116, 124, 136, 142, 120, 120, 160, 158, 144, 130, 125, 175)

age = c(39, 47, 45, 47, 65, 46, 67, 42, 67, 56, 64, 56, 59, 34, 42, 48, 45, 17, 20, 19, 36, 50, 39, 21, 44, 53, 63, 29, 25, 69)

Later on it will be useful to have a data frame with the data, created as follows:

data = data.frame(x=age, y=sbp)

This allows us to use the names data$x and data$y to refer to the variables. Sometimes is easier to think in terms of x-y, sometimes is better to think of the actual content of the variables using age and sbp.

Data frames in R can be thought of as tables. Here each column corresponds to a variable, and each row to an observation. The first few rows of the data frame can be seen with:

head(data)

##    x   y
## 1 39 144
## 2 47 220
## 3 45 138
## 4 47 145
## 5 65 162
## 6 46 142

And if you want to obtain a csv with this data you just need to uncomment (remove the #) and execute the following line of code. The file will be created at your R working directory, that you can locate executing the getwd function. Remeber that you can also set the working directory with setwd.

# write.table(data, "sbp_age.csv", sep = ",", row.names = FALSE)

Let’s begin by exporing the data. The summary and str (think structure) functions are helpful when starting to explore a new data set:

summary(data)

##        x               y        
##  Min.   :17.00   Min.   :110.0  
##  1st Qu.:36.75   1st Qu.:125.8  
##  Median :45.50   Median :141.0  
##  Mean   :45.13   Mean   :142.5  
##  3rd Qu.:56.00   3rd Qu.:157.0  
##  Max.   :69.00   Max.   :220.0

str(data)

## 'data.frame':    30 obs. of  2 variables:
##  $ x: num  39 47 45 47 65 46 67 42 67 56 ...
##  $ y: num  144 220 138 145 162 142 170 124 158 154 ...

Plotting the data

Let us do a basic plot of the data. With plot we can get a scatter plot of the sbp variable (vertical axis, thus second argument of plot) vs the age variable (horizontal axis, thus first argument of plot). The xlab and ylab options allow us to set labels for the plot axis:

plot(data$x, data$y, xlab="age", ylab="sbp")

This is a really basic plot, but it is enough for our purposes. In case you want fancier plots, check the resources mentioned above. I would suggest taking a look at the ggplot package.

Obtaining the regression line.

We are going to obtain this line in two different ways. First we will benefit from the fact that R is a programming language, and we will directly compute the expressions on pages 6 and 7 of the lecture pdf. After that we will use the function lm to get the results in a single step.

Let’s begin by computing $\beta_1$ (see page 6 of the pdf). A possibility is to do:

(beta1 = cov(data$x, data$y)/var(data$x))

## [1] 0.9708704

But we can also proceed slowly, computing the products with the vector arithmetic of R. We also define n as the length of the data vectors (in this case I define it as the number of rows in the data frame):

(n = nrow(data))

## [1] 30

(sumXY = sum(data$x*data$y))

## [1] 199576

(sumX = sum(data$x))

## [1] 1354

(sumY = sum(data$y))

## [1] 4276

(sumX2 = sum(data$x^2))

## [1] 67894

and now:

(beta1 = (sumXY - (sumX * sumY) / n) / (sumX2 - (sumX^2 / n)))

## [1] 0.9708704

The result is the same as before, as expected. To compute $\beta_0$ we will first obtain the means of the data vectors and store them in two variables for later reference:

(meanX = mean(data$x))

## [1] 45.13333

(meanY = mean(data$y))

## [1] 142.5333

Now:

(beta0 = meanY - beta1 * meanX)

## [1] 98.71472

For the second version we just need to apply the function lm to the R formula

y ~ x

that describes that we want to obtain a linear model with y as response (dependent variable) and x as predictor (independent variable). It will be convenient later to give the resulting model a name, so I will use lmXY for this:

(lmXY = lm(y ~ x, data))

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Coefficients:
## (Intercept)            x  
##     98.7147       0.9709

The results coincide with the previous direct computation. In case we want to access the coefficients individually after using lm we can do so as follows:

(beta0 = lmXY$coefficients[1])

## (Intercept) 
##    98.71472

(beta1 = lmXY$coefficients[2])

##         x 
## 0.9708704

It is not really relevant for what follows, but note that when you use lm the coefficients become named quantities in R.

Adding the regression line to the scatter plot.

This is straightforward with the abline function (ab stands for the coefficients in a line equation such as $y = a + bx$). First I recreate the scatter plot, then I add the line:

plot(data$x, data$y, xlab = "age", ylab = "sbp")
abline(beta0, beta1)

Analysis of the regression model on page 12 of the lecture.

To obtain the inforrmation that appears on page 12 we apply summary to the model object lmXY that we created before:

summary(lmXY)

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.724  -6.994  -0.520   2.931  75.654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  98.7147    10.0005   9.871 1.28e-10 ***
## x             0.9709     0.2102   4.618 7.87e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.31 on 28 degrees of freedom
## Multiple R-squared:  0.4324, Adjusted R-squared:  0.4121 
## F-statistic: 21.33 on 1 and 28 DF,  p-value: 7.867e-05

This gives you most of the values that appear in the lower table of that page. Some of them perhaps don’t make sense yet for you, we will learn more about them as the course moves forward. However it may be good to use this to create a short of dictionary between the results of R and those of Stata:

The following values appear on the top right of the pdf page:

The F-statistic $21.33$, in Stata F( 1, 28).
The corresponding p-value $7.867\times 10^{-5}$, in Stata prob > F.
The (multiple) R-squared, $0.4324$.
The adjusted R-squared, $0.4121$.
The residual standard error $17.31$, that Stata shows as Root MSE.

To get the values that appear in the first table on page 12 of the pdf (the one with columns labeled SS, df, MS) we use:

(AnovaXY = anova(lmXY))

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 6394.0  6394.0   21.33 7.867e-05 ***
## Residuals 28 8393.4   299.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Naming the result has some important advantages. For example, the components of this table can be accessed using the $ notation in R (if you don’t know the names of the components in RStudio you can simply type AnovaXY$ in the prompt and then press Tab; the list of components will appear). For example the first column of the table is accessed as:

AnovaXY$`Sum Sq`

## [1] 6394.023 8393.444

Note that we use quotes because the name of this column contains spaces. Now if you need the sum of this column (that the Stata table shows under Total/SS) simply type:

sum(AnovaXY$`Sum Sq`)

## [1] 14787.47

The remaining components of the Anova table can be accessed similarly. I should mention that you can also use this method to access the components of the model lmXY and (interestingly) of the summary of the model (you will have to provide a variable name for the output of summary(lmXY) to do this).

You may have noticed that the Stata table also contains confidence intervals for $\beta_0$ and $\beta_1$. We will compute these intervals in a different way below, but for now we can do with this command:

confint(lmXY, level=0.95)

##                  2.5 %     97.5 %
## (Intercept) 78.2296890 119.199747
## x            0.5402629   1.401478

The default confidence level is in fact 0.95, but I have included it explicitly so you can change it easily if you need to.

Estimating $\sigma^2$.

I will use the variable s2yx to compute $S^2_{y|x}$. The sample variances of $x$ and $y$ can be obtained with (alternatively you may prefer to use the formulas on page 15):

(varX = var(data$x))

## [1] 233.9126

(varY = var(data$y))

## [1] 509.9126

and so, using the last formula on page 14:

(s2yx = ((n - 1) / (n - 2)) * (varY - beta1^2 * varX))

##        x 
## 299.7659

You may have noticed that this value was already obtained in the second column of the R Anova table (see also the column labeled MSin the Stata output). Thus an alternative way to get the value is:

(s2yx = AnovaXY$`Mean Sq`[2])

## [1] 299.7659

The selection with brackets [2] is needed because the desired value is the second element of the column.

The standard error of the estimate is then:

sqrt(s2yx)

## [1] 17.31375

Note that this was also previously obtained as the residual standard error.

Testing the null hypothesis $H_0=\{\beta_1 = 0\}$.

We will use the formula (in green) on page 19 of the lecture to compute the value of a $t$ statistic, expressed as follows:

\[ t = \dfrac{\hat\beta_1 - \beta_1^{(0)}}{\sqrt{\dfrac{s^2_{y|x}}{(n - 1)s^2_x}}} \] We already have all the ingredients for this formula, and so the value of the $t$ statistic is:

(tStstc = beta1 / sqrt(s2yx / ((n-1) * varX)))

##        x 
## 4.618447

And now we obtain the p-value as follows:

(pValue = 2 * pt(abs(tStstc), df = n - 2, lower.tail = FALSE))

##            x 
## 7.867263e-05

Notice that we take the absolute value and multiply by 2 because this is a two-tail contrast. And recall that we have already seen this value in the analysis of the model (although there it was expressed in terms of the F distribution). If you’d rather use values of $t$ for the contrast instead of p-values, then you can find the value $t_{.975}(28)$ (the quantile of $t$ that serves as a cut point for a two tailed $t$ contrast at 95% confidence level) with qt as follows:

qt(0.975, df = n - 2)

## [1] 2.048407

Testing the null hypothesis $H_0=\{\beta_0 = 75\}$.

In this case the value of the $t$ statistic is:

(tStstc = (beta0 - 75) / sqrt(s2yx *  (1 / n + meanX^2 / ((n -1 ) * varX))))

## (Intercept) 
##    2.371361

(do not pay attention to the name Intercept, it is simply inherited from beta0).

The corresponding p-value is:

(pValue = 2 * pt(abs(tStstc), df = n - 2, lower.tail = FALSE))

## (Intercept) 
##  0.02483759

Confidence interval for $\mu_{y|x}$ for 65 year old individuals.

We will do this also in two different ways: first a step by step procedure, and then the express version.

The estimation $s^2_{\hat y_{x_0}}$ that appears on page 24 of the pdf is obtained with:

x0 = 65
(s2yx0 = s2yx * ((1 / n) + (x0 - meanX)^2 / ((n - 1) * varX)))

## [1] 27.43356

The standard error that we will use for the confidence interval is obtained as the square root of this value:

sqrt(s2yx0)

## [1] 5.237706

and the quantile of the t-distribution is obtained using qt:

(tquantile = qt(0.95, df = n - 2))

## [1] 1.701131

The predicted value for $x_0 = 65$ is:

(y0 = beta0 + beta1 * x0)

## (Intercept) 
##    161.8213

Note that the value that appears in the pdf is different due to the rounding used there.

With these values the confidence interval is constructed with:

(confInterval = y0 + c(-1, 1) * tquantile * sqrt(s2yx0))

## [1] 152.9113 170.7313

The trick with c(-1, 1) produces both ends of the interval with a single command. Note the effect of rounding again here.

This concludes the step by step computation of the confidence interval. Now for the express procedure things are way faster:

(confInterval = predict(lmXY, int="c", newdata=data.frame(x = 65), level=0.90))

##        fit      lwr      upr
## 1 161.8213 152.9113 170.7313

The output shows the center (that is, y0) and the ends of the interval. It is obtained with the function predict, and the option int="c" implies that a confidence interval is returned. We provide the value of $x$ through a data frame in the option newdata=data.frame(x = 65). This has the advantage that several values of $x$ can be used at the same time, and we wil get the corresponding confidence interval for each of these values. We will see an example of this below when we plot the confidence and prediction bands.

Prediction interval for age 65 individuals.

For this, and in order to appreciate the ease of use of predict, let us begin with the express procedure:

(predInterval = predict(lmXY, int="p", newdata=data.frame(x = 65), level=0.90))

##        fit      lwr      upr
## 1 161.8213 131.0501 192.5925

As you can see, all that we need to do is to exchange the c of confidence for the p of prediction, and we have our interval. As I said before, with the added benefit that we can easily get many prediction intervals with a single command.

Now let’s complete the step by step construction. We can reuse some of the values and formulas that we used for the confidence interval. The standard error for prediction is (formula on page 27) similar to the computation :

(stdErrPred = sqrt(s2yx * (1 + (1 / n) + (x0 - meanX)^2 / ((n - 1) * varX))))

## [1] 18.08865

and then the prediction interval is:

(predInterval = y0 + c(-1, 1) * tquantile * stdErrPred)

## [1] 131.0501 192.5925

same as before.

but instead of using the square root of the sample variance for $x$ it is simpler to express the standard deviation of $x$ as:

(sdX = sd(data$x))

## [1] 15.2942

Plotting confidence and prediction bands.

Using predict this is quite simple. We only need to take a large enough number of values in the $x$-range of our sample (I’ll take a hundred equispaced points),

xValues = seq(min(data$x), max(data$x), length.out=100)

Then we construct the confidence and prediction interval for each of these values

predIntrvls = predict(lmXY, int="p", newdata=data.frame(x = xValues), level=0.95)
confIntrvls = predict(lmXY, int="c", newdata=data.frame(x = xValues), level=0.95)

The first few confidence intervals are:

head(confIntrvls)

##        fit      lwr      upr
## 1 115.2195 101.4832 128.9558
## 2 115.7295 102.1922 129.2667
## 3 116.2394 102.9003 129.5785
## 4 116.7494 103.6076 129.8912
## 5 117.2593 104.3139 130.2048
## 6 117.7693 105.0192 130.5194

Now for the plot we:

draw the scatter plot of $y$ versus $x$ with plot,
use abline to draw the regression line,
use segments to join each sample point $(x, y)$ with the corresponding predicted point on the regression line,
and finally use matlines to join the endpoints of those intervals to get the boundaries of the corresponding band.

plot(data$x, data$y, pch=4, lwd=2, 
     col="black", xlab="age", ylab = "sbp",
     cex.lab=1.1, cex.axis=1.1)
abline(lmXY, lwd=2, col="blue")
segments(data$x,fitted(lmXY), data$x, data$y, lwd=2)
matlines(xValues, predIntrvls[ ,2:3], lty=c(1, 1, 1), col="red", lwd=2)
matlines(xValues, confIntrvls[ ,2:3], lty=c(1, 1, 1), col="darkgreen", lwd=2)

Thanks for your attention!

Applied Regression Analysis. Week01

Course Notes by Fernando San Segundo

February 2016

Preliminaries

Loading and exploring the data

Plotting the data

Obtaining the regression line.

Adding the regression line to the scatter plot.

Analysis of the regression model on page 12 of the lecture.

Estimating \(\sigma^2\).

Testing the null hypothesis \(H_0=\{\beta_1 = 0\}\).

Testing the null hypothesis \(H_0=\{\beta_0 = 75\}\).

Confidence interval for \(\mu_{y|x}\) for 65 year old individuals.

Prediction interval for age 65 individuals.

Plotting confidence and prediction bands.