QG Lecture 13

Today you will learn more about linear regression. After today you will be able to interpret the output from a linear regression model. You will know how to create a bootstrap sample and why that is useful. You will also know how to output uncertainty intervals on your model predictions.

We begin with a short detour on manipulating data frames.

R Cheat Sheets: http://bit.ly/UzCoCj

One of the things that can be frustrating for new R users is learning how to manipulate a data frame. The plyr package makes it easier.

Manipulating data frames

require(plyr)

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.0.2

Think “plier”: a versatile tool.

Consider the mtcars data frame.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Suppose you want to determine the average miles per gallon (mpg) for cars with 6 cylinders.

mtcars$mpg[mtcars$cyl == 6]

## [1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7

mean(mtcars$mpg[mtcars$cyl == 6])

## [1] 19.74

But suppose you want the mpg for each cylinder level?

df = ddply(mtcars, .(cyl), summarize, avgMPG = mean(mpg))
df

##   cyl avgMPG
## 1   4  26.66
## 2   6  19.74
## 3   8  15.10

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.0.2

ggplot(df, aes(x = factor(cyl), y = avgMPG)) + geom_bar(stat = "identity") + 
    xlab("Number of Cylinders") + ylab("Average MPG")

plot of chunk ddply

Let's review what we did last week.

Example: Dating

In U.S. culture, an older man dating a younger woman is not uncommon, but when the age difference becomes too great it may seem unacceptable.

The data set too.young (UsingR) is a survey of 80 people with their minimum age for an acceptable female partner for a range of ages for the male. A surprising rule of thumb (in the sense that someone took the time to figure this out) for the minimum is half the age plus seven.

require(UsingR)

## Loading required package: UsingR

## Warning: package 'UsingR' was built under R version 3.0.2

## Loading required package: MASS
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies

tail(too.young)

##    Male Female
## 75   21     18
## 76   25     18
## 77   30     25
## 78   40     30
## 79   50     40
## 80   60     50

Create a scatter plot of the survey results plotting the man's age on the x-axis.

require(ggplot2)
p = ggplot(too.young, aes(x = Male, y = Female)) + geom_point()
p

## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk scatterPlot

Add a linear regression line to the plot.

p = p + geom_smooth(method = lm, se = FALSE)
p

## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk addRegressionLine

Add a line corresponding to the rule that a woman's age should be no less than ½ the man's plus 7. Use a different color for that line.

p = p + geom_abline(yintercept = 7, slope = 1/2, col = "red")
p

## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk addRule

Create a linear regression model using Female age as the response variable. What is the value of the slope and how does it compare with the rule?

model = lm(Female ~ Male, data = too.young)
model

## 
## Call:
## lm(formula = Female ~ Male, data = too.young)
## 
## Coefficients:
## (Intercept)         Male  
##       5.472        0.575

The model indicates that folks in this sample believe it is okay for a man to date a women that is 58% of his age + 5. This is close to the rule of thumb. of 50% + 7.

How close depends on the parameter's standard errors. These are given using the summary() method on the model object.

summary(model)

## 
## Call:
## lm(formula = Female ~ Male, data = too.young)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.993 -1.705  0.322  1.418 10.760 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4720     0.9989    5.48  5.2e-07 ***
## Male          0.5754     0.0278   20.72  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.69 on 77 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.848,  Adjusted R-squared:  0.846 
## F-statistic:  429 on 1 and 77 DF,  p-value: <2e-16

The standard errors are given in the column labeled Std. Error. The standard error on the intercept is approximately 1 so 2 standard errors added to the estimate fo 5.4 includes the value 7.

Similarly, the slope minus 2 times the slope standard error is close to ½.

Let's look at the output from the summary() method in greater detail.

Parameter Uncertainty

Let's return to our heart rate data set. 15 individuals tested for maximum heart rate. The data are

Age = c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
HR = c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 
    178)
heart = data.frame(HR = HR, Age = Age)

Age is in years and the heart rate is in beats per minute.

We modeled the data with linear regression and saved the object as “model” by typing

model = lm(HR ~ Age, data = heart)

Summary statistics from the model are extracted by typing

summary(model)

## 
## Call:
## lm(formula = HR ~ Age, data = heart)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.926 -2.538  0.388  3.187  6.624 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  210.048      2.867    73.3  < 2e-16 ***
## Age           -0.798      0.070   -11.4  3.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.58 on 13 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.902 
## F-statistic:  130 on 1 and 13 DF,  p-value: 3.85e-08

The summary() method applied to a model object returns quite a bit of output. The first bit is the call function. The second bit is the summary statistics on the model residuals.

Recall a residual is the observed value minus the modeled value.

A table of coefficients contains the slope estimate along with a standard error. The standard error is a measure of the uncertainty surrounding the slope estimate. The best estimate for the population slope is -0.798 with a margin of error of +/- 0.07.

The uncertainty about the model parameter is used to test hypothesis and compute confidence intervals.

Interest typically centers on the null hypothesis that the slope = 0. A zero slope implies the line is horizontal and thus that HR is independent of Age.

The t-value (t-statistic) for the zero-slope hypothesis is the slope divided by its standard error. The t-value has a t distribution with n-2 dof if the slope is zero.

Here the t-value is -11.4 which gives a p-value (Pr(>|t|)) of .0000000385 (3.85e-08). It is written this way, because the p value is the probability of observing a more extreme t-value assuming the null hypothesis is true (slope is zero, for instance).

The symbols to the right are indicators of the level of significance. The line below the table shows the definitions which we can interpret using our definition.

**: overwhelming, *: convincing, *: moderate, . : suggestive but inconclusive.

So we have overwhelming evidence that HR depends on Age given these data.

The next line of output from the summary() method gives the Residual standard error. This tells you how close the observations are from the regression line. The dof is again n-2.

sqrt(sum(resid(model)^2)/13)

## [1] 4.578

Note the function resid() extracts the model residuals from the model object.

resid(model)

##       1       2       3       4       5       6       7       8       9 
##  6.3106 -5.7007 -3.1053 -2.1280 -2.1962  2.0288 -8.9258  6.6242  0.3879 
##      10      11      12      13      14      15 
##  4.1083  1.2993 -2.5439  2.3106  4.0629 -2.5326

The next line of output gives the multiple R-squared value. Also called the coefficient of determination. And the adjusted R-squared value.

The multiple R-squared is equal to the square of the Pearson correlation coefficient.

The multiple R-squared = 1 - SSE/SSY, where SSE is the sum of the squared residuals and SSY is the total variation about the mean response.

SSE = sum(resid(model)^2)
SSE

## [1] 272.4

deviance(model)

## [1] 272.4

model0 = lm(HR ~ 1, data = heart)
SSY = deviance(model0)
1 - SSE/SSY

## [1] 0.9091

One minus the ratio of the explained variation to the total variation.

SSE will always be less than SSY so SSE/SSY will be less that 1. If SSE is much less than SSY then, SSE/SSY is close to zero so R squared is close to 1.

The adjusted R-squared multiplied by 100% is the variance of the response variable explained by the explanatory variable.

The adjusted R-squared is designed to penalize for explanatory variables which do not add to the explanatory power of the regression.

1 - (n - 1)/(n - p) * (1 - multiple R² )

where n is the sample size and p is the number of model parameters.

The adjusted R-squared is always smaller than the multiple R-squared, can decrease as you add new explanatory variables, and can even be negative for poorly fitting models. It is important in the context of multiple regression.

The final line of output is the F-statistic, dof, and associated p-value.

F-statistic = [(SSY - SSE)/(p - 1)]/[SSE/(n - p)]

Under the null hypothesis that the regression is no better than the mean as a model for the data, the F-statistic comes from an F distribution with (p-1) and n degrees of freedom.

pf(130, 1, 13, lower.tail = FALSE)

## [1] 3.85e-08

The p-value is very small so you confidently reject the null hypothesis that the mean model is better than the linear regression model.

Bootstrapping

Suppose we randomly choose another 15 people and test for maximum heart rate and record ages. Will the slope be different? Almost certainly, yes.

Although we don't have access to a new sample of subjects, we can create “new” samples by the method of resampling.

This is called “bootstrapping”.

s = 1:15
bs1 = sample(s, size = 15, replace = TRUE)

bs1 is called a bootstrap sample of the original sample s.

Table the output from the sample() function using table() function.

table(sample(s, size = 15, replace = TRUE))

## 
##  1  3  5  7  9 10 11 13 15 
##  1  4  1  1  2  1  1  1  3

Not all the numbers are picked and some are picked more than once. Repeat.

You rerun the regression using the bootstrap sample as an index to select the points to include.

lm(HR[bs1] ~ Age[bs1])

## 
## Call:
## lm(formula = HR[bs1] ~ Age[bs1])
## 
## Coefficients:
## (Intercept)     Age[bs1]  
##     210.803       -0.847

The slope value is not exactly the same as before.

Why stop at one bootstrap sample? We can resample again and again.

Repeat that procedure 1000 times and create a histogram.

slope = numeric()
for (i in 1:1000) {
    bs = sample(1:15, size = 15, replace = TRUE)
    model = lm(HR[bs] ~ Age[bs])
    slope[i] = coef(model)[2]
}
ggplot(as.data.frame(slope), aes(slope)) + geom_histogram()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk bootstrapDistribution

The standard deviation on the set of bootstrapped slopes will be close to the standard error estimated from statistical theory.

sd(slope)

## [1] 0.07507

This is quite a remarkable result. Uncertainty bounds can be estimated from a single sample by resampling.

Predictive Uncertainty

Recall that to make a prediction with your model you use the predict() function and specify the value of the explanatory variable as a data frame.

model = lm(HR ~ Age, data = heart)
predict(model, data.frame(Age = c(50, 60)))

##     1     2 
## 170.2 162.2

To get the uncertainty intervals on the predicted values: Use the level= and interval=“confidence” arguments with the predict function.

predict(model, data.frame(Age = c(50, 60)), level = 0.95, interval = "confidence")

##     fit   lwr   upr
## 1 170.2 167.0 173.4
## 2 162.2 157.9 166.5

The lower (lwr) and upper (upr) bounds represent the 95% uncertainty (confidence) interval about the location of the line for the particular value of the explanatory variable Age.

You state that the best prediction for maximum heart rate for a random 50-yr old is 170 bpm with a 95% uncertainty interval between 167 and 173 bpm. Thus if we repeat the sampling 100 times and make a prediction for a 50-yr old, our CI on the prediction will cover the true predicted value 95 times.

At this point it is important to note that repeated sampling is not the same as bootstrap resampling. Repeated sampling refers to a theoretical ideal of there being a true model, while bootstrap resampling is a procedure that you implement on your observations.

With interval=“prediction” you get a 95% prediction interval. That interval is wider than the confidence interval as it represents two sources of uncertainty.

The uncertainty associated with the mean value GIVEN the person's age AND the uncertainty associated with a particular maximum heart rate GIVEN the conditional mean.

predict(model, data.frame(Age = c(50, 60)), level = 0.95, interval = "prediction")

##     fit   lwr   upr
## 1 170.2 159.8 180.6
## 2 162.2 151.4 173.0

This is an example of compound uncertainty. If you are 90% certain that your spouse is faithful and 90% certain he is not a cross-dresser, then you must be less than 90% certain he is a faithful non-cross dresser.

Bank teller riddle: Laura is 31, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which of the following statements about Laura is the most likely?

A. Laura is a bank teller. B. Laura is a bank teller and a professional golfer. C. Laura is a bank teller and active in the feminist movement.

To plot the confidence band type

ggplot(heart, aes(x = Age, y = HR)) + geom_point(size = 2) + geom_smooth(method = lm)

plot of chunk confidenceBand

By default level=.95 in the geom_smooth() function.

To plot the prediction band, first add the prediction intervals to the original data frame.

heart2 = data.frame(heart, predict(model, interval = "prediction"))

## Warning: predictions on current data refer to _future_ responses

heart2 = heart2[order(heart2$Age), ]

Then

ggplot(heart2, aes(x = Age, y = HR)) + geom_point(size = 2) + geom_smooth(method = lm) + 
    geom_ribbon(aes(y = fit, ymin = lwr, ymax = upr), alpha = 0.2)

plot of chunk predictionBand

Model Adequacy

Although the regression model by construction represents the best fit line through the data. You need to know if the model is adequate.

By fitting a linear regression model, you are implicitly making 4 assumptions. The assumptions concern the behavior of the residuals—or equivalently the distribution of Y (response variable) conditional on X (explanatory variable).

Linearity: Means of sub-populations fall on a straight-line function of X.
Constant variance: Sub-population standard deviations (square of the variances) are all equal.
Normality: Values from each sub-population can be described by a normal distribution.
Independence: Each of the observation is independent of all other observations.

To the extent these assumptions are valid, the model is said to be adequate for representing your data. A model can be significant, but not adequate.

The first three assumptions are best examined using plots. You will never be in a position to state that a given assumption is valid, only that you have no evidence to indicate otherwise.

Consider the cars data frame.

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

The data give the speed of cars (mph) and the distances (ft) taken to stop. The data were recorded in the 1920s.

cor(cars$speed, cars$dist)

## [1] 0.8069

ggplot(cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth(method = lm, 
    se = FALSE)

plot of chunk plotCars

The scatter plot shows that there appears to be a linear relationship between the breaking distance and the forward speed of the car for this sample of data.

The relationship is positive indicating that as speed increases so does the average breaking distance. The relationship also seems to be quite strong (r = .81).

model = lm(dist ~ speed, data = cars)
summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.07  -9.53  -2.27   9.21  43.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.579      6.758   -2.60    0.012 *  
## speed          3.932      0.416    9.46  1.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared:  0.651,  Adjusted R-squared:  0.644 
## F-statistic: 89.6 on 1 and 48 DF,  p-value: 1.49e-12

The results of your regression analysis suggest that there is also a statistically significant relationship (p value is less than 0.001) in the population. In fact, 65% of the variation in average breaking distance is associated with differences in speed. The statistically significant effect suggests it is unlikely in the population of cars at this time that there is no relationship between speed and breaking distance.

The model looks pretty good, so you write up your results and you are done. Hold on. Let's take a closer look.

There tends to be more values of breaking distance below the line over the range of speeds between 10 & 20 mph. Also, the spread of the residuals appears to get larger as speed increases. There is more variation in the response for larger values of speed.

A way to see this more clearly is to cut the explanatory variable into non-overlapping groups and display the set of corresponding response values as a box plot.

ggplot(cars, aes(x = cut(speed, 5), y = dist)) + geom_boxplot()

plot of chunk boxplotsCutbyX

Here you can see evidence that makes you question the validity of the linearity assumption.

You can also clearly see that the IQR range of distance is larger for faster moving cars. There is evidence to suspect the constant variance assumption.

Another informative plot is created using the scatter.smooth() function.

ggplot(cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth(se = FALSE) + 
    geom_smooth(method = lm, color = "red", se = FALSE)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk scattersmooth

Here local regressions are performed using overlapping groups of data. Again, linearity and constant variance assumptions are questionable.

What about the assumption of normality. Although the normality assumption about the residuals is that the conditional distribution of the residuals at each x_i is adequately described by a normal distribution, with sample data there usually are not enough observations at each value of X to check this assumption. Instead, in practice, it is common to examine the joint distribution of residuals.

The residuals are obtained by using the residuals() extractor function.

res = residuals(model)
res

##        1        2        3        4        5        6        7        8 
##   3.8495  11.8495  -5.9478  12.0522   2.1198  -7.8126  -3.7450   4.2550 
##        9       10       11       12       13       14       15       16 
##  12.2550  -8.6774   2.3226 -15.6098  -9.6098  -5.6098  -1.6098  -7.5422 
##       17       18       19       20       21       22       23       24 
##   0.4578   0.4578  12.4578 -11.4746  -1.4746  22.5254  42.5254 -21.4070 
##       25       26       27       28       29       30       31       32 
## -15.4070  12.5930 -13.3394  -5.3394 -17.2719  -9.2719   0.7281 -11.2043 
##       33       34       35       36       37       38       39       40 
##   2.7957  22.7957  30.7957 -21.1367 -11.1367  10.8633 -29.0691 -13.0691 
##       41       42       43       44       45       46       47       48 
##  -9.0691  -5.0691   2.9309  -2.9339 -18.8663  -6.7987  15.2013  16.2013 
##       49       50 
##  43.2013   4.2689

A quick look at the residuals shows that some are positive and some are negative. A histogram of the model's residuals is obtained by typing

hist(res)

plot of chunk histResid

plot(density(res))

plot of chunk histResid

You can see that the right tail in the histogram (and density) is longer than the left tail. With a normal distribution the tails are symmetric. The normality assumption is under question.

Since departures from normality can occur simply because of sampling variation, the question arises as to whether that apparent skewness in the residuals is significant.

A common approach to visualizing the expected variation from the reference distribution is to superimpose a confidence band (envelope) on the density plot. The sm.density() function from the sm package is a quick way to plot the confidence envelope. The argument model = “Normal” is included.

require(sm)

## Loading required package: sm

## Warning: package 'sm' was built under R version 3.0.2

## Package `sm', version 2.2-5: type help(sm) for summary information
## 
## Attaching package: 'sm'
## 
## The following object is masked from 'package:MASS':
## 
##     muscle

sm.density(res, model = "Normal")

plot of chunk densityEnvelope

The curve is shifted left and goes outside the confidence band in the right tail indicating that the residuals may not be adequately described by a normal distribution.

Income vs percent college graduates

The file Inc.txt contains average income vs percentage of college graduates by state. Fit a linear regression model to these data and check the model assumption. Does the linear model appear to be adequate?

Inc = read.table("http://myweb.fsu.edu/jelsner/data/Inc.txt", header = TRUE)
head(Inc)

##   State College Income
## 1    AL    20.4  20487
## 2    AK    28.1  26171
## 3    AZ    24.6  21947
## 4    AR    18.4  19479
## 5    CA    27.5  26808
## 6    CO    34.6  27573

Repeat the steps above to check model adequacy.

ggplot(Inc, aes(x = College, y = Income)) + geom_point() + geom_smooth(method = lm)

plot of chunk incomeModel

model = lm(Income ~ College, data = Inc)
summary(model)

## 
## Call:
## lm(formula = Income ~ College, data = Inc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4136  -1345     75   1276   5121 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10460       1615    6.48  4.3e-08 ***
## College          545         63    8.65  2.0e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2080 on 49 degrees of freedom
## Multiple R-squared:  0.604,  Adjusted R-squared:  0.596 
## F-statistic: 74.8 on 1 and 49 DF,  p-value: 1.98e-11