4 SLR Model Assumptions

I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 4 in STAT 501 is SLR Model Assumptions. The SAS version of the code below is posted on my blog.

Overview

This lesson presents graphical diagnostic tests of the assumptions underlying the SLR.

The residuals versus fits plot is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. Use the plot to detect non-linearity, unequal error variances, and outliers.
If a regression model violates the linearity assumption, the residuals will depart from 0 in some systematic manner.
Flag for further investigation observations with a standardized residual greater than 2.
A residuals versus order plot detects a particular form of non-independence of the error terms: serial (in time or space) correlation.
If a normal probability plot of residuals is not approximately linear, then the normality assumption may be violated.

R Concepts Worth Noting

Create residuals with resid(), standardized residuals with rstandard(), and fitted values with fitted(), then merge into the original data set with cbind.

Data Management for this Module

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# alcoholarm is a tab-delimited data file.
# Total lifetime consumption of alcohol and deltoid muscle strength of nondominant arm for n=50 alcoholic men.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alcoholarm.txt'
alcoholarm <- read.table(url, sep = '\t', header = TRUE)
#head(alcoholarm)
#summarise(alcoholarm, n=n())

# bloodpress is a tab-delimited data file.
# Age (in years), weight (in pounds), duration of hypertension (in years), and diastolic blood pressure (in mm Hg) on a sample of n = 20 hypertensive individuals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/bloodpress.txt'
bloodpress <- read.table(url, sep = '\t', header = TRUE)
#head(bloodpress)
#summarise(bloodpress, n=n())

# treadwear is a tab-delimited data file.
# Mileage (in 1000 miles) driven and depth of the remaining groove (in mils) on a sample of n = 9 vehicles.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/treadwear.txt'
treadwear <- read.table(url, sep = '\t', header = TRUE)
#head(treadwear)
#summarise(treadwear, n=n())

# alphapluto is a tab-delimited data file.
# Plutonium activity (pCi/g) and alpha count rate (number per second) for n=23 samples. 
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alphapluto.txt'
alphapluto <- read.table(url, sep = '\t', header = TRUE)
#head(alphapluto)
#summarise(alphapluto, n=n())

# alcoholtobacco is a tab-delimited data file.
# Average weekly expenditure on tobacco (British pounds) and average weekly expenditure on alcohol (British pounds) for households in n = 11 regions in the United Kingdom.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alcoholtobacco.txt'
alcoholtobacco <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(alcoholtobacco)
#summarise(alcoholtobacco, n=n())

# anscombe is a fixed-width data file.
# 
anscombe <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/anscombe.txt", 
  skip = 1,
  fwf_widths(c(3, 6, 3, 6, 3, 7, 3, 6), c('x1','y1','x2','y2','x3','y3','x4','y4')))

## Parsed with column specification:
## cols(
##   x1 = col_integer(),
##   y1 = col_double(),
##   x2 = col_integer(),
##   y2 = col_double(),
##   x3 = col_integer(),
##   y3 = col_double(),
##   x4 = col_integer(),
##   y4 = col_double()
## )

#head(anscombe)
#summarise(anscombe, n=n())

# alligator is a tab-delimited data file.
alligator <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alligator.txt", 
  skip = 1,
  fwf_widths(c(11, 3), c('weight', 'length')))

## Parsed with column specification:
## cols(
##   weight = col_integer(),
##   length = col_integer()
## )

## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row col    expected  actual    file                                    expected   <int> <chr>  <chr>     <chr>     <chr>                                   actual 1    26 weight 11 chars  0         'https://onlinecourses.science.psu.edu~ file 2    26 <NA>   2 columns 1 columns 'https://onlinecourses.science.psu.edu~

head(alligator)

## # A tibble: 6 x 2
##   weight length
##    <int>  <int>
## 1    130     94
## 2     51     74
## 3    640    147
## 4     28     58
## 5     80     86
## 6    110     94

summarise(alligator, n=n())

## # A tibble: 1 x 1
##       n
##   <int>
## 1    26

# skincancer is a fixed-width data file.
# Mortality due to skin cancer (number of deaths per 10 million people), and 
# U.S. state latitude and longitude at center of state for n=49 states.
skincancer <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/skincancer.txt", 
  skip = 1,
  fwf_widths(c(25, 9, 7, 5, 5), c('state','lat','mort','ocean','lon')))

## Parsed with column specification:
## cols(
##   state = col_character(),
##   lat = col_double(),
##   mort = col_integer(),
##   ocean = col_integer(),
##   lon = col_double()
## )

#head(skincancer)
#summarise(skincancer, n=n())

# corrosion is a tab-delimited data file.
# iron content and weight loss due to corrosion for sample of n=13 specimens.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/corrosion.txt'
corrosion <- read.table(url, sep = '\t', header = TRUE)
head(corrosion)

##   iron wgtloss
## 1 0.01   127.6
## 2 0.01   130.1
## 3 0.01   128.0
## 4 0.48   124.0
## 5 0.48   122.0
## 6 0.71   110.8

summarise(corrosion, n=n())

##    n
## 1 13

# handheight is a tab-delimited data file.
# handspan (cm) and height (inches), for n = 167 students.
url <- 'https://onlinecourses.science.psu.edu/stat462/sites/onlinecourses.science.psu.edu.stat462/files/data/handheight.txt'
handheight <- read.table(url, sep = '\t', header = TRUE)
#head(handheight)

# solutions_conc is a fixed-width data file.
# concentration of a chemical solution and time after solution was made
solutions_conc <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/solutions_conc.txt", 
  skip = 0,
  fwf_widths(c(11, 3), c('conc','time')))

## Parsed with column specification:
## cols(
##   conc = col_double(),
##   time = col_double()
## )

#head(solutions_conc)
#summarise(solutions_conc, n=n())

# oldfaithful is a tab-delimited data file.
# Time to next eruption and duration of last eruption for eruptions of the Old Faithful geyser.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/oldfaithful.txt'
oldfaithful <- read.table(url, sep = '\t', header = TRUE)
#head(oldfaithful)
#summarise(oldfaithful, n=n())

# realestate is a tab-delimited data file.
# sale price of a home and square foot area of home for sample of n=521 homes.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/realestate.txt'
realestate <- read.table(url, sep = '\t', header = TRUE)
#head(realestate)
#summarise(realestate, n=n())

# infectionrisk is a tab-delimited data file.
# Infection risk in a hospital and x = average length of stay in the hospital at n=113 hospitals.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/infectionrisk.txt'
infectionrisk <- read.table(url, sep = '\t', header = TRUE)
head(infectionrisk)

##   ID  Stay  Age InfctRsk Culture  Xray Beds MedSchool Region Census Nurses
## 1  1  7.13 55.7      4.1     9.0  39.6  279         2      4    207    241
## 2  2  8.82 58.2      1.6     3.8  51.7   80         2      2     51     52
## 3  3  8.34 56.9      2.7     8.1  74.0  107         2      3     82     54
## 4  4  8.95 53.7      5.6    18.9 122.8  147         2      4     53    148
## 5  5 11.20 56.5      5.7    34.5  88.9  180         2      1    134    151
## 6  6  9.76 50.9      5.1    21.9  97.0  150         2      2    147    106
##   Facilities
## 1         60
## 2         40
## 3         20
## 4         40
## 5         40
## 6         40

summarise(infectionrisk, n=n())

##     n
## 1 113

# carstopping is a tab-delimited data file.
# Stopping distance of a car and speed of the car when the brakes were applied for sample of n=61 cars.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/carstopping.txt'
carstopping <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
head(carstopping)

##   StopDist Speed
## 1        4     4
## 2        2     5
## 3        8     5
## 4        8     5
## 5        4     5
## 6        6     7

summarise(carstopping, n=n())

##    n
## 1 63

4.1 Background

Recall the four “LINE” conditions underlying SLR: the errors $e_i$ are independent normal random variables with mean zero and constant variance, $\sigma^2$.

The response variable mean $E(Y_i)$ is a linear function of $x_i$. (which is also to say $E(e_i)=0).
The errors $e_i$ are independent.
The errors $e_i$ at each $x_i$ are normally distributed.
The errors $e_i$ at each $x_i$ have equal variances $\sigma^2$.

This list could be expanded to include two more conditions:

One or more outlier or leverage points are handled properly.
The model is specified correctly.

Here is a summary of the importantance of the conditions.

All tests and intervals are very sensitive to even minor departures from independence.
All tests and intervals are sensitive to moderate departures from equal variance.
The hypothesis tests and confidence intervals for $\beta_0$ and $\beta_1$ are fairly robust against departures from normality.
Prediction intervals are quite sensitive to departures from normality.

4.2 Residuals vs. Fits Plot

The residuals versus fits plot is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. Use the plot to detect non-linearity, unequal error variances, and outliers.

The plot below is a well-behaved residuals vs. fits plot.

The residuals “bounce randomly” around the 0 line, supporting the linearity assumption.
The residuals form a “horizontal band” around the 0 line, supporting the equal variances assumption.
No one residual “stands out” from the basic random pattern, supportin the no outliers assumption.

alcoholarm.lm <- lm(strength ~ alcohol, data = alcoholarm)
alcoholarm.res <- data.frame(res=resid(alcoholarm.lm))
alcoholarm.fit <- data.frame(fit=fitted(alcoholarm.lm))
alcoholarm2 <- cbind(alcoholarm, alcoholarm.res, alcoholarm.fit)
ggplot(alcoholarm2, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Fitted Value") +
  ylab("Residual") +
  ggtitle("Residuals vs. Fits Plot")

4.3 Residuals vs. Predictor Plot

The residuals versus predictors plot is a scatter plot of residuals on the y axis and explanatory variables on the x axis. It is an alternative to the residuals vs. fitted values plot. Use the plot to detect non-linearity, unequal error variances, and outliers. A well-behaved plot bounces randomly and forms a roughly horizontal band around the residual = 0 line, and no data points stand out from the basic random pattern of the other residuals.

In an SLR, the residuals versus predictors plot adds no information not already supplied by the residuals versus fits plot. But in multiple linear regression (MLR), the residuals versus predictors plot can indicate whether a preditor should be added to the model.

ggplot(alcoholarm2, aes(y = res, x = alcohol)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Predictor Value (alcohol)") +
  ylab("Residual") +
  ggtitle("Residuals vs. Predictor Plot")

Consider an example in which the residuals vs. predictor plot can help determine whether or not another predictor should be added to the model. Data set bloodpress contains the age (in years), weight (in pounds), duration of hypertension (in years), and diastolic blood pressure (in mm Hg) BP of a sample of n = 20 hypertensive individuals.

Start with a regression of BP on Age. The Pearson correlation coefficient indicates a moderately strong linear relationship (r=0.659093). A regression of BP on Weight also has a high correlation coefficient (r=0.9500677). A regression of BP on Dur does not have a high correlation coefficient (r=0.2928336).

cor(bloodpress$BP, bloodpress$Age)

## [1] 0.659093

cor(bloodpress$BP, bloodpress$Weight)

## [1] 0.9500677

cor(bloodpress$BP, bloodpress$Dur)

## [1] 0.2928336

Regress BP on Age then plot the residuals against Weight. In general a non-random pattern indicates that the additional explanatory variable can explain some of the remaining variability in the response. Think of the residuals on the y axis as a new response, the diastolic blood pressure adjusted for age. If a plot of the new response against a predictor shows a non-random pattern, the predictor explains some of the remaining variability. Here, there is a pattern in the plot.

bloodpress.lm <- lm(BP ~ Age, data = bloodpress)
bloodpress.res <- data.frame(res=resid(bloodpress.lm))
bloodpress2 <- cbind(bloodpress, bloodpress.res)
ggplot(bloodpress2, aes(y = res, x = Weight)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Weight") +
  ylab("Residual") +
  ggtitle("Residuals vs. Weight")

How about the Dur variable? A plot of the residuals against Dur reveals no pattern. Adding Dur to the model is not worthwhile.

ggplot(bloodpress2, aes(y = res, x = Dur)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Duration") +
  ylab("Residual") +
  ggtitle("Residuals vs. Duration")

4.4 Identifying Specific Problems Using Residual Plots

Observe how residuals versus fits (or predictor) plots reveal problems with a formulated regression model.

If a regression model violates the linearity assumption, the residuals will depart from 0 in some systematic manner.

Consider a data set treadwear of mileage driven mileage (1000 miles) and the depth of the remaining groove groove (mils) for n=9 vehicles. The relationship is not quite linear, and it is obvious from the residuals vs. fitted values plot. The residuals plot suggests a nonlinear relationship would be a better predictor.

treadwear.lm <- lm(groove ~ mileage, data = treadwear)
treadwear.res <- data.frame(res=resid(treadwear.lm))
treadwear.fit <- data.frame(fit=fitted(treadwear.lm))
treadwear <- cbind(treadwear, treadwear.res, treadwear.fit)
ggplot(treadwear, aes(y = groove, x = mileage)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("Mileage (1000 miles)") +
  ylab("Groove (mils)") +
  ggtitle("Regression Plot")

ggplot(treadwear, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Fitted") +
  ylab("Residual") +
  ggtitle("Residuals versus Fitted")

Non-constant error variance shows up on a residuals vs. fits (or predictor) plot as a “fanning” effect, a “funneling” effect, or some other complex fashion.

Consider a data set alphapluto of plutonium activity pluto (pCi/g) and alpha count rate alpha (number per second) for n=23 samples. The regression plot shows a linear relationship between alpha count rate and plutonium activity. It also suggests that the error terms vary around the regression line in a non-constant manner. The corresponding residuals vs. fits plot accentuates this observation.

alphapluto.lm <- lm(alpha ~ pluto, data = alphapluto)
alphapluto.res <- data.frame(res=resid(alphapluto.lm))
alphapluto.fit <- data.frame(fit=fitted(alphapluto.lm))
alphapluto <- cbind(alphapluto, alphapluto.res, alphapluto.fit)
ggplot(alphapluto, aes(y = alpha, x = pluto)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("pluto (pCi/g)") +
  ylab("alhpa (number/sec") +
  ggtitle("Regression Plot")

ggplot(alphapluto, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Fitted") +
  ylab("Residual") +
  ggtitle("Residuals versus Fitted")

Outliers show up on a residuals versus fitted values plot standing apart from the basic random pattern of the rest of the residuals. The random pattern of the residual plot can even disappear if one outlier really deviates from the pattern of the rest of the data.

Consider a data set alcoholtobacco of expenditure on tobacco Tobacco (British pounds) and the average weekly expenditure on alcohol Alcohol (British pounds) for households in n=11 different regions in the United Kingdom. The regression plot shows an outlier in the lower right of the plot. In fact, the outlier is so far removed from the pattern of the rest of the data that it appears to be pulling the line in its direction. The corresponding residuals vs. fits plot accentuates the effect.

alcoholtobacco.lm <- lm(Alcohol ~ Tobacco, data = alcoholtobacco)
alcoholtobacco.res <- data.frame(res=resid(alcoholtobacco.lm))
alcoholtobacco.fit <- data.frame(fit=fitted(alcoholtobacco.lm))
alcoholtobacco <- cbind(alcoholtobacco, alcoholtobacco.res, alcoholtobacco.fit)
ggplot(alcoholtobacco, aes(y = Alcohol, x = Tobacco)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("Tobacco (British Pounds)") +
  ylab("Alcohol (British Pounds)") +
  ggtitle("Regression Plot")

ggplot(alcoholtobacco, aes(y = res, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Fitted") +
  ylab("Residual") +
  ggtitle("Residuals versus Fitted")

How large must a residual be for the associated data point to be flagged as being an outlier? Divide the residuals by their standard deviation to create standardized residuals. Flag observations with a standardized residuals greater than 2 for further investigation.

alcoholtobacco.stdres <- data.frame(stdres=rstandard(alcoholtobacco.lm))
alcoholtobacco <- cbind(alcoholtobacco, alcoholtobacco.stdres)
ggplot(alcoholtobacco, aes(y = stdres, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2, colour="red") +
  geom_hline(yintercept = 2, colour="red") +
  xlab("Fitted") +
  ylab("Standardized Residual") +
  ggtitle("Residuals versus Fitted")

Here is an extreme example where the ideal random pattern of the residual plot has disappeared, since the one outlier really deviates from the pattern of the rest of the data.

anscombe.lm <- lm(y3 ~ x3, data = anscombe)
anscombe.stdres <- data.frame(stdres=rstandard(anscombe.lm))
anscombe.fit <- data.frame(fit=fitted(anscombe.lm))
anscombe <- cbind(anscombe, anscombe.stdres, anscombe.fit)
ggplot(anscombe, aes(y = y3, x = x3)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("x3") +
  ylab("y3") +
  ggtitle("Regression Plot")

ggplot(anscombe, aes(y = stdres, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  xlab("Fitted") +
  ylab("Standardized Residual") +
  ggtitle("Standardized Residuals versus Fitted")

4.5 Residuals vs. Order Plot

A residuals versus order plot detects a particular form of non-independence of the error terms: serial (in time or space) correlation. The plot is only appropriate if you know the order in which the data were collected.

If serial correlation is not present, the residuals bounce randomly around the residual = 0 line. A residuals vs. order plot with a positive trend suggests that some of the variation in the response is due to time, so a time explanatory variable belongs in the model. Positive serial correlation exists when residuals tend to be followed, in time, by residuals of the same sign and about the same magnitude. Negative serial correlation exists when residuals of one sign tend to be followed by residuals of the opposite sign.

4.6 Normal Probability Plot of Residuals

Use a normal probability plot of the residuals to determine whether it is reasonable to assume that the error terms are normally distributed. A plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles should be approximately linear.

Skewed residuals will show a bow in the normal probability plot. Heavy tails will cross the normal probability plot line.

4.7 Assessing Linearity by Visual Inspection

For an SLR, it is possible to assess linearity by simply looking at the regression plot.

For our skincancer data set example, the regression plot supports the linearity assumption.

ggplot(skincancer, aes(y=mort, x=lat)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE) +
  ylab("Mortality (Deaths per 10 million)") +
  xlab("Latitude (at center of state)") +
  ggtitle("Skin Cancer Mortality versus State Latitude")

The following example from data set alligator of alligator length (inches) and weight (pounds) from a sample of n=25 alligators. The regression plot shows a nonlinear relationship.

ggplot(alligator, aes(y=weight, x=length)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE) +
  ylab("Weight (pounds)") +
  xlab("Length (inches)") +
  ggtitle("Regression Plot")

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

The following example from data set corrosion of corrosion and iron content from a sample of n=13 specimens. The regression plot shows a linear relationship.

ggplot(corrosion, aes(y=wgtloss, x=iron)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE) +
  ylab("Iron") +
  xlab("Weight Loss") +
  ggtitle("Regression Plot")

4.8 Further Examples

This plot from data set handheight looks good in that the variance is roughly the same all the way across and there are no worrisome patterns. There seems to be no difficulties with the model or data.

handheight.lm <- lm(HandSpan ~ Height, data = handheight)
handheight.stdres <- data.frame(stdres=rstandard(handheight.lm))
handheight.fit <- data.frame(fit=fitted(handheight.lm))
handheight <- cbind(handheight, handheight.stdres, handheight.fit)
ggplot(handheight, aes(y = stdres, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2, colour="red") +
  geom_hline(yintercept = 2, colour="red") +
  xlab("Fitted") +
  ylab("Standardized Residual") +
  ggtitle("Standardized Residuals versus Fitted")

This plot from data set solutions_conc shows two difficulties. First, the pattern is curved which indicates that the wrong type of equation was used. Second, the variance (vertical spread) increases as the fitted values (predicted values) increase.

solutions_conc.lm <- lm(conc ~ time, data = solutions_conc)
solutions_conc.stdres <- data.frame(stdres=rstandard(solutions_conc.lm))
solutions_conc.fit <- data.frame(fit=fitted(solutions_conc.lm))
solutions_conc <- cbind(solutions_conc, solutions_conc.stdres, solutions_conc.fit)
ggplot(solutions_conc, aes(y = stdres, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2, colour="red") +
  geom_hline(yintercept = 2, colour="red") +
  xlab("Fitted") +
  ylab("Standardized Residual") +
  ggtitle("Standardized Residuals versus Fitted")

This plot from data set realestate shows that the residual variance (vertical spread) increases as the fitted values (predicted values of sale price) increase. This violates the assumption of constant error variance.

realestate.lm <- lm(SalePrice ~ SqFeet, data = realestate)
realestate.stdres <- data.frame(stdres=rstandard(realestate.lm))
realestate.fit <- data.frame(fit=fitted(realestate.lm))
realestate <- cbind(realestate, realestate.stdres, realestate.fit)
ggplot(realestate, aes(y = stdres, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2, colour="red") +
  geom_hline(yintercept = 2, colour="red") +
  xlab("Fitted") +
  ylab("Standardized Residual") +
  ggtitle("Standardized Residuals versus Fitted")

The histogram of oldfaithful is roughly bell-shaped so it is an indication that it is reasonable to assume that the errors have a normal distribution. The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution.

oldfaithful.lm <- lm(eruption ~ waiting, data = oldfaithful)
oldfaithful.stdres <- data.frame(stdres=rstandard(oldfaithful.lm))
oldfaithful.fit <- data.frame(fit=fitted(oldfaithful.lm))
oldfaithful <- cbind(oldfaithful, oldfaithful.stdres, oldfaithful.fit)
ggplot(oldfaithful, aes(x = stdres)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(oldfaithful, aes(y = stdres, x = fit)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2, colour="red") +
  geom_hline(yintercept = 2, colour="red") +
  xlab("Fitted") +
  ylab("Standardized Residual") +
  ggtitle("Standardized Residuals versus Fitted")

The normal probability plot of infectionrisk shows some deviation from the straight-line pattern indicating a distribution with heavier tails than a normal distribution.

infectionrisk.lm <- lm(InfctRsk ~ Stay, data = infectionrisk)
infectionrisk.stdres <- rstandard(infectionrisk.lm)
qqnorm(infectionrisk.stdres, 
     ylab="Standardized Residuals", 
     xlab="Normal Scores", 
     main="Old Faithful Eruptions") 
 qqline(infectionrisk.stdres)

The regression plot of carstopping shows problems with both curvature and nonconstant variance. One possible remedy is to transform y. With some trial and error, we find that there is an approximate linear relationship between $sqrt{y}$ and x with no suggestion of nonconstant variance.

carstopping.lm <- lm(StopDist ~ Speed, data = carstopping)
carstopping.res <- data.frame(res=resid(carstopping.lm))
carstopping.fit <- data.frame(fit=fitted(carstopping.lm))
carstopping <- cbind(carstopping, carstopping.res, carstopping.fit)
ggplot(carstopping, aes(y = StopDist, x = Speed)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("Speed (mpg)") +
  ylab("Stopping Distance (feet)") +
  ggtitle("Regression Plot")

carstopping$StopDist.sqrt = sqrt(carstopping$StopDist)
carstopping.lm2 <- lm(StopDist.sqrt ~ Speed, data = carstopping)
newdata <- data.frame(Speed=c(10,20,30,40))
predict(carstopping.lm2, newdata, interval="prediction")

##         fit      lwr       upr
## 1  3.443966 1.984875  4.903057
## 2  5.969649 4.519884  7.419414
## 3  8.495332 7.031415  9.959249
## 4 11.021015 9.520132 12.521898