Graphing Regression Diagnostics

This is the code to create the graphs and tables in the second chapter of “Regression Analysis by Example.” This document focuses on the data for the hypothetical study on television news ratings. The authors use this study to illustrate the impact of extreme outliers on the quality of a simple linear regression. There are two data sets used; the second is simply a truncated version of the first (with 4 influential observations removed).

External Libraries

We need to import the ggplot2 library.

library(ggplot2)

The Data Sets

Create the news ratings data.

lead_in <- c(2.5,2.7,2.9,3.1,3.3,3.5,3.7,3.9,4.1,4.3,
             4.5,4.7,4.9,5.1,5.3,5.5,5.7,5.9,6.1,6.3,
             6.5,6.7,6.9,7.1,7.3,7.5,2.5,2.7,7.3,7.5)

news    <- c(3.8,4.1,5.8,4.8,5.7,4.4,4.8,3.6,5.5,4.15,
             5.8,3.8,4.75,3.9,6.2,4.35,4.15,4.85,6.2,3.8,
             7.0,5.4,6.1,6.5,6.1,4.75,1.0,1.2,9.5,9.0)

The Linear Model

Create the relevant models.

# linear model
rtgs <- lm(news~lead_in)
# anova table
rtgs_anv <- anova(rtgs)
# summary stats
rtgs_smry <- summary(rtgs)

Residuals and Fitted Values

Note that the model created by the lm() function is, among other things, a list. You can access a lot of useful output by targeting indices of that list.

Use that approach to create a few important vectors.

# vector of model residuals
rtgs_resids <- rtgs[[2]]
# vector of standardized residuals
rtgs_std_resids <- rstandard(rtgs)
# vector of fitted values
rtgs_fitVals <- rtgs[[5]]

The Data Frame

Create a data frame to store those vectors side by side with the original data. I find this makes plotting and any future subsetting a little bit easier. This is a pretty small data set, so there’s no harm in creating some extra objects. Having multiple ways to access data components can reduce headaches by reducing the need for commands with tons of parentheses (inside of brackets inside of more parentheses etc).

Here’s that data frame.

rtgsDF <- data.frame(lead_in, news, 
                     rtgs_resids,
                     rtgs_std_resids,
                     rtgs_fitVals)

colnames(rtgsDF) <- c("lead", "news", 
                      "resd", 
                      "sResd", 
                      "fits")

Tables and Graphical Output

Now we can produce the tables and graphs used in the chapter.
Here’s Table 2.1.

# Table 2.1. Lead rating and news rating of television data
rtgsDF[, c(1:2)]

##    lead news
## 1   2.5 3.80
## 2   2.7 4.10
## 3   2.9 5.80
## 4   3.1 4.80
## 5   3.3 5.70
## 6   3.5 4.40
## 7   3.7 4.80
## 8   3.9 3.60
## 9   4.1 5.50
## 10  4.3 4.15
## 11  4.5 5.80
## 12  4.7 3.80
## 13  4.9 4.75
## 14  5.1 3.90
## 15  5.3 6.20
## 16  5.5 4.35
## 17  5.7 4.15
## 18  5.9 4.85
## 19  6.1 6.20
## 20  6.3 3.80
## 21  6.5 7.00
## 22  6.7 5.40
## 23  6.9 6.10
## 24  7.1 6.50
## 25  7.3 6.10
## 26  7.5 4.75
## 27  2.5 1.00
## 28  2.7 1.20
## 29  7.3 9.50
## 30  7.5 9.00

Figure 2.1, i.e. the graph of news ratings against lead-in ratings.

# Fig. 2.1. Graph of news rating against lead rating
ggplot(rtgsDF, aes(x=lead, y=news)) +
    geom_point(shape=19, size=3, color="slateblue3") +
    geom_smooth(method=lm,
                se=FALSE)

Table 2.2

# Table 2.2. Regression coefficients and standard errors
#            for television ratings data
# coefficients:
rtgs_smry[[4]]

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 1.7065376  0.8171549 2.088389 0.0459765082
## lead_in     0.6653592  0.1552076 4.286898 0.0001938651

# n size:
length(news)

## [1] 30

# R squared:
rtgs_smry[[8]]

## [1] 0.3962589

# s (residual standard error):
rtgs_smry[[6]]

## [1] 1.401861

Figure 2.2, the graph of the standardized residuals against x (i.e. the lead-in values).

# Fig. 2.2. Graph of standardized residuals against x.
ggplot(rtgsDF, aes(x=lead, y=sResd)) +
    geom_point(shape=19, size=3, color="slateblue3")

Figure 2.3, the graph of the standardized residuals against the fitted values.

# Fig. 2.3 Graph of standardized residuals versus fitted values
ggplot(rtgsDF, aes(x=fits, y=sResd)) +
    geom_point(shape=19, size=3, color="slateblue3")

The Reduced Data Set and Associated Output

The reduced data set.

# ratings_data, reduced version
lead_in2 <- c(2.5,2.7,2.9,3.1,3.3,3.5,3.7,3.9,
              4.1,4.3,4.5,4.7,4.9,5.1,5.3,5.5,
              5.7,5.9,6.1,6.3,6.5,6.7,6.9,7.1,
              7.3,7.5)

news2    <- c(3.8,4.1,5.8,4.8,5.7,4.4,4.8,3.6,
              5.5,4.15,5.8,3.8,4.75,3.9,6.2,4.35,
              4.15,4.85,6.2,3.8,7.0,5.4,6.1,6.5,
              6.1,4.75)

Models for the reduced data set.

rtgs2 <- lm(news2 ~ lead_in2)
rtgs_anv2 <- anova(rtgs)
rtgs_smry2 <- summary(rtgs2)

Vectors for the reduced data set.

rtgs_resids2 <- rtgs2[[2]]
rtgs_std_resids2 <- rstandard(rtgs2)
rtgs_fitVals2 <- rtgs2[[5]]

Data frame for the reduced data set.

rtgsDF2 <- data.frame(lead_in2, news2, 
                     rtgs_resids2,
                     rtgs_std_resids2,
                     rtgs_fitVals2)

colnames(rtgsDF2) <- c("lead2", "news2", 
                      "resd2", 
                      "sResd2", 
                      "fits2")

Table 2.3, coefficients etc for the reduced data set.

# Table 2.3. Regression coefficients and standard errors
#            for reduced data set
# coefficients:
rtgs_smry2[[4]]

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 3.7132479  0.6313522 5.881421 4.564029e-06
## lead_in2    0.2596581  0.1209451 2.146908 4.211490e-02

# n size:
length(news2)

## [1] 26

# R squared:
rtgs_smry2[[8]]

## [1] 0.1611094

# s (residual standard error):
rtgs_smry2[[6]]

## [1] 0.9250525

Figure 2.4, the graph of the standardized residuals against the x values for the reduced data set.

# Fig. 2.4. Graph of standardized residuals against x (reduced data set)
ggplot(rtgsDF2, aes(x=lead2, y=sResd2)) +
    geom_point(shape=19, size=3, color="slateblue3")