This is the code to create the graphs and tables in the second chapter of “Regression Analysis by Example.” This document focuses on the data for the hypothetical study on television news ratings. The authors use this study to illustrate the impact of extreme outliers on the quality of a simple linear regression. There are two data sets used; the second is simply a truncated version of the first (with 4 influential observations removed).
We need to import the ggplot2 library.
library(ggplot2)
Create the news ratings data.
lead_in <- c(2.5,2.7,2.9,3.1,3.3,3.5,3.7,3.9,4.1,4.3,
4.5,4.7,4.9,5.1,5.3,5.5,5.7,5.9,6.1,6.3,
6.5,6.7,6.9,7.1,7.3,7.5,2.5,2.7,7.3,7.5)
news <- c(3.8,4.1,5.8,4.8,5.7,4.4,4.8,3.6,5.5,4.15,
5.8,3.8,4.75,3.9,6.2,4.35,4.15,4.85,6.2,3.8,
7.0,5.4,6.1,6.5,6.1,4.75,1.0,1.2,9.5,9.0)
Create the relevant models.
# linear model
rtgs <- lm(news~lead_in)
# anova table
rtgs_anv <- anova(rtgs)
# summary stats
rtgs_smry <- summary(rtgs)
Note that the model created by the lm() function is, among other things, a list. You can access a lot of useful output by targeting indices of that list.
Use that approach to create a few important vectors.
# vector of model residuals
rtgs_resids <- rtgs[[2]]
# vector of standardized residuals
rtgs_std_resids <- rstandard(rtgs)
# vector of fitted values
rtgs_fitVals <- rtgs[[5]]
Create a data frame to store those vectors side by side with the original data. I find this makes plotting and any future subsetting a little bit easier. This is a pretty small data set, so there’s no harm in creating some extra objects. Having multiple ways to access data components can reduce headaches by reducing the need for commands with tons of parentheses (inside of brackets inside of more parentheses etc).
Here’s that data frame.
rtgsDF <- data.frame(lead_in, news,
rtgs_resids,
rtgs_std_resids,
rtgs_fitVals)
colnames(rtgsDF) <- c("lead", "news",
"resd",
"sResd",
"fits")
Now we can produce the tables and graphs used in the chapter.
Here’s Table 2.1.
# Table 2.1. Lead rating and news rating of television data
rtgsDF[, c(1:2)]
## lead news
## 1 2.5 3.80
## 2 2.7 4.10
## 3 2.9 5.80
## 4 3.1 4.80
## 5 3.3 5.70
## 6 3.5 4.40
## 7 3.7 4.80
## 8 3.9 3.60
## 9 4.1 5.50
## 10 4.3 4.15
## 11 4.5 5.80
## 12 4.7 3.80
## 13 4.9 4.75
## 14 5.1 3.90
## 15 5.3 6.20
## 16 5.5 4.35
## 17 5.7 4.15
## 18 5.9 4.85
## 19 6.1 6.20
## 20 6.3 3.80
## 21 6.5 7.00
## 22 6.7 5.40
## 23 6.9 6.10
## 24 7.1 6.50
## 25 7.3 6.10
## 26 7.5 4.75
## 27 2.5 1.00
## 28 2.7 1.20
## 29 7.3 9.50
## 30 7.5 9.00
Figure 2.1, i.e. the graph of news ratings against lead-in ratings.
# Fig. 2.1. Graph of news rating against lead rating
ggplot(rtgsDF, aes(x=lead, y=news)) +
geom_point(shape=19, size=3, color="slateblue3") +
geom_smooth(method=lm,
se=FALSE)
Table 2.2
# Table 2.2. Regression coefficients and standard errors
# for television ratings data
# coefficients:
rtgs_smry[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7065376 0.8171549 2.088389 0.0459765082
## lead_in 0.6653592 0.1552076 4.286898 0.0001938651
# n size:
length(news)
## [1] 30
# R squared:
rtgs_smry[[8]]
## [1] 0.3962589
# s (residual standard error):
rtgs_smry[[6]]
## [1] 1.401861
Figure 2.2, the graph of the standardized residuals against x (i.e. the lead-in values).
# Fig. 2.2. Graph of standardized residuals against x.
ggplot(rtgsDF, aes(x=lead, y=sResd)) +
geom_point(shape=19, size=3, color="slateblue3")
Figure 2.3, the graph of the standardized residuals against the fitted values.
# Fig. 2.3 Graph of standardized residuals versus fitted values
ggplot(rtgsDF, aes(x=fits, y=sResd)) +
geom_point(shape=19, size=3, color="slateblue3")
The reduced data set.
# ratings_data, reduced version
lead_in2 <- c(2.5,2.7,2.9,3.1,3.3,3.5,3.7,3.9,
4.1,4.3,4.5,4.7,4.9,5.1,5.3,5.5,
5.7,5.9,6.1,6.3,6.5,6.7,6.9,7.1,
7.3,7.5)
news2 <- c(3.8,4.1,5.8,4.8,5.7,4.4,4.8,3.6,
5.5,4.15,5.8,3.8,4.75,3.9,6.2,4.35,
4.15,4.85,6.2,3.8,7.0,5.4,6.1,6.5,
6.1,4.75)
Models for the reduced data set.
rtgs2 <- lm(news2 ~ lead_in2)
rtgs_anv2 <- anova(rtgs)
rtgs_smry2 <- summary(rtgs2)
Vectors for the reduced data set.
rtgs_resids2 <- rtgs2[[2]]
rtgs_std_resids2 <- rstandard(rtgs2)
rtgs_fitVals2 <- rtgs2[[5]]
Data frame for the reduced data set.
rtgsDF2 <- data.frame(lead_in2, news2,
rtgs_resids2,
rtgs_std_resids2,
rtgs_fitVals2)
colnames(rtgsDF2) <- c("lead2", "news2",
"resd2",
"sResd2",
"fits2")
Table 2.3, coefficients etc for the reduced data set.
# Table 2.3. Regression coefficients and standard errors
# for reduced data set
# coefficients:
rtgs_smry2[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7132479 0.6313522 5.881421 4.564029e-06
## lead_in2 0.2596581 0.1209451 2.146908 4.211490e-02
# n size:
length(news2)
## [1] 26
# R squared:
rtgs_smry2[[8]]
## [1] 0.1611094
# s (residual standard error):
rtgs_smry2[[6]]
## [1] 0.9250525
Figure 2.4, the graph of the standardized residuals against the x values for the reduced data set.
# Fig. 2.4. Graph of standardized residuals against x (reduced data set)
ggplot(rtgsDF2, aes(x=lead2, y=sResd2)) +
geom_point(shape=19, size=3, color="slateblue3")