Linear regression, a la \( Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \; ... \) assumes a \( \beta_i \) change in the response variable \( Y \) for a unit change in predictor variable \( X_i \), holding \( X_{j \ne i} \) constant.
Transformations assume a non-linear response-predictor relationship and apply a function (e.g., log, polynomial) to predictor and/or response variables.
Interactions between predictor variables may be seen, in which case the interacting predictors are multiplied together.
see lm_find.interactions
Qualitative predictors are represented by replacing them with n-1 binary variables, where n = the number of levels
The effects of such changes are considered improvements if we see:
reduced residual standard error
increased adjusted R2
increased F statistic
and reduced associated p-value
increased abs(T statistic)
and reduced p-values of the individual predictor variables
tvads = read.csv("tvads.csv")
str(tvads)
## 'data.frame': 21 obs. of 3 variables:
## $ Brand : Factor w/ 21 levels "ATT/Bell","BudLite",..: 15 17 20 8 3 5 12 13 7 9 ...
## $ Spend : num 50.1 74.1 19.3 22.9 82.4 ...
## $ Impress: num 32.1 99.6 11.7 21.9 60.8 78.6 92.4 50.7 21.4 40.1 ...
head(tvads)
## Brand Spend Impress
## 1 Miller 50.1 32.1
## 2 Pepsi 74.1 99.6
## 3 Strohs 19.3 11.7
## 4 Fedex 22.9 21.9
## 5 BurgerKing 82.4 60.8
## 6 CocaCola 40.1 78.6
grfpairs(tvads[, c("Spend", "Impress")], main = "TV Ad Spending")
simple.model = lm(Impress ~ Spend, data = tvads)
log.model = lm(Impress ~ log(Spend), data = tvads)
lm_compare(simple.model, log.model)
## Residual Standard Error
## simple.model 23.5
## log.model 21.18
## Decreased: -2.326
## log.model preferred
## Adjusted R^2
## simple.model 0.3936
## log.model 0.5077
## Increased: 0.1141
## log.model preferred
## F Statistic
## simple.model 13.98
## log.model 21.63
## Increased: 7.643
## log.model preferred
## F Statistic p-value
## simple.model 0.001389
## log.model 0.0001744
## Decreased: -0.001215
##
## Coeffcient Statistics
## Spend abs(t stat)
## simple.model 3.739
## log.model 4.65
## Increased: 0.911
## log.model preferred
## Spend t stat p-value
## simple.model 0.001389
## log.model 0.0001744
## Decreased: -0.001215
par(mfrow = c(2, 2))
attach(tvads)
simple.model = lm(Impress ~ Spend, data = tvads)
# X^2 transform
plot(Spend, Impress, main = paste("Linear vs", expression(X^2)))
abline(simple.model, lty = 3)
Spend2 = Spend^2
poly.model = lm(Impress ~ Spend + Spend2, data = tvads)
lines(sort(Spend), fitted(poly.model)[order(Spend)])
# log transform
plot(Spend, Impress, main = "Linear vs Log(X)")
abline(simple.model, lty = 3)
SpendLog = log(Spend)
log.model = lm(Impress ~ SpendLog, data = tvads)
lines(sort(Spend), fitted(log.model)[order(Spend)])
# reciprocal transform
plot(Spend, Impress, main = paste("Linear vs ", expression(1/X)))
abline(simple.model, lty = 3)
SpendRecip = 1/Spend
recip.model = lm(Impress ~ SpendRecip, data = tvads)
lines(sort(Spend), fitted(recip.model)[order(Spend)])
# exponential transform
plot(Spend, Impress, main = paste("Linear vs ", expression(e^X)))
abline(simple.model, lty = 3)
SpendExp = exp(Spend)
exp.model = lm(Impress ~ SpendExp, data = tvads)
lines(sort(Spend), fitted(exp.model)[order(Spend)])
The log transform model stipulates that the effect on the response variable is more pronounced at small levels of spending than it is at higher levels … the value per dollar spent tends to fall: declining marginal return/effect.
In biological applications \( log_2 \) is preferred since it shows a “doubling” effect; in engineering \( log_{10} \), which shows “order-of-magnitude”, may be preferred.
Linear regression models often produce better results when the predictor variables are symmetrically distributed. Large outliers are “pulled in” with a log transformation.
par(mfrow = c(1, 2))
hist(Spend)
hist(log(Spend))
# What about logs of negative numbers?
par(mfrow = c(1, 3))
XX = rexp(1000) - 1
hist(XX, main = "Negative Numbers\nSkewed")
# Translate XX to (XX + a); for a lower bound of 1, a = (1 - min(XX))
XXxlate = XX + (1 - min(XX))
hist(XXxlate, main = "Translated", xlim = c(min(XX), max(XXxlate)))
qqnorm(XXxlate)
qqline(XXxlate)
par(mfrow = c(1, 2))
hist(log(XXxlate), main = "Log of translate")
qqnorm(log(XXxlate))
qqline(log(XXxlate))
hist(sqrt(XXxlate), main = "Sqrt of translate")
qqnorm(sqrt(XXxlate))
qqline(sqrt(XXxlate))
hist(1/XXxlate, main = "Reciprocal of translate")
qqnorm(1/XXxlate)
qqline(1/XXxlate)
Note: see page 147 of Applied Regression Modeling which states that when using polynomial transformation of a predictor, it is commonly rescaled to have \( mean = 0,\; \sigma^2 = 1 \).
homes4 = read.csv("homes4.csv")
str(homes4)
## 'data.frame': 76 obs. of 3 variables:
## $ Price: num 388 450 386 350 156 ...
## $ Year : int 1940 1957 1955 1956 1994 1940 1958 1961 1965 1968 ...
## $ Age : int 65 48 50 49 11 65 47 44 40 37 ...
names(homes4)
## [1] "Price" "Year" "Age"
head(homes4)
## Price Year Age
## 1 388.0 1940 65
## 2 450.0 1957 48
## 3 386.0 1955 50
## 4 350.0 1956 49
## 5 155.5 1994 11
## 6 220.0 1940 65
# looking just at Age = 2005 - Year
grfpairs(homes4[, c("Price", "Age")], main = "Home Prices vs. Age")
The lower-left chart gives the idea for a polynomial transform.
plot(homes4$Age, homes4$Price)
simple.model = lm(Price ~ Age, data = homes4)
abline(simple.model, lty = 3)
# poly.model = lm(Price~Age + I(Age^2), data=homes4)
Age_sq = homes4$Age^2
poly.model = lm(Price ~ Age + Age_sq, data = homes4)
lines(sort(homes4$Age), fitted(poly.model)[order(homes4$Age)])
scatter.smooth(homes4$Age, homes4$Price)
abline(simple.model, lty = 3)
The principle of Preserving Hierarchy tells us that if we prefer the model with X2, then we should also retain X, whatever its p-value. The exception is when we know (somehow) the fitted regression line should be flat when X = 0.
lm_compare(simple.model, poly.model)
## Residual Standard Error
## simple.model 60.01
## poly.model 56.49
## Decreased: -3.527
## poly.model preferred
## Adjusted R^2
## simple.model 0.01057
## poly.model 0.1235
## Increased: 0.1129
## poly.model preferred
## F Statistic
## simple.model 1.801
## poly.model 6.282
## Increased: 4.481
## poly.model preferred
## F Statistic p-value
## simple.model 0.1837
## poly.model 0.003039
## Decreased: -0.1807
##
## Coeffcient Statistics
## Age abs(t stat)
## simple.model 1.342
## poly.model 3.542
## Increased: 2.2
## poly.model preferred
## Age t stat p-value
## simple.model 0.1837
## poly.model 0.0006952
## Decreased: -0.183
The polynomial model is preferred by all parameters. Note the coefficient I(Age2) cannot be compared because it appears only in the poly.model.
rm(list = setdiff(ls(), lsf.str())) # remove all but functions
This example clearly shows an improvement in the statistical value of the model which uses the reciprocal of the predictor variable. I do not have any intuition for why this should be so since the common-sense is that they are mirror-images of each other; but it is clearly true that the reciprocal is a better predictor.
cars3 = read.csv("cars3.csv")
str(cars3)
## 'data.frame': 351 obs. of 3 variables:
## $ Name: Factor w/ 351 levels "Acura RL 3.7L",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Cmpg: int 17 18 17 13 12 13 22 21 22 21 ...
## $ Eng : num 3.7 3.5 3.7 5.9 5.9 5.9 2 2 2 2 ...
names(cars3)
## [1] "Name" "Cmpg" "Eng"
head(cars3)
## Name Cmpg Eng
## 1 Acura RL 3.7L 17 3.7
## 2 Acura TL 2wd 3.5L 18 3.5
## 3 Acura TL 4wd 3.7L 17 3.7
## 4 Aston Martin DB9 5.9L 13 5.9
## 5 Aston Martin DBS 5.9L 12 5.9
## 6 Aston Martin Rapide 5.9L 13 5.9
# compare effect of Eng: engine size in liters on Cmpg: city miles per
# gallon
grfpairs(cars3[, c("Cmpg", "Eng")], main = "City MPG and Engine Size")
regular.model = lm(Cmpg ~ Eng, data = cars3)
recip.model = lm(Cmpg ~ I(1/Eng), data = cars3)
par(mfrow = c(2, 1))
plot(cars3$Eng, cars3$Cmpg, main = "Regular Model", xlab = "Engine Size", ylab = "City mpg")
abline(regular.model)
plot(1/cars3$Eng, cars3$Cmpg, main = "Reciprocal Model", xlab = "Reciprocal of Engine Size",
ylab = "City mpg")
lines(sort(1/cars3$Eng), fitted(recip.model)[order(1/cars3$Eng)])
lm_compare(regular.model, recip.model)
## Residual Standard Error
## regular.model 1.929
## recip.model 1.741
## Decreased: -0.1881
## recip.model preferred
## Adjusted R^2
## regular.model 0.7506
## recip.model 0.7969
## Increased: 0.04626
## recip.model preferred
## F Statistic
## regular.model 1055
## recip.model 1374
## Increased: 319.7
## recip.model preferred
## F Statistic p-value
## regular.model 0
## recip.model 0
## Unchanged: 0
##
## Coeffcient Statistics
## Eng abs(t stat)
## regular.model 32.47
## recip.model 37.07
## Increased: 4.597
## recip.model preferred
## Eng t stat p-value
## regular.model 1.67e-107
## recip.model 4.562e-123
## Decreased: -1.67e-107
The reciprocal model is preferred by all parameters.
rm(list = setdiff(ls(), lsf.str())) # remove all but functions
This model makes predictions for \( \ln(Y) \) which can be converted to a prediction for \( Y \) via \( e^{\ln(Y)} \)
This form of modelling may be useful in the following circumstances:
To correct for skewed data in Y just as it can do for X
Prevent negative predicted values: \( e^{\ln(Y)} \) is always positive
To transform multiplicative models to simple linear regression:
from \( Y = e^{\beta_0} e^{\beta_1 X_1} e^{\beta_2 X_2} e^\epsilon \Rightarrow \ln(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \)
Note:
The residual standard error is measured in the units of the response variable and therefore cannot be compared between models since the units of the response variable in the transformed model are in terms of the log(those of the un-transformed model).
Similarly, the R2 explains variation in the response variables and is only informally helpful when they are measured in different units.
workexp = read.csv("workexp.csv")
str(workexp)
## 'data.frame': 50 obs. of 2 variables:
## $ Exp: int 8 1 20 21 17 9 12 29 11 12 ...
## $ Sal: num 35 22.9 65.3 74.4 70.5 ...
names(workexp)
## [1] "Exp" "Sal"
head(workexp)
## Exp Sal
## 1 8 35.00
## 2 1 22.88
## 3 20 65.32
## 4 21 74.36
## 5 17 70.54
## 6 9 39.93
# compare the effect of Exp: years of experience on Sal: salary ($,000)
grfpairs(workexp, main = "Experience vs. Salary")
par(mfrow = c(1, 2))
hist(workexp$Sal)
hist(log(workexp$Sal))
par(mfrow = c(1, 1))
regular.model = lm(Sal ~ Exp, data = workexp)
logresp.model = lm(log(Sal) ~ Exp, data = workexp)
lm_compare(regular.model, logresp.model)
## Residual Standard Error
## regular.model 8.357
## logresp.model 0.1248
## Decreased: -8.232
## logresp.model preferred
## Adjusted R^2
## regular.model 0.8619
## logresp.model 0.9102
## Increased: 0.04827
## logresp.model preferred
## F Statistic
## regular.model 306.8
## logresp.model 497.4
## Increased: 190.6
## logresp.model preferred
## F Statistic p-value
## regular.model 0
## logresp.model 0
## Unchanged: 0
##
## Coeffcient Statistics
## Exp abs(t stat)
## regular.model 17.52
## logresp.model 22.3
## Increased: 4.787
## logresp.model preferred
## Exp t stat p-value
## regular.model 1.738e-22
## logresp.model 5.586e-27
## Decreased: -1.738e-22
Note
rm(list = setdiff(ls(), lsf.str())) # remove all but functions
hometax = read.csv("hometax.csv")
str(hometax)
## 'data.frame': 104 obs. of 2 variables:
## $ Tax : int 1639 1088 1193 1635 1732 1534 1765 1161 1010 1191 ...
## $ Price: num 205 208 215 215 200 ...
names(hometax)
## [1] "Tax" "Price"
head(hometax)
## Tax Price
## 1 1639 205.0
## 2 1088 208.0
## 3 1193 215.0
## 4 1635 215.0
## 5 1732 199.9
## 6 1534 190.0
grfpairs(hometax, main = "Annual Tax bill vs. Sale Price ($,000)")
BUT taxes are a function of assessed home value which is very nearly the same as sale price, so “prediction” doesn't seem warranted here.
simple.model = lm(Tax ~ Price, data = hometax)
lnTax = log(hometax$Tax)
lnPrice = log(hometax$Price)
loglog.model = lm(lnTax ~ lnPrice)
lm_compare(simple.model, loglog.model)
## Residual Standard Error
## simple.model 141.9
## loglog.model 0.1616
## Decreased: -141.7
## loglog.model preferred
## Adjusted R^2
## simple.model 0.7751
## loglog.model 0.7825
## Increased: 0.007428
## loglog.model preferred
## F Statistic
## simple.model 355.9
## loglog.model 371.6
## Increased: 15.64
## loglog.model preferred
## F Statistic p-value
## simple.model 0
## loglog.model 0
## Unchanged: 0
##
## Coeffcient Statistics
## Price abs(t stat)
## simple.model 18.87
## loglog.model 19.28
## Increased: 0.41
## loglog.model preferred
## Price t stat p-value
## simple.model 4.879e-35
## loglog.model 8.76e-36
## Decreased: -4.003e-35
Note
rm(list = setdiff(ls(), lsf.str())) # remove all but functions
This example examines the hypothesis that Sales and Advertising vary according to Interest rates.
This is done by creating a new variable AdvInt = Advert * Interest
sales1 = read.csv("sales1.csv")
str(sales1)
## 'data.frame': 12 obs. of 3 variables:
## $ Sales : num 4 6 8 2 4.5 9 4.5 8 10.5 5 ...
## $ Advert : num 3.5 5.5 7 1 3 6.5 2 4 6 1 ...
## $ Interest: int 5 5 5 4 4 4 3 3 3 2 ...
grfpairs(sales1, main = "Sales vs. Advertising and Interest Rates")
model = lm(Sales ~ Advert * Interest, data = sales1)
summary(model)
##
## Call:
## lm(formula = Sales ~ Advert * Interest, data = sales1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4225 -0.2071 0.0377 0.1642 0.3959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9415 0.6617 8.98 1.9e-05 ***
## Advert 1.8363 0.1349 13.61 8.2e-07 ***
## Interest -1.3119 0.1967 -6.67 0.00016 ***
## Advert:Interest -0.1260 0.0386 -3.26 0.01151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.308 on 8 degrees of freedom
## Multiple R-squared: 0.994, Adjusted R-squared: 0.992
## F-statistic: 472 on 3 and 8 DF, p-value: 2.44e-09
# Y = sales1$Sales Xdf = sales1 excluding Y
lm_find.interactions(sales1$Sales, subset(sales1, select = -c(Sales)))
## Advert : Interest interaction p-value = 0.01151
The Advert : Interest interaction term has a p-value less than 0.05, so there is an interaction
# see computer help #33
X = sales1$Advert
Y = sales1$Sales
Factor = as.factor(sales1$Interest)
plot(X, Y, type = "n", ann = FALSE)
for (i in 1:nrow(sales1)) points(X[i], Y[i], pch = as.character(Factor[i]))
for (i in levels(Factor)) lines(X[Factor == i], fitted(model)[Factor == i])
title(xlab = "Adversising (in $mm)", ylab = "Sales (in $mm)", main = "Sales by Advertising for separate Interest Rate levels")
This graph makes these things clear:
rm(list = setdiff(ls(), lsf.str())) # remove all but functions
Incorporate qualitative (factor or categorical) predictors using indicator (or dummy) variables.
In this example DF is the indicator variable. “Dummy” “Female”, where “1” => Female.
DF = ifelse(salgpa1$Gender == “F”, yes = 1, no = 0)
lm_create.indicators uses this trick to create indicator variables from a factor variable input
Note: use one fewer than are created
Note Whether we use DF (Female = 1) or DM (Male = 1) makes no difference to the outcome of the analysis.
salgpa1 = read.csv("salgpa1.csv")
str(salgpa1)
## 'data.frame': 10 obs. of 4 variables:
## $ Salary: num 59.8 44.2 52 62.4 75.4 70.2 59.8 28.6 39 49.4
## $ Gpa : num 3 2.5 1 3.5 3.5 3 2 1.5 2 1.5
## $ Gender: Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 2 2 1
## $ DF : int 0 0 1 0 1 1 1 0 0 1
names(salgpa1)
## [1] "Salary" "Gpa" "Gender" "DF"
head(salgpa1)
## Salary Gpa Gender DF
## 1 59.8 3.0 M 0
## 2 44.2 2.5 M 0
## 3 52.0 1.0 F 1
## 4 62.4 3.5 M 0
## 5 75.4 3.5 F 1
## 6 70.2 3.0 F 1
# DF is the indicator variable for Gender
grfpairs(salgpa1, main = "Annual Starting Salary and GPA\nDF is the indicator variable for Gender")
DfGPA = salgpa1$DF * salgpa1$Gpa
df.model = lm(Salary ~ DF + Gpa + DfGPA, data = salgpa1)
summary(df.model)
##
## Call:
## lm(formula = Salary ~ DF + Gpa + DfGPA, data = salgpa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.595 -1.690 0.393 0.916 4.160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.60 5.29 0.49 0.64039
## DF 35.61 6.46 5.52 0.00149 **
## Gpa 17.68 2.04 8.69 0.00013 ***
## DfGPA -7.16 2.56 -2.80 0.03128 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.22 on 6 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.95
## F-statistic: 57.5 on 3 and 6 DF, p-value: 8.19e-05
The model:
\( E(Salary) = \beta_0 + \beta_1 DF + \beta_2 Gpa + \beta_3 DfGPA \)
\( E(Salary) = \) 2.6 + 35.6139534883721 DF + 17.68 Gpa + -7.15906976744185 DfGPA
Setting DF = 0 we get the model for male salaries
\( E(Salary Males) = \beta_0 + \beta_2 Gpa \)
\( E(Salary Males) = \) 2.6 + 17.68 Gpa
Setting DF = 1 we get the model for female salaries
\( E(Salary Females) = (\beta_0 + \beta_1) + (\beta_2 + \beta_3) Gpa \)
\( E(Salary Females) = \) 38.2139534883721 + 10.5209302325581 Gpa
The model shows females start higher but grow more slowly.
The fact that the lines are not parallel (different slopes) suggests an interaction between Gender (as represented by the indicator variable DF) and GPA. The fact that DfGPA has a p-value of 0.0313 < 5% significance level = TRUE seems to confirm this by rejecting the null hypothesis that \( \beta_3 \) is actually zero in the population as a whole.
# see computer help #33
X = salgpa1$Gpa
Y = salgpa1$Salary
Factor = salgpa1$Gender # use factor() if it isn't one already
plot(X, Y, type = "n", ann = FALSE)
for (i in 1:nrow(salgpa1)) points(X[i], Y[i], pch = as.character(Factor[i]))
for (i in levels(Factor)) lines(X[Factor == i], fitted(df.model)[Factor == i])
title(xlab = "GPA", ylab = "Salary ($000)", main = "Effect of GPA on Salary, by Gender")
simple.model = lm(Salary ~ DF + Gpa, data = salgpa1)
lm_compare(simple.model, df.model, coefficients = c(2, 3))
## Residual Standard Error
## simple.model 4.522
## df.model 3.218
## Decreased: -1.304
## df.model preferred
## Adjusted R^2
## simple.model 0.9005
## df.model 0.9496
## Increased: 0.04913
## df.model preferred
## F Statistic
## simple.model 41.72
## df.model 57.54
## Increased: 15.82
## df.model preferred
## F Statistic p-value
## simple.model 0.000129
## df.model 8.187e-05
## Decreased: -4.713e-05
##
## Coeffcient Statistics
## DF abs(t stat)
## simple.model 6.366
## df.model 5.516
## Decreased: -0.8498
## simple.model preferred
## DF t stat p-value
## simple.model 0.0003794
## df.model 0.001492
## Increased: 0.001113
##
## Gpa abs(t stat)
## simple.model 7.584
## df.model 8.687
## Increased: 1.103
## df.model preferred
## Gpa t stat p-value
## simple.model 0.0001279
## df.model 0.0001284
## Increased: 5.228e-07
This result confirms the hypothesis test: there is interaction and including an interaction variable improves the model.
cars4 = read.csv("cars4.csv")
str(cars4)
## 'data.frame': 26 obs. of 7 variables:
## $ Name : Factor w/ 26 levels "Audi A3 2L","Chevrolet Aveo 1.6L",..: 9 12 17 18 19 21 24 2 3 4 ...
## $ Cmpg : int 29 25 26 17 23 27 29 25 24 22 ...
## $ Eng : num 1.6 1.8 1.6 3.8 2.5 1.8 1.5 1.6 1.4 1.8 ...
## $ Class: Factor w/ 3 levels "CO","SC","SW": 2 2 2 2 2 2 2 1 1 1 ...
## $ DSc : int 1 1 1 1 1 1 1 0 0 0 ...
## $ DCo : int 0 0 0 0 0 0 0 1 1 1 ...
## $ DSw : int 0 0 0 0 0 0 0 0 0 0 ...
names(cars4)
## [1] "Name" "Cmpg" "Eng" "Class" "DSc" "DCo" "DSw"
head(cars4)
## Name Cmpg Eng Class DSc DCo DSw
## 1 Ford Fiesta Fwd 1.6L 29 1.6 SC 1 0 0
## 2 Honda Civic 1.8L 25 1.8 SC 1 0 0
## 3 Mini Cooper S Clubman 1.6L 26 1.6 SC 1 0 0
## 4 Mitsubishi Eclipse 3.8L 17 3.8 SC 1 0 0
## 5 Nissan Altima Coupe 2.5L 23 2.5 SC 1 0 0
## 6 Scion xD 1.8L 27 1.8 SC 1 0 0
# Dxx are dummy/indicator variables for class of car
grfpairs(cars4[, c("Cmpg", "Eng", "Class")], main = "City MPG based on Engine Size by Class of Car")
lm_create.indicators(cars4$Class, ret = FALSE)
##
## indicators created: D.CO, D.SC, D.SW
# create response variable
Cgphm = 100/cars4$Cmpg # city gallons per 100 miles
## create interaction variables
DCO_ENG = D.CO * cars4$Eng
## Error: object 'D.CO' not found
DSW_ENG = D.SW * cars4$Eng
## Error: object 'D.SW' not found
model = lm(Cgphm ~ D.CO + D.SW + Eng + DCO_ENG + DSW_ENG, data = cars4)
## Error: object 'D.CO' not found
summary(model)
## Error: object 'model' not found
# see computer help #33
X = cars4$Eng
Y = Cgphm
Factor = cars4$Class # use factor() if it isn't one already
plot(X, Y, type = "n", ann = FALSE)
for (i in 1:nrow(cars4)) points(X[i], Y[i], pch = as.character(substr(Factor[i],
2, 2)))
for (i in levels(Factor)) lines(X[Factor == i], fitted(model)[Factor == i])
## Error: object 'model' not found
title(xlab = "Engine Size (liters)", ylab = "City gallons/100 miles", main = "Effect of Engine Size on Efficiency, by Class of Car")
text(3.25, 3.75, labels = "C = SC = Subcompact\nO = CO = Compact\nW = SW = Station Wagon")
The model summary shows very high p-values for D.SW, the indicator for Station Wagon, and DSW_ENG, the interaction variable for Station Wagons and Engine Size
The lines for Subcompact and Station Wagon are (to the eye) nearly parallel.
Let's remove DSW_ENG and see how a new model performs vs. the original:
new.model = lm(Cgphm ~ D.CO + D.SW + Eng + DCO_ENG, data = cars4)
## Error: object 'D.CO' not found
lm_compare(model, new.model, coefficients = c(2, 3))
## Error: object 'model' not found
The new model is the clear winner.