Multiple Linear Regression

Transforms, Interactions, Indicators

Linear regression, a la \( Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \; ... \) assumes a \( \beta_i \) change in the response variable \( Y \) for a unit change in predictor variable \( X_i \), holding \( X_{j \ne i} \) constant.

  1. Transformations assume a non-linear response-predictor relationship and apply a function (e.g., log, polynomial) to predictor and/or response variables.

  2. Interactions between predictor variables may be seen, in which case the interacting predictors are multiplied together.
    see lm_find.interactions

  3. Qualitative predictors are represented by replacing them with n-1 binary variables, where n = the number of levels

The effects of such changes are considered improvements if we see:

  1. reduced residual standard error

  2. increased adjusted R2

  3. increased F statistic
    and reduced associated p-value

  4. increased abs(T statistic)
    and reduced p-values of the individual predictor variables

See lm_compare.R


Natural log transform of predictor

tvads = read.csv("tvads.csv")
str(tvads)
## 'data.frame':    21 obs. of  3 variables:
##  $ Brand  : Factor w/ 21 levels "ATT/Bell","BudLite",..: 15 17 20 8 3 5 12 13 7 9 ...
##  $ Spend  : num  50.1 74.1 19.3 22.9 82.4 ...
##  $ Impress: num  32.1 99.6 11.7 21.9 60.8 78.6 92.4 50.7 21.4 40.1 ...
head(tvads)
##        Brand Spend Impress
## 1     Miller  50.1    32.1
## 2      Pepsi  74.1    99.6
## 3     Strohs  19.3    11.7
## 4      Fedex  22.9    21.9
## 5 BurgerKing  82.4    60.8
## 6   CocaCola  40.1    78.6

grfpairs(tvads[, c("Spend", "Impress")], main = "TV Ad Spending")

plot of chunk unnamed-chunk-3


Simple regression

simple.model = lm(Impress ~ Spend, data = tvads)

Natural log of the predictor

log.model = lm(Impress ~ log(Spend), data = tvads)

Comparison of the models

lm_compare(simple.model, log.model)
## Residual Standard Error 
##       simple.model        23.5 
##       log.model       21.18 
##       Decreased:      -2.326 
##               log.model preferred 
## Adjusted R^2 
##       simple.model        0.3936 
##       log.model       0.5077 
##       Increased:      0.1141 
##               log.model preferred 
## F Statistic 
##       simple.model        13.98 
##       log.model       21.63 
##       Increased:      7.643 
##               log.model preferred 
## F Statistic p-value 
##       simple.model        0.001389 
##       log.model       0.0001744 
##       Decreased:      -0.001215 
## 
## Coeffcient Statistics 
##   Spend  abs(t stat) 
##       simple.model        3.739 
##       log.model       4.65 
##       Increased:      0.911 
##               log.model preferred 
##   Spend  t stat p-value 
##       simple.model        0.001389 
##       log.model       0.0001744 
##       Decreased:      -0.001215

Predictor Effect Plot (my variation)

Display the effects of transformed predictors vs linear

par(mfrow = c(2, 2))
attach(tvads)
simple.model = lm(Impress ~ Spend, data = tvads)

# X^2 transform
plot(Spend, Impress, main = paste("Linear vs", expression(X^2)))
abline(simple.model, lty = 3)

Spend2 = Spend^2
poly.model = lm(Impress ~ Spend + Spend2, data = tvads)
lines(sort(Spend), fitted(poly.model)[order(Spend)])

# log transform
plot(Spend, Impress, main = "Linear vs Log(X)")
abline(simple.model, lty = 3)

SpendLog = log(Spend)
log.model = lm(Impress ~ SpendLog, data = tvads)
lines(sort(Spend), fitted(log.model)[order(Spend)])

# reciprocal transform
plot(Spend, Impress, main = paste("Linear vs ", expression(1/X)))
abline(simple.model, lty = 3)

SpendRecip = 1/Spend
recip.model = lm(Impress ~ SpendRecip, data = tvads)
lines(sort(Spend), fitted(recip.model)[order(Spend)])

# exponential transform
plot(Spend, Impress, main = paste("Linear vs ", expression(e^X)))
abline(simple.model, lty = 3)

SpendExp = exp(Spend)
exp.model = lm(Impress ~ SpendExp, data = tvads)
lines(sort(Spend), fitted(exp.model)[order(Spend)])

plot of chunk unnamed-chunk-7

The log transform model stipulates that the effect on the response variable is more pronounced at small levels of spending than it is at higher levels … the value per dollar spent tends to fall: declining marginal return/effect.


Different logs

In biological applications \( log_2 \) is preferred since it shows a “doubling” effect; in engineering \( log_{10} \), which shows “order-of-magnitude”, may be preferred.

Indication of log transform

Linear regression models often produce better results when the predictor variables are symmetrically distributed. Large outliers are “pulled in” with a log transformation.

par(mfrow = c(1, 2))
hist(Spend)
hist(log(Spend))

plot of chunk unnamed-chunk-8


# What about logs of negative numbers?
par(mfrow = c(1, 3))
XX = rexp(1000) - 1
hist(XX, main = "Negative Numbers\nSkewed")

# Translate XX to (XX + a); for a lower bound of 1, a = (1 - min(XX))
XXxlate = XX + (1 - min(XX))
hist(XXxlate, main = "Translated", xlim = c(min(XX), max(XXxlate)))
qqnorm(XXxlate)
qqline(XXxlate)

plot of chunk unnamed-chunk-8


par(mfrow = c(1, 2))
hist(log(XXxlate), main = "Log of translate")
qqnorm(log(XXxlate))
qqline(log(XXxlate))

plot of chunk unnamed-chunk-8

hist(sqrt(XXxlate), main = "Sqrt of translate")
qqnorm(sqrt(XXxlate))
qqline(sqrt(XXxlate))

plot of chunk unnamed-chunk-8

hist(1/XXxlate, main = "Reciprocal of translate")
qqnorm(1/XXxlate)
qqline(1/XXxlate)

plot of chunk unnamed-chunk-8


Polynomial transform of predictor

Note: see page 147 of Applied Regression Modeling which states that when using polynomial transformation of a predictor, it is commonly rescaled to have \( mean = 0,\; \sigma^2 = 1 \).

homes4 = read.csv("homes4.csv")
str(homes4)
## 'data.frame':    76 obs. of  3 variables:
##  $ Price: num  388 450 386 350 156 ...
##  $ Year : int  1940 1957 1955 1956 1994 1940 1958 1961 1965 1968 ...
##  $ Age  : int  65 48 50 49 11 65 47 44 40 37 ...
names(homes4)
## [1] "Price" "Year"  "Age"
head(homes4)
##   Price Year Age
## 1 388.0 1940  65
## 2 450.0 1957  48
## 3 386.0 1955  50
## 4 350.0 1956  49
## 5 155.5 1994  11
## 6 220.0 1940  65

# looking just at Age = 2005 - Year
grfpairs(homes4[, c("Price", "Age")], main = "Home Prices vs. Age")

plot of chunk unnamed-chunk-11

The lower-left chart gives the idea for a polynomial transform.


Plot linear vs. 2nd-order polynomial

plot(homes4$Age, homes4$Price)

simple.model = lm(Price ~ Age, data = homes4)
abline(simple.model, lty = 3)

# poly.model = lm(Price~Age + I(Age^2), data=homes4)
Age_sq = homes4$Age^2
poly.model = lm(Price ~ Age + Age_sq, data = homes4)
lines(sort(homes4$Age), fitted(poly.model)[order(homes4$Age)])

plot of chunk unnamed-chunk-12


Add a loess fitted line

scatter.smooth(homes4$Age, homes4$Price)
abline(simple.model, lty = 3)

plot of chunk unnamed-chunk-13


Compare the models

The principle of Preserving Hierarchy tells us that if we prefer the model with X2, then we should also retain X, whatever its p-value. The exception is when we know (somehow) the fitted regression line should be flat when X = 0.


lm_compare(simple.model, poly.model)
## Residual Standard Error 
##       simple.model        60.01 
##       poly.model          56.49 
##       Decreased:      -3.527 
##               poly.model preferred 
## Adjusted R^2 
##       simple.model        0.01057 
##       poly.model          0.1235 
##       Increased:      0.1129 
##               poly.model preferred 
## F Statistic 
##       simple.model        1.801 
##       poly.model          6.282 
##       Increased:      4.481 
##               poly.model preferred 
## F Statistic p-value 
##       simple.model        0.1837 
##       poly.model          0.003039 
##       Decreased:      -0.1807 
## 
## Coeffcient Statistics 
##   Age  abs(t stat) 
##       simple.model        1.342 
##       poly.model          3.542 
##       Increased:      2.2 
##               poly.model preferred 
##   Age  t stat p-value 
##       simple.model        0.1837 
##       poly.model          0.0006952 
##       Decreased:      -0.183

The polynomial model is preferred by all parameters. Note the coefficient I(Age2) cannot be compared because it appears only in the poly.model.

rm(list = setdiff(ls(), lsf.str()))  # remove all but functions

Reciprocal transform of predictor

This example clearly shows an improvement in the statistical value of the model which uses the reciprocal of the predictor variable. I do not have any intuition for why this should be so since the common-sense is that they are mirror-images of each other; but it is clearly true that the reciprocal is a better predictor.

cars3 = read.csv("cars3.csv")
str(cars3)
## 'data.frame':    351 obs. of  3 variables:
##  $ Name: Factor w/ 351 levels "Acura RL 3.7L",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Cmpg: int  17 18 17 13 12 13 22 21 22 21 ...
##  $ Eng : num  3.7 3.5 3.7 5.9 5.9 5.9 2 2 2 2 ...
names(cars3)
## [1] "Name" "Cmpg" "Eng"
head(cars3)
##                       Name Cmpg Eng
## 1            Acura RL 3.7L   17 3.7
## 2        Acura TL 2wd 3.5L   18 3.5
## 3        Acura TL 4wd 3.7L   17 3.7
## 4    Aston Martin DB9 5.9L   13 5.9
## 5    Aston Martin DBS 5.9L   12 5.9
## 6 Aston Martin Rapide 5.9L   13 5.9

# compare effect of Eng: engine size in liters on Cmpg: city miles per
# gallon
grfpairs(cars3[, c("Cmpg", "Eng")], main = "City MPG and Engine Size")

plot of chunk unnamed-chunk-17

regular.model = lm(Cmpg ~ Eng, data = cars3)
recip.model = lm(Cmpg ~ I(1/Eng), data = cars3)

Regular and reciprocal models

plots

par(mfrow = c(2, 1))
plot(cars3$Eng, cars3$Cmpg, main = "Regular Model", xlab = "Engine Size", ylab = "City mpg")
abline(regular.model)

plot(1/cars3$Eng, cars3$Cmpg, main = "Reciprocal Model", xlab = "Reciprocal of Engine Size", 
    ylab = "City mpg")
lines(sort(1/cars3$Eng), fitted(recip.model)[order(1/cars3$Eng)])

plot of chunk unnamed-chunk-19


Regular and reciprocal models

analysis

lm_compare(regular.model, recip.model)
## Residual Standard Error 
##       regular.model       1.929 
##       recip.model         1.741 
##       Decreased:      -0.1881 
##               recip.model preferred 
## Adjusted R^2 
##       regular.model       0.7506 
##       recip.model         0.7969 
##       Increased:      0.04626 
##               recip.model preferred 
## F Statistic 
##       regular.model       1055 
##       recip.model         1374 
##       Increased:      319.7 
##               recip.model preferred 
## F Statistic p-value 
##       regular.model       0 
##       recip.model         0 
##       Unchanged:      0 
## 
## Coeffcient Statistics 
##   Eng  abs(t stat) 
##       regular.model       32.47 
##       recip.model         37.07 
##       Increased:      4.597 
##               recip.model preferred 
##   Eng  t stat p-value 
##       regular.model       1.67e-107 
##       recip.model         4.562e-123 
##       Decreased:      -1.67e-107

The reciprocal model is preferred by all parameters.

rm(list = setdiff(ls(), lsf.str()))  # remove all but functions

Natural log transform of the response variable

This model makes predictions for \( \ln(Y) \) which can be converted to a prediction for \( Y \) via \( e^{\ln(Y)} \)

This form of modelling may be useful in the following circumstances:

  1. To correct for skewed data in Y just as it can do for X

  2. Prevent negative predicted values: \( e^{\ln(Y)} \) is always positive

  3. To transform multiplicative models to simple linear regression:
    from \( Y = e^{\beta_0} e^{\beta_1 X_1} e^{\beta_2 X_2} e^\epsilon \Rightarrow \ln(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \)

Note:

  1. The residual standard error is measured in the units of the response variable and therefore cannot be compared between models since the units of the response variable in the transformed model are in terms of the log(those of the un-transformed model).

  2. Similarly, the R2 explains variation in the response variables and is only informally helpful when they are measured in different units.


workexp = read.csv("workexp.csv")
str(workexp)
## 'data.frame':    50 obs. of  2 variables:
##  $ Exp: int  8 1 20 21 17 9 12 29 11 12 ...
##  $ Sal: num  35 22.9 65.3 74.4 70.5 ...
names(workexp)
## [1] "Exp" "Sal"
head(workexp)
##   Exp   Sal
## 1   8 35.00
## 2   1 22.88
## 3  20 65.32
## 4  21 74.36
## 5  17 70.54
## 6   9 39.93

# compare the effect of Exp: years of experience on Sal: salary ($,000)
grfpairs(workexp, main = "Experience vs. Salary")

plot of chunk unnamed-chunk-24


Compare \( Y \) and \( \ln(Y) \)

par(mfrow = c(1, 2))
hist(workexp$Sal)
hist(log(workexp$Sal))

plot of chunk unnamed-chunk-25

par(mfrow = c(1, 1))

Compare Models

regular.model = lm(Sal ~ Exp, data = workexp)
logresp.model = lm(log(Sal) ~ Exp, data = workexp)
lm_compare(regular.model, logresp.model)
## Residual Standard Error 
##       regular.model       8.357 
##       logresp.model       0.1248 
##       Decreased:      -8.232 
##               logresp.model preferred 
## Adjusted R^2 
##       regular.model       0.8619 
##       logresp.model       0.9102 
##       Increased:      0.04827 
##               logresp.model preferred 
## F Statistic 
##       regular.model       306.8 
##       logresp.model       497.4 
##       Increased:      190.6 
##               logresp.model preferred 
## F Statistic p-value 
##       regular.model       0 
##       logresp.model       0 
##       Unchanged:      0 
## 
## Coeffcient Statistics 
##   Exp  abs(t stat) 
##       regular.model       17.52 
##       logresp.model       22.3 
##       Increased:      4.787 
##               logresp.model preferred 
##   Exp  t stat p-value 
##       regular.model       1.738e-22 
##       logresp.model       5.586e-27 
##       Decreased:      -1.738e-22

Note

  1. residual standard error is not comparable
  2. adjusted R2 is only informally comparable
rm(list = setdiff(ls(), lsf.str()))  # remove all but functions

Transformations for both response and predictors

hometax = read.csv("hometax.csv")
str(hometax)
## 'data.frame':    104 obs. of  2 variables:
##  $ Tax  : int  1639 1088 1193 1635 1732 1534 1765 1161 1010 1191 ...
##  $ Price: num  205 208 215 215 200 ...
names(hometax)
## [1] "Tax"   "Price"
head(hometax)
##    Tax Price
## 1 1639 205.0
## 2 1088 208.0
## 3 1193 215.0
## 4 1635 215.0
## 5 1732 199.9
## 6 1534 190.0

grfpairs(hometax, main = "Annual Tax bill vs. Sale Price ($,000)")

plot of chunk unnamed-chunk-30

  1. Both seem skewed
  2. The variance obviously increases with sale price which violates the constant-variance assumption

BUT taxes are a function of assessed home value which is very nearly the same as sale price, so “prediction” doesn't seem warranted here.

simple.model = lm(Tax ~ Price, data = hometax)

lnTax = log(hometax$Tax)
lnPrice = log(hometax$Price)
loglog.model = lm(lnTax ~ lnPrice)

lm_compare(simple.model, loglog.model)
## Residual Standard Error 
##       simple.model        141.9 
##       loglog.model        0.1616 
##       Decreased:      -141.7 
##               loglog.model preferred 
## Adjusted R^2 
##       simple.model        0.7751 
##       loglog.model        0.7825 
##       Increased:      0.007428 
##               loglog.model preferred 
## F Statistic 
##       simple.model        355.9 
##       loglog.model        371.6 
##       Increased:      15.64 
##               loglog.model preferred 
## F Statistic p-value 
##       simple.model        0 
##       loglog.model        0 
##       Unchanged:      0 
## 
## Coeffcient Statistics 
##   Price  abs(t stat) 
##       simple.model        18.87 
##       loglog.model        19.28 
##       Increased:      0.41 
##               loglog.model preferred 
##   Price  t stat p-value 
##       simple.model        4.879e-35 
##       loglog.model        8.76e-36 
##       Decreased:      -4.003e-35

Note

  1. residual standard error is not comparable
  2. adjusted R2 is only informally comparable
rm(list = setdiff(ls(), lsf.str()))  # remove all but functions

Interactions

This example examines the hypothesis that Sales and Advertising vary according to Interest rates.

This is done by creating a new variable AdvInt = Advert * Interest

sales1 = read.csv("sales1.csv")
str(sales1)
## 'data.frame':    12 obs. of  3 variables:
##  $ Sales   : num  4 6 8 2 4.5 9 4.5 8 10.5 5 ...
##  $ Advert  : num  3.5 5.5 7 1 3 6.5 2 4 6 1 ...
##  $ Interest: int  5 5 5 4 4 4 3 3 3 2 ...

grfpairs(sales1, main = "Sales vs. Advertising and Interest Rates")

plot of chunk unnamed-chunk-35


Create the model to test for interaction

model = lm(Sales ~ Advert * Interest, data = sales1)
summary(model)
## 
## Call:
## lm(formula = Sales ~ Advert * Interest, data = sales1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4225 -0.2071  0.0377  0.1642  0.3959 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.9415     0.6617    8.98  1.9e-05 ***
## Advert            1.8363     0.1349   13.61  8.2e-07 ***
## Interest         -1.3119     0.1967   -6.67  0.00016 ***
## Advert:Interest  -0.1260     0.0386   -3.26  0.01151 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.308 on 8 degrees of freedom
## Multiple R-squared:  0.994,  Adjusted R-squared:  0.992 
## F-statistic:  472 on 3 and 8 DF,  p-value: 2.44e-09

use lm_find.interactions

# Y = sales1$Sales Xdf = sales1 excluding Y
lm_find.interactions(sales1$Sales, subset(sales1, select = -c(Sales)))
## Advert : Interest    interaction p-value = 0.01151

Interaction Hypothesis

The Advert : Interest interaction term has a p-value less than 0.05, so there is an interaction


Plot Sales by Advert for each level of Interest Rates

Y, X1 quantitative; X2 discrete

non-parallel lines show an interaction

# see computer help #33
X = sales1$Advert
Y = sales1$Sales
Factor = as.factor(sales1$Interest)

plot(X, Y, type = "n", ann = FALSE)

for (i in 1:nrow(sales1)) points(X[i], Y[i], pch = as.character(Factor[i]))
for (i in levels(Factor)) lines(X[Factor == i], fitted(model)[Factor == i])

title(xlab = "Adversising (in $mm)", ylab = "Sales (in $mm)", main = "Sales by Advertising for separate Interest Rate levels")

plot of chunk unnamed-chunk-37

This graph makes these things clear:

  1. Sales are an increasing function of Advertising
  2. Interest Rates play a strong role in depressing the value of Sales across all levels of Advertising spending
  3. The slope discussion, above, shows that Interest Rates negatively effect the rate of growth of Sales as a function of Advertising, as well as their absolute value
rm(list = setdiff(ls(), lsf.str()))  # remove all but functions

Qualitative Predictors

Incorporate qualitative (factor or categorical) predictors using indicator (or dummy) variables.

Qualitative Predictors with Two Levels

In this example DF is the indicator variable. “Dummy” “Female”, where “1” => Female.

DF = ifelse(salgpa1$Gender == “F”, yes = 1, no = 0)
lm_create.indicators uses this trick to create indicator variables from a factor variable input
Note: use one fewer than are created

Note Whether we use DF (Female = 1) or DM (Male = 1) makes no difference to the outcome of the analysis.

salgpa1 = read.csv("salgpa1.csv")
str(salgpa1)
## 'data.frame':    10 obs. of  4 variables:
##  $ Salary: num  59.8 44.2 52 62.4 75.4 70.2 59.8 28.6 39 49.4
##  $ Gpa   : num  3 2.5 1 3.5 3.5 3 2 1.5 2 1.5
##  $ Gender: Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 2 2 1
##  $ DF    : int  0 0 1 0 1 1 1 0 0 1
names(salgpa1)
## [1] "Salary" "Gpa"    "Gender" "DF"
head(salgpa1)
##   Salary Gpa Gender DF
## 1   59.8 3.0      M  0
## 2   44.2 2.5      M  0
## 3   52.0 1.0      F  1
## 4   62.4 3.5      M  0
## 5   75.4 3.5      F  1
## 6   70.2 3.0      F  1

# DF is the indicator variable for Gender
grfpairs(salgpa1, main = "Annual Starting Salary and GPA\nDF is the indicator variable for Gender")

plot of chunk unnamed-chunk-40


Create a model

DfGPA = salgpa1$DF * salgpa1$Gpa
df.model = lm(Salary ~ DF + Gpa + DfGPA, data = salgpa1)
summary(df.model)
## 
## Call:
## lm(formula = Salary ~ DF + Gpa + DfGPA, data = salgpa1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.595 -1.690  0.393  0.916  4.160 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.60       5.29    0.49  0.64039    
## DF             35.61       6.46    5.52  0.00149 ** 
## Gpa            17.68       2.04    8.69  0.00013 ***
## DfGPA          -7.16       2.56   -2.80  0.03128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.22 on 6 degrees of freedom
## Multiple R-squared:  0.966,  Adjusted R-squared:  0.95 
## F-statistic: 57.5 on 3 and 6 DF,  p-value: 8.19e-05

The model:
\( E(Salary) = \beta_0 + \beta_1 DF + \beta_2 Gpa + \beta_3 DfGPA \)
\( E(Salary) = \) 2.6 + 35.6139534883721 DF + 17.68 Gpa + -7.15906976744185 DfGPA

Setting DF = 0 we get the model for male salaries
\( E(Salary Males) = \beta_0 + \beta_2 Gpa \)
\( E(Salary Males) = \) 2.6 + 17.68 Gpa

Setting DF = 1 we get the model for female salaries
\( E(Salary Females) = (\beta_0 + \beta_1) + (\beta_2 + \beta_3) Gpa \)
\( E(Salary Females) = \) 38.2139534883721 + 10.5209302325581 Gpa

The model shows females start higher but grow more slowly.

The fact that the lines are not parallel (different slopes) suggests an interaction between Gender (as represented by the indicator variable DF) and GPA. The fact that DfGPA has a p-value of 0.0313 < 5% significance level = TRUE seems to confirm this by rejecting the null hypothesis that \( \beta_3 \) is actually zero in the population as a whole.


Plot the model

# see computer help #33
X = salgpa1$Gpa
Y = salgpa1$Salary
Factor = salgpa1$Gender  # use factor() if it isn't one already  

plot(X, Y, type = "n", ann = FALSE)

for (i in 1:nrow(salgpa1)) points(X[i], Y[i], pch = as.character(Factor[i]))
for (i in levels(Factor)) lines(X[Factor == i], fitted(df.model)[Factor == i])

title(xlab = "GPA", ylab = "Salary ($000)", main = "Effect of GPA on Salary, by Gender")

plot of chunk unnamed-chunk-42


Compare the model with and without the interaction term

simple.model = lm(Salary ~ DF + Gpa, data = salgpa1)
lm_compare(simple.model, df.model, coefficients = c(2, 3))
## Residual Standard Error 
##       simple.model        4.522 
##       df.model        3.218 
##       Decreased:      -1.304 
##               df.model preferred 
## Adjusted R^2 
##       simple.model        0.9005 
##       df.model        0.9496 
##       Increased:      0.04913 
##               df.model preferred 
## F Statistic 
##       simple.model        41.72 
##       df.model        57.54 
##       Increased:      15.82 
##               df.model preferred 
## F Statistic p-value 
##       simple.model        0.000129 
##       df.model        8.187e-05 
##       Decreased:      -4.713e-05 
## 
## Coeffcient Statistics 
##   DF  abs(t stat) 
##       simple.model        6.366 
##       df.model        5.516 
##       Decreased:      -0.8498 
##               simple.model preferred 
##   DF  t stat p-value 
##       simple.model        0.0003794 
##       df.model        0.001492 
##       Increased:      0.001113 
## 
##   Gpa  abs(t stat) 
##       simple.model        7.584 
##       df.model        8.687 
##       Increased:      1.103 
##               df.model preferred 
##   Gpa  t stat p-value 
##       simple.model        0.0001279 
##       df.model        0.0001284 
##       Increased:      5.228e-07

This result confirms the hypothesis test: there is interaction and including an interaction variable improves the model.


Qualitative predictors with three or more levels

cars4 = read.csv("cars4.csv")
str(cars4)
## 'data.frame':    26 obs. of  7 variables:
##  $ Name : Factor w/ 26 levels "Audi A3 2L","Chevrolet Aveo 1.6L",..: 9 12 17 18 19 21 24 2 3 4 ...
##  $ Cmpg : int  29 25 26 17 23 27 29 25 24 22 ...
##  $ Eng  : num  1.6 1.8 1.6 3.8 2.5 1.8 1.5 1.6 1.4 1.8 ...
##  $ Class: Factor w/ 3 levels "CO","SC","SW": 2 2 2 2 2 2 2 1 1 1 ...
##  $ DSc  : int  1 1 1 1 1 1 1 0 0 0 ...
##  $ DCo  : int  0 0 0 0 0 0 0 1 1 1 ...
##  $ DSw  : int  0 0 0 0 0 0 0 0 0 0 ...
names(cars4)
## [1] "Name"  "Cmpg"  "Eng"   "Class" "DSc"   "DCo"   "DSw"
head(cars4)
##                         Name Cmpg Eng Class DSc DCo DSw
## 1       Ford Fiesta Fwd 1.6L   29 1.6    SC   1   0   0
## 2           Honda Civic 1.8L   25 1.8    SC   1   0   0
## 3 Mini Cooper S Clubman 1.6L   26 1.6    SC   1   0   0
## 4    Mitsubishi Eclipse 3.8L   17 3.8    SC   1   0   0
## 5   Nissan Altima Coupe 2.5L   23 2.5    SC   1   0   0
## 6              Scion xD 1.8L   27 1.8    SC   1   0   0

# Dxx are dummy/indicator variables for class of car
grfpairs(cars4[, c("Cmpg", "Eng", "Class")], main = "City MPG based on Engine Size by Class of Car")

plot of chunk unnamed-chunk-46


Create the model

lm_create.indicators(cars4$Class, ret = FALSE)
## 
## indicators created: D.CO, D.SC, D.SW
# create response variable
Cgphm = 100/cars4$Cmpg  # city gallons per 100 miles

## create interaction variables
DCO_ENG = D.CO * cars4$Eng
## Error: object 'D.CO' not found
DSW_ENG = D.SW * cars4$Eng
## Error: object 'D.SW' not found
model = lm(Cgphm ~ D.CO + D.SW + Eng + DCO_ENG + DSW_ENG, data = cars4)
## Error: object 'D.CO' not found
summary(model)
## Error: object 'model' not found

Print the model

# see computer help #33
X = cars4$Eng
Y = Cgphm
Factor = cars4$Class  # use factor() if it isn't one already  

plot(X, Y, type = "n", ann = FALSE)

for (i in 1:nrow(cars4)) points(X[i], Y[i], pch = as.character(substr(Factor[i], 
    2, 2)))
for (i in levels(Factor)) lines(X[Factor == i], fitted(model)[Factor == i])
## Error: object 'model' not found

title(xlab = "Engine Size (liters)", ylab = "City gallons/100 miles", main = "Effect of Engine Size on Efficiency, by Class of Car")

text(3.25, 3.75, labels = "C = SC = Subcompact\nO = CO = Compact\nW = SW = Station Wagon")

plot of chunk unnamed-chunk-49


Analysis of the model

  1. The model summary shows very high p-values for D.SW, the indicator for Station Wagon, and DSW_ENG, the interaction variable for Station Wagons and Engine Size

  2. The lines for Subcompact and Station Wagon are (to the eye) nearly parallel.

Let's remove DSW_ENG and see how a new model performs vs. the original:

new.model = lm(Cgphm ~ D.CO + D.SW + Eng + DCO_ENG, data = cars4)
## Error: object 'D.CO' not found
lm_compare(model, new.model, coefficients = c(2, 3))
## Error: object 'model' not found

The new model is the clear winner.