Regression Model Building II

Influential Points
- Outliers
- Leverage
- Cook's Distance
Note that rather than excluding an anomalous point, a transformation may be preferred
Regression Pitfalls
- Nonconstant error variance (Heteroskedasticity)
- Autocorrelation/Serial Correlation
  (failure of Independence assumption; often in time series)
- Multicollinearity
- Excluding important predictors
- Overfitting
- Extrapolation
- Power and Sample Size

setwd("~/R/regression")
# opts_knit$set(tidy = TRUE)
source("~/R/grfpairs.R")  # Winston Chang's pairs function

## Warning: cannot open file 'C:/Users/GRFisher4/Documents/R/grfpairs.R': No
## such file or directory

## Error: cannot open the connection

# source('~/R/predictvals.R') # Winston Chang P95
source("~/R/multiplot.R")  # Winston Chang from cookbook website

## Warning: cannot open file 'C:/Users/GRFisher4/Documents/R/multiplot.R': No
## such file or directory

## Error: cannot open the connection

source("~/R/lm_compare.R")  # my model-comparison function

## Warning: cannot open file 'C:/Users/GRFisher4/Documents/R/lm_compare.R':
## No such file or directory

## Error: cannot open the connection

# source('~/R/create.indicators.R') # my indicator-creation function
library(UsingR)
library(ggplot2)
library(grid)
library(lmtest)  # for Breusch-Pagan test for Heteroskedasticity
library(randomizeBE)  # for runs.pvalue(resid_studentized) test for autocorrelation

Influential Points

Note A quick-and-dirty look for Influential Points is found with

plot(model)
(lower-right panel shows Cook's Distance bands)
plot(model, which = 4)
abline(h=0.5, lty=2)

Outliers

Data that lies far away from the majority of the other data … an actual \( Y \) value which lies a long way from the predicted \( \hat{Y} \) value … a residual with a large magnitude: \( \hat{\epsilon} = Y_i - \hat{Y_i} \)

Residuals
resid = residuals(model)
Standardized Residuals
resid_standardized = rstandard(model)
Studentized Residuals
resid_studentized = rstudent(model)

When \( Residuals \thicksim N(0,\sigma^2) \; iid \), the \( Studentized \: Residuals \thicksim N(0,1) \).

Therefore we define a potential outlier to be an observation with a \( \left| Studentized \; Residual \right| > 3 \)

Sources of outliers:

Data error
An important predictor has been omitted from the model
\( Residuals \not\thicksim N(0,\sigma^2) \; iid \)
The outlier is a real observation
maybe remove, maybe rethink the situation

Exclude an Outlier and Reanalyze

If the results change substantially, exclude the outlier and analyze it separately from the rest of the sample
If not, merely note the outlier's existence

cars5 = read.csv("cars5.csv")
str(cars5)

## 'data.frame':    50 obs. of  6 variables:
##  $ ID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name: Factor w/ 50 levels "Aston Martin DB9 5.9L",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Cmpg: int  13 18 18 18 18 18 18 19 17 17 ...
##  $ Eng : num  5.9 3 3 3 3 3 3 3 3 3 ...
##  $ Cyl : int  12 6 6 6 6 6 6 6 6 6 ...
##  $ Vol : num  0.83 0.86 0.96 1 0.93 1.05 1.18 0.93 0.93 1 ...

names(cars5)

## [1] "ID"   "Name" "Cmpg" "Eng"  "Cyl"  "Vol"

head(cars5)

##   ID                     Name Cmpg Eng Cyl  Vol
## 1  1    Aston Martin DB9 5.9L   13 5.9  12 0.83
## 2  2 BMW 128Ci Convertible 3L   18 3.0   6 0.86
## 3  3              BMW 128i 3L   18 3.0   6 0.96
## 4  4             BMW 328Ci 3L   18 3.0   6 1.00
## 5  5 BMW 328Ci Convertible 3L   18 3.0   6 0.93
## 6  6              BMW 328i 3L   18 3.0   6 1.05

grfpairs(cars5[, c("Eng", "Cyl", "Vol", "Cmpg")], main = "Engine Size vs. Fuel Efficiency")

## Error: could not find function "grfpairs"

Create the model, print its parameters and look for outliers

Cgphm = 100/cars5$Cmpg  # city gallons per 100 miles
model = lm(Cgphm ~ Eng + Cyl + Vol, data = cars5)
summary(model)

## 
## Call:
## lm(formula = Cgphm ~ Eng + Cyl + Vol, data = cars5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9094 -0.1793  0.0453  0.2350  0.5232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.5420     0.3650    6.96    1e-08 ***
## Eng           0.2350     0.1117    2.10  0.04089 *  
## Cyl           0.2937     0.0762    3.85  0.00036 ***
## Vol           0.4762     0.2786    1.71  0.09421 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.379 on 46 degrees of freedom
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.743 
## F-statistic: 48.1 on 3 and 46 DF,  p-value: 3.11e-14

Note High p-value for Vol

resid_studentized = rstudent(model)
hist(resid_studentized, main = "Studentized Residual Distribution")

plot of chunk unnamed-chunk-5

# find the |Studentized Residuals| > 3 standard deviations
(gt3 = resid_studentized[abs(resid_studentized) > 3])

##     32 
## -8.274

outlier_list = vector()
for (name in names(gt3)) {
    print(cars5[as.integer(name), ])
    outlier_list[length(outlier_list) + 1] = as.integer(name)
}

##    ID               Name Cmpg Eng Cyl  Vol
## 32 32 Lexus HS 250h 2.4L   35 2.4   4 1.02

The analysis of this outlier is that it is a hybrid and therefore it is not surprising that its characteristics are different.

Remove the outliers and reanalyze

cars5_ex_outliers = cars5[-outlier_list, ]
Cgphm = 100/cars5_ex_outliers$Cmpg  # city gallons per 100 miles
model_ex_outliers = lm(Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_outliers)
summary(model_ex_outliers)

## 
## Call:
## lm(formula = Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4765 -0.2447  0.0477  0.1974  0.4501 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9048     0.2365   12.28  5.7e-16 ***
## Eng           0.2594     0.0712    3.64  0.00069 ***
## Cyl           0.2278     0.0492    4.63  3.1e-05 ***
## Vol           0.4922     0.1774    2.77  0.00803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.241 on 45 degrees of freedom
## Multiple R-squared:  0.857,  Adjusted R-squared:  0.847 
## F-statistic: 89.6 on 3 and 45 DF,  p-value: <2e-16

lm_compare(model, model_ex_outliers, coefficients = c(2, 3, 4))

## Error: could not find function "lm_compare"

resid_studentized_ex_outliers = rstudent(model_ex_outliers)
hist(resid_studentized_ex_outliers, main = "Studentized Residual Distribution post removal of outliers")

plot of chunk unnamed-chunk-9

Leverage

Outliers are unusual Y values
Leverage measures unusual influence of X values

The Leverage of an observation ranges from 0 (low influence) to 1 (high influence).

For k = # X variables: length(coef(model)) - 1
and n = # observations: nrow(dataset)

Create the leverages
lev = hatvalues(model)
Note any observations with leverage GT 3 * (k + 1) / n
(or 2 * if the observation is isolated)

Note any observations with large leverage in model_ex_outliers

lev = hatvalues(model_ex_outliers)

k = length(coef(model_ex_outliers)) - 1
n = nrow(cars5_ex_outliers)

plot(as.integer(rownames(cars5_ex_outliers)), lev, main = "Leverage in cars5_ex_outliers Dataset", 
    ylab = "Leverage Value")
abline(h = 2 * (k + 1)/n, lty = 2)
abline(h = 3 * (k + 1)/n, lty = 2)

plot of chunk unnamed-chunk-10

Identify and exclude items with high leverage in addition to the outliers

(plus3 = lev[lev > (3 * (k + 1)/n)])

##      1     22     50 
## 0.4908 0.3144 0.3497

# note this peculiar way of identifying a row: when we removed row #32, the
# Lexus, we created a gap in the rownames so now the row in position 32 has
# rowname '33' ... the data frame retains the original rownames; so we
# identify by rowname not row position
biglev_list = vector()
for (name in names(plus3)) {
    print(cars5_ex_outliers[rownames(cars5_ex_outliers) == as.integer(name), 
        ])
    biglev_list[length(biglev_list) + 1] = as.integer(name)
}

##   ID                  Name Cmpg Eng Cyl  Vol
## 1  1 Aston Martin DB9 5.9L   13 5.9  12 0.83
##    ID                       Name Cmpg Eng Cyl  Vol
## 22 22 Dodge Challenger SRT8 6.4L   14 6.4   8 1.07
##    ID                   Name Cmpg Eng Cyl  Vol
## 50 50 Rolls-Royce Ghost 6.6L   13 6.6  12 1.25

The analysis is to exclude only #1 rather than the entire biglev_list because its leverage is the highest

# note this peculiar way of removing a row: when we removed row #32, the
# Lexus, we created a gap in the rownames so now the row in position 32 has
# rowname '33' ... the data frame retains the original rownames; so we
# remove by rowname not row position
cars5_ex_biglev = cars5_ex_outliers[rownames(cars5_ex_outliers) != 1, ]
# to exclude the whole biglev_list:

# cars5_ex_biglev = cars5_ex_outliers[rownames(cars5_ex_outliers) !=
# biglev_list,]

Cgphm = 100/cars5_ex_biglev$Cmpg  # city gallons per 100 miles
model_ex_biglev = lm(Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_biglev)
summary(model_ex_biglev)

## 
## Call:
## lm(formula = Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_biglev)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.461 -0.229  0.051  0.208  0.437 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9488     0.2472   11.93  2.2e-15 ***
## Eng           0.2823     0.0796    3.55  0.00094 ***
## Cyl           0.2010     0.0640    3.14  0.00303 ** 
## Vol           0.5316     0.1883    2.82  0.00711 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.243 on 44 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.815 
## F-statistic:   70 on 3 and 44 DF,  p-value: <2e-16

Compare the ex-leverage model to ex-outlier model

lm_compare(model_ex_biglev, model_ex_outliers, coefficients = c(2, 3, 4))

## Error: could not find function "lm_compare"

And we find that we have actually made things worse by removing the high-leverage observation … so we should leave it in.

Cook's Distance

Cook's Distance helps to identify either

consequential outliers or
consequential leverage

In the cars5 example, it identifies only the Lexus outlier, confirming that the leveraged observations are not consequential.

cook = cooks.distance(model)

Look at the items with a value GT 0.5

Compare Cook's Distance before and after removing outliers

cook_simple = cooks.distance(model)
cook_ex_outliers = cooks.distance(model_ex_outliers)

par(mfrow = c(1, 2))
plot(as.integer(rownames(cars5)), cook_simple, main = "Including Outliers", 
    ylab = "Cook's Distance", ylim = c(0, max(cook_simple)))
abline(h = 0.5)
plot(as.integer(rownames(cars5_ex_outliers)), cook_ex_outliers, main = "Excluding Outliers", 
    ylab = "", ylim = c(0, max(cook_simple)))
abline(h = 0.5)

plot of chunk unnamed-chunk-14

Identify the Cook's Distance extremes

(cook05 = cook_simple[cook_simple > 0.5])

##    32 
## 0.633

cook_extremes = vector()
for (name in names(cook05)) {
    print(cars5[as.integer(name), ])
    cook_extremes[length(cook_extremes) + 1] = as.integer(name)
}

##    ID               Name Cmpg Eng Cyl  Vol
## 32 32 Lexus HS 250h 2.4L   35 2.4   4 1.02

library(car)
outlierTest(model)

##    rstudent unadjusted p-value Bonferonni p
## 32   -8.274          1.382e-10    6.907e-09

Identify Cook's Distance extremes with plot(model):

plot(model, which = 4)
abline(h = 0.5, lty = 2)

plot of chunk unnamed-chunk-17

Regression Pitfalls

Nonconstant error variance (Heteroskedasticity)

fertility = read.csv("fertility.csv")
str(fertility)

## 'data.frame':    169 obs. of  4 variables:
##  $ Locality: Factor w/ 169 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fer     : num  6.42 1.56 2.31 5.58 2.23 1.74 1.94 1.37 2.16 1.9 ...
##  $ Inc     : num  1.22 6.55 6.21 5.06 13.5 ...
##  $ Edu     : num  0.8 10.7 7.1 4.5 11.4 11.4 12.7 11.4 11.9 12 ...

names(fertility)

## [1] "Locality" "Fer"      "Inc"      "Edu"

head(fertility)

##      Locality  Fer    Inc  Edu
## 1 Afghanistan 6.42  1.217  0.8
## 2     Albania 1.56  6.546 10.7
## 3     Algeria 2.31  6.207  7.1
## 4      Angola 5.58  5.056  4.5
## 5   Argentina 2.23 13.498 11.4
## 6     Armenia 1.74  4.523 11.4

grfpairs(fertility[, c("Inc", "Edu", "Fer")], main = "Fertility vs. Income (PPP) & Education Level")

## Error: could not find function "grfpairs"

lnInc = log(fertility$Inc)
lnInc_model = lm(Fer ~ lnInc + Edu, data = fertility)

scatter.smooth(rstudent(lnInc_model) ~ fitted(lnInc_model), main = "Fertility-Rate model Heteroskedasticity", 
    sub = "solid line should lie along the dotted", xlab = "Fitted Values", 
    ylab = "Studentized Residuals")
abline(h = 0, lty = 2)

plot of chunk unnamed-chunk-21

bptest(lnInc_model)  # Breusch-Pagan test for Heteroskedasticity

## 
##  studentized Breusch-Pagan test
## 
## data:  lnInc_model
## BP = 14.01, df = 2, p-value = 0.0009064

Very-low p-value indicates Heteroskedasticity.

Responses to Heteroskedasticity

Use Weighted (WLS) or Generalized (GLS) rather than Ordinary Least Squares (OLS)
Transform the Y variable
- Reciprocal (strongest)
- Natural Log
- Sqrt (weakest)

lnFer = log(fertility$Fer)  # Transform Y to log(Y)
lnln_model = lm(lnFer ~ lnInc + Edu, data = fertility)

scatter.smooth(rstudent(lnln_model) ~ fitted(lnln_model), main = "Fertility-Rate model reduced Heteroskedasticity", 
    sub = "solid line much closer to the dotted", xlab = "Fitted Values", ylab = "Studentized Residuals")
abline(h = 0, lty = 2)

plot of chunk unnamed-chunk-23


bptest(lnln_model)  # Breusch-Pagan test for Heteroskedasticity

## 
##  studentized Breusch-Pagan test
## 
## data:  lnln_model
## BP = 0.2408, df = 2, p-value = 0.8865

Very HIGH p-value indicates Constant Variance.

Comparison of the models is inconclusive regarding the value of reducing heteroskedasticity

lm_compare(lnInc_model, lnln_model, coefficients = c(2, 3))

## Error: could not find function "lm_compare"

Autocorrelation/Serial Correlation (failure of Independence assumption)

This is often seen in time series in which the next value is not independent of the previous one.

unemp = read.csv("unemp.csv")
str(unemp)

## 'data.frame':    52 obs. of  7 variables:
##  $ Year: int  1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 ...
##  $ Unr : num  5.45 5.54 6.69 5.57 5.64 ...
##  $ Bms : num  298 312 336 363 393 ...
##  $ Ipd : num  18.3 18.6 18.8 19.1 19.3 ...
##  $ Gdp : num  2818 2868 2933 3119 3249 ...
##  $ Exp : num  83.9 98.5 99 104 111.5 ...
##  $ Time: int  1 2 3 4 5 6 7 8 9 10 ...

names(unemp)

## [1] "Year" "Unr"  "Bms"  "Ipd"  "Gdp"  "Exp"  "Time"

head(unemp)

##   Year   Unr   Bms   Ipd  Gdp   Exp Time
## 1 1959 5.450 297.8 18.34 2818  83.9    1
## 2 1960 5.542 312.4 18.60 2868  98.5    2
## 3 1961 6.692 335.5 18.80 2933  99.0    3
## 4 1962 5.567 362.7 19.06 3119 104.0    4
## 5 1963 5.642 393.2 19.27 3249 111.5    5
## 6 1964 5.158 424.7 19.56 3426 124.6    6

grfpairs(unemp[, c("Time", "Bms", "Ipd", "Gdp", "Exp", "Unr")], main = "Unemployment vs other macro factors")

## Error: could not find function "grfpairs"

All except unemployment go up over time.

Autocorrelated model

lnBms_Ipd = log(unemp$Bms/unemp$Ipd)
lnGdp = log(unemp$Gdp)
lnExp = log(unemp$Exp)

model = lm(Unr ~ lnBms_Ipd + lnGdp + lnExp + Time, data = unemp)
resid_studentized = rstudent(model)
scatter.smooth(unemp$Time, resid_studentized, ann = FALSE)
abline(h = 0, lty = 2)
title(main = "Residuals over Time", xlab = "Time Period", ylab = "Studentized Residuals", 
    sub = "momentum is clearly at work")

plot of chunk unnamed-chunk-28

1st Order Autoregressive Errors AR(1)

\( \epsilon_i = r \epsilon_{i-1} + v_i \)

\( i \; time \; index \)
\( v \thicksim N(0,\sigma^2) \)
\( -1 \le r \le 1 \)

runs.pvalue(resid_studentized)  # Wald and Wolfowitz runs test

## [1] 9.49e-07

Very low p-value indicates presence of autocorrelation

What to do?

Include a time variable if one is absent
Include indicator variables for seasonality
Add lagged time variables
“First-Difference”“ the whole dataset:
\( Y^{diff}_i = Y_i - Y_{i-1} \)
\( X^{diff}_i = X_i - X_{i-1} \)

Multicollinearity

When the X variables are highly correlated to each other, the model's standard error is inflated.

sales3 = read.csv("sales3.csv")
str(sales3)

## 'data.frame':    12 obs. of  3 variables:
##  $ Sales: num  5 7 7 5 9 11 12 13 15 17 ...
##  $ Trad : num  1 2 2.5 1.5 3.5 4 5 6 6.5 8 ...
##  $ Int  : num  2 2.5 3 1.5 4 4.5 5 5.5 6 7.5 ...

names(sales3)

## [1] "Sales" "Trad"  "Int"

head(sales3)

##   Sales Trad Int
## 1     5  1.0 2.0
## 2     7  2.0 2.5
## 3     7  2.5 3.0
## 4     5  1.5 1.5
## 5     9  3.5 4.0
## 6    11  4.0 4.5

grfpairs(sales3, main = "Sales by Traditional and Internet Advertising")

## Error: could not find function "grfpairs"

Multicollinearity is obvious, here

What to do?

Combine the collinear factors \( \beta_i \, (X_1 + X_2) \)
Omit one of the collinear factors (the one with the highest p-value)
Rescale the factors to mean zero, StDev 1

Excluding important predictors

Generally speaking, start big and trim.

Overfitting

Either including unimportant variables or using a model that so closely fits the sample data that it sacrifices its predictive power with new data.

Extrapolation

Be wary of using a model to predict too far away from the data you're given.

Power and Sample Size

Small datasets lack statistical power.