Influential Points
Note that rather than excluding an anomalous point, a transformation may be preferred
Regression Pitfalls
setwd("~/R/regression")
# opts_knit$set(tidy = TRUE)
source("~/R/grfpairs.R") # Winston Chang's pairs function
## Warning: cannot open file 'C:/Users/GRFisher4/Documents/R/grfpairs.R': No
## such file or directory
## Error: cannot open the connection
# source('~/R/predictvals.R') # Winston Chang P95
source("~/R/multiplot.R") # Winston Chang from cookbook website
## Warning: cannot open file 'C:/Users/GRFisher4/Documents/R/multiplot.R': No
## such file or directory
## Error: cannot open the connection
source("~/R/lm_compare.R") # my model-comparison function
## Warning: cannot open file 'C:/Users/GRFisher4/Documents/R/lm_compare.R':
## No such file or directory
## Error: cannot open the connection
# source('~/R/create.indicators.R') # my indicator-creation function
library(UsingR)
library(ggplot2)
library(grid)
library(lmtest) # for Breusch-Pagan test for Heteroskedasticity
library(randomizeBE) # for runs.pvalue(resid_studentized) test for autocorrelation
Note A quick-and-dirty look for Influential Points is found with
plot(model)
(lower-right panel shows Cook's Distance bands)
plot(model, which = 4)
abline(h=0.5, lty=2)
Data that lies far away from the majority of the other data … an actual \( Y \) value which lies a long way from the predicted \( \hat{Y} \) value … a residual with a large magnitude: \( \hat{\epsilon} = Y_i - \hat{Y_i} \)
Residuals
resid = residuals(model)
Standardized Residuals
resid_standardized = rstandard(model)
Studentized Residuals
resid_studentized = rstudent(model)
When \( Residuals \thicksim N(0,\sigma^2) \; iid \), the \( Studentized \: Residuals \thicksim N(0,1) \).
Therefore we define a potential outlier to be an observation with a \( \left| Studentized \; Residual \right| > 3 \)
Sources of outliers:
Data error
An important predictor has been omitted from the model
\( Residuals \not\thicksim N(0,\sigma^2) \; iid \)
The outlier is a real observation
maybe remove, maybe rethink the situation
If the results change substantially, exclude the outlier and analyze it separately from the rest of the sample
If not, merely note the outlier's existence
cars5 = read.csv("cars5.csv")
str(cars5)
## 'data.frame': 50 obs. of 6 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name: Factor w/ 50 levels "Aston Martin DB9 5.9L",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Cmpg: int 13 18 18 18 18 18 18 19 17 17 ...
## $ Eng : num 5.9 3 3 3 3 3 3 3 3 3 ...
## $ Cyl : int 12 6 6 6 6 6 6 6 6 6 ...
## $ Vol : num 0.83 0.86 0.96 1 0.93 1.05 1.18 0.93 0.93 1 ...
names(cars5)
## [1] "ID" "Name" "Cmpg" "Eng" "Cyl" "Vol"
head(cars5)
## ID Name Cmpg Eng Cyl Vol
## 1 1 Aston Martin DB9 5.9L 13 5.9 12 0.83
## 2 2 BMW 128Ci Convertible 3L 18 3.0 6 0.86
## 3 3 BMW 128i 3L 18 3.0 6 0.96
## 4 4 BMW 328Ci 3L 18 3.0 6 1.00
## 5 5 BMW 328Ci Convertible 3L 18 3.0 6 0.93
## 6 6 BMW 328i 3L 18 3.0 6 1.05
grfpairs(cars5[, c("Eng", "Cyl", "Vol", "Cmpg")], main = "Engine Size vs. Fuel Efficiency")
## Error: could not find function "grfpairs"
Cgphm = 100/cars5$Cmpg # city gallons per 100 miles
model = lm(Cgphm ~ Eng + Cyl + Vol, data = cars5)
summary(model)
##
## Call:
## lm(formula = Cgphm ~ Eng + Cyl + Vol, data = cars5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9094 -0.1793 0.0453 0.2350 0.5232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5420 0.3650 6.96 1e-08 ***
## Eng 0.2350 0.1117 2.10 0.04089 *
## Cyl 0.2937 0.0762 3.85 0.00036 ***
## Vol 0.4762 0.2786 1.71 0.09421 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.379 on 46 degrees of freedom
## Multiple R-squared: 0.758, Adjusted R-squared: 0.743
## F-statistic: 48.1 on 3 and 46 DF, p-value: 3.11e-14
Note High p-value for Vol
resid_studentized = rstudent(model)
hist(resid_studentized, main = "Studentized Residual Distribution")
# find the |Studentized Residuals| > 3 standard deviations
(gt3 = resid_studentized[abs(resid_studentized) > 3])
## 32
## -8.274
outlier_list = vector()
for (name in names(gt3)) {
print(cars5[as.integer(name), ])
outlier_list[length(outlier_list) + 1] = as.integer(name)
}
## ID Name Cmpg Eng Cyl Vol
## 32 32 Lexus HS 250h 2.4L 35 2.4 4 1.02
The analysis of this outlier is that it is a hybrid and therefore it is not surprising that its characteristics are different.
cars5_ex_outliers = cars5[-outlier_list, ]
Cgphm = 100/cars5_ex_outliers$Cmpg # city gallons per 100 miles
model_ex_outliers = lm(Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_outliers)
summary(model_ex_outliers)
##
## Call:
## lm(formula = Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4765 -0.2447 0.0477 0.1974 0.4501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9048 0.2365 12.28 5.7e-16 ***
## Eng 0.2594 0.0712 3.64 0.00069 ***
## Cyl 0.2278 0.0492 4.63 3.1e-05 ***
## Vol 0.4922 0.1774 2.77 0.00803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.241 on 45 degrees of freedom
## Multiple R-squared: 0.857, Adjusted R-squared: 0.847
## F-statistic: 89.6 on 3 and 45 DF, p-value: <2e-16
lm_compare(model, model_ex_outliers, coefficients = c(2, 3, 4))
## Error: could not find function "lm_compare"
resid_studentized_ex_outliers = rstudent(model_ex_outliers)
hist(resid_studentized_ex_outliers, main = "Studentized Residual Distribution post removal of outliers")
The Leverage of an observation ranges from 0 (low influence) to 1 (high influence).
For k = # X variables: length(coef(model)) - 1
and n = # observations: nrow(dataset)
Create the leverages
lev = hatvalues(model)
Note any observations with leverage GT 3 * (k + 1) / n
(or 2 * if the observation is isolated)
lev = hatvalues(model_ex_outliers)
k = length(coef(model_ex_outliers)) - 1
n = nrow(cars5_ex_outliers)
plot(as.integer(rownames(cars5_ex_outliers)), lev, main = "Leverage in cars5_ex_outliers Dataset",
ylab = "Leverage Value")
abline(h = 2 * (k + 1)/n, lty = 2)
abline(h = 3 * (k + 1)/n, lty = 2)
(plus3 = lev[lev > (3 * (k + 1)/n)])
## 1 22 50
## 0.4908 0.3144 0.3497
# note this peculiar way of identifying a row: when we removed row #32, the
# Lexus, we created a gap in the rownames so now the row in position 32 has
# rowname '33' ... the data frame retains the original rownames; so we
# identify by rowname not row position
biglev_list = vector()
for (name in names(plus3)) {
print(cars5_ex_outliers[rownames(cars5_ex_outliers) == as.integer(name),
])
biglev_list[length(biglev_list) + 1] = as.integer(name)
}
## ID Name Cmpg Eng Cyl Vol
## 1 1 Aston Martin DB9 5.9L 13 5.9 12 0.83
## ID Name Cmpg Eng Cyl Vol
## 22 22 Dodge Challenger SRT8 6.4L 14 6.4 8 1.07
## ID Name Cmpg Eng Cyl Vol
## 50 50 Rolls-Royce Ghost 6.6L 13 6.6 12 1.25
The analysis is to exclude only #1 rather than the entire biglev_list because its leverage is the highest
# note this peculiar way of removing a row: when we removed row #32, the
# Lexus, we created a gap in the rownames so now the row in position 32 has
# rowname '33' ... the data frame retains the original rownames; so we
# remove by rowname not row position
cars5_ex_biglev = cars5_ex_outliers[rownames(cars5_ex_outliers) != 1, ]
# to exclude the whole biglev_list:
# cars5_ex_biglev = cars5_ex_outliers[rownames(cars5_ex_outliers) !=
# biglev_list,]
Cgphm = 100/cars5_ex_biglev$Cmpg # city gallons per 100 miles
model_ex_biglev = lm(Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_biglev)
summary(model_ex_biglev)
##
## Call:
## lm(formula = Cgphm ~ Eng + Cyl + Vol, data = cars5_ex_biglev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.461 -0.229 0.051 0.208 0.437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9488 0.2472 11.93 2.2e-15 ***
## Eng 0.2823 0.0796 3.55 0.00094 ***
## Cyl 0.2010 0.0640 3.14 0.00303 **
## Vol 0.5316 0.1883 2.82 0.00711 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.243 on 44 degrees of freedom
## Multiple R-squared: 0.827, Adjusted R-squared: 0.815
## F-statistic: 70 on 3 and 44 DF, p-value: <2e-16
lm_compare(model_ex_biglev, model_ex_outliers, coefficients = c(2, 3, 4))
## Error: could not find function "lm_compare"
And we find that we have actually made things worse by removing the high-leverage observation … so we should leave it in.
Cook's Distance helps to identify either
In the cars5 example, it identifies only the Lexus outlier, confirming that the leveraged observations are not consequential.
cook = cooks.distance(model)
Look at the items with a value GT 0.5
cook_simple = cooks.distance(model)
cook_ex_outliers = cooks.distance(model_ex_outliers)
par(mfrow = c(1, 2))
plot(as.integer(rownames(cars5)), cook_simple, main = "Including Outliers",
ylab = "Cook's Distance", ylim = c(0, max(cook_simple)))
abline(h = 0.5)
plot(as.integer(rownames(cars5_ex_outliers)), cook_ex_outliers, main = "Excluding Outliers",
ylab = "", ylim = c(0, max(cook_simple)))
abline(h = 0.5)
(cook05 = cook_simple[cook_simple > 0.5])
## 32
## 0.633
cook_extremes = vector()
for (name in names(cook05)) {
print(cars5[as.integer(name), ])
cook_extremes[length(cook_extremes) + 1] = as.integer(name)
}
## ID Name Cmpg Eng Cyl Vol
## 32 32 Lexus HS 250h 2.4L 35 2.4 4 1.02
library(car)
outlierTest(model)
## rstudent unadjusted p-value Bonferonni p
## 32 -8.274 1.382e-10 6.907e-09
plot(model, which = 4)
abline(h = 0.5, lty = 2)
fertility = read.csv("fertility.csv")
str(fertility)
## 'data.frame': 169 obs. of 4 variables:
## $ Locality: Factor w/ 169 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Fer : num 6.42 1.56 2.31 5.58 2.23 1.74 1.94 1.37 2.16 1.9 ...
## $ Inc : num 1.22 6.55 6.21 5.06 13.5 ...
## $ Edu : num 0.8 10.7 7.1 4.5 11.4 11.4 12.7 11.4 11.9 12 ...
names(fertility)
## [1] "Locality" "Fer" "Inc" "Edu"
head(fertility)
## Locality Fer Inc Edu
## 1 Afghanistan 6.42 1.217 0.8
## 2 Albania 1.56 6.546 10.7
## 3 Algeria 2.31 6.207 7.1
## 4 Angola 5.58 5.056 4.5
## 5 Argentina 2.23 13.498 11.4
## 6 Armenia 1.74 4.523 11.4
grfpairs(fertility[, c("Inc", "Edu", "Fer")], main = "Fertility vs. Income (PPP) & Education Level")
## Error: could not find function "grfpairs"
lnInc = log(fertility$Inc)
lnInc_model = lm(Fer ~ lnInc + Edu, data = fertility)
scatter.smooth(rstudent(lnInc_model) ~ fitted(lnInc_model), main = "Fertility-Rate model Heteroskedasticity",
sub = "solid line should lie along the dotted", xlab = "Fitted Values",
ylab = "Studentized Residuals")
abline(h = 0, lty = 2)
bptest(lnInc_model) # Breusch-Pagan test for Heteroskedasticity
##
## studentized Breusch-Pagan test
##
## data: lnInc_model
## BP = 14.01, df = 2, p-value = 0.0009064
Very-low p-value indicates Heteroskedasticity.
lnFer = log(fertility$Fer) # Transform Y to log(Y)
lnln_model = lm(lnFer ~ lnInc + Edu, data = fertility)
scatter.smooth(rstudent(lnln_model) ~ fitted(lnln_model), main = "Fertility-Rate model reduced Heteroskedasticity",
sub = "solid line much closer to the dotted", xlab = "Fitted Values", ylab = "Studentized Residuals")
abline(h = 0, lty = 2)
bptest(lnln_model) # Breusch-Pagan test for Heteroskedasticity
##
## studentized Breusch-Pagan test
##
## data: lnln_model
## BP = 0.2408, df = 2, p-value = 0.8865
Very HIGH p-value indicates Constant Variance.
lm_compare(lnInc_model, lnln_model, coefficients = c(2, 3))
## Error: could not find function "lm_compare"
This is often seen in time series in which the next value is not independent of the previous one.
unemp = read.csv("unemp.csv")
str(unemp)
## 'data.frame': 52 obs. of 7 variables:
## $ Year: int 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 ...
## $ Unr : num 5.45 5.54 6.69 5.57 5.64 ...
## $ Bms : num 298 312 336 363 393 ...
## $ Ipd : num 18.3 18.6 18.8 19.1 19.3 ...
## $ Gdp : num 2818 2868 2933 3119 3249 ...
## $ Exp : num 83.9 98.5 99 104 111.5 ...
## $ Time: int 1 2 3 4 5 6 7 8 9 10 ...
names(unemp)
## [1] "Year" "Unr" "Bms" "Ipd" "Gdp" "Exp" "Time"
head(unemp)
## Year Unr Bms Ipd Gdp Exp Time
## 1 1959 5.450 297.8 18.34 2818 83.9 1
## 2 1960 5.542 312.4 18.60 2868 98.5 2
## 3 1961 6.692 335.5 18.80 2933 99.0 3
## 4 1962 5.567 362.7 19.06 3119 104.0 4
## 5 1963 5.642 393.2 19.27 3249 111.5 5
## 6 1964 5.158 424.7 19.56 3426 124.6 6
grfpairs(unemp[, c("Time", "Bms", "Ipd", "Gdp", "Exp", "Unr")], main = "Unemployment vs other macro factors")
## Error: could not find function "grfpairs"
All except unemployment go up over time.
lnBms_Ipd = log(unemp$Bms/unemp$Ipd)
lnGdp = log(unemp$Gdp)
lnExp = log(unemp$Exp)
model = lm(Unr ~ lnBms_Ipd + lnGdp + lnExp + Time, data = unemp)
resid_studentized = rstudent(model)
scatter.smooth(unemp$Time, resid_studentized, ann = FALSE)
abline(h = 0, lty = 2)
title(main = "Residuals over Time", xlab = "Time Period", ylab = "Studentized Residuals",
sub = "momentum is clearly at work")
\( \epsilon_i = r \epsilon_{i-1} + v_i \)
runs.pvalue(resid_studentized) # Wald and Wolfowitz runs test
## [1] 9.49e-07
Very low p-value indicates presence of autocorrelation
What to do?
When the X variables are highly correlated to each other, the model's standard error is inflated.
sales3 = read.csv("sales3.csv")
str(sales3)
## 'data.frame': 12 obs. of 3 variables:
## $ Sales: num 5 7 7 5 9 11 12 13 15 17 ...
## $ Trad : num 1 2 2.5 1.5 3.5 4 5 6 6.5 8 ...
## $ Int : num 2 2.5 3 1.5 4 4.5 5 5.5 6 7.5 ...
names(sales3)
## [1] "Sales" "Trad" "Int"
head(sales3)
## Sales Trad Int
## 1 5 1.0 2.0
## 2 7 2.0 2.5
## 3 7 2.5 3.0
## 4 5 1.5 1.5
## 5 9 3.5 4.0
## 6 11 4.0 4.5
grfpairs(sales3, main = "Sales by Traditional and Internet Advertising")
## Error: could not find function "grfpairs"
What to do?
Generally speaking, start big and trim.
Either including unimportant variables or using a model that so closely fits the sample data that it sacrifices its predictive power with new data.
Be wary of using a model to predict too far away from the data you're given.
Small datasets lack statistical power.