A high leverage data point has extreme predictor \(X\) values. An outlier is a data point whose response value \(y\) does not follow the general trend of the rest of the data. A data point is influential if it unduly influences predicted responses, the estimated slope coefficients, or hypothesis test results. High leverage data points and outliers may or may not be influential - you need to investigate!
If you have influential data points, you can 1) do nothing because there is theoretical reasons to include them, 2) drop them, or 3) include them with reduced weight using robust regression.
The “hat matrix” \(H\) identifies any leverage points. Recall that in the linear regression model \(\hat{y} = X \hat{\beta}\) the slope coefficients are estimated by \(\hat{\beta} = (X'X)^{-1}X'y\). Substituting, \(\hat{y} = X(X'X)^{-1}X'y\), or \(\hat{y} = Hy\), where
\[H = X(X'X)^{-1}X'.\]
\(H\) is called the hat matrix because \(H\) puts the hat on \(y\). \(H\) is an \(n \times n\) matrix. The diagonal elements \(H_{ii}\) are a measure of the distances between each observation \(i\)’s predictor variables \(X_i\) and the average of the entire data set predictor variables \(\bar{X}\). \(H_{ii}\) are the leverage that the observed responses \(y_i\) exert on the predicted responses \(\hat{y}_i\). Each \(H_{ii}\) is in the unit interval [0, 1] and the values sum to the number of regression parameters (including the intercept) \(\sum{H_{ii}} = p\). A common rule is to research any observation whose leverage value is more than 3 times larger than the mean leverage value, which since the sum of the leverage values is \(p\), equals
\[H_{ii} > 3 \frac{p}{n}.\]
In practice, you will identify and deal with influential data points, not the leverage points that give rise to them. However, since leverage is an important component of influential data points, you should know how to work with it.
A study measured cement hardening based on its chemical composition. Data set cement
(cement.txt) is a summary of n = 13
formulations:
y
= heat evolved, in calories/gramx1
= % of tricalcium aluminatex2
= % of tricalcium silicatex3
= % of tetracalcium alumino ferritex4
= % of dicalcium silicatelibrary(readr)
cement <- read_fwf(file = "./Data/cement.txt",
skip = 1,
fwf_widths(c(5, 7, 7, 7, 7)))
colnames(cement) <- c("y", "x1", "x2", "x3", "x4")
head(cement)
## # A tibble: 6 x 5
## y x1 x2 x3 x4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 78.5 7 26 6 60
## 2 74.3 1 29 15 52
## 3 104. 11 56 8 20
## 4 87.6 11 31 8 47
## 5 95.9 7 52 6 33
## 6 109. 11 55 9 22
Suppose you model y ~ x1
.
cement_lm1 <- lm(y ~ x1, data = cement)
From the scatterplot it looks like observation 10 may be a high leverage point.
library(ggplot2)
ggplot(data = cement, aes(y = y, x = x1)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "y ~ x1",
subtitle = "Obs 10 a high leverage point?") +
geom_text(label=rownames(cement),
nudge_x = 0.5,
nudge_y = 0.25,
check_overlap = T)
The hat matrix is \(H = X(X'X)^{-1}X'\). The diagonal elements are the leverages. The further each \(X_i\) is from \(\bar{X}\), the larger the leverage. Observation 10 has the largest leverage.
X <- matrix(c(rep(1, times = nrow(cement)),
cement$x1),
ncol = 2)
# t(x) is transpose of x: x'
# solve(x) is inverse of x: x^{-1}
H <- X %*% solve(t(X) %*% X) %*% t(X)
data.frame(x1 = cement$x1,
H = diag(H))
## x1 H
## 1 7 0.07743609
## 2 1 0.17747314
## 3 11 0.10707670
## 4 11 0.10707670
## 5 7 0.07743609
## 6 11 0.10707670
## 7 3 0.12486106
## 8 1 0.17747314
## 9 2 0.14875880
## 10 21 0.51834013
## 11 1 0.17747314
## 12 11 0.10707670
## 13 10 0.09244165
The sum of the matrix diagonal (trace) equals the number of predictor variables in the model.
sum(diag(H))
## [1] 2
You did not need manually calculate the hat matrix for this exercise becaues R function hatvalues(model)
returns the diagonal elements of the hat matrix. Which observation(s) are high leverage points? Observation 10 is more than three times the mean leverage.
outliers <- hatvalues(cement_lm1) > 3 * mean(hatvalues(cement_lm1))
cement[outliers, ]
## # A tibble: 1 x 5
## y x1 x2 x3 x4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 116. 21 47 4 26
For multiple linear regression models like y ~ x2 + x2 + x3 + x4
, there is no straight-forward visual diagnostic like the simple linear regression scatterplot. However, the method is otherwise unchanged. In this case, there are no outliers.
cement_lm2 <- lm(y ~ x1 + x2 + x3 + x4,
data = cement)
outliers <- hatvalues(cement_lm2) > 3 * mean(hatvalues(cement_lm2))
cement[outliers, ]
## # A tibble: 0 x 5
## # ... with 5 variables: y <dbl>, x1 <dbl>, x2 <dbl>, x3 <dbl>, x4 <dbl>
An outlier is a point whose observation is far from the fitted regression line. That is, an outlier is a point with a large residual. “Large” is relative to the variable units, so standardize the residuals by dividing by an estimate of their standard deviation.1 Residuals standardized in this way are sometimes called internally standardized residuals or internally studentized residuals.
\[r_i = \frac{\epsilon_i}{\sqrt{MSE(1 - H_{ii})}}\]
mse <- sum(cement_lm2$residuals^2) / df.residual(cement_lm2)
cement_lm2$residuals / sqrt(mse*(1 - hatvalues(cement_lm2)))
## 1 2 3 4 5
## 0.002902141 0.756624558 -1.050274056 -0.841081415 0.127905849
## 6 7 8 9 10
## 1.714815620 -0.744450296 -1.687801801 0.670799981 0.210293420
## 11 12 13
## 1.073910078 0.463352296 -1.124105189
These residuals are calculated by the rstandard()
function.
rstandard(cement_lm2)
## 1 2 3 4 5
## 0.002902141 0.756624558 -1.050274056 -0.841081415 0.127905849
## 6 7 8 9 10
## 1.714815620 -0.744450296 -1.687801801 0.670799981 0.210293420
## 11 12 13
## 1.073910078 0.463352296 -1.124105189
An observation with a standardized residual that is larger than 3 is generally considered an outlier. One problem with this definition is that potential outlying observations influence the slope of regression line, muting their measured residual. A second definition of an outlier using the externally studentized residual remedies this problem. Externally standardized (studentized) residuals are the residuals divided by the MSE measured with observation \(i\) removed, \(MSE_{(i)}\).
\[r_i = \frac{\epsilon_i}{\sqrt{MSE_{(i)}(1 - H_{ii})}}\]
Calculating \(MSE_i\) is a process of leave-one-out regression diagnostics. In the code below, mdl
is a list of \(i = 1:n\) models with observation i
removed. mse
is the resulting mean squared error, \(\sum{\epsilon^2} / df_\epsilon\).
library(purrr)
i <- 1:13
mdl <- map(i, function(i) lm(y ~ x1 + x2 + x3 + x4,
data = cement[-i, ]))
mse <- as.numeric(map(i, function(i) sum(mdl[[i]]$residuals^2) /
mdl[[i]]$df.residual))
as.numeric(map(i, function(i) (cement_lm2$residuals /
sqrt(mse[[i]] *
(1 - hatvalues(cement_lm2))))[i]))
## [1] 0.002714706 0.734526654 -1.058093203 -0.824036397 0.119767490
## [6] 2.017049821 -0.721820523 -1.967482994 0.645903738 0.197257449
## [11] 1.085864774 0.439362041 -1.145888712
These residuals are calculated by the rstudent(model)
function.
rstudent(cement_lm2)
## 1 2 3 4 5
## 0.002714706 0.734526654 -1.058093203 -0.824036397 0.119767490
## 6 7 8 9 10
## 2.017049821 -0.721820523 -1.967482994 0.645903738 0.197257449
## 11 12 13
## 1.085864774 0.439362041 -1.145888712
There are two common measures for identifying influential data points: difference in fits (DFFITS), and Cook’s distance. Both methods use the leave-one-out process from above.
The difference in fits for obseravation \(i\) is defined
\[DFFITS_i = \frac{\hat{y}_i - \hat{y}_{(i)}}{\sqrt{MSE_{(i)} H_{ii}}}.\]
\(DFFITS\) is similar to the externally standardized residual definition, except that the numerator is the difference in fitted values with observation \(i\) and without observation \(i\). An observation is considered influential if
\[2 \sqrt{\frac{p + 1}{n - p - 1}}.\]
Data set influence3
(influence3.txt]) has one influential observation.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
influence3 <- read_tsv(file = "./Data/influence3.txt")
## Parsed with column specification:
## cols(
## Row = col_double(),
## x = col_double(),
## y = col_double()
## )
print(influence3)
## # A tibble: 21 x 3
## Row x y
## <dbl> <dbl> <dbl>
## 1 1 0.1 -0.0716
## 2 2 0.454 4.17
## 3 3 1.10 6.57
## 4 4 1.28 13.8
## 5 5 2.21 11.5
## 6 6 2.50 13.0
## 7 7 3.04 20.2
## 8 8 3.24 17.6
## 9 9 4.45 26.0
## 10 10 4.17 22.8
## # ... with 11 more rows
From the scatterplot, you can see observation 21 is a high leverage data point.
ggplot(data = influence3, aes(y = y, x = x)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "y ~ x",
subtitle = "Obs 21 an influential data point?") +
geom_text(label=rownames(influence3),
nudge_x = 0.5,
nudge_y = 0.25,
check_overlap = T)
That intuition is confirmed by the diagnostic test:
influence3_lm <- lm(y ~ x, data = influence3)
leverages <- hatvalues(influence3_lm) > 3 * mean(hatvalues(influence3_lm))
influence3[leverages, ]
## # A tibble: 1 x 3
## Row x y
## <dbl> <dbl> <dbl>
## 1 21 14 68
Is it also an outlier? No!
outliers <- rstudent(influence3_lm) > 3
influence3[outliers, ]
## # A tibble: 0 x 3
## # ... with 3 variables: Row <dbl>, x <dbl>, y <dbl>
However, it is potentially influential according to DFFITS. Use the dffits(model)
function to calculate the \(DFFITS_i\) values. In this case look for \(DFFITS_i\) values (in absolute value) greater than
\[2 \sqrt{\frac{p + 1}{n - p - 1}} = 2 \sqrt{\frac{2 + 1}{21 - 2 - 1}} = 0.82.\]
Observation 21 is a potential influential data point.
df <- influence3_lm$df.residual
p <- length(influence3_lm$coefficients)
n <- nrow(influence3_lm$model)
dffits_crit = 2 * sqrt((p + 1) / (n - p - 1))
influence3_dffits <- dffits(influence3_lm)
df <- data.frame(obs = names(influence3_dffits),
dffits = influence3_dffits)
ggplot(df, aes(y = dffits, x = obs)) +
geom_point() +
geom_hline(yintercept = c(dffits_crit, -dffits_crit), linetype="dashed") +
labs(title = "DFFITS",
subtitle = "Influential Observation at observation 21.",
x = "Observation Number",
y = "DFFITS")
influence3[which(abs(influence3_dffits) > dffits_crit),]
## # A tibble: 1 x 3
## Row x y
## <dbl> <dbl> <dbl>
## 1 21 14 68
This is surprising because it certainly seems to fit within the data set well. Remember, DFFITS just flags potentially influential data points - you should check whether the potential influential data points significantly alter the outcome of the regression analysis. If they do, report the results with and without the influential points.
Cook’s distance for observation \(i\) is defined
\[D_i = \frac{(y_i - \hat{y}_i)^2}{p \times MSE} \frac{H_{ii}}{(1 - H_{ii})^2}.\]
\(D_i\) directly summarizes how much all of the fitted values change when the ith observation is deleted. ???A data point with \(D_i > 1\) is probably influential. \(D_i > 0.5\) is at least worth investigating.
Using Cook’s distance, observation 21 is still potentially influential.
cooks_crit = 0.5
influence3_cooks <- cooks.distance(influence3_lm)
df <- data.frame(obs = names(influence3_cooks),
cooks = influence3_cooks)
ggplot(df, aes(y = cooks, x = obs)) +
geom_point() +
geom_hline(yintercept = cooks_crit, linetype="dashed") +
labs(title = "Cook's Distance",
subtitle = "Influential Observation at observation 21.",
x = "Observation Number",
y = "Cook's")
influence3[which(abs(influence3_cooks) > cooks_crit),]
## # A tibble: 1 x 3
## Row x y
## <dbl> <dbl> <dbl>
## 1 21 14 68
Dataset allentest
(allentest.txt2) contains Allen Cognitive Level scores and three test scores.
ACL
: Allen Cognitive Level scoreVocab
: vocabulary test scoreAbstract
: Abstraction test scoreSDMT
: SDMT test scoreallentest <- read_tsv(file = "./Data/allentest.txt")
head(allentest)
## # A tibble: 6 x 5
## Subj ACL Vocab Abstract SDMT
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6 28 36 70
## 2 2 5.4 34 32 49
## 3 3 4.7 19 8 28
## 4 4 4.8 32 28 47
## 5 5 4.9 22 4 29
## 6 6 4.5 24 24 23
The scatterplot matrix suggests which predictors to include in the model. ACL
is most strongly correlated with SDMT
, but less so with Abstract
and Vocab
. SDMT
is also strongly correlated with both the other predictor variables Abstract
and Vocab
.
plot(allentest[, -1])
cor(allentest[,-1], method = "pearson")
## ACL Vocab Abstract SDMT
## ACL 1.0000000 0.2500002 0.3537383 0.5208768
## Vocab 0.2500002 1.0000000 0.6978405 0.5560707
## Abstract 0.3537383 0.6978405 1.0000000 0.5769238
## SDMT 0.5208768 0.5560707 0.5769238 1.0000000
library(corrplot)
corrplot(cor(allentest[,-1], method = "pearson"))
Fit an initial regression model ACL ~ Vocab + Abstract + SDMT
.
allentest.lm <- lm(ACL ~ Vocab + Abstract + SDMT, data = allentest)
summary(allentest.lm)
##
## Call:
## lm(formula = ACL ~ Vocab + Abstract + SDMT, data = allentest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63459 -0.46009 -0.02471 0.33624 1.52886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.946347 0.338069 11.673 < 2e-16 ***
## Vocab -0.017397 0.018077 -0.962 0.339428
## Abstract 0.012182 0.011585 1.051 0.296926
## SDMT 0.027404 0.007168 3.823 0.000298 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6878 on 65 degrees of freedom
## Multiple R-squared: 0.2857, Adjusted R-squared: 0.2528
## F-statistic: 8.668 on 3 and 65 DF, p-value: 6.414e-05
Conduct diagnostic tests on the model. The residuals vs fits plot \((\epsilon \sim \hat{Y})\) detects non-linearity, unequal error variances, and outliers. If the sample relationship is linear, the values vary around \(\epsilon = 0\). If the error variances are equal, the values vary with a constant width \(\epsilon = 0\), with no fan shape at the low and high ends. The residuals vs fits plot also identifies outliers and influential points (standardize the residuals be dividing by their standard deviation - 95% of standardized residuals should fall within two standard deviations).
The residuals normal probability plot is a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles. It should be approximately linear.
In the plots below the errors are independent, and normally distributed with constant variance.
par(mfrow = c(2, 2))
plot(allentest.lm)
Assess the presence of outliers with DFFITS. DFFITS measures how much the regression function changes at the i-th observation when the i-th observation is deleted. 3 observations are potential influential data points.
p <- length(allentest.lm$coefficients)
n <- nrow(allentest.lm$model)
dffits_crit = 2 * sqrt((p + 1) / (n - p - 1))
allentest.lm_dffits <- dffits(allentest.lm)
df <- data.frame(obs = names(allentest.lm_dffits),
dffits = allentest.lm_dffits)
ggplot(df, aes(y = dffits, x = obs)) +
geom_point() +
geom_hline(yintercept = c(dffits_crit, -dffits_crit), linetype="dashed") +
labs(title = "DFFITS",
x = "Observation Number",
y = "DFFITS")
allentest[which(abs(allentest.lm_dffits) > dffits_crit),]
## # A tibble: 3 x 5
## Subj ACL Vocab Abstract SDMT
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20 3.8 18 28 41
## 2 40 3.9 30 36 61
## 3 49 5.6 22 6 20
Assess the presence of outliers with Cook’s distance. Cook’s distance measures how much the entire regression function changes when the i-th case is deleted. The Cook’s \(D_i\) finds no potential influential points.
cooks_crit = 0.5
allentest.lm_cooks <- cooks.distance(allentest.lm)
df <- data.frame(obs = names(allentest.lm_cooks),
cooks = allentest.lm_cooks)
ggplot(df, aes(y = cooks, x = obs)) +
geom_point() +
geom_hline(yintercept = cooks_crit, linetype="dashed") +
labs(title = "Cook's Distance",
x = "Observation Number",
y = "Cook's")
allentest[which(abs(allentest.lm_cooks) > cooks_crit),]
## # A tibble: 0 x 5
## # ... with 5 variables: Subj <dbl>, ACL <dbl>, Vocab <dbl>,
## # Abstract <dbl>, SDMT <dbl>
Robust regression methods provide an alternative to least squares regression by requiring less restrictive assumptions. These methods dampen the influence of outlying cases in order to provide a better fit to the majority of the data.
Use robust regression when the dataset includes outliers or high leverage data points that are not data entry errors or from a different population. If there is no compelling reason to exclude these data points from the analysis, robust regression is an option because it is a compromise between treating all data points equally and excluding certain points entirely.
One robust regression method is called M-estimation where “M” stands for “maximum likelihood type”. M-estimation defines a weight function such that the estimating equation becomes \(\sum_{i=1}^n {w_i(y_i-x'b)x_i'} = 0\). The weights \(w_i\) depend on the residuals and the residuals on the weights. The equation is solved using iteratively reweighted least squares (IRLS). The coefficient matrix at iteration \(j\) is \(B_j = (X'W_{j-1}X)^{-1}X'W_{j-1}Y\). The iterations continue until the equation converges. In Huber weighting, observations with small residuals are weighted 1, and the larger residual are given proportionately smaller weights. In bisquare weighting, all observations are down-weighted at least a little.
The rlm()
function in the MASS
package implements several robust regression methods.
Dataset crime
(crime.dta) records crime statistics for each state plus Washington DC (n = 51).
sid
: state identifierstate
: state abbrevcrime
: violent crimes per 100,000 peoplemurder
: murders per 1,000,000 peoplepctmetro
: percent of population living in metropolitan areaspctwhite
: percent of population that is whitepcths
: percent of population graduating high schoolpoverty
: percent of population living under poverty linesingle
: percent of population that are single parentslibrary(haven)
crime <- read_dta(file = "https://stats.idre.ucla.edu/stat/data/crime.dta")
head(crime)
## # A tibble: 6 x 9
## sid state crime murder pctmetro pctwhite pcths poverty single
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ak 761 9 41.8 75.2 86.6 9.10 14.3
## 2 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5
## 3 3 ar 593 10.2 44.7 82.9 66.3 20 10.7
## 4 4 az 715 8.60 84.7 88.6 78.7 15.4 12.1
## 5 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5
## 6 6 co 567 5.80 81.8 92.5 84.4 9.90 12.1
For this example, examine the relationship between violent crime and poverty rate and single parenthood, crime ~ poverty + single
.
Fit an OLS regression and run diagnostic tests.
crime.ols <- lm(crime ~ poverty + single,
data = crime)
summary(crime.ols)
##
## Call:
## lm(formula = crime ~ poverty + single, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -811.14 -114.27 -22.44 121.86 689.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1368.189 187.205 -7.308 2.48e-09 ***
## poverty 6.787 8.989 0.755 0.454
## single 166.373 19.423 8.566 3.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 243.6 on 48 degrees of freedom
## Multiple R-squared: 0.7072, Adjusted R-squared: 0.695
## F-statistic: 57.96 on 2 and 48 DF, p-value: 1.578e-13
par(mfrow = c(2,2))
plot(crime.ols)
Observations 9 (FL), 25 (MS), and 51 (DC) have either high leverage or large residuals.
crime[c(9, 25, 51), c("sid", "state", "crime", "poverty", "single")]
## # A tibble: 3 x 5
## sid state crime poverty single
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 9 fl 1206 17.8 10.6
## 2 25 ms 434 24.7 14.7
## 3 51 dc 2922 26.4 22.1
Look at the observations with large values of Cook’s D. The conventional cut-off point is 4 / n.
d1 <- cooks.distance(crime.ols)
r <- rstandard(crime.ols)
a <- cbind(crime, d1, r)
a[d1 > 4/51, c("sid", "state", "crime", "poverty", "single",
"d1", "r")]
## sid state crime poverty single d1 r
## 1 1 ak 761 9.1 14.3 0.1254750 -1.397418
## 9 9 fl 1206 17.8 10.6 0.1425891 2.902663
## 25 25 ms 434 24.7 14.7 0.6138721 -3.562990
## 51 51 dc 2922 26.4 22.1 2.6362519 2.616447
Observation 1 (AK) is included.
Run a robust linear regression with Hubert weights.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## cement
## The following object is masked from 'package:dplyr':
##
## select
crime.rr.huber <- rlm(crime ~ poverty + single, data = crime)
summary(crime.rr.huber)
##
## Call: rlm(formula = crime ~ poverty + single, data = crime)
## Residuals:
## Min 1Q Median 3Q Max
## -846.09 -125.80 -16.49 119.15 679.94
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -1423.0373 167.5899 -8.4912
## poverty 8.8677 8.0467 1.1020
## single 168.9858 17.3878 9.7186
##
## Residual standard error: 181.8 on 48 degrees of freedom
library(dplyr)
data.frame(state = crime$state,
resid = crime.rr.huber$resid,
weight = crime.rr.huber$w) %>%
arrange(weight) %>%
head(n = 15)
## state resid weight
## 1 ms -846.08536 0.2889618
## 2 fl 679.94327 0.3595480
## 3 vt -410.48310 0.5955740
## 4 dc 376.34468 0.6494131
## 5 mt -356.13760 0.6864625
## 6 me -337.09622 0.7252263
## 7 nj 331.11603 0.7383578
## 8 il 319.10036 0.7661169
## 9 ak -313.15532 0.7807432
## 10 md 307.19142 0.7958154
## 11 ma 291.20817 0.8395172
## 12 la -266.95752 0.9159411
## 13 al 105.40319 1.0000000
## 14 ar 30.53589 1.0000000
## 15 az -43.25299 1.0000000
The weights increase inversely with the absolute value of the residual. Observations with a large residuals tend to be down-weighted. In OLS regression, all observations have a weight of 1.
Now run a robust regression with the bisquare weighting function.
crime.rr.bisquare <- rlm(crime ~ poverty + single,
data = crime,
psi = psi.bisquare)
summary(crime.rr.bisquare)
##
## Call: rlm(formula = crime ~ poverty + single, data = crime, psi = psi.bisquare)
## Residuals:
## Min 1Q Median 3Q Max
## -905.59 -140.97 -14.98 114.65 668.38
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -1535.3338 164.5062 -9.3330
## poverty 11.6903 7.8987 1.4800
## single 175.9303 17.0678 10.3077
##
## Residual standard error: 202.3 on 48 degrees of freedom
data.frame(state = crime$state,
resid = crime.rr.bisquare$resid,
weight = crime.rr.bisquare$w) %>%
arrange(weight) %>%
head(n = 15)
## state resid weight
## 1 ms -905.5931 0.007652565
## 2 fl 668.3844 0.252870542
## 3 vt -402.8031 0.671495418
## 4 mt -360.8997 0.731136908
## 5 nj 345.9780 0.751347695
## 6 la -332.6527 0.768938330
## 7 me -328.6143 0.774103322
## 8 ak -325.8519 0.777662383
## 9 il 313.1466 0.793658594
## 10 md 308.7737 0.799065530
## 11 ma 297.6068 0.812596833
## 12 dc 260.6489 0.854441716
## 13 wy -234.1952 0.881660897
## 14 ca 201.4407 0.911713981
## 15 ga -186.5799 0.924033113
The weight given for “ms” is much lower using the bisquare weighting function than the Huber weighting function. Huber weights can have difficulties with severe outliers, and bisquare weights can have difficulties converging or may yield multiple solutions.
If the results from a robust regression are very different from the regular OLS regression, use the robust regression. Large differences mean the outliers are highly influencing the model parameters.
The parameter estimates from the two weighting methods differ.
OLS:
summary.lm(crime.ols)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1368.188661 187.205167 -7.3084984 2.478608e-09
## poverty 6.787359 8.988529 0.7551134 4.538703e-01
## single 166.372670 19.422910 8.5657951 3.116800e-11
Hubert:
summary(crime.rr.huber)$coefficients
## Value Std. Error t value
## (Intercept) -1423.037337 167.589930 -8.491186
## poverty 8.867678 8.046717 1.102024
## single 168.985787 17.387790 9.718647
Bisquare:
summary(crime.rr.bisquare)$coefficients
## Value Std. Error t value
## (Intercept) -1535.33376 164.506224 -9.332983
## poverty 11.69033 7.898655 1.480041
## single 175.93032 17.067850 10.307703
The reason for studentizing is that variances of the residuals at different predictor variable values may differ, even if the variances of the errors are equal. The issue is the difference between errors and residuals. Unlike errors, residuals do not all have the same variance: the variance decreases as the predictor values get farther from their average value. This is because of the regression better fits values at the ends of the domain. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. Read more on Wikipedia↩
See example at PSU STAT 501. I downloaded the dataset locally to address unicode encoding issue.↩