A high leverage data point has extreme predictor \(X\) values. An outlier is a data point whose response value \(y\) does not follow the general trend of the rest of the data. A data point is influential if it unduly influences predicted responses, the estimated slope coefficients, or hypothesis test results. High leverage data points and outliers may or may not be influential - you need to investigate!

If you have influential data points, you can 1) do nothing because there is theoretical reasons to include them, 2) drop them, or 3) include them with reduced weight using robust regression.

High Leverage Points

The “hat matrix” \(H\) identifies any leverage points. Recall that in the linear regression model \(\hat{y} = X \hat{\beta}\) the slope coefficients are estimated by \(\hat{\beta} = (X'X)^{-1}X'y\). Substituting, \(\hat{y} = X(X'X)^{-1}X'y\), or \(\hat{y} = Hy\), where

\[H = X(X'X)^{-1}X'.\]

\(H\) is called the hat matrix because \(H\) puts the hat on \(y\). \(H\) is an \(n \times n\) matrix. The diagonal elements \(H_{ii}\) are a measure of the distances between each observation \(i\)’s predictor variables \(X_i\) and the average of the entire data set predictor variables \(\bar{X}\). \(H_{ii}\) are the leverage that the observed responses \(y_i\) exert on the predicted responses \(\hat{y}_i\). Each \(H_{ii}\) is in the unit interval [0, 1] and the values sum to the number of regression parameters (including the intercept) \(\sum{H_{ii}} = p\). A common rule is to research any observation whose leverage value is more than 3 times larger than the mean leverage value, which since the sum of the leverage values is \(p\), equals

\[H_{ii} > 3 \frac{p}{n}.\]

In practice, you will identify and deal with influential data points, not the leverage points that give rise to them. However, since leverage is an important component of influential data points, you should know how to work with it.

Example

A study measured cement hardening based on its chemical composition. Data set cement (cement.txt) is a summary of n = 13 formulations:

  • y = heat evolved, in calories/gram
  • x1 = % of tricalcium aluminate
  • x2 = % of tricalcium silicate
  • x3 = % of tetracalcium alumino ferrite
  • x4 = % of dicalcium silicate
library(readr)
cement <- read_fwf(file = "./Data/cement.txt", 
                   skip = 1,
                   fwf_widths(c(5, 7, 7, 7, 7)))
colnames(cement) <- c("y", "x1", "x2", "x3", "x4")
head(cement)
## # A tibble: 6 x 5
##       y    x1    x2    x3    x4
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  78.5     7    26     6    60
## 2  74.3     1    29    15    52
## 3 104.     11    56     8    20
## 4  87.6    11    31     8    47
## 5  95.9     7    52     6    33
## 6 109.     11    55     9    22

Suppose you model y ~ x1.

cement_lm1 <- lm(y ~ x1, data = cement)

From the scatterplot it looks like observation 10 may be a high leverage point.

library(ggplot2)
ggplot(data = cement, aes(y = y, x = x1)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "y ~ x1",
       subtitle = "Obs 10 a high leverage point?") +
  geom_text(label=rownames(cement), 
            nudge_x = 0.5, 
            nudge_y = 0.25, 
            check_overlap = T)

The hat matrix is \(H = X(X'X)^{-1}X'\). The diagonal elements are the leverages. The further each \(X_i\) is from \(\bar{X}\), the larger the leverage. Observation 10 has the largest leverage.

X <- matrix(c(rep(1, times = nrow(cement)),
              cement$x1), 
            ncol = 2)

# t(x) is transpose of x: x'
# solve(x) is inverse of x: x^{-1}
H <- X %*% solve(t(X) %*% X) %*% t(X)

data.frame(x1 = cement$x1,
           H = diag(H))
##    x1          H
## 1   7 0.07743609
## 2   1 0.17747314
## 3  11 0.10707670
## 4  11 0.10707670
## 5   7 0.07743609
## 6  11 0.10707670
## 7   3 0.12486106
## 8   1 0.17747314
## 9   2 0.14875880
## 10 21 0.51834013
## 11  1 0.17747314
## 12 11 0.10707670
## 13 10 0.09244165

The sum of the matrix diagonal (trace) equals the number of predictor variables in the model.

sum(diag(H))
## [1] 2

You did not need manually calculate the hat matrix for this exercise becaues R function hatvalues(model) returns the diagonal elements of the hat matrix. Which observation(s) are high leverage points? Observation 10 is more than three times the mean leverage.

outliers <- hatvalues(cement_lm1) > 3 * mean(hatvalues(cement_lm1))
cement[outliers, ]
## # A tibble: 1 x 5
##       y    x1    x2    x3    x4
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  116.    21    47     4    26

For multiple linear regression models like y ~ x2 + x2 + x3 + x4, there is no straight-forward visual diagnostic like the simple linear regression scatterplot. However, the method is otherwise unchanged. In this case, there are no outliers.

cement_lm2 <- lm(y ~ x1 + x2 + x3 + x4, 
                 data = cement)

outliers <- hatvalues(cement_lm2) > 3 * mean(hatvalues(cement_lm2))
cement[outliers, ]
## # A tibble: 0 x 5
## # ... with 5 variables: y <dbl>, x1 <dbl>, x2 <dbl>, x3 <dbl>, x4 <dbl>

Outliers

An outlier is a point whose observation is far from the fitted regression line. That is, an outlier is a point with a large residual. “Large” is relative to the variable units, so standardize the residuals by dividing by an estimate of their standard deviation.1 Residuals standardized in this way are sometimes called internally standardized residuals or internally studentized residuals.

\[r_i = \frac{\epsilon_i}{\sqrt{MSE(1 - H_{ii})}}\]

mse <- sum(cement_lm2$residuals^2) / df.residual(cement_lm2)
cement_lm2$residuals / sqrt(mse*(1 - hatvalues(cement_lm2)))
##            1            2            3            4            5 
##  0.002902141  0.756624558 -1.050274056 -0.841081415  0.127905849 
##            6            7            8            9           10 
##  1.714815620 -0.744450296 -1.687801801  0.670799981  0.210293420 
##           11           12           13 
##  1.073910078  0.463352296 -1.124105189

These residuals are calculated by the rstandard() function.

rstandard(cement_lm2)
##            1            2            3            4            5 
##  0.002902141  0.756624558 -1.050274056 -0.841081415  0.127905849 
##            6            7            8            9           10 
##  1.714815620 -0.744450296 -1.687801801  0.670799981  0.210293420 
##           11           12           13 
##  1.073910078  0.463352296 -1.124105189

An observation with a standardized residual that is larger than 3 is generally considered an outlier. One problem with this definition is that potential outlying observations influence the slope of regression line, muting their measured residual. A second definition of an outlier using the externally studentized residual remedies this problem. Externally standardized (studentized) residuals are the residuals divided by the MSE measured with observation \(i\) removed, \(MSE_{(i)}\).

\[r_i = \frac{\epsilon_i}{\sqrt{MSE_{(i)}(1 - H_{ii})}}\]

Calculating \(MSE_i\) is a process of leave-one-out regression diagnostics. In the code below, mdl is a list of \(i = 1:n\) models with observation i removed. mse is the resulting mean squared error, \(\sum{\epsilon^2} / df_\epsilon\).

library(purrr)
i <- 1:13
mdl <- map(i, function(i) lm(y ~ x1 + x2 + x3 + x4, 
                             data = cement[-i, ]))
mse <- as.numeric(map(i, function(i) sum(mdl[[i]]$residuals^2) /
                         mdl[[i]]$df.residual))
as.numeric(map(i, function(i)  (cement_lm2$residuals / 
                                  sqrt(mse[[i]] * 
                                         (1 - hatvalues(cement_lm2))))[i]))
##  [1]  0.002714706  0.734526654 -1.058093203 -0.824036397  0.119767490
##  [6]  2.017049821 -0.721820523 -1.967482994  0.645903738  0.197257449
## [11]  1.085864774  0.439362041 -1.145888712

These residuals are calculated by the rstudent(model) function.

rstudent(cement_lm2)
##            1            2            3            4            5 
##  0.002714706  0.734526654 -1.058093203 -0.824036397  0.119767490 
##            6            7            8            9           10 
##  2.017049821 -0.721820523 -1.967482994  0.645903738  0.197257449 
##           11           12           13 
##  1.085864774  0.439362041 -1.145888712

Influential Points

There are two common measures for identifying influential data points: difference in fits (DFFITS), and Cook’s distance. Both methods use the leave-one-out process from above.

DFFITS

The difference in fits for obseravation \(i\) is defined

\[DFFITS_i = \frac{\hat{y}_i - \hat{y}_{(i)}}{\sqrt{MSE_{(i)} H_{ii}}}.\]

\(DFFITS\) is similar to the externally standardized residual definition, except that the numerator is the difference in fitted values with observation \(i\) and without observation \(i\). An observation is considered influential if

\[2 \sqrt{\frac{p + 1}{n - p - 1}}.\]

Example

Data set influence3 (influence3.txt]) has one influential observation.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
influence3 <- read_tsv(file = "./Data/influence3.txt")
## Parsed with column specification:
## cols(
##   Row = col_double(),
##   x = col_double(),
##   y = col_double()
## )
print(influence3)
## # A tibble: 21 x 3
##      Row     x       y
##    <dbl> <dbl>   <dbl>
##  1     1 0.1   -0.0716
##  2     2 0.454  4.17  
##  3     3 1.10   6.57  
##  4     4 1.28  13.8   
##  5     5 2.21  11.5   
##  6     6 2.50  13.0   
##  7     7 3.04  20.2   
##  8     8 3.24  17.6   
##  9     9 4.45  26.0   
## 10    10 4.17  22.8   
## # ... with 11 more rows

From the scatterplot, you can see observation 21 is a high leverage data point.

ggplot(data = influence3, aes(y = y, x = x)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "y ~ x",
       subtitle = "Obs 21 an influential data point?") +
  geom_text(label=rownames(influence3), 
            nudge_x = 0.5, 
            nudge_y = 0.25, 
            check_overlap = T)

That intuition is confirmed by the diagnostic test:

influence3_lm <- lm(y ~ x, data = influence3)
leverages <- hatvalues(influence3_lm) > 3 * mean(hatvalues(influence3_lm))
influence3[leverages, ]
## # A tibble: 1 x 3
##     Row     x     y
##   <dbl> <dbl> <dbl>
## 1    21    14    68

Is it also an outlier? No!

outliers <- rstudent(influence3_lm) > 3
influence3[outliers, ]
## # A tibble: 0 x 3
## # ... with 3 variables: Row <dbl>, x <dbl>, y <dbl>

However, it is potentially influential according to DFFITS. Use the dffits(model) function to calculate the \(DFFITS_i\) values. In this case look for \(DFFITS_i\) values (in absolute value) greater than

\[2 \sqrt{\frac{p + 1}{n - p - 1}} = 2 \sqrt{\frac{2 + 1}{21 - 2 - 1}} = 0.82.\]

Observation 21 is a potential influential data point.

df <- influence3_lm$df.residual
p <- length(influence3_lm$coefficients)
n <- nrow(influence3_lm$model)
dffits_crit = 2 * sqrt((p + 1) / (n - p - 1))
influence3_dffits <- dffits(influence3_lm)
df <- data.frame(obs = names(influence3_dffits),
                 dffits = influence3_dffits)
ggplot(df, aes(y = dffits, x = obs)) +
  geom_point() +
  geom_hline(yintercept = c(dffits_crit, -dffits_crit), linetype="dashed") +
  labs(title = "DFFITS",
       subtitle = "Influential Observation at observation 21.",
       x = "Observation Number",
       y = "DFFITS")

influence3[which(abs(influence3_dffits) > dffits_crit),]
## # A tibble: 1 x 3
##     Row     x     y
##   <dbl> <dbl> <dbl>
## 1    21    14    68

This is surprising because it certainly seems to fit within the data set well. Remember, DFFITS just flags potentially influential data points - you should check whether the potential influential data points significantly alter the outcome of the regression analysis. If they do, report the results with and without the influential points.

Cook’s Distance

Cook’s distance for observation \(i\) is defined

\[D_i = \frac{(y_i - \hat{y}_i)^2}{p \times MSE} \frac{H_{ii}}{(1 - H_{ii})^2}.\]

\(D_i\) directly summarizes how much all of the fitted values change when the ith observation is deleted. ???A data point with \(D_i > 1\) is probably influential. \(D_i > 0.5\) is at least worth investigating.

Using Cook’s distance, observation 21 is still potentially influential.

cooks_crit = 0.5
influence3_cooks <- cooks.distance(influence3_lm)
df <- data.frame(obs = names(influence3_cooks),
                 cooks = influence3_cooks)
ggplot(df, aes(y = cooks, x = obs)) +
  geom_point() +
  geom_hline(yintercept = cooks_crit, linetype="dashed") +
  labs(title = "Cook's Distance",
       subtitle = "Influential Observation at observation 21.",
       x = "Observation Number",
       y = "Cook's")

influence3[which(abs(influence3_cooks) > cooks_crit),]
## # A tibble: 1 x 3
##     Row     x     y
##   <dbl> <dbl> <dbl>
## 1    21    14    68

Full Example

Dataset allentest (allentest.txt2) contains Allen Cognitive Level scores and three test scores.

  • ACL: Allen Cognitive Level score
  • Vocab: vocabulary test score
  • Abstract: Abstraction test score
  • SDMT: SDMT test score
allentest <- read_tsv(file = "./Data/allentest.txt")
head(allentest)
## # A tibble: 6 x 5
##    Subj   ACL Vocab Abstract  SDMT
##   <dbl> <dbl> <dbl>    <dbl> <dbl>
## 1     1   6      28       36    70
## 2     2   5.4    34       32    49
## 3     3   4.7    19        8    28
## 4     4   4.8    32       28    47
## 5     5   4.9    22        4    29
## 6     6   4.5    24       24    23

The scatterplot matrix suggests which predictors to include in the model. ACL is most strongly correlated with SDMT, but less so with Abstract and Vocab. SDMT is also strongly correlated with both the other predictor variables Abstract and Vocab.

plot(allentest[, -1])

cor(allentest[,-1], method = "pearson")
##                ACL     Vocab  Abstract      SDMT
## ACL      1.0000000 0.2500002 0.3537383 0.5208768
## Vocab    0.2500002 1.0000000 0.6978405 0.5560707
## Abstract 0.3537383 0.6978405 1.0000000 0.5769238
## SDMT     0.5208768 0.5560707 0.5769238 1.0000000
library(corrplot)
corrplot(cor(allentest[,-1], method = "pearson"))

Fit an initial regression model ACL ~ Vocab + Abstract + SDMT.

allentest.lm <- lm(ACL ~ Vocab + Abstract + SDMT, data = allentest)
summary(allentest.lm)
## 
## Call:
## lm(formula = ACL ~ Vocab + Abstract + SDMT, data = allentest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63459 -0.46009 -0.02471  0.33624  1.52886 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.946347   0.338069  11.673  < 2e-16 ***
## Vocab       -0.017397   0.018077  -0.962 0.339428    
## Abstract     0.012182   0.011585   1.051 0.296926    
## SDMT         0.027404   0.007168   3.823 0.000298 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6878 on 65 degrees of freedom
## Multiple R-squared:  0.2857, Adjusted R-squared:  0.2528 
## F-statistic: 8.668 on 3 and 65 DF,  p-value: 6.414e-05

Conduct diagnostic tests on the model. The residuals vs fits plot \((\epsilon \sim \hat{Y})\) detects non-linearity, unequal error variances, and outliers. If the sample relationship is linear, the values vary around \(\epsilon = 0\). If the error variances are equal, the values vary with a constant width \(\epsilon = 0\), with no fan shape at the low and high ends. The residuals vs fits plot also identifies outliers and influential points (standardize the residuals be dividing by their standard deviation - 95% of standardized residuals should fall within two standard deviations).

The residuals normal probability plot is a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles. It should be approximately linear.

In the plots below the errors are independent, and normally distributed with constant variance.

par(mfrow = c(2, 2))
plot(allentest.lm)

Assess the presence of outliers with DFFITS. DFFITS measures how much the regression function changes at the i-th observation when the i-th observation is deleted. 3 observations are potential influential data points.

p <- length(allentest.lm$coefficients)
n <- nrow(allentest.lm$model)
dffits_crit = 2 * sqrt((p + 1) / (n - p - 1))
allentest.lm_dffits <- dffits(allentest.lm)
df <- data.frame(obs = names(allentest.lm_dffits),
                 dffits = allentest.lm_dffits)
ggplot(df, aes(y = dffits, x = obs)) +
  geom_point() +
  geom_hline(yintercept = c(dffits_crit, -dffits_crit), linetype="dashed") +
  labs(title = "DFFITS",
       x = "Observation Number",
       y = "DFFITS")

allentest[which(abs(allentest.lm_dffits) > dffits_crit),]
## # A tibble: 3 x 5
##    Subj   ACL Vocab Abstract  SDMT
##   <dbl> <dbl> <dbl>    <dbl> <dbl>
## 1    20   3.8    18       28    41
## 2    40   3.9    30       36    61
## 3    49   5.6    22        6    20

Assess the presence of outliers with Cook’s distance. Cook’s distance measures how much the entire regression function changes when the i-th case is deleted. The Cook’s \(D_i\) finds no potential influential points.

cooks_crit = 0.5
allentest.lm_cooks <- cooks.distance(allentest.lm)
df <- data.frame(obs = names(allentest.lm_cooks),
                 cooks = allentest.lm_cooks)
ggplot(df, aes(y = cooks, x = obs)) +
  geom_point() +
  geom_hline(yintercept = cooks_crit, linetype="dashed") +
  labs(title = "Cook's Distance",
       x = "Observation Number",
       y = "Cook's")

allentest[which(abs(allentest.lm_cooks) > cooks_crit),]
## # A tibble: 0 x 5
## # ... with 5 variables: Subj <dbl>, ACL <dbl>, Vocab <dbl>,
## #   Abstract <dbl>, SDMT <dbl>

Robust Regression

Robust regression methods provide an alternative to least squares regression by requiring less restrictive assumptions. These methods dampen the influence of outlying cases in order to provide a better fit to the majority of the data.

Use robust regression when the dataset includes outliers or high leverage data points that are not data entry errors or from a different population. If there is no compelling reason to exclude these data points from the analysis, robust regression is an option because it is a compromise between treating all data points equally and excluding certain points entirely.

One robust regression method is called M-estimation where “M” stands for “maximum likelihood type”. M-estimation defines a weight function such that the estimating equation becomes \(\sum_{i=1}^n {w_i(y_i-x'b)x_i'} = 0\). The weights \(w_i\) depend on the residuals and the residuals on the weights. The equation is solved using iteratively reweighted least squares (IRLS). The coefficient matrix at iteration \(j\) is \(B_j = (X'W_{j-1}X)^{-1}X'W_{j-1}Y\). The iterations continue until the equation converges. In Huber weighting, observations with small residuals are weighted 1, and the larger residual are given proportionately smaller weights. In bisquare weighting, all observations are down-weighted at least a little.

The rlm() function in the MASS package implements several robust regression methods.

Example

Dataset crime (crime.dta) records crime statistics for each state plus Washington DC (n = 51).

  • sid: state identifier
  • state: state abbrev
  • crime: violent crimes per 100,000 people
  • murder: murders per 1,000,000 people
  • pctmetro: percent of population living in metropolitan areas
  • pctwhite: percent of population that is white
  • pcths: percent of population graduating high school
  • poverty: percent of population living under poverty line
  • single: percent of population that are single parents
library(haven)

crime <- read_dta(file = "https://stats.idre.ucla.edu/stat/data/crime.dta")

head(crime)
## # A tibble: 6 x 9
##     sid state crime murder pctmetro pctwhite pcths poverty single
##   <dbl> <chr> <dbl>  <dbl>    <dbl>    <dbl> <dbl>   <dbl>  <dbl>
## 1     1 ak      761   9        41.8     75.2  86.6    9.10   14.3
## 2     2 al      780  11.6      67.4     73.5  66.9   17.4    11.5
## 3     3 ar      593  10.2      44.7     82.9  66.3   20      10.7
## 4     4 az      715   8.60     84.7     88.6  78.7   15.4    12.1
## 5     5 ca     1078  13.1      96.7     79.3  76.2   18.2    12.5
## 6     6 co      567   5.80     81.8     92.5  84.4    9.90   12.1

For this example, examine the relationship between violent crime and poverty rate and single parenthood, crime ~ poverty + single.

Fit an OLS regression and run diagnostic tests.

crime.ols <- lm(crime ~ poverty + single, 
                data = crime)
summary(crime.ols)
## 
## Call:
## lm(formula = crime ~ poverty + single, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -811.14 -114.27  -22.44  121.86  689.82 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1368.189    187.205  -7.308 2.48e-09 ***
## poverty         6.787      8.989   0.755    0.454    
## single        166.373     19.423   8.566 3.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 243.6 on 48 degrees of freedom
## Multiple R-squared:  0.7072, Adjusted R-squared:  0.695 
## F-statistic: 57.96 on 2 and 48 DF,  p-value: 1.578e-13
par(mfrow = c(2,2))
plot(crime.ols)

Observations 9 (FL), 25 (MS), and 51 (DC) have either high leverage or large residuals.

crime[c(9, 25, 51), c("sid", "state", "crime", "poverty", "single")]
## # A tibble: 3 x 5
##     sid state crime poverty single
##   <dbl> <chr> <dbl>   <dbl>  <dbl>
## 1     9 fl     1206    17.8   10.6
## 2    25 ms      434    24.7   14.7
## 3    51 dc     2922    26.4   22.1

Look at the observations with large values of Cook’s D. The conventional cut-off point is 4 / n.

d1 <- cooks.distance(crime.ols)
r <- rstandard(crime.ols)
a <- cbind(crime, d1, r)
a[d1 > 4/51, c("sid", "state", "crime", "poverty", "single", 
               "d1", "r")]
##    sid state crime poverty single        d1         r
## 1    1    ak   761     9.1   14.3 0.1254750 -1.397418
## 9    9    fl  1206    17.8   10.6 0.1425891  2.902663
## 25  25    ms   434    24.7   14.7 0.6138721 -3.562990
## 51  51    dc  2922    26.4   22.1 2.6362519  2.616447

Observation 1 (AK) is included.

Run a robust linear regression with Hubert weights.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     cement
## The following object is masked from 'package:dplyr':
## 
##     select
crime.rr.huber <- rlm(crime ~ poverty + single, data = crime)
summary(crime.rr.huber)
## 
## Call: rlm(formula = crime ~ poverty + single, data = crime)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -846.09 -125.80  -16.49  119.15  679.94 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1423.0373   167.5899    -8.4912
## poverty         8.8677     8.0467     1.1020
## single        168.9858    17.3878     9.7186
## 
## Residual standard error: 181.8 on 48 degrees of freedom
library(dplyr)
data.frame(state = crime$state, 
           resid = crime.rr.huber$resid, 
           weight = crime.rr.huber$w) %>%
  arrange(weight) %>%
  head(n = 15)
##    state      resid    weight
## 1     ms -846.08536 0.2889618
## 2     fl  679.94327 0.3595480
## 3     vt -410.48310 0.5955740
## 4     dc  376.34468 0.6494131
## 5     mt -356.13760 0.6864625
## 6     me -337.09622 0.7252263
## 7     nj  331.11603 0.7383578
## 8     il  319.10036 0.7661169
## 9     ak -313.15532 0.7807432
## 10    md  307.19142 0.7958154
## 11    ma  291.20817 0.8395172
## 12    la -266.95752 0.9159411
## 13    al  105.40319 1.0000000
## 14    ar   30.53589 1.0000000
## 15    az  -43.25299 1.0000000

The weights increase inversely with the absolute value of the residual. Observations with a large residuals tend to be down-weighted. In OLS regression, all observations have a weight of 1.

Now run a robust regression with the bisquare weighting function.

crime.rr.bisquare <- rlm(crime ~ poverty + single, 
                         data = crime, 
                         psi = psi.bisquare)
summary(crime.rr.bisquare)
## 
## Call: rlm(formula = crime ~ poverty + single, data = crime, psi = psi.bisquare)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -905.59 -140.97  -14.98  114.65  668.38 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1535.3338   164.5062    -9.3330
## poverty        11.6903     7.8987     1.4800
## single        175.9303    17.0678    10.3077
## 
## Residual standard error: 202.3 on 48 degrees of freedom
data.frame(state = crime$state, 
           resid = crime.rr.bisquare$resid, 
           weight = crime.rr.bisquare$w) %>%
  arrange(weight) %>%
  head(n = 15)
##    state     resid      weight
## 1     ms -905.5931 0.007652565
## 2     fl  668.3844 0.252870542
## 3     vt -402.8031 0.671495418
## 4     mt -360.8997 0.731136908
## 5     nj  345.9780 0.751347695
## 6     la -332.6527 0.768938330
## 7     me -328.6143 0.774103322
## 8     ak -325.8519 0.777662383
## 9     il  313.1466 0.793658594
## 10    md  308.7737 0.799065530
## 11    ma  297.6068 0.812596833
## 12    dc  260.6489 0.854441716
## 13    wy -234.1952 0.881660897
## 14    ca  201.4407 0.911713981
## 15    ga -186.5799 0.924033113

The weight given for “ms” is much lower using the bisquare weighting function than the Huber weighting function. Huber weights can have difficulties with severe outliers, and bisquare weights can have difficulties converging or may yield multiple solutions.

If the results from a robust regression are very different from the regular OLS regression, use the robust regression. Large differences mean the outliers are highly influencing the model parameters.

The parameter estimates from the two weighting methods differ.

OLS:

summary.lm(crime.ols)$coefficients
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -1368.188661 187.205167 -7.3084984 2.478608e-09
## poverty         6.787359   8.988529  0.7551134 4.538703e-01
## single        166.372670  19.422910  8.5657951 3.116800e-11

Hubert:

summary(crime.rr.huber)$coefficients
##                    Value Std. Error   t value
## (Intercept) -1423.037337 167.589930 -8.491186
## poverty         8.867678   8.046717  1.102024
## single        168.985787  17.387790  9.718647

Bisquare:

summary(crime.rr.bisquare)$coefficients
##                   Value Std. Error   t value
## (Intercept) -1535.33376 164.506224 -9.332983
## poverty        11.69033   7.898655  1.480041
## single        175.93032  17.067850 10.307703

  1. The reason for studentizing is that variances of the residuals at different predictor variable values may differ, even if the variances of the errors are equal. The issue is the difference between errors and residuals. Unlike errors, residuals do not all have the same variance: the variance decreases as the predictor values get farther from their average value. This is because of the regression better fits values at the ends of the domain. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. Read more on Wikipedia

  2. See example at PSU STAT 501. I downloaded the dataset locally to address unicode encoding issue.