Definitions

Some important terms in linear regression.

  • Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value.
  • Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
  • Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients.
  • Influence: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients. Influence can be thought of as the product of leverage and outlierness.
  • Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.

Leverage and Influence

  • Studentized Residuals : Residuals divided by their estimated standard errors (like t-statistics). Observations with values larger than 3 in absolute value are considered outliers.
  • Leverage Values (Hat Diag) : Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations with values larger than \(2(k+1)/n\) are considered to be potentially highly influential, where k is the number of predictors and n is the sample size.

DFFITS and DFBETAs

  • DFFITS : Measure of how much an observation has effected its fitted value from the regression model. Values larger than \(2\sqrt{(k+1)/n}\) in absolute value are considered highly influential.

  • DFBETAS : Measure of how much an observation has effected the estimate of a regression coefficient (there is one DFBETA for each regression coefficient, including the intercept).
  • Values larger than 2/sqrt(n) in absolute value are usually considered highly influential.

  • The measure that measures how much impact each observation has on a particular predictor is DFBETAs The DFBETA for a predictor and for a particular observation is the difference between the regression coefficient calculated for all of the data and the regression coefficient calculated with the observation deleted, scaled by the standard error calculated with the observation deleted.


Leverage and Influence

  • Cook's D : Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than 4/n are considered highly influential.

The studentized residual

  • The studentized residual RSTUDENT is estimated by \(s(i)^2\) without the ith observation, not by \(s^2\). For example,

\[\mbox{RSTUDENT} = \frac{r_i}{s_{(i)} \sqrt{(1 - h_i)}} \]

  • Observations with RSTUDENT larger than 2 in absolute value may need some attention.