Plotting outlier Statistics

R automatically provides us with several commonly used diagnostic statistics upon constructing a linear model. For a useful cheat sheet of linear regression commands and objects, please see the following: http://www2.kenyon.edu/Depts/Math/hartlaub/Math305%20Fall2011/R.htm

In this example, we will look at creating a Standardized Dffits scatterplot, labeling the cutoff values, and labeling points outside of those cutoff values.

  1. Read in data
df <- read.csv("Sales.csv")
  1. Construct Linear Model
fit <- with(df, lm(Sales ~ Advertising + Bonuses))
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Advertising + Bonuses)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -165.255  -84.635    6.292   54.150  131.377 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -516.4443   189.8757  -2.720   0.0125 *  
## Advertising    2.4732     0.2753   8.983 8.18e-09 ***
## Bonuses        1.8562     0.7157   2.593   0.0166 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90.75 on 22 degrees of freedom
## Multiple R-squared:  0.8549, Adjusted R-squared:  0.8418 
## F-statistic: 64.83 on 2 and 22 DF,  p-value: 5.985e-10
  1. Define n, k, and Dffits critical value
n <- nrow(df)
k <- length(fit$coefficients)-1
cv <- 2*sqrt(k/n)
  1. Create Scatterplot using Dffits object automatically generated when we constructed the linear model. Label the axis and header with critical value. Then draw in horizontal lines at critical values.
plot(dffits(fit), 
     ylab = "Standardized dfFits", xlab = "Index", 
     main = paste("Standardized DfFits, \n critical value = 2*sqrt(k/n) = +/-", round(cv,3)))

#Critical Value horizontal lines
abline(h = cv, lty = 2)
abline(h = -cv, lty = 2)

  1. Labelling Points

Here we have a few options. The simplest from a coding standpoint is to use R’s built in identify function and manually click on points that fall outside of out limits. the code for this is basically: identify(x axis, y axis, labels to plot, plot = TRUE) Run this and click the points on the plot you want to label. Then click finish in the upper right of the plot and they will appear.

identify(row.names(df), dffits(fit), row.names(df), plot=TRUE)

The second option involves using the textxy function from the calibrate package to automatically label the points the fall outside the critical values for us. While the code for this is slightly more difficult to follow, it is a cool feature to have. To use this, first we feed the function the x coordinates that are greater than or less than the critical values, then we feed the y coodinates less/greater than the critical value. Lastly, we tell the function to label the points at each coordinate with the associated case number which ends up being the same as the x value since this is an index plot.

library(calibrate)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.2.2
plot(dffits(fit), 
     ylab = "Standardized dfFits", xlab = "Index", 
     main = paste("Standardized DfFits, \n critical value = 2*sqrt(k/n) = +/-", round(cv,3)))
abline(h = cv, lty = 2)
abline(h = -cv, lty = 2)

#code for labeling points
textxy(as.numeric(names(dffits(fit)[which(dffits(fit) < -cv | dffits(fit) > cv)])), 
       dffits(fit)[which(dffits(fit) < -cv | dffits(fit) > cv)], 
       as.numeric(names(dffits(fit)[which(dffits(fit) < -cv | dffits(fit) > cv)])), cex=0.7,offset = -1)