R automatically provides us with several commonly used diagnostic statistics upon constructing a linear model. For a useful cheat sheet of linear regression commands and objects, please see the following: http://www2.kenyon.edu/Depts/Math/hartlaub/Math305%20Fall2011/R.htm
In this example, we will look at creating a Standardized Dffits scatterplot, labeling the cutoff values, and labeling points outside of those cutoff values.
df <- read.csv("Sales.csv")
fit <- with(df, lm(Sales ~ Advertising + Bonuses))
summary(fit)
##
## Call:
## lm(formula = Sales ~ Advertising + Bonuses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -165.255 -84.635 6.292 54.150 131.377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -516.4443 189.8757 -2.720 0.0125 *
## Advertising 2.4732 0.2753 8.983 8.18e-09 ***
## Bonuses 1.8562 0.7157 2.593 0.0166 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90.75 on 22 degrees of freedom
## Multiple R-squared: 0.8549, Adjusted R-squared: 0.8418
## F-statistic: 64.83 on 2 and 22 DF, p-value: 5.985e-10
n <- nrow(df)
k <- length(fit$coefficients)-1
cv <- 2*sqrt(k/n)
plot(dffits(fit),
ylab = "Standardized dfFits", xlab = "Index",
main = paste("Standardized DfFits, \n critical value = 2*sqrt(k/n) = +/-", round(cv,3)))
#Critical Value horizontal lines
abline(h = cv, lty = 2)
abline(h = -cv, lty = 2)
Here we have a few options. The simplest from a coding standpoint is to use R’s built in identify function and manually click on points that fall outside of out limits. the code for this is basically: identify(x axis, y axis, labels to plot, plot = TRUE) Run this and click the points on the plot you want to label. Then click finish in the upper right of the plot and they will appear.
identify(row.names(df), dffits(fit), row.names(df), plot=TRUE)
The second option involves using the textxy function from the calibrate package to automatically label the points the fall outside the critical values for us. While the code for this is slightly more difficult to follow, it is a cool feature to have. To use this, first we feed the function the x coordinates that are greater than or less than the critical values, then we feed the y coodinates less/greater than the critical value. Lastly, we tell the function to label the points at each coordinate with the associated case number which ends up being the same as the x value since this is an index plot.
library(calibrate)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.2.2
plot(dffits(fit),
ylab = "Standardized dfFits", xlab = "Index",
main = paste("Standardized DfFits, \n critical value = 2*sqrt(k/n) = +/-", round(cv,3)))
abline(h = cv, lty = 2)
abline(h = -cv, lty = 2)
#code for labeling points
textxy(as.numeric(names(dffits(fit)[which(dffits(fit) < -cv | dffits(fit) > cv)])),
dffits(fit)[which(dffits(fit) < -cv | dffits(fit) > cv)],
as.numeric(names(dffits(fit)[which(dffits(fit) < -cv | dffits(fit) > cv)])), cex=0.7,offset = -1)