Check linearity
Linear regression using least squares method assumes that DV is linearly predicted by IV. So first thing to do is to do is run a scatter plot of DV with each IV and see if there is a linear relation. It may not be apparent immediately so try some transformations like \(DV\sim log(IV)\), \(log(DV)\sim IV\), \(DV\sim(IV)^{2}\), etc.
Check residuals
normal distribution:
mean = 0 the errors (\(\epsilon_i\)) must have mean zero.
constant variance: the errors (\(\epsilon_i\)) must have a constant variation, also called homogeniety or homoscedasticity.
independenc: the errors (\(\epsilon_i\)) must be independent
Check IVs
All IVs are to be independent of each other. If there is collinearity it needs to be dealth with appropriately.
Graphical analysis is very useful and should give us insights even before we begin analysis. Graphical analysis can be broken down into two types
Example of stem and leaf plot
stem(mtcars$mpg)
##
## The decimal point is at the |
##
## 10 | 44
## 12 | 3
## 14 | 3702258
## 16 | 438
## 18 | 17227
## 20 | 00445
## 22 | 88
## 24 | 4
## 26 | 03
## 28 |
## 30 | 44
## 32 | 49
Example of a dot plot
dotchart(mtcars$mpg, labels = row.names(mtcars), cex = 0.7)
Should evaluate all dimensions of the data but of course this is only possible with few variables. The most common analysis is pair-wise scatter plots - again, only feasible with few variables.
A simple correlation matrix can be created by this cor(mtcars[,1:4]). But sometimes you want to see correlations along with a scatter plot. This is called a draftsman’s plot.
library(GGally)
ggpairs(mtcars[,1:4])
When using correlations to detect relationships remember two things:
1. correlations can only detect linear relationships, so if the relationship are non-linear we need another method
2. correlations can be heavily influenced by one or two outliers
Therefore, don’t assume that in the absence of a linear pattern (or correlation) there is no linear relationship when running multiple regression. Sometimes the linear regression exists between two variables after having adjusted for a 3rd variable. Try looking at 3D rotating plots for this. This can be useful. Use the rgl or Rcmdr library for doing this in R. Here is an example
library(rgl)
plot3d(mtcars$wt, mtcars$disp, mtcars$mpg, col = "red", size = 3)
To understand leverage and influence take a look at this graph.
Think of leverage as something that is an outlier on X-axis.
Think of influence as something that is an outlier on the Y-axis.
Both, high leverage and high influence points can impact the regression equation. So they need special attention. To understand how much they are impacting the regression we can run a regression with and without these points to see how much our regression co-efficientes have changed.
To fully list all the methods to probe influence points run ?influence.measures in the console but here are a few important ones. Residuals, \(Y_{i}-\widehat{Y}\), need to be standardised and there are two ways to standardise them. Both of these ways standardise by dividing the \(~i~^{th}\) residual by the standard deviation. However, where they differ is in wether they include the \(~i~^{th}\) variable in the calculation. They are both referred to as studentized but one follows an exact t-distribution while the other does not.
rstandard - residuals are divided by their standard deviations, also called internally standardized for this reasonrstudent - resiuals are divided by their standard deviations, but the \(~i~^{th}\) data point was deleted in the calculation of the standard deviation. This follows an exact t-distribution. Also called externally standardized for this reasonBelow are some of the common ways to probe for influence.
dffits - change in the predicted response, \(\widehat{Y}\) when the \(i^{th}\) point is deleted in the fitted modeldfbetas - change in individual coeffients, \(\beta_{i}\), when the \(i^{th}\) point is deleted in fitting the modelcooks.distance - overall change in the coefficients when the \(i^{th}\) point is deletedresid(fit)/(1 - hatvalues(fit)). This measures the difference in the response and the predicted response at data point \(i\), when it is not included in the model fitting.Leverage is largely measured by one quantity, so called hat diagonals, which can be obtained by hatvalues(). The hat values are between 0 and 1 and larger values indicated greater (potential for) leverage.