Regression analysis and diagnostics

Henk Harmsen

December 23, 2015

What is OLS ?

Ordinary Least Square Regression
You want to explain something: dependent quantitative variable (outcome, response)
Fit a model with independent variable(s) (predictor, explanatory)

Beware …

There are underlying assumptions for using OLS, which are often violated:

Normality = the dependent variable is normally distributed for fixed values of the independents.
Independence = the independent variables are independent of each other.
Constant variance = changes in independent variables do not affect the variance of the dependent variable.
Linearity = the relation between the independent and the dependent variables is linear.

Example: maize acreage

Can you explain the maize acreage from other variables?

You may just type:

fit = lm(maize.acres ~ bags.store.2013 + savings.2013 + 
           cows + solar.light, data = df2)

Why is this not a good idea?

Cleaning up the dataset

Always visualize your dataset before you work with it!

Better:

Fit your model

##                      Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)      6.857085e-01 7.974795e-02  8.5984468 1.686361e-15
## bags.store.2013  1.453749e-02 1.494103e-02  0.9729909 3.316512e-01
## savings.2013    -1.490552e-05 5.672169e-05 -0.2627834 7.929692e-01
## cows1            1.007916e-01 1.083842e-01  0.9299478 3.534411e-01
## solar.light1     5.816559e-01 1.027453e-01  5.6611421 4.777664e-08

What do you see? Why? (hint: go back to the scatterplot matrix in the previous slide).
What else do you want to see?

Diagnostics

Standard R offers diagnostic plots that you get by typing:

plot(fit)

A detailed summary is obtained by typing:

summary(fit)

Always check the p-value for the F-statistic first! It evaluates whether the results that you have were obtained by chance alone. With p > 0.05 you do not need to examine the outputs any further.

Diagnostics - Normality

Diagnostics - Normality / what you see

The dots should be in the neighbourhood of the dotted line…
…but they don't: maize acres is not normally distributed against the independent variables
The big deviations are numbered -> these are row numbers in the dataset

Diagnostics - Linearity

Diagnostics - Linearity / what you see

The linearity plot should look like random noise (no pattern)…
…There is a linear pattern (downward sloping), albeit not very strong
Conclude that this condition is satisfied

Diagnostics - Constant variance

Diagnostics - Constant variance / what you see

The dots in the contant variance plot should be more or less randomly distributed along a horizontal band…
…the points are more or less along a horizontal band, but they are not randomly distributed.
Conclude that this condition is not satisfied (it is an artificial dataset)

Diagnostics - Influential observations

Diagnostics - Influential observations / what you see

There are outliers. These are points for which the residuals are larger than +2/-2 (vertical axis). E.g. points 97 and 153.
There are points with an unusual combination of predictor values (leverage, horizontal axis).
Point 105 is influential (Cook's distance).
Conclude that the model does not predict very well (in view of the outliers).

Diagnostics - remediations for Normality

Normality is violated: this can be remediated by a so-called Box-Cox transformation.

Diagnostics - remediations for Linearity

The linearity assumption is violated; this can be remediated with a so-called Box-Tidwell transformation of the predictor variables.

The command is shown but not run for technical reasons:

The transformation involves log values, which require removal of zeros.
After removal of zeros, there are not enough points left in the dataset.

library(car)
# dfx2 = dfx2[bags.store.2013 != 0 & savings.2013 != 0 
#       & cows != 0 & solar.light != 0,]
# boxTidwell(maize.acres ~., data = dfx2)

Tips - One overall summary plot

The plot on the next slide shows many of the things discussed in one plot. In particular:

Bigger circles = bigger influence on the model
Hat values on the x-axis express leverage (= unusual combinations of predictor values)
Residuals on the y-axis express differences between predicted and actual values.