Regression analysis and diagnostics

Henk Harmsen

December 23, 2015

What is OLS ?

  • Ordinary Least Square Regression
  • You want to explain something: dependent quantitative variable (outcome, response)
  • Fit a model with independent variable(s) (predictor, explanatory)

Beware …

There are underlying assumptions for using OLS, which are often violated:

  • Normality = the dependent variable is normally distributed for fixed values of the independents.
  • Independence = the independent variables are independent of each other.
  • Constant variance = changes in independent variables do not affect the variance of the dependent variable.
  • Linearity = the relation between the independent and the dependent variables is linear.

Example: maize acreage

Can you explain the maize acreage from other variables?

You may just type:

fit = lm(maize.acres ~ bags.store.2013 + savings.2013 + 
           cows + solar.light, data = df2)

Why is this not a good idea?

Cleaning up the dataset

Always visualize your dataset before you work with it!

Better:

Fit your model

##                      Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)      6.857085e-01 7.974795e-02  8.5984468 1.686361e-15
## bags.store.2013  1.453749e-02 1.494103e-02  0.9729909 3.316512e-01
## savings.2013    -1.490552e-05 5.672169e-05 -0.2627834 7.929692e-01
## cows1            1.007916e-01 1.083842e-01  0.9299478 3.534411e-01
## solar.light1     5.816559e-01 1.027453e-01  5.6611421 4.777664e-08
  • What do you see? Why? (hint: go back to the scatterplot matrix in the previous slide).
  • What else do you want to see?

Diagnostics

Standard R offers diagnostic plots that you get by typing:

plot(fit)

A detailed summary is obtained by typing:

summary(fit)

Always check the p-value for the F-statistic first! It evaluates whether the results that you have were obtained by chance alone. With p > 0.05 you do not need to examine the outputs any further.

Diagnostics - Normality

Diagnostics - Normality / what you see

  • The dots should be in the neighbourhood of the dotted line…
  • …but they don't: maize acres is not normally distributed against the independent variables
  • The big deviations are numbered -> these are row numbers in the dataset

Diagnostics - Linearity

Diagnostics - Linearity / what you see

  • The linearity plot should look like random noise (no pattern)…
  • …There is a linear pattern (downward sloping), albeit not very strong
  • Conclude that this condition is satisfied

Diagnostics - Constant variance

Diagnostics - Constant variance / what you see

  • The dots in the contant variance plot should be more or less randomly distributed along a horizontal band…
  • …the points are more or less along a horizontal band, but they are not randomly distributed.
  • Conclude that this condition is not satisfied (it is an artificial dataset)

Diagnostics - Influential observations

Diagnostics - Influential observations / what you see

  • There are outliers. These are points for which the residuals are larger than +2/-2 (vertical axis). E.g. points 97 and 153.
  • There are points with an unusual combination of predictor values (leverage, horizontal axis).
  • Point 105 is influential (Cook's distance).
  • Conclude that the model does not predict very well (in view of the outliers).

Diagnostics - remediations for Normality

Normality is violated: this can be remediated by a so-called Box-Cox transformation.

Diagnostics - remediations for Linearity

The linearity assumption is violated; this can be remediated with a so-called Box-Tidwell transformation of the predictor variables.

The command is shown but not run for technical reasons:

  • The transformation involves log values, which require removal of zeros.
  • After removal of zeros, there are not enough points left in the dataset.
library(car)
# dfx2 = dfx2[bags.store.2013 != 0 & savings.2013 != 0 
#       & cows != 0 & solar.light != 0,]
# boxTidwell(maize.acres ~., data = dfx2)

Tips - One overall summary plot

The plot on the next slide shows many of the things discussed in one plot. In particular:

  • Bigger circles = bigger influence on the model
  • Hat values on the x-axis express leverage (= unusual combinations of predictor values)
  • Residuals on the y-axis express differences between predicted and actual values.

Tips / Influence plot

At this point …

You can conclude that:

  • OLS is a highly technical topic
  • Underlying assumptions are often violated - this is not a robust method
  • Better & more modern techniques are worth exploring, such as regression trees

Regression tree example