Normality is for residuals

The normality assumption in linear modeling (multiple regression) is on the residuals, not the dependent or independent variables.

Look at the following data. Size is n = 14:

str(d)

## 'data.frame':    14 obs. of  4 variables:
##  $ y : num  14.54 15.65 9.41 4.96 13.93 ...
##  $ x1: num  0.266 0.372 0.573 0.908 0.202 ...
##  $ x2: num  0.54 0.957 0.147 1.391 0.762 ...
##  $ x3: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 2 2 2 ...

Let’s say we want to model y as an additive function of x1, x2 and x3. This is basic multiple regression. Before we begin, let’s examine the distributions of the numeric data. None of them are remotely Normal.

hist(d$y)

hist(d$x1)

hist(d$x2)

That doesn’t matter. We can still proceed with multiple regression:

m <- lm(y ~ x1 + x2 + x3, data = d)

NOW is the time to assess the normality assumption. One way is via QQ-Plot, which is built-in to R as follows. The dots should lie close to the diagonal line. The normality assumption checks out. It really doesn’t get any better than this.

plot(m, which = 2)

We could also look at a histogram of the residuals. This looks pretty good considering our small sample size.

hist(residuals(m))

We could even do a Shapiro-Wilk normality test, though I don’t care for it since it encourages binary Yes-No thinking. In this case however, the result agrees with the QQ plot and histogram. We fail to reject the Null of Normality.

shapiro.test(residuals(m))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m)
## W = 0.98133, p-value = 0.9818

How I generated the data

Here is the R code I used to generate the data. Notice the only values that were generated from a Normal distribution are the errors (or residuals): rnorm(n, sd = 4.7). The x1 and x2 values were drawn from uniform and exponential distributions, respectively. The x3 variable is a categorical variable with 2 levels, 7 each.

set.seed(1)
n <- 14
# generate IVs
x1 <- runif(n)
x2 <- rexp(n)
x3 <- gl(n = 2, k = 7)
# generate DV
y <- 7.8 + 1.2 * x1 + 3.9 * x2 + 20.5 * (x3 == "2") + rnorm(n, sd = 4.7)
d <- data.frame(y, x1, x2, x3)

Notice the regression does a decent job of recovering the “True” values used to generate the data, except for x1.

summary(m)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2686 -1.5815  0.1818  1.8399  4.8533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.7219     2.5349   5.413 0.000296 ***
## x1           -9.0774     3.7792  -2.402 0.037192 *  
## x2            4.1378     0.9807   4.219 0.001773 ** 
## x32          18.4677     2.0250   9.120 3.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.566 on 10 degrees of freedom
## Multiple R-squared:  0.9238, Adjusted R-squared:  0.901 
## F-statistic: 40.43 on 3 and 10 DF,  p-value: 6.714e-06

True Intercept: 7.8
Estimated Intercept: 13.721

True x1 coefficient: 1.2
Estimated x1 coefficient: -9.077

True x2 coefficient: 3.9
Estimated x3 coefficient: 4.137

True x3 coefficient: 20.5
Estimated x3 coefficient: 18.467

True Residual standard error: 4.7
Estimated Residual standard error: 3.556

And all of this worked despite none of the predictors or dependent variable being normally distributed.

Normality is for residuals

Clay Ford

11/2/2020

How I generated the data