Ok, let’s look at this simple model. \[ \begin{aligned} Y = \beta_0 + \beta_1 X + \epsilon \end{aligned} \] Where \(\epsilon\) is a normally distributed random variable with a mean of zero.

If X is normally distributed, Y should obviously be normally distributed. However, if X is not normally distributed or we don’t have appropriate values along its range, Y will not be normally distributed. Let’s look at these cases.

x  <- runif(1000,0,100); #Uniformly distributed from 0-100
y  <- 2 + 0.5*x + rnorm(1000,0,2)
plot(density(y))

Oh my Glob! Y is not normally distributed. What are we to do? Nothing. It’s fine, only the error needs to be normally distributed.

lm1  <- lm(y ~ x)
plot(density(resid(lm1)))

qqnorm(resid(lm1))
qqline(resid(lm1))

require(stats)
shapiro.test(resid(lm1))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(lm1)
## W = 0.9984, p-value = 0.5096

See, all you need are normally distributed residuals to tell you if a regression is appropriate. Never let another teacher tell you all you variables, or even all your response variables need to be normally distributed.

Let’s look at another common case where x is lognormally distributed.

x  <- rlnorm(100, 10)
y  <- 2 + 0.5*x + rnorm(100,0,2);
plot(density(y))

Well that’s really not normal. But now let’s look at the residuals again.

lm2  <- lm(y ~ x)
plot(density(resid(lm2)))

qqnorm(resid(lm2))
qqline(resid(lm2))

shapiro.test(resid(lm2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(lm2)
## W = 0.9881, p-value = 0.5142

That looks great! So there you have it. Stop worrying about non-normal response variables.