Ok, let’s look at this simple model. \[ \begin{aligned} Y = \beta_0 + \beta_1 X + \epsilon \end{aligned} \] Where \(\epsilon\) is a normally distributed random variable with a mean of zero.
If X is normally distributed, Y should obviously be normally distributed. However, if X is not normally distributed or we don’t have appropriate values along its range, Y will not be normally distributed. Let’s look at these cases.
x <- runif(1000,0,100); #Uniformly distributed from 0-100
y <- 2 + 0.5*x + rnorm(1000,0,2)
plot(density(y))
Oh my Glob! Y is not normally distributed. What are we to do? Nothing. It’s fine, only the error needs to be normally distributed.
lm1 <- lm(y ~ x)
plot(density(resid(lm1)))
qqnorm(resid(lm1))
qqline(resid(lm1))
require(stats)
shapiro.test(resid(lm1))
##
## Shapiro-Wilk normality test
##
## data: resid(lm1)
## W = 0.9984, p-value = 0.5096
See, all you need are normally distributed residuals to tell you if a regression is appropriate. Never let another teacher tell you all you variables, or even all your response variables need to be normally distributed.
Let’s look at another common case where x is lognormally distributed.
x <- rlnorm(100, 10)
y <- 2 + 0.5*x + rnorm(100,0,2);
plot(density(y))
Well that’s really not normal. But now let’s look at the residuals again.
lm2 <- lm(y ~ x)
plot(density(resid(lm2)))
qqnorm(resid(lm2))
qqline(resid(lm2))
shapiro.test(resid(lm2))
##
## Shapiro-Wilk normality test
##
## data: resid(lm2)
## W = 0.9881, p-value = 0.5142
That looks great! So there you have it. Stop worrying about non-normal response variables.