Assumptions of linear models (parametric tests):
Data (or residuals) are normally distributed
Variance is homogeneous or normalized
Interval data
Variables are independent of each other
We’ve touched on this before, but let’s really dive in:
hist(mtcars$mpg,breaks=10,probability = T)
lines(density(mtcars$mpg),col='red',lwd=2)
There is another (easier?) way of visualizing normality: the Q-Q
plot
qqnorm(mtcars$mpg)
qqline(mtcars$mpg,col="red")
The closer the “dots” are to the “line”, the more normal the data is.
What would we call this? Normal-ish? Sort of normal?
To me, this distribution is “mostly or fairly normal”. Not really a scientific term, though.
We can quantify it, however:
# Shapiro-Wilk is mostly used for datasets with less than 50 observations (or so)
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
How to interpret a general Shapiro-Wilk normality diagnostic:
For a p-value GREATER than 0.05 = data is plausibly normal For a p-value LESS THAN OR EQUAL TO 0.05 = data is probably not normal
So, MPG from mtcars is probably normal, per Shapiro-Wilk.
For larger datasets (between 50 and 5000+ observations), there is the K-S test:
hist(diamonds$price, main="Histogram of Diamond Prices", xlab="Price", breaks=50)
qqnorm(diamonds$price)
qqline(diamonds$price,col="red")
ks.test(diamonds$price,"pnorm",mean=mean(diamonds$price),sd=sd(diamonds$price))
## Warning in ks.test.default(diamonds$price, "pnorm", mean =
## mean(diamonds$price), : ties should not be present for the one-sample
## Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: diamonds$price
## D = 0.18467, p-value < 2.2e-16
## alternative hypothesis: two-sided
We are getting a warning because there are repeated prices in the data.
The null hypothesis of a K-S test is that the sample and a normal distribution are the same. Because the p-value is below 0.05, we can reject the null hypothesis and say “the sample is very different from a normal distribution”.
P-value above 0.05 = probably NORMAL P-value at or below 0.05 = probably NOT NORMAL
So the k-s test is consistent with what we see in the plots above.
The Shapiro-Wilk test becomes inaccurate for large datasets, because the larger the dataset, the more “obvious” non-normality becomes, which increases the chances of error.
stat.desc(mtcars$mpg,norm=T)
## nbr.val nbr.null nbr.na min max range
## 32.0000000 0.0000000 0.0000000 10.4000000 33.9000000 23.5000000
## sum median mean SE.mean CI.mean.0.95 var
## 642.9000000 19.2000000 20.0906250 1.0654240 2.1729465 36.3241028
## std.dev coef.var skewness skew.2SE kurtosis kurt.2SE
## 6.0269481 0.2999881 0.6106550 0.7366922 -0.3727660 -0.2302812
## normtest.W normtest.p
## 0.9475647 0.1228814
The second assumption of linear regression is homogeneity of variance. What does this mean, exactly?
ggplot(mtcars,aes(x=cyl,y=mpg)) + geom_point()
What do you notice here? The means are very different between cylinders (4 cylinders seems to have a much higher average MPG than 6 or 8) – but that’s not what we’re looking at. We are looking at the spread of variance around each of the means.
ggplot(mtcars,aes(x=cyl,y=mpg,group=cyl)) + geom_boxplot()
(this is where boxplot comes in handy)
Are they similar? Or different?
leveneTest(mtcars$mpg,as.factor(mtcars$cyl))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 5.5071 0.00939 **
## 29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
in the Levene Test, the grouping variable (cyl in this case) must come SECOND.
Our result is that the variances are SIGNIFICANTLY different – ie., the assumption is violated.
In our work in this course, you won’t need to report the actual F-statistics, just if the variances are homogeneous or not.
This topic can be tricky and is somewhat advanced (sorry). Soon you will understand this is more of an art than a science.
Look what happens with outliers:
ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point()
The Maserati has a horsepower that is much higher than the other groups, and even higher than other 8-cylinder cars. It’s distorting the variance of all groups.
leveneTest(mtcars$hp,mtcars$cyl,center=mean)
## Warning in leveneTest.default(mtcars$hp, mtcars$cyl, center = mean): mtcars$cyl
## coerced to factor.
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 2 4.1437 0.02613 *
## 29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One idea is to just remove the outlier:
cars_removed<-mtcars %>% filter(hp<300)
ggplot(cars_removed,aes(x=cyl,y=hp)) + geom_point()
leveneTest(cars_removed$hp,cars_removed$cyl)
## Warning in leveneTest.default(cars_removed$hp, cars_removed$cyl):
## cars_removed$cyl coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.4953 0.2415
## 28
We now see that the Levene Test is not significant – that is, the variances have been homogenized.
The data is now considerably more normal as well:
hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)
hist(cars_removed$hp,breaks=10,probability = T)
lines(density(cars_removed$hp),col='red',lwd=2)
The problem here is that we’ve deleted something from our data for the purpose of making it better – but this is basically cheating. DONT DO THIS!
Exception: there is some kind of overwhelming theoretical reason to remove specific parts of the data. When this happens, you must make it clear that you have done this.
Instead of removing outliers or anything else that can distort results, we can transform the data using various methods. My favorite (because of how easy it is) is the log-transformation.
A logarithm is “the power we must raise a certain number to get another number”
base^exponent = number
2^(exponent?) = 8
To get 8, you have to raise 2 to what power?
It’s actually 3:
2^3 = 8
So, the log of 8 in “base 2”, is 3.
log(8,base=2)
## [1] 3
The natural logarithm is about ~2.71, you can also google “Euler’s number” for help sleeping. When you use the “log()” command, it’s assuming a “base” of ~2.7
log(8)
## [1] 2.079442
~2.71 ^ 2.079442 = ~8
2.71^2.079442
## [1] 7.949403
Why does this matter?
It turns out that finding the natural log of a whole data set will “smoosh” the data together - reducing skew and outliers to some degree.
ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point() + ggtitle("Untransformed Data")
ggplot(mtcars,aes(x=log(cyl),y=log(hp))) + geom_point() + ggtitle("Log-Transformed Data")
Variances are much more homogeneous! What about normality?
hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)
hist(log(mtcars$hp),breaks=10,probability = T)
lines(density(log(mtcars$hp)),col='red',lwd=2)
Very good. What does a log transformation actually look like?
mt_cars_log<-mtcars %>% mutate(LOG_HP=log(hp)) %>% select(hp,LOG_HP)
head(mt_cars_log)
## hp LOG_HP
## Mazda RX4 110 4.700480
## Mazda RX4 Wag 110 4.700480
## Datsun 710 93 4.532599
## Hornet 4 Drive 110 4.700480
## Hornet Sportabout 175 5.164786
## Valiant 105 4.653960
The natural log transformation “squishes” the values of horsepower, to reduce the influence of outliers.
You don’t have to transform every variable in the dataset, but you must transform every variable within the same column (so, no transforming HP #1 but not HP #2)
Natural logs don’t work (in terms of real numbers) for negative values or zero. If you somehow had a car with negative 100 horsepower, this is what would happen:
log(-100)
## Warning in log(-100): NaNs produced
## [1] NaN
It is possible to extend logarithms to negative numbers but that is a complex mathematical topic far beyond this course (also unlikely to be a solution in real life)
The book recommends that you “add a constant” to make the values become positive – I actually disagree and think that is a bit silly.
Better idea: use absolute values (abs())
log(abs(-100))
## [1] 4.60517
Always make it clear in your paper if you’ve transformed the data. Turning negative values into positive values by taking the absolute value is perfectly fine as long as you make it clear you’ve done so – and that doing so is supported by the literature in some way. You can report your result as “the log absolute value of x” instead of just “x”. The downside is that this often makes interpretation a little more difficult.
It’s also possible to do transformations via square roots:
mt_cars_sqrt<-mtcars %>% mutate(HP_SQRT=sqrt(hp)) %>% select(hp,HP_SQRT)
head(mt_cars_sqrt)
## hp HP_SQRT
## Mazda RX4 110 10.488088
## Mazda RX4 Wag 110 10.488088
## Datsun 710 93 9.643651
## Hornet 4 Drive 110 10.488088
## Hornet Sportabout 175 13.228757
## Valiant 105 10.246951
hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)
hist(mt_cars_sqrt$HP_SQRT,breaks=10,probability = T)
lines(density(mt_cars_sqrt$HP_SQRT),col='red',lwd=2)
Which is better? The main advantage is that square root transformations can be applied to values of “zero”.
sqrt(0)
## [1] 0
log(0)
## [1] -Inf