Assumptions of linear models (parametric tests):
Data (or residuals) are normally distributed
Variance is homogeneous (the same) or normalized
We’ve touched on this before, but let’s really dive in: probability=T means it will come out as percentages
hist(mtcars$mpg,breaks=10,probability = T)
lines(density(mtcars$mpg),col='red',lwd=2)
There is another (easier?) way of visualizing normality: the Q-Q plot
#redline represents what the data would look like if it was perfectly
normal example data is more normal than not but is not entirely normal;
example given would be usable for a project
qqnorm(mtcars$mpg)
qqline(mtcars$mpg,col="red")
The closer the “dots” are to the “line”, the more normal the data is.
What would we call this? Normal-ish? Sort of normal?
To me, this distribution is “mostly or fairly normal”. Not really a scientific term, though.
We can quantify it, however:
# Shapiro-Wilk is mostly used for datasets with less than 50 observations (or so)
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
#this is testing if the data is distinguishable from a normal distribution How to interpret a general Shapiro-Wilk normality diagnostic:
For a p-value GREATER than 0.05 = data is plausibly normal For a p-value LESS THAN OR EQUAL TO 0.05 = data is probably not normal
So, MPG from mtcars is probably normal, per Shapiro-Wilk.
For larger datasets (between 50 and 5000+ observations), there is the K-S test:
hist(diamonds$price, main="Histogram of Diamond Prices", xlab="Price", breaks=50)
# this data is really skewed to the right
qqnorm(diamonds$price)
qqline(diamonds$price,col="red")
# does not really follow the red line; only intersects like twice with
large outliers
ks.test(diamonds$price,"pnorm",mean=mean(diamonds$price),sd=sd(diamonds$price))
## Warning in ks.test.default(diamonds$price, "pnorm", mean =
## mean(diamonds$price), : ties should not be present for the one-sample
## Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: diamonds$price
## D = 0.18467, p-value < 2.2e-16
## alternative hypothesis: two-sided
We are getting a warning because there are repeated prices in the data.
The null hypothesis of a K-S test is that the sample and a normal distribution are the same. Because the p-value is below 0.05, we can reject the null hypothesis and say “the sample is very different from a normal distribution”.
P-value above 0.05 = probably NORMAL P-value at or below 0.05 = probably NOT NORMAL
So the k-s test is consistent with what we see in the plots above.
The Shapiro-Wilk test becomes inaccurate for large datasets, because the larger the dataset, the more “obvious” non-normality becomes, which increases the chances of error. #stat.desc is command for describe variable in pastecs
stat.desc(mtcars$mpg,norm=T)
## nbr.val nbr.null nbr.na min max range
## 32.0000000 0.0000000 0.0000000 10.4000000 33.9000000 23.5000000
## sum median mean SE.mean CI.mean.0.95 var
## 642.9000000 19.2000000 20.0906250 1.0654240 2.1729465 36.3241028
## std.dev coef.var skewness skew.2SE kurtosis kurt.2SE
## 6.0269481 0.2999881 0.6106550 0.7366922 -0.3727660 -0.2302812
## normtest.W normtest.p
## 0.9475647 0.1228814
The second assumption of linear regression is homogeneity of variance. What does this mean, exactly?
ggplot(mtcars,aes(x=cyl,y=mpg)) + geom_point()
What do you notice here? The means are very different between cylinders (4 cylinders seems to have a much higher average MPG than 6 or 8) – but that’s not what we’re looking at. We are looking at the spread of variance around each of the means. # each dot represents a car; some are closer together in their individual groups # if these were all homogenous the spread for each gorup would be the same; in this case the variancw is NOT # homogenous across cylinders
ggplot(mtcars,aes(x=cyl,y=mpg,group=cyl)) + geom_boxplot()
(this is where boxplot comes in handy)
Are they similar? Or different?
#leveneTest measures if the variances are significantly different
leveneTest(mtcars$mpg,mtcars$cyl)
## Warning in leveneTest.default(mtcars$mpg, mtcars$cyl): mtcars$cyl coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 5.5071 0.00939 **
## 29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
in the Levene Test, the grouping variable (cyl in this case) must come SECOND.
Our result is that the variances are SIGNIFICANTLY different – ie., the assumption is violated.
In our work in this course, you won’t need to report the actual F-statistics, just if the variances are homogeneous or not.
This topic can be tricky and is somewhat advanced (sorry). Soon you will understand this is more of an art than a science.
Look what happens with outliers:
ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point()
The Maserati (top 8 cylinder dot) has a horsepower that is much higher than the other groups, and even higher than other 8-cylinder cars. It’s distorting the variance of all groups. #if the maserati was removed from the data the variance would compress a bit
leveneTest(mtcars$hp,mtcars$cyl,center=mean)
## Warning in leveneTest.default(mtcars$hp, mtcars$cyl, center = mean): mtcars$cyl
## coerced to factor.
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 2 4.1437 0.02613 *
## 29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#just barely inhomogenous bc it is close to .05 One idea is to just remove the outlier:
cars_removed<-mtcars %>% filter(hp<300)
ggplot(cars_removed,aes(x=cyl,y=hp)) + geom_point()
leveneTest(cars_removed$hp,cars_removed$cyl)
## Warning in leveneTest.default(cars_removed$hp, cars_removed$cyl):
## cars_removed$cyl coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.4953 0.2415
## 28
#p value is significantly higher meaning the variance has been homogenized by removing the maserati
We now see that the Levene Test is not significant – that is, the variances have been homogenized.
The data is now considerably more normal as well:
hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)
hist(cars_removed$hp,breaks=10,probability = T)
lines(density(cars_removed$hp),col='red',lwd=2)
Exception: there is some kind of overwhelming theoretical reason to remove specific parts of the data. When this happens, you must make it clear that you have done this.
Instead of removing outliers or anything else that can distort results, we can transform the data using various methods. My favorite (because of how easy it is) is the log-transformation.
A logarithm is “the power we must raise a certain number to get another number”
base^exponent = number
2^(exponent?) = 8
To get 8, you have to raise 2 to what power?
It’s actually 3:
2^3 = 8
So, the log of 8 in “base 2”, is 3.
log(8,base=2)
## [1] 3
The natural logarithm is about ~2.71, you can also google “Euler’s number” for help sleeping. When you use the “log()” command, it’s assuming a “base” of ~2.7
log(8)
## [1] 2.079442
~2.71 ^ 2.079442 = ~8
2.71^2.079442
## [1] 7.949403
Why does this matter?
It turns out that finding the natural log of a whole data set will “smoosh” the data together - reducing skew and outliers to some degree.
ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point() + ggtitle("Untransformed Data")
ggplot(mtcars,aes(x=log(cyl),y=log(hp))) + geom_point() + ggtitle("Log-Transformed Data")
Variances are much more homogeneous! What about normality? #data remains the same but the shape of the data is changed #complication is that you can no longer report this as cylxhp it is log(cyl)x log(hp) #it is not the same as reporting the data by itself *you want the data as normalized as possible to get an accurate p value on the regression
hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)
hist(log(mtcars$hp),breaks=10,probability = T)
lines(density(log(mtcars$hp)),col='red',lwd=2)
Very good. What does a log transformation actually look like?
mt_cars_log<-mtcars %>% mutate(LOG_HP=log(hp)) %>% select(hp,LOG_HP)
head(mt_cars_log)
## hp LOG_HP
## Mazda RX4 110 4.700480
## Mazda RX4 Wag 110 4.700480
## Datsun 710 93 4.532599
## Hornet 4 Drive 110 4.700480
## Hornet Sportabout 175 5.164786
## Valiant 105 4.653960
The natural log transformation “squishes” the values of horsepower, to reduce the influence of outliers. #so 2.71^4.700480 will equal hp of 110 You don’t have to transform every variable in the dataset, but you must transform every variable within the same column (so, no transforming HP #1 but not HP #2)
Natural logs don’t work (in terms of real numbers) for negative values or zero. If you somehow had a car with negative 100 horsepower, this is what would happen:
log(-100)
## Warning in log(-100): NaNs produced
## [1] NaN
It is possible to extend logarithms to negative numbers but that is a complex mathematical topic far beyond this course (also unlikely to be a solution in real life)
The book recommends that you “add a constant” to make the values become positive – I actually disagree and think that is a bit silly.
Better idea: use absolute values (abs())
log(abs(-100))
## [1] 4.60517
Always make it clear in your paper if you’ve transformed the data. Turning negative values into positive values by taking the absolute value is perfectly fine as long as you make it clear you’ve done so – and that doing so is supported by the literature in some way. You can report your result as “the log absolute value of x” instead of just “x”. The downside is that this often makes interpretation a little more difficult.
It’s also possible to do transformations via square roots:
mt_cars_sqrt<-mtcars %>% mutate(HP_SQRT=sqrt(hp)) %>% select(hp,HP_SQRT)
head(mt_cars_sqrt)
## hp HP_SQRT
## Mazda RX4 110 10.488088
## Mazda RX4 Wag 110 10.488088
## Datsun 710 93 9.643651
## Hornet 4 Drive 110 10.488088
## Hornet Sportabout 175 13.228757
## Valiant 105 10.246951
hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)
hist(mt_cars_sqrt$HP_SQRT,breaks=10,probability = T)
lines(density(mt_cars_sqrt$HP_SQRT),col='red',lwd=2)
Which is better? The main advantage is that square root transformations can be applied to values of “zero”.
sqrt(0)
## [1] 0
log(0)
## [1] -Inf