Assumptions of linear models (parametric tests):

Data (or residuals) are normally distributed
Variance is homogeneous or normalized
Interval data
Variables are independent of each other

NORMALITY

We’ve touched on this before, but let’s really dive in:

hist(mtcars$mpg,breaks=10,probability = T)
lines(density(mtcars$mpg),col='red',lwd=2)

There is another (easier?) way of visualizing normality: the Q-Q plot

qqnorm(mtcars$mpg)
qqline(mtcars$mpg,col="red")

The closer the “dots” are to the “line”, the more normal the data is.

What would we call this? Normal-ish? Sort of normal?

To me, this distribution is “mostly or fairly normal”. Not really a scientific term, though.

We can quantify it, however:

# Shapiro-Wilk is mostly used for datasets with less than 50 observations (or so)

shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

How to interpret a general Shapiro-Wilk normality diagnostic:

For a p-value GREATER than 0.05 = data is plausibly normal For a p-value LESS THAN OR EQUAL TO 0.05 = data is probably not normal

So, MPG from mtcars is probably normal, per Shapiro-Wilk.

For larger datasets (between 50 and 5000+ observations), there is the K-S test:

hist(diamonds$price, main="Histogram of Diamond Prices", xlab="Price", breaks=50)

qqnorm(diamonds$price)
qqline(diamonds$price,col="red")

ks.test(diamonds$price,"pnorm",mean=mean(diamonds$price),sd=sd(diamonds$price))

## Warning in ks.test.default(diamonds$price, "pnorm", mean =
## mean(diamonds$price), : ties should not be present for the one-sample
## Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  diamonds$price
## D = 0.18467, p-value < 2.2e-16
## alternative hypothesis: two-sided

We are getting a warning because there are repeated prices in the data.

The null hypothesis of a K-S test is that the sample and a normal distribution are the same. Because the p-value is below 0.05, we can reject the null hypothesis and say “the sample is very different from a normal distribution”.

P-value above 0.05 = probably NORMAL P-value at or below 0.05 = probably NOT NORMAL

So the k-s test is consistent with what we see in the plots above.

The Shapiro-Wilk test becomes inaccurate for large datasets, because the larger the dataset, the more “obvious” non-normality becomes, which increases the chances of error.

stat.desc(mtcars$mpg,norm=T)

##      nbr.val     nbr.null       nbr.na          min          max        range 
##   32.0000000    0.0000000    0.0000000   10.4000000   33.9000000   23.5000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##  642.9000000   19.2000000   20.0906250    1.0654240    2.1729465   36.3241028 
##      std.dev     coef.var     skewness     skew.2SE     kurtosis     kurt.2SE 
##    6.0269481    0.2999881    0.6106550    0.7366922   -0.3727660   -0.2302812 
##   normtest.W   normtest.p 
##    0.9475647    0.1228814

HOMOGENEOUS VARIANCES

The second assumption of linear regression is homogeneity of variance. What does this mean, exactly?

ggplot(mtcars,aes(x=cyl,y=mpg)) + geom_point()

What do you notice here? The means are very different between cylinders (4 cylinders seems to have a much higher average MPG than 6 or 8) – but that’s not what we’re looking at. We are looking at the spread of variance around each of the means.

ggplot(mtcars,aes(x=cyl,y=mpg,group=cyl)) + geom_boxplot()

(this is where boxplot comes in handy)

Are they similar? Or different?

leveneTest(mtcars$mpg,as.factor(mtcars$cyl))

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)   
## group  2  5.5071 0.00939 **
##       29                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

in the Levene Test, the grouping variable (cyl in this case) must come SECOND.

Our result is that the variances are SIGNIFICANTLY different – ie., the assumption is violated.

In our work in this course, you won’t need to report the actual F-statistics, just if the variances are homogeneous or not.

DEALING WITH SKEWED DATA

This topic can be tricky and is somewhat advanced (sorry). Soon you will understand this is more of an art than a science.

Look what happens with outliers:

ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point()

The Maserati has a horsepower that is much higher than the other groups, and even higher than other 8-cylinder cars. It’s distorting the variance of all groups.

leveneTest(mtcars$hp,mtcars$cyl,center=mean)

## Warning in leveneTest.default(mtcars$hp, mtcars$cyl, center = mean): mtcars$cyl
## coerced to factor.

## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value  Pr(>F)  
## group  2  4.1437 0.02613 *
##       29                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One idea is to just remove the outlier:

cars_removed<-mtcars %>% filter(hp<300)

ggplot(cars_removed,aes(x=cyl,y=hp)) + geom_point()

leveneTest(cars_removed$hp,cars_removed$cyl)

## Warning in leveneTest.default(cars_removed$hp, cars_removed$cyl):
## cars_removed$cyl coerced to factor.

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  1.4953 0.2415
##       28

We now see that the Levene Test is not significant – that is, the variances have been homogenized.

The data is now considerably more normal as well:

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(cars_removed$hp,breaks=10,probability = T)
lines(density(cars_removed$hp),col='red',lwd=2)

The problem here is that we’ve deleted something from our data for the purpose of making it better – but this is basically cheating. DONT DO THIS!

Exception: there is some kind of overwhelming theoretical reason to remove specific parts of the data. When this happens, you must make it clear that you have done this.

TRANSFORMATION

Instead of removing outliers or anything else that can distort results, we can transform the data using various methods. My favorite (because of how easy it is) is the log-transformation.

A logarithm is “the power we must raise a certain number to get another number”

base^exponent = number

2^(exponent?) = 8

To get 8, you have to raise 2 to what power?

It’s actually 3:

2^3 = 8

So, the log of 8 in “base 2”, is 3.

log(8,base=2)

## [1] 3

The natural logarithm is about ~2.71, you can also google “Euler’s number” for help sleeping. When you use the “log()” command, it’s assuming a “base” of ~2.7

log(8)

## [1] 2.079442

~2.71 ^ 2.079442 = ~8

2.71^2.079442

## [1] 7.949403

Why does this matter?

It turns out that finding the natural log of a whole data set will “smoosh” the data together - reducing skew and outliers to some degree.

ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point() + ggtitle("Untransformed Data")

ggplot(mtcars,aes(x=log(cyl),y=log(hp))) + geom_point() + ggtitle("Log-Transformed Data")

Variances are much more homogeneous! What about normality?

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(log(mtcars$hp),breaks=10,probability = T)
lines(density(log(mtcars$hp)),col='red',lwd=2)

Very good. What does a log transformation actually look like?

mt_cars_log<-mtcars %>% mutate(LOG_HP=log(hp)) %>% select(hp,LOG_HP)

head(mt_cars_log)

##                    hp   LOG_HP
## Mazda RX4         110 4.700480
## Mazda RX4 Wag     110 4.700480
## Datsun 710         93 4.532599
## Hornet 4 Drive    110 4.700480
## Hornet Sportabout 175 5.164786
## Valiant           105 4.653960

The natural log transformation “squishes” the values of horsepower, to reduce the influence of outliers.

You don’t have to transform every variable in the dataset, but you must transform every variable within the same column (so, no transforming HP #1 but not HP #2)

Natural logs don’t work (in terms of real numbers) for negative values or zero. If you somehow had a car with negative 100 horsepower, this is what would happen:

log(-100)

## Warning in log(-100): NaNs produced

## [1] NaN

It is possible to extend logarithms to negative numbers but that is a complex mathematical topic far beyond this course (also unlikely to be a solution in real life)

The book recommends that you “add a constant” to make the values become positive – I actually disagree and think that is a bit silly.

Better idea: use absolute values (abs())

log(abs(-100))

## [1] 4.60517

Always make it clear in your paper if you’ve transformed the data. Turning negative values into positive values by taking the absolute value is perfectly fine as long as you make it clear you’ve done so – and that doing so is supported by the literature in some way. You can report your result as “the log absolute value of x” instead of just “x”. The downside is that this often makes interpretation a little more difficult.

It’s also possible to do transformations via square roots:

mt_cars_sqrt<-mtcars %>% mutate(HP_SQRT=sqrt(hp)) %>% select(hp,HP_SQRT)

head(mt_cars_sqrt)

##                    hp   HP_SQRT
## Mazda RX4         110 10.488088
## Mazda RX4 Wag     110 10.488088
## Datsun 710         93  9.643651
## Hornet 4 Drive    110 10.488088
## Hornet Sportabout 175 13.228757
## Valiant           105 10.246951

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(mt_cars_sqrt$HP_SQRT,breaks=10,probability = T)
lines(density(mt_cars_sqrt$HP_SQRT),col='red',lwd=2)

Which is better? The main advantage is that square root transformations can be applied to values of “zero”.

sqrt(0)

## [1] 0

log(0)

## [1] -Inf

HOMEWORK

From the data you have chosen, select a variable that you are interested in
Use pastecs::stat.desc to describe the variable. Include a few sentences about what the variable is and what it’s measuring. Remember to load pastecs “library(pastecs)”
Remove NA’s if needed using dplyr:filter (or anything similar)
Provide a histogram of the variable (as shown in this lesson)
transform the variable using the log transformation or square root transformation (whatever is more appropriate) using dplyr::mutate or something similar
provide a histogram of the transformed variable
submit via rpubs on CANVAS

Week 6 Lesson

Karlerik Naslund

2024-06-16

NORMALITY

HOMOGENEOUS VARIANCES

DEALING WITH SKEWED DATA

TRANSFORMATION

HOMEWORK