Assumptions of linear models (parametric tests):

  1. Data (or residuals) are normally distributed

  2. Variance is homogeneous or normalized

  3. Interval data

  4. Variables are independent of each other

This is why we don’t use linear models for everything– the data has to have these parameters in order to be valid! ## NORMALITY

We’ve touched on this before, but let’s really dive in:

hist(mtcars$mpg,breaks=10,probability = T)
lines(density(mtcars$mpg),col='red',lwd=2)

This is kind of bell shaped, but with a weird tail. If something isn’t normal, we can make it more normal (without cheating)

There is another (easier?) way of visualizing normality: the Q-Q plot

qqnorm(mtcars$mpg)
qqline(mtcars$mpg,col="red")

The closer the “dots” are to the “line”, the more normal the data is.

What would we call this? Normal-ish? Sort of normal? It goes a bit off the rails at the end.

To me, this distribution is “mostly or fairly normal”. Not really a scientific term, though.This is why data analysis is an art and a science– no data in the real world is going to deviate (at least lit a bit) from the norm.

We can quantify it, however:

# Shapiro-Wilk is mostly used for datasets with less than 50 observations (or so)

shapiro.test(mtcars$mpg)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

How to interpret a general Shapiro-Wilk normality diagnostic:

For a p-value GREATER than 0.05 = data is plausibly normal For a p-value LESS THAN OR EQUAL TO 0.05 = data is probably not normal

So, MPG from mtcars is probably normal, per Shapiro-Wilk.

For larger datasets (between 50 and 5000+ observations), there is the K-S test:

hist(diamonds$price, main="Histogram of Diamond Prices", xlab="Price", breaks=50)

Not even close to a bell curve! Why is that important? Remember the the p-value is derived from the standard deviation. So if data is not in a bell curve, the SD is bogus, and the p-value is also bogus. ANY TEST that has a p-value has to have a normal distribution, or it will be incorrect (and R won’t tell you that it is incorrect)

qqnorm(diamonds$price)
qqline(diamonds$price,col="red")

The ks test is kind of like the Shapiro-Wilk test, but for a larger data set.

ks.test(diamonds$price,"pnorm",mean=mean(diamonds$price),sd=sd(diamonds$price))
## Warning in ks.test.default(diamonds$price, "pnorm", mean =
## mean(diamonds$price), : ties should not be present for the one-sample
## Kolmogorov-Smirnov test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  diamonds$price
## D = 0.18467, p-value < 2.2e-16
## alternative hypothesis: two-sided

Warning explanation- The KS test is not supposed to be used on data where data can be repeated (in this example, no two diamonds should be $1,000)

We are getting a warning because there are repeated prices in the data.

The null hypothesis of a K-S test is that the sample and a normal distribution are the same. Because the p-value is below 0.05, we can reject the null hypothesis and say “the sample is very different from a normal distribution”.

P-value above 0.05 = probably NORMAL P-value at or below 0.05 = probably NOT NORMAL

So the k-s test is consistent with what we see in the plots above.

Erik recommends using the Q-Q plot and looking at it to determine if data is normalish. In papers, you don’t usally have to show that data is ‘normal.’

The Shapiro-Wilk test becomes inaccurate for large datasets, because the larger the dataset, the more “obvious” non-normality becomes, which increases the chances of error.

stat.desc(mtcars$mpg,norm=T)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##   32.0000000    0.0000000    0.0000000   10.4000000   33.9000000   23.5000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##  642.9000000   19.2000000   20.0906250    1.0654240    2.1729465   36.3241028 
##      std.dev     coef.var     skewness     skew.2SE     kurtosis     kurt.2SE 
##    6.0269481    0.2999881    0.6106550    0.7366922   -0.3727660   -0.2302812 
##   normtest.W   normtest.p 
##    0.9475647    0.1228814

The normtest.p shows the Shapiro-Wilk p-value!

HOMOGENEOUS VARIANCES

The second assumption of linear regression is homogeneity of variance. What does this mean, exactly?

ggplot(mtcars,aes(x=cyl,y=mpg)) + geom_point()

What do you notice here? The means are very different between cylinders (4 cylinders seems to have a much higher average MPG than 6 or 8) – but that’s not what we’re looking at. We are looking at the spread of variance around each of the means.

ggplot(mtcars,aes(x=cyl,y=mpg,group=cyl)) + geom_boxplot()

(this is where boxplot comes in handy)

Are they similar? Or different? [the four cylinder car has a big range– the six cylinder are very close– the 4 cylinder is effecting MPG in a different way than 6 or 8 cylinder. This is not good for linear regression

leveneTest(mtcars$mpg,mtcars$cyl)
## Warning in leveneTest.default(mtcars$mpg, mtcars$cyl): mtcars$cyl coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)   
## group  2  5.5071 0.00939 **
##       29                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

in the Levene Test, the grouping variable (cyl in this case) must come SECOND.

Our result is that the variances are SIGNIFICANTLY different – ie., the assumption is violated.The null hypothesis is rejected. The alternate hypothesis– that the variance around the means is different– is correct.

In our work in this course, you won’t need to report the actual F-statistics, just if the variances are homogeneous or not.

DEALING WITH SKEWED DATA

This topic can be tricky and is somewhat advanced (sorry). Soon you will understand this is more of an art than a science.

Look what happens with outliers:

ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point()

The Maserati has a horsepower that is much higher than the other groups, and even higher than other 8-cylinder cars. It’s distorting the variance of all groups.

leveneTest(mtcars$hp,mtcars$cyl,center=mean)
## Warning in leveneTest.default(mtcars$hp, mtcars$cyl, center = mean): mtcars$cyl
## coerced to factor.
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value  Pr(>F)  
## group  2  4.1437 0.02613 *
##       29                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis can be rejected– the variances are not homogeneous.

One idea is to just remove the outlier: (Erik does not recommend this generally, but we’re doing this exercise to prove a point)

cars_removed<-mtcars %>% filter(hp<300)

ggplot(cars_removed,aes(x=cyl,y=hp)) + geom_point()

leveneTest(cars_removed$hp,cars_removed$cyl)
## Warning in leveneTest.default(cars_removed$hp, cars_removed$cyl):
## cars_removed$cyl coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  1.4953 0.2415
##       28

We now see that the Levene Test is not significant – that is, the variances have been homogenized. Basically, these dots have a similar spread!

The data is now considerably more normal as well:

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(cars_removed$hp,breaks=10,probability = T)
lines(density(cars_removed$hp),col='red',lwd=2)

Removing the outlier made the data more normal. It is still skewed, but less skewed than before. Definition of skewed– data has a tail

The problem here is that we’ve deleted something from our data for the purpose of making it better – but this is basically cheating. DONT DO THIS!

Exception: there is some kind of overwhelming theoretical reason to remove specific parts of the data. When this happens, you must make it clear that you have done this.

TRANSFORMATION

Instead of removing outliers or anything else that can distort results, we can transform the data using various methods. My favorite (because of how easy it is) is the log-transformation. (you can do weird things to the data, as long as you do the same weird things to all the other observations. It’s like algebra– what you do to one side of an equation, you have to do to the other)

A logarithm is “the power we must raise a certain number to get another number”

base^exponent = number

2^(exponent?) = 8

To get 8, you have to raise 2 to what power?

It’s actually 3:

2^3 = 8

So, the log of 8 in “base 2”, is 3. you don’t have to memorize this– it’s just background for explaining what log is!

log(8,base=2)
## [1] 3

The natural logarithm is about ~2.71, you can also google “Euler’s number” for help sleeping. When you use the “log()” command, it’s assuming a “base” of ~2.7

log(8)
## [1] 2.079442

~2.71 ^ 2.079442 = ~8

2.71^2.079442
## [1] 7.949403

Why does this matter?

It turns out that finding the natural log of a whole data set will “smoosh” the data together - reducing skew and outliers to some degree.

ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point() + ggtitle("Untransformed Data")

ggplot(mtcars,aes(x=log(cyl),y=log(hp))) + geom_point() + ggtitle("Log-Transformed Data")

Variances are much more homogeneous! What about normality?

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(log(mtcars$hp),breaks=10,probability = T)
lines(density(log(mtcars$hp)),col='red',lwd=2)

What’s the catch? When you do this to your data, you are no longer reporting your results as as cylinders on horsepower– you are now reporting your results as log(cyl) and log(hp). This is much more theoretical.

Very good. What does a log transformation actually look like?

mt_cars_log<-mtcars %>% mutate(LOG_HP=log(hp)) %>% select(hp,LOG_HP)

head(mt_cars_log)
##                    hp   LOG_HP
## Mazda RX4         110 4.700480
## Mazda RX4 Wag     110 4.700480
## Datsun 710         93 4.532599
## Hornet 4 Drive    110 4.700480
## Hornet Sportabout 175 5.164786
## Valiant           105 4.653960

Here’s the math: 2.71^4.7 = 110

The natural log transformation “squishes” the values of horsepower, to reduce the influence of outliers.

You don’t have to transform every variable in the dataset, but you must transform every variable within the same column (so, no transforming HP #1 but not HP #2)

Natural logs don’t work (in terms of real numbers) for negative values or zero. If you somehow had a car with negative 100 horsepower, this is what would happen:

log(-100)
## Warning in log(-100): NaNs produced
## [1] NaN

It is possible to extend logarithms to negative numbers but that is a complex mathematical topic far beyond this course (also unlikely to be a solution in real life)

The book recommends that you “add a constant” to make the values become positive – I actually disagree and think that is a bit silly.

Better idea: use absolute values (abs())

log(abs(-100))
## [1] 4.60517

Always make it clear in your paper if you’ve transformed the data. Turning negative values into positive values by taking the absolute value is perfectly fine as long as you make it clear you’ve done so – and that doing so is supported by the literature in some way. You can report your result as “the log absolute value of x” instead of just “x”. The downside is that this often makes interpretation a little more difficult.

It’s also possible to do transformations via square roots: (if you don’t like the log function and think it’s too complicated)

mt_cars_sqrt<-mtcars %>% mutate(HP_SQRT=sqrt(hp)) %>% select(hp,HP_SQRT)

head(mt_cars_sqrt)
##                    hp   HP_SQRT
## Mazda RX4         110 10.488088
## Mazda RX4 Wag     110 10.488088
## Datsun 710         93  9.643651
## Hornet 4 Drive    110 10.488088
## Hornet Sportabout 175 13.228757
## Valiant           105 10.246951

It isn’t as squished as the log transformation, but still less skweed than the untransformed data.

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(mt_cars_sqrt$HP_SQRT,breaks=10,probability = T)
lines(density(mt_cars_sqrt$HP_SQRT),col='red',lwd=2)

Which is better? The main advantage is that square root transformations can be applied to values of “zero”.

sqrt(0)
## [1] 0
log(0)
## [1] -Inf

HOMEWORK

  1. From the data you have chosen, select a variable that you are interested in (just one variable)
  2. Use pastecs::stat.desc to describe the variable. Include a few sentences about what the variable is and what it’s measuring. Remember to load pastecs “library(pastecs)” (if you need to install it– install(pastecs))
  3. Remove NA’s if needed using dplyr:filter (or anything similar)
  4. Provide a histogram of the variable (as shown in this lesson)
  5. transform the variable using the log transformation or square root transformation (whatever is more appropriate) using dplyr::mutate or something similar
  6. provide a histogram of the transformed variable
  7. submit via rpubs on CANVAS

Practice

library(tidyverse)
library(pastecs)
cars_data2<-mtcars

Erik makes a new chunk whenever he does something different– kind of like starting a new paragraph in English

stat.desc(cars_data2$disp, norm=T)
##       nbr.val      nbr.null        nbr.na           min           max 
##  3.200000e+01  0.000000e+00  0.000000e+00  7.110000e+01  4.720000e+02 
##         range           sum        median          mean       SE.mean 
##  4.009000e+02  7.383100e+03  1.963000e+02  2.307219e+02  2.190947e+01 
##  CI.mean.0.95           var       std.dev      coef.var      skewness 
##  4.468466e+01  1.536080e+04  1.239387e+02  5.371779e-01  3.816570e-01 
##      skew.2SE      kurtosis      kurt.2SE    normtest.W    normtest.p 
##  4.604298e-01 -1.207212e+00 -7.457714e-01  9.200127e-01  2.080657e-02
#lets check that there aren't any NA's in the data-- 
summary(cars_data2$disp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.1   120.8   196.3   230.7   326.0   472.0

The mean is 230, median is 196. Norm test is .00208– a significant normality test. Let’s look at it

hist(cars_data2$disp)

cars_data_transformed <- cars_data2 |> mutate(disp_log=log(disp))

This creates another variable in cars_data_transformed.

hist(cars_data_transformed$disp_log)

not quite normal… but more normal than the previous histogram!