Assumptions of linear models (parametric tests):

  1. Data (or residuals) are normally distributed

  2. Variance is homogeneous (the same) or normalized

  1. Interval data
  1. Variables are independent of each other

NORMALITY

We’ve touched on this before, but let’s really dive in: probability=T means it will come out as percentages

hist(mtcars$mpg,breaks=10,probability = T)
lines(density(mtcars$mpg),col='red',lwd=2)

There is another (easier?) way of visualizing normality: the Q-Q plot #redline represents what the data would look like if it was perfectly normal example data is more normal than not but is not entirely normal; example given would be usable for a project

qqnorm(mtcars$mpg)
qqline(mtcars$mpg,col="red")

The closer the “dots” are to the “line”, the more normal the data is.

What would we call this? Normal-ish? Sort of normal?

To me, this distribution is “mostly or fairly normal”. Not really a scientific term, though.

We can quantify it, however:

# Shapiro-Wilk is mostly used for datasets with less than 50 observations (or so)

shapiro.test(mtcars$mpg)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

#this is testing if the data is distinguishable from a normal distribution How to interpret a general Shapiro-Wilk normality diagnostic:

For a p-value GREATER than 0.05 = data is plausibly normal For a p-value LESS THAN OR EQUAL TO 0.05 = data is probably not normal

So, MPG from mtcars is probably normal, per Shapiro-Wilk.

For larger datasets (between 50 and 5000+ observations), there is the K-S test:

hist(diamonds$price, main="Histogram of Diamond Prices", xlab="Price", breaks=50)

# this data is really skewed to the right

qqnorm(diamonds$price)
qqline(diamonds$price,col="red")

# does not really follow the red line; only intersects like twice with large outliers

ks.test(diamonds$price,"pnorm",mean=mean(diamonds$price),sd=sd(diamonds$price))
## Warning in ks.test.default(diamonds$price, "pnorm", mean =
## mean(diamonds$price), : ties should not be present for the one-sample
## Kolmogorov-Smirnov test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  diamonds$price
## D = 0.18467, p-value < 2.2e-16
## alternative hypothesis: two-sided

We are getting a warning because there are repeated prices in the data.

The null hypothesis of a K-S test is that the sample and a normal distribution are the same. Because the p-value is below 0.05, we can reject the null hypothesis and say “the sample is very different from a normal distribution”.

P-value above 0.05 = probably NORMAL P-value at or below 0.05 = probably NOT NORMAL

So the k-s test is consistent with what we see in the plots above.

The Shapiro-Wilk test becomes inaccurate for large datasets, because the larger the dataset, the more “obvious” non-normality becomes, which increases the chances of error. #stat.desc is command for describe variable in pastecs

stat.desc(mtcars$mpg,norm=T)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##   32.0000000    0.0000000    0.0000000   10.4000000   33.9000000   23.5000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##  642.9000000   19.2000000   20.0906250    1.0654240    2.1729465   36.3241028 
##      std.dev     coef.var     skewness     skew.2SE     kurtosis     kurt.2SE 
##    6.0269481    0.2999881    0.6106550    0.7366922   -0.3727660   -0.2302812 
##   normtest.W   normtest.p 
##    0.9475647    0.1228814

HOMOGENEOUS VARIANCES

The second assumption of linear regression is homogeneity of variance. What does this mean, exactly?

ggplot(mtcars,aes(x=cyl,y=mpg)) + geom_point()

What do you notice here? The means are very different between cylinders (4 cylinders seems to have a much higher average MPG than 6 or 8) – but that’s not what we’re looking at. We are looking at the spread of variance around each of the means. # each dot represents a car; some are closer together in their individual groups # if these were all homogenous the spread for each gorup would be the same; in this case the variancw is NOT # homogenous across cylinders

ggplot(mtcars,aes(x=cyl,y=mpg,group=cyl)) + geom_boxplot()

(this is where boxplot comes in handy)

Are they similar? Or different?

#leveneTest measures if the variances are significantly different

leveneTest(mtcars$mpg,mtcars$cyl)
## Warning in leveneTest.default(mtcars$mpg, mtcars$cyl): mtcars$cyl coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)   
## group  2  5.5071 0.00939 **
##       29                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

in the Levene Test, the grouping variable (cyl in this case) must come SECOND.

Our result is that the variances are SIGNIFICANTLY different – ie., the assumption is violated.

In our work in this course, you won’t need to report the actual F-statistics, just if the variances are homogeneous or not.

DEALING WITH SKEWED DATA

This topic can be tricky and is somewhat advanced (sorry). Soon you will understand this is more of an art than a science.

Look what happens with outliers:

ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point()

The Maserati (top 8 cylinder dot) has a horsepower that is much higher than the other groups, and even higher than other 8-cylinder cars. It’s distorting the variance of all groups. #if the maserati was removed from the data the variance would compress a bit

leveneTest(mtcars$hp,mtcars$cyl,center=mean)
## Warning in leveneTest.default(mtcars$hp, mtcars$cyl, center = mean): mtcars$cyl
## coerced to factor.
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value  Pr(>F)  
## group  2  4.1437 0.02613 *
##       29                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#just barely inhomogenous bc it is close to .05 One idea is to just remove the outlier:

cars_removed<-mtcars %>% filter(hp<300)

ggplot(cars_removed,aes(x=cyl,y=hp)) + geom_point()

leveneTest(cars_removed$hp,cars_removed$cyl)
## Warning in leveneTest.default(cars_removed$hp, cars_removed$cyl):
## cars_removed$cyl coerced to factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  1.4953 0.2415
##       28

#p value is significantly higher meaning the variance has been homogenized by removing the maserati

We now see that the Levene Test is not significant – that is, the variances have been homogenized.

The data is now considerably more normal as well:

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(cars_removed$hp,breaks=10,probability = T)
lines(density(cars_removed$hp),col='red',lwd=2)

The problem here is that we’ve deleted something from our data for the purpose of making it better – but this is basically cheating. DONT DO THIS!

Exception: there is some kind of overwhelming theoretical reason to remove specific parts of the data. When this happens, you must make it clear that you have done this.

TRANSFORMATION

Instead of removing outliers or anything else that can distort results, we can transform the data using various methods. My favorite (because of how easy it is) is the log-transformation.

A logarithm is “the power we must raise a certain number to get another number”

base^exponent = number

2^(exponent?) = 8

To get 8, you have to raise 2 to what power?

It’s actually 3:

2^3 = 8

So, the log of 8 in “base 2”, is 3.

log(8,base=2)
## [1] 3

The natural logarithm is about ~2.71, you can also google “Euler’s number” for help sleeping. When you use the “log()” command, it’s assuming a “base” of ~2.7

log(8)
## [1] 2.079442

~2.71 ^ 2.079442 = ~8

2.71^2.079442
## [1] 7.949403

Why does this matter?

It turns out that finding the natural log of a whole data set will “smoosh” the data together - reducing skew and outliers to some degree.

ggplot(mtcars,aes(x=cyl,y=hp)) + geom_point() + ggtitle("Untransformed Data")

ggplot(mtcars,aes(x=log(cyl),y=log(hp))) + geom_point() + ggtitle("Log-Transformed Data")

Variances are much more homogeneous! What about normality? #data remains the same but the shape of the data is changed #complication is that you can no longer report this as cylxhp it is log(cyl)x log(hp) #it is not the same as reporting the data by itself *you want the data as normalized as possible to get an accurate p value on the regression

hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(log(mtcars$hp),breaks=10,probability = T)
lines(density(log(mtcars$hp)),col='red',lwd=2)

Very good. What does a log transformation actually look like?

mt_cars_log<-mtcars %>% mutate(LOG_HP=log(hp)) %>% select(hp,LOG_HP)

head(mt_cars_log)
##                    hp   LOG_HP
## Mazda RX4         110 4.700480
## Mazda RX4 Wag     110 4.700480
## Datsun 710         93 4.532599
## Hornet 4 Drive    110 4.700480
## Hornet Sportabout 175 5.164786
## Valiant           105 4.653960

The natural log transformation “squishes” the values of horsepower, to reduce the influence of outliers. #so 2.71^4.700480 will equal hp of 110 You don’t have to transform every variable in the dataset, but you must transform every variable within the same column (so, no transforming HP #1 but not HP #2)

Natural logs don’t work (in terms of real numbers) for negative values or zero. If you somehow had a car with negative 100 horsepower, this is what would happen:

log(-100)
## Warning in log(-100): NaNs produced
## [1] NaN

It is possible to extend logarithms to negative numbers but that is a complex mathematical topic far beyond this course (also unlikely to be a solution in real life)

The book recommends that you “add a constant” to make the values become positive – I actually disagree and think that is a bit silly.

Better idea: use absolute values (abs())

log(abs(-100))
## [1] 4.60517

Always make it clear in your paper if you’ve transformed the data. Turning negative values into positive values by taking the absolute value is perfectly fine as long as you make it clear you’ve done so – and that doing so is supported by the literature in some way. You can report your result as “the log absolute value of x” instead of just “x”. The downside is that this often makes interpretation a little more difficult.

It’s also possible to do transformations via square roots:

mt_cars_sqrt<-mtcars %>% mutate(HP_SQRT=sqrt(hp)) %>% select(hp,HP_SQRT)

head(mt_cars_sqrt)
##                    hp   HP_SQRT
## Mazda RX4         110 10.488088
## Mazda RX4 Wag     110 10.488088
## Datsun 710         93  9.643651
## Hornet 4 Drive    110 10.488088
## Hornet Sportabout 175 13.228757
## Valiant           105 10.246951
hist(mtcars$hp,breaks=10,probability = T)
lines(density(mtcars$hp),col='red',lwd=2)

hist(mt_cars_sqrt$HP_SQRT,breaks=10,probability = T)
lines(density(mt_cars_sqrt$HP_SQRT),col='red',lwd=2)

Which is better? The main advantage is that square root transformations can be applied to values of “zero”.

sqrt(0)
## [1] 0
log(0)
## [1] -Inf

HOMEWORK due in 3 weeks

  1. From the data you have chosen, select a variable that you are interested in
  2. Use pastecs::stat.desc to describe the variable. Include a few sentences about what the variable is and what it’s measuring. Remember to load pastecs “library(pastecs)” #command is stat.desc
  3. Remove NA’s if needed using dplyr:filter (or anything similar)
  4. Provide a histogram of the variable (as shown in this lesson)
  5. transform the variable using the log transformation OR square root transformation (whatever is more appropriate) using dplyr::mutate or something similar
  6. provide a histogram of the transformed variable
  7. submit via rpubs on CANVAS