Testing Assumptions of Regression

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
cars_data<-mtcars

Let’s review the assumptions of linear regression from last time. For all your assumptions to be correct, these all must be true:

  1. Linearity of variables
  2. Independence of variables
  3. Homoscedasticity
  4. Normality of residuals
  5. No multicollinearity

1) We went over Linearity of variables, but to review:

The overall relationship of the model to the residuals should be linear (a straight line)

cars_model_1<-lm(hp~wt+carb,data=cars_data)
plot(cars_model_1,which=1)

#note: the which=1 brings up the Residuals vs. Fitted graph. Without which=1, there are four graphs that come up.

The fact that the points are scattered more or less evenly around the straight line in the center (and the fitted line is mostly straight) means that the relationship between these variables is pretty linear: as WT and CARB goes up, HP either goes up or down in a linear path. In this case, it goes up.

We can also do a mathematical test (a “rainbow” test) to see if the variables have a linear relationship.

raintest(cars_model_1)
## 
##  Rainbow test
## 
## data:  cars_model_1
## Rain = 2.4129, df1 = 16, df2 = 13, p-value = 0.05781

In the case of the Rainbow Test, a low p-value means that the data is NOT linear. As we can see, this model is just barely linear. Also, the Rainbow test assumes homoskedasticy, so keep an eye out for that (we’ll talk about it soon).

The null hypothesis is that the data is linear. If it is greater than .05, that means the data is likely linear. (If the p-value is smaller than .05, then you can do things like log transformations). Importantly, the rainbow test relies on a p-value, which assumes normal distribution.

Erik’s advise– don’t sweat it too much. You want to address it in the research, but there are many different ways to deal with the data if it is non-linear.

2) Independence of Errors

  • we want to make sure that none of the residuals in the model are too strongly correlated with one another. We want to make sure that they’re “independent” (hence, Independence of Errors). Why? If errors are too closely correlated with each other it messes up the standard errors of the model, which messes up literally everything else (p-values, beta coefficients, r-squared – everything).

example– imagine throwing darts at a dartboard. You are aiming at the middle, but all the darts you throw end up in the bottom right. You would expect that the errors would be more randomly distributed– landing all over the board. There may be something influencing how the darts are being thrown– the errors are not independent.

Thankfully there is a pretty simple way to test for Independence of Errors using using the “car” package. (Unrelated to MTCARS or even “CARS”, sorry about that). The package contains the “durbinWatsonTest” (Durbin-Watson Test) which lets us test the model directly:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(cars_model_1)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1464524      1.652833   0.258
##  Alternative hypothesis: rho != 0

A p-value greater than 0.05 means that the errors ARE independent. In this case, the errors are independent (p-value=0.228). Also, the closer the D-W statistic is to “2”, the better. (Long story, just go with it)

3) Homoscedasticity

  • the variance of the model should be the same across all levels of the model. Basically, if you plot the residuals, they will be fitted closely to a straight line and not get “wavy”. Or, do the independent variables effect the model the same throughout the model?

Thankfully this can be done very simply with the Scale-Location “plot” command:

plot(cars_model_1,which=3)

You’ll notice this is very similar to the Residuals vs. Fitted plot, however, this is Fitted Values versus the Square Root of standardized residuals. Otherwise, it’s very similar to the Residuals vs. Fitted plot: the red line should be almost totally straight, and the points should cluster evenly around it. As you can see from this, the red line is very curvy, so this likely violates the assumption of Homoscedasticity.

Study tip: “Homo” means “same” so “homoscedasticity” means that the “scedasticity” is the same across the plot.

This can also be tested mathematically using the Breusch-Pagan Test in the “lmtest” package:

bptest(cars_model_1)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_1
## BP = 8.0855, df = 2, p-value = 0.01755

For the BP test, a significant result means the null hypothesis can be rejected. In this case, the null hypothesis is that the model is homoscedastic. In this case, the p-value is significant (less than 0.05) so we can reject homoscedasticity and assume the model is heteroscedastic.

What does a homoscedastic model look like?

# here is the old hp~mpg model we used:

cars_model_2<-lm(hp~mpg,data=cars_data)

plot(cars_model_2,which=3)

The line is much straighter, much more homoscedastic.

This is borne out mathematically as well:

bptest(cars_model_2)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_2
## BP = 0.86843, df = 1, p-value = 0.3514

The p-value is 0.35, much larger than 0.05, which means that the null hypothesis of heteroscedascity can be REJECTED and homoscedasticy can be assumed.

Note: You don’t have to put these images in the paper if you don’t want to; you can just write out results (example: “the model met this assumption but not this one; here is what I did to mitigate that”)

4) Normality of Residuals

We want to make sure that the residuals are normally distributed. This is hard to visualize on it’s own, but we can use what is called a “Q-Q plot” to do so:

plot(cars_model_1,which=2)

For residuals to be normally distributed, all of the observations would need to lie along the dotted line. As you can see, there is significant deviation from the dotted line. In fact, the observations barely touch the dotted line.

Lets look at cars_model_2 for a more normally distributed Q-Q plot:

plot(cars_model_2,which=2)

There is still come deviation from the dotted line but it’s much less than the previous plot. This is slightly more normal. We can also run a shapiro-wilk test to check for normality of residuals:

shapiro.test(cars_model_2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_model_2$residuals
## W = 0.89154, p-value = 0.003799

The p-value is actually below 0.05, which suggests that the residuals are significantly different from a normal distribution (ie., they are not normal).The Shapiro-Wilke test is testing for non normality– the null hypothesis is that there is non-normality.

5) No Multicollinearity

  • the variables themselves should not be correlated with each other. We can do this with the “vif()” command from the “car” package seen above.If your independent variables are strongly correlated, they be masking each other’s effects. You want independent variables that are influencing the dependent variable but not influencing each other too much.

Here’s the big huge kitchen sink model we ran last week: (VIF= variance inflation factor)

kitchen_sink<-lm(hp~cyl+disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)


vif(kitchen_sink)
##       cyl      disp       mpg      drat        wt      qsec        vs        am 
## 14.912374 15.715476  7.296154  3.398500 16.339384  7.958123  4.600828  4.931842 
##      gear      carb 
##  5.344559  5.923559

Basically, any VIF over 10 means that the variable is strongly correlated with some other variable, and thus the assumption of “no multicollinearity” will be violated. Some sources say anything over “5” is problematic. Erik said to go easy on ourselves– it’s 10 for our class!

It does not tell you what the variables are strongly correlated with, but if you have two variables over 10, it’s likely they are correlated with each other. What do we do? You can take one variable out and run VIF and see if it improves. He recommends taking them out one at a time.

new_kitchen_sink<-lm(hp~cyl+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(new_kitchen_sink)
##       cyl       mpg      drat        wt      qsec        vs        am      gear 
## 12.084679  7.272244  3.379276  7.221269  6.651531  4.587613  4.930532  5.235139 
##      carb 
##  4.267101

We took disp out, but cyl is still a problem.

new_kitchen_sink2<-lm(hp~disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(new_kitchen_sink2)
##      disp       mpg      drat        wt      qsec        vs        am      gear 
## 12.735496  7.267753  3.117346 15.834900  7.356650  4.221355  4.663491  4.737532 
##      carb 
##  5.114559

Taking out cyl makes disp and wt go over 10. We may need to take all 3 out!

Mitigation

So, lets say your data fails all of these assumptions. Is it time to give up?

Absolutely not!

Typically, data can be transformed so that it meets the assumptions of multiple regression. This somewhat changes the interpretation of the end-product, but the idea is basically the same.

Spoiler alert: transformation is almost always the answer. Unfortunately, intrepretation of results can become a bit more tricky– instead of interpreting ‘mpg,’ you might be intrepreting log(mpg) or sqr(mpg).

  1. Linearity:

Non-linear data can be adjusted using the log transformation.

cars_model_3<-lm(hp~mpg+am,data=cars_data)
cars_model_3_log<-lm(log(hp)~log(mpg)+am,data=cars_data)
#log doesn't work with negative numbers or zero-- use square root if you have these

raintest(cars_model_3)
## 
##  Rainbow test
## 
## data:  cars_model_3
## Rain = 3.0656, df1 = 16, df2 = 13, p-value = 0.02378
raintest(cars_model_3_log)
## 
##  Rainbow test
## 
## data:  cars_model_3_log
## Rain = 2.4838, df1 = 16, df2 = 13, p-value = 0.05225

Now we see that cars_model_3 has been transformed so it has a raintest p-value of 0.0522 (barely linear).

  1. Independence of Errors -

Instead of an “LM”, you can run a “RLM” from the “MASS” package. This creates a “robust” linear model where independence of errors is not as important:

cars_model_2_robust<-rlm(hp~mpg,data=cars_data)


summary(cars_model_2)
## 
## Call:
## lm(formula = hp ~ mpg, data = cars_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.26 -28.93 -13.45  25.65 143.36 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   324.08      27.43  11.813 8.25e-13 ***
## mpg            -8.83       1.31  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
summary(cars_model_2_robust)
## 
## Call: rlm(formula = hp ~ mpg, data = cars_data)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.902 -25.432  -6.355  26.480 150.425 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept) 308.7988  24.9851    12.3593
## mpg          -8.2816   1.1927    -6.9435
## 
## Residual standard error: 38.78 on 30 degrees of freedom

The main issue is that robust linear models do not report p-values or r-squared (because they do not assume independent variables), so it’s best to run a regular regression and then a robust regression together, to get a sense for which variables are significant, and report both. See if the estimates vary a lot. In this example, -8.83 and -8.28 are pretty close– even if the data violated independence of errors, perhaps the violation doesn’t matter that much. You can report something like this in your paper.

  1. Homoscedasticity

Similar to the issues with linearity above, a log-transformation of the dependent variable can induce homoscedasticity:

bptest(cars_model_3)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_3
## BP = 10.251, df = 2, p-value = 0.005943
bptest(cars_model_3_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_3_log
## BP = 3.7762, df = 2, p-value = 0.1514

(the log model is non-significant, meaning that it is now homoscedastic)

  1. Normality of residuals

If the residuals of a model are non-normal, they can be transformed (typically with log transformation, as above)

plot(cars_model_3,which=2)

plot(cars_model_3_log,which=2)

  1. Multicollinearity

We saw earlier that a VIF greater than 5 or 10 meant there was some multicollinearity:

vif(kitchen_sink)
##       cyl      disp       mpg      drat        wt      qsec        vs        am 
## 14.912374 15.715476  7.296154  3.398500 16.339384  7.958123  4.600828  4.931842 
##      gear      carb 
##  5.344559  5.923559

It’s allowed to remove one variable that is strongly correlated with another. But how do we know which one to observe?

kitchen_sink_vars<-cars_data %>% dplyr::select(cyl,disp,mpg,drat,wt,qsec,vs,am,gear,carb)
#The dplyr:: is used because different packages in R sometimes have the same commands. If you load two packages that have the same command, you have to specify which one you want R to use, or it will cause an error code. dplyr:: does this 

cor(kitchen_sink_vars)
##             cyl       disp        mpg        drat         wt        qsec
## cyl   1.0000000  0.9020329 -0.8521620 -0.69993811  0.7824958 -0.59124207
## disp  0.9020329  1.0000000 -0.8475514 -0.71021393  0.8879799 -0.43369788
## mpg  -0.8521620 -0.8475514  1.0000000  0.68117191 -0.8676594  0.41868403
## drat -0.6999381 -0.7102139  0.6811719  1.00000000 -0.7124406  0.09120476
## wt    0.7824958  0.8879799 -0.8676594 -0.71244065  1.0000000 -0.17471588
## qsec -0.5912421 -0.4336979  0.4186840  0.09120476 -0.1747159  1.00000000
## vs   -0.8108118 -0.7104159  0.6640389  0.44027846 -0.5549157  0.74453544
## am   -0.5226070 -0.5912270  0.5998324  0.71271113 -0.6924953 -0.22986086
## gear -0.4926866 -0.5555692  0.4802848  0.69961013 -0.5832870 -0.21268223
## carb  0.5269883  0.3949769 -0.5509251 -0.09078980  0.4276059 -0.65624923
##              vs          am       gear        carb
## cyl  -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.7104159 -0.59122704 -0.5555692  0.39497686
## mpg   0.6640389  0.59983243  0.4802848 -0.55092507
## drat  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    1.0000000  0.16834512  0.2060233 -0.56960714
## am    0.1683451  1.00000000  0.7940588  0.05753435
## gear  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.5696071  0.05753435  0.2740728  1.00000000

According to VIF(), cyl, disp and wt are most strongly correlated with other variables, and we see in the cor() chart that they are strongly correlated with basically everything, including each other. This makes sense, as “cyl”, “disp” and “wt” are basically measurements of the car’s size – just stated in different ways. Consequently, it wouldn’t hurt to try and decide which measurement of car weight is the most important to your analysis, and keep that one.

In this case, I think “wt” is probably the best measurement of car size – its more direct than cyl or disp. So, lets drop cyl and disp:

kitchen_sink_altered<-lm(hp~mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(kitchen_sink_altered)
##      mpg     drat       wt     qsec       vs       am     gear     carb 
## 7.259608 3.106846 6.770138 4.650877 4.192360 4.578935 4.737443 4.125290

As you can see, the VIF for “wt” went from around 16 to about 7 – much more manageable than before. It’s still kind of high but that’s because “wt” is correlated with many things in a car.

Cheat Sheet for Mitigation:

  1. No Linearity? (try a log transformation) (may not always work)
  2. Errors not independent? (try a Robust Linear Model RLM()) (removes p-values)
  3. Heteroscedascity? (log transformation of the dependent variable) (also may not always work)
  4. Residuals not normal? (log transformation)
  5. Multicolinearity? (remove strongly correlated variables)

As you can see, log() transformations are basically the main “tool” to help.

Logistic Regression

What is logistic regression and why is it valuable?

Basically: logistic regression is similar to “regular” regression, but instead of a continuous outcome (like price, or mpg) the dependent variable is categorical (it’s one or zero).

Who cares? Well, this is actually one of the most powerful tools that is used day-to-day by organizations. If you think about it, many important things can be represented by a one or zero.

Does the patient live or die? (1 = live, 0 = die) Is this bank transaction fraudulent? (1 = fraud, 0 = legitimate) Do clients of this non-profit find housing? (1 = yes, 0 = no)

And so forth.

This is a little different than regular regression because the dependent variable is no longer continuous. The result is no longer linear. So, the 1 or the 0 becomes probabilities.

The dependent variable is what we are trying to predict.

am_logistic<-glm(am~mpg+wt,data=cars_data,family="binomial")
#am is the transmission type in cars data, where 1=automatic and 0=manual. We are trying to predict what kind of transmission a car has (am) by the mpg and weight
#glm is generalized linear model; binomiaml means the dependent variable is 1 or 0  
summary(am_logistic)
## 
## Call:
## glm(formula = am ~ mpg + wt, family = "binomial", data = cars_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  25.8866    12.1935   2.123   0.0338 *
## mpg          -0.3242     0.2395  -1.354   0.1759  
## wt           -6.4162     2.5466  -2.519   0.0118 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 17.184  on 29  degrees of freedom
## AIC: 23.184
## 
## Number of Fisher Scoring iterations: 7

Weight is mildly significant. The more weight that a car has, the less likely it is to be an automatic.

#we select "wt" from the model coefficients
wt_odds<-coef(am_logistic)[3]

# print wt coefficient
print(wt_odds)
##        wt 
## -6.416172
# exponent the coefficent to get the odds per unit of wt
exp(wt_odds)
##          wt 
## 0.001634903

What does this mean? For every unit of “WT” that goes up, the odds of AM go down by 0.001% (one-tenth of one percent). Statistically significant but probably not practically significant.

We will discuss this more next week.

Homework

  1. Load your preferred dataset into R studio
  2. Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome
  3. Check the following assumptions:
  1. Linearity (plot and raintest)
  2. Independence of errors (durbin-watson)
  3. Homoscedasticity (plot, bptest)
  4. Normality of residuals (QQ plot, shapiro test– switch to a ks test if over 5000 observations. If your data set is super large, take normality with a grain of salt. You can mention if R says your data isn’t normal in your paper but it’s very common with big data sets)
  5. No multicolinarity (VIF, cor)
  1. does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.
  2. If your model violates an assumption, which one?
  3. What would you do to mitigate this assumption? Show your work.