Testing Assumptions of Regression

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
cars_data<-mtcars

Let’s review the assumptions of linear regression from last time.

  1. Linearity of variables
  2. Independence of variables
  3. Homoscedasticity
  4. Normality of residuals
  5. No multicollinearity

1) We went over Linearity of variables, but to review:

The overall relationship of the model to the residuals should be linear (a straight line)

cars_model_1<-lm(hp~wt+carb,data=cars_data)
plot(cars_model_1,which=1)

The fact that the points are scattered more or less evenly around the straight line in the center (and the fitted line is mostly straight) means that the relationship between these variables is pretty linear: as WT and CARB goes up, HP either goes up or down in a linear path. In this case, it goes up.

We can also do a mathematical test (a “rainbow” test) to see if the variables have a linear relationship.

raintest(cars_model_1)
## 
##  Rainbow test
## 
## data:  cars_model_1
## Rain = 2.4129, df1 = 16, df2 = 13, p-value = 0.05781

In the case of the Rainbow Test, a low p-value means that the data is NOT linear. As we can see, this model is just barely linear. Also, the Rainbow test assumes homoskedasticy, so keep an eye out for that (we’ll talk about it soon).

2) Independence of Errors

  • we want to make sure that none of the residuals in the model are too strongly correlated with one another. We want to make sure that they’re “independent” (hence, Independence of Errors). Why? If errors are too closely correlated with each other it messes up the standard errors of the model, which messes up literally everything else (p-values, beta coefficients, r-squared – everything).

Thankfully there is a pretty simple way to test for Independence of Errors using using the “car” package. (Unrelated to MTCARS or even “CARS”, sorry about that). The package contains the “durbinWatsonTest” (Durbin-Watson Test) which lets us test the model directly:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(cars_model_1)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1464524      1.652833   0.222
##  Alternative hypothesis: rho != 0

A p-value greater than 0.05 means that the errors ARE independent. In this case, the errors are independent (p-value=0.228). Also, the closer the D-W statistic is to “2”, the better. (Long story, just go with it)

3) Homoscedasticity

  • the variance of the model should be the same across all levels of the model. Basically, if you plot the residuals, they will be fitted closely to a straight line and not get “wavy”.

Thankfully this can be done very simply with the Scale-Location “plot” command:

plot(cars_model_1,which=3)

You’ll notice this is very similar to the Residuals vs. Fitted plot, however, this is Fitted Values versus the Square Root of standardized residuals. Otherwise, it’s very similar to the Residuals vs. Fitted plot: the red line should be almost totally straight, and the points should cluster evenly around it. As you can see from this, the red line is very curvy, so this likely violates the assumption of Homoscedasticity.

Study tip: “Homo” means “same” so “homoscedasticity” means that the “scedasticity” is the same across the plot.

This can also be tested mathematically using the Breusch-Pagan Test in the “lmtest” package:

bptest(cars_model_1)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_1
## BP = 8.0855, df = 2, p-value = 0.01755

For the BP test, a significant result means the null hypothesis can be rejected. In this case, the null hypothesis is that the model is homoscedastic. In this case, the p-value is significant (less than 0.05) so we can reject homoscedasticity and assume the model is heteroscedastic.

What does a homoscedastic model look like?

# here is the old hp~mpg model we used:

cars_model_2<-lm(hp~mpg,data=cars_data)

plot(cars_model_2,which=3)

The line is much straighter, much more homoscedastic.

This is borne out mathematically as well:

bptest(cars_model_2)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_2
## BP = 0.86843, df = 1, p-value = 0.3514

The p-value is 0.35, much larger than 0.05, which means that the null hypothesis of heteroscedascity can be REJECTED and homoscedasticy can be assumed.

4) Normality of Residuals

We want to make sure that the residuals are normally distributed. This is hard to visualize on it’s own, but we can use what is called a “Q-Q plot” to do so:

plot(cars_model_1,which=2)

For residuals to be normally distributed, all of the observations would need to lie along the dotted line. As you can see, there is significant deviation from the dotted line. In fact, the observations barely touch the dotted line.

Lets look at cars_model_2 for a more normally distributed Q-Q plot:

plot(cars_model_2,which=2)

There is still come deviation from the dotted line but it’s much less than the previous plot. We can also run a shapiro-wilk test to check for normality of residuals:

shapiro.test(cars_model_2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_model_2$residuals
## W = 0.89154, p-value = 0.003799

The p-value is actually below 0.05, which suggests that the residuals are significantly different from a normal distribution (ie., they are not normal).

5) No Multicollinearity

  • the variables themselves should not be correlated with each other. We can do this with the “vif()” command from the “car” package seen above.

Here’s the big huge kitchen sink model we ran last week:

kitchen_sink<-lm(hp~cyl+disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)


vif(kitchen_sink)
##       cyl      disp       mpg      drat        wt      qsec        vs        am 
## 14.912374 15.715476  7.296154  3.398500 16.339384  7.958123  4.600828  4.931842 
##      gear      carb 
##  5.344559  5.923559

Basically, any VIF over 10 means that the variable is strongly correlated with some other variable, and thus the assumption of “no multicollinearity” will be violated. Some sources say anything over “5” is problematic.

Mitigation

So, lets say your data fails all of these assumptions. Is it time to give up?

Absolutely not!

Typically, data can be transformed so that it meets the assumptions of multiple regression. This somewhat changes the interpretation of the end-product, but the idea is basically the same.

  1. Linearity:

Non-linear data can be adjusted using the log transformation.

cars_model_3<-lm(hp~mpg+am,data=cars_data)
cars_model_3_log<-lm(log(hp)~log(mpg)+am,data=cars_data)

raintest(cars_model_3)
## 
##  Rainbow test
## 
## data:  cars_model_3
## Rain = 3.0656, df1 = 16, df2 = 13, p-value = 0.02378
raintest(cars_model_3_log)
## 
##  Rainbow test
## 
## data:  cars_model_3_log
## Rain = 2.4838, df1 = 16, df2 = 13, p-value = 0.05225

Now we see that cars_model_3 has been transformed so it has a raintest p-value of 0.0522 (barely linear).

  1. Independence of Errors -

Instead of an “LM”, you can run a “RLM” from the “MASS” package. This creates a “robust” linear model where independence of errors is not as important:

cars_model_2_robust<-rlm(hp~mpg,data=cars_data)


summary(cars_model_2)
## 
## Call:
## lm(formula = hp ~ mpg, data = cars_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.26 -28.93 -13.45  25.65 143.36 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   324.08      27.43  11.813 8.25e-13 ***
## mpg            -8.83       1.31  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07
summary(cars_model_2_robust)
## 
## Call: rlm(formula = hp ~ mpg, data = cars_data)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.902 -25.432  -6.355  26.480 150.425 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept) 308.7988  24.9851    12.3593
## mpg          -8.2816   1.1927    -6.9435
## 
## Residual standard error: 38.78 on 30 degrees of freedom

The main issue is that robust linear models do not report p-values or r-squared (because they do not assume independent variables), so it’s best to run a regular regression and then a robust regression together, to get a sense for which variables are significant, and report both.

  1. Homoscedasticity

Similar to the issues with linearity above, a log-transformation of the dependent variable can induce homoscedasticity:

bptest(cars_model_3)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_3
## BP = 10.251, df = 2, p-value = 0.005943
bptest(cars_model_3_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model_3_log
## BP = 3.7762, df = 2, p-value = 0.1514

(the log model is non-significant, meaning that it is now homoscedastic)

  1. Normality of residuals

If the residuals of a model are non-normal, they can be transformed (typically with log transformation, as above)

plot(cars_model_3,which=2)

plot(cars_model_3_log,which=2)

  1. Multicollinearity

We saw earlier that a VIF greater than 5 or 10 meant there was some multicollinearity:

vif(kitchen_sink)
##       cyl      disp       mpg      drat        wt      qsec        vs        am 
## 14.912374 15.715476  7.296154  3.398500 16.339384  7.958123  4.600828  4.931842 
##      gear      carb 
##  5.344559  5.923559

It’s allowed to remove one variable that is strongly correlated with another. But how do we know which one to observe?

kitchen_sink_vars<-cars_data %>% dplyr::select(cyl,disp,mpg,drat,wt,qsec,vs,am,gear,carb)

cor(kitchen_sink_vars)
##             cyl       disp        mpg        drat         wt        qsec
## cyl   1.0000000  0.9020329 -0.8521620 -0.69993811  0.7824958 -0.59124207
## disp  0.9020329  1.0000000 -0.8475514 -0.71021393  0.8879799 -0.43369788
## mpg  -0.8521620 -0.8475514  1.0000000  0.68117191 -0.8676594  0.41868403
## drat -0.6999381 -0.7102139  0.6811719  1.00000000 -0.7124406  0.09120476
## wt    0.7824958  0.8879799 -0.8676594 -0.71244065  1.0000000 -0.17471588
## qsec -0.5912421 -0.4336979  0.4186840  0.09120476 -0.1747159  1.00000000
## vs   -0.8108118 -0.7104159  0.6640389  0.44027846 -0.5549157  0.74453544
## am   -0.5226070 -0.5912270  0.5998324  0.71271113 -0.6924953 -0.22986086
## gear -0.4926866 -0.5555692  0.4802848  0.69961013 -0.5832870 -0.21268223
## carb  0.5269883  0.3949769 -0.5509251 -0.09078980  0.4276059 -0.65624923
##              vs          am       gear        carb
## cyl  -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.7104159 -0.59122704 -0.5555692  0.39497686
## mpg   0.6640389  0.59983243  0.4802848 -0.55092507
## drat  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    1.0000000  0.16834512  0.2060233 -0.56960714
## am    0.1683451  1.00000000  0.7940588  0.05753435
## gear  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.5696071  0.05753435  0.2740728  1.00000000

According to VIF(), cyl, disp and wt are most strongly correlated with other variables, and we see in the cor() chart that they are strongly correlated with basically everything, including each other. This makes sense, as “cyl”, “disp” and “wt” are basically measurements of the car’s size – just stated in different ways. Consequently, it wouldn’t hurt to try and decide which measurement of car weight is the most important to your analysis, and keep that one.

In this case, I think “wt” is probably the best measurement of car size – its more direct than cyl or disp. So, lets drop cyl and disp:

kitchen_sink_altered<-lm(hp~mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(kitchen_sink_altered)
##      mpg     drat       wt     qsec       vs       am     gear     carb 
## 7.259608 3.106846 6.770138 4.650877 4.192360 4.578935 4.737443 4.125290

As you can see, the VIF for “wt” went from around 16 to about 7 – much more manageable than before. It’s still kind of high but that’s because “wt” is correlated with many things in a car.

Cheat Sheet for Mitigation:

  1. No Linearity? (try a log transformation) (may not always work)
  2. Errors not independent? (try a Robust Linear Model RLM()) (removes p-values)
  3. Heteroscedascity? (log transformation of the dependent variable) (also may not always work)
  4. Residuals not normal? (log transformation)
  5. Multicolinearity? (remove strongly correlated variables)

As you can see, log() transformations are basically the main “tool” to help.

Logistic Regression

What is logistic regression and why is it valuable?

Basically: logistic regression is similar to “regular” regression, but instead of a continuous outcome (like price, or mpg) the dependent variable is categorical (it’s one or zero).

Who cares? Well, this is actually one of the most powerful tools that is used day-to-day by organizations. If you think about it, many important things can be represented by a one or zero.

Does the patient live or die? (1 = live, 0 = die) Is this bank transaction fraudulent? (1 = fraud, 0 = legitimate) Do clients of this non-profit find housing? (1 = yes, 0 = no)

And so forth.

This is a little different than regular regression because the dependent variable is no longer continuous.

am_logistic<-glm(am~mpg+wt,data=cars_data,family="binomial")
summary(am_logistic)
## 
## Call:
## glm(formula = am ~ mpg + wt, family = "binomial", data = cars_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  25.8866    12.1935   2.123   0.0338 *
## mpg          -0.3242     0.2395  -1.354   0.1759  
## wt           -6.4162     2.5466  -2.519   0.0118 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 17.184  on 29  degrees of freedom
## AIC: 23.184
## 
## Number of Fisher Scoring iterations: 7
#we select "wt" from the model coefficients
wt_odds<-coef(am_logistic)[3]

# print wt coefficient
print(wt_odds)
##        wt 
## -6.416172
# exponent the coefficent to get the odds per unit of wt
exp(wt_odds)
##          wt 
## 0.001634903

What does this mean? For every unit of “WT” that goes up, the odds of AM go down by 0.001% (one-tenth of one percent). Statistically significant but probably not practically significant.

We will discuss this more next week.

Homework

  1. Load your preferred dataset into R studio
  2. Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome
  3. Check the following assumptions:
  1. Linearity (plot and raintest)
  2. Independence of errors (durbin-watson)
  3. Homoscedasticity (plot, bptest)
  4. Normality of residuals (QQ plot, shapiro test)
  5. No multicolinarity (VIF, cor)
  1. does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.
  2. If your model violates an assumption, which one?
  3. What would you do to mitigate this assumption? Show your work.