library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
cars_data<-mtcars
Let’s review the assumptions of linear regression from last time.
The overall relationship of the model to the residuals should be linear (a straight line)
cars_model_1<-lm(hp~wt+carb,data=cars_data)
plot(cars_model_1,which=1)
The fact that the points are scattered more or less evenly around the straight line in the center (and the fitted line is mostly straight) means that the relationship between these variables is pretty linear: as WT and CARB goes up, HP either goes up or down in a linear path. In this case, it goes up.
We can also do a mathematical test (a “rainbow” test) to see if the variables have a linear relationship.
raintest(cars_model_1)
##
## Rainbow test
##
## data: cars_model_1
## Rain = 2.4129, df1 = 16, df2 = 13, p-value = 0.05781
In the case of the Rainbow Test, a low p-value means that the data is NOT linear. As we can see, this model is just barely linear. Also, the Rainbow test assumes homoskedasticy, so keep an eye out for that (we’ll talk about it soon).
Thankfully there is a pretty simple way to test for Independence of Errors using using the “car” package. (Unrelated to MTCARS or even “CARS”, sorry about that). The package contains the “durbinWatsonTest” (Durbin-Watson Test) which lets us test the model directly:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
durbinWatsonTest(cars_model_1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1464524 1.652833 0.222
## Alternative hypothesis: rho != 0
A p-value greater than 0.05 means that the errors ARE independent. In this case, the errors are independent (p-value=0.228). Also, the closer the D-W statistic is to “2”, the better. (Long story, just go with it)
Thankfully this can be done very simply with the Scale-Location “plot” command:
plot(cars_model_1,which=3)
You’ll notice this is very similar to the Residuals vs. Fitted plot, however, this is Fitted Values versus the Square Root of standardized residuals. Otherwise, it’s very similar to the Residuals vs. Fitted plot: the red line should be almost totally straight, and the points should cluster evenly around it. As you can see from this, the red line is very curvy, so this likely violates the assumption of Homoscedasticity.
Study tip: “Homo” means “same” so “homoscedasticity” means that the “scedasticity” is the same across the plot.
This can also be tested mathematically using the Breusch-Pagan Test in the “lmtest” package:
bptest(cars_model_1)
##
## studentized Breusch-Pagan test
##
## data: cars_model_1
## BP = 8.0855, df = 2, p-value = 0.01755
For the BP test, a significant result means the null hypothesis can be rejected. In this case, the null hypothesis is that the model is homoscedastic. In this case, the p-value is significant (less than 0.05) so we can reject homoscedasticity and assume the model is heteroscedastic.
What does a homoscedastic model look like?
# here is the old hp~mpg model we used:
cars_model_2<-lm(hp~mpg,data=cars_data)
plot(cars_model_2,which=3)
The line is much straighter, much more homoscedastic.
This is borne out mathematically as well:
bptest(cars_model_2)
##
## studentized Breusch-Pagan test
##
## data: cars_model_2
## BP = 0.86843, df = 1, p-value = 0.3514
The p-value is 0.35, much larger than 0.05, which means that the null hypothesis of heteroscedascity can be REJECTED and homoscedasticy can be assumed.
We want to make sure that the residuals are normally distributed. This is hard to visualize on it’s own, but we can use what is called a “Q-Q plot” to do so:
plot(cars_model_1,which=2)
For residuals to be normally distributed, all of the observations would need to lie along the dotted line. As you can see, there is significant deviation from the dotted line. In fact, the observations barely touch the dotted line.
Lets look at cars_model_2 for a more normally distributed Q-Q plot:
plot(cars_model_2,which=2)
There is still come deviation from the dotted line but it’s much less than the previous plot. We can also run a shapiro-wilk test to check for normality of residuals:
shapiro.test(cars_model_2$residuals)
##
## Shapiro-Wilk normality test
##
## data: cars_model_2$residuals
## W = 0.89154, p-value = 0.003799
The p-value is actually below 0.05, which suggests that the residuals are significantly different from a normal distribution (ie., they are not normal).
Here’s the big huge kitchen sink model we ran last week:
kitchen_sink<-lm(hp~cyl+disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(kitchen_sink)
## cyl disp mpg drat wt qsec vs am
## 14.912374 15.715476 7.296154 3.398500 16.339384 7.958123 4.600828 4.931842
## gear carb
## 5.344559 5.923559
Basically, any VIF over 10 means that the variable is strongly correlated with some other variable, and thus the assumption of “no multicollinearity” will be violated. Some sources say anything over “5” is problematic.
So, lets say your data fails all of these assumptions. Is it time to give up?
Absolutely not!
Typically, data can be transformed so that it meets the assumptions of multiple regression. This somewhat changes the interpretation of the end-product, but the idea is basically the same.
Non-linear data can be adjusted using the log transformation.
cars_model_3<-lm(hp~mpg+am,data=cars_data)
cars_model_3_log<-lm(log(hp)~log(mpg)+am,data=cars_data)
raintest(cars_model_3)
##
## Rainbow test
##
## data: cars_model_3
## Rain = 3.0656, df1 = 16, df2 = 13, p-value = 0.02378
raintest(cars_model_3_log)
##
## Rainbow test
##
## data: cars_model_3_log
## Rain = 2.4838, df1 = 16, df2 = 13, p-value = 0.05225
Now we see that cars_model_3 has been transformed so it has a raintest p-value of 0.0522 (barely linear).
Instead of an “LM”, you can run a “RLM” from the “MASS” package. This creates a “robust” linear model where independence of errors is not as important:
cars_model_2_robust<-rlm(hp~mpg,data=cars_data)
summary(cars_model_2)
##
## Call:
## lm(formula = hp ~ mpg, data = cars_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.26 -28.93 -13.45 25.65 143.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 324.08 27.43 11.813 8.25e-13 ***
## mpg -8.83 1.31 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
summary(cars_model_2_robust)
##
## Call: rlm(formula = hp ~ mpg, data = cars_data)
## Residuals:
## Min 1Q Median 3Q Max
## -53.902 -25.432 -6.355 26.480 150.425
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 308.7988 24.9851 12.3593
## mpg -8.2816 1.1927 -6.9435
##
## Residual standard error: 38.78 on 30 degrees of freedom
The main issue is that robust linear models do not report p-values or r-squared (because they do not assume independent variables), so it’s best to run a regular regression and then a robust regression together, to get a sense for which variables are significant, and report both.
Similar to the issues with linearity above, a log-transformation of the dependent variable can induce homoscedasticity:
bptest(cars_model_3)
##
## studentized Breusch-Pagan test
##
## data: cars_model_3
## BP = 10.251, df = 2, p-value = 0.005943
bptest(cars_model_3_log)
##
## studentized Breusch-Pagan test
##
## data: cars_model_3_log
## BP = 3.7762, df = 2, p-value = 0.1514
(the log model is non-significant, meaning that it is now homoscedastic)
If the residuals of a model are non-normal, they can be transformed (typically with log transformation, as above)
plot(cars_model_3,which=2)
plot(cars_model_3_log,which=2)
We saw earlier that a VIF greater than 5 or 10 meant there was some multicollinearity:
vif(kitchen_sink)
## cyl disp mpg drat wt qsec vs am
## 14.912374 15.715476 7.296154 3.398500 16.339384 7.958123 4.600828 4.931842
## gear carb
## 5.344559 5.923559
It’s allowed to remove one variable that is strongly correlated with another. But how do we know which one to observe?
kitchen_sink_vars<-cars_data %>% dplyr::select(cyl,disp,mpg,drat,wt,qsec,vs,am,gear,carb)
cor(kitchen_sink_vars)
## cyl disp mpg drat wt qsec
## cyl 1.0000000 0.9020329 -0.8521620 -0.69993811 0.7824958 -0.59124207
## disp 0.9020329 1.0000000 -0.8475514 -0.71021393 0.8879799 -0.43369788
## mpg -0.8521620 -0.8475514 1.0000000 0.68117191 -0.8676594 0.41868403
## drat -0.6999381 -0.7102139 0.6811719 1.00000000 -0.7124406 0.09120476
## wt 0.7824958 0.8879799 -0.8676594 -0.71244065 1.0000000 -0.17471588
## qsec -0.5912421 -0.4336979 0.4186840 0.09120476 -0.1747159 1.00000000
## vs -0.8108118 -0.7104159 0.6640389 0.44027846 -0.5549157 0.74453544
## am -0.5226070 -0.5912270 0.5998324 0.71271113 -0.6924953 -0.22986086
## gear -0.4926866 -0.5555692 0.4802848 0.69961013 -0.5832870 -0.21268223
## carb 0.5269883 0.3949769 -0.5509251 -0.09078980 0.4276059 -0.65624923
## vs am gear carb
## cyl -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.7104159 -0.59122704 -0.5555692 0.39497686
## mpg 0.6640389 0.59983243 0.4802848 -0.55092507
## drat 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 1.0000000 0.16834512 0.2060233 -0.56960714
## am 0.1683451 1.00000000 0.7940588 0.05753435
## gear 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.5696071 0.05753435 0.2740728 1.00000000
According to VIF(), cyl, disp and wt are most strongly correlated with other variables, and we see in the cor() chart that they are strongly correlated with basically everything, including each other. This makes sense, as “cyl”, “disp” and “wt” are basically measurements of the car’s size – just stated in different ways. Consequently, it wouldn’t hurt to try and decide which measurement of car weight is the most important to your analysis, and keep that one.
In this case, I think “wt” is probably the best measurement of car size – its more direct than cyl or disp. So, lets drop cyl and disp:
kitchen_sink_altered<-lm(hp~mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(kitchen_sink_altered)
## mpg drat wt qsec vs am gear carb
## 7.259608 3.106846 6.770138 4.650877 4.192360 4.578935 4.737443 4.125290
As you can see, the VIF for “wt” went from around 16 to about 7 – much more manageable than before. It’s still kind of high but that’s because “wt” is correlated with many things in a car.
As you can see, log() transformations are basically the main “tool” to help.
What is logistic regression and why is it valuable?
Basically: logistic regression is similar to “regular” regression, but instead of a continuous outcome (like price, or mpg) the dependent variable is categorical (it’s one or zero).
Who cares? Well, this is actually one of the most powerful tools that is used day-to-day by organizations. If you think about it, many important things can be represented by a one or zero.
Does the patient live or die? (1 = live, 0 = die) Is this bank transaction fraudulent? (1 = fraud, 0 = legitimate) Do clients of this non-profit find housing? (1 = yes, 0 = no)
And so forth.
This is a little different than regular regression because the dependent variable is no longer continuous.
am_logistic<-glm(am~mpg+wt,data=cars_data,family="binomial")
summary(am_logistic)
##
## Call:
## glm(formula = am ~ mpg + wt, family = "binomial", data = cars_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 25.8866 12.1935 2.123 0.0338 *
## mpg -0.3242 0.2395 -1.354 0.1759
## wt -6.4162 2.5466 -2.519 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 17.184 on 29 degrees of freedom
## AIC: 23.184
##
## Number of Fisher Scoring iterations: 7
#we select "wt" from the model coefficients
wt_odds<-coef(am_logistic)[3]
# print wt coefficient
print(wt_odds)
## wt
## -6.416172
# exponent the coefficent to get the odds per unit of wt
exp(wt_odds)
## wt
## 0.001634903
What does this mean? For every unit of “WT” that goes up, the odds of AM go down by 0.001% (one-tenth of one percent). Statistically significant but probably not practically significant.
We will discuss this more next week.