library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
cars_data<-mtcars
Let’s review the assumptions of linear regression from last time. For all your assumptions to be correct, these all must be true:
The overall relationship of the model to the residuals should be linear (a straight line)
cars_model_1<-lm(hp~wt+carb,data=cars_data)
plot(cars_model_1,which=1)
#note: the which=1 brings up the Residuals vs. Fitted graph. Without which=1, there are four graphs that come up.
The fact that the points are scattered more or less evenly around the straight line in the center (and the fitted line is mostly straight) means that the relationship between these variables is pretty linear: as WT and CARB goes up, HP either goes up or down in a linear path. In this case, it goes up.
We can also do a mathematical test (a “rainbow” test) to see if the variables have a linear relationship.
raintest(cars_model_1)
##
## Rainbow test
##
## data: cars_model_1
## Rain = 2.4129, df1 = 16, df2 = 13, p-value = 0.05781
In the case of the Rainbow Test, a low p-value means that the data is NOT linear. As we can see, this model is just barely linear. Also, the Rainbow test assumes homoskedasticy, so keep an eye out for that (we’ll talk about it soon).
The null hypothesis is that the data is linear. If it is greater than .05, that means the data is likely linear. (If the p-value is smaller than .05, then you can do things like log transformations). Importantly, the rainbow test relies on a p-value, which assumes normal distribution.
Erik’s advise– don’t sweat it too much. You want to address it in the research, but there are many different ways to deal with the data if it is non-linear.
example– imagine throwing darts at a dartboard. You are aiming at the middle, but all the darts you throw end up in the bottom right. You would expect that the errors would be more randomly distributed– landing all over the board. There may be something influencing how the darts are being thrown– the errors are not independent.
Thankfully there is a pretty simple way to test for Independence of Errors using using the “car” package. (Unrelated to MTCARS or even “CARS”, sorry about that). The package contains the “durbinWatsonTest” (Durbin-Watson Test) which lets us test the model directly:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
durbinWatsonTest(cars_model_1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1464524 1.652833 0.258
## Alternative hypothesis: rho != 0
A p-value greater than 0.05 means that the errors ARE independent. In this case, the errors are independent (p-value=0.228). Also, the closer the D-W statistic is to “2”, the better. (Long story, just go with it)
Thankfully this can be done very simply with the Scale-Location “plot” command:
plot(cars_model_1,which=3)
You’ll notice this is very similar to the Residuals vs. Fitted plot, however, this is Fitted Values versus the Square Root of standardized residuals. Otherwise, it’s very similar to the Residuals vs. Fitted plot: the red line should be almost totally straight, and the points should cluster evenly around it. As you can see from this, the red line is very curvy, so this likely violates the assumption of Homoscedasticity.
Study tip: “Homo” means “same” so “homoscedasticity” means that the “scedasticity” is the same across the plot.
This can also be tested mathematically using the Breusch-Pagan Test in the “lmtest” package:
bptest(cars_model_1)
##
## studentized Breusch-Pagan test
##
## data: cars_model_1
## BP = 8.0855, df = 2, p-value = 0.01755
For the BP test, a significant result means the null hypothesis can be rejected. In this case, the null hypothesis is that the model is homoscedastic. In this case, the p-value is significant (less than 0.05) so we can reject homoscedasticity and assume the model is heteroscedastic.
What does a homoscedastic model look like?
# here is the old hp~mpg model we used:
cars_model_2<-lm(hp~mpg,data=cars_data)
plot(cars_model_2,which=3)
The line is much straighter, much more homoscedastic.
This is borne out mathematically as well:
bptest(cars_model_2)
##
## studentized Breusch-Pagan test
##
## data: cars_model_2
## BP = 0.86843, df = 1, p-value = 0.3514
The p-value is 0.35, much larger than 0.05, which means that the null hypothesis of heteroscedascity can be REJECTED and homoscedasticy can be assumed.
Note: You don’t have to put these images in the paper if you don’t want to; you can just write out results (example: “the model met this assumption but not this one; here is what I did to mitigate that”)
We want to make sure that the residuals are normally distributed. This is hard to visualize on it’s own, but we can use what is called a “Q-Q plot” to do so:
plot(cars_model_1,which=2)
For residuals to be normally distributed, all of the observations would need to lie along the dotted line. As you can see, there is significant deviation from the dotted line. In fact, the observations barely touch the dotted line.
Lets look at cars_model_2 for a more normally distributed Q-Q plot:
plot(cars_model_2,which=2)
There is still come deviation from the dotted line but it’s much less than the previous plot. This is slightly more normal. We can also run a shapiro-wilk test to check for normality of residuals:
shapiro.test(cars_model_2$residuals)
##
## Shapiro-Wilk normality test
##
## data: cars_model_2$residuals
## W = 0.89154, p-value = 0.003799
The p-value is actually below 0.05, which suggests that the residuals are significantly different from a normal distribution (ie., they are not normal).The Shapiro-Wilke test is testing for non normality– the null hypothesis is that there is non-normality.
Here’s the big huge kitchen sink model we ran last week: (VIF= variance inflation factor)
kitchen_sink<-lm(hp~cyl+disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(kitchen_sink)
## cyl disp mpg drat wt qsec vs am
## 14.912374 15.715476 7.296154 3.398500 16.339384 7.958123 4.600828 4.931842
## gear carb
## 5.344559 5.923559
Basically, any VIF over 10 means that the variable is strongly correlated with some other variable, and thus the assumption of “no multicollinearity” will be violated. Some sources say anything over “5” is problematic. Erik said to go easy on ourselves– it’s 10 for our class!
It does not tell you what the variables are strongly correlated with, but if you have two variables over 10, it’s likely they are correlated with each other. What do we do? You can take one variable out and run VIF and see if it improves. He recommends taking them out one at a time.
new_kitchen_sink<-lm(hp~cyl+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(new_kitchen_sink)
## cyl mpg drat wt qsec vs am gear
## 12.084679 7.272244 3.379276 7.221269 6.651531 4.587613 4.930532 5.235139
## carb
## 4.267101
We took disp out, but cyl is still a problem.
new_kitchen_sink2<-lm(hp~disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(new_kitchen_sink2)
## disp mpg drat wt qsec vs am gear
## 12.735496 7.267753 3.117346 15.834900 7.356650 4.221355 4.663491 4.737532
## carb
## 5.114559
Taking out cyl makes disp and wt go over 10. We may need to take all 3 out!
So, lets say your data fails all of these assumptions. Is it time to give up?
Absolutely not!
Typically, data can be transformed so that it meets the assumptions of multiple regression. This somewhat changes the interpretation of the end-product, but the idea is basically the same.
Spoiler alert: transformation is almost always the answer. Unfortunately, intrepretation of results can become a bit more tricky– instead of interpreting ‘mpg,’ you might be intrepreting log(mpg) or sqr(mpg).
Non-linear data can be adjusted using the log transformation.
cars_model_3<-lm(hp~mpg+am,data=cars_data)
cars_model_3_log<-lm(log(hp)~log(mpg)+am,data=cars_data)
#log doesn't work with negative numbers or zero-- use square root if you have these
raintest(cars_model_3)
##
## Rainbow test
##
## data: cars_model_3
## Rain = 3.0656, df1 = 16, df2 = 13, p-value = 0.02378
raintest(cars_model_3_log)
##
## Rainbow test
##
## data: cars_model_3_log
## Rain = 2.4838, df1 = 16, df2 = 13, p-value = 0.05225
Now we see that cars_model_3 has been transformed so it has a raintest p-value of 0.0522 (barely linear).
Instead of an “LM”, you can run a “RLM” from the “MASS” package. This creates a “robust” linear model where independence of errors is not as important:
cars_model_2_robust<-rlm(hp~mpg,data=cars_data)
summary(cars_model_2)
##
## Call:
## lm(formula = hp ~ mpg, data = cars_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.26 -28.93 -13.45 25.65 143.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 324.08 27.43 11.813 8.25e-13 ***
## mpg -8.83 1.31 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
summary(cars_model_2_robust)
##
## Call: rlm(formula = hp ~ mpg, data = cars_data)
## Residuals:
## Min 1Q Median 3Q Max
## -53.902 -25.432 -6.355 26.480 150.425
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 308.7988 24.9851 12.3593
## mpg -8.2816 1.1927 -6.9435
##
## Residual standard error: 38.78 on 30 degrees of freedom
The main issue is that robust linear models do not report p-values or r-squared (because they do not assume independent variables), so it’s best to run a regular regression and then a robust regression together, to get a sense for which variables are significant, and report both. See if the estimates vary a lot. In this example, -8.83 and -8.28 are pretty close– even if the data violated independence of errors, perhaps the violation doesn’t matter that much. You can report something like this in your paper.
Similar to the issues with linearity above, a log-transformation of the dependent variable can induce homoscedasticity:
bptest(cars_model_3)
##
## studentized Breusch-Pagan test
##
## data: cars_model_3
## BP = 10.251, df = 2, p-value = 0.005943
bptest(cars_model_3_log)
##
## studentized Breusch-Pagan test
##
## data: cars_model_3_log
## BP = 3.7762, df = 2, p-value = 0.1514
(the log model is non-significant, meaning that it is now homoscedastic)
If the residuals of a model are non-normal, they can be transformed (typically with log transformation, as above)
plot(cars_model_3,which=2)
plot(cars_model_3_log,which=2)
We saw earlier that a VIF greater than 5 or 10 meant there was some multicollinearity:
vif(kitchen_sink)
## cyl disp mpg drat wt qsec vs am
## 14.912374 15.715476 7.296154 3.398500 16.339384 7.958123 4.600828 4.931842
## gear carb
## 5.344559 5.923559
It’s allowed to remove one variable that is strongly correlated with another. But how do we know which one to observe?
kitchen_sink_vars<-cars_data %>% dplyr::select(cyl,disp,mpg,drat,wt,qsec,vs,am,gear,carb)
#The dplyr:: is used because different packages in R sometimes have the same commands. If you load two packages that have the same command, you have to specify which one you want R to use, or it will cause an error code. dplyr:: does this
cor(kitchen_sink_vars)
## cyl disp mpg drat wt qsec
## cyl 1.0000000 0.9020329 -0.8521620 -0.69993811 0.7824958 -0.59124207
## disp 0.9020329 1.0000000 -0.8475514 -0.71021393 0.8879799 -0.43369788
## mpg -0.8521620 -0.8475514 1.0000000 0.68117191 -0.8676594 0.41868403
## drat -0.6999381 -0.7102139 0.6811719 1.00000000 -0.7124406 0.09120476
## wt 0.7824958 0.8879799 -0.8676594 -0.71244065 1.0000000 -0.17471588
## qsec -0.5912421 -0.4336979 0.4186840 0.09120476 -0.1747159 1.00000000
## vs -0.8108118 -0.7104159 0.6640389 0.44027846 -0.5549157 0.74453544
## am -0.5226070 -0.5912270 0.5998324 0.71271113 -0.6924953 -0.22986086
## gear -0.4926866 -0.5555692 0.4802848 0.69961013 -0.5832870 -0.21268223
## carb 0.5269883 0.3949769 -0.5509251 -0.09078980 0.4276059 -0.65624923
## vs am gear carb
## cyl -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.7104159 -0.59122704 -0.5555692 0.39497686
## mpg 0.6640389 0.59983243 0.4802848 -0.55092507
## drat 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 1.0000000 0.16834512 0.2060233 -0.56960714
## am 0.1683451 1.00000000 0.7940588 0.05753435
## gear 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.5696071 0.05753435 0.2740728 1.00000000
According to VIF(), cyl, disp and wt are most strongly correlated with other variables, and we see in the cor() chart that they are strongly correlated with basically everything, including each other. This makes sense, as “cyl”, “disp” and “wt” are basically measurements of the car’s size – just stated in different ways. Consequently, it wouldn’t hurt to try and decide which measurement of car weight is the most important to your analysis, and keep that one.
In this case, I think “wt” is probably the best measurement of car size – its more direct than cyl or disp. So, lets drop cyl and disp:
kitchen_sink_altered<-lm(hp~mpg+drat+wt+qsec+vs+am+gear+carb,data=cars_data)
vif(kitchen_sink_altered)
## mpg drat wt qsec vs am gear carb
## 7.259608 3.106846 6.770138 4.650877 4.192360 4.578935 4.737443 4.125290
As you can see, the VIF for “wt” went from around 16 to about 7 – much more manageable than before. It’s still kind of high but that’s because “wt” is correlated with many things in a car.
As you can see, log() transformations are basically the main “tool” to help.
What is logistic regression and why is it valuable?
Basically: logistic regression is similar to “regular” regression, but instead of a continuous outcome (like price, or mpg) the dependent variable is categorical (it’s one or zero).
Who cares? Well, this is actually one of the most powerful tools that is used day-to-day by organizations. If you think about it, many important things can be represented by a one or zero.
Does the patient live or die? (1 = live, 0 = die) Is this bank transaction fraudulent? (1 = fraud, 0 = legitimate) Do clients of this non-profit find housing? (1 = yes, 0 = no)
And so forth.
This is a little different than regular regression because the dependent variable is no longer continuous. The result is no longer linear. So, the 1 or the 0 becomes probabilities.
The dependent variable is what we are trying to predict.
am_logistic<-glm(am~mpg+wt,data=cars_data,family="binomial")
#am is the transmission type in cars data, where 1=automatic and 0=manual. We are trying to predict what kind of transmission a car has (am) by the mpg and weight
#glm is generalized linear model; binomiaml means the dependent variable is 1 or 0
summary(am_logistic)
##
## Call:
## glm(formula = am ~ mpg + wt, family = "binomial", data = cars_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 25.8866 12.1935 2.123 0.0338 *
## mpg -0.3242 0.2395 -1.354 0.1759
## wt -6.4162 2.5466 -2.519 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 17.184 on 29 degrees of freedom
## AIC: 23.184
##
## Number of Fisher Scoring iterations: 7
Weight is mildly significant. The more weight that a car has, the less likely it is to be an automatic.
#we select "wt" from the model coefficients
wt_odds<-coef(am_logistic)[3]
# print wt coefficient
print(wt_odds)
## wt
## -6.416172
# exponent the coefficent to get the odds per unit of wt
exp(wt_odds)
## wt
## 0.001634903
What does this mean? For every unit of “WT” that goes up, the odds of AM go down by 0.001% (one-tenth of one percent). Statistically significant but probably not practically significant.
We will discuss this more next week.