library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
Purpose: compares means of two groups to see if they are significantly different
Interpretation: p-value < 0.05, groups means are probably different
p-value > 0.05, groups means are very similar
Example:
#running T-test on MPG vs. Transmission type (AM) in MTCARS
t.test(mpg~am,data=mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
Result: a p-value of 0.0013 means that the mean MPG between transmission types is probably very different. Mean in group 0 (Automatic Transmission) is estimated to be 17.14, mean in group 1 (Manual Transmission) is estimated to be 24.39.
Assumptions: normal distribution, data is continuous, observations are independent, sampling is random
Purpose: Tests for normality of a variable. Best used for small numbers of observations (<5,000).
Interpretation: p-value < 0.05 - data is probably not normal
p-value > 0.05 - data is probably normal
Example:
shapiro.test(mtcars$hp)
##
## Shapiro-Wilk normality test
##
## data: mtcars$hp
## W = 0.93342, p-value = 0.04881
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
Results: HP has a p-value of 0.048, so it’s probably non-normal. MPG has a p-value of 0.12, so it’s probably normal.
Assumptions: data is continuous, observations are independent, sample size is small
Purpose: compares a variable with a known distribution (typically a normal distribution). Can detect if a variable is normal or not. Best used for large numbers of observations (5,000+).
Interpretation: p-value < 0.05 - variable is likely different from normal distribution
p-value > 0.05 - variable is likely the same as a normal distribution
Example:
ks.test(diamonds$price,"pnorm")
## Warning in ks.test.default(diamonds$price, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: diamonds$price
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Results: A p-value less than 2.2e-16 suggests the price of diamonds is not normally distributed. WARNING: the larger the dataset, the more likely K-S test will find deviations from normality. Use with caution.
Assumptions: data is continuous, observations are independent, large sample size
Purpose: Testing for heteroscedasticity (non-constant variance) in residuals of a regression model
Interpretation: p-value < 0.05 - model is heteroscedastic
p-value > 0.05 - model is probably homoscedastic
Example:
# first we create a model to test
cars_model<-lm(mpg~cyl,data=mtcars)
# then we test the model for homoscedasticity
bptest(cars_model)
##
## studentized Breusch-Pagan test
##
## data: cars_model
## BP = 5.4942, df = 1, p-value = 0.01908
Results: With a p-value of 0.019, which is less than 0.05, we can conclude the model is likely heteroscedastic.
Assumptions: Errors are normally distributed, no multicolinearity among predictors, model is homoscedastic
Purpose: Checks for correlation of residuals in a regression model
Interpretation: p-value < 0.05 - model likely contains correlation of residuals
p-value > 0.05 - model probably doesn't contain correlation of residuals
# first, lets create a model with lots of correlation:
cars_correlated<-lm(hp~cyl+wt+disp,data=mtcars)
# now we can run a Durbin-Watson test on it:
durbinWatsonTest(cars_correlated)
## lag Autocorrelation D-W Statistic p-value
## 1 0.3326231 1.296869 0.026
## Alternative hypothesis: rho != 0
Results: with a p-value less than 0.05, the model likely contains some correlation of residuals.
Assumptions: Errors are normally distributed, homoscedastic, no autocorrelation among errors
Purpose: to determine if there is multicollinearity within a regression model.
Interpretation: VIF > 10 means that a variable is strongly correlated with another variable
VIF < 10 means that a variable is not strongly correlated with another variable
Example:
# lets create a model with several multicolinear variables to test VIF
vif_model<-lm(hp~mpg+disp+wt+cyl+qsec+carb,data=mtcars)
# run VIF on this model
vif(vif_model)
## mpg disp wt cyl qsec carb
## 6.451314 15.204973 15.143123 8.562826 5.285539 3.441238
Results: DISP and WT are strongly multicolinear (VIF>10)
Assumptions:
Purpose: Tests for linearity within a regression model.
Interpretation: p-value < 0.05 - model is non-linear
p-value > 0.05 - model is likely linear
Example:
cars_model<-lm(mpg~cyl,data=mtcars)
raintest(cars_model)
##
## Rainbow test
##
## data: cars_model
## Rain = 0.53774, df1 = 16, df2 = 14, p-value = 0.8828
Results: The p-value of 0.88 is greater than 0.05, which means the model is likely linear.
Assumptions: model is linear, errors are normally distributed, homoscedastic, no autocorrelation in errors
The following assumptions must be met in order for a linear model to be accurate:
Definition: the relationship between independent variables and the dependent variable must be linear.
Example:
# this model is mostly linear
hp_carb<-lm(hp~wt+carb,data=mtcars)
plot(hp_carb,which=1)
# this model is not linear
mpg_carb<-lm(mpg~hp,data=mtcars)
plot(mpg_carb,which=1)
Consequences: Non-linearity of variables in the model will result in inaccurate estimates.
Definition: The residuals of the model should be independent of each other. Or, the model should not affect different errors the same way.
Example:
# the hp_carb model "lm(hp~wt+carb,data=mtcars)" - has mostly independent errors. The p-value is over 0.05 and the D-W Statistic is fairly close to "2":
durbinWatsonTest(hp_carb)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1464524 1.652833 0.21
## Alternative hypothesis: rho != 0
# the mpg_carb model "lm(mpg~hp,data=mtcars)" - has some errors that are not independent. The p-value is below 0.05 and the D-W Statistic is farther away from "2":
durbinWatsonTest(mpg_carb)
## lag Autocorrelation D-W Statistic p-value
## 1 0.428481 1.133807 0.012
## Alternative hypothesis: rho != 0
Consequences: may create errors in estimates, rendering the whole model inaccurate.
Definition: the variance of the residuals (errors) is the same across the independent variables
Example:
# the model "lm(hp~mpg,data=mtcars)" is fairly homoscedastic. The variance of the residuals is similar across the plot:
hp_mpg<-lm(hp~mpg,data=mtcars)
plot(hp_mpg,which=3)
# the null hypothesis of the Breusch-Pagan test is that the model ISNT homoscedastic. So, a p-value higher than 0.05 means the model is probably homoscedastic:
bptest(hp_mpg)
##
## studentized Breusch-Pagan test
##
## data: hp_mpg
## BP = 0.86843, df = 1, p-value = 0.3514
Consequences: if model is NOT homoscedastic, estimates and p-values will be inaccurate
Definition: Errors of the model should be normally distributed
Example:
# the model "lm(qsec~disp,data=mtcars)" has residuals/errors that are somewhat normally distributed - they mostly fall on the dotted line in a Q-Q plot:
qsec_lm<-lm(qsec~disp,data=mtcars)
plot(qsec_lm,which=2)
shapiro.test(qsec_lm$residuals)
##
## Shapiro-Wilk normality test
##
## data: qsec_lm$residuals
## W = 0.96176, p-value = 0.3066
# the linear model "lm(mpg~carb,data=mtcars)" has errors that are somewhat less normal:
plot(mpg_carb,which=2)
shapiro.test(mpg_carb$residuals)
##
## Shapiro-Wilk normality test
##
## data: mpg_carb$residuals
## W = 0.92337, p-value = 0.02568
Consequences: if residuals/errors are not normal, the p-values will be unreliable, along with confidence intervals.
Definition: independent variables should not be overly correlated with each other.
Example:
# in the model "lm(mpg~hp+disp+cyl+wt,data=mtcars)", disp (engine displacement) has a VIF over 10, meaning that it's strongly correlated with some other variable (probably CYL):
multicolinear_model<-lm(mpg~hp+disp+cyl+wt,data=mtcars)
vif(multicolinear_model)
## hp disp cyl wt
## 3.405983 10.373286 6.737707 4.848016
# after removing "disp", the model is no longer multicolinear:
less_colinear_model<-lm(mpg~hp+cyl+wt,data=mtcars)
vif(less_colinear_model)
## hp cyl wt
## 3.258481 4.757456 2.580486
Consequences: A model with multicolinearity will have inaccurate p-values and standard errors
Purpose: To mitigate a violation of the linearity assumption in linear regression
Suggested Method: Log or Square-Root Transformation of one or more variables
Example:
# the model "mpg_carb" is non-linear, but when we log both variables (mpg and hp), the model becomes noticeably more linear:
plot(mpg_carb,which=1)
mpg_carb_log<-lm(log(mpg)~log(hp),data=mtcars)
plot(mpg_carb_log,which=1)
REMINDER: log transformation does not work if there is any missing data, zeroes or negative numbers in the data. Sqare-root cannot use missing data or negative numbers.
Purpose: to reduce correlation between errors in a model. Ironically, we can do this by (among other things) just changing the model to something more appropriate.
Example:
durbinWatsonTest(mpg_carb)
## lag Autocorrelation D-W Statistic p-value
## 1 0.428481 1.133807 0.01
## Alternative hypothesis: rho != 0
mpg_carb_changed<-rlm(mpg~carb+drat,data=mtcars)
durbinWatsonTest(mpg_carb_changed)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.2707792 2.502298 0.19
## Alternative hypothesis: rho != 0
REMINDER: correlation of errors suggests that the model just isn’t very good. Adding a lag indicator for time series data might work, but otherwise it means the variables may be inappropriate for a linear model.
Purpose: To reduce heteroscedascity in a model, and increase homoscedascity (an assumption for linear regression).
Example: This can potentially be accomplished through a log transformation of the dependent variable -
hp_wt_carb<-lm(hp~wt+carb,data=mtcars)
plot(hp_wt_carb,which=3)
hp_wt_carb_log<-lm(log(hp)~wt+carb,data=mtcars)
plot(hp_wt_carb_log,which=3)
REMINDER: Some data may be so skewed that even a log or square-root transformation cannot induce homoscedascity. Also, log transformations will not work in data with missing values, zeroes, or negative values.
Purpose: If the residuals in a model are not normal, the assumption is not met. We can potentially transform the dependent variable to induce normality of residuals.
Example: logging the dependent variable (hp) makes the residuals go from somewhat non-normal, to fairly normalized.
cars_model_2<-lm(hp~mpg,data=mtcars)
shapiro.test(cars_model_2$residuals)
##
## Shapiro-Wilk normality test
##
## data: cars_model_2$residuals
## W = 0.89154, p-value = 0.003799
cars_model_3<-lm(log(hp)~mpg,data=mtcars)
shapiro.test(cars_model_3$residuals)
##
## Shapiro-Wilk normality test
##
## data: cars_model_3$residuals
## W = 0.9453, p-value = 0.1059
plot(cars_model_2,which=2)
plot(cars_model_3,which=2)
REMINDER: For large datasets, the central limit theorem often mitigates non-normality of residuals on it’s own.
Purpose: If variables are highly correlated with one another, that violates an assumption of the linear model.
Example: CYL, DISP and WT are strongly correlated with one another (VIF > 10). By removing two of these variables, multicolinearity will be mitigated:
kitchen_sink<-lm(hp~cyl+disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=mtcars)
vif(kitchen_sink)
## cyl disp mpg drat wt qsec vs am
## 14.912374 15.715476 7.296154 3.398500 16.339384 7.958123 4.600828 4.931842
## gear carb
## 5.344559 5.923559
#WT and DISP removed:
kitchen_sink_2<-lm(hp~cyl+mpg+drat+qsec+vs+am+gear+carb,data=mtcars)
vif(kitchen_sink_2)
## cyl mpg drat qsec vs am gear carb
## 11.329719 5.670006 3.365246 5.165514 4.438732 4.923775 5.188293 3.873564
There are more types of GLM’s than just the following, but these are the types you will most likely encounter in this program:
Dependent variable - Independent variable - Method
Continuous - Continuous or categorical - Gaussian (family=“gaussian”)
Binary - Categorical or continuous - Binomial (family=“binomial”)
Counts - Categorical and continuous integers - Poisson (family=“poisson”)
Continuous (skewed)- Continuous and positive - Gamma (family=“Gamma”)
Each of the families can also have “links” specified to the data:
“Identity” - for linear relationships
“Log” - when dependent variable is positive or when estimating rates
“Logit” - when the dependent variable is binary
“Inverse” - for skewed data
“Inverse-Square” - skewed data with outliers
Don’t worry about the links too much in this academic program, each family has it’s own default links which will probably work just fine.
Remember these types of variables as well:
Continuous - Can be written on a number line (1.1,1.2,1.3,1.4). Examples: height in cm (160.5 cm, etc.)
Binary - Either it happened, or it didn’t (1, 0). Examples: patient lived after surgery (0) or died (1)
Counts - Integers you can “count” on your fingers (1,2,3). Examples: there were 4 car crashes yesterday
Continuous (skewed - Gamma) - Same as continuous but positive and with a grouping around extreme values. Example: graduation rates
Categorical - different “categories” the data can be sorted into. Example: OJ vs. Vitamin C (from ToothGrowth)
A GLM will not have a p-value due to being non-parametric. However, it’s possible to estimate a pseudo p-value (a rough estimate of the model’s significance) using the differences between null and residual deviance, and null and residual degrees of freedom.
If the Null Deviance minus the Residual Deviance is greater than the Null deviance DF minus the Residual deviance DF, it’s probably significant:
chick<-ChickWeight
chick_glm<-glm(weight~Time+Diet,data=chick,family = Gamma(link=log))
summary(chick_glm)
##
## Call:
## glm(formula = weight ~ Time + Diet, family = Gamma(link = log),
## data = chick)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.68330 0.02015 182.801 < 2e-16 ***
## Time 0.07991 0.00133 60.092 < 2e-16 ***
## Diet2 0.12122 0.02450 4.949 9.84e-07 ***
## Diet3 0.23526 0.02450 9.604 < 2e-16 ***
## Diet4 0.22651 0.02463 9.197 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.0465709)
##
## Null deviance: 193.022 on 577 degrees of freedom
## Residual deviance: 28.338 on 573 degrees of freedom
## AIC: 5272.8
##
## Number of Fisher Scoring iterations: 5
193 - 28 = 165 577 - 573 = 4
165 > 4
Model is probably significant!
Dobson, A. J. (2002). *An introduction to generalized linear models*. 2nd Ed. Chapman and Hall/CRC.
Field, A., Field, Z., & Miles, J. (2012). *Discovering statistics using R.* SAGE, Los Angeles.