library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Cheat Sheets for Statistical Tests

T-Test

Purpose: compares means of two groups to see if they are significantly different

Interpretation: p-value < 0.05, groups means are probably different
                p-value > 0.05, groups means are very similar

Example:

#running T-test on MPG vs. Transmission type (AM) in MTCARS

t.test(mpg~am,data=mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

Result: a p-value of 0.0013 means that the mean MPG between transmission types is probably very different. Mean in group 0 (Automatic Transmission) is estimated to be 17.14, mean in group 1 (Manual Transmission) is estimated to be 24.39.

Assumptions: normal distribution, data is continuous, observations are independent, sampling is random

Shapiro-Wilk Test

Purpose: Tests for normality of a variable. Best used for small numbers of observations (<5,000).

Interpretation: p-value < 0.05 - data is probably not normal
                
                p-value > 0.05 - data is probably normal

Example:

shapiro.test(mtcars$hp)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$hp
## W = 0.93342, p-value = 0.04881
shapiro.test(mtcars$mpg)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

Results: HP has a p-value of 0.048, so it’s probably non-normal. MPG has a p-value of 0.12, so it’s probably normal.

Assumptions: data is continuous, observations are independent, sample size is small

K-S Test

Purpose: compares a variable with a known distribution (typically a normal distribution). Can detect if a variable is normal or not. Best used for large numbers of observations (5,000+).

Interpretation: p-value < 0.05 - variable is likely different from normal distribution
                
                p-value > 0.05 - variable is likely the same as a normal distribution

Example:

ks.test(diamonds$price,"pnorm")
## Warning in ks.test.default(diamonds$price, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  diamonds$price
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

Results: A p-value less than 2.2e-16 suggests the price of diamonds is not normally distributed. WARNING: the larger the dataset, the more likely K-S test will find deviations from normality. Use with caution.

Assumptions: data is continuous, observations are independent, large sample size

Breusch-Pagan Test

Purpose: Testing for heteroscedasticity (non-constant variance) in residuals of a regression model

Interpretation: p-value < 0.05 - model is heteroscedastic
                
                p-value > 0.05 - model is probably homoscedastic

Example:

# first we create a model to test
cars_model<-lm(mpg~cyl,data=mtcars)

# then we test the model for homoscedasticity

bptest(cars_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  cars_model
## BP = 5.4942, df = 1, p-value = 0.01908

Results: With a p-value of 0.019, which is less than 0.05, we can conclude the model is likely heteroscedastic.

Assumptions: Errors are normally distributed, no multicolinearity among predictors, model is homoscedastic

Durbin-Watson Test

Purpose: Checks for correlation of residuals in a regression model

Interpretation: p-value < 0.05 - model likely contains correlation of residuals
                
                p-value > 0.05 - model probably doesn't contain correlation of residuals
# first, lets create a model with lots of correlation:

cars_correlated<-lm(hp~cyl+wt+disp,data=mtcars)

# now we can run a Durbin-Watson test on it:

durbinWatsonTest(cars_correlated)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.3326231      1.296869   0.026
##  Alternative hypothesis: rho != 0

Results: with a p-value less than 0.05, the model likely contains some correlation of residuals.

Assumptions: Errors are normally distributed, homoscedastic, no autocorrelation among errors

VIF (Variance Inflation Factor)

Purpose: to determine if there is multicollinearity within a regression model.

Interpretation: VIF > 10 means that a variable is strongly correlated with another variable
                VIF < 10 means that a variable is not strongly correlated with another variable

Example:

# lets create a model with several multicolinear variables to test VIF

vif_model<-lm(hp~mpg+disp+wt+cyl+qsec+carb,data=mtcars)

# run VIF on this model

vif(vif_model)
##       mpg      disp        wt       cyl      qsec      carb 
##  6.451314 15.204973 15.143123  8.562826  5.285539  3.441238

Results: DISP and WT are strongly multicolinear (VIF>10)

Assumptions:

Rainbow Test

Purpose: Tests for linearity within a regression model.

Interpretation: p-value < 0.05 - model is non-linear
                p-value > 0.05 - model is likely linear

Example:

cars_model<-lm(mpg~cyl,data=mtcars)
raintest(cars_model)
## 
##  Rainbow test
## 
## data:  cars_model
## Rain = 0.53774, df1 = 16, df2 = 14, p-value = 0.8828

Results: The p-value of 0.88 is greater than 0.05, which means the model is likely linear.

Assumptions: model is linear, errors are normally distributed, homoscedastic, no autocorrelation in errors

LINEAR MODEL CHEAT SHEET

The following assumptions must be met in order for a linear model to be accurate:

Linearity

Definition: the relationship between independent variables and the dependent variable must be linear.

Example:

# this model is mostly linear

hp_carb<-lm(hp~wt+carb,data=mtcars)
plot(hp_carb,which=1)

# this model is not linear

mpg_carb<-lm(mpg~hp,data=mtcars)
plot(mpg_carb,which=1)

Consequences: Non-linearity of variables in the model will result in inaccurate estimates.

Independence of Errors

Definition: The residuals of the model should be independent of each other. Or, the model should not affect different errors the same way.

Example:

# the hp_carb model "lm(hp~wt+carb,data=mtcars)" - has mostly independent errors. The p-value is over 0.05 and the D-W Statistic is fairly close to "2":

durbinWatsonTest(hp_carb)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1464524      1.652833    0.21
##  Alternative hypothesis: rho != 0
# the mpg_carb model "lm(mpg~hp,data=mtcars)" - has some errors that are not independent. The p-value is below 0.05 and the D-W Statistic is farther away from "2":

durbinWatsonTest(mpg_carb)
##  lag Autocorrelation D-W Statistic p-value
##    1        0.428481      1.133807   0.012
##  Alternative hypothesis: rho != 0

Consequences: may create errors in estimates, rendering the whole model inaccurate.

Homoscedasticity

Definition: the variance of the residuals (errors) is the same across the independent variables

Example:

# the model "lm(hp~mpg,data=mtcars)" is fairly homoscedastic. The variance of the residuals is similar across the plot:

hp_mpg<-lm(hp~mpg,data=mtcars)
plot(hp_mpg,which=3)

# the null hypothesis of the Breusch-Pagan test is that the model ISNT homoscedastic. So, a p-value higher than 0.05 means the model is probably homoscedastic:

bptest(hp_mpg)
## 
##  studentized Breusch-Pagan test
## 
## data:  hp_mpg
## BP = 0.86843, df = 1, p-value = 0.3514

Consequences: if model is NOT homoscedastic, estimates and p-values will be inaccurate

Normality of Errors (or Normality of Residuals)

Definition: Errors of the model should be normally distributed

Example:

# the model "lm(qsec~disp,data=mtcars)" has residuals/errors that are somewhat normally distributed - they mostly fall on the dotted line in a Q-Q plot:

qsec_lm<-lm(qsec~disp,data=mtcars)
plot(qsec_lm,which=2)

shapiro.test(qsec_lm$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  qsec_lm$residuals
## W = 0.96176, p-value = 0.3066
# the linear model "lm(mpg~carb,data=mtcars)" has errors that are somewhat less normal:

plot(mpg_carb,which=2)

shapiro.test(mpg_carb$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mpg_carb$residuals
## W = 0.92337, p-value = 0.02568

Consequences: if residuals/errors are not normal, the p-values will be unreliable, along with confidence intervals.

No Multicolinearity

Definition: independent variables should not be overly correlated with each other.

Example:

# in the model "lm(mpg~hp+disp+cyl+wt,data=mtcars)", disp (engine displacement) has a VIF over 10, meaning that it's strongly correlated with some other variable (probably CYL):

multicolinear_model<-lm(mpg~hp+disp+cyl+wt,data=mtcars)
vif(multicolinear_model)
##        hp      disp       cyl        wt 
##  3.405983 10.373286  6.737707  4.848016
# after removing "disp", the model is no longer multicolinear: 

less_colinear_model<-lm(mpg~hp+cyl+wt,data=mtcars)
vif(less_colinear_model)
##       hp      cyl       wt 
## 3.258481 4.757456 2.580486

Consequences: A model with multicolinearity will have inaccurate p-values and standard errors

MITIGATION TECHNIQUES

  1. No Linearity? (try a log transformation) (may not always work)
  2. Errors not independent? (try a Robust Linear Model RLM()) (removes p-values)
  3. Heteroscedascity? (log transformation of the dependent variable) (also may not always work)
  4. Residuals not normal? (log transformation)
  5. Multicolinearity? (remove strongly correlated variables)

Mitigation of Assumption Violations

Linearity Violation

Purpose: To mitigate a violation of the linearity assumption in linear regression

Suggested Method: Log or Square-Root Transformation of one or more variables

Example:

# the model "mpg_carb" is non-linear, but when we log both variables (mpg and hp), the model becomes noticeably more linear:

plot(mpg_carb,which=1)

mpg_carb_log<-lm(log(mpg)~log(hp),data=mtcars)
plot(mpg_carb_log,which=1)

REMINDER: log transformation does not work if there is any missing data, zeroes or negative numbers in the data. Sqare-root cannot use missing data or negative numbers.

Independent Errors Violation

Purpose: to reduce correlation between errors in a model. Ironically, we can do this by (among other things) just changing the model to something more appropriate.

Example:

durbinWatsonTest(mpg_carb)
##  lag Autocorrelation D-W Statistic p-value
##    1        0.428481      1.133807    0.01
##  Alternative hypothesis: rho != 0
mpg_carb_changed<-rlm(mpg~carb+drat,data=mtcars)

durbinWatsonTest(mpg_carb_changed)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.2707792      2.502298    0.19
##  Alternative hypothesis: rho != 0

REMINDER: correlation of errors suggests that the model just isn’t very good. Adding a lag indicator for time series data might work, but otherwise it means the variables may be inappropriate for a linear model.

Homoscedascity Violation

Purpose: To reduce heteroscedascity in a model, and increase homoscedascity (an assumption for linear regression).

Example: This can potentially be accomplished through a log transformation of the dependent variable -

hp_wt_carb<-lm(hp~wt+carb,data=mtcars)

plot(hp_wt_carb,which=3)

hp_wt_carb_log<-lm(log(hp)~wt+carb,data=mtcars)
plot(hp_wt_carb_log,which=3)

REMINDER: Some data may be so skewed that even a log or square-root transformation cannot induce homoscedascity. Also, log transformations will not work in data with missing values, zeroes, or negative values.

Normality of Residuals Violation

Purpose: If the residuals in a model are not normal, the assumption is not met. We can potentially transform the dependent variable to induce normality of residuals.

Example: logging the dependent variable (hp) makes the residuals go from somewhat non-normal, to fairly normalized.

cars_model_2<-lm(hp~mpg,data=mtcars)
shapiro.test(cars_model_2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_model_2$residuals
## W = 0.89154, p-value = 0.003799
cars_model_3<-lm(log(hp)~mpg,data=mtcars)
shapiro.test(cars_model_3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars_model_3$residuals
## W = 0.9453, p-value = 0.1059
plot(cars_model_2,which=2)

plot(cars_model_3,which=2)

REMINDER: For large datasets, the central limit theorem often mitigates non-normality of residuals on it’s own.

Multicolinearity Violation

Purpose: If variables are highly correlated with one another, that violates an assumption of the linear model.

Example: CYL, DISP and WT are strongly correlated with one another (VIF > 10). By removing two of these variables, multicolinearity will be mitigated:

kitchen_sink<-lm(hp~cyl+disp+mpg+drat+wt+qsec+vs+am+gear+carb,data=mtcars)


vif(kitchen_sink)
##       cyl      disp       mpg      drat        wt      qsec        vs        am 
## 14.912374 15.715476  7.296154  3.398500 16.339384  7.958123  4.600828  4.931842 
##      gear      carb 
##  5.344559  5.923559
#WT and DISP removed:
kitchen_sink_2<-lm(hp~cyl+mpg+drat+qsec+vs+am+gear+carb,data=mtcars)

vif(kitchen_sink_2)
##       cyl       mpg      drat      qsec        vs        am      gear      carb 
## 11.329719  5.670006  3.365246  5.165514  4.438732  4.923775  5.188293  3.873564

GENERALIZED LINEAR MODEL SELECTION

There are more types of GLM’s than just the following, but these are the types you will most likely encounter in this program:

Dependent variable - Independent variable - Method

Continuous - Continuous or categorical - Gaussian (family=“gaussian”)

Binary - Categorical or continuous - Binomial (family=“binomial”)

Counts - Categorical and continuous integers - Poisson (family=“poisson”)

Continuous (skewed)- Continuous and positive - Gamma (family=“Gamma”)

VARIABLE DEFINITIONS

Remember these types of variables as well:

Continuous - Can be written on a number line (1.1,1.2,1.3,1.4). Examples: height in cm (160.5 cm, etc.)

Binary - Either it happened, or it didn’t (1, 0). Examples: patient lived after surgery (0) or died (1)

Counts - Integers you can “count” on your fingers (1,2,3). Examples: there were 4 car crashes yesterday

Continuous (skewed - Gamma) - Same as continuous but positive and with a grouping around extreme values. Example: graduation rates

Categorical - different “categories” the data can be sorted into. Example: OJ vs. Vitamin C (from ToothGrowth)

PSEUDO P-VALUE

A GLM will not have a p-value due to being non-parametric. However, it’s possible to estimate a pseudo p-value (a rough estimate of the model’s significance) using the differences between null and residual deviance, and null and residual degrees of freedom.

If the Null Deviance minus the Residual Deviance is greater than the Null deviance DF minus the Residual deviance DF, it’s probably significant:

chick<-ChickWeight
chick_glm<-glm(weight~Time+Diet,data=chick,family = Gamma(link=log))

summary(chick_glm)
## 
## Call:
## glm(formula = weight ~ Time + Diet, family = Gamma(link = log), 
##     data = chick)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.68330    0.02015 182.801  < 2e-16 ***
## Time         0.07991    0.00133  60.092  < 2e-16 ***
## Diet2        0.12122    0.02450   4.949 9.84e-07 ***
## Diet3        0.23526    0.02450   9.604  < 2e-16 ***
## Diet4        0.22651    0.02463   9.197  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.0465709)
## 
##     Null deviance: 193.022  on 577  degrees of freedom
## Residual deviance:  28.338  on 573  degrees of freedom
## AIC: 5272.8
## 
## Number of Fisher Scoring iterations: 5

193 - 28 = 165 577 - 573 = 4

165 > 4

Model is probably significant!

CITATIONS



Dobson, A. J. (2002). *An introduction to generalized linear models*. 2nd Ed. Chapman and Hall/CRC.

Field, A., Field, Z., & Miles, J. (2012). *Discovering statistics using R.* SAGE, Los Angeles.