Core functions from broom package: glance() (model fit), tidy() (model parameters), and augment() (data values).

Also: visdat::vis_dat(), rsample::initial_split, car::vif(), yarstick::conf_mat(), GGally::ggpairs() and interactions::interact_plot().

1 Examine data

1.1 Overview

head(penguins)
summary(penguins)
nrow(penguins)

1.2 Examine missing data

visdat::vis_dat(penguins)

penguins <- drop_na(penguins)
nrow(penguins)

## [1] 333

1.3 Plot variables

penguins %>%
  pairs()

## GGally::ggpairs(penguins %>% select(bill_length_mm, bill_depth_mm, year))

pairs(penguins %>% select(bill_length_mm, bill_depth_mm, year))

penguins %>%
  select_if(is.numeric) %>% 
  cor()

2 Multiple linear regression

Objective: minimize residual sum of squares

Type of linear model depends on distribution of Y (Exponential family: simple linear-> Gaussian; logistic -> Bernouli; Poisson/counts -> Poisson)
Models mean of Y as function of predictors X, transformed by link function so that Y is linearly related to X — linear relationship between independent and dependent variable — normal random error

2.1 Fit model

lmfit <- lm(body_mass_g ~ flipper_length_mm + sex, penguins)

summary(lmfit)

## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + sex, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -910.28 -243.89   -2.94  238.85 1067.73 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5410.300    285.798 -18.931  < 2e-16 ***
## flipper_length_mm    46.982      1.441  32.598  < 2e-16 ***
## sexmale             347.850     40.342   8.623 2.78e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 355.9 on 330 degrees of freedom
## Multiple R-squared:  0.8058, Adjusted R-squared:  0.8047 
## F-statistic: 684.8 on 2 and 330 DF,  p-value: < 2.2e-16

Residual SE: average amount that the response will deviate from the true regression line
sqrt(sum(mod$residuals^2)/df)
68% of values will be in range

2.2 Model goodness of fit

Does the model explain much of the variance? Is the F-statistic significant?

glance(lmfit) %>%
  kable()

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.8058374	0.8046607	355.8829	684.8033	0	2	-2427.242	4862.484	4877.717	41795374	330	333

R2 = fraction of the variation in your dependent variable that is accounted for (or predicted by) your independent variables. (with single var R = r) [R2 = 1- RSS/TSS]
compute R^2 manually: SSR <- sum(mod_summary$residuals^2) TSS <- sum((score - mean(score))^2) R2 <- 1 - SSR/TSS
F-value:
- tests whether there is no relationship among predictors and outcome
- Has associated p-value
- No relationship when F is 1
- Look at this first, then individual p-values because F statistic adjusts for multiple comparisons (depends on n and p)
nobs = n observations
sigma = Estimated standard error of the residuals

2.3 Model parameters:

lmparms <- tidy(lmfit, conf.int = TRUE)
kable(lmparms)

term	estimate	std.error	statistic	conf.low	conf.high
(Intercept)	-5410.30022	285.797694	-18.930524	-5972.51535	-4848.08510
flipper_length_mm	46.98218	1.441256	32.598066	44.14697	49.81738
sexmale	347.85025	40.341557	8.622628	268.49120	427.20930

include CIs with conf.int = T, conf.level = X
p-value for each independent variable
- tests the null hypothesis that the variable has no correlation with the dependent variable. (Null hypothesis = 0)
- technical def: probability of observeing a result at least as extreme as the one that we observed, under the assumption that the null hypothesis is correct (Pr of false positive; Type 1 error)
- non-technical def: the probability that the coefficient is actually 0, and the effect was just generated from random chance (small value -> unlikely to have occurred from chance)
Non-standardized estimates (coefficients, betas):
- weights that minimize the sum of the square of the errors.
- useful when you want to interpret the effect that a one unit change on a predictor variable has on a response variable
- how much the mean of the dependent variable changes on average given a one-unit shift in the independent variable while holding other variables in the model constant
- Choose B0 and B1 to minimize RSS
- SE = SD of the coefficiens, SE depends on both predictor and outcome variables
Standardized estimates (coefficients, betas):
- useful when want toompare the effects of different predictors when they’re on different scales
- A change of 1 SD in X is associated with a change of beta SDs of Y, while holding other variables in the model constant
- Z-score predictors and outcome variable (scale())
- allows you to compare coefficients: e.g., “X2 is twice as important as X1 in predicting Y, assuming that both X1 and X2 follow roughly the same distribution and their standard deviations are not that different.”
standard error of estimate
tscore = B/SE(B); df = sample_size - num parameters
CI = B + (t_crit*SE)

Plot params:

ggplot(lmparms, aes(estimate, term, xmin = conf.low, xmax = conf.high)) +
  geom_point() +
  geom_vline(xintercept = 0, lty = 4) +
  geom_linerange() +
  coord_flip()

2.4 Model values

lm.augment <- augment(lmfit, interval = "confidence") 
lm.augment

## # A tibble: 333 × 11
##    body_mass_g flipper_length_mm sex    .fitted .lower .upper .resid    .hat
##          <int>             <int> <fct>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1        3750               181 male     3441.  3356.  3527.  309.  0.0150 
##  2        3800               186 female   3328.  3265.  3392.  472.  0.00818
##  3        3250               195 female   3751.  3696.  3806. -501.  0.00615
##  4        3450               193 female   3657.  3601.  3713. -207.  0.00637
##  5        3650               190 male     3864.  3796.  3932. -214.  0.00940
##  6        3625               181 female   3093.  3022.  3165.  532.  0.0105 
##  7        4675               195 male     4099.  4039.  4159.  576.  0.00743
##  8        3200               182 female   3140.  3071.  3210.   59.5 0.00993
##  9        3800               191 male     3911.  3845.  3977. -111.  0.00894
## 10        4400               198 male     4240.  4183.  4297.  160.  0.00665
## # … with 323 more rows, and 3 more variables: .sigma <dbl>, .cooksd <dbl>,
## #   .std.resid <dbl>

to deal with missing data, add in original newdata = X
interval = “confidence”/“prediction”
(Confidence interval - uncertainty due to sampling of parameter; Prediction intervals- uncertainty in parameter, plus the random variation of the individual value)
sigma = Estimated residual standard deviation when corresponding observation is dropped from model
.hat = Diagonal of the hat matrix (measures leverage) .cooksd = measure influence (outlier that has high leverage; ook’s distance greater than 1 is often considered large)
se.fit

2.5 Evaluating model assumptions

Normliv

2.5.1 Normality of residuals

= errors between the line and the data are assumed to be drawn from a nearly-normal distribution

To assess, use QQ plots
do residuals come from normal distribution
plots quantiles from two distributions against eachother
when violated, p-values wrong

ggplot(lm.augment,aes(sample=.resid)) + 
  geom_qq() + 
  geom_qq_line() + 
  ggtitle('Normal data')

2.5.2 Linearity

= y should be linearly related to x

To assess, plot residuals vs. fitted values
The residuals spread randomly around the 0 line indicating that the relationship is linear.
The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance.
No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.
if not, transform

ggplot(lm.augment, aes(y=.resid, x = .fitted)) + 
  geom_point() +
  geom_smooth(method = "lm")

2.5.3 Homogenity of variance

= Assume that data are approximately equally variable at all ranges of x and y

To assess, plot standardized residuals vs x (simple) or fitted value (y)
Is residual variance constant? (should be uniform and uncorrelated with eachother)
If not, SE of estimates could be wrong (overconfidence)

ggplot(lm.augment, aes(y=.std.resid, x = .fitted)) + 
  geom_point() +
  geom_smooth(method = "lm")

2.5.4 Anomalous data

= Data points very far away from the rest can exert undue influence on the model parameters

Look at cook’s distance from augment function (values > 1 problematic)
- cook’s distance is measure of how much the fitted values change when observation is removed
Outlier (unusual Y) Studentized residuals: e/SE gives scale-free estimate of deviation; exclude outliers greater than 3
Leverage (unusual X)

2.5.5 Colinearity

= predictors are independent from each other

To assess, use variance inflation factor
Results in lots of uncertainty about estimates, and reduces power of hypothesis test (prob of correctly detecting non-zero coefficient
For each predictor, estimate R2 when predictor is OUTCOME and all other variables are predictors
vif = 1/(1-R2)
square root of the VIF is pretty interpretable: it tells you how much wider the confidence interval for the corresponding coefficient is, relative to what you would have expected if the predictors are all nice and uncorrelated with one another.
A VIF of 1.8 means the variance is 80% larger than it would be if that predictor was uncorrelated with the other predictors. VIF ranges from 1 to infinity. A VIF >= 4 warrants investigation; a VIF >= 10 requires correction
drop predictors with vif > 10

#cor(penguins %>% select(bill_length_mm, flipper_length_mm), use = "pairwise.complete.obs")
car::vif(lmfit)

## flipper_length_mm               sex 
##          1.069646          1.069646

3 Logistic Regression

Goal: maximum liklihood (least squares is special case)

linear relationship between independent and log-odds of dependent variable
LN(p / (1-p)) = b0 + b1 x1 + b2 x2 +… + bn xn

3.1 Fit model

glmfit <- glm(sex ~ flipper_length_mm + bill_length_mm, penguins, 
              family = "binomial")
glmfit

## 
## Call:  glm(formula = sex ~ flipper_length_mm + bill_length_mm, family = "binomial", 
##     data = penguins)
## 
## Coefficients:
##       (Intercept)  flipper_length_mm     bill_length_mm  
##         -7.005988           0.007737           0.124375  
## 
## Degrees of Freedom: 332 Total (i.e. Null);  330 Residual
## Null Deviance:       461.6 
## Residual Deviance: 419.9     AIC: 425.9

null deviance shows how well the response is predicted by the model with nothing but an intercept.
residual deviance shows how well the response is predicted by the model when the predictors are included
degrees of freedom = no. of observations – no. of predictors
deviance:
- measure of goodness of fit of a generalized linear model (deviance wrt perfect “saturated” model)
- indicates the extent to which the likelihood of the saturated model exceeds the likelihood of the proposed model.
AIC is based on deviance, but penalizes you for making the model more complicated (smaller = better)
balance between the model fitness, given by the likelihood, and its complexity
BIC is similiar to AIC but takes into account the number of observations

summary(glmfit)

## 
## Call:
## glm(formula = sex ~ flipper_length_mm + bill_length_mm, family = "binomial", 
##     data = penguins)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8924  -1.1412   0.5765   0.9822   1.6976  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.005988   1.727620  -4.055 5.01e-05 ***
## flipper_length_mm  0.007737   0.011081   0.698    0.485    
## bill_length_mm     0.124375   0.029528   4.212 2.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 461.61  on 332  degrees of freedom
## Residual deviance: 419.94  on 330  degrees of freedom
## AIC: 425.94
## 
## Number of Fisher Scoring iterations: 4

3.2 Model parameters

tidy(glmfit, conf.int = TRUE)

## # A tibble: 3 × 7
##   term              estimate std.error statistic   p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)       -7.01       1.73      -4.06  0.0000501 -10.5      -3.67  
## 2 flipper_length_mm  0.00774    0.0111     0.698 0.485      -0.0141    0.0295
## 3 bill_length_mm     0.124      0.0295     4.21  0.0000253   0.0679    0.184

include CIs with conf.int = T, conf.level = X
t = B/SE(B)
nobs = n observations
sigma = Estimated standard error of the residuals
Betas are in log odds space
log odds to odds ratio [p/(1-p)]-> exp() (e.g. “odds ratio of 2 implies 2x more likely”)

penguins %>%
  mutate(prob = ifelse(sex == "male", 1, 0)) %>%
  ggplot(aes(body_mass_g, prob)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"))   
#http://www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/

3.3 Model values

what’s the difference between link and response here?

model_vals <- augment(glmfit, type.predict = "response") %>%
    mutate(predicted = case_when(.fitted > .5 ~ "male",
                                 TRUE ~ "female"))

response = fitted residuals are transformed by taking the inverse of the link function:

Confusion matrix:

yardstick::conf_mat(model_vals,  sex, predicted) #true, predicted

##           Truth
## Prediction female male
##     female     89   69
##     male       76   99

prediction accuracy is on training data, so over-confident
classifcation report?

4 Misc

4.1 Transformations

deal with problems of non-constant variance
right skewed: log(y)
left skewed: exp(y)
non normal count data: sqrt(x)
- sqrt(x), x2

4.2 Dummy variables

levels(penguins$sex)

## [1] "female" "male"

penguins <- penguins %>%
  mutate(sex = relevel(sex, ref = "male"))

0/1 (intercept = value when var = 0; beta = difference)
-1/1 (intercept = average of two groups; beta = difference from average)

4.3 Prediction

predict(model, new_data_with_predictors) (for GLM, add type = "response)

#probabilities <- model %>% predict(test.data, type = "response")

Split into training vs test? * To evaluate your models out-of-sample on new data.

#rsample
penguins_split <- initial_split(penguins, prop = .8)
train_data <- training(penguins_split)
test_data <- testing(penguins_split)

compare using RMSE (sqrt(mean(lmfit$residuals^2))).

4.4 comparing models

anova() - Is the more complex model better?

lmfit <- lm(body_mass_g ~ flipper_length_mm + sex, penguins)

lmfit2 <- lm(body_mass_g ~ flipper_length_mm + sex + island, penguins)
anova(lmfit2, lmfit)

## Analysis of Variance Table
## 
## Model 1: body_mass_g ~ flipper_length_mm + sex + island
## Model 2: body_mass_g ~ flipper_length_mm + sex
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1    328 35758209                                  
## 2    330 41795374 -2  -6037165 27.689 7.737e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lmfit, lmfit2)

## Analysis of Variance Table
## 
## Model 1: body_mass_g ~ flipper_length_mm + sex
## Model 2: body_mass_g ~ flipper_length_mm + sex + island
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1    330 41795374                                  
## 2    328 35758209  2   6037165 27.689 7.737e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

small number of variables - consider all
Forward — Starts with zero variables. If a variable is added then it stays in the model even if it becomes insignificant later.
Backward — Starts with all the variables. If a variable is eliminated then it cannot be included in the model.
Stepwise — Includes aspects of forward and backward selection methods. It terminates when no variable can be added or removed from the model.

4.5 interpreting interactions

should center variables first
shouldn’t focus too much on the main effects of terms included in the interaction since they are conditional on the other variable(s) in the interaction being held constant at 0.

lmfit_mod <- lm(body_mass_g ~ island * flipper_length_mm, penguins)
summary(lmfit_mod)

## 
## Call:
## lm(formula = body_mass_g ~ island * flipper_length_mm, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -985.13 -248.93  -20.27  222.90 1192.46 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -5520.278    429.553 -12.851  < 2e-16 ***
## islandDream                        3743.156    975.595   3.837  0.00015 ***
## islandTorgersen                    2845.594   1742.520   1.633  0.10342    
## flipper_length_mm                    48.862      2.045  23.892  < 2e-16 ***
## islandDream:flipper_length_mm       -20.413      4.971  -4.106 5.08e-05 ***
## islandTorgersen:flipper_length_mm   -15.535      9.047  -1.717  0.08689 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 371.8 on 327 degrees of freedom
## Multiple R-squared:   0.79,  Adjusted R-squared:  0.7868 
## F-statistic: 246.1 on 5 and 327 DF,  p-value: < 2.2e-16

interactions::interact_plot(lmfit_mod, pred = flipper_length_mm, modx = island)

lmfit_mod2 <- lm(body_mass_g ~ bill_depth_mm * flipper_length_mm, penguins)

summary(lmfit_mod2)

## 
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm * flipper_length_mm, 
##     data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -915.78 -253.91  -33.51  223.53 1032.50 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -37536.190   4692.218  -8.000 2.15e-14 ***
## bill_depth_mm                     1854.806    276.375   6.711 8.46e-11 ***
## flipper_length_mm                  203.368     22.879   8.889  < 2e-16 ***
## bill_depth_mm:flipper_length_mm     -9.018      1.357  -6.646 1.25e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.2 on 329 degrees of freedom
## Multiple R-squared:  0.7916, Adjusted R-squared:  0.7897 
## F-statistic: 416.7 on 3 and 329 DF,  p-value: < 2.2e-16

interactions::interact_plot(lmfit_mod2, pred = flipper_length_mm, modx = bill_depth_mm)

interactions::interact_plot(lmfit_mod2, pred = bill_depth_mm, modx = flipper_length_mm)

4.6 syntax for fitting various models

polynomial: model <- lm(y ~ poly(x,3))

intercept_mod <- lm(body_mass_g ~ 1, penguins)
#no_intercept_mod <- lm(y ~ x - 1)

Regression Notes

Molly Lewis

2022-03-06