Requirements for multivariate regression

library(tidyverse)
library(here)
library(car)
library(ggResidpanel)
nyc<-read.csv(here('data','nyc.csv'))

Response Requirements 1.

response!=binary & response!=count IF response==binary use multivariate_logistic_regression.Rmd IF response==count use multivariate_count_regression.Rmd

High correlation between predictors can lead to unstable parameter estimates of regression which makes it very difficult to assess the effect of independent variables on dependent variables.

no autocorrelation if there is a time base ## Check distribution of response variable

-should be normal, transform if needed

check with car::powerTransform

shapiro.test(nyc$Price)
## 
##  Shapiro-Wilk normality test
## 
## data:  nyc$Price
## W = 0.99223, p-value = 0.5031

H_0: data is normal

H_A: data is not normal

IF shapiro.test(response)==FAIL use car::powerTransform(response) to get transformation power

Correlation and covariance

Dohoo et al. (1997) argued that multicollinearity is certain at the 0.9 level of a correlation coefficient or higher

Analysis of Covariance and correlation

Correlation is a scaled form of covariance (the measure of change in one variable associated to change in another variable.) Types of covariance:

  1. coincident regression lines

  2. parallel regression lines

regression lines with equal intercepts but different slopes

unrelated regression lines

library(dlookr)
dlookr::plot_correlate(nyc)

IF correlation(x_i,x_j)>.9 consider dropping one of the variables from your analysis

Create first fitted model

library(tidyverse)
library(here)
nyc<-read.csv(here('data','nyc.csv'))%>%select(-c(Case,Restaurant))
lm.list=list()
lm.list[["lm.base"]]<-lm(Price~.,data=nyc)
summary(lm.list[["lm.base"]])
## 
## Call:
## lm(formula = Price ~ ., data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0465  -3.8837   0.0373   3.3942  17.7491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.023800   4.708359  -5.102 9.24e-07 ***
## Food          1.538120   0.368951   4.169 4.96e-05 ***
## Decor         1.910087   0.217005   8.802 1.87e-15 ***
## Service      -0.002727   0.396232  -0.007   0.9945    
## East          2.068050   0.946739   2.184   0.0304 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6187 
## F-statistic: 68.76 on 4 and 163 DF,  p-value: < 2.2e-16

Examine Residuals for possible transformations

see transformations.Rmd for interpretation of residual plots for transformations

library(regclass)
regclass::VIF(lm.list[['lm.base']])
##     Food    Decor  Service     East 
## 2.714261 1.744851 3.558735 1.064985

see predictor_interaction.RMd

ggResidpanel::resid_xpanel(lm.list[['lm.base']])

car::powerTransform(nyc%>%select(-c(Price,East)))
## Estimated transformation parameters 
##     Food    Decor  Service 
## 0.748239 1.446262 1.729677

it looks like decor might be polynomial

ggplot(nyc,aes(x=Decor,y=Price))+geom_point()+geom_smooth(method='loess',se=FALSE,color='blue')+geom_smooth(method='lm',color='red')
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## Check for heteroscedasticity

library(lmtest)
lmtest::bptest(lm.list[['lm.base']])
## 
##  studentized Breusch-Pagan test
## 
## data:  lm.list[["lm.base"]]
## BP = 3.1186, df = 4, p-value = 0.5382

if heteroscedasticity(model)==FAIL SEE heteroscedasticity_FAIL.Rmd

Examine interactions between predictor variables for possible transformations

VIF

-The VIF of a predictor is a measure for how easily it is predicted from a linear regression using the other predictors.

a VIF above 10 indicates high correlation and is cause for concern. Some authors suggest a more conservative level of 2.5 or above.

if VIF(predictor)>threshold best practice is to remove the predictors with the highest VIF and rerun the model.

library(regclass)
regclass::VIF(lm.list[['lm.base']])
##     Food    Decor  Service     East 
## 2.714261 1.744851 3.558735 1.064985

Add transformations to base model

lm.list[['lm.transform']]<-update(lm.list[['lm.base']],~.+I(Decor)^2)


#summary(lm.list[['lm.transform']])

Create “best subsets” models

this will create give you the best possible model for models with n predictors up to the nvmax parameter

you can use the parameter method to select forward or backward selection methods

library(leaps)
max_model_variable=4
best_subset<-leaps::regsubsets(Price~.,data=nyc,nvmax=max_model_variable)

results<-summary(best_subset)


results
## Subset selection object
## Call: regsubsets.formula(Price ~ ., data = nyc, nvmax = max_model_variable)
## 4 Variables  (and intercept)
##         Forced in Forced out
## Food        FALSE      FALSE
## Decor       FALSE      FALSE
## Service     FALSE      FALSE
## East        FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          Food Decor Service East
## 1  ( 1 ) " "  "*"   " "     " " 
## 2  ( 1 ) "*"  "*"   " "     " " 
## 3  ( 1 ) "*"  "*"   " "     "*" 
## 4  ( 1 ) "*"  "*"   "*"     "*"

Extract and plot results from “lm.best_subset”

tibble(predictors = 1:max_model_variable,
       adj_R2 = results$adjr2,
       Cp = results$cp,
       BIC = results$bic) %>%
  gather(statistic, value, -predictors) %>%
  ggplot(aes(predictors, value, color = statistic)) +
  geom_line(show.legend = F) +
  geom_point(show.legend = F) +
  facet_wrap(~ statistic, scales = "free")

paste('best adjR2: ',which.max(results$adjr2),"best BIC: ",which.min(results$bic),"best Cp: ",which.min(results$bic))
## [1] "best adjR2:  3 best BIC:  2 best Cp:  2"

Decide which number of variables is best by examining the curves

Compare variables and coefficients that these models include

# 2 variable model

coef(best_subset,2)
## (Intercept)        Food       Decor 
##  -24.500156    1.646149    1.882015

Manually refit model with best predictors to model_list()

lm.list[['lm.best']]<-lm(Price~Food+Decor,data=nyc)

Model Diagnostics

F-test:

.05 threshold model is good

AIC and BIC, model with lowest is best

VIF

-The VIF of a predictor is a measure for how easily it is predicted from a linear regression using the other predictors.

a VIF above 10 indicates high correlation and is cause for concern. Some authors suggest a more conservative level of 2.5 or above.

library(regclass)
regclass::VIF(lm.list[['lm.base']])
##     Food    Decor  Service     East 
## 2.714261 1.744851 3.558735 1.064985

Model Selection

AICs<-do.call(AIC,unname(lm.list))$AIC
adjr2<-lapply(X=lm.list,
       FUN=function(x) unlist(summary(x)$adj.r.squared))
(select_df<-data.frame(Models=names(lm.list),AIC=round(AICs,2), R2=round(unlist(unname(adjr2)),2)))
##         Models     AIC   R2
## 1      lm.base 1070.71 0.62
## 2 lm.transform 1070.71 0.62
## 3      lm.best 1071.68 0.61