library(tidyverse)
library(here)
library(car)
library(ggResidpanel)
nyc<-read.csv(here('data','nyc.csv'))
Response Requirements 1.
response!=binary & response!=count IF response==binary use multivariate_logistic_regression.Rmd IF response==count use multivariate_count_regression.Rmd
High correlation between predictors can lead to unstable parameter estimates of regression which makes it very difficult to assess the effect of independent variables on dependent variables.
no autocorrelation if there is a time base ## Check distribution of response variable
-should be normal, transform if needed
check with car::powerTransform
shapiro.test(nyc$Price)
##
## Shapiro-Wilk normality test
##
## data: nyc$Price
## W = 0.99223, p-value = 0.5031
H_0: data is normal
H_A: data is not normal
IF shapiro.test(response)==FAIL use car::powerTransform(response) to get transformation power
Dohoo et al. (1997) argued that multicollinearity is certain at the 0.9 level of a correlation coefficient or higher
Correlation is a scaled form of covariance (the measure of change in one variable associated to change in another variable.) Types of covariance:
coincident regression lines
parallel regression lines
regression lines with equal intercepts but different slopes
unrelated regression lines
library(dlookr)
dlookr::plot_correlate(nyc)
IF
correlation(x_i,x_j)>.9 consider dropping one of the variables from your analysis
library(tidyverse)
library(here)
nyc<-read.csv(here('data','nyc.csv'))%>%select(-c(Case,Restaurant))
lm.list=list()
lm.list[["lm.base"]]<-lm(Price~.,data=nyc)
summary(lm.list[["lm.base"]])
##
## Call:
## lm(formula = Price ~ ., data = nyc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.0465 -3.8837 0.0373 3.3942 17.7491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.023800 4.708359 -5.102 9.24e-07 ***
## Food 1.538120 0.368951 4.169 4.96e-05 ***
## Decor 1.910087 0.217005 8.802 1.87e-15 ***
## Service -0.002727 0.396232 -0.007 0.9945
## East 2.068050 0.946739 2.184 0.0304 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared: 0.6279, Adjusted R-squared: 0.6187
## F-statistic: 68.76 on 4 and 163 DF, p-value: < 2.2e-16
see transformations.Rmd for interpretation of residual plots for transformations
library(regclass)
regclass::VIF(lm.list[['lm.base']])
## Food Decor Service East
## 2.714261 1.744851 3.558735 1.064985
see predictor_interaction.RMd
ggResidpanel::resid_xpanel(lm.list[['lm.base']])
car::powerTransform(nyc%>%select(-c(Price,East)))
## Estimated transformation parameters
## Food Decor Service
## 0.748239 1.446262 1.729677
it looks like decor might be polynomial
ggplot(nyc,aes(x=Decor,y=Price))+geom_point()+geom_smooth(method='loess',se=FALSE,color='blue')+geom_smooth(method='lm',color='red')
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Check for heteroscedasticity
library(lmtest)
lmtest::bptest(lm.list[['lm.base']])
##
## studentized Breusch-Pagan test
##
## data: lm.list[["lm.base"]]
## BP = 3.1186, df = 4, p-value = 0.5382
if heteroscedasticity(model)==FAIL SEE heteroscedasticity_FAIL.Rmd
VIF
-The VIF of a predictor is a measure for how easily it is predicted from a linear regression using the other predictors.
a VIF above 10 indicates high correlation and is cause for concern. Some authors suggest a more conservative level of 2.5 or above.
if VIF(predictor)>threshold best practice is to remove the predictors with the highest VIF and rerun the model.
library(regclass)
regclass::VIF(lm.list[['lm.base']])
## Food Decor Service East
## 2.714261 1.744851 3.558735 1.064985
lm.list[['lm.transform']]<-update(lm.list[['lm.base']],~.+I(Decor)^2)
#summary(lm.list[['lm.transform']])
this will create give you the best possible model for models with n predictors up to the nvmax parameter
you can use the parameter method to select forward or backward selection methods
library(leaps)
max_model_variable=4
best_subset<-leaps::regsubsets(Price~.,data=nyc,nvmax=max_model_variable)
results<-summary(best_subset)
results
## Subset selection object
## Call: regsubsets.formula(Price ~ ., data = nyc, nvmax = max_model_variable)
## 4 Variables (and intercept)
## Forced in Forced out
## Food FALSE FALSE
## Decor FALSE FALSE
## Service FALSE FALSE
## East FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## Food Decor Service East
## 1 ( 1 ) " " "*" " " " "
## 2 ( 1 ) "*" "*" " " " "
## 3 ( 1 ) "*" "*" " " "*"
## 4 ( 1 ) "*" "*" "*" "*"
tibble(predictors = 1:max_model_variable,
adj_R2 = results$adjr2,
Cp = results$cp,
BIC = results$bic) %>%
gather(statistic, value, -predictors) %>%
ggplot(aes(predictors, value, color = statistic)) +
geom_line(show.legend = F) +
geom_point(show.legend = F) +
facet_wrap(~ statistic, scales = "free")
paste('best adjR2: ',which.max(results$adjr2),"best BIC: ",which.min(results$bic),"best Cp: ",which.min(results$bic))
## [1] "best adjR2: 3 best BIC: 2 best Cp: 2"
Decide which number of variables is best by examining the curves
# 2 variable model
coef(best_subset,2)
## (Intercept) Food Decor
## -24.500156 1.646149 1.882015
lm.list[['lm.best']]<-lm(Price~Food+Decor,data=nyc)
F-test:
.05 threshold model is good
AIC and BIC, model with lowest is best
-The VIF of a predictor is a measure for how easily it is predicted from a linear regression using the other predictors.
a VIF above 10 indicates high correlation and is cause for concern. Some authors suggest a more conservative level of 2.5 or above.
library(regclass)
regclass::VIF(lm.list[['lm.base']])
## Food Decor Service East
## 2.714261 1.744851 3.558735 1.064985
AICs<-do.call(AIC,unname(lm.list))$AIC
adjr2<-lapply(X=lm.list,
FUN=function(x) unlist(summary(x)$adj.r.squared))
(select_df<-data.frame(Models=names(lm.list),AIC=round(AICs,2), R2=round(unlist(unname(adjr2)),2)))
## Models AIC R2
## 1 lm.base 1070.71 0.62
## 2 lm.transform 1070.71 0.62
## 3 lm.best 1071.68 0.61