simple_logistic_regression

Simple Logistic Regression

\(m=n(observation~per~row)\)

This means that each row in your data frame has more than one observation.

Additional requirements for simple logistic regression

m!=1 if m==1 USE simple_binary_logistic_regression.Rmd

MichelinFood<-read.table(here('data','MichelinFood.txt'),header = TRUE)

Note:

each row is has a count of m observations (InMichelin+NotInMichelin) for every Food score, therefore m=n(observations) and m!=1

MichelinFood[6,]

##   Food InMichelin NotInMichelin mi proportion
## 6   20          8            25 33       0.24

Build model

Use family = binomial to generate a logistic regression.

glm.list=list()
glm.list[['glm.base']]<-glm(formula = cbind(InMichelin,NotInMichelin)~Food, family = binomial, data=MichelinFood)

summary(glm.list[['glm.base']])

## 
## Call:
## glm(formula = cbind(InMichelin, NotInMichelin) ~ Food, family = binomial, 
##     data = MichelinFood)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4850  -0.7987  -0.1679   0.5913   1.5889  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.84154    1.86236  -5.821 5.84e-09 ***
## Food          0.50124    0.08768   5.717 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.427  on 13  degrees of freedom
## Residual deviance: 11.368  on 12  degrees of freedom
## AIC: 41.491
## 
## Number of Fisher Scoring iterations: 4

Interpretation

odds of success equal to exponential of linear model with the coefficients given by the glm

\[odds~of~success=e^{\beta_0+\beta_1x_i}\]

Examining Diagnostics

In Logistic Regression, use Deviance in place of least squares

The hypothesis test examining goodness of fit

H_0: the model we created fits the data well

H_A: the model does not fit the data well enough, and a better model is needed

deviance<-glm.list[["glm.base"]]$deviance
#degrees of freedom
df<-glm.list[["glm.base"]]$df.residual
pchisq(deviance,df,lower.tail=FALSE)

## [1] 0.4976357

Since the p-value is high, we CANNOT reject H_0, and our model appropriately models the data.

RULE OF THUMB: if the deviance is less than the degrees of freedom, then the model is probably good.

R^2_dev for logistic regression

The equivalent of R^2 for logistic regression is \(R^2_{dev}\) which is based off of the deviance instead of the least squares.

library(modEvA)
modEvA::Dsquared(glm.list[["glm.base"]])

## [1] 0.8149279

Marginal model plots

visually asses how close our model fits the data

blue line: -superfit curve to the data

red line: -our model’s fit (it doesnt seem to work well for single predictors for some reason, need to look into this more)

car::mmps(glm.list[["glm.base"]])

Overdispersion

if deviance is large

MOST COMMON:

wrong structural form of model -wrong predictors, wrong transformations, outliers skewing model

Checks before relying on Overdispersion to explain large deviance

sparseness -observations per row

MichelinFood%>%mutate(Total=InMichelin+NotInMichelin)%>%select(Total)

##    Total
## 1      1
## 2      1
## 3      8
## 4     15
## 5     18
## 6     33
## 7     26
## 8     12
## 9     18
## 10     7
## 11    12
## 12     2
## 13     7
## 14     4

low counts for some levels of Food could lead to sparseness.

check for outliers with faraway::halfnorm

library(faraway)
faraway::halfnorm(residuals(glm.list[["glm.base"]]))

observations that fall off the linear pattern are having a large impact on the model, this looks ok.

If these checks are met

estimate dispersion parameter

dispersion parameter= residuals^2/degrees_of_freedom

(sigma2<-sum(residuals(glm.list[['glm.base']],type='pearson')^2)/glm.list[['glm.base']]$df.residual)

## [1] 0.9999572

in this case, the dispersion is 1 (no dispersion)

IF THERE IS DISPERIONS

use dispersion parameter to scale up the estimates of the standard error:

we can use the dispersion parameter to scale up the estimates of the standard error:

-this will give a more accurate reading of the significance of the variables.

summary(glm.list[['glm.base']], dispersion=sigma2)

## 
## Call:
## glm(formula = cbind(InMichelin, NotInMichelin) ~ Food, family = binomial, 
##     data = MichelinFood)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4850  -0.7987  -0.1679   0.5913   1.5889  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.84154    1.86232  -5.822 5.83e-09 ***
## Food          0.50124    0.08767   5.717 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9999572)
## 
##     Null deviance: 61.427  on 13  degrees of freedom
## Residual deviance: 11.368  on 12  degrees of freedom
## AIC: 41.491
## 
## Number of Fisher Scoring iterations: 4

simple_logistic_regression

Jack Wright