\(m=n(observation~per~row)\)
This means that each row in your data frame has more than one observation.
Additional requirements for simple logistic regression
m!=1 if m==1 USE simple_binary_logistic_regression.Rmd
MichelinFood<-read.table(here('data','MichelinFood.txt'),header = TRUE)
Note:
each row is has a count of m observations (InMichelin+NotInMichelin) for every Food score, therefore m=n(observations) and m!=1
MichelinFood[6,]
## Food InMichelin NotInMichelin mi proportion
## 6 20 8 25 33 0.24
Use family = binomial to generate a logistic regression.
glm.list=list()
glm.list[['glm.base']]<-glm(formula = cbind(InMichelin,NotInMichelin)~Food, family = binomial, data=MichelinFood)
summary(glm.list[['glm.base']])
##
## Call:
## glm(formula = cbind(InMichelin, NotInMichelin) ~ Food, family = binomial,
## data = MichelinFood)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4850 -0.7987 -0.1679 0.5913 1.5889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.84154 1.86236 -5.821 5.84e-09 ***
## Food 0.50124 0.08768 5.717 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.427 on 13 degrees of freedom
## Residual deviance: 11.368 on 12 degrees of freedom
## AIC: 41.491
##
## Number of Fisher Scoring iterations: 4
odds of success equal to exponential of linear model with the coefficients given by the glm
\[odds~of~success=e^{\beta_0+\beta_1x_i}\]
In Logistic Regression, use Deviance in place of least squares
The hypothesis test examining goodness of fit
H_0: the model we created fits the data well
H_A: the model does not fit the data well enough, and a better model is needed
deviance<-glm.list[["glm.base"]]$deviance
#degrees of freedom
df<-glm.list[["glm.base"]]$df.residual
pchisq(deviance,df,lower.tail=FALSE)
## [1] 0.4976357
Since the p-value is high, we CANNOT reject H_0, and our model appropriately models the data.
RULE OF THUMB: if the deviance is less than the degrees of freedom, then the model is probably good.
The equivalent of R^2 for logistic regression is \(R^2_{dev}\) which is based off of the deviance instead of the least squares.
library(modEvA)
modEvA::Dsquared(glm.list[["glm.base"]])
## [1] 0.8149279
visually asses how close our model fits the data
blue line: -superfit curve to the data
red line: -our modelโs fit (it doesnt seem to work well for single predictors for some reason, need to look into this more)
car::mmps(glm.list[["glm.base"]])
if deviance is large
MOST COMMON:
wrong structural form of model -wrong predictors, wrong transformations, outliers skewing model
OR
Checks before relying on Overdispersion to explain large deviance
sparseness -observations per row
MichelinFood%>%mutate(Total=InMichelin+NotInMichelin)%>%select(Total)
## Total
## 1 1
## 2 1
## 3 8
## 4 15
## 5 18
## 6 33
## 7 26
## 8 12
## 9 18
## 10 7
## 11 12
## 12 2
## 13 7
## 14 4
low counts for some levels of Food could lead to sparseness.
check for outliers with faraway::halfnorm
library(faraway)
faraway::halfnorm(residuals(glm.list[["glm.base"]]))
observations that fall off the linear pattern are having a large impact on the model, this looks ok.
If these checks are met
estimate dispersion parameter
dispersion parameter= residuals^2/degrees_of_freedom
(sigma2<-sum(residuals(glm.list[['glm.base']],type='pearson')^2)/glm.list[['glm.base']]$df.residual)
## [1] 0.9999572
in this case, the dispersion is 1 (no dispersion)
IF THERE IS DISPERIONS
use dispersion parameter to scale up the estimates of the standard error:
we can use the dispersion parameter to scale up the estimates of the standard error:
-this will give a more accurate reading of the significance of the variables.
summary(glm.list[['glm.base']], dispersion=sigma2)
##
## Call:
## glm(formula = cbind(InMichelin, NotInMichelin) ~ Food, family = binomial,
## data = MichelinFood)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4850 -0.7987 -0.1679 0.5913 1.5889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.84154 1.86232 -5.822 5.83e-09 ***
## Food 0.50124 0.08767 5.717 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9999572)
##
## Null deviance: 61.427 on 13 degrees of freedom
## Residual deviance: 11.368 on 12 degrees of freedom
## AIC: 41.491
##
## Number of Fisher Scoring iterations: 4