Make sure to explain plots and graphs, not just leave them on their own When interpreting a model, need to interpret every variable in the model For degrees of freedom in drop in deviance test, be cautious of categorical variables which may include multiple betas. Cant use Wald test in summary() for categorical vars, need to use Drop in Dev or LRT use "times as many or times as great" rather than "greater than or more"
Response is binomial (binary, independant, and same P(success) for all trials)
If X~Bin(n,p), then E(X) = np and Var(X) = np(1-p)
Odds = P(Sucesses)/P(Failure) =(# successes/n)/(# failures /n) = # successes/#failures If p is the P(success), then the odds of success are p/(1-p)
log odds (also called logit) (natural log) = log(p/(1-p)) logistic regression: log(p/(1-p))=B0 +B1x1 + B2x2 + ... + Bkxk use maximum likelihood to estimate betas using data
library(faraway)
data(orings)
Use jitter(variable) to make overlapping points show up more in a plot plot(y~jitter(x))
#glm(cbind(number of successes, number of failures)~x, family =binomial, data= data)
shuttlemod<- glm(cbind(damage,6-damage)~temp, family = binomial, data = orings)
summary(shuttlemod)
##
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial,
## data = orings)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9529 -0.7345 -0.4393 -0.2079 1.9565
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.66299 3.29626 3.538 0.000403 ***
## temp -0.21623 0.05318 -4.066 4.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38.898 on 22 degrees of freedom
## Residual deviance: 16.912 on 21 degrees of freedom
## AIC: 33.675
##
## Number of Fisher Scoring iterations: 6
Model outputLog odds = 11.663 - 0.216(temperature) log(p/(1-p) = logit(p) ilogit(w) solves for p (part of faraway package) w is the estimate from them regression ilogit(w) = exp(-w)/(1+exp(-w))
b1 represents the coefficient of the variable we are interpreting
Linear model: For every 1 unit increase in x, we expect a b1 increase in y
Poission Regression: Every 1 unit increase in x is associated w/ a multiplicative change in the mean of y by a factor of exp(b1)
Logistic Regression: Every 1 unit increase in x is associated w/ a b1 unit increase in the log odds OR Every 1 unit increase in x is associated w/ a multiplicative change in the odds by a factor of exp(b1)
6.8.1 1. A. y= whether a student binge drinks x= academic performance (grades) B. y= whether they accepted into medical school x= GPA and MCAT scores C. y = whether they marry bio dad x = sex of the baby D. y = whether student graduates x = athletic participation E. y = whether person gets a cancer diagnosis x = exposure to a particular chemical
https://irsaatumn.github.io/RWorkshop18/logistic-regression-in-r.html # Logistic Reression Dropbox Notes: https://www.dropbox.com/s/u9fcrrl2vvo7a3b/LogisticIntro.pdf?dl=0
crabs <- read.csv("http://www.cknudson.com/data/crabs.csv")
variable "y" is a binomial variable with 1 indicating a female crab having satellites and 0 indicating a female crab with no satellites.
weightmodel<- glm(y~weight, data= crabs, family = binomial)
summary(weightmodel)
##
## Call:
## glm(formula = y ~ weight, family = binomial, data = crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1108 -1.0749 0.5426 0.9122 1.6285
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6947264 0.8801975 -4.198 2.70e-05 ***
## weight 0.0018151 0.0003767 4.819 1.45e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 195.74 on 171 degrees of freedom
## AIC: 199.74
##
## Number of Fisher Scoring iterations: 4
estimated log odds = -3.6947264 + 0.0018151(weight) Heavier crabs have higher chances of having satellites since a 1 unit increase in the weight of a female crab is associated with a 1.001817 multiplicative change in the odds ratio of having a satellite.
library(faraway)
exp(0.0018151)
## [1] 1.001817
#probability of having satellites if weight = 2000 grams
ilogit(-3.6947264 + 0.0018151*2000)
## [1] 0.483874
Yes there is a linear relationship between a female crab's weight and her log odds of satellites. The p value of the coefficient for weight is 1.45*10^-6 which is significant at the alpha = 0.05 level.
blowdown <- read.csv("http://www.cknudson.com/data/blowdown.csv")
deathmodel<-glm(y~D,data = blowdown, family = binomial)
summary(deathmodel)
##
## Call:
## glm(formula = y ~ D, family = binomial, data = blowdown)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6309 -0.9616 -0.7211 1.1495 1.7172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.702112 0.082798 -20.56 <2e-16 ***
## D 0.097558 0.004846 20.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5057.9 on 3665 degrees of freedom
## Residual deviance: 4555.6 on 3664 degrees of freedom
## AIC: 4559.6
##
## Number of Fisher Scoring iterations: 4
predicted log odds of dying = -1.702112 + 0.097558 * Diameter
exp(0.097558)
## [1] 1.102475
Thicker trees have higher chances of dying since a 1 unit increase in the diameter of the tree is associated with a 1.102475 multiplicative change in the odds ratio of dying.
#Probability of a 20 cm diameter tree dying:
ilogit(-1.702112 + 0.097558 * 20)
## [1] 0.5619422
Yes there is a linear relationship between the tree's diameter and its log odds of dying. The pvalue for diameter is less than 2*10^-16, which is significant at the 0.05 level.
beer <- read.csv("http://www.cknudson.com/data/MNbeer.csv")
ABVmodel<- glm(Good~ABV, data= beer, family = binomial)
summary(ABVmodel)
##
## Call:
## glm(formula = Good ~ ABV, family = binomial, data = beer)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3847 -0.8206 -0.4663 0.7230 2.1320
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.7881 3.3959 -2.882 0.00395 **
## ABV 1.4662 0.5481 2.675 0.00747 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 51.564 on 43 degrees of freedom
## Residual deviance: 42.108 on 42 degrees of freedom
## AIC: 46.108
##
## Number of Fisher Scoring iterations: 5
exp(1.4662)
## [1] 4.332739
Beers with a higher ABV are more likely to be good since a 1 unit increase in ABV is associated with a multiplicative change of 4.332739 in the odds ratio for having a score that is at least 90.
Stargazer has a ABV of 5. Let's find its probability of having a score of at least 90:
ilogit(-9.7881 + 1.4662*5)
## [1] 0.07892089
Stargazer has a predicted probability of 0.0789 of having a score of at least 90.
election <- read.csv("https://www.macalester.edu/~ajohns24/data/IMAdata1.csv")
election$Red <- as.numeric(election$StateColor == "red")
electionmodel<- glm(Red~per_capita_income, data= election, family = binomial)
summary(electionmodel)
##
## Call:
## glm(formula = Red ~ per_capita_income, family = binomial, data = election)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.617 -1.146 -0.753 1.157 2.073
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.649e+00 1.722e-01 9.575 <2e-16 ***
## per_capita_income -7.339e-05 7.214e-06 -10.173 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4352.6 on 3142 degrees of freedom
## Residual deviance: 4237.1 on 3141 degrees of freedom
## AIC: 4241.1
##
## Number of Fisher Scoring iterations: 4
exp(-7.339e-05)
## [1] 0.9999266
If a state has a high per capita income, it is less likely to be red because an increase in 1 unit of per capita income is associated with a multiplicative change of 0.9999 in the odds ratio of the state's color being red.
H0: no lack of fit Ha: lack of fit pchisq(deviance(modelname), lower.tail=FALSE, df= #betas-1)
*note: cbind() only needs to be used if your data is not already in binary form. mod2<- glm( cbind(#success, #fails) ~ 1, data =orings, family = binomial) dsmall<- deviance(mod2) dbig <-deviance(shuttlemod) (teststat <- dsmall-dbig)
df= #beta in big model - #beta in small model pchisq(teststat, df=1, lower.tail=FALSE)
Drop in dev tests perform better than Wald Tests with smaller sample sizes