Homework review:

Make sure to explain plots and graphs, not just leave them on their own When interpreting a model, need to interpret every variable in the model For degrees of freedom in drop in deviance test, be cautious of categorical variables which may include multiple betas. Cant use Wald test in summary() for categorical vars, need to use Drop in Dev or LRT use "times as many or times as great" rather than "greater than or more"

LOGISTIC REGRESSION

Response is binomial (binary, independant, and same P(success) for all trials)

If X~Bin(n,p), then E(X) = np and Var(X) = np(1-p)

Definition:

Odds = P(Sucesses)/P(Failure) =(# successes/n)/(# failures /n) = # successes/#failures If p is the P(success), then the odds of success are p/(1-p)

log odds (also called logit) (natural log) = log(p/(1-p)) logistic regression: log(p/(1-p))=B0 +B1x1 + B2x2 + ... + Bkxk use maximum likelihood to estimate betas using data

 library(faraway)
 data(orings)

Use jitter(variable) to make overlapping points show up more in a plot plot(y~jitter(x))

#glm(cbind(number of successes, number of failures)~x, family =binomial, data= data)
shuttlemod<- glm(cbind(damage,6-damage)~temp, family = binomial, data = orings)
summary(shuttlemod)
## 
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial, 
##     data = orings)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9529  -0.7345  -0.4393  -0.2079   1.9565  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.66299    3.29626   3.538 0.000403 ***
## temp        -0.21623    0.05318  -4.066 4.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38.898  on 22  degrees of freedom
## Residual deviance: 16.912  on 21  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 6

Model outputLog odds = 11.663 - 0.216(temperature) log(p/(1-p) = logit(p) ilogit(w) solves for p (part of faraway package) w is the estimate from them regression ilogit(w) = exp(-w)/(1+exp(-w))

Interpreting Coefficients

b1 represents the coefficient of the variable we are interpreting

Linear model: For every 1 unit increase in x, we expect a b1 increase in y

Poission Regression: Every 1 unit increase in x is associated w/ a multiplicative change in the mean of y by a factor of exp(b1)

Logistic Regression: Every 1 unit increase in x is associated w/ a b1 unit increase in the log odds OR Every 1 unit increase in x is associated w/ a multiplicative change in the odds by a factor of exp(b1)

Example

6.8.1 1. A. y= whether a student binge drinks x= academic performance (grades) B. y= whether they accepted into medical school x= GPA and MCAT scores C. y = whether they marry bio dad x = sex of the baby D. y = whether student graduates x = athletic participation E. y = whether person gets a cancer diagnosis x = exposure to a particular chemical

New Document with Notes & Practice Problems

https://irsaatumn.github.io/RWorkshop18/logistic-regression-in-r.html # Logistic Reression Dropbox Notes: https://www.dropbox.com/s/u9fcrrl2vvo7a3b/LogisticIntro.pdf?dl=0

crabs <- read.csv("http://www.cknudson.com/data/crabs.csv")

variable "y" is a binomial variable with 1 indicating a female crab having satellites and 0 indicating a female crab with no satellites.

5.3.1 Female Horseshoe Crab Weight

weightmodel<- glm(y~weight, data= crabs, family = binomial)
summary(weightmodel)
## 
## Call:
## glm(formula = y ~ weight, family = binomial, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1108  -1.0749   0.5426   0.9122   1.6285  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.6947264  0.8801975  -4.198 2.70e-05 ***
## weight       0.0018151  0.0003767   4.819 1.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 195.74  on 171  degrees of freedom
## AIC: 199.74
## 
## Number of Fisher Scoring iterations: 4

estimated log odds = -3.6947264 + 0.0018151(weight) Heavier crabs have higher chances of having satellites since a 1 unit increase in the weight of a female crab is associated with a 1.001817 multiplicative change in the odds ratio of having a satellite.

library(faraway)
exp(0.0018151)
## [1] 1.001817
#probability of having satellites if weight = 2000 grams
ilogit(-3.6947264 + 0.0018151*2000)
## [1] 0.483874

Yes there is a linear relationship between a female crab's weight and her log odds of satellites. The p value of the coefficient for weight is 1.45*10^-6 which is significant at the alpha = 0.05 level.

5.3.2 Boundary Water Blowdown

blowdown <- read.csv("http://www.cknudson.com/data/blowdown.csv")
deathmodel<-glm(y~D,data = blowdown, family = binomial)
summary(deathmodel)
## 
## Call:
## glm(formula = y ~ D, family = binomial, data = blowdown)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6309  -0.9616  -0.7211   1.1495   1.7172  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.702112   0.082798  -20.56   <2e-16 ***
## D            0.097558   0.004846   20.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5057.9  on 3665  degrees of freedom
## Residual deviance: 4555.6  on 3664  degrees of freedom
## AIC: 4559.6
## 
## Number of Fisher Scoring iterations: 4

predicted log odds of dying = -1.702112 + 0.097558 * Diameter

exp(0.097558)
## [1] 1.102475

Thicker trees have higher chances of dying since a 1 unit increase in the diameter of the tree is associated with a 1.102475 multiplicative change in the odds ratio of dying.

#Probability of a 20 cm diameter tree dying:
ilogit(-1.702112 + 0.097558 * 20)
## [1] 0.5619422

Yes there is a linear relationship between the tree's diameter and its log odds of dying. The pvalue for diameter is less than 2*10^-16, which is significant at the 0.05 level.

5.3.3 Beer

beer <- read.csv("http://www.cknudson.com/data/MNbeer.csv")
ABVmodel<- glm(Good~ABV, data= beer, family = binomial)
summary(ABVmodel)
## 
## Call:
## glm(formula = Good ~ ABV, family = binomial, data = beer)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3847  -0.8206  -0.4663   0.7230   2.1320  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -9.7881     3.3959  -2.882  0.00395 **
## ABV           1.4662     0.5481   2.675  0.00747 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 51.564  on 43  degrees of freedom
## Residual deviance: 42.108  on 42  degrees of freedom
## AIC: 46.108
## 
## Number of Fisher Scoring iterations: 5
exp(1.4662)
## [1] 4.332739

Beers with a higher ABV are more likely to be good since a 1 unit increase in ABV is associated with a multiplicative change of 4.332739 in the odds ratio for having a score that is at least 90.

Stargazer has a ABV of 5. Let's find its probability of having a score of at least 90:

ilogit(-9.7881 + 1.4662*5)
## [1] 0.07892089

Stargazer has a predicted probability of 0.0789 of having a score of at least 90.

5.3.4 State Colors

election <- read.csv("https://www.macalester.edu/~ajohns24/data/IMAdata1.csv")
election$Red <- as.numeric(election$StateColor == "red")
electionmodel<- glm(Red~per_capita_income, data= election, family = binomial)
summary(electionmodel)
## 
## Call:
## glm(formula = Red ~ per_capita_income, family = binomial, data = election)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.617  -1.146  -0.753   1.157   2.073  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.649e+00  1.722e-01   9.575   <2e-16 ***
## per_capita_income -7.339e-05  7.214e-06 -10.173   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4352.6  on 3142  degrees of freedom
## Residual deviance: 4237.1  on 3141  degrees of freedom
## AIC: 4241.1
## 
## Number of Fisher Scoring iterations: 4
exp(-7.339e-05)
## [1] 0.9999266

If a state has a high per capita income, it is less likely to be red because an increase in 1 unit of per capita income is associated with a multiplicative change of 0.9999 in the odds ratio of the state's color being red.

Lack of Fit Test

H0: no lack of fit Ha: lack of fit pchisq(deviance(modelname), lower.tail=FALSE, df= #betas-1)

Drop In Dev

*note: cbind() only needs to be used if your data is not already in binary form. mod2<- glm( cbind(#success, #fails) ~ 1, data =orings, family = binomial) dsmall<- deviance(mod2) dbig <-deviance(shuttlemod) (teststat <- dsmall-dbig)

df= #beta in big model - #beta in small model pchisq(teststat, df=1, lower.tail=FALSE)

Drop in dev tests perform better than Wald Tests with smaller sample sizes