Logistic Regression

This week we started learning about the basics of logistic regression. This Rpub will be a review of all the things we covered.

In order to use logistic regression, we have to meet certain assumptions:

  1. The response must be dichotomous (only two possible responses)
  2. The observations must be independent of each other
  3. Variance is np(1-p); variance is highest at p = .5
  4. log(1/1-p) must be a linear function of x
crabdat<- read.csv("http://www.cknudson.com/data/crabs.csv")
library(faraway)
attach(crabdat)

In this crab dataset there is a variable called ‘y’. This variable indicates if a crab did indeed have satellites (1) or not (0).

Because this ‘y’ variable is dichotomous we can use logistic regression. For this example I will model ‘y’ using female crab carapace width.

modwidth <- glm(y~width, family = binomial)
summary(modwidth)
## 
## Call:
## glm(formula = y ~ width, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0281  -1.0458   0.5480   0.9066   1.6942  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.3508     2.6287  -4.698 2.62e-06 ***
## width         0.4972     0.1017   4.887 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 194.45  on 171  degrees of freedom
## AIC: 198.45
## 
## Number of Fisher Scoring iterations: 4

First, we look at the p-value for the variable ‘width’. I see a very small p-value which means we have a significant linear relationship between width and the log odds of a female having at least one satellite crab.

Next, it is important to lay out the log odds equation and analyze it.

For this model it is: log(odds) = -12.3508 + .4972(crab carapace width)

A increase in 1 cm in width is associated with a .4972 multiplicative increase in the log odds of ‘y’.

Because the coefficient for ‘width’ is positive, the wider a females carapace is the higher chance the female will have at least 1 satellite crab.

Lets say a female crab has a carapace width of 28 cm…

width1 <- 28
ans <- -12.3508 + .4972*(width1)

prob <- exp(ans)/(1 + exp(ans))
prob
## [1] 0.8278976

A female crab with a carapace width of 28 cm has approximately a 82.79% chance of having at least 1 satellite crab.

Creating a fictional list of possible widths and then plotting that with their corresponding probabilities of having at least 1 satellite crab provides this chart:

width1 <- c(15,17,19,21,23,24,26,28,30,31,33,37)
ans <- -12.3508 + .4972*(width1)

prob <- exp(ans)/(1 + exp(ans))

plot(width1,prob, type = "l", ylab = "Probability", xlab = "Width (cm)" )

I think this does a nice job of visually representing the relationship of width and the probability of having at least one satellite.

Probability Review

The last thing I want to do this week is review a few probability concepts.

“Odds” is always a confusing term for me because growing up it was synonomous with probability of something happening. However, in the statistical world, the word means something a little different.

Odds = #successes/#failures or Odds = p/1-p

If I go to the gym and shoot 100 free throws and make 63 out of the 100. The probability of me making a free throw is .63

However, the odds of me making a free throw is .63/1-.63 or approximately 1.703 to 1.