This week we started learning about the basics of logistic regression. This Rpub will be a review of all the things we covered.
In order to use logistic regression, we have to meet certain assumptions:
crabdat<- read.csv("http://www.cknudson.com/data/crabs.csv")
library(faraway)
attach(crabdat)
In this crab dataset there is a variable called ‘y’. This variable indicates if a crab did indeed have satellites (1) or not (0).
Because this ‘y’ variable is dichotomous we can use logistic regression. For this example I will model ‘y’ using female crab carapace width.
modwidth <- glm(y~width, family = binomial)
summary(modwidth)
##
## Call:
## glm(formula = y ~ width, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0281 -1.0458 0.5480 0.9066 1.6942
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.3508 2.6287 -4.698 2.62e-06 ***
## width 0.4972 0.1017 4.887 1.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 194.45 on 171 degrees of freedom
## AIC: 198.45
##
## Number of Fisher Scoring iterations: 4
First, we look at the p-value for the variable ‘width’. I see a very small p-value which means we have a significant linear relationship between width and the log odds of a female having at least one satellite crab.
Next, it is important to lay out the log odds equation and analyze it.
For this model it is: log(odds) = -12.3508 + .4972(crab carapace width)
A increase in 1 cm in width is associated with a .4972 multiplicative increase in the log odds of ‘y’.
Because the coefficient for ‘width’ is positive, the wider a females carapace is the higher chance the female will have at least 1 satellite crab.
Lets say a female crab has a carapace width of 28 cm…
width1 <- 28
ans <- -12.3508 + .4972*(width1)
prob <- exp(ans)/(1 + exp(ans))
prob
## [1] 0.8278976
A female crab with a carapace width of 28 cm has approximately a 82.79% chance of having at least 1 satellite crab.
Creating a fictional list of possible widths and then plotting that with their corresponding probabilities of having at least 1 satellite crab provides this chart:
width1 <- c(15,17,19,21,23,24,26,28,30,31,33,37)
ans <- -12.3508 + .4972*(width1)
prob <- exp(ans)/(1 + exp(ans))
plot(width1,prob, type = "l", ylab = "Probability", xlab = "Width (cm)" )
I think this does a nice job of visually representing the relationship of width and the probability of having at least one satellite.
The last thing I want to do this week is review a few probability concepts.
“Odds” is always a confusing term for me because growing up it was synonomous with probability of something happening. However, in the statistical world, the word means something a little different.
Odds = #successes/#failures or Odds = p/1-p
If I go to the gym and shoot 100 free throws and make 63 out of the 100. The probability of me making a free throw is .63
However, the odds of me making a free throw is .63/1-.63 or approximately 1.703 to 1.