Stats 155 Class Notes 2012-11-19

Musical Theme for the Day

Question: What does 5 to 9 mean?

Modeling of Categorical Variables

We're going to do the simplest case: two-level variables.

Points to keep in mind as we study logistic regression

Example:

Just so that you'll see that it's easy to do, let's do a simple model: Can we predict the sex of one of the kids from the width and length of the foot?

mod = glm(sex ~ width + length, data = KidsFeet, family = "binomial")
summary(mod)
## 
## Call:
## glm(formula = sex ~ width + length, family = "binomial", data = KidsFeet)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.606  -1.030  -0.563   1.017   1.656  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   18.139      8.155    2.22    0.026 *
## width         -1.649      0.967   -1.71    0.088 .
## length        -0.136      0.353   -0.39    0.699  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54.040  on 38  degrees of freedom
## Residual deviance: 47.071  on 36  degrees of freedom
## AIC: 53.07
## 
## Number of Fisher Scoring iterations: 4
head(fitted(mod))  # probabilities
##      1      2      3      4      5      6 
## 0.7246 0.5429 0.2334 0.1901 0.5120 0.2054

It's not obvious from the report whether these are the probability of being a girl or the probability of being a boy. Here's a plot to help out:

bwplot(fitted(mod) ~ sex, data = KidsFeet)

plot of chunk unnamed-chunk-3

It seems that the probability refers to being a girl, since the girls have a higher probability than the boys.

The functions themselves have a nice, graceful shape which you may recognize from Math 135 as being an S-function.

f = makeFun(mod)
plotFun(f(width = 10, length) ~ length, length.lim = range(0, 40))

plot of chunk unnamed-chunk-4

plotFun(f(width = width, length = length) ~ width & length, length.lim = c(10, 
    30), width.lim = c(7, 12))

plot of chunk unnamed-chunk-4

ACTIVITY: Try adding in an interaction term, etc.

mod2 = glm(sex ~ length * width, data = KidsFeet, family = "binomial")
f2 = makeFun(mod2)
plotFun(f2(width = width, length = length) ~ width & length, length.lim = c(20, 
    30), width.lim = c(7, 10))
plotPoints(length ~ width, data = KidsFeet, add = TRUE, col = KidsFeet$sex)

plot of chunk unnamed-chunk-5

The Zero-One/TRUE-FALSE encoding

Look at the data in oring-damage.csv, which is about the space shuttle. The damage variable has been encoded as 0 or 1.

oring = fetchData("oring-damage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/oring-damage.csv
mod = glm(damage ~ temp, data = oring, family = "binomial")
f = makeFun(mod)
plotFun(f(temp) ~ temp, temp.lim = range(0, 100))

plot of chunk unnamed-chunk-6

Activity: Add in quadratic or cubic terms or stepwise terms and see how that changes the shape of the function. For instance:

mod2 = glm(damage ~ poly(temp, 2) + (temp < 70), data = oring, family = "binomial")
f2 = makeFun(mod2)
plotFun(f2(temp) ~ temp, temp.lim = range(0, 100))

plot of chunk unnamed-chunk-7

The function doesn't need to be a simple step.

Goals for this Unit

Since fitting and the report generation is trivial, and the report interpretation is superficially the same, we don't need to spend much time on that. Instead, we'll spend a bit of time on interpretation in a common setting: Case-Control studies.

We tend to think about three main kinds of studies:

But there is another kind of study that people don't usually think about but that can be very useful in certain circumstances.

There is a group that wants to do its survey about health outcomes for college students and the factors that determine them.