Monday April 15, 2013 Stats 155 Class Notes

Carson wrote: I've encountered some problems in grading Friday's quizzes, mainly regarding how to give partial credits to the last question, which asks to state the Null Hypothesis:

Especially for 1) and 2), I feel that the answers imply strongly the causal direction of the relationship, which should not be found in a null hypothesis.

Eating variance: the age-at-marriage problem

Eating Up the Variance

EXAMPLE: Difference in age between husband and wife in couples getting married.

Ask the class: Who is older in a married couple, the man or the woman? By how much?

Let's see if the data support this:

m = fetchData("marriage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/marriage.csv
mod0 = mm(Age ~ Person, data = m)
mod0
## Groupwise Model.
## Call:
## Age ~ Person
## 
## Coefficients:
## Bride  Groom  
##  33.2   35.8
confint(mod0)
##   group 2.5 % 97.5 %
## 1 Bride 29.15  37.33
## 2 Groom 31.70  39.87
mod1 = lm(Age ~ Person, data = m)
summary(mod1)
## 
## Call:
## lm(formula = Age ~ Person, data = m)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.42 -12.02  -3.34   8.80  39.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.24       2.06   16.13   <2e-16 ***
## PersonGroom     2.55       2.91    0.87     0.38    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 14.4 on 96 degrees of freedom
## Multiple R-squared: 0.00789, Adjusted R-squared: -0.00245 
## F-statistic: 0.763 on 1 and 96 DF,  p-value: 0.384

The point estimate is about right, but the margin of error is so large that we can't take this estimate very seriously. The p-value is so large that we can't reject the null that there is no relationship between Person and age.

Try adding in some other variables, astrological sign, years of education, etc. and show that this doesn't help much.

Finally, add in the BookpageID variable.

mod2 = lm(Age ~ Person + BookpageID, data = m)
anova(mod2)
## Analysis of Variance Table
## 
## Response: Age
##            Df Sum Sq Mean Sq F value Pr(>F)    
## Person      1    159     159    9.07 0.0041 ** 
## BookpageID 48  19127     398   22.77 <2e-16 ***
## Residuals  48    840      18                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
head(confint(mod2))
##                        2.5 % 97.5 %
## (Intercept)          18.7105 30.728
## PersonGroom           0.8461  4.245
## BookpageIDB230p1354  -9.1763  7.648
## BookpageIDB230p1665  -6.6544 10.169
## BookpageIDB230p1948 -13.7133  3.111
## BookpageIDB230p539   -3.7393 13.085

This gives an individual ID to each marriage. Putting this in the model effectively holds the couple constant when considering the Person. In terms of ANOVA, BookpageID is eating up lots and lots of variance.

A Simple Experiment

Suppose that we've just invented a device that makes cars safer, and which we expect to reduce deaths by 10%. That would be a big improvement. We're going to do an experiment:

We already know something about Planet Alt: that the death rate will be lower by 10%. Now we need to do a little bit more work to make a Planet Alt that's relevant to the data we will actually be creating on Planet Earth. Death rates for vehicular accidents are given at the NHTSA web site http://www-fars.nhtsa.dot.gov/Main/index.aspx.

The site indicates that in 2010, there were 15.65 vehicle accident deaths per 100,000 drivers per year. We don't actually know that this rate will be relevant to our experiment.

Imagine that we collect data with \( n=1000 \) sites. (This corresponds to 100,000 cars in each of the treatment and control groups.) The death rate at each site should be 15.65/1000.

n = 1000
ourdata = data.frame(treatment = rpois(n, lambda = 0.9 * 15.65/1000), control = rpois(n, 
    lambda = 15.65/1000))
require(reshape2)
## Loading required package: reshape2
ourdata = melt(ourdata)
## Using as id variables
summary(lm(value ~ variable, data = ourdata))
## 
## Call:
## lm(formula = value ~ variable, data = ourdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.015 -0.015 -0.011 -0.011  0.989 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.01500    0.00358    4.19    3e-05 ***
## variablecontrol -0.00400    0.00507   -0.79     0.43    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.113 on 1998 degrees of freedom
## Multiple R-squared: 0.000312,    Adjusted R-squared: -0.000189 
## F-statistic: 0.623 on 1 and 1998 DF,  p-value: 0.43

Have everyone in the class do this and see how many of them were able to reject the null hypothesis.

How many sites do we need to use to get 90% power?

Looking at the Standard Error

Another way to think about this problem is to ask what is the precision of our estimate of the number of deaths. It needs to be substantially smaller than the anticipated reduction in the death rate. That reduction is 10% of 15.65/100 or 0.0016

summary(lm(rpois(1000, lambda = 0.9 * 15.65/1000) ~ 1))
## 
## Call:
## lm(formula = rpois(1000, lambda = 0.9 * 15.65/1000) ~ 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.011 -0.011 -0.011 -0.011  0.989 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0110     0.0033    3.33  0.00089 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.104 on 999 degrees of freedom

This is too large. We should try to get it much smaller, perhaps one-sixth as large. That will necessitate collecting 36 times as much data.

Do the power calculation for that size.

QUESTION: Given the number of sites, is there a practical way to do the experiment? How could we do the experiment in a less costly way? (Find a high-risk group of cars. Find an endpoint other than deaths, say injuries.)

Here's a bit of information for Pennsylvania.

How big would the study need to be if studied injuries rather than deaths?

Win/Lose, Yes/No, Life & Death

Fugue for Tinhorns from Guys and Dolls.

Review of Odds

An introduction to being a bookie.

Who is going to win the World Series? Solicit answers. Then go through the class and get people to lay a $1 or $2 bet on their choice.

For each choice, the odds are simply calculated: the amount on all the other choices divided by the amount on the choices itself.

If you win, you get your bet times the odds. (You also keep your original money.) If you lose, you pay up.

Show that I, the bookie, can't lose money — the payout in every case must be exactly equal to the amount paid by the people who lose.

By shaving the odds a bit, I can guarantee to make money. (At least, if I can collect from the losers.)

In practice, it works like this:

So, with a bit of seed capital, and some muscle for enforcement, I can make money.

Log odds: translate odds (which run 0 to \( \infty \)) into a variable with support from \( -\infty \) to \( \infty \).

An explanation resource on Dutch book