Stats 155 Class Notes 2012-11-16

The Power and Significance App

fetchData("mHypTest.R")
mHypTest()  # for a t-test.

Show how low degrees of freedom in the residual and high residual size lead to broad tails in the null distribution.

Turn on the alternative hypothesis and show how bigger effect size increases the power, as does a larger amount of data.

Now with the F distribution

mHypTest(TRUE)  # for an F distribution

Effect size is in terms of R2.

The F Distribution

Simulation with random explanatory vectors. Vary the degrees of freedom in the residual to show how small degrees of freedom requires F to be big to reject the null.

Show the F distribution analytically: that is with the df() operator.

Compare F and t: t2 is F. One sided versus two sided tests.

Eating Up the Variance

The F statistic compares the “credit” earned by a model term to the mean square residual, which can be interpreted as the credit that would be earned by a junky random term.

Fit a model and add in some random terms. Show that the F for the random terms is about 1 and that the mean square of the residual is hardly changed by the random terms.

mod0 = lm(wage ~ sector + sex + exper, data = CPS85)
anova(mod0)
## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   2572     367    17.8 < 2e-16 ***
## sex         1    436     436    21.1 5.3e-06 ***
## exper       1    264     264    12.8 0.00037 ***
## Residuals 524  10805      21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The mean square residual is about 20.

Now throw in some junk:

mod10 = lm(wage ~ sector + sex + exper + rand(10), data = CPS85)
anova(mod10)
## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   2572     367   17.77 < 2e-16 ***
## sex         1    436     436   21.09 5.5e-06 ***
## exper       1    264     264   12.79 0.00038 ***
## rand(10)   10    180      18    0.87 0.56047    
## Residuals 514  10625      21                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But what if a term eats more variance than a junky term. That term makes it easier for the other terms to show significance.

EXAMPLE: Difference in age between husband and wife in couples getting married.

Ask: Who is older in a married couple, the man or the woman? By how much?

Let's see if the data support this:

m = fetchData("marriage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/marriage.csv
mod0 = mm(Age ~ Person, data = m)
mod0
## Groupwise Model.
## Call:
## Age ~ Person
## 
## Coefficients:
## Bride  Groom  
##  33.2   35.8
confint(mod0)
##   group 2.5 % 97.5 %
## 1 Bride 29.15  37.33
## 2 Groom 31.70  39.87
mod1 = lm(Age ~ Person, data = m)
summary(mod1)
## 
## Call:
## lm(formula = Age ~ Person, data = m)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.42 -12.02  -3.34   8.80  39.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.24       2.06   16.13   <2e-16 ***
## PersonGroom     2.55       2.91    0.87     0.38    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 14.4 on 96 degrees of freedom
## Multiple R-squared: 0.00789, Adjusted R-squared: -0.00245 
## F-statistic: 0.763 on 1 and 96 DF,  p-value: 0.384

The point estimate is about right, but the margin of error is so large that we can't take this estimate very seriously. The p-value is so large that we can't reject the null that there is no relationship between Person and age.

Try adding in some other variables, astrological sign, years of education, etc. and show that this doesn't help much.

Finally, add in the BookpageID variable.

mod2 = lm(Age ~ Person + BookpageID, data = m)
anova(mod2)
## Analysis of Variance Table
## 
## Response: Age
##            Df Sum Sq Mean Sq F value Pr(>F)    
## Person      1    159     159    9.07 0.0041 ** 
## BookpageID 48  19127     398   22.77 <2e-16 ***
## Residuals  48    840      18                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
head(confint(mod2))
##                        2.5 % 97.5 %
## (Intercept)          18.7105 30.728
## PersonGroom           0.8461  4.245
## BookpageIDB230p1354  -9.1763  7.648
## BookpageIDB230p1665  -6.6544 10.169
## BookpageIDB230p1948 -13.7133  3.111
## BookpageIDB230p539   -3.7393 13.085

This gives an individual ID to each marriage. Putting this in the model effectively holds the couple constant when considering the Person. In terms of ANOVA, BookpageID is eating up lots and lots of variance.

Example: Heating and Electricity Use

In theory, the electricity used in a house for lighting, computers, etc. should also be heating the house; eventually the energy in the electricity is transformed into heat. But how much? Perhaps the amount of electricity is small compared to the amount of heat.

Let's look just at winter months, November through March, when heating is used pretty steadily:

utils = fetchData("utilities.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/utilities.csv
winter = subset(utils, temp < 55 & ccf > 45 & month != 10)
bwplot(gasbill/elecbill ~ as.factor(month), data = winter)

plot of chunk unnamed-chunk-8

It looks like roughly twice as much is paid for gas as electricity. So there's more money going into heating than lighting and other electricity uses. Considering that electricity is a much more expensive form of energy than natural gas, this suggests that there is much more energy being used for heating than for lighting.

But since energy is a physical quantity, there is a very good theory of it. In fact, both kWh and ccf are forms of energy, and so there is a conversion between them. (Actually, ccf is a volume, but since the volumetric energy density of natural gas varies only by a few percent depending on other factors, we can treat it approximately as an energy.) Looking it up on in Wikipedia indicates that a ccf is roughly 29.3 kWh.

bwplot(kwh/(29.3 * ccf) ~ as.factor(month), data = winter)

plot of chunk unnamed-chunk-9

Electricity energy use is about 10-20% compared to gas energy use in the winter months. This suggests that we should be able to detect the electricity use. It should show up as a negative coefficient on kwh in a model of ccf, with a magnitude of about \( 1/29.3 \) (the conversion factor between kwh and ccf).

mod1 = lm(ccf ~ kwh, data = winter)
summary(mod1)
## 
## Call:
## lm(formula = ccf ~ kwh, data = winter)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -82.38 -48.21   6.18  40.41 100.65 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.4193    31.9863    3.64   0.0007 ***
## kwh           0.0420     0.0436    0.96   0.3407    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 54.6 on 45 degrees of freedom
## Multiple R-squared: 0.0202,  Adjusted R-squared: -0.00158 
## F-statistic: 0.928 on 1 and 45 DF,  p-value: 0.341
anova(mod1)
## Analysis of Variance Table
## 
## Response: ccf
##           Df Sum Sq Mean Sq F value Pr(>F)
## kwh        1   2763    2763    0.93   0.34
## Residuals 45 134066    2979

Alas, we can't reject the null. Note that the two reports are giving the same information. There is only explanatory vector and so there is no “team” effect for ANOVA to help us understand.

But there are other factors that affect natural gas use, and we may be able to explain some of these. Doing so will reduce the size of the residuals and may make it easier to see the effect of electricity use on natural gas use.

mod2 = lm(ccf ~ kwh + temp, data = winter)
summary(mod2)
## 
## Call:
## lm(formula = ccf ~ kwh + temp, data = winter)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44.82 -10.49  -3.72  12.33  45.12 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 299.4614    14.9283   20.06   <2e-16 ***
## kwh          -0.0225     0.0155   -1.46     0.15    
## temp         -4.3683     0.2395  -18.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 18.9 on 44 degrees of freedom
## Multiple R-squared: 0.886,   Adjusted R-squared: 0.88 
## F-statistic:  170 on 2 and 44 DF,  p-value: <2e-16
anova(mod2)
## Analysis of Variance Table
## 
## Response: ccf
##           Df Sum Sq Mean Sq F value Pr(>F)    
## kwh        1   2763    2763    7.77 0.0078 ** 
## temp       1 118411  118411  332.81 <2e-16 ***
## Residuals 44  15655     356                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When put first in the model, kwh shows up as significant in ANOVA. But when put last (try it!), it doesn't. That's because both temp and kwh have a common contributing cause: winter. temp is capable of explaining some of kwh and so putting temp first grabs some of the credit.

A sample-size calculation: One way to make Planet Alt.

How much data would we need to have to be able to test the conversion between ccf and kwh to within about 5%?

Creating Planet Alt as a copy of Planet Null. Resample it with various sample sizes until the confidence intervals get acceptable in size.

Emphasize: this cannot replace collecting the data, but indicates to us how much data we would need to collect to be able to get the sort of result we expect.

Choosing Model Terms

Put terms first: they get credit. Put terms later: they can help the earlier terms to get credit.

Look at sex and race in the wage data. Put them first. Put them last. What are the different interpretations?

mod0 = lm(log(wage) ~ race + sex, data = CPS85)
anova(mod0)
## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## race        1    0.7    0.66     2.5    0.11    
## sex         1    7.2    7.23    27.3 2.5e-07 ***
## Residuals 531  140.6    0.26                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Race isn't showing up as significant, but sex is. Perhaps if we eat up the variance by putting in more explanatory variables …

mod1 = lm(log(wage) ~ race + sex + sector + educ + exper + south, data = CPS85)
anova(mod1)
## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## race        1    0.7    0.66    3.47  0.0631 .  
## sex         1    7.2    7.23   37.96 1.4e-09 ***
## sector      7   24.8    3.54   18.56 < 2e-16 ***
## educ        1    5.5    5.52   28.95 1.1e-07 ***
## exper       1    9.5    9.49   49.83 5.4e-12 ***
## south       1    1.5    1.53    8.03  0.0048 ** 
## Residuals 521   99.3    0.19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significance of sex is even stronger Race is right on the margin. With more data (and this is a very limited amount of data) we might see something.

But what if someone argues that it's really educational levels etc that matter and you have to see how these effect wage first:

mod3 = lm(log(wage) ~ sector + educ + exper + south + sex + race, data = CPS85)
anova(mod3)
## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   27.0    3.86   20.27 < 2e-16 ***
## educ        1    6.5    6.50   34.13 9.1e-09 ***
## exper       1    8.8    8.83   46.34 2.7e-11 ***
## south       1    1.5    1.45    7.63  0.0059 ** 
## sex         1    5.2    5.23   27.46 2.3e-07 ***
## race        1    0.1    0.13    0.70  0.4019    
## Residuals 521   99.3    0.19                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

No indication that race is playing a role.

QUESTION: But is race affecting education, experience, and sector? If so, then race should go first.

An Odd Situation

Sometimes the later term gets credit, but only because it was set up by an earlier term. Without the earlier term, it wouldn't get credit.

Examples:

ANOVA and near redundancy

The CPS85 data. With ANOVA, you can throw all sorts of stuff into the model and not have to worry about redundancy or near redundancy.

m1 = lm(wage ~ sector + exper + age + educ, data = CPS85)
summary(m1)  # terrible standard errors
## 
## Call:
## lm(formula = wage ~ sector + exper + age + educ, data = CPS85)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10.36  -2.80  -0.65   2.01  33.92 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.0445     6.8083   -0.74   0.4591    
## sectorconst     3.0259     1.0984    2.75   0.0061 ** 
## sectormanag     4.0025     0.7651    5.23  2.4e-07 ***
## sectormanuf     1.7037     0.7186    2.37   0.0181 *  
## sectorother     2.1345     0.7114    3.00   0.0028 ** 
## sectorprof      2.7200     0.6766    4.02  6.7e-05 ***
## sectorsales    -0.1196     0.8467   -0.14   0.8877    
## sectorservice  -0.1434     0.6727   -0.21   0.8313    
## exper          -0.0765     1.1133   -0.07   0.9452    
## age             0.1751     1.1122    0.16   0.8749    
## educ            0.5733     1.1182    0.51   0.6084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 4.42 on 523 degrees of freedom
## Multiple R-squared: 0.274,   Adjusted R-squared: 0.26 
## F-statistic: 19.7 on 10 and 523 DF,  p-value: <2e-16
anova(m1)
## Analysis of Variance Table
## 
## Response: wage
##            Df Sum Sq Mean Sq F value  Pr(>F)    
## sector      7   2572     367   18.80 < 2e-16 ***
## exper       1    213     213   10.91   0.001 ** 
## age         1   1068    1068   54.65 5.7e-13 ***
## educ        1      5       5    0.26   0.608    
## Residuals 523  10219      20                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Still working …

One of the advantages of Planet Null is that it's easy to get there and collect data. So we can know a lot about it. Discoveries about Planet Null have played a large role in statistics.

Time for t

Let's collect lots of coefficients on Planet Null and see what they have in common. We'll do this by randomizing the response variable:

mod1 = lm(shuffle(wage) - mean(wage) ~ shuffle(sector) * shuffle(sex) * shuffle(age) * 
    shuffle(exper) * shuffle(married) * shuffle(union), data = CPS85)
houses = fetchData("SaratogaHouses.csv")
mod2 = lm(shuffle(Price) - mean(Price) ~ shuffle(Living.Area) * shuffle(Baths) * 
    shuffle(Bedrooms) * shuffle(Fireplace) * shuffle(Acres), data = houses)
grades = fetchData("grades.csv")
g2n = fetchData("grade-to-number.csv")
courses = fetchData("courses.csv")
all = merge(grades, g2n)
all = merge(courses, all)
mod3 = lm(rand(1) ~ shuffle(sessionID) + shuffle(dept) + shuffle(sid), data = all)

Panel data with mother, father, children just a small amount of varied ages, but show the children are younger than the parents.

Lifespan versus year of birth (then add in year of death).