Lecture 19: (More) Hypothesis Testing II (Again)

The elementary setting for ANOVA

Differences among groups. When there are two groups, this is called a “t-test”. More than two, it’s traditionally called simply “ANOVA” or “one-way ANOVA”, but it’s just a standard application of the ANOVA procedure to a simple model.

The t-test

Consider the difference between the sexes in wage:

data("CPS85")
mod = lm(wage ~ sex, data=CPS85)
summary(mod)

## 
## Call:
## lm(formula = wage ~ sex, data = CPS85)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.995 -3.529 -1.072  2.394 36.621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.8789     0.3216   24.50  < 2e-16 ***
## sexM          2.1161     0.4372    4.84  1.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.034 on 532 degrees of freedom
## Multiple R-squared:  0.04218,    Adjusted R-squared:  0.04038 
## F-statistic: 23.43 on 1 and 532 DF,  p-value: 1.703e-06

anova(mod)

## Analysis of Variance Table
## 
## Response: wage
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## sex         1   593.7  593.71  23.426 1.703e-06 ***
## Residuals 532 13483.0   25.34                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the regression report and the ANOVA report give the same p-value. There’s no “team” issue when there is just one coefficient at issue.

For multiple groups, consider wage versus sector:

mod2 = lm(wage ~ sector, data=CPS85)
summary(mod2)

## 
## Call:
## lm(formula = wage ~ sector, data = CPS85)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.704  -3.037  -1.037   2.279  31.796 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.4226     0.4749  15.631  < 2e-16 ***
## sectorconst     2.0794     1.1485   1.810   0.0708 .  
## sectormanag     5.2814     0.7894   6.690 5.71e-11 ***
## sectormanuf     0.6135     0.7397   0.829   0.4073    
## sectorother     1.0780     0.7397   1.457   0.1456    
## sectorprof      4.5249     0.6586   6.870 1.82e-11 ***
## sectorsales     0.1701     0.8950   0.190   0.8494    
## sectorservice  -0.8851     0.6993  -1.266   0.2062    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.677 on 526 degrees of freedom
## Multiple R-squared:  0.1827, Adjusted R-squared:  0.1718 
## F-statistic:  16.8 on 7 and 526 DF,  p-value: < 2.2e-16

In the regression report, each sector gets its own p-value with respect to the reference sector. It might be that there is not enough evidence to support a claim that any given sector is different from the reference, but taken together they suggest that the overall variable is playing a role in explaining the response variable. ANOVA provides a simple way to combine the contributions of all the levels of the variable.

anova(mod2)

## Analysis of Variance Table
## 
## Response: wage
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## sector      7  2571.6  367.37  16.796 < 2.2e-16 ***
## Residuals 526 11505.1   21.87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When considering the null hypothesis, the “explanatory” variables are assumed to be no better than the residuals. Here’s how that shows up in the ANOVA table:

modnull = lm(wage ~ shuffle(sector), data=CPS85)
anova(modnull)

## Analysis of Variance Table
## 
## Response: wage
##                  Df  Sum Sq Mean Sq F value Pr(>F)
## shuffle(sector)   7   177.6  25.375  0.9603 0.4595
## Residuals       526 13899.1  26.424

Log prices, indices, and wages

Prices are relative. An indication of this is the almost universal use of percentages to describe inflation, wage increases, etc. For example, an often quoted number is that women earn approximately 72 cents for each dollar earned by a man.

The naive way to find this number — which is in fact the way it is found — is to divide the average wage of women by the average wage of men, e.g.

mean( wage ~ sex, data=CPS85)

##        F        M 
## 7.878857 9.994913

7.88/9.99

## [1] 0.7887888

Close to the quoted number in this data set from the 1980s.

A better way is to work with log wages, find the contribution from sex, and then convert that back into a multiplier. That let’s us adjust for various other factors. Here is the basic calculation, done in log-wage style:

lm(log(wage) ~ sex, data=CPS85)

## 
## Call:
## lm(formula = log(wage) ~ sex, data = CPS85)
## 
## Coefficients:
## (Intercept)         sexM  
##      1.9340       0.2312

exp(-0.2312)

## [1] 0.7935807

Now we can include covariates:

lm( log(wage) ~ sex + sector + exper + educ, data=CPS85)

## 
## Call:
## lm(formula = log(wage) ~ sex + sector + exper + educ, data = CPS85)
## 
## Coefficients:
##   (Intercept)           sexM    sectorconst    sectormanag    sectormanuf  
##       0.69776        0.21966        0.16371        0.20869        0.04473  
##   sectorother     sectorprof    sectorsales  sectorservice          exper  
##       0.03932        0.19760       -0.15135       -0.15387        0.01176  
##          educ  
##       0.07607

exp(-0.2197)

## [1] 0.8027596

Hardly any difference. But maybe the model should be more complicated? How can we tell?

Use ANOVA and log wages to see if interactions should be included in the model. Example:

mod = lm(log(wage) ~ sector*sex, data=CPS85)
anova(mod)

## Analysis of Variance Table
## 
## Response: log(wage)
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## sector       7  27.035  3.8621 17.7750 < 2.2e-16 ***
## sex          1   5.444  5.4444 25.0575 7.634e-07 ***
## sector:sex   6   3.201  0.5335  2.4555   0.02375 *  
## Residuals  519 112.767  0.2173                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Do we have a sigficant interaction?

Question

Is there an interaction between age and mileage in the used car data? Does it show up if we look at log prices?

cars = read.csv("http://www.mosaic-web.org/go/datasets/used-hondas.csv")
anova(lm(log(Price)~Age*Mileage, data=cars))

## Analysis of Variance Table
## 
## Response: log(Price)
##             Df Sum Sq Mean Sq  F value  Pr(>F)    
## Age          1 8.3325  8.3325 599.4336 < 2e-16 ***
## Mileage      1 2.1881  2.1881 157.4089 < 2e-16 ***
## Age:Mileage  1 0.0457  0.0457   3.2865 0.07326 .  
## Residuals   88 1.2232  0.0139                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Do professors vary in how they grade?

One place where F shines is when we want to look at many explanatory vectors collectively.

Previously, we looked at professor-wise grade-point averages, with an eye to the question of whether some professors are easy grading.We used as a test statistic, the model coefficient for each professor, and ran into the question of multiple comparisons.

Now let’s return to the question using analysis of variance.

grades = read.csv("http://www.mosaic-web.org/go/datasets/grades.csv")
courses = read.csv("http://www.mosaic-web.org/go/datasets/courses.csv")
g2n = read.csv("http://www.mosaic-web.org/go/datasets/grade-to-number.csv")
all = merge( grades, courses)
all = merge(all, g2n) # a data set of every grade given, etc.

Suppose, instead of being concerned about individual professors, we were interested in the professorate as a whole: do they grade in a consistent way, where “consistent” means, “draw grades from a common pool.” This test can be done easily. Build the model and see if the explanatory variable accounts for more than is likely to arise from chance:

mod1 = lm(gradepoint ~ iid, data=all)
rsquared(mod1)

## [1] 0.1744812

The regression report actually gives a p-value for this r.squared. It’s not any different than we would get based on the null hypothesis — randomizing iid and seeing what is the distribution of \(R^2\).

Another way to summarize the model is with an ANOVA report:

anova(mod1)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## iid        358  351.98 0.98319  3.1586 < 2.2e-16 ***
## Residuals 5350 1665.33 0.31128                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Breaking up the variance into parts

Of course, it’s not fair to credit professors for variation in grades that is really due to the students. So we want to divide up the variation into that due to the students and that due to the professors. ANOVA let’s us do this:

mod2 = lm(gradepoint ~ sid + iid, data=all)
anova(mod2)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## sid        442  645.01 1.45930  6.7583 < 2.2e-16 ***
## iid        358  312.53 0.87300  4.0430 < 2.2e-16 ***
## Residuals 4908 1059.77 0.21593                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interestingly, the result depends on the order in which you put the model terms, even though the model values do not at all depend on this. This is explained in Section 15.3 of your textbook. It is based on the fact that we are testing two different null hypotheses:

Looking at sid, in the first case the null is that sid on it’s own doesn’t contribute beyond what is typical for a random vector, and in the second case, the null is that sid doesn’t contribute beyond the contribution already made by iid. The two reports give us different perspectives on each of the variables (in this case, we focus on sid).

mod3 = lm(gradepoint ~ iid + sid, data=all)
anova(mod3)

## Analysis of Variance Table
## 
## Response: gradepoint
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## iid        358  351.98 0.98319  4.5534 < 2.2e-16 ***
## sid        442  605.56 1.37005  6.3449 < 2.2e-16 ***
## Residuals 4908 1059.77 0.21593                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

sum((fitted(mod2) - fitted(mod3))^2) # Model values are the same

## [1] 1.039428e-22

Eating up the variance

The F statistic compares the “credit” earned by a model term to the mean square residual, which can be interpreted as the credit that would be earned by a junk random term.

What happens if we fit a model and add in some random terms? What should we see?

We should see that the F for the random terms is about 1 and that the mean square of the residual is hardly changed by the random terms.

mod0 = lm(wage ~ sector + sex + exper, data=CPS85)
anova(mod0)

## Analysis of Variance Table
## 
## Response: wage
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## sector      7  2571.6  367.37  17.816 < 2.2e-16 ***
## sex         1   436.0  435.98  21.144 5.348e-06 ***
## exper       1   264.5  264.48  12.826 0.0003736 ***
## Residuals 524 10804.7   20.62                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The mean square residual is about 20. Now throw in some junk:

mod10 = lm( wage ~ sector + sex + exper + rand(10), data=CPS85)
anova(mod10)

## Analysis of Variance Table
## 
## Response: wage
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## sector      7  2571.6  367.37 17.9505 < 2.2e-16 ***
## sex         1   436.0  435.98 21.3029 4.959e-06 ***
## exper       1   264.5  264.48 12.9230 0.0003559 ***
## rand(10)   10   285.4   28.54  1.3943 0.1792743    
## Residuals 514 10519.3   20.47                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But what if a term eats more variance than a junk term? That term makes it easier for the other terms to show significance!

Example

Difference in age between husband and wife in couples getting married

Who is older in a married couple, the man or the woman? By how much?

Let’s see if the data support this:

m = read.csv("http://www.mosaic-web.org/go/datasets/marriage.csv")
mod0 = mm(Age ~ Person, data=m)
mod0

## 
## Groupwise Model Call:
## Age ~ Person
## 
## Coefficients:
## Bride  Groom  
## 33.24  35.78

confint(mod0)

##   group    2.5 %   97.5 %
## 1 Bride 29.14955 37.32884
## 2 Groom 31.69509 39.87438

mod1 = lm(Age ~ Person, data=m)
summary(mod1)

## 
## Call:
## lm(formula = Age ~ Person, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.423 -12.022  -3.338   8.796  39.564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.239      2.060  16.133   <2e-16 ***
## PersonGroom    2.546      2.914   0.874    0.384    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.42 on 96 degrees of freedom
## Multiple R-squared:  0.007888,   Adjusted R-squared:  -0.002447 
## F-statistic: 0.7633 on 1 and 96 DF,  p-value: 0.3845

The point estimate is about right, but the margin of error is so large that we can’t take this estimate very seriously. The p-value is so large that we can’t reject the null that there is no relationship between Person and age.

We can try adding in some other variables, astrological sign, years of education, etc. but this doesn’t help much…

mod1 = lm(Age ~ Person + Race + College + Sign, data=m)
summary(mod1)

## 
## Call:
## lm(formula = Age ~ Person + Race + College + Sign, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.510 -10.188  -2.478   9.552  36.646 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      22.7417    16.6082   1.369   0.1752  
## PersonGroom       1.7171     3.3139   0.518   0.6060  
## RaceBlack         1.6238    15.5737   0.104   0.9173  
## RaceHispanic      1.5978    21.1876   0.075   0.9401  
## RaceWhite         1.4318    15.2981   0.094   0.9257  
## College           1.7995     0.8259   2.179   0.0327 *
## SignAries         4.4266     7.2254   0.613   0.5421  
## SignCancer       10.0494     7.4525   1.348   0.1818  
## SignCapricorn    -0.4124    11.5024  -0.036   0.9715  
## SignGemini       13.8901     7.2829   1.907   0.0605 .
## SignLeo           7.4954     7.9812   0.939   0.3508  
## SignLibra         0.9624     8.2834   0.116   0.9078  
## SignPisces        3.3368     6.8114   0.490   0.6257  
## SignSaggitarius   5.8941     7.4545   0.791   0.4318  
## SignScorpio       8.3847     7.9792   1.051   0.2969  
## SignTaurus       10.8592     8.4584   1.284   0.2034  
## SignVirgo         2.1650     7.3922   0.293   0.7705  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.22 on 71 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.1571, Adjusted R-squared:  -0.0328 
## F-statistic: 0.8273 on 16 and 71 DF,  p-value: 0.6511

OK, what about if we add in the BookpageID variable as a covariate? This gives an individual ID to each marriage.

mod2 = lm(Age ~ Person + BookpageID, data=m)
anova(mod2)

## Analysis of Variance Table
## 
## Response: Age
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## Person      1   158.8  158.75  9.0699  0.004137 ** 
## BookpageID 48 19127.3  398.49 22.7661 < 2.2e-16 ***
## Residuals  48   840.2   17.50                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

head(confint(mod2))

##                           2.5 %    97.5 %
## (Intercept)          18.7104914 30.727529
## PersonGroom           0.8460752  4.245007
## BookpageIDB230p1354  -9.1763101  7.647543
## BookpageIDB230p1665  -6.6543923 10.169461
## BookpageIDB230p1948 -13.7132964  3.110557
## BookpageIDB230p539   -3.7393238 13.084529

Putting this in the model effectively holds the couple constant when considering the Person. In terms of ANOVA, BookpageID is eating up lots and lots of variance. This is sometimes called a paired t-test, because each family has a wife and a husband (or perhaps more politically correct, each marriage has two different individuals).

Example

Heating and electricity Use

In theory, the electricity used in a house for lighting, computers, etc. should also be heating the house; eventually the energy in the electricity is transformed into heat. But how much? Perhaps the amount of electricity is small compared to the amount of heat.

Let’s look just at winter months, November through March, when heating is used pretty steadily:

utils = read.csv("http://www.mosaic-web.org/go/datasets/utilities.csv")
winter = subset(utils, temp < 55 & ccf > 45 & month != 10)
bwplot( gasbill/elecbill ~ as.factor(month), data=winter)

It looks like roughly twice as much is paid for gas as electricity. So there’s more money going into heating than lighting and other electricity uses. Considering that electricity is a much more expensive form of energy than natural gas, this suggests that there is much more energy being used for heating than for lighting.

But since energy is a physical quantity, there is a very good theory of it. In fact, both kWh and ccf are forms of energy, and so there is a conversion between them. (Actually, ccf is a volume, but since the volumetric energy density of natural gas varies only by a few percent depending on other factors, we can treat it approximately as an energy.) Looking it up on in Wikipedia indicates that a ccf is roughly 29.3 kWh.

bwplot(kwh/(29.3*ccf) ~ as.factor(month), data=winter)

Electricity energy use is about 10-20% compared to gas energy use in the winter months. This suggests that we should be able to detect the electricity use. It should show up as a negative coefficient on kwh in a model of ccf, with a magnitude of about \(1/29.3\) (the conversion factor between kwh and ccf).

mod1 = lm(ccf ~ kwh, data=winter)
summary(mod1)

## 
## Call:
## lm(formula = ccf ~ kwh, data = winter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.382 -48.208   6.176  40.406 100.651 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.41931   31.98631   3.640 0.000701 ***
## kwh           0.04197    0.04358   0.963 0.340652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.58 on 45 degrees of freedom
## Multiple R-squared:  0.0202, Adjusted R-squared:  -0.001578 
## F-statistic: 0.9275 on 1 and 45 DF,  p-value: 0.3407

anova(mod1)

## Analysis of Variance Table
## 
## Response: ccf
##           Df Sum Sq Mean Sq F value Pr(>F)
## kwh        1   2763  2763.3  0.9275 0.3407
## Residuals 45 134066  2979.2

Alas, we can’t reject the null. Note that the two reports are giving the same information. There is only explanatory vector and so there is no “team” effect for ANOVA to help us understand.

But there are other factors that affect natural gas use, and we may be able to explain some of these. Doing so will reduce the size of the residuals and may make it easier to see the effect of electricity use on natural gas use.

mod2 = lm(ccf ~ kwh + temp, data=winter)
summary(mod2)

## 
## Call:
## lm(formula = ccf ~ kwh + temp, data = winter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.819 -10.492  -3.723  12.332  45.121 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 299.46137   14.92833  20.060   <2e-16 ***
## kwh          -0.02254    0.01547  -1.457    0.152    
## temp         -4.36834    0.23945 -18.243   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.86 on 44 degrees of freedom
## Multiple R-squared:  0.8856, Adjusted R-squared:  0.8804 
## F-statistic: 170.3 on 2 and 44 DF,  p-value: < 2.2e-16

anova(mod2)

## Analysis of Variance Table
## 
## Response: ccf
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## kwh        1   2763    2763   7.7668  0.007826 ** 
## temp       1 118411  118411 332.8121 < 2.2e-16 ***
## Residuals 44  15655     356                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When put first in the model, kwh shows up as significant in ANOVA, but when put last (below), it doesn’t. That’s because both temp and kwh have a common contributing cause: winter! Since temp is capable of explaining some of kwh, putting temp first grabs some of the credit.

mod3 = lm(ccf ~ temp + kwh, data=winter)
summary(mod3)

## 
## Call:
## lm(formula = ccf ~ temp + kwh, data = winter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.819 -10.492  -3.723  12.332  45.121 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 299.46137   14.92833  20.060   <2e-16 ***
## temp         -4.36834    0.23945 -18.243   <2e-16 ***
## kwh          -0.02254    0.01547  -1.457    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.86 on 44 degrees of freedom
## Multiple R-squared:  0.8856, Adjusted R-squared:  0.8804 
## F-statistic: 170.3 on 2 and 44 DF,  p-value: < 2.2e-16

anova(mod3)

## Analysis of Variance Table
## 
## Response: ccf
##           Df Sum Sq Mean Sq  F value Pr(>F)    
## temp       1 120419  120419 338.4563 <2e-16 ***
## kwh        1    755     755   2.1226 0.1522    
## Residuals 44  15655     356                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The fact that order matters can be extremely frustrating to people new to ANOVA. There are some general guidelines about how to choose the order of terms in ANOVA. Section 15.4.1 of your textbook provides some useful guidelines. I strongly recommend you read this section and use the information there when conducting ANOVA in your final projects.

Choosing model terms

With the above caveate, here are a few ‘nice’ ways to think about order and model terms in ANOVA:

Put terms first — they get credit
Put terms later — they can help the earlier terms to get credit

For example, look at sex and race in the wage data. Let’s put them first; let’s put them last. What are the different interpretations?

mod0 = lm(log(wage) ~ race + sex, data=CPS85)
anova(mod0)

## Analysis of Variance Table
## 
## Response: log(wage)
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## race        1   0.661  0.6607   2.496    0.1147    
## sex         1   7.233  7.2327  27.325 2.477e-07 ***
## Residuals 531 140.553  0.2647                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems as though race isn’t showing up as significant, but sex is. Perhaps if we eat up the variance by putting in more explanatory variables …?

mod1 = lm(log(wage) ~ race + sex + sector + educ + exper + south, data=CPS85)
anova(mod1)

## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## race        1  0.661  0.6607  3.4678  0.063136 .  
## sex         1  7.233  7.2327 37.9630 1.444e-09 ***
## sector      7 24.753  3.5362 18.5605 < 2.2e-16 ***
## educ        1  5.516  5.5159 28.9516 1.123e-07 ***
## exper       1  9.494  9.4941 49.8324 5.374e-12 ***
## south       1  1.529  1.5292  8.0266  0.004788 ** 
## Residuals 521 99.261  0.1905                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significance of sex is even stronger, but race is right on the margin. With more data (and this is a very limited amount of data) we might see something.

But what if someone argues that it’s really educational levels etc. that matter and you have to see how these effect wage first:

mod3 = lm( log(wage) ~ sector + educ + exper + south + sex + race, data=CPS85)
anova(mod3)

## Analysis of Variance Table
## 
## Response: log(wage)
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## sector      7 27.035  3.8621 20.2713 < 2.2e-16 ***
## educ        1  6.503  6.5026 34.1306 9.086e-09 ***
## exper       1  8.828  8.8283 46.3379 2.747e-11 ***
## south       1  1.454  1.4538  7.6309   0.00594 ** 
## sex         1  5.232  5.2323 27.4630 2.330e-07 ***
## race        1  0.134  0.1341  0.7037   0.40193    
## Residuals 521 99.261  0.1905                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now there is no indication that race is playing a role! But is race affecting education, experience, and sector? If so, then race should go first.

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.