Linear Models (Contd.)

Categorical Explanatory Variables in Linear Models

Example: Comparing Mean Incomes of Racial-Ethnic Groups

Income<-read.table("http://stat4ds.rwth-aachen.de/data/Income.dat", header=TRUE)
head(Income)
  income education race
1     16        10    B
2     18         7    B
3     26         9    B
4     16        11    B
5     34        14    B
6     22        12    B
summary(lm(income~race,data=Income))

Call:
lm(formula = income ~ race, data = Income)

Residuals:
   Min     1Q Median     3Q    Max 
-24.48 -12.48  -4.48   7.52  77.52 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   27.750      4.968   5.586 3.37e-07 ***
raceH          3.250      7.273   0.447   0.6562    
raceW         14.730      5.708   2.581   0.0118 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.87 on 77 degrees of freedom
Multiple R-squared:  0.0993,    Adjusted R-squared:  0.0759 
F-statistic: 4.244 on 2 and 77 DF,  p-value: 0.01784

Analysis of Variance (ANOVA)

anova(lm(income~race,data=Income)) # analysis of variance comparing mean
Analysis of Variance Table

Response: income
          Df  Sum Sq Mean Sq F value  Pr(>F)  
race       2  3352.5 1676.23  4.2444 0.01784 *
Residuals 77 30409.5  394.93                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Comparisons of Means: Bonferroni and Tukey Methods

fit<-lm(income~race, data=Income)
summary(fit)

Call:
lm(formula = income ~ race, data = Income)

Residuals:
   Min     1Q Median     3Q    Max 
-24.48 -12.48  -4.48   7.52  77.52 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   27.750      4.968   5.586 3.37e-07 ***
raceH          3.250      7.273   0.447   0.6562    
raceW         14.730      5.708   2.581   0.0118 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.87 on 77 degrees of freedom
Multiple R-squared:  0.0993,    Adjusted R-squared:  0.0759 
F-statistic: 4.244 on 2 and 77 DF,  p-value: 0.01784
confint(fit, level=1-0.05/3) #confidence level 0.9833 instead of 0.95
                0.833 % 99.167 %
(Intercept)  15.5907991 39.90920
raceH       -14.5492476 21.04925
raceW         0.7601417 28.69986
race2<-factor(Income$race, levels=c("H","B","W"))# re-code so H is reference
confint(lm(income~race2,data=Income), level = 1-0.05/3)
               0.833 % 99.167 %
(Intercept)  18.001267 43.99873
race2B      -21.049248 14.54925
race2W       -3.226387 26.18639

fit.aov<-aov(income~race, data=Income) #analysis of variance function
TukeyHSD(fit.aov,conf.level=0.95)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = income ~ race, data = Income)

$race
     diff        lwr      upr     p adj
H-B  3.25 -14.130754 20.63075 0.8959280
W-B 14.73   1.088600 28.37140 0.0312355
W-H 11.48  -2.880612 25.84061 0.1425936

Models with Both Categorical and Quantitative Explanatory Variables

fit2<-lm(income~race + education, data=Income)
summary(fit2)

Call:
lm(formula = income ~ race + education, data = Income)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.664  -9.622  -1.642   6.552  57.620 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -26.5379     8.5124  -3.118  0.00257 ** 
raceH         5.9407     5.6702   1.048  0.29809    
raceW        10.8744     4.4730   2.431  0.01741 *  
education     4.4317     0.6191   7.158 4.42e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.46 on 76 degrees of freedom
Multiple R-squared:  0.462, Adjusted R-squared:  0.4408 
F-statistic: 21.75 on 3 and 76 DF,  p-value: 2.853e-10

Interaction with Categorical and Quantitative Explanatory Variables

Income<-read.table("http://stat4ds.rwth-aachen.de/data/Income.dat", header=TRUE)
fit3<-lm(income ~ race+education+race:education, data=Income)
summary(fit3)

Call:
lm(formula = income ~ race + education + race:education, data = Income)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.064  -9.448  -1.453   6.167  56.936 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)       -6.536     14.980  -0.436   0.6639  
raceH            -10.069     26.527  -0.380   0.7053  
raceW            -19.333     18.293  -1.057   0.2940  
education          2.799      1.182   2.368   0.0205 *
raceH:education    1.290      2.193   0.588   0.5582  
raceW:education    2.411      1.418   1.700   0.0933 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.37 on 74 degrees of freedom
Multiple R-squared:  0.4825,    Adjusted R-squared:  0.4475 
F-statistic:  13.8 on 5 and 74 DF,  p-value: 1.618e-09
anova(fit3)# tests individual effects of explanatory variables
Analysis of Variance Table

Response: income
               Df  Sum Sq Mean Sq F value    Pr(>F)    
race            2  3352.5  1676.2  7.0993  0.001512 ** 
education       1 12245.2 12245.2 51.8616 4.131e-10 ***
race:education  2   691.8   345.9  1.4650  0.237690    
Residuals      74 17472.4   236.1                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Linear Model Example: Salaries Data Set

income <- read.table("http://stat4ds.rwth-aachen.de/data/Salaries.dat", header=T)

model.all <- lm(salary~years, data=income) # for all cases, ignoring gender
summary(model.all)

Call:
lm(formula = salary ~ years, data = income)

Residuals:
    Min      1Q  Median      3Q     Max 
-168.02  -68.41  -24.68   82.95  176.79 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1323.160     63.962   20.69   <2e-16 ***
years         96.730      2.969   32.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 100.3 on 24 degrees of freedom
Multiple R-squared:  0.9779,    Adjusted R-squared:  0.977 
F-statistic:  1062 on 1 and 24 DF,  p-value: < 2.2e-16
model.f <- lm(salary~years, data=income, subset=gender==2)
summary(model.f)

Call:
lm(formula = salary ~ years, data = income, subset = gender == 
    2)

Residuals:
     Min       1Q   Median       3Q      Max 
-112.022  -45.487   -4.634   42.635  117.268 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1276.938     71.829   17.78 6.77e-09 ***
years         96.215      3.151   30.54 3.33e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 65.19 on 10 degrees of freedom
Multiple R-squared:  0.9894,    Adjusted R-squared:  0.9883 
F-statistic: 932.4 on 1 and 10 DF,  p-value: 3.325e-11
model.ancova <- lm(salary~years+gender, data=income)
summary(model.ancova)

Call:
lm(formula = salary ~ years + gender, data = income)

Residuals:
     Min       1Q   Median       3Q      Max 
-161.571  -55.325   -4.797   65.821  137.748 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1450.418     66.380  21.850  < 2e-16 ***
years         98.490      2.559  38.492  < 2e-16 ***
gender      -111.771     34.024  -3.285  0.00324 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 84.57 on 23 degrees of freedom
Multiple R-squared:  0.985, Adjusted R-squared:  0.9836 
F-statistic: 752.8 on 2 and 23 DF,  p-value: < 2.2e-16
layout(matrix(1:6,3,2)) # multiple plots in a 3x2 matrix layout
plot(model.ancova, which=1:6) # residual plots, shown in Figure A6.2

Exercises