Use R to Analyze Data ‘savings’

Savings rates in 50 countries

The savings data frame has 50 rows and 5 columns. The data is averaged over the period 1960-1970. This data frame contains the following columns:

library(faraway)
?savings

pairs(savings)

plot of chunk unnamed-chunk-1

attach(savings)
country=dimnames(savings)[[1]]
plot(dpi, sr)
text(dpi+0.5, sr+0.5, country, col=1:50)

plot of chunk unnamed-chunk-1

plot(ddpi, sr)
text(ddpi+0.5, sr+0.5, country, col=1:50)

plot of chunk unnamed-chunk-1

plot(pop75, pop15)
text(pop75+0.2, pop15+0.2, country, col=1:50, cex=0.75)

plot of chunk unnamed-chunk-1

Fit linear regression models

Older people tend to save more. So higher sr for counties with higher pop75? We fit a SLR with sr as the response and pop75 as the predictor.

summary(lm(sr ~ pop75,data=savings))
## 
## Call:
## lm(formula = sr ~ pop75, data = savings)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.266 -3.229  0.054  2.334 11.850 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.152      1.248    5.73  6.4e-07 ***
## pop75          1.099      0.475    2.31    0.025 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.29 on 48 degrees of freedom
## Multiple R-squared:   0.1,   Adjusted R-squared:  0.0814 
## F-statistic: 5.34 on 1 and 48 DF,  p-value: 0.0251

Next, let’s fit a MLR with all four predictors including pop75.

fullmodel=lm(sr~pop15+pop75+dpi+ddpi, data=savings) 
summary(fullmodel)
## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.242 -2.686 -0.249  2.428  9.751 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.566087   7.354516    3.88  0.00033 ***
## pop15       -0.461193   0.144642   -3.19  0.00260 ** 
## pop75       -1.691498   1.083599   -1.56  0.12553    
## dpi         -0.000337   0.000931   -0.36  0.71917    
## ddpi         0.409695   0.196197    2.09  0.04247 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.8 on 45 degrees of freedom
## Multiple R-squared:  0.338,  Adjusted R-squared:  0.28 
## F-statistic: 5.76 on 4 and 45 DF,  p-value: 0.00079

While pop75 is a significant predictor for sr in SLR with a positve coefficient, the coefficient estimate for pop75 in MLR is no longer significant and is even negative.

How should we interprete the coefficient estimate for pop75 in SLR and the one in MLR?

Does it make sense for the SLR to suggest that pop75 has a positive effect on sr, while MLR suggests the opposite? Such seemingly contradictory statements are quite comment when predictors are correlated.

cor(savings[,-1])
##          pop15    pop75     dpi     ddpi
## pop15  1.00000 -0.90848 -0.7562 -0.04783
## pop75 -0.90848  1.00000  0.7870  0.02532
## dpi   -0.75619  0.78700  1.0000 -0.12949
## ddpi  -0.04783  0.02532 -0.1295  1.00000

How to handle rank deficiency?

Sometimes, we may have redundant predictors, i.e., one predictor is equal to a linear combination of some other predictors. Then the design matrix X is not of full rank. The consequence is that the inverse of (X’X) does not exist, so we cannot compute the LS estimate beta-hat.

Suppose we create a new variable which measures the percentage of poeple between age 15 and 75, i.e., pop.mid = 1 - pop15 - pop75, then re-run the regression with 5 predictors.

newsavings = savings
newsavings[, "pop.mid"] = 1 - savings$pop15 - savings$pop75
newfullmodel = lm(sr ~ ., data=newsavings)
summary(newfullmodel)
## 
## Call:
## lm(formula = sr ~ ., data = newsavings)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.242 -2.686 -0.249  2.428  9.751 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.566087   7.354516    3.88  0.00033 ***
## pop15       -0.461193   0.144642   -3.19  0.00260 ** 
## pop75       -1.691498   1.083599   -1.56  0.12553    
## dpi         -0.000337   0.000931   -0.36  0.71917    
## ddpi         0.409695   0.196197    2.09  0.04247 *  
## pop.mid            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.8 on 45 degrees of freedom
## Multiple R-squared:  0.338,  Adjusted R-squared:  0.28 
## F-statistic: 5.76 on 4 and 45 DF,  p-value: 0.00079

Of course, when X has rank deficiency, we cannot discuss parameter estimation, since the LS estimates are not longer unique — there are many many equally good estimates of beta which give us the same RSS.

But if we just care about prediction, then rank deficiency doesn’t bring us any trouble. You’ll find the fitted values from the two models, with/without pop.mid, are the same.

cbind(fullmodel$fitted[1:5], newfullmodel$fitted[1:5])
##             [,1]   [,2]
## Australia 10.566 10.566
## Austria   11.454 11.454
## Belgium   10.951 10.951
## Bolivia    6.448  6.448
## Brazil     9.327  9.327