Let’s look at a few more methods to improve regression models in Econometrics, continuing up from our previous lesson for workshop 3. Here, the data used are CONSUME.xls and once again, WOMEN.xls.
source("scripts/r4abep-02.R")
CONSUME <- readXL("CONSUME.xls")
CONSUME
str(CONSUME)
'data.frame': 54 obs. of 5 variables:
$ year : num 1947 1948 1949 1950 1951 ...
$ consumption: num 976 998 1025 1091 1107 ...
$ income : num 1035 1090 1096 1193 1227 ...
$ wealth : num 5167 5281 5607 5760 6086 ...
$ interest : num -10.351 -4.72 1.044 0.407 -5.283 ...
CONSUME.lm <- lm(log(consumption)~log(income)+log(wealth)+interest,data = CONSUME)
summary(CONSUME.lm)
Call:
lm(formula = log(consumption) ~ log(income) + log(wealth) + interest,
data = CONSUME)
Residuals:
Min 1Q Median 3Q Max
-0.018441 -0.010001 0.000337 0.007039 0.032578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4677110 0.0427781 -10.933 7.33e-15 ***
log(income) 0.8048729 0.0174979 45.998 < 2e-16 ***
log(wealth) 0.2012700 0.0175926 11.441 1.43e-15 ***
interest -0.0026891 0.0007619 -3.529 0.000905 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01193 on 50 degrees of freedom
Multiple R-squared: 0.9996, Adjusted R-squared: 0.9995
F-statistic: 3.783e+04 on 3 and 50 DF, p-value: < 2.2e-16
r4abep.plotlm(CONSUME.lm)
With a quick glance, the graphs show us that Normal Q-Q is fine, same as the Scale-Location, while sample “1” in Residuals vs Leverage is nearing the point of interest, and the Residuals vs Fitted graph still shows a pattern, an undesirable indication of heteroskedasticity.
r4abep.dw(CONSUME.lm)
Durbin-Watson test
data: model
DW = 1.2892, p-value = 0.001889
alternative hypothesis: true autocorrelation is not 0
CONSUME.lm2 <- lm(diff(log(consumption))~diff(log(income))+diff(log(wealth))+diff(interest),data=CONSUME)
summary(CONSUME.lm2)
Call:
lm(formula = diff(log(consumption)) ~ diff(log(income)) + diff(log(wealth)) +
diff(interest), data = CONSUME)
Residuals:
Min 1Q Median 3Q Max
-0.027777 -0.006015 -0.001279 0.008554 0.021555
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0070455 0.0033955 2.075 0.0433 *
diff(log(income)) 0.7148177 0.0816885 8.751 0.0000000000139 ***
diff(log(wealth)) 0.0782672 0.0381738 2.050 0.0457 *
diff(interest) 0.0007339 0.0008010 0.916 0.3641
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01078 on 49 degrees of freedom
Multiple R-squared: 0.6453, Adjusted R-squared: 0.6236
F-statistic: 29.72 on 3 and 49 DF, p-value: 0.00000000004302
r4abep.dw(CONSUME.lm2)
Durbin-Watson test
data: model
DW = 1.8968, p-value = 0.6914
alternative hypothesis: true autocorrelation is not 0
\[ \Delta log(consumption) = 0.007 + 0.71 \Delta log(income) + 0.08 \Delta log(wealth) + 0.0007 \Delta interest + \epsilon \]
CONSUME.lm3 <- lm(diff(log(consumption))~diff(log(income))+diff(log(wealth)),data=CONSUME)
summary(CONSUME.lm3)
Call:
lm(formula = diff(log(consumption)) ~ diff(log(income)) + diff(log(wealth)),
data = CONSUME)
Residuals:
Min 1Q Median 3Q Max
-0.024034 -0.006177 -0.002030 0.010204 0.021593
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.006894 0.003386 2.036 0.0471 *
diff(log(income)) 0.725677 0.080694 8.993 0.00000000000504 ***
diff(log(wealth)) 0.077381 0.038100 2.031 0.0476 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01077 on 50 degrees of freedom
Multiple R-squared: 0.6393, Adjusted R-squared: 0.6248
F-statistic: 44.3 on 2 and 50 DF, p-value: 0.000000000008509
“Income” stands out and remains as the most significant variable here.
r4abep.dw(CONSUME.lm3)
Durbin-Watson test
data: model
DW = 2.0049, p-value = 0.9892
alternative hypothesis: true autocorrelation is not 0
\[ \Delta log(consumption) = 0.007 + 0.73 \Delta log(income) + 0.08 \Delta log(wealth) + \epsilon \]
r4abep.orcutt(CONSUME.lm)
Call:
lm(formula = log(consumption) ~ log(income) + log(wealth) + interest,
data = CONSUME)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.39983081 0.06706911 -5.961 0.0000002683 ***
log(income) 0.84585539 0.02778070 30.448 < 2.2e-16 ***
log(wealth) 0.15913004 0.02676132 5.946 0.0000002832 ***
interest 0.00121406 0.00091498 1.327 0.1907
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0099 on 49 degrees of freedom
Multiple R-squared: 0.9979 , Adjusted R-squared: 0.9978
F-statistic: 7850.4 on 3 and 49 DF, p-value: < 1.063e-65
Durbin-Watson statistic
(original): 1.28922 , p-value: 9.444e-04
(transformed): 1.87472 , p-value: 2.448e-01
# of interaction: 19
rho: 0.6124432
r4abep.reset(WOMEN.lm)
RESET test
data: model
RESET = 222.42, df1 = 2, df2 = 11, p-value = 0.000000001271
The initial RESET Test result above shows that the model can be improved by non-linear specifications of independent variables.
r4abep.reset(WOMEN.lm2)
RESET test
data: model
RESET = 0.30859, df1 = 4, df2 = 8, p-value = 0.8645
linlin.lm <- lm(height~weight,data=WOMEN)
\[ height = 25.72 + 0.29 weight + \epsilon \]
linlog.lm <- lm(height~log(weight),data=WOMEN)
\[ height = -129.61 + 39.62 log(weight) + \epsilon \]
loglin.lm <- lm(log(height)~weight,data=WOMEN)
loglin.lm
Call:
lm(formula = log(height) ~ weight, data = WOMEN)
Coefficients:
(Intercept) weight
3.567960 0.004419
\[ log(height) = 3.57 + 0.0044 weight + \epsilon \]
loglog.lm <- lm(log(height)~log(weight),data=WOMEN)
loglog.lm
Call:
lm(formula = log(height) ~ log(weight), data = WOMEN)
Coefficients:
(Intercept) log(weight)
1.1740 0.6104
\[ log(height) = 1.17 + 0.61 log(weight) + \epsilon \]
linlin.lm
Call:
lm(formula = height ~ weight, data = WOMEN)
Coefficients:
(Intercept) weight
25.7235 0.2872
r4abep.margeff(linlin.lm)
Average marginal effects
lm(formula = height ~ weight, data = WOMEN)
quad.lm <- lm(height~weight+I(weight^2),data=WOMEN)
quad.lm
Call:
lm(formula = height ~ weight + I(weight^2), data = WOMEN)
Coefficients:
(Intercept) weight I(weight^2)
-11.746939 0.834341 -0.001973
r4abep.margeff(quad.lm)
Average marginal effects
lm(formula = height ~ weight + I(weight^2), data = WOMEN)
cube.lm <- lm(height~weight+I(weight^2)+I(weight^3),data=WOMEN)
cube.lm
Call:
lm(formula = height ~ weight + I(weight^2) + I(weight^3), data = WOMEN)
Coefficients:
(Intercept) weight I(weight^2) I(weight^3)
-4.812067969 0.682600397 -0.000874579 -0.000002633
r4abep.margeff(cube.lm)
Average marginal effects
lm(formula = height ~ weight + I(weight^2) + I(weight^3), data = WOMEN)
r4abep.influence(CONSUME.lm)
On the visualization above, samples “3” and “54” are outliers, “1” has high leverage, while all those 3 have the highest influence along with a few more other samples, as depicted by the circles.
r4abep.outlier(CONSUME.lm)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
r4abep.normality(CONSUME.lm)
[1] 3 54
Shapiro-Wilk normality test
data: model$residuals
W = 0.96726, p-value = 0.1457
The sample data CONSUME used in this paper has been filtered and/or improved by several methods. Those can also be applied to studies with extensive sample sets in order to come up with the best linear regression model and/or equation.