This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
In labor economics, there exist many empirical studies trying to shed light on the issue of discrimination (for example, by gender or ethnicity).These works typically involve regressions with factors and interactions. Since the CPS1988 data contain the factor ethnicity.
Of course, our example is merely an illustration of working with factors and interactions, and we do not seriously address any discrimination issues.Technically, we are interested in the empirical relevance of an interaction between ethnicity and other variables in our regression model. Before doing so for the data at hand, the most important specifications of interactions in R are outlined in advance.
The operator : specifies an interaction effect that is, in the default contrast coding, essentially the product of a dummy variable and a further variable (possibly also a dummy).The operator * does the same but also includes the corresponding main effects. The same is done by /, but it uses a nested coding instead of the interaction coding. Finally, ^ can be used to include all interactions up to a certain order within a group of variables.
Specification of interactions in formulas.
Formula Description y ~ a + x Model without interaction: identical slopes with respect to x but different intercepts with respect to a.
y ~ a * x Model with interaction: the term a:x gives the difference
y ~ a + x + a:x in slopes compared with the reference category.
y ~ a / x Model with interaction: produces the same fitted values
y ~ a + x %in% a as the model above but using a nested coefficient coding.
An explicit slope estimate is computed for each category in a.
y ~ (a + b + c)^2 Model with all two-way interactions (excluding the three-
y~ abc - a:b:c way interaction).
provides a brief overview for numerical variables y, x and categorical variables a, b, c.
Interactions At present it only affects the intercept. A priori, it is not clear whether slope coefficients are also affected; i.e., whether Caucasians and African-Americans are paid differently conditional on some further regressors.For illustration, let us consider an interaction between ethnicity and education.
R> cps_int <- lm(log(wage) ~ experience + I(experience^2) +education *ethnicity, data = CPS1988)
R> coeftest(cps_int)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.3131e+00 1.9590e-02 220.1703 < 2e-16 ***
## experience 7.7520e-02 8.8028e-04 88.0625 < 2e-16 ***
## I(experience^2) -1.3179e-03 1.9006e-05 -69.3388 < 2e-16 ***
## education 8.6312e-02 1.3089e-03 65.9437 < 2e-16 ***
## ethnicityafam -1.2389e-01 5.9026e-02 -2.0989 0.03584 *
## education:ethnicityafam -9.6481e-03 4.6510e-03 -2.0744 0.03805 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the interaction term is statistically significant at the 5% level.However, with a sample comprising almost 30,000 individuals, this can hardly be taken as compelling evidence for inclusion of the term.
Above, the coefficients and associated tests is computed for compactness.As described in before the term educationethnicity specifies inclusion of three terms: ethnicity, education, and the interaction between the two (internally, the product of the dummy indicating ethnicity==“afam”and education). Specifically, educationethnicity may be thought of as expanding to 1 + education + ethnicity + education:ethnicity; the coefficients are, in this order, the intercept for Caucasians, the slope for education for Caucasians, the difference in intercepts, and the difference in slopes. Hence, the interaction term is also available without inclusion of ethnicity and education, namely as education:ethnicity
R> cps_int <- lm(log(wage) ~ experience + I(experience^2)+education + ethnicity + education:ethnicity,data = CPS1988)
Separate regression for each level
As a further variation, it may be necessary to fit separate regressions for African-Americans and Caucasians.This could either be done by computing two separate “lm” objects using the subset argument to lm() (e.g.,lm(formula, data, subset = ethnicity==“afam”, …) or, more conveniently,
by using a single linear-model object in the form
R> cps_sep <- lm(log(wage) ~ ethnicity /(experience + I(experience^2) +education) - 1,data = CPS1988)
This model specifies that the terms within parentheses are nested within ethnicity.Here, an intercept is not needed since it is best replaced by two separate intercepts for the two levels of ethnicity; the term -1 removes it. For compactness, we just give the estimated coefficients for the two groups defined by the levels of ethnicity:
R> cps_sep_cf <- matrix(coef(cps_sep), nrow = 2)
R> rownames(cps_sep_cf) <- levels(CPS1988$ethnicity)
R> colnames(cps_sep_cf) <- names(coef(cps_lm))[1:4]
R> cps_sep_cf
##
## Call:
## lm(formula = log(wage) ~ experience + I(experience^2) + education +
## ethnicity, data = CPS1988)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9428 -0.3162 0.0580 0.3756 4.3830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.321e+00 1.917e-02 225.38 <2e-16 ***
## experience 7.747e-02 8.800e-04 88.03 <2e-16 ***
## I(experience^2) -1.316e-03 1.899e-05 -69.31 <2e-16 ***
## education 8.567e-02 1.272e-03 67.34 <2e-16 ***
## ethnicityafam -2.434e-01 1.292e-02 -18.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5839 on 28150 degrees of freedom
## Multiple R-squared: 0.3347, Adjusted R-squared: 0.3346
## F-statistic: 3541 on 4 and 28150 DF, p-value: < 2.2e-16
## (Intercept) experience I(experience^2) education
## cauc 4.310196 0.07923367 -0.0013596710 0.08575089
## afam 4.159317 0.06189576 -0.0009414879 0.08654232
This shows that the effects of education are similar for both groups, but the remaining coefficients are somewhat smaller in absolute size for African-Americans
R> anova(cps_sep, cps_lm)
Hence, the model where ethnicity interacts with every other regressor fits significantly better, at any reasonable level, than the model without any interaction term.
## Analysis of Variance Table
##
## Model 1: log(wage) ~ ethnicity/(experience + I(experience^2) + education) -
## 1
## Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28147 9581.8
## 2 28150 9598.6 -3 -16.814 16.464 1.099e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Change of the reference category
In any regression containing an (unordered) factor, R by default uses the first level of the factor as the reference category (whose coefficient is fixed at zero). In CPS1988, “cauc” is the reference category for ethnicity, while “northeast” is the reference category for region.
Bierens and Ginther (2001) employ “south” as the reference category for region. For comparison with their article, we now change the contrast coding of this factor, so that “south” becomes the reference category. This can be achieved in various ways; e.g., by using contrasts() or by simply changing the order of the levels in the factor. As the former offers far more complexity than is needed here (but is required, for example, in statistical experimental design),we only present a solution using the latter. We set the reference category for region in the CPS1988 dataframe using relevel() and subsequently fit a model in which this is included:
R> CPS1988\(region <- relevel(CPS1988\)region, ref = “south”)
R> cps_region <- lm(log(wage) ~ ethnicity + education +experience + I(experience^2) + region, data = CPS1988)
R> coef(cps_region)
## (Intercept) ethnicityafam education experience
## 4.283606335 -0.225678877 0.084672493 0.077656452
## I(experience^2) regionnortheast regionmidwest regionwest
## -0.001322942 0.131920488 0.043789477 0.040326813
Weighted least squares
Cross-section regressions are often plagued by heteroskedasticity.here, we illustrate one of the remedies, weighted least squares (WLS), in an application to the journals data
R> jour_wls1 <- lm(log(journlas\(subs) ~ log(journals\)citeprice),weights =1/journals$citeprice^2)
R> jour_wls2 <- lm(log(journals\(subs) ~ log(journals\)citeprice),weights =1/journals$citeprice)
## Warning in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...):
## extra argument 'weghits' is disregarded.
R> plot(log(journals\(subs) ~ log(journals\)citeprice), data = journals) R> jour_lm <- lm(log(journals\(subs) ~ log(journals\)citeprice), data = journals) R> abline(jour_lm) R> plot(log(journals\(subs) ~ log(journals\)citeprice), data = journals)
R> abline(jour_lm)
R> abline(jour_wls1, lwd = 2, lty = 2)
R> abline(jour_wls2, lwd = 2, lty = 3)
R> legend(“bottomleft”, c(“OLS”, “WLS1”, “WLS2”),lty = 1:3, lwd = 2, bty = “n”)
Fig. Scatterplot of the journals data with OLS (solid) and iterated FGLS (dashed) lines.
Fig. Scatterplot of the journals data with least-squares (solid) and weighted least-squares (dashed and dotted) lines.
R> auxreg <- lm(log(residuals(jour_lm)^2) ~ log(citeprice),data = journals)
R> jour_fgls1 <- lm(log(journals\(subs) ~ log(journals\)citeprice),weights = 1/exp(fitted(auxreg)), data = journals)
R> gamma2i <- coef(auxreg)[2]
R> gamma2 <- 0
while(abs((gamma2i - gamma2)/gamma2) > 1e-7)
gamma2 <- gamma2
ifglsi <- lm(log(journals\(subs) ~ log(journals\)citeprice), data = journals,weights = 1/journals\(citeprice^gamma2) gamma2i <- coef(lm(log(residuals(fglsi)^2) ~log(journals\)citeprice), data = journals))[2]
R> jour_fgls2 <- lm(log(journals\(subs) ~ log(journals\)citeprice), data = journals,weights= 1/journals$citeprice^gamma2)
R> coef(jour_fgls2)
## (Intercept) log(journals$citeprice)
## 4.7731440 -0.5051952
This topic has been taken from Applied Econometrics using R from Chapter 3(Linear Regression)
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.