Karim Naguib (Boston University)
10/15/2013
What is then needed is a curve to fit the data: this can be done using a quadratic function instead of a linear function \[ TestScore_i = \beta_0 + \beta_1 Income_i + \beta_2 Income_i^2 + u_i \]
This is called a quadratic regression model with the population regression function \[ E[TestScore_i|Income_i] = \beta_0 + \beta_1 Income_i + \beta_2 Income_i^2 \]
To carry out OLS estimation using a quadratic model, we simply consider it a multiple regression with two variables: \( Income_i \) and \( Income_i^2 \)
test.score.data$avginc.squared <- test.score.data$avginc^2
regress.results <- lm(testscr ~ avginc + avginc.squared, data = test.score.data)
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 607.30174 2.92422 207.68 <2e-16 ***
avginc 3.85099 0.27110 14.20 <2e-16 ***
avginc.squared -0.04231 0.00488 -8.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(regress.results)$adj.r.squared
[1] 0.554
A general form of a regression model is
\[ Y_i = \underbrace{f(X_{1i}, X_{2i},\dots, X_{ki})}_{\text{regression function}} + u_i, i = 1,\dots,n \]
This regression function can also be defined as
\[ E[Y_i|X_{1i},X_{2i},\dots, X_{ki}] = f(X_{1i}, X_{2i},\dots, X_{ki}) \]
In the case of a linear population regression function we would have something like
\[ f(X_{1i}, X_{2i},\dots, X_{ki}) = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_k X_{ki} \]
In the case of a nonlinear regression function it could look something like
\[ f(Income_i) = \beta_0 + \beta_1 Income_i + \beta_2 Income_i^2 \]
When we want to know the effect of a \( \Delta X_1 \) on \( Y \) in a linear population regression model we calculated it to be \[ \Delta Y = \beta_1 \Delta X_1 \]
In the case of a nonlinear model this calculation is more complicated because it depends on the dependent variable
For a nonlinear regression function we calculate the expected change in \( Y \), \( \Delta Y \), in response to a change \( \Delta X_1 \) in \( X_1 \) and holding all other variables fixed by \[ \Delta Y = f(X_1 + \Delta X_1, X_2, \dots, X_k) - f(X_1, X_2, \dots, X_k) \]
Since we don't observe the population regression function \( f \) we rely on the estimated function \( \hat{f} \)
Suppose we wish to calculate the predicted change in test scores in response to a $1,000 increase in district income. Since this predicted change depends on the initial income we consider two cases
The standard error of \( \Delta \hat{Y} \) would be
\[ SE(\Delta \hat{Y}) = SE(\hat{\beta}_1 + 21 \hat{\beta}_2) = \frac{|\Delta \hat{Y}|}{\sqrt{F}} \]
Recall that to test the single restriction \( \hat{\beta}_1 + 21\hat{\beta}_2 = 0 \) we use
\[ F = t^2 = \left[\frac{\hat{\beta}_1 + 21\hat{\beta}_2}{SE(\hat{\beta}_1 + 21\hat{\beta}_2)}\right]^2 \]
In linear specifications it was easy to interpret the meaning of a coefficient
\[ \beta_1 = \frac{\Delta Y}{\Delta X_1} \]
But in a nonlinear specification, we cannot use the same interpretation. It is more useful to use a graph to show the effect of changes in \( X_1 \) on \( Y \) or by calculating \( \Delta Y \).
The polynomial regression model of degree \( r \) is
\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \cdots + \beta_r X_i^r + u_i \]
It is the general form that included the quadratic function discussed earlier (\( r = 2 \))
In order to test the null hypothesis that population regression function is linear we have to test with \( q = r - 1 \) restrictions:
\[ \begin{align*} H_0&: \beta_2 = 0, \beta_3 = 0,\dots, \beta_r = 0 \\ H_1&: \text{at least one }\beta_j \ne 0, j = 2,\dots,r \end{align*} \]
Consider estimating the cubic regression model (\( r = 3 \)) where test scores are regressed on district income
test.score.data$avginc.cubed <- test.score.data$avginc^3
regress.results <- lm(testscr ~ avginc + avginc.squared + avginc.cubed, data = test.score.data)
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.00e+02 5.46e+00 109.86 < 2e-16 ***
avginc 5.02e+00 7.87e-01 6.37 4.9e-10 ***
avginc.squared -9.58e-02 3.41e-02 -2.81 0.0051 **
avginc.cubed 6.85e-04 4.37e-04 1.57 0.1174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lht(regress.results, c('avginc.squared = 0', 'avginc.cubed = 0'), test = 'F', vcov=vcovHC(regress.results))
Linear hypothesis test
Hypothesis:
avginc.squared = 0
avginc.cubed = 0
Model 1: restricted model
Model 2: testscr ~ avginc + avginc.squared + avginc.cubed
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 418
2 416 2 29.7 8.9e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The relationship between logarithms and percentages relies on the approximation, when \( \Delta x \) and \( \frac{\Delta x}{x} \) are small,
\[ \ln(x + \Delta x) - \ln(x) \cong \frac{\Delta x}{x} \]
The percentage change in \( x \) is \( \frac{\Delta x}{x} \times 100 \)
The regression model is
\[ Y_i = \beta_0 + \beta_1\ln(X_i) + u_i, i = 1,\dots,n \]
which is called the linear-log model
\[ \begin{align*} \Delta Y &= [\beta_0 + \beta_1 \ln(X + \Delta X)] - [\beta_0 + \beta_1 \ln(X)] \\ &= \beta_1[\ln(X + \Delta X) - \ln(X)] \cong \beta_1\frac{\Delta X}{X} \end{align*} \]
regress.results <- lm(testscr ~ log(avginc), data = test.score.data)
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 557.83 3.86 144.4 <2e-16 ***
log(avginc) 36.42 1.41 25.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is the predicted difference in test scores of districts with average income $10,000 vs. $11,000?
\[ \Delta \hat{Y} = [557.8 + 36.42\ln(11)] - [557.8 + 36.42\ln(10)] = 3.47 \]
What is the predicted difference in test scores of districts with average income $40,000 vs. $41,000?
\[ \Delta \hat{Y} = [557.8 + 36.42\ln(41)] - [557.8 + 36.42\ln(40)] = 0.90 \]
The regression model is
\[ \ln(Y_i) = \beta_0 + \beta_1 X_i + u_i, i = 1,\dots,n \]
hich is called the log-linear model
\[ \begin{align*} \ln(Y + \Delta Y) - \ln(Y) &= [\beta_0 + \beta_1(X + \Delta X)] - [\beta_0 + \beta_1(X)] \\ &= \beta_1 \Delta X \end{align*} \]
We know the approximation \( \ln(Y + \Delta Y) - \ln(Y) = \frac{\Delta Y}{Y} \) and hence \( \frac{\Delta Y}{Y} = \beta_1 \Delta X \)
Suppose we want to regress the logarithm of earnings on the age of college graduates from some 2009 CPS data
regress.results <- lm(log(ahe) ~ age, data = cps.92.08, subset = (bachelor == 1))
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.70582 0.06428 26.5 <2e-16 ***
age 0.03756 0.00218 17.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Earnings are predicted to increase by 3.76% for each additional year of age
The regression model is
\[ \ln(Y_i) = \beta_0 + \beta_1\ln(X_i) + u_i \]
which is called a log-log model
\[ \begin{align*} \ln(Y + \Delta Y) - \ln(Y) &= [\beta_0 + \beta_1\ln(X + \Delta X)] - [\beta_0 + \beta_1\ln(X)] \\ &= \beta_1[\ln(X + \Delta X) - \ln(X)] \end{align*} \] \[ \therefore \frac{\Delta Y}{Y} \cong \beta_1\frac{\Delta X}{X}\text{ or }\beta_1 = \frac{\Delta Y/Y}{\Delta X/X} = \frac{\text{percentage change in }Y}{\text{percentage change in }X} \]
regress.results <- lm(log(testscr) ~ log(avginc), data = test.score.data)
coeftest(regress.results, vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.33635 0.00596 1063.4 <2e-16 ***
log(avginc) 0.05542 0.00216 25.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This means that a 1% increase in income is estimated to cause a 0.0554% increase in test scores.
For comparison consider a log-linear model of the logarithm of test scores regressed on income
regress.results <- lm(log(testscr) ~ avginc, data = test.score.data)
coeftest(regress.results, vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.439362 0.002987 2155.4 <2e-16 ***
avginc 0.002844 0.000183 15.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now that we've seen different forms of polynomial and logarithmic specifications, how do we compare them?
It is possible that student who are still learning English might respond differently to the STR than those who aren't. In other words, STR could interact differently with the percentage of English learners in a district.
We will consider three types of possible interactions between independent variables
Consider a model with two binary variables, where the dependent variable \( Y_i \) is the log of earnings, and the independent variables are whether a worker has a college degree (\( D_{1i} \)), and a worker's gender (\( D_{2i} \)).
\[ Y_i = \beta_0 + \beta_1 D_{1i} + \beta_2 D_{2i} + u_i \]
In order to model the possibility of this interaction between college degrees and gender we use a interaction regression model
The term \( D_{1i} \times D_{2i} \) is called an interaction term or an interacted regressor
\[ \begin{align*} E[Y_i|D_{1i} = 0, D_{2i} = d_2] &= \beta_0 + \beta_1\times 0 + \beta_2\times d_2 + \beta_3\times(0\times d_2) \\ &= \beta_0 + \beta_2 d_2 \end{align*} \]
\[ \begin{align*} E[Y_i|D_{1i} = 1, D_{2i} = d_2] &= \beta_0 + \beta_1\times 1 + \beta_2\times d_2 + \beta_3\times(1\times d_2) \\ &= \beta_0 + \beta_1 + \beta_2 d_2 + \beta_3 d_2 \end{align*} \]
Compare the difference between them
\[ E[Y_i|D_{1i} = 1, D_{2i} = d_2] - E[Y_i|D_{1i} = 0, D_{2i} = d_2] = \beta_1 + \beta_3 d_2 \]
Let use add two dummy variables to capture whether STR is high and whether there is a high percentage of English learners. Then we regress \( TestScore \) on them and their interaction term
test.score.data$hi.str <- ifelse(test.score.data$str >= 20, 1, 0)
test.score.data$hi.el <- ifelse(test.score.data$el.pct >= 10, 1, 0)
regress.results <- lm(testscr ~ hi.str + hi.el + hi.str : hi.el, data = test.score.data)
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 664.14 1.39 477.53 < 2e-16 ***
hi.str -1.91 1.94 -0.98 0.33
hi.el -18.16 2.36 -7.70 9.7e-14 ***
hi.str:hi.el -3.49 3.14 -1.11 0.27
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Consider regressing log earnings on an individual's work experience (\( X_i \)) and a binary variable indicating whether they earned a college degree (\( D_i \))
\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 D_i + u_i \]
Using the dummy variable \( D_i \) allows for having a different \( y \)-axis intercept depending on having a college degree
However, it does not allow for the possibility of having different slopes depending on having a college degree. For that we add an interaction term
Consider regressing test scores on \( STR \) but allowing for different slopes (or effect) depending on whether there are a high or low percentage of English learners
regress.results <- lm(testscr ~ str + hi.el + str : hi.el, data = test.score.data)
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 682.246 12.071 56.52 <2e-16 ***
str -0.968 0.599 -1.62 0.11
hi.el 5.639 19.889 0.28 0.78
str:hi.el -1.277 0.986 -1.30 0.20
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This means that we have two regression lines depending on whether there is a high or low percentage of English learners in the school district
Low percentage:
\[ 682.2 - 0.97 STR_i \]
High percentage:
\[ (682.2 + 5.6) - (0.97 + 1.28)STR_i = 687.8 - 2.25 STR_i \]
To test whether the two regression lines are the same test the joint hypothesis that both the coefficients on \( HiEL_i \) and \( STR_i \times HiEL_i \) are zero
lht(regress.results, c('hi.el = 0', 'str:hi.el = 0'), test = 'F', vcov = vcovHC(regress.results))
Linear hypothesis test
Hypothesis:
hi.el = 0
str:hi.el = 0
Model 1: restricted model
Model 2: testscr ~ str + hi.el + str:hi.el
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 418
2 416 2 88.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Consider regressing log earnings on two continuous random variables
Consider regressing test scores on \( STR \) and the percentage of English learners (both continuous variables)
regress.results <- lm(testscr ~ str + el.pct + str : el.pct, data = test.score.data)
coeftest(regress.results, vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.33852 11.93785 57.49 <2e-16 ***
str -1.11702 0.59652 -1.87 0.062 .
el.pct -0.67291 0.38654 -1.74 0.082 .
str:el.pct 0.00116 0.01916 0.06 0.952
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1