This example summarizes several aspects that we've seen in the previous sessions.
All aspects are discussed using a previously simulated dataset. As opposed to field work we do know the data generating process and thus can evaluate easily how well a regression model works.
Let's consider the following admittedly far fetched example in which grade point average is only determined by a handful of variables. Note that some of these variables are not directly observable to the researcher.
Let's create a dataset given the following true model.
\[ \begin{aligned} gpa = 7.8 - 0.2*weedsmoking - 0.2 * parents - 0.15 * sport - 0.4 * lazy +u \end{aligned} \]
Note that weedsmoking is generated as follows:
\[ \begin{aligned} weedsmokingstar = -3.5+2.5*parents-0.25*sport+1.2*lazy-0.01*distance+error \\ \end{aligned} \]
with weedsmoking = 0, if weedsmokingstar < 6, else weedsmoking = 1.
Make the example reproducible. What do you need to do in advance of the simulation in order to do so?
Generate the variables described above. Choose reasonable values (i.e. no negative grams consumed). Create a sample with 10 000 observations.
set.seed(23)
# The Utopia example... omitted variable bias and so on...
sport <- 10 * rbeta(10000, 2, 6)
lazy <- rnorm(10000, 8, 1)
parents <- rbinom(10000, 1, 0.1)
# instrument
distance <- rnorm(10000, 100, 30)
u <- 0.2 * rnorm(10000)
# true model:
error <- rnorm(10000)
weedsmokingstar <- -3.5 + 2.5 * parents - 0.25 * sport + 1.2 * lazy - 0.01 *
distance + error
# check whether there a negative grams... hint: weedsmokingstar == 0
weedsmoking <- rep(0, 10000)
weedsmoking[weedsmokingstar > 6] <- 1
# true model
gpa <- 7.8 - 0.2 * weedsmoking - 0.2 * parents - 0.15 * sport - 0.4 * lazy +
u
Compare the regression coefficients with those of the true model.
fit1 <- lm(gpa ~ weedsmoking + parents + sport + lazy)
summary(fit1)
##
## Call:
## lm(formula = gpa ~ weedsmoking + parents + sport + lazy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9065 -0.1324 -0.0013 0.1361 0.6848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.79618 0.01828 426.5 <2e-16 ***
## weedsmoking -0.20123 0.00615 -32.7 <2e-16 ***
## parents -0.20027 0.00753 -26.6 <2e-16 ***
## sport -0.14914 0.00141 -105.8 <2e-16 ***
## lazy -0.39947 0.00233 -171.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.201 on 9995 degrees of freedom
## Multiple R-squared: 0.861, Adjusted R-squared: 0.861
## F-statistic: 1.54e+04 on 4 and 9995 DF, p-value: <2e-16
Diagnostics: Normality of residuals
qqnorm(fit1$residuals)
qqline(fit1$residuals)
Assume the researcher could not observer laziness or simply forgot to include this important variable.
fit_biased <- lm(gpa ~ weedsmoking + parents + sport)
summary(fit_biased)
##
## Call:
## lm(formula = gpa ~ weedsmoking + parents + sport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8221 -0.2693 -0.0086 0.2615 1.5578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.75116 0.00859 552.90 < 2e-16 ***
## weedsmoking -0.74010 0.01049 -70.55 < 2e-16 ***
## parents 0.09545 0.01455 6.56 5.6e-11 ***
## sport -0.17099 0.00279 -61.39 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.398 on 9996 degrees of freedom
## Multiple R-squared: 0.451, Adjusted R-squared: 0.451
## F-statistic: 2.74e+03 on 3 and 9996 DF, p-value: <2e-16
Omitting important variables leads to biased results.
R provides several opportunities to address the issue describes above.
are two popular methods that can also be implemented using R. The latter can also be implemented step by step without additional packages.
stage_1 <- lm(weedsmoking ~ parents + distance + sport)
summary(stage_1)
##
## Call:
## lm(formula = weedsmoking ~ parents + distance + sport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8912 -0.2079 -0.1440 -0.0169 1.0415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.427036 0.014618 29.2 <2e-16 ***
## parents 0.564321 0.012574 44.9 <2e-16 ***
## distance -0.001563 0.000125 -12.5 <2e-16 ***
## sport -0.039639 0.002605 -15.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.377 on 9996 degrees of freedom
## Multiple R-squared: 0.193, Adjusted R-squared: 0.193
## F-statistic: 797 on 3 and 9996 DF, p-value: <2e-16
est_weed <- fitted(stage_1)
stage_2 <- lm(gpa ~ est_weed + sport + parents)
summary(stage_2)
##
## Call:
## lm(formula = gpa ~ est_weed + sport + parents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9770 -0.3180 0.0109 0.3273 1.7374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.65145 0.02989 155.64 < 2e-16 ***
## est_weed -0.37334 0.10370 -3.60 0.00032 ***
## sport -0.15637 0.00533 -29.33 < 2e-16 ***
## parents -0.11144 0.06071 -1.84 0.06646 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.487 on 9996 degrees of freedom
## Multiple R-squared: 0.179, Adjusted R-squared: 0.179
## F-statistic: 727 on 3 and 9996 DF, p-value: <2e-16
require(AER)
summary(ivreg(gpa ~ parents + sport + weedsmoking | parents + distance + sport))
##
## Call:
## ivreg(formula = gpa ~ parents + sport + weedsmoking | parents +
## distance + sport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.90173 -0.28837 -0.00764 0.27663 1.64707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.65145 0.02589 179.69 < 2e-16 ***
## parents -0.11144 0.05259 -2.12 0.034 *
## sport -0.15637 0.00462 -33.86 < 2e-16 ***
## weedsmoking -0.37334 0.08982 -4.16 3.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.422 on 9996 degrees of freedom
## Multiple R-Squared: 0.384, Adjusted R-squared: 0.384
## Wald test: 969 on 3 and 9996 DF, p-value: <2e-16
and make use of the sep=“” option which denotes the delimiter. Some possible delimiters are: “;” , “,” , “\t”.
http://personality-testing.info/_rawdata/