Simulation Example: Linear Regression

This example summarizes several aspects that we've seen in the previous sessions.

create random data based on different distributions
R's lm() function, ordinary least squares estimation and one of its limitations
R's formula syntax (Remember: y ~ x)

All aspects are discussed using a previously simulated dataset. As opposed to field work we do know the data generating process and thus can evaluate easily how well a regression model works.

The Effect of Smoking Weed on School Grades in Utopia

Let's consider the following admittedly far fetched example in which grade point average is only determined by a handful of variables. Note that some of these variables are not directly observable to the researcher.

weedsmoking: binary indicator for regular weed consumption
weedsmokingstar: grams consumed
parents: binary indicator for alarming parental background
lazy: laziness score
sport: sportiness score
distance: distance to Amsterdam from where student grows up (instrument)

The True Model

Let's create a dataset given the following true model.

\[ \begin{aligned} gpa = 7.8 - 0.2*weedsmoking - 0.2 * parents - 0.15 * sport - 0.4 * lazy +u \end{aligned} \]

Note that weedsmoking is generated as follows:

\[ \begin{aligned} weedsmokingstar = -3.5+2.5*parents-0.25*sport+1.2*lazy-0.01*distance+error \\ \end{aligned} \]

with weedsmoking = 0, if weedsmokingstar < 6, else weedsmoking = 1.

Task 1: Generate Random Data

Make the example reproducible. What do you need to do in advance of the simulation in order to do so?
Generate the variables described above. Choose reasonable values (i.e. no negative grams consumed). Create a sample with 10 000 observations.

set.seed(23)
# The Utopia example... omitted variable bias and so on...
sport <- 10 * rbeta(10000, 2, 6)
lazy <- rnorm(10000, 8, 1)
parents <- rbinom(10000, 1, 0.1)

# instrument
distance <- rnorm(10000, 100, 30)

u <- 0.2 * rnorm(10000)

# true model:
error <- rnorm(10000)
weedsmokingstar <- -3.5 + 2.5 * parents - 0.25 * sport + 1.2 * lazy - 0.01 * 
    distance + error
# check whether there a negative grams...  hint: weedsmokingstar == 0
weedsmoking <- rep(0, 10000)
weedsmoking[weedsmokingstar > 6] <- 1

# true model
gpa <- 7.8 - 0.2 * weedsmoking - 0.2 * parents - 0.15 * sport - 0.4 * lazy + 
    u

Regression with full knowledge of the DGP

Compare the regression coefficients with those of the true model.

fit1 <- lm(gpa ~ weedsmoking + parents + sport + lazy)
summary(fit1)

## 
## Call:
## lm(formula = gpa ~ weedsmoking + parents + sport + lazy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9065 -0.1324 -0.0013  0.1361  0.6848 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.79618    0.01828   426.5   <2e-16 ***
## weedsmoking -0.20123    0.00615   -32.7   <2e-16 ***
## parents     -0.20027    0.00753   -26.6   <2e-16 ***
## sport       -0.14914    0.00141  -105.8   <2e-16 ***
## lazy        -0.39947    0.00233  -171.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.201 on 9995 degrees of freedom
## Multiple R-squared:  0.861,  Adjusted R-squared:  0.861 
## F-statistic: 1.54e+04 on 4 and 9995 DF,  p-value: <2e-16

Diagnostics: Normality of residuals

qqnorm(fit1$residuals)
qqline(fit1$residuals)

plot of chunk unnamed-chunk-3

Omitted Variables: Biased Regression

Assume the researcher could not observer laziness or simply forgot to include this important variable.

fit_biased <- lm(gpa ~ weedsmoking + parents + sport)
summary(fit_biased)

## 
## Call:
## lm(formula = gpa ~ weedsmoking + parents + sport)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8221 -0.2693 -0.0086  0.2615  1.5578 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.75116    0.00859  552.90  < 2e-16 ***
## weedsmoking -0.74010    0.01049  -70.55  < 2e-16 ***
## parents      0.09545    0.01455    6.56  5.6e-11 ***
## sport       -0.17099    0.00279  -61.39  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.398 on 9996 degrees of freedom
## Multiple R-squared:  0.451,  Adjusted R-squared:  0.451 
## F-statistic: 2.74e+03 on 3 and 9996 DF,  p-value: <2e-16

Omitting important variables leads to biased results.

Adressing Ommitted Variable Bias

R provides several opportunities to address the issue describes above.

instrumental variable estimation (ivreg from the AER R package)
2 Step Least Squares (2SLS)

are two popular methods that can also be implemented using R. The latter can also be implemented step by step without additional packages.

stage_1 <- lm(weedsmoking ~ parents + distance + sport)
summary(stage_1)

## 
## Call:
## lm(formula = weedsmoking ~ parents + distance + sport)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8912 -0.2079 -0.1440 -0.0169  1.0415 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.427036   0.014618    29.2   <2e-16 ***
## parents      0.564321   0.012574    44.9   <2e-16 ***
## distance    -0.001563   0.000125   -12.5   <2e-16 ***
## sport       -0.039639   0.002605   -15.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.377 on 9996 degrees of freedom
## Multiple R-squared:  0.193,  Adjusted R-squared:  0.193 
## F-statistic:  797 on 3 and 9996 DF,  p-value: <2e-16


est_weed <- fitted(stage_1)

stage_2 <- lm(gpa ~ est_weed + sport + parents)
summary(stage_2)

## 
## Call:
## lm(formula = gpa ~ est_weed + sport + parents)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9770 -0.3180  0.0109  0.3273  1.7374 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.65145    0.02989  155.64  < 2e-16 ***
## est_weed    -0.37334    0.10370   -3.60  0.00032 ***
## sport       -0.15637    0.00533  -29.33  < 2e-16 ***
## parents     -0.11144    0.06071   -1.84  0.06646 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.487 on 9996 degrees of freedom
## Multiple R-squared:  0.179,  Adjusted R-squared:  0.179 
## F-statistic:  727 on 3 and 9996 DF,  p-value: <2e-16


require(AER)
summary(ivreg(gpa ~ parents + sport + weedsmoking | parents + distance + sport))

## 
## Call:
## ivreg(formula = gpa ~ parents + sport + weedsmoking | parents + 
##     distance + sport)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.90173 -0.28837 -0.00764  0.27663  1.64707 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.65145    0.02589  179.69  < 2e-16 ***
## parents     -0.11144    0.05259   -2.12    0.034 *  
## sport       -0.15637    0.00462  -33.86  < 2e-16 ***
## weedsmoking -0.37334    0.08982   -4.16  3.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.422 on 9996 degrees of freedom
## Multiple R-Squared: 0.384,   Adjusted R-squared: 0.384 
## Wald test:  969 on 3 and 9996 DF,  p-value: <2e-16

Task: Reading Data

Try to read some datasets from this website. Download it, unzip it and figure out how to properly read data into a data.frame. Use the functions:

read.table()
read.csv2()

and make use of the sep=“” option which denotes the delimiter. Some possible delimiters are: “;” , “,” , “\t”.

http://personality-testing.info/_rawdata/

Try out some descriptive statistics to get an idea of the dataset. Also visualization methods that we've seen class: boxplots, histograms, scatterplots, etc. Comment on what you find and create an Rmarkdown report using knitr.