Econ 300 Problem Set 2

This problem set is due Friday, October 20th at midnight. You may submit answers in any reasonable format, but knitting the rmarkdown (.rmd) file is probably the easiest method. You should show both output and code where appropriate, and make sure to answer ‘essay’ style questions as well. Problem sets will involve a number of questions unrelated to coding and are worth a lot of points, so if you can’t get things to run properly you should still attempt these questions and submit a partial assignment at least.

Simulation

For the following problem set we will make extensive use of simulation in order to better understand hypothesis testing and the likely outcomes behind multivariate regression. In order to simulate different scenarios we will need to make use of random number generation, and will also need to perform ‘experiments’ many times in order to study the overall behavior of such systems. Here we look at the base coding skills needed in order to do these for the rest of the problem set

Random Number Generation

Computers are deterministic: given a set of initial inputs and assuming no hardware failures, they will always produce the same output. Computers therefore rely on a workaround to generate random behavior called a pseudo-random number generator. The specifics are beyond the scope of the course, but it turns out that the properties of prime numbers can be used to simulate random behavior, but this is all subject to an initial “seed” value. We can set this seed in R with the set.seed function, and we can generate random numbers between 0 and 1 using the runif function. Consider the following code which generates 100 uniform random numbers, then calculates their mean

set.seed(1)
random_vector1 <- runif(100)
mean(random_vector1)

## [1] 0.5178471

We expect a value of 0.5, and we get a value reasonably close. What if we paste the lines of code above and run this again?

set.seed(1)
random_vector2 <- runif(100)
mean(random_vector2)

## [1] 0.5178471

Since we set the seed value to be the same, we get the exact same mean. In fact, we can confirm that all of these values are identical.

#The value inside of all() gives a vector of size 100 that are all either TRUE or FALSE
#all(.) returns TRUE if every one of these values is TRUE
all(random_vector1==random_vector2)

## [1] TRUE

The seed value is useful because it allows us to replicate our results if we rerun them. For our purposes though, we want to generate a second sample of 100 and see what the value is. We also expect this to be relatively close to 0.5:

random_vector2 <- runif(100)
mean(random_vector2)

## [1] 0.5174718

Through sheer coincidence we get an average value equal to our first one to three decimal places. One thing we might want to know is what the standard deviation of this random vector would be.

If we wanted to do this by hand, we could note that the pdf for the uniform distribution is \(f(x)=1, x\in (0,1)\), therefore

\[E[X]=\int_0^1xdx = [\frac{1}{2}x^2]_0^1=\frac{1}{2}\], and \[E[X^2]=\int_0^1x^2dx=[\frac{1}{3}x^3]_0^1=\frac{1}{3}\]. We then compute \[var(x)=E[X^2]-E[X]^2=\frac{1}{3}-(\frac{1}{2})^2=\frac{1}{12}\] so that \[\sigma=\sqrt\frac{1}{12}\approx 0.2887\]. Since we are taking the average of 100 such random variables, we then have \[\sigma_{\bar x} = var(\frac{x_1+x_2+...+x_100}{100})=\frac{\sigma}{\sqrt(100)}\approx .02887\].

We can avoid having to do calculus (or looking up the value on wikipedia) by simulating 100 draws from this distribution many times, say 10000, then taking the standard deviation in R. Copying and pasting this code 10,000 times is a bit wonky, so we make use of a loop to assist us. (Note: the functions sapply and lapply are the idiomatic ways of doing this in R, but loops are conceptually simpler)

Loops and Iteration

Here we use a for loop: we specify the range we want to use (here from 1:10000), and then within the body of the loop we assign our vector to a value. A common way to go about this is to initialize a vector outside of the loop and iteratively assign it:

simulated_means <- c() #initialize a vector that initially contains no values
for(i in 1:10000) { #Calculate something in the body for i=1, i=2, ..., i=10000
  random_vector <- runif(100) #temporary storage: this will get overridden many times
  simulated_means[i] <- mean(random_vector) #store our mean inside the ith value of the vector
}
length(simulated_means)

## [1] 10000

mean(simulated_means)

## [1] 0.4999219

sd(simulated_means)

## [1] 0.02899116

And here we can confirm that our mean and standard deviation are quite close to our actual values we calculated. Our mean is off by a factor of only 0.0003 and our standard deviation is off by 0.00014

Estimating A Difficult Math Problem

Consider the following math problem: what is the average number of times you must flip a coin in order to obtain 3 heads in a row? Intuitively, the probability of a single heads is \(\frac{1}{2}\), so the probability of 3 heads in a row is \(\frac{1}{8}\), therefore we expect the first heads to occur after 8 flips. While intuitive, this answer is completely wrong! In fact, it is very hard to convince people that it is wrong using math (look up ‘renewal theory’ if you’re interested), but we can simulate the results to show why. To simulate flipping a coin, we can generate a random number between 0 and 1 and round it to the nearest integer to obtain a 0/1 value, then we use some string manipulation to search for the first “111” that occurs

answer <- c()
for(i in 1:10000) {
  coins <- round(runif(100),0)
  coin_string <- paste(coins,collapse="")
  first_3_heads <- regexpr('111',coin_string)[1]#Look for the first instance of "111"
  answer[i] <- first_3_heads + 2 #Our initial answer is the location of the first heads, 
                                 #we add 2 to get the ending location
}
mean(answer)

## [1] 13.9872

sd(answer)

## [1] 11.83062

We see that the actual answer is approximately 14 rather than 8! Interestingly, the standard deviation is also approximately equal to the mean, which is characteristic of a type of distribution called the exponential distribution.

Question 1

{Using a simulation, calculate the probability that the sum of 20 six-sided dice will be between 55 and 85. I’ve included the base code below with a comment on what to use to complete}

success <- c()
for(i in 1:10000) {
  #generate in (0,6), round down to obtain {0,1,2,3,4,5}, then add 1.
  dice_roll <- as.integer(runif(20,min=0,max=6))+1 
  dice_sum <- sum(dice_roll)
  
  #Compare the sum of dice_roll to 55 and 85 using the < and > operators
  #along with the 'and' operator &
  #i.e. a generic calculation might be x > 0 & x < 1
  #to determine if x is between 0 and 1
  
  success[i] <- dice_sum > 55 & dice_sum < 85
}
mean(success)

## [1] 0.944

Hypothesis Testing

Setup

Consider the following simple setup: we’re trying to measure the effect of attendance on grade. Suppose that the actual relationship is given by \(grade_i = 70 + 10 attendance_i + \varepsilon_i\). We can simulate this by generating attendance and \(\varepsilon\) through random variables:

n <- 100
attendance <- runif(n)
e <- rnorm(n,sd=10)
grade <- 70 + 10 * attendance + e
dat <- data.frame(attendance,grade)
model <- lm(data=dat, grade ~ attendance)
summary(model)

## 
## Call:
## lm(formula = grade ~ attendance, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.994  -7.397  -1.114   6.127  23.024 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   68.932      1.948  35.383  < 2e-16 ***
## attendance    11.004      3.262   3.374  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.21 on 98 degrees of freedom
## Multiple R-squared:  0.104,  Adjusted R-squared:  0.0949 
## F-statistic: 11.38 on 1 and 98 DF,  p-value: 0.001064

We can see that if we use only 100 observations that our estimate of \(\hat\beta_1\) isn’t very close to \(\beta_1\), and the standard error and is quite high. We can extract the coefficient, std error, and p value for use in a loop as above using the following code:

slope <- summary(model)$coefficients[2,'Estimate']
stderr <- summary(model)$coefficients[2,'Std. Error']
p <- summary(model)$coefficients[2,'Pr(>|t|)']
slope

## [1] 11.00359

stderr

## [1] 3.261763

## [1] 0.001064108

Question 2

{Using simulation, find the mean value of \(\hat\beta_1\) and the standard deviation of \(\hat\beta_1\).}

{Does the the standard deviation of your \(\hat\beta_1\) estimates equal the average of your standard errors?}

{Explain why you observe the results that you do.}

{Much of the boilerplate code has already been filled in below}

betas <- c()
stderrs <- c()
ps <- c()
n <- 100
for(i in 1:1000) {
  attendance <- runif(n)
  e <- rnorm(n,sd=10)
  grade <- 70 + 10 * attendance + e
  dat <- data.frame(attendance,grade)
  model <- lm(data=dat, grade ~ attendance)
  #Assign your 3 vectors here
  
  betas[i] <- summary(model)$coefficients[2,'Estimate']
  stderrs[i] <- summary(model)$coefficients[2,'Std. Error']
  ps[i] <- summary(model)$coefficients[2,'Pr(>|t|)']
  
}
mean(betas)

## [1] 9.813171

sd(betas)

## [1] 3.310649

mean(stderrs)

## [1] 3.495025

mean(ps)

## [1] 0.04343533

ps[10]

## [1] 0.01871079

sum(ps<.05)

## [1] 804

Yes my mean standard error equals my standard deviation.

Question 3

{You also calculated every p-value above.}

{What percent of the time do we reject the null hypothesis assuming a significance level of \(\alpha=.05\)?}

We would accept the null hypothesis about 80.4% of the time

{Does this value equal 5%?}

{Explain why you would or would not expect to see this result, and how it relates to type I and type II errors.}

Since our p-value is positive and less than .05, we accept the null hypothesis 80% of the time. 20% of the time we will have an error.

Question 4

{You have a vector containing 1000 estimates of \(\hat\beta_1\) that you computed above, and you know that the true value of \(\beta_1=10\).}

{Create a histogram of \(\hat\beta_1-\beta_1\) using the hist function (see problem set 1 if necessary).}
- {How does this histogram relate to bias of the estimate of \(\hat\beta_1\)?}

hist(betas-10)

Multivariate OLS Setup

Now we turn to multivariate OLS. For this I simulated highly correlated data, which I’ll motivate as the grades on 13 different assignments for 100,000 students. While this could be simulated directly in this Rmarkdown, the code is a bit too complicated for this class. Each assignment has been normalized so that rather than a raw score, the values represent the number of standard deviations from the mean for the student, e.g. a value of 0 means that the student’s score was average, while a value of 1 means they scored 1 standard deviation above the mean. We can read this data into R using the following code: if reading in via Rmarkdown make sure that the file is stored in the same location as the .Rmd file, or otherwise specify the path to the file:

dat <- read.csv("simulated.csv")

In total there are 13 columns in this data. We can also see that they’re highly correlated: here I display all correlations with the first column of data.

dim(dat)

## [1] 100000     13

cor(dat)[,1]

##            Q1            Q2            Q3            Q4            Q5 
##  1.0000000000 -0.5763057962  0.3838390878  0.4764261153  0.2804178521 
##            Q6            Q7            Q8            Q9           Q10 
## -0.1661273708 -0.1034786900 -0.0550259644  0.7438346756 -0.1829953373 
##           Q11           Q12           Q13 
##  0.0008593828 -0.0004809710  0.0001639467

Suppose that we wish to calculate a subscore for algebra and for quantitative reasoning. The algebra subscore is the sum of the first two assignments, while the quantitative subscore is the algebra subscore plus the 5th and 12th assignments. We can regress the quantitative score on the algebra score to see the association between the two:

dat$algebra <- dat$Q1 + dat$Q2
dat$quantitative <- dat$algebra + dat$Q5 + dat$Q12

# Q12 not omitted

summary(lm(data=dat, quantitative ~ algebra))

## 
## Call:
## lm(formula = quantitative ~ algebra, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1182 -0.8041  0.0027  0.8036  5.3543 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003815   0.003779   -1.01    0.313    
## algebra      1.339087   0.017378   77.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.195 on 99998 degrees of freedom
## Multiple R-squared:  0.05605,    Adjusted R-squared:  0.05604 
## F-statistic:  5938 on 1 and 99998 DF,  p-value: < 2.2e-16

Question 5

{Interpret \(\hat\beta_1\) in the bivariate regression above.}

Our Beta1 estimate is 1.339.
{We know from our calculation that \(\beta_1=1\), so explain why our estimate of \(\hat\beta_1\) differs from this value.}

Our estimate differs from the actual value, because the probability of getting the exact value is 0. There also could be something biasing the result in the dataset, or we do not have a large enough sample size.
- {Is this a result of random chance?}
To a certain degree, yes. The probability that we get any specific number is 0. But more probably there is another factor(s) that we should control for.
- {Explain how this is related to the correlations we calculated above.}
We are only taking into account the correlations of Q1, Q2, Q5, and Q12. Q12 is not an omitted variable, and it’s correlation is 0

Question 6

We can redo the above in a multivariate regression by controlling for other variables by adding a +, e.g. we can control for Q9 by using

summary(lm(data=dat, quantitative ~ algebra + Q9))

## 
## Call:
## lm(formula = quantitative ~ algebra + Q9, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1197 -0.8045  0.0021  0.8030  5.3598 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003815   0.003778  -1.010    0.313    
## algebra      1.257304   0.024112  52.143   <2e-16 ***
## Q9           0.178395   0.036467   4.892    1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.195 on 99997 degrees of freedom
## Multiple R-squared:  0.05627,    Adjusted R-squared:  0.05626 
## F-statistic:  2981 on 2 and 99997 DF,  p-value: < 2.2e-16

{Interpret \(\hat\beta_1\), with particular emphasis on how it differs from question 5.}

Our beta1 estimate is closer to 1 than before, this is because we controlled for a variable that was biasing the results.

{Why is this estimate still biased?}

There are still other variables that are biasing the result that are not controlled for

Question 7

summary(lm(data=dat, quantitative ~ algebra + Q9 + Q3 + Q4 + Q5 + Q6 + Q7 + Q8 + Q10 + Q11 + Q12 + Q13))

## 
## Call:
## lm(formula = quantitative ~ algebra + Q9 + Q3 + Q4 + Q5 + Q6 + 
##     Q7 + Q8 + Q10 + Q11 + Q12 + Q13, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.100e-14 -3.000e-16 -1.000e-16  1.000e-16  6.422e-12 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  3.451e-18  7.603e-17  4.500e-02   0.9638    
## algebra      1.000e+00  1.944e-15  5.143e+14  < 2e-16 ***
## Q9           1.762e-14  4.195e-15  4.201e+00 2.66e-05 ***
## Q3           1.867e-13  1.826e-14  1.022e+01  < 2e-16 ***
## Q4          -5.449e-15  2.928e-15 -1.861e+00   0.0627 .  
## Q5           1.000e+00  5.529e-16  1.809e+15  < 2e-16 ***
## Q6          -1.139e-14  8.365e-15 -1.361e+00   0.1734    
## Q7           2.086e-16  4.759e-16  4.380e-01   0.6612    
## Q8           3.483e-16  3.799e-16  9.170e-01   0.3593    
## Q10          1.683e-15  2.263e-15  7.440e-01   0.4570    
## Q11          3.554e-17  7.584e-17  4.690e-01   0.6393    
## Q12          1.000e+00  7.586e-17  1.318e+16  < 2e-16 ***
## Q13         -2.259e-19  7.596e-17 -3.000e-03   0.9976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.404e-14 on 99987 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.18e+31 on 12 and 99987 DF,  p-value: < 2.2e-16

{Modify the regression equation in question 6 so that you have a multivariate regression with an unbiased estimate of \(\hat\beta_1\) [hint: you can always explore using different controls to see what works and what doesn’t].}

Here, I controlled for every variable and my beta1 estimate was basically 1, meaning it is unbiased.

{Explain why you need to control for the variable(s) that you do in order to get an unbiased result.}

Since every variable either adds or subtracts correlation from beta1hat once you control for them, if you add all of them they cancel out and you are left with beta1hat = beta1

{What happens if you add every single variable to the regression equation; do you get an unbiased result?}

I believe so, since beta1hat = beta1 after accounting for all the variables.

FWL and Matching

The Frisch-Waugh-Lovell theorem gives an alternative method of performing multivariate regression. Here we obtain the residuals from regressing both our x and y variables on Q9, then regressing the residualized quantitative score on the residualized algebra score.

dat$residual_quant <- residuals(lm(data=dat, quantitative ~ Q9))
dat$residual_algebra <- residuals(lm(data=dat, algebra ~ Q9))
summary(lm(data=dat, residual_quant ~ residual_algebra))

## 
## Call:
## lm(formula = residual_quant ~ residual_algebra, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1197 -0.8045  0.0021  0.8030  5.3598 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.645e-17  3.778e-03    0.00        1    
## residual_algebra 1.257e+00  2.411e-02   52.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.195 on 99998 degrees of freedom
## Multiple R-squared:  0.02647,    Adjusted R-squared:  0.02646 
## F-statistic:  2719 on 1 and 99998 DF,  p-value: < 2.2e-16

Another alternative is to compare only individuals with the same Q9 score. Since Q9 is continuous no one in our data has the exact same score, but we can round our values to do this comparison. We can then regress within each group and average the results. First, we will tabulate our grouped data to avoid regressing on very small groups:

dat$Q9Group <- round(dat$Q9,1)
table(dat$Q9Group)

## 
##  -0.6  -0.5  -0.4  -0.3  -0.2  -0.1     0   0.1   0.2   0.3   0.4   0.5   0.6 
##     7    87   693  3359 10763 21389 27334 21502 10804  3331   653    73     5

Now, we can regress in a loop and confirm that the results match (though not perfectly due to rounding)

groups <- seq(from=-0.4,to=0.4,by=.1)
betas <- c()
ns <- c()
for(i in 1:length(groups)) {
  g <- groups[i]
  data_truncated <- dat[abs(dat$Q9Group-g)<.01,]
  ns[i] <- nrow(data_truncated)
  betas[i] <- summary(lm(data=data_truncated, quantitative ~ algebra))$coefficients[2,'Estimate']
}
mean(betas)

## [1] 1.250582

weighted.mean(betas,ns)

## [1] 1.264681

Question 8

{Calculate the correlation between Q9 and the two residualized variables you created.}

cor(dat$Q9,dat$residual_algebra)

## [1] 1.414053e-17

cor(dat$Q9,dat$residual_quant)

## [1] -7.994442e-20

cor(dat$Q9Group,dat$residual_algebra)

## [1] 0.0006694461

cor(dat$Q9Group,dat$residual_quant)

## [1] -0.0002442318

{Is Q9 an omitted variable in the regression on these residuals? Explain why this occurs.}

Since my answers were essentially 0, that means they had little to no correlation with these outcomes. Meaning they are not an omitted variable.

{Is Q9 an omitted variable on the within group approach? Explain why or why not.}

No, the correlation between these variables is essentially 0.
{Finally, explain conceptually why these two approaches are achieving the same result.}

Q9 has no correlation with the quantitative and algebra residuals.

Multivariate Limitations

Let’s suppose we have the same data as before, but now our outcome variable represents ability to think abstractly and is the sum of our algebra subscore and the fourth assignment on philosophy. Suppose that due to a data mishap, we no longer have assignments 1, 2 or 4 in our dataset (I won’t remove it in the code for your convenience, but pretend it no longer exists).

dat$abstract <- dat$algebra + dat$Q4

Question 9

{Without the variables Q1, Q2, or Q4 is it possible to obtain an unbiased estimate of our casual relationship between algebra subscore and abstract subscore?}

No, we would not have any variables to work with.

{Run both a bivariate OLS regressing algebra on abstract,and then a multivariate OLS controlling for everything we know is a confounding variable (i.e. use Q3, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, and Q13).}
{Which regression gives a more accurate result? Explain why this is happening.}

The second regression is a more accurate result, since it is controlling for all other variables, meaning we get to see the true result of abstract on algebra.

summary(lm(data=dat,dat$algebra ~ dat$abstract))

## 
## Call:
## lm(formula = dat$algebra ~ dat$abstract, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52996 -0.08127  0.00016  0.08130  0.54664 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.701e-17  3.819e-04     0.0        1    
## dat$abstract 4.431e-01  9.359e-04   473.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1208 on 99998 degrees of freedom
## Multiple R-squared:  0.6915, Adjusted R-squared:  0.6915 
## F-statistic: 2.242e+05 on 1 and 99998 DF,  p-value: < 2.2e-16

summary(lm(data=dat,dat$algebra ~ dat$abstract + Q3 + Q5 + Q6 + Q7 + Q8 + Q9 + Q10 + Q11 + Q12))

## 
## Call:
## lm(formula = dat$algebra ~ dat$abstract + Q3 + Q5 + Q6 + Q7 + 
##     Q8 + Q9 + Q10 + Q11 + Q12, data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.200261 -0.032046 -0.000037  0.032167  0.204697 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  -2.808e-08  1.504e-04    0.000    1.000    
## dat$abstract  1.438e+00  3.589e-03  400.536   <2e-16 ***
## Q3           -1.293e+00  3.590e-02  -36.024   <2e-16 ***
## Q5           -2.141e-01  8.592e-04 -249.152   <2e-16 ***
## Q6            4.020e+00  1.059e-02  379.390   <2e-16 ***
## Q7           -2.412e-01  5.518e-04 -437.133   <2e-16 ***
## Q8           -1.944e-01  4.324e-04 -449.668   <2e-16 ***
## Q9           -1.725e+00  6.253e-03 -275.921   <2e-16 ***
## Q10          -3.337e-01  4.351e-03  -76.697   <2e-16 ***
## Q11           1.058e-04  1.500e-04    0.705    0.481    
## Q12          -9.744e-05  1.501e-04   -0.649    0.516    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04756 on 99989 degrees of freedom
## Multiple R-squared:  0.9522, Adjusted R-squared:  0.9521 
## F-statistic: 1.99e+05 on 10 and 99989 DF,  p-value: < 2.2e-16

Extra Credit: Causal Chains

Sometimes rather than variables having a direct causal effect, there is a causal chain along which effects happen. For instance, consider the case where a good teacher can directly improve students grades, but the teacher can also entice more people to attend class, and the increased attendance also improves grades. Then we have two causal pathways here: quality -> grade and quality -> attendance -> grade. Researchers often want to split out these effects in order to find the mechanisms behind which causal effects occur.

Using our data above, let’s simulate such an example whereby x directly causes y to increase by 5, but also causes z to increase by 1, and z also causes y to increase by 1.

dat$x1 <-  dat$Q1
dat$z1 <- dat$x1 + dat$Q11
dat$y1 <- 5*dat$x1 + dat$z + dat$Q12

Extra Credit 1

summary(lm(data=dat,dat$y1 ~ dat$x1))

## 
## Call:
## lm(formula = dat$y1 ~ dat$x1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5397 -0.9585  0.0001  0.9522  6.2926 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.007062   0.004481  -1.576    0.115    
## dat$x1       6.001528   0.018038 332.712   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.417 on 99998 degrees of freedom
## Multiple R-squared:  0.5254, Adjusted R-squared:  0.5254 
## F-statistic: 1.107e+05 on 1 and 99998 DF,  p-value: < 2.2e-16

summary(lm(data=dat,dat$y1 ~ dat$x1 + dat$z1))

## 
## Call:
## lm(formula = dat$y1 ~ dat$x1 + dat$z1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2892 -0.6719  0.0035  0.6761  4.2764 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003818   0.003169  -1.205    0.228    
## dat$x1       4.999006   0.013147 380.237   <2e-16 ***
## dat$z1       0.999056   0.003161 316.042   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 99997 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.7626 
## F-statistic: 1.606e+05 on 2 and 99997 DF,  p-value: < 2.2e-16

{For the above setup, regress y1 on x1 in both a bivariate setup and in a multivariate setup controlling for z1.}
{Interpret the slope estimates. Do the results align with your intution of what should happen?}

Yes, in the first regression the estimated slope of x1 was 6. In the second, the x1 slope was essentially 5, and the z term also had a slope of 1, both of which match the actual results.

Now, consider a slight modification of the above setup where x2 -> z2 -> y2. The setup is identical, except y2 now adds Q11 rather than Q12 in its calculation (and Z1 also has Q11 added to it)

dat$x2 <-  dat$Q1
dat$z2 <- dat$x2 + dat$Q11
dat$y2 <- 5*dat$x2 + dat$z2 + dat$Q11

Extra Credit 2

{Rerun the regressions from extra credit 1, but using x2, y2, and z2 rather than x1, y1, and z1.}

summary(lm(data=dat,dat$y2 ~ dat$x2))

## 
## Call:
## lm(formula = dat$y2 ~ dat$x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5588 -1.3556 -0.0066  1.3486  8.6975 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006495   0.006341  -1.024    0.306    
## dat$x2       6.006937   0.025527 235.320   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.005 on 99998 degrees of freedom
## Multiple R-squared:  0.3564, Adjusted R-squared:  0.3564 
## F-statistic: 5.538e+04 on 1 and 99998 DF,  p-value: < 2.2e-16

summary(lm(data=dat,dat$y2 ~ dat$x2 + dat$z2))

## 
## Call:
## lm(formula = dat$y2 ~ dat$x2 + dat$z2, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.660e-14 -4.000e-16 -2.000e-16  0.000e+00  1.278e-11 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.264e-16  1.412e-16 -8.950e-01    0.371    
## dat$x2       4.000e+00  5.858e-16  6.828e+15   <2e-16 ***
## dat$z2       2.000e+00  1.409e-16  1.420e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.466e-14 on 99997 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.566e+32 on 2 and 99997 DF,  p-value: < 2.2e-16

{In order for a variable to bias our result we need it to be jointly correlated with both our x and y variables, yet Q11 is not correlated with x2. Explain why our estimate of \(\hat\beta_1\) is biased despite \(cor(x2,Q11)=0\) and Q11 being our only omitted variable.}

It still biases our estimate because the correlation between z2 and Q11 is almost 1, the correlation between y2 and Q11 is .8, and the correlation between y2 and z2 is .9. Both z2 and y2 correlate positively with x2.

In other words, Q11 is correlated with y2 and z2 which is correlated with x2, biasing it.

{What if we control for Q11? Does this fix anything? Explain why this is happening and whether there is any way to split out the effects.}

If we add Q11 into the regression, nothing changes. This is because my independent variables are strongly correlated. We can’t split out the effects without taking out Q11 in one of the variables, which essentially defeats the point of the regression.

summary(lm(data=dat,dat$y2 ~ dat$x2 + dat$z2 + dat$Q11))

## 
## Call:
## lm(formula = dat$y2 ~ dat$x2 + dat$z2 + dat$Q11, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.660e-14 -4.000e-16 -2.000e-16  0.000e+00  1.278e-11 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.264e-16  1.412e-16 -8.950e-01    0.371    
## dat$x2       4.000e+00  5.858e-16  6.828e+15   <2e-16 ***
## dat$z2       2.000e+00  1.409e-16  1.420e+16   <2e-16 ***
## dat$Q11             NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.466e-14 on 99997 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.566e+32 on 2 and 99997 DF,  p-value: < 2.2e-16