This problem set is due Friday, October 20th at midnight. You may submit answers in any reasonable format, but knitting the rmarkdown (.rmd) file is probably the easiest method. You should show both output and code where appropriate, and make sure to answer ‘essay’ style questions as well. Problem sets will involve a number of questions unrelated to coding and are worth a lot of points, so if you can’t get things to run properly you should still attempt these questions and submit a partial assignment at least.

Simulation

For the following problem set we will make extensive use of simulation in order to better understand hypothesis testing and the likely outcomes behind multivariate regression. In order to simulate different scenarios we will need to make use of random number generation, and will also need to perform ‘experiments’ many times in order to study the overall behavior of such systems. Here we look at the base coding skills needed in order to do these for the rest of the problem set

Random Number Generation

Computers are deterministic: given a set of initial inputs and assuming no hardware failures, they will always produce the same output. Computers therefore rely on a workaround to generate random behavior called a pseudo-random number generator. The specifics are beyond the scope of the course, but it turns out that the properties of prime numbers can be used to simulate random behavior, but this is all subject to an initial “seed” value. We can set this seed in R with the set.seed function, and we can generate random numbers between 0 and 1 using the runif function. Consider the following code which generates 100 uniform random numbers, then calculates their mean

set.seed(1)
random_vector1 <- runif(100)
mean(random_vector1)
## [1] 0.5178471

We expect a value of 0.5, and we get a value reasonably close. What if we paste the lines of code above and run this again?

set.seed(1)
random_vector2 <- runif(100)
mean(random_vector2)
## [1] 0.5178471

Since we set the seed value to be the same, we get the exact same mean. In fact, we can confirm that all of these values are identical.

#The value inside of all() gives a vector of size 100 that are all either TRUE or FALSE
#all(.) returns TRUE if every one of these values is TRUE
all(random_vector1==random_vector2)
## [1] TRUE

The seed value is useful because it allows us to replicate our results if we rerun them. For our purposes though, we want to generate a second sample of 100 and see what the value is. We also expect this to be relatively close to 0.5:

random_vector2 <- runif(100)
mean(random_vector2)
## [1] 0.5174718

Through sheer coincidence we get an average value equal to our first one to three decimal places. One thing we might want to know is what the standard deviation of this random vector would be.

If we wanted to do this by hand, we could note that the pdf for the uniform distribution is \(f(x)=1, x\in (0,1)\), therefore

\[E[X]=\int_0^1xdx = [\frac{1}{2}x^2]_0^1=\frac{1}{2}\], and \[E[X^2]=\int_0^1x^2dx=[\frac{1}{3}x^3]_0^1=\frac{1}{3}\]. We then compute \[var(x)=E[X^2]-E[X]^2=\frac{1}{3}-(\frac{1}{2})^2=\frac{1}{12}\] so that \[\sigma=\sqrt\frac{1}{12}\approx 0.2887\]. Since we are taking the average of 100 such random variables, we then have \[\sigma_{\bar x} = var(\frac{x_1+x_2+...+x_100}{100})=\frac{\sigma}{\sqrt(100)}\approx .02887\].

We can avoid having to do calculus (or looking up the value on wikipedia) by simulating 100 draws from this distribution many times, say 10000, then taking the standard deviation in R. Copying and pasting this code 10,000 times is a bit wonky, so we make use of a loop to assist us. (Note: the functions sapply and lapply are the idiomatic ways of doing this in R, but loops are conceptually simpler)

Loops and Iteration

Here we use a for loop: we specify the range we want to use (here from 1:10000), and then within the body of the loop we assign our vector to a value. A common way to go about this is to initialize a vector outside of the loop and iteratively assign it:

simulated_means <- c() #initialize a vector that initially contains no values
for(i in 1:10000) { #Calculate something in the body for i=1, i=2, ..., i=10000
  random_vector <- runif(100) #temporary storage: this will get overridden many times
  simulated_means[i] <- mean(random_vector) #store our mean inside the ith value of the vector
}
length(simulated_means)
## [1] 10000
mean(simulated_means)
## [1] 0.4999219
sd(simulated_means)
## [1] 0.02899116

And here we can confirm that our mean and standard deviation are quite close to our actual values we calculated. Our mean is off by a factor of only 0.0003 and our standard deviation is off by 0.00014

Estimating A Difficult Math Problem

Consider the following math problem: what is the average number of times you must flip a coin in order to obtain 3 heads in a row? Intuitively, the probability of a single heads is \(\frac{1}{2}\), so the probability of 3 heads in a row is \(\frac{1}{8}\), therefore we expect the first heads to occur after 8 flips. While intuitive, this answer is completely wrong! In fact, it is very hard to convince people that it is wrong using math (look up ‘renewal theory’ if you’re interested), but we can simulate the results to show why. To simulate flipping a coin, we can generate a random number between 0 and 1 and round it to the nearest integer to obtain a 0/1 value, then we use some string manipulation to search for the first “111” that occurs

answer <- c()
for(i in 1:10000) {
  coins <- round(runif(100),0)
  coin_string <- paste(coins,collapse="")
  first_3_heads <- regexpr('111',coin_string)[1]#Look for the first instance of "111"
  answer[i] <- first_3_heads + 2 #Our initial answer is the location of the first heads, 
                                 #we add 2 to get the ending location
}
mean(answer)
## [1] 13.9872
sd(answer)
## [1] 11.83062

We see that the actual answer is approximately 14 rather than 8! Interestingly, the standard deviation is also approximately equal to the mean, which is characteristic of a type of distribution called the exponential distribution.

Question 1

{Using a simulation, calculate the probability that the sum of 20 six-sided dice will be between 55 and 85. I’ve included the base code below with a comment on what to use to complete}

success <- c()
for(i in 1:10000) {
  #generate in (0,6), round down to obtain {0,1,2,3,4,5}, then add 1.
  dice_roll <- as.integer(runif(20,min=0,max=6))+1 
  dice_sum <- sum(dice_roll)
  
  #Compare the sum of dice_roll to 55 and 85 using the < and > operators
  #along with the 'and' operator &
  #i.e. a generic calculation might be x > 0 & x < 1
  #to determine if x is between 0 and 1
  
  success[i] <- dice_sum > 55 & dice_sum < 85
}
mean(success)
## [1] 0.944

Hypothesis Testing

Setup

Consider the following simple setup: we’re trying to measure the effect of attendance on grade. Suppose that the actual relationship is given by \(grade_i = 70 + 10 attendance_i + \varepsilon_i\). We can simulate this by generating attendance and \(\varepsilon\) through random variables:

n <- 100
attendance <- runif(n)
e <- rnorm(n,sd=10)
grade <- 70 + 10 * attendance + e
dat <- data.frame(attendance,grade)
model <- lm(data=dat, grade ~ attendance)
summary(model)
## 
## Call:
## lm(formula = grade ~ attendance, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.994  -7.397  -1.114   6.127  23.024 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   68.932      1.948  35.383  < 2e-16 ***
## attendance    11.004      3.262   3.374  0.00106 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.21 on 98 degrees of freedom
## Multiple R-squared:  0.104,  Adjusted R-squared:  0.0949 
## F-statistic: 11.38 on 1 and 98 DF,  p-value: 0.001064

We can see that if we use only 100 observations that our estimate of \(\hat\beta_1\) isn’t very close to \(\beta_1\), and the standard error and is quite high. We can extract the coefficient, std error, and p value for use in a loop as above using the following code:

slope <- summary(model)$coefficients[2,'Estimate']
stderr <- summary(model)$coefficients[2,'Std. Error']
p <- summary(model)$coefficients[2,'Pr(>|t|)']
slope
## [1] 11.00359
stderr
## [1] 3.261763
p
## [1] 0.001064108

Question 2

{Using simulation, find the mean value of \(\hat\beta_1\) and the standard deviation of \(\hat\beta_1\).}

{Does the the standard deviation of your \(\hat\beta_1\) estimates equal the average of your standard errors?}

{Explain why you observe the results that you do.}

{Much of the boilerplate code has already been filled in below}

betas <- c()
stderrs <- c()
ps <- c()
n <- 100
for(i in 1:1000) {
  attendance <- runif(n)
  e <- rnorm(n,sd=10)
  grade <- 70 + 10 * attendance + e
  dat <- data.frame(attendance,grade)
  model <- lm(data=dat, grade ~ attendance)
  #Assign your 3 vectors here
  
  betas[i] <- summary(model)$coefficients[2,'Estimate']
  stderrs[i] <- summary(model)$coefficients[2,'Std. Error']
  ps[i] <- summary(model)$coefficients[2,'Pr(>|t|)']
  
}
mean(betas)
## [1] 9.813171
sd(betas)
## [1] 3.310649
mean(stderrs)
## [1] 3.495025
mean(ps)
## [1] 0.04343533
ps[10]
## [1] 0.01871079
sum(ps<.05)
## [1] 804

Question 3

{You also calculated every p-value above.}

We would accept the null hypothesis about 80.4% of the time

No

Since our p-value is positive and less than .05, we accept the null hypothesis 80% of the time. 20% of the time we will have an error.

Question 4

{You have a vector containing 1000 estimates of \(\hat\beta_1\) that you computed above, and you know that the true value of \(\beta_1=10\).}

  • {Create a histogram of \(\hat\beta_1-\beta_1\) using the hist function (see problem set 1 if necessary).}

    • {How does this histogram relate to bias of the estimate of \(\hat\beta_1\)?}
hist(betas-10)

Multivariate OLS Setup

Now we turn to multivariate OLS. For this I simulated highly correlated data, which I’ll motivate as the grades on 13 different assignments for 100,000 students. While this could be simulated directly in this Rmarkdown, the code is a bit too complicated for this class. Each assignment has been normalized so that rather than a raw score, the values represent the number of standard deviations from the mean for the student, e.g. a value of 0 means that the student’s score was average, while a value of 1 means they scored 1 standard deviation above the mean. We can read this data into R using the following code: if reading in via Rmarkdown make sure that the file is stored in the same location as the .Rmd file, or otherwise specify the path to the file:

dat <- read.csv("simulated.csv")

In total there are 13 columns in this data. We can also see that they’re highly correlated: here I display all correlations with the first column of data.

dim(dat)
## [1] 100000     13
cor(dat)[,1]
##            Q1            Q2            Q3            Q4            Q5 
##  1.0000000000 -0.5763057962  0.3838390878  0.4764261153  0.2804178521 
##            Q6            Q7            Q8            Q9           Q10 
## -0.1661273708 -0.1034786900 -0.0550259644  0.7438346756 -0.1829953373 
##           Q11           Q12           Q13 
##  0.0008593828 -0.0004809710  0.0001639467

Suppose that we wish to calculate a subscore for algebra and for quantitative reasoning. The algebra subscore is the sum of the first two assignments, while the quantitative subscore is the algebra subscore plus the 5th and 12th assignments. We can regress the quantitative score on the algebra score to see the association between the two:

dat$algebra <- dat$Q1 + dat$Q2
dat$quantitative <- dat$algebra + dat$Q5 + dat$Q12

# Q12 not omitted

summary(lm(data=dat, quantitative ~ algebra))
## 
## Call:
## lm(formula = quantitative ~ algebra, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1182 -0.8041  0.0027  0.8036  5.3543 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003815   0.003779   -1.01    0.313    
## algebra      1.339087   0.017378   77.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.195 on 99998 degrees of freedom
## Multiple R-squared:  0.05605,    Adjusted R-squared:  0.05604 
## F-statistic:  5938 on 1 and 99998 DF,  p-value: < 2.2e-16

Question 5

Question 6

We can redo the above in a multivariate regression by controlling for other variables by adding a +, e.g. we can control for Q9 by using

summary(lm(data=dat, quantitative ~ algebra + Q9))
## 
## Call:
## lm(formula = quantitative ~ algebra + Q9, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1197 -0.8045  0.0021  0.8030  5.3598 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003815   0.003778  -1.010    0.313    
## algebra      1.257304   0.024112  52.143   <2e-16 ***
## Q9           0.178395   0.036467   4.892    1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.195 on 99997 degrees of freedom
## Multiple R-squared:  0.05627,    Adjusted R-squared:  0.05626 
## F-statistic:  2981 on 2 and 99997 DF,  p-value: < 2.2e-16

Our beta1 estimate is closer to 1 than before, this is because we controlled for a variable that was biasing the results.

There are still other variables that are biasing the result that are not controlled for

Question 7

summary(lm(data=dat, quantitative ~ algebra + Q9 + Q3 + Q4 + Q5 + Q6 + Q7 + Q8 + Q10 + Q11 + Q12 + Q13))
## 
## Call:
## lm(formula = quantitative ~ algebra + Q9 + Q3 + Q4 + Q5 + Q6 + 
##     Q7 + Q8 + Q10 + Q11 + Q12 + Q13, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.100e-14 -3.000e-16 -1.000e-16  1.000e-16  6.422e-12 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  3.451e-18  7.603e-17  4.500e-02   0.9638    
## algebra      1.000e+00  1.944e-15  5.143e+14  < 2e-16 ***
## Q9           1.762e-14  4.195e-15  4.201e+00 2.66e-05 ***
## Q3           1.867e-13  1.826e-14  1.022e+01  < 2e-16 ***
## Q4          -5.449e-15  2.928e-15 -1.861e+00   0.0627 .  
## Q5           1.000e+00  5.529e-16  1.809e+15  < 2e-16 ***
## Q6          -1.139e-14  8.365e-15 -1.361e+00   0.1734    
## Q7           2.086e-16  4.759e-16  4.380e-01   0.6612    
## Q8           3.483e-16  3.799e-16  9.170e-01   0.3593    
## Q10          1.683e-15  2.263e-15  7.440e-01   0.4570    
## Q11          3.554e-17  7.584e-17  4.690e-01   0.6393    
## Q12          1.000e+00  7.586e-17  1.318e+16  < 2e-16 ***
## Q13         -2.259e-19  7.596e-17 -3.000e-03   0.9976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.404e-14 on 99987 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.18e+31 on 12 and 99987 DF,  p-value: < 2.2e-16

Here, I controlled for every variable and my beta1 estimate was basically 1, meaning it is unbiased.

Since every variable either adds or subtracts correlation from beta1hat once you control for them, if you add all of them they cancel out and you are left with beta1hat = beta1

I believe so, since beta1hat = beta1 after accounting for all the variables.

FWL and Matching

The Frisch-Waugh-Lovell theorem gives an alternative method of performing multivariate regression. Here we obtain the residuals from regressing both our x and y variables on Q9, then regressing the residualized quantitative score on the residualized algebra score.

dat$residual_quant <- residuals(lm(data=dat, quantitative ~ Q9))
dat$residual_algebra <- residuals(lm(data=dat, algebra ~ Q9))
summary(lm(data=dat, residual_quant ~ residual_algebra))
## 
## Call:
## lm(formula = residual_quant ~ residual_algebra, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1197 -0.8045  0.0021  0.8030  5.3598 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.645e-17  3.778e-03    0.00        1    
## residual_algebra 1.257e+00  2.411e-02   52.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.195 on 99998 degrees of freedom
## Multiple R-squared:  0.02647,    Adjusted R-squared:  0.02646 
## F-statistic:  2719 on 1 and 99998 DF,  p-value: < 2.2e-16

Another alternative is to compare only individuals with the same Q9 score. Since Q9 is continuous no one in our data has the exact same score, but we can round our values to do this comparison. We can then regress within each group and average the results. First, we will tabulate our grouped data to avoid regressing on very small groups:

dat$Q9Group <- round(dat$Q9,1)
table(dat$Q9Group)
## 
##  -0.6  -0.5  -0.4  -0.3  -0.2  -0.1     0   0.1   0.2   0.3   0.4   0.5   0.6 
##     7    87   693  3359 10763 21389 27334 21502 10804  3331   653    73     5

Now, we can regress in a loop and confirm that the results match (though not perfectly due to rounding)

groups <- seq(from=-0.4,to=0.4,by=.1)
betas <- c()
ns <- c()
for(i in 1:length(groups)) {
  g <- groups[i]
  data_truncated <- dat[abs(dat$Q9Group-g)<.01,]
  ns[i] <- nrow(data_truncated)
  betas[i] <- summary(lm(data=data_truncated, quantitative ~ algebra))$coefficients[2,'Estimate']
}
mean(betas)
## [1] 1.250582
weighted.mean(betas,ns)
## [1] 1.264681

Question 8

cor(dat$Q9,dat$residual_algebra)
## [1] 1.414053e-17
cor(dat$Q9,dat$residual_quant)
## [1] -7.994442e-20
cor(dat$Q9Group,dat$residual_algebra)
## [1] 0.0006694461
cor(dat$Q9Group,dat$residual_quant)
## [1] -0.0002442318

Since my answers were essentially 0, that means they had little to no correlation with these outcomes. Meaning they are not an omitted variable.

Multivariate Limitations

Let’s suppose we have the same data as before, but now our outcome variable represents ability to think abstractly and is the sum of our algebra subscore and the fourth assignment on philosophy. Suppose that due to a data mishap, we no longer have assignments 1, 2 or 4 in our dataset (I won’t remove it in the code for your convenience, but pretend it no longer exists).

dat$abstract <- dat$algebra + dat$Q4

Question 9

No, we would not have any variables to work with.

The second regression is a more accurate result, since it is controlling for all other variables, meaning we get to see the true result of abstract on algebra.

summary(lm(data=dat,dat$algebra ~ dat$abstract))
## 
## Call:
## lm(formula = dat$algebra ~ dat$abstract, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52996 -0.08127  0.00016  0.08130  0.54664 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.701e-17  3.819e-04     0.0        1    
## dat$abstract 4.431e-01  9.359e-04   473.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1208 on 99998 degrees of freedom
## Multiple R-squared:  0.6915, Adjusted R-squared:  0.6915 
## F-statistic: 2.242e+05 on 1 and 99998 DF,  p-value: < 2.2e-16
summary(lm(data=dat,dat$algebra ~ dat$abstract + Q3 + Q5 + Q6 + Q7 + Q8 + Q9 + Q10 + Q11 + Q12))
## 
## Call:
## lm(formula = dat$algebra ~ dat$abstract + Q3 + Q5 + Q6 + Q7 + 
##     Q8 + Q9 + Q10 + Q11 + Q12, data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.200261 -0.032046 -0.000037  0.032167  0.204697 
## 
## Coefficients:
##                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  -2.808e-08  1.504e-04    0.000    1.000    
## dat$abstract  1.438e+00  3.589e-03  400.536   <2e-16 ***
## Q3           -1.293e+00  3.590e-02  -36.024   <2e-16 ***
## Q5           -2.141e-01  8.592e-04 -249.152   <2e-16 ***
## Q6            4.020e+00  1.059e-02  379.390   <2e-16 ***
## Q7           -2.412e-01  5.518e-04 -437.133   <2e-16 ***
## Q8           -1.944e-01  4.324e-04 -449.668   <2e-16 ***
## Q9           -1.725e+00  6.253e-03 -275.921   <2e-16 ***
## Q10          -3.337e-01  4.351e-03  -76.697   <2e-16 ***
## Q11           1.058e-04  1.500e-04    0.705    0.481    
## Q12          -9.744e-05  1.501e-04   -0.649    0.516    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04756 on 99989 degrees of freedom
## Multiple R-squared:  0.9522, Adjusted R-squared:  0.9521 
## F-statistic: 1.99e+05 on 10 and 99989 DF,  p-value: < 2.2e-16

Extra Credit: Causal Chains

Sometimes rather than variables having a direct causal effect, there is a causal chain along which effects happen. For instance, consider the case where a good teacher can directly improve students grades, but the teacher can also entice more people to attend class, and the increased attendance also improves grades. Then we have two causal pathways here: quality -> grade and quality -> attendance -> grade. Researchers often want to split out these effects in order to find the mechanisms behind which causal effects occur.

Using our data above, let’s simulate such an example whereby x directly causes y to increase by 5, but also causes z to increase by 1, and z also causes y to increase by 1.

dat$x1 <-  dat$Q1
dat$z1 <- dat$x1 + dat$Q11
dat$y1 <- 5*dat$x1 + dat$z + dat$Q12

Extra Credit 1

summary(lm(data=dat,dat$y1 ~ dat$x1))
## 
## Call:
## lm(formula = dat$y1 ~ dat$x1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5397 -0.9585  0.0001  0.9522  6.2926 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.007062   0.004481  -1.576    0.115    
## dat$x1       6.001528   0.018038 332.712   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.417 on 99998 degrees of freedom
## Multiple R-squared:  0.5254, Adjusted R-squared:  0.5254 
## F-statistic: 1.107e+05 on 1 and 99998 DF,  p-value: < 2.2e-16
summary(lm(data=dat,dat$y1 ~ dat$x1 + dat$z1))
## 
## Call:
## lm(formula = dat$y1 ~ dat$x1 + dat$z1, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2892 -0.6719  0.0035  0.6761  4.2764 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003818   0.003169  -1.205    0.228    
## dat$x1       4.999006   0.013147 380.237   <2e-16 ***
## dat$z1       0.999056   0.003161 316.042   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 99997 degrees of freedom
## Multiple R-squared:  0.7626, Adjusted R-squared:  0.7626 
## F-statistic: 1.606e+05 on 2 and 99997 DF,  p-value: < 2.2e-16

Now, consider a slight modification of the above setup where x2 -> z2 -> y2. The setup is identical, except y2 now adds Q11 rather than Q12 in its calculation (and Z1 also has Q11 added to it)

dat$x2 <-  dat$Q1
dat$z2 <- dat$x2 + dat$Q11
dat$y2 <- 5*dat$x2 + dat$z2 + dat$Q11

Extra Credit 2

summary(lm(data=dat,dat$y2 ~ dat$x2))
## 
## Call:
## lm(formula = dat$y2 ~ dat$x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5588 -1.3556 -0.0066  1.3486  8.6975 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006495   0.006341  -1.024    0.306    
## dat$x2       6.006937   0.025527 235.320   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.005 on 99998 degrees of freedom
## Multiple R-squared:  0.3564, Adjusted R-squared:  0.3564 
## F-statistic: 5.538e+04 on 1 and 99998 DF,  p-value: < 2.2e-16
summary(lm(data=dat,dat$y2 ~ dat$x2 + dat$z2))
## 
## Call:
## lm(formula = dat$y2 ~ dat$x2 + dat$z2, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.660e-14 -4.000e-16 -2.000e-16  0.000e+00  1.278e-11 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.264e-16  1.412e-16 -8.950e-01    0.371    
## dat$x2       4.000e+00  5.858e-16  6.828e+15   <2e-16 ***
## dat$z2       2.000e+00  1.409e-16  1.420e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.466e-14 on 99997 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.566e+32 on 2 and 99997 DF,  p-value: < 2.2e-16

It still biases our estimate because the correlation between z2 and Q11 is almost 1, the correlation between y2 and Q11 is .8, and the correlation between y2 and z2 is .9. Both z2 and y2 correlate positively with x2.

In other words, Q11 is correlated with y2 and z2 which is correlated with x2, biasing it.

If we add Q11 into the regression, nothing changes. This is because my independent variables are strongly correlated. We can’t split out the effects without taking out Q11 in one of the variables, which essentially defeats the point of the regression.

summary(lm(data=dat,dat$y2 ~ dat$x2 + dat$z2 + dat$Q11))
## 
## Call:
## lm(formula = dat$y2 ~ dat$x2 + dat$z2 + dat$Q11, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.660e-14 -4.000e-16 -2.000e-16  0.000e+00  1.278e-11 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.264e-16  1.412e-16 -8.950e-01    0.371    
## dat$x2       4.000e+00  5.858e-16  6.828e+15   <2e-16 ***
## dat$z2       2.000e+00  1.409e-16  1.420e+16   <2e-16 ***
## dat$Q11             NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.466e-14 on 99997 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.566e+32 on 2 and 99997 DF,  p-value: < 2.2e-16