This problem set is due Friday, October 20th at midnight. You may submit answers in any reasonable format, but knitting the rmarkdown (.rmd) file is probably the easiest method. You should show both output and code where appropriate, and make sure to answer ‘essay’ style questions as well. Problem sets will involve a number of questions unrelated to coding and are worth a lot of points, so if you can’t get things to run properly you should still attempt these questions and submit a partial assignment at least.
For the following problem set we will make extensive use of simulation in order to better understand hypothesis testing and the likely outcomes behind multivariate regression. In order to simulate different scenarios we will need to make use of random number generation, and will also need to perform ‘experiments’ many times in order to study the overall behavior of such systems. Here we look at the base coding skills needed in order to do these for the rest of the problem set
Computers are deterministic: given a set of initial inputs and assuming no hardware failures, they will always produce the same output. Computers therefore rely on a workaround to generate random behavior called a pseudo-random number generator. The specifics are beyond the scope of the course, but it turns out that the properties of prime numbers can be used to simulate random behavior, but this is all subject to an initial “seed” value. We can set this seed in R with the set.seed function, and we can generate random numbers between 0 and 1 using the runif function. Consider the following code which generates 100 uniform random numbers, then calculates their mean
set.seed(1)
random_vector1 <- runif(100)
mean(random_vector1)
## [1] 0.5178471
We expect a value of 0.5, and we get a value reasonably close. What if we paste the lines of code above and run this again?
set.seed(1)
random_vector2 <- runif(100)
mean(random_vector2)
## [1] 0.5178471
Since we set the seed value to be the same, we get the exact same mean. In fact, we can confirm that all of these values are identical.
#The value inside of all() gives a vector of size 100 that are all either TRUE or FALSE
#all(.) returns TRUE if every one of these values is TRUE
all(random_vector1==random_vector2)
## [1] TRUE
The seed value is useful because it allows us to replicate our results if we rerun them. For our purposes though, we want to generate a second sample of 100 and see what the value is. We also expect this to be relatively close to 0.5:
random_vector2 <- runif(100)
mean(random_vector2)
## [1] 0.5174718
Through sheer coincidence we get an average value equal to our first one to three decimal places. One thing we might want to know is what the standard deviation of this random vector would be.
If we wanted to do this by hand, we could note that the pdf for the uniform distribution is \(f(x)=1, x\in (0,1)\), therefore
\[E[X]=\int_0^1xdx = [\frac{1}{2}x^2]_0^1=\frac{1}{2}\], and \[E[X^2]=\int_0^1x^2dx=[\frac{1}{3}x^3]_0^1=\frac{1}{3}\]. We then compute \[var(x)=E[X^2]-E[X]^2=\frac{1}{3}-(\frac{1}{2})^2=\frac{1}{12}\] so that \[\sigma=\sqrt\frac{1}{12}\approx 0.2887\]. Since we are taking the average of 100 such random variables, we then have \[\sigma_{\bar x} = var(\frac{x_1+x_2+...+x_100}{100})=\frac{\sigma}{\sqrt(100)}\approx .02887\].
We can avoid having to do calculus (or looking up the value on wikipedia) by simulating 100 draws from this distribution many times, say 10000, then taking the standard deviation in R. Copying and pasting this code 10,000 times is a bit wonky, so we make use of a loop to assist us. (Note: the functions sapply and lapply are the idiomatic ways of doing this in R, but loops are conceptually simpler)
Here we use a for loop: we specify the range we want to use (here from 1:10000), and then within the body of the loop we assign our vector to a value. A common way to go about this is to initialize a vector outside of the loop and iteratively assign it:
simulated_means <- c() #initialize a vector that initially contains no values
for(i in 1:10000) { #Calculate something in the body for i=1, i=2, ..., i=10000
random_vector <- runif(100) #temporary storage: this will get overridden many times
simulated_means[i] <- mean(random_vector) #store our mean inside the ith value of the vector
}
length(simulated_means)
## [1] 10000
mean(simulated_means)
## [1] 0.4999219
sd(simulated_means)
## [1] 0.02899116
And here we can confirm that our mean and standard deviation are quite close to our actual values we calculated. Our mean is off by a factor of only 0.0003 and our standard deviation is off by 0.00014
Consider the following math problem: what is the average number of times you must flip a coin in order to obtain 3 heads in a row? Intuitively, the probability of a single heads is \(\frac{1}{2}\), so the probability of 3 heads in a row is \(\frac{1}{8}\), therefore we expect the first heads to occur after 8 flips. While intuitive, this answer is completely wrong! In fact, it is very hard to convince people that it is wrong using math (look up ‘renewal theory’ if you’re interested), but we can simulate the results to show why. To simulate flipping a coin, we can generate a random number between 0 and 1 and round it to the nearest integer to obtain a 0/1 value, then we use some string manipulation to search for the first “111” that occurs
answer <- c()
for(i in 1:10000) {
coins <- round(runif(100),0)
coin_string <- paste(coins,collapse="")
first_3_heads <- regexpr('111',coin_string)[1]#Look for the first instance of "111"
answer[i] <- first_3_heads + 2 #Our initial answer is the location of the first heads,
#we add 2 to get the ending location
}
mean(answer)
## [1] 13.9872
sd(answer)
## [1] 11.83062
We see that the actual answer is approximately 14 rather than 8! Interestingly, the standard deviation is also approximately equal to the mean, which is characteristic of a type of distribution called the exponential distribution.
{Using a simulation, calculate the probability that the sum of 20 six-sided dice will be between 55 and 85. I’ve included the base code below with a comment on what to use to complete}
success <- c()
for(i in 1:10000) {
#generate in (0,6), round down to obtain {0,1,2,3,4,5}, then add 1.
dice_roll <- as.integer(runif(20,min=0,max=6))+1
dice_sum <- sum(dice_roll)
#Compare the sum of dice_roll to 55 and 85 using the < and > operators
#along with the 'and' operator &
#i.e. a generic calculation might be x > 0 & x < 1
#to determine if x is between 0 and 1
success[i] <- dice_sum > 55 & dice_sum < 85
}
mean(success)
## [1] 0.944
Consider the following simple setup: we’re trying to measure the effect of attendance on grade. Suppose that the actual relationship is given by \(grade_i = 70 + 10 attendance_i + \varepsilon_i\). We can simulate this by generating attendance and \(\varepsilon\) through random variables:
n <- 100
attendance <- runif(n)
e <- rnorm(n,sd=10)
grade <- 70 + 10 * attendance + e
dat <- data.frame(attendance,grade)
model <- lm(data=dat, grade ~ attendance)
summary(model)
##
## Call:
## lm(formula = grade ~ attendance, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.994 -7.397 -1.114 6.127 23.024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.932 1.948 35.383 < 2e-16 ***
## attendance 11.004 3.262 3.374 0.00106 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.21 on 98 degrees of freedom
## Multiple R-squared: 0.104, Adjusted R-squared: 0.0949
## F-statistic: 11.38 on 1 and 98 DF, p-value: 0.001064
We can see that if we use only 100 observations that our estimate of \(\hat\beta_1\) isn’t very close to \(\beta_1\), and the standard error and is quite high. We can extract the coefficient, std error, and p value for use in a loop as above using the following code:
slope <- summary(model)$coefficients[2,'Estimate']
stderr <- summary(model)$coefficients[2,'Std. Error']
p <- summary(model)$coefficients[2,'Pr(>|t|)']
slope
## [1] 11.00359
stderr
## [1] 3.261763
p
## [1] 0.001064108
{Using simulation, find the mean value of \(\hat\beta_1\) and the standard deviation of \(\hat\beta_1\).}
{Does the the standard deviation of your \(\hat\beta_1\) estimates equal the average of your standard errors?}
{Explain why you observe the results that you do.}
{Much of the boilerplate code has already been filled in below}
betas <- c()
stderrs <- c()
ps <- c()
n <- 100
for(i in 1:1000) {
attendance <- runif(n)
e <- rnorm(n,sd=10)
grade <- 70 + 10 * attendance + e
dat <- data.frame(attendance,grade)
model <- lm(data=dat, grade ~ attendance)
#Assign your 3 vectors here
betas[i] <- summary(model)$coefficients[2,'Estimate']
stderrs[i] <- summary(model)$coefficients[2,'Std. Error']
ps[i] <- summary(model)$coefficients[2,'Pr(>|t|)']
}
mean(betas)
## [1] 9.813171
sd(betas)
## [1] 3.310649
mean(stderrs)
## [1] 3.495025
mean(ps)
## [1] 0.04343533
ps[10]
## [1] 0.01871079
sum(ps<.05)
## [1] 804
{You also calculated every p-value above.}
We would accept the null hypothesis about 80.4% of the time
No
Since our p-value is positive and less than .05, we accept the null hypothesis 80% of the time. 20% of the time we will have an error.
{You have a vector containing 1000 estimates of \(\hat\beta_1\) that you computed above, and you know that the true value of \(\beta_1=10\).}
{Create a histogram of \(\hat\beta_1-\beta_1\) using the hist function (see problem set 1 if necessary).}
hist(betas-10)
Now we turn to multivariate OLS. For this I simulated highly correlated data, which I’ll motivate as the grades on 13 different assignments for 100,000 students. While this could be simulated directly in this Rmarkdown, the code is a bit too complicated for this class. Each assignment has been normalized so that rather than a raw score, the values represent the number of standard deviations from the mean for the student, e.g. a value of 0 means that the student’s score was average, while a value of 1 means they scored 1 standard deviation above the mean. We can read this data into R using the following code: if reading in via Rmarkdown make sure that the file is stored in the same location as the .Rmd file, or otherwise specify the path to the file:
dat <- read.csv("simulated.csv")
In total there are 13 columns in this data. We can also see that they’re highly correlated: here I display all correlations with the first column of data.
dim(dat)
## [1] 100000 13
cor(dat)[,1]
## Q1 Q2 Q3 Q4 Q5
## 1.0000000000 -0.5763057962 0.3838390878 0.4764261153 0.2804178521
## Q6 Q7 Q8 Q9 Q10
## -0.1661273708 -0.1034786900 -0.0550259644 0.7438346756 -0.1829953373
## Q11 Q12 Q13
## 0.0008593828 -0.0004809710 0.0001639467
Suppose that we wish to calculate a subscore for algebra and for quantitative reasoning. The algebra subscore is the sum of the first two assignments, while the quantitative subscore is the algebra subscore plus the 5th and 12th assignments. We can regress the quantitative score on the algebra score to see the association between the two:
dat$algebra <- dat$Q1 + dat$Q2
dat$quantitative <- dat$algebra + dat$Q5 + dat$Q12
# Q12 not omitted
summary(lm(data=dat, quantitative ~ algebra))
##
## Call:
## lm(formula = quantitative ~ algebra, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1182 -0.8041 0.0027 0.8036 5.3543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003815 0.003779 -1.01 0.313
## algebra 1.339087 0.017378 77.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.195 on 99998 degrees of freedom
## Multiple R-squared: 0.05605, Adjusted R-squared: 0.05604
## F-statistic: 5938 on 1 and 99998 DF, p-value: < 2.2e-16
{Interpret \(\hat\beta_1\) in the bivariate regression above.}
Our Beta1 estimate is 1.339.
{We know from our calculation that \(\beta_1=1\), so explain why our estimate of \(\hat\beta_1\) differs from this value.}
Our estimate differs from the actual value, because the probability of getting the exact value is 0. There also could be something biasing the result in the dataset, or we do not have a large enough sample size.
To a certain degree, yes. The probability that we get any specific number is 0. But more probably there is another factor(s) that we should control for.
We are only taking into account the correlations of Q1, Q2, Q5, and Q12. Q12 is not an omitted variable, and it’s correlation is 0
We can redo the above in a multivariate regression by controlling for other variables by adding a +, e.g. we can control for Q9 by using
summary(lm(data=dat, quantitative ~ algebra + Q9))
##
## Call:
## lm(formula = quantitative ~ algebra + Q9, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1197 -0.8045 0.0021 0.8030 5.3598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003815 0.003778 -1.010 0.313
## algebra 1.257304 0.024112 52.143 <2e-16 ***
## Q9 0.178395 0.036467 4.892 1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.195 on 99997 degrees of freedom
## Multiple R-squared: 0.05627, Adjusted R-squared: 0.05626
## F-statistic: 2981 on 2 and 99997 DF, p-value: < 2.2e-16
Our beta1 estimate is closer to 1 than before, this is because we controlled for a variable that was biasing the results.
There are still other variables that are biasing the result that are not controlled for
summary(lm(data=dat, quantitative ~ algebra + Q9 + Q3 + Q4 + Q5 + Q6 + Q7 + Q8 + Q10 + Q11 + Q12 + Q13))
##
## Call:
## lm(formula = quantitative ~ algebra + Q9 + Q3 + Q4 + Q5 + Q6 +
## Q7 + Q8 + Q10 + Q11 + Q12 + Q13, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.100e-14 -3.000e-16 -1.000e-16 1.000e-16 6.422e-12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.451e-18 7.603e-17 4.500e-02 0.9638
## algebra 1.000e+00 1.944e-15 5.143e+14 < 2e-16 ***
## Q9 1.762e-14 4.195e-15 4.201e+00 2.66e-05 ***
## Q3 1.867e-13 1.826e-14 1.022e+01 < 2e-16 ***
## Q4 -5.449e-15 2.928e-15 -1.861e+00 0.0627 .
## Q5 1.000e+00 5.529e-16 1.809e+15 < 2e-16 ***
## Q6 -1.139e-14 8.365e-15 -1.361e+00 0.1734
## Q7 2.086e-16 4.759e-16 4.380e-01 0.6612
## Q8 3.483e-16 3.799e-16 9.170e-01 0.3593
## Q10 1.683e-15 2.263e-15 7.440e-01 0.4570
## Q11 3.554e-17 7.584e-17 4.690e-01 0.6393
## Q12 1.000e+00 7.586e-17 1.318e+16 < 2e-16 ***
## Q13 -2.259e-19 7.596e-17 -3.000e-03 0.9976
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.404e-14 on 99987 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.18e+31 on 12 and 99987 DF, p-value: < 2.2e-16
Here, I controlled for every variable and my beta1 estimate was basically 1, meaning it is unbiased.
Since every variable either adds or subtracts correlation from beta1hat once you control for them, if you add all of them they cancel out and you are left with beta1hat = beta1
I believe so, since beta1hat = beta1 after accounting for all the variables.
The Frisch-Waugh-Lovell theorem gives an alternative method of performing multivariate regression. Here we obtain the residuals from regressing both our x and y variables on Q9, then regressing the residualized quantitative score on the residualized algebra score.
dat$residual_quant <- residuals(lm(data=dat, quantitative ~ Q9))
dat$residual_algebra <- residuals(lm(data=dat, algebra ~ Q9))
summary(lm(data=dat, residual_quant ~ residual_algebra))
##
## Call:
## lm(formula = residual_quant ~ residual_algebra, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1197 -0.8045 0.0021 0.8030 5.3598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.645e-17 3.778e-03 0.00 1
## residual_algebra 1.257e+00 2.411e-02 52.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.195 on 99998 degrees of freedom
## Multiple R-squared: 0.02647, Adjusted R-squared: 0.02646
## F-statistic: 2719 on 1 and 99998 DF, p-value: < 2.2e-16
Another alternative is to compare only individuals with the same Q9 score. Since Q9 is continuous no one in our data has the exact same score, but we can round our values to do this comparison. We can then regress within each group and average the results. First, we will tabulate our grouped data to avoid regressing on very small groups:
dat$Q9Group <- round(dat$Q9,1)
table(dat$Q9Group)
##
## -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
## 7 87 693 3359 10763 21389 27334 21502 10804 3331 653 73 5
Now, we can regress in a loop and confirm that the results match (though not perfectly due to rounding)
groups <- seq(from=-0.4,to=0.4,by=.1)
betas <- c()
ns <- c()
for(i in 1:length(groups)) {
g <- groups[i]
data_truncated <- dat[abs(dat$Q9Group-g)<.01,]
ns[i] <- nrow(data_truncated)
betas[i] <- summary(lm(data=data_truncated, quantitative ~ algebra))$coefficients[2,'Estimate']
}
mean(betas)
## [1] 1.250582
weighted.mean(betas,ns)
## [1] 1.264681
cor(dat$Q9,dat$residual_algebra)
## [1] 1.414053e-17
cor(dat$Q9,dat$residual_quant)
## [1] -7.994442e-20
cor(dat$Q9Group,dat$residual_algebra)
## [1] 0.0006694461
cor(dat$Q9Group,dat$residual_quant)
## [1] -0.0002442318
Since my answers were essentially 0, that means they had little to no correlation with these outcomes. Meaning they are not an omitted variable.
{Is Q9 an omitted variable on the within group approach? Explain why or why not.}
No, the correlation between these variables is essentially 0.
{Finally, explain conceptually why these two approaches are achieving the same result.}
Q9 has no correlation with the quantitative and algebra residuals.
Let’s suppose we have the same data as before, but now our outcome variable represents ability to think abstractly and is the sum of our algebra subscore and the fourth assignment on philosophy. Suppose that due to a data mishap, we no longer have assignments 1, 2 or 4 in our dataset (I won’t remove it in the code for your convenience, but pretend it no longer exists).
dat$abstract <- dat$algebra + dat$Q4
No, we would not have any variables to work with.
{Run both a bivariate OLS regressing algebra on abstract,and then a multivariate OLS controlling for everything we know is a confounding variable (i.e. use Q3, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12, and Q13).}
{Which regression gives a more accurate result? Explain why this is happening.}
The second regression is a more accurate result, since it is controlling for all other variables, meaning we get to see the true result of abstract on algebra.
summary(lm(data=dat,dat$algebra ~ dat$abstract))
##
## Call:
## lm(formula = dat$algebra ~ dat$abstract, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52996 -0.08127 0.00016 0.08130 0.54664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.701e-17 3.819e-04 0.0 1
## dat$abstract 4.431e-01 9.359e-04 473.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1208 on 99998 degrees of freedom
## Multiple R-squared: 0.6915, Adjusted R-squared: 0.6915
## F-statistic: 2.242e+05 on 1 and 99998 DF, p-value: < 2.2e-16
summary(lm(data=dat,dat$algebra ~ dat$abstract + Q3 + Q5 + Q6 + Q7 + Q8 + Q9 + Q10 + Q11 + Q12))
##
## Call:
## lm(formula = dat$algebra ~ dat$abstract + Q3 + Q5 + Q6 + Q7 +
## Q8 + Q9 + Q10 + Q11 + Q12, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.200261 -0.032046 -0.000037 0.032167 0.204697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.808e-08 1.504e-04 0.000 1.000
## dat$abstract 1.438e+00 3.589e-03 400.536 <2e-16 ***
## Q3 -1.293e+00 3.590e-02 -36.024 <2e-16 ***
## Q5 -2.141e-01 8.592e-04 -249.152 <2e-16 ***
## Q6 4.020e+00 1.059e-02 379.390 <2e-16 ***
## Q7 -2.412e-01 5.518e-04 -437.133 <2e-16 ***
## Q8 -1.944e-01 4.324e-04 -449.668 <2e-16 ***
## Q9 -1.725e+00 6.253e-03 -275.921 <2e-16 ***
## Q10 -3.337e-01 4.351e-03 -76.697 <2e-16 ***
## Q11 1.058e-04 1.500e-04 0.705 0.481
## Q12 -9.744e-05 1.501e-04 -0.649 0.516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04756 on 99989 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.9521
## F-statistic: 1.99e+05 on 10 and 99989 DF, p-value: < 2.2e-16
Sometimes rather than variables having a direct causal effect, there is a causal chain along which effects happen. For instance, consider the case where a good teacher can directly improve students grades, but the teacher can also entice more people to attend class, and the increased attendance also improves grades. Then we have two causal pathways here: quality -> grade and quality -> attendance -> grade. Researchers often want to split out these effects in order to find the mechanisms behind which causal effects occur.
Using our data above, let’s simulate such an example whereby x directly causes y to increase by 5, but also causes z to increase by 1, and z also causes y to increase by 1.
dat$x1 <- dat$Q1
dat$z1 <- dat$x1 + dat$Q11
dat$y1 <- 5*dat$x1 + dat$z + dat$Q12
summary(lm(data=dat,dat$y1 ~ dat$x1))
##
## Call:
## lm(formula = dat$y1 ~ dat$x1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5397 -0.9585 0.0001 0.9522 6.2926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.007062 0.004481 -1.576 0.115
## dat$x1 6.001528 0.018038 332.712 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.417 on 99998 degrees of freedom
## Multiple R-squared: 0.5254, Adjusted R-squared: 0.5254
## F-statistic: 1.107e+05 on 1 and 99998 DF, p-value: < 2.2e-16
summary(lm(data=dat,dat$y1 ~ dat$x1 + dat$z1))
##
## Call:
## lm(formula = dat$y1 ~ dat$x1 + dat$z1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2892 -0.6719 0.0035 0.6761 4.2764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.003818 0.003169 -1.205 0.228
## dat$x1 4.999006 0.013147 380.237 <2e-16 ***
## dat$z1 0.999056 0.003161 316.042 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.002 on 99997 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.7626
## F-statistic: 1.606e+05 on 2 and 99997 DF, p-value: < 2.2e-16
{For the above setup, regress y1 on x1 in both a bivariate setup and in a multivariate setup controlling for z1.}
{Interpret the slope estimates. Do the results align with your intution of what should happen?}
Yes, in the first regression the estimated slope of x1 was 6. In the second, the x1 slope was essentially 5, and the z term also had a slope of 1, both of which match the actual results.
Now, consider a slight modification of the above setup where x2 -> z2 -> y2. The setup is identical, except y2 now adds Q11 rather than Q12 in its calculation (and Z1 also has Q11 added to it)
dat$x2 <- dat$Q1
dat$z2 <- dat$x2 + dat$Q11
dat$y2 <- 5*dat$x2 + dat$z2 + dat$Q11
summary(lm(data=dat,dat$y2 ~ dat$x2))
##
## Call:
## lm(formula = dat$y2 ~ dat$x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5588 -1.3556 -0.0066 1.3486 8.6975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006495 0.006341 -1.024 0.306
## dat$x2 6.006937 0.025527 235.320 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.005 on 99998 degrees of freedom
## Multiple R-squared: 0.3564, Adjusted R-squared: 0.3564
## F-statistic: 5.538e+04 on 1 and 99998 DF, p-value: < 2.2e-16
summary(lm(data=dat,dat$y2 ~ dat$x2 + dat$z2))
##
## Call:
## lm(formula = dat$y2 ~ dat$x2 + dat$z2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.660e-14 -4.000e-16 -2.000e-16 0.000e+00 1.278e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.264e-16 1.412e-16 -8.950e-01 0.371
## dat$x2 4.000e+00 5.858e-16 6.828e+15 <2e-16 ***
## dat$z2 2.000e+00 1.409e-16 1.420e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.466e-14 on 99997 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.566e+32 on 2 and 99997 DF, p-value: < 2.2e-16
It still biases our estimate because the correlation between z2 and Q11 is almost 1, the correlation between y2 and Q11 is .8, and the correlation between y2 and z2 is .9. Both z2 and y2 correlate positively with x2.
In other words, Q11 is correlated with y2 and z2 which is correlated with x2, biasing it.
If we add Q11 into the regression, nothing changes. This is because my independent variables are strongly correlated. We can’t split out the effects without taking out Q11 in one of the variables, which essentially defeats the point of the regression.
summary(lm(data=dat,dat$y2 ~ dat$x2 + dat$z2 + dat$Q11))
##
## Call:
## lm(formula = dat$y2 ~ dat$x2 + dat$z2 + dat$Q11, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.660e-14 -4.000e-16 -2.000e-16 0.000e+00 1.278e-11
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.264e-16 1.412e-16 -8.950e-01 0.371
## dat$x2 4.000e+00 5.858e-16 6.828e+15 <2e-16 ***
## dat$z2 2.000e+00 1.409e-16 1.420e+16 <2e-16 ***
## dat$Q11 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.466e-14 on 99997 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.566e+32 on 2 and 99997 DF, p-value: < 2.2e-16