Problem 1

1.a)

PS3DATA <- read.dta("/Users/YaseenAbdulridha/Downloads/paeco526W16_qr_ps3.dta")
PS2DATA$tobacco <- factor(tobacco)
IA_Model<-lm(dbirwt ~ tobacco, data = PS2DATA)
summary(IA_Model)
## 
## Call:
## lm(formula = dbirwt ~ tobacco, data = PS2DATA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3199.2  -307.2    32.8   349.4  3409.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3426.194      1.271 2694.90   <2e-16 ***
## tobacco1    -260.580      2.960  -88.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 574.1 on 249998 degrees of freedom
## Multiple R-squared:  0.03007,    Adjusted R-squared:  0.03007 
## F-statistic:  7751 on 1 and 249998 DF,  p-value: < 2.2e-16

We have always been told that correlation doesn’t imply causation. The smoking coefficient of the smoking indicator, can not have any confounding variables that influence the DV as well as the IV, in order for it to be intrepreted as a causal effect. We must also observe some association present along with its direction. Our problem in this case is that we have a limited Independent variable in which we have no measure of the extent of which these mothers smoked during pregnancy, which could greatly bias our coefficient results. Given our assumption that the expectation of our error term is zero, which will only happen when it is independent of our error term Ui. We are attempting to make the assumption that that are no unobservable that affect both the baby’s birth weight as well the decision of whether or not to smoke. In our regression model, it would be very reckless to make such a large jump, and therefore end up with biased coefficients in the setting.

1.b)

IB_Model <- lm(dbirwt ~ tobacco + DV)
summary(IB_Model)
## 
## Call:
## lm(formula = dbirwt ~ tobacco + DV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3549.6  -303.0    17.8   334.5  3431.5 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 2805.61513   25.35639  110.647  < 2e-16 ***
## tobacco     -229.41852    2.94382  -77.932  < 2e-16 ***
## DVdmage        8.75244    1.75738    4.980 6.35e-07 ***
## DVdmage2      -0.20353    0.03062   -6.647 3.00e-11 ***
## DVdmeduc       6.00593    0.70315    8.541  < 2e-16 ***
## DVmblack    -154.08295    8.70253  -17.706  < 2e-16 ***
## DVmotherr    -85.76344   14.07765   -6.092 1.12e-09 ***
## DVmhispan    -77.53224   10.00371   -7.750 9.20e-15 ***
## DVdmar       -42.66961    3.35566  -12.716  < 2e-16 ***
## DVforeignb   -15.85210    6.33783   -2.501  0.01238 *  
## DVdfage       -0.07689    0.26294   -0.292  0.76995    
## DVdfeduc       4.35438    0.63316    6.877 6.12e-12 ***
## DVfblack     -53.10469    8.48791   -6.257 3.94e-10 ***
## DVfotherr   -107.06115   13.88879   -7.708 1.28e-14 ***
## DVfhispan    -59.67901    9.20561   -6.483 9.01e-11 ***
## DValcohol    -25.19821   11.16907   -2.256  0.02407 *  
## DVdrink       -9.92581    1.90827   -5.201 1.98e-07 ***
## DVadequac2    67.95236    4.64499   14.629  < 2e-16 ***
## DVadequac3   153.80668    9.71492   15.832  < 2e-16 ***
## DVtripre0   -109.43095   13.81638   -7.920 2.38e-15 ***
## DVtripre2     27.16719    4.92541    5.516 3.48e-08 ***
## DVtripre3     80.08478   10.65329    7.517 5.61e-14 ***
## DVnprevist    37.02335    0.43226   85.651  < 2e-16 ***
## DVfirst      -82.01740    3.82024  -21.469  < 2e-16 ***
## DVdlivord     31.31721    1.42526   21.973  < 2e-16 ***
## DVdisllb      -0.25344    0.04609   -5.499 3.82e-08 ***
## DVpre4000    480.98922   10.50457   45.789  < 2e-16 ***
## DVplural    -982.86673    8.55440 -114.896  < 2e-16 ***
## DVdiabete     38.90363    8.00885    4.858 1.19e-06 ***
## DVanemia     -55.25044   11.11533   -4.971 6.68e-07 ***
## DVcardiac    -59.02826   15.42272   -3.827  0.00013 ***
## DVchyper    -209.61905   12.39990  -16.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 535.9 on 249968 degrees of freedom
## Multiple R-squared:  0.1549, Adjusted R-squared:  0.1548 
## F-statistic:  1478 on 31 and 249968 DF,  p-value: < 2.2e-16

We still face problem in this case is that we have a limited Independent variable in which we have no measure of the extent of which these mothers smoked during pregnancy, which could greatly bias our coefficient results. A multivariate model support claims about causality if, the sample is unbiased ,the measurement is accurate, the model includes controls for all major possible confounding effects, the possibility of reverse causality can be ruled out. By adding more possible coefficients and controlling for their effects we are attempting to decrease the bias of confounders that affect both the baby’s birth weight as well the decision of whether or not to smoke. In our expanded model we can see that the coefficient on smoking increased, which means we were able to explain for portions of baby’s birth weight through our controls, whereas before we were incorrectly overestimating the negative causal effect of smoking.

1.c)

1.d)

Deadkids and Preterm are not valid exclusion-restrictions,beacause I find it hard to believe that they are able to affect smoking while leaving birth weight unchanged.

1.e)

I used etregress in STATA, which uses Maximum Likelihood. (This was the only part I Could not do in R :( )

x<- data.frame("Coef"=-704.6137, "Robust Standard Errors" =15.37441,  " z " = -45.83, " P>|z| "= 0.000 ,  "[95% Conf. Interval] "= -734.747, -674.4804) 
row.names(x)<- ("tobacco")                    
x
##              Coef Robust.Standard.Errors   X.z. X.P..z..
## tobacco -704.6137               15.37441 -45.83        0
##         X.95..Conf..Interval.. X.674.4804
## tobacco               -734.747  -674.4804

rho=.4828716 , Which is not an extremely strong degree of corr between the errors of selection/outcome equations.

This may be preferable to the instrumental variable approach, given that instruments are notoriously difficult to find.An advantage to Heckmans approach is that it allows one to correct for selection ias. A disadvantage to Heckman’s two step approach is that the correlation coefficient may be outside the interval [-1, 1]. Most importantly, this approach does not produce asymptotically efficient estimators.

1.f)

It does not seemlikely that our exclusion restrictions are valid instruments, therefore I would not place much faith in this model. It also does not seem likely that preterm and deadkids directly affects smoking during pregnancy, while having no confounding effect on birth weight.

1.g)

One instrument I believe could be quite useful is a binary indicator of whether or not the father(or husband/boyfriend) smokes. I dont believe it should affect the baby’s birthweight, but can very well affect whether/the amount of tobacco the mother consumes. Although it is relatively simple to come up with an exmaple one think may work, however its validity is always hard to prove. The monotonicity assumption is assumed to not allow outliers to have a distorting effect. Imy example instrument we must observe that if you live with another smoker/father smoker then you must be more ikely to smoke than if your childs father/your boyfriend does not smoke, (no Treatment).

Problem 2

2.a)

IIA_Model <- lm(dbirwt~tobacco,data = PS3DATA)
IIA_Model$coefficients
## (Intercept)     tobacco 
##   3423.3603   -266.0286
x<- confint.lm(IIA_Model)
x[2,]
##     2.5 %    97.5 % 
## -275.3614 -256.6957
IIAA_Model <- lm(dbirwt~.,data = PS3DATA)
IIAA_Model$coefficients[18]
##   tobacco 
## -231.9765
x <-  confint(IIAA_Model)
x[18,1]
## [1] -241.2445
x[18,2]
## [1] -222.7084

2.b)

IIB_Model <- rq(dbirwt~tobacco, tau=0.5, data = PS3DATA)
c(IIB_Model$coefficients[2]-1.96*coef(summary(IIB_Model))[2,2],IIB_Model$coefficients[2]+1.96*coef(summary(IIB_Model))[2,2])
##   tobacco   tobacco 
## -267.7852 -250.2148

This coefficient is interpreted as the difference in the median birthweight between mothers who smoked and those who did not. Which is different to the mean effects from before, as a result we can not claim causal effects. We can not rule out other factors that are effecting birth weight of the baby that we are not accounting for, we must use an instrument to account for this as best we can. We are not able to infer an unconditional median of dbirwt’s distribution from the result. The difference between mothers who did/ did not smoke, is 17.58 grames difference in median birth weight.

2.c)

IIC_Model <- rq(dbirwt~.,tau=0.5,data=PS3DATA)
c(IIC_Model$coefficients[18]-1.96*coef(summary(IIC_Model))[18,2],IIC_Model$coefficients[18]+1.96*coef(summary(IIC_Model))[18,2])
##   tobacco   tobacco 
## -237.9135 -217.5756

Our statements from above direcly apply to this question. However the interpretetation of this coefficient has now become the difference in median birth weight between mothers who smoke/dont smoke. After controlling for all our covariates, The difference between mothers who did/ did not smoke, is 17.58 grames difference in median birth weight. tobacco coefficient = -227.74454. Although it may seem a more accurate estimate, it does not imply a causal relationship. Our estimate is only 4 grams away from that in section a however we are still not controlling for a very large amount of confounders.

2.d)

The method in stata use to calculate standard errors for quantile regressions assumes that the residuals are independent and identically distributed, and that errors are homoskedastistic.

The Stata Method is: vce(iid).

vce(robust) is what allows for heteroscedasticity in the errors, which is particularly useful for settings when we want to look at effects at different quantiles. The obvious advantage is computational, however it requires us to see regression diagnostic plots for the distrubtion of our errors, which we hope are independent, identically dist. homoskedastistic. It most likely is not our case because of the sheer amount of covariates.

2.e)

E <- matrix(nrow = 32,ncol = 15)
SE <-  matrix(nrow = 32,ncol = 15)
for(i in 1:15)
  {
    S1 <- sample(1:nrow(PS3DATA), replace = TRUE)
    D <- PS3DATA[S1,]
    M1 <- rq(dbirwt~.,tau=0.5,data = D)
    summary<- summary(M1)
    E[,i] <- M1$coefficients
    SE[,i] <-  summary$coefficients[,2]
  }
E[18]
## [1] -221.6579
SE[18]
## [1] 5.172882
mean(SE[18,])
## [1] 5.207662

Bootstrapping retains the assumption of independent errors while easing the assumption of identically distributed errors. Therefore becoming equivalent to robust standard errors we find in the linear regression. The asymptotic variance we are estimating using bootstrap is fur(0|xi) which we do not observe to go to 0.

Although these standard errors have similar values as the previously, they are not as robust as those in part c. I only did 15 replications, but I am sure after 10,000 they would be more useful.

2.f)

TValue = -1.207 Therefore I concluded that we can not reject the null hypothesis. Our estimate is greater than the critical level for significance, and our estimated coefficient for the sum of tobacco/alcohol is ! <= -300.

Problem 3

3.a)

#a)
A_Model<- rq(dbirwt~tobacco,tau=0.1,data = PS3DATA)
AI_Model<- rq(dbirwt~tobacco,tau=0.25,data = PS3DATA)
AII_Model<- rq(dbirwt~tobacco,tau=0.5,data = PS3DATA)
AIII_Model<- rq(dbirwt~tobacco,tau=0.75,data = PS3DATA)
AIV_Model<- rq(dbirwt~tobacco,tau=0.9,data = PS3DATA)

#10th Quantile
A_Model$coefficient
## (Intercept)     tobacco 
##        2778        -283
#25th Quantile
AI_Model$coefficient
## (Intercept)     tobacco 
##        3119        -263
#50th Quantile
AII_Model$coefficient
## (Intercept)     tobacco 
##        3459        -259
#75th Quantile
AIII_Model$coefficient
## (Intercept)     tobacco 
##        3771        -256
#90th Quantile
AIV_Model$coefficient
## (Intercept)     tobacco 
##        4082        -255

As we can see, when we move up in quantiles the effect of smoking on birthweight decreases. We are unable to infer the unconditional distribution however we can see which quantiles are effected differenetly by smoking during pregnancy. The lowest quantile is impacted the most by smoking.

3.b)

#b)
AI_Model <- rq(dbirwt~.,tau=0.1,data = PS3DATA)
AII_Model <- rq(dbirwt~.,tau=0.25,data = PS3DATA)
AIII_Model <- rq(dbirwt~.,tau=0.5,data = PS3DATA)
AIV_Model <- rq(dbirwt~.,tau=0.75,data = PS3DATA)
AV_Model <- rq(dbirwt~.,tau=0.9,data = PS3DATA)


#10th Quantile
AI_Model$coefficient[18]
##   tobacco 
## -253.7774
#25th Quantile
AII_Model$coefficient[18]
##   tobacco 
## -231.8153
#50th Quantile
AIII_Model$coefficient[18]
##   tobacco 
## -227.7445
#75th Quantile
AIV_Model$coefficient[18]
##   tobacco 
## -220.5429
#90th Quantile
AV_Model$coefficient[18]
##   tobacco 
## -228.9399

There seems to be less of a downward bias in these estimates than the previous estimates. Adding more Explanatory Variables led to this result, and as before we see an effect of the indicator smoking on birthweight decreasing as we move along the quantiles. The assumptions about the unconditional distribution have not changed, we still can not infer anything. OLS results can be interpreted as an avg effect across all the observations, and the change in coefficient from the previous section was 35 grams lower. We can see that the effects of the binary tobacco indicator can vary across infant birth weights.

3.c)

T <- anova(AI_Model,AII_Model,AIII_Model,AIV_Model, AV_Model, joint = F)

In this section the anova test is showing us that the estimated coefficients at different quantiles are not equal for the binary tobacco indicator, by comparing the five models. This tells us that there is a different effect of whether or not mothers smoked during pregnancy on infant birth weights across different estimated quantile weights.

3.d)

Quantile_1 <- AI_Model$coefficients[29]
Quantile_2.5 <- AII_Model$coefficients[29]
Quantile_5 <- AIII_Model$coefficients[29]
Quantile_7.5 <- AIV_Model$coefficients[29]
Quantile_9.0 <- AV_Model$coefficients[29]
IIAA_Model <- lm(dbirwt~.,data = PS3DATA)


#Distribution of No Prenatal Care 
data.frame(IIAA_Model$coefficients[29],Quantile_1, Quantile_2.5, Quantile_5, Quantile_7.5, Quantile_9.0)
##         IIAA_Model.coefficients.29. Quantile_1 Quantile_2.5 Quantile_5
## tripre0                   -119.5509  -503.1754    -127.6602  -50.77035
##         Quantile_7.5 Quantile_9.0
## tripre0    -38.12517    -24.95823

As we can see from our test, the effect of prenatal care during pregnancy is not the same across all the conditional birth weight dist. quantiles. The lack of prenatal care has a significantly larger effect on infant birth weight throughout the distribution, and the most significant for births at the lower tail of dist. It is actually quite interesting to see how large the discrepancy between the quantiles.

Our OLS estimate = -119 after controlling for all the covariates.

Problem 4

The Paper by Angrist and Krueger(2001) really goes into detail about the efficacy and proper implementation of instrumental variabls. They offer alot of insight as to when they are useful as opposed to alternative methods, when they can best be employed, and what disadvantages they also bring about. They explain why creating an instrumental variable is so difficult, and how they can lend insight into characteristics we would have otherwise not observed. I actualy quite enjoyed the question on this problem set in which we created our own instrumental variable because it really leaves so much room for creativity and for criticism, so I was interested to see how they respond technically to weak instrumental variables and how to discern the robustness of your instrumental variable. The obvious condidtion concerning IV’s are that they must have a very high correlation with the endogenous variable, while not being correlated with the outcome such as babys birth weight in our examples. I really appreciated the table they presented to help understand different applications of instrumetnal variables, it makes it easier to place it in context. I am actually quite intrigued by the many differnt possible uses of Instrumental variables, I would love to think of applications in capital markets. Possible future thesis? :)

Problem 5

The paper by Koenker and Hallock(2001) hosts a great discussion about the efficacy of Quantile regressions and its advantages when used in the place of Ordinary Least Squares Regression estimation. Econometricians might find Quantile Regressions very useful when trying to use data with outliers that may skew results. One thing I have heard time and time again from being in your class is that the median, similar to quantiles, is a more robust measurement than a mean, in the precense of outliers. I gained alot of knowledge regarding the purpose of quantile regressions and seeing why econometricians might prefer it to OLS in situations where they want to estimate effects of covariates over a range of outcomes. It can be very useful to understand different effects of covariates over a range of outcomes as we have seen in the previous questions with the quantile regression over the smoking during pregnancy indicator over a range of different infant birth weights and how the effects change.