This assignment will help you review and extend your understanding of multivariate OLS regression.
This problem set will require the use of R, PS3Data.csv, HeightWage_MenWomenUS_HW.csv, and Cellphone_2012_homework.csv. Data files are available in Moodle
The following chart presents the results of regression using the state.x77 data that we used in handouts and class exercises (you can see details of this dataset by typing ?state.x77 in R). I have created one new variable, South, that takes a value of 1 if the state is Virginia, Georgia, North Carolina, South Carolina, Alabama, Arkansas, Texas, or Mississippi and 0 if it is not any of those states.
| Dependent variable: | ||||
| State-Level Murder Rate | ||||
| (1) | (2) | (3) | (4) | |
| Income | -0.001 | 0.001 | 0.0001 | 0.001 |
| (0.001) | (0.001) | (0.001) | (0.001) | |
| HS.Grad | -0.256*** | -0.149* | ||
| (0.074) | (0.079) | |||
| South | 5.569*** | 4.180*** | ||
| (1.349) | (1.508) | |||
| Constant | 13.509*** | 17.858*** | 6.080 | 10.457** |
| (3.778) | (3.628) | (3.734) | (4.319) | |
| Observations | 50 | 50 | 50 | 50 |
| R2 | 0.053 | 0.247 | 0.305 | 0.354 |
| Adjusted R2 | 0.033 | 0.215 | 0.275 | 0.312 |
| Residual Std. Error | 3.630 (df = 48) | 3.272 (df = 47) | 3.143 (df = 47) | 3.061 (df = 46) |
| F Statistic | 2.683 (df = 1; 48) | 7.693*** (df = 2; 47) | 10.307*** (df = 2; 47) | 8.418*** (df = 3; 46) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Write out the regression equations for columns 1 and 4. Which of these is likelier to contain useful information for predicting the murder rate by state? What do the numbers in parentheses represent?
Compare columns 2, 3, and 4. What do the differences in the estimated coefficients for HS.Grad and South across these columns suggest about the presence or absence of ommitted variable bias?
Substantively interpret the coefficient for South in column 4. How should you, as an analyst, understand this relationship in a purely statistical sense? How should you, as an analyst, interpret that statistical relationship in a causal sense?
Using the file HeightWage_MenWomenUS_HW.csv, complete the following problems:
Estimate an OLS regression model with adult wages as the dependent variable and adult height, adolescent height, and a dummy variable for males as the independent variables. Does controlling for gender affect the results?
Generate a female dummy variable. Estimate a model with both a male dummy variable and a female dummy variable. What happens? Why?
Can we interpret these data causally? What sources of endogeneity might exist? Come up with the strongest possible candidate for endogeneity you can, justify your choice, and use the data given to estimate a model that controls for sources of endogeneity.
Using the file Cellphone_2012_homework.csv, complete the following problems:
Regress the number of deaths on controls for the number of cell phone subscribers, the population of states, and the total miles driven in a state. Interpret the coefficient for the number of cell phone subscribers in both substantive and significance terms.
Take the equation you regressed in 3 (A) and add controls for the presence or absence of a text ban and a cell phone ban. Interpret what has changed between your earlier results and this one. How many lives does your result suggest a cell phone ban saves per year? How many lives does your result suggest a texting ban saves per year? Interpret these results with regard to the estimation of the standard error of each of these variables’ coefficient.
How many lives would be saved if California were to implement a cell phone ban / texting ban? How many lives would be saved if Wyoming did the same? What are the implications of your answers for proper model specification?
Estimate a model in which total miles is interacted with both the cell phone ban and the prohibition of texting variables. What is the estimated effect of a cell phone ban for California? For Wyoming? What is the effect of a texting ban for California? For Wyoming?
Imagine you are an analyst working as an employee of the Multiple OLS Directorate (MOD), a contractor that consults for the U.S. federal government. Your client is interested in understanding why other countries spend money on their military. Your supervisor suggests that oil rents might help prop up military regimes.
In PS3Data.csv, you will have several variables available to you:
iso, cname: country ID variables
chga_demo, binary variable indicating whether a country is democratic or authoritarianp_polity2, continuous variable ranging from most authoritarian (-10) to most democratic (10).wdi_gdpc, per-capita GDP; wdi_gdp, overall GDPwdi_gdpcgr, growth rate of per-capita GDP; wdi_gdpgr, growth rate of overall GDPwdi_gini, Gini indexwdi_megdp, military expenditure as percent of GDP, and wdi_mege, military expenditure as percent of government expenditureswdi_wip, share of women in parliamentepi_co2cap, carbon emissions per capita; epi_co2gdp, carbon emissions per unit of GDP (“per dollar”); and epi_epi, a measure of environmental protection (all from the Environmental Protection Index)ht_colonial, colonial origin of a given countryt_demyrs, total number of years as a democracyand, most important, oipc, oil income per capita
Create an analysis of why states choose to spend money on their militaries (megdp), including:
Your boss asks you to re-run the analysis above, using mege (percent of government expenditures spent on military) instead. What differences do you see? Be specific. Which measurement do you believe is most theoretically relevant and why?
You’ve left your job as a contractor for the MOD and are now working for a global environmental activist NGO. They would like you to provide them with an analysis of whether oil production is associated with worse environmental outcomes. Construct and justify three different models of environmental outcomes with no more than five different explanatory variables in total, one of which must be oipc and another of which must measure national wealth in some form. You shoud use two different measures of environmental quality from the three available to you.
Based on those models, provide two charts illustrating the relationship you have discovered as well as full results from your models. Summarize your analysis and justify your analytical choices. Provide a recommendation about what your analysis imply for the strategy your NGO should follow.