Purpose

This assignment will help you review and extend your understanding of multivariate OLS regression.

Requirements

This problem set will require the use of R, PS3Data.csv, HeightWage_MenWomenUS_HW.csv, and Cellphone_2012_homework.csv. Data files are available in Moodle

Question 1

The following chart presents the results of regression using the state.x77 data that we used in handouts and class exercises (you can see details of this dataset by typing ?state.x77 in R). I have created one new variable, South, that takes a value of 1 if the state is Virginia, Georgia, North Carolina, South Carolina, Alabama, Arkansas, Texas, or Mississippi and 0 if it is not any of those states.


	Dependent variable:

	State-Level Murder Rate
	(1)	(2)	(3)	(4)

Income	-0.001	0.001	0.0001	0.001
	(0.001)	(0.001)	(0.001)	(0.001)

HS.Grad		-0.256^***		-0.149^*
		(0.074)		(0.079)

South			5.569^***	4.180^***
			(1.349)	(1.508)

Constant	13.509^***	17.858^***	6.080	10.457^**
	(3.778)	(3.628)	(3.734)	(4.319)


Observations	50	50	50	50
R²	0.053	0.247	0.305	0.354
Adjusted R²	0.033	0.215	0.275	0.312
Residual Std. Error	3.630 (df = 48)	3.272 (df = 47)	3.143 (df = 47)	3.061 (df = 46)
F Statistic	2.683 (df = 1; 48)	7.693^*** (df = 2; 47)	10.307^*** (df = 2; 47)	8.418^*** (df = 3; 46)

Note:	p<0.1; p<0.05; p<0.01

Question 1 A

Write out the regression equations for columns 1 and 4. Which of these is likelier to contain useful information for predicting the murder rate by state? What do the numbers in parentheses represent?

\[Murder = 13.509 -0.001 Income\] \[Murder = 10.457 + 0.001 Income -.15 HS.Grad + 4.2 South\]

The second equation is likelier to be useful because only it contains statistically significant predictors of the Murder rate. The numbers in parentheses are estimates of the standard error of each coefficient estimate.

Question 1 B

Compare columns 2, 3, and 4. What do the differences in the estimated coefficients for HS.Grad and South across these columns suggest about the presence or absence of ommitted variable bias?

There is ample probability for OVB in columns 2 and 3. Not only are both coefficient estimates significantly different from zero on their own, their values also change substantially when we control for both as compared to including only one or the other.

Question 1 C

Substantively interpret the coefficient for South in column 4. How should you, as an analyst, understand this relationship in a purely statistical sense? How should you, as an analyst, interpret that statistical relationship in a causal sense?

The coefficient for South suggests that a Southern state (as defined) will experience 4.2 more murders per 100,000 people than a non-Southern state. Given that the average number of murders per 100k is only 7.4, this is a substantively significant difference; since the likelihood that this beta coefficient would be found by chance if the true relationship were zero is much less than 1 in 20 (that is, p is much less than 0.05), we consider this finding statistically significant.

But is this causal? Well, in one sense, maybe – maybe thee’s something about being Southern that is causing this. But we can’t say for sure: we can’t randomly assign states to have a Southern cultural heritage or not, and that means we can’t compare what would have happened if Iowa or California had been “Southern” with the reality we observe. That means that South is probably picking up a whole mess of endogenous factors, and even though it’s clearly important in a statiscital, predictive sense, it’s hard – really, impossible – for us to say what is causing that.

Question 2

Using the file HeightWage_MenWomenUS_HW.csv, complete the following problems:

Question 2 A

__ Estimate an OLS regression model with adult wages as the dependent variable and adult height, adolescent height, and a dummy variable for males as the independent variables. Does controlling for gender affect the results?__

q2data <- read.csv("HeightWage_MenWomenUS_HW.csv")
q2.lm1 <- lm(wage96~height81+height85+male,data=q2data)
q2.lm2 <- lm(wage96~height81+height85,data=q2data)

library(stargazer)
stargazer(q2.lm1,q2.lm2,type="html")


	Dependent variable:

	wage96
	(1)	(2)

height81	0.487^**	0.457^*
	(0.248)	(0.246)

height85	-0.048	-0.107
	(0.250)	(0.243)

male	-1.000
	(0.989)

Constant	-14.722^*	-9.248
	(7.916)	(5.775)


Observations	6,594	6,594
R²	0.003	0.003
Adjusted R²	0.002	0.002
Residual Std. Error	27.893 (df = 6590)	27.893 (df = 6591)
F Statistic	6.256^*** (df = 3; 6590)	8.873^*** (df = 2; 6591)

Note:	p<0.1; p<0.05; p<0.01

In the model with both height variables and a dummy variable for male, the adolescent height variable (height81) is statistically significant (barely, at p = 0.049). The other variables are not statistically significant. If we exclude the male variable (reported below), the coefficient on adolescent height is similar to that in the model with the male dummy variable, but it is not statistically significant. Including the male variable therefore exerts a modest effect on the results. If anything, this is surprising because one would expect the male dummy variable to be highly correlated with height and to be at least somewhat correlated with wages, which, if true, could have induced fairly large omitted variable bias in the model without it.

Question 2 B

Generate a female dummy variable. Estimate a model with both a male dummy variable and a female dummy variable. What happens? Why?

q2data$female <- NA
q2data[q2data$male==1,]$female <- 0
q2data[q2data$male==0,]$female <- 1
q2b.lm <- lm(wage96~male+female, data=q2data)
summary(q2b.lm)

## 
## Call:
## lm(formula = wage96 ~ male + female, data = q2data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -14.89   -7.20   -3.36    2.01 1519.96 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.3761     0.4781  27.975   <2e-16 ***
## male          1.5186     0.6584   2.307   0.0211 *  
## female            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.36 on 6928 degrees of freedom
##   (5756 observations deleted due to missingness)
## Multiple R-squared:  0.0007673,  Adjusted R-squared:  0.0006231 
## F-statistic:  5.32 on 1 and 6928 DF,  p-value: 0.02111

Unsurprisingly, R says we can’t run this – the two variables are perfectly collinear, and it therefore drops one of them because you, the user, have made a big mistake….

Question 2 C

Can we interpret these data causally? What sources of endogeneity might exist? Come up with the strongest possible candidate for endogeneity you can, justify your choice, and use the data given to estimate a model that controls for sources of endogeneity.

This is tricky! In one sense, I think it’s clear that we have some good reasons to think that there are other factors out there in the world that aren’t being represented in this model—that is, we’re at risk of being exposed to omitted variable bias. On the other hand, where to begin?

I suggested in class that you should be thinking about interaction terms here. I think that “gender” and “height” might be a good chance to try that out!

```r q2.lm3 <- lm(wage96~height81+height85+male+ maleheight81+maleheight85,data=q2data)

q2.lm4 <- lm(wage96~height81+height85,data=subset(q2data,male==1)) q2.lm5 <- lm(wage96~height81+height85,data=subset(q2data,male==0)) ```


	Dependent variable:

	wage96
	q2.lm1	q2.lm3	q2.lm4 (Males)	q2.lm5 (Females)
	(1)	(2)	(3)	(4)

height81	0.487^**	-0.006	0.783^***	-0.006
	(0.248)	(0.402)	(0.206)	(0.513)

height85	-0.048	0.321	-0.269	0.321
	(0.250)	(0.391)	(0.213)	(0.498)

male	-1.000	-14.089
	(0.989)	(16.649)

height81:male		0.789
		(0.511)

height85:male		-0.590
		(0.509)

Constant	-14.722^*	-6.728	-20.817^***	-6.728
	(7.916)	(12.119)	(7.468)	(15.453)


Observations	6,594	6,594	3,445	3,149
R²	0.003	0.003	0.009	0.001
Adjusted R²	0.002	0.002	0.009	-0.00004
Residual Std. Error	27.893 (df = 6590)	27.892 (df = 6588)	18.248 (df = 3442)	35.565 (df = 3146)
F Statistic	6.256^*** (df = 3; 6590)	4.267^*** (df = 5; 6588)	15.939^*** (df = 2; 3442)	0.944 (df = 2; 3146)

Note:	p<0.1; p<0.05; p<0.01

This produces three new equations, including one that interacts male and height and two that run different equations for men and women. We’re trying to see if we can use interactions to assess whether the slopes differ for men and women. The slope for women on height81 in q2.lm3, the interaction equation, is -0.006, which is identical to what we saw in the women-only results above. The slope on height81 for men is the sum of -0.006 + 0.7889, which is equal to 0.7829, which is the identical slope to the men-only results above. The other results in the interaction equation are also the same as we saw with the separate results. For example, the intercept for women is -6.72 and the intercept for men is -6.72 – 14.089, which is -20.81, as we see in the male-only results.

The principle is that if we allow all slopes to differ by gender and include a different intercept for each gender, then the coefficients will be the same as in running models on each gender separately. The fact that none of our results are “significant” isn’t the point – it’s that including the interaction term gives us better results than not including them.

Question 3

Using the file Cellphone_2012_homework.csv, complete the following problems:

Question 3 A

Regress the number of deaths on controls for the number of cell phone subscribers, the population of states, and the total miles driven in a state. Interpret the coefficient for the number of cell phone subscribers in both substantive and significance terms.

q3data <- read.csv("Cellphone_2012_homework.csv")
q3.lm1 <- lm(numberofdeaths~cell_subscription+population+total_miles_driven,data=q3data)
summary(q3.lm1)

## 
## Call:
## lm(formula = numberofdeaths ~ cell_subscription + population + 
##     total_miles_driven, data = q3data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -555.98  -92.75  -12.18   60.67  788.37 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.346e+00  4.108e+01   0.106    0.916    
## cell_subscription   2.465e-03  7.671e-02   0.032    0.975    
## population         -7.422e-05  8.821e-05  -0.841    0.404    
## total_miles_driven  1.883e-02  3.018e-03   6.240 1.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 194.7 on 46 degrees of freedom
## Multiple R-squared:  0.9221, Adjusted R-squared:  0.917 
## F-statistic: 181.5 on 3 and 46 DF,  p-value: < 2.2e-16

The coefficient on cell phone subscribers indicates that for every thousand new cell phone subscribers there is a small–very small!–increase in the death rate in a state. However, this relationship is not statistically significant.

Question 3 B

Take the equation you regressed in 3 (A) and add controls for the presence or absence of a text ban and a cell phone ban. Interpret what has changed between your earlier results and this one. How many lives does your result suggest a cell phone ban saves per year? How many lives does your result suggest a texting ban saves per year? Interpret these results with regard to the estimation of the standard error of each of these variables’ coefficients.

q3.lm2 <- lm(numberofdeaths~cell_subscription+population+total_miles_driven+cell_ban+text_ban,data=q3data)

stargazer(q3.lm1,q3.lm2,column.labels=c("Q3 (A)","Q3 (B)"),type="html")


	Dependent variable:

	numberofdeaths
	Q3 (A)	Q3 (B)
	(1)	(2)

cell_subscription	0.002	0.027
	(0.077)	(0.069)

population	-0.0001	-0.0001
	(0.0001)	(0.0001)

total_miles_driven	0.019^***	0.016^***
	(0.003)	(0.003)

cell_ban		-100.790
		(73.329)

text_ban		-165.234^***
		(56.950)

Constant	4.346	150.098^***
	(41.075)	(54.705)


Observations	50	50
R²	0.922	0.941
Adjusted R²	0.917	0.934
Residual Std. Error	194.712 (df = 46)	173.454 (df = 44)
F Statistic	181.518^*** (df = 3; 46)	140.035^*** (df = 5; 44)

Note:	p<0.1; p<0.05; p<0.01

The coefficient on cell_ban is -100, meaning there were 100 fewer deaths in states with cell phone bans, once we’ve controlled for the other factors. This is not statistically significant for a one-tailed test with  less than 0.05. The coefficient text_ban on is -165, meaning there were 165 fewer deaths in states with text bans, once we’ve controlled for the other factors. This is statistically significant at any conventional level.

Question 3 C

How many lives would be saved if California were to implement a cell phone ban / texting ban? How many lives would be saved if Wyoming did the same? What are the implications of your answers for proper model specification?

California would save 100 lives from a cell phone ban and 165 lives from a text ban. Wyoming would save the same number of lives: 100 lives from a cell phone ban and 165 lives from a text ban. Clearly, this is odd, since California has more than 60 times as many people in the state. A full analysis therefore needs to do something to account for this, perhaps using per capita death rates or, as we do below, looking for interactions.

Question 3 D

Estimate a model in which total miles is interacted with both the cell phone ban and the prohibition of texting variables. What is the estimated effect of a cell phone ban for California? For Wyoming? What is the effect of a texting ban for California? For Wyoming?

The results below include interaction variables for total miles times cell ban and total miles times texting ban.

q3.lm3 <- lm(numberofdeaths~cell_subscription+population+total_miles_driven+cell_ban+text_ban+cell_ban*total_miles_driven+text_ban*total_miles_driven,data=q3data)


	Dependent variable:

	numberofdeaths

cell_subscription	-0.022
	(0.055)

population	0.00001
	(0.0001)

total_miles_driven	0.014^***
	(0.002)

cell_ban	-20.090
	(72.031)

text_ban	21.832
	(65.714)

total_miles_driven:cell_ban	-0.001
	(0.001)

total_miles_driven:text_ban	-0.003^***
	(0.001)

Constant	8.845
	(47.046)


Observations	50
R²	0.969
Adjusted R²	0.964
Residual Std. Error	127.731 (df = 42)
F Statistic	190.044^*** (df = 7; 42)

Note:	p<0.1; p<0.05; p<0.01

The effect of a cell phone ban is \(cellban + CellBanMiles × totalmiles\).

For California (where total_miles = 326,272) this means the estimated effect is -20 + -0.0014622 × 326,272 = -497.
For Wyoming (where total_miles = 9,270) this means the estimated effect is \(-20 + -0.0014622 × 9,270 = -33.6\).

The effect of a text ban is \(textban + TextBanMiles × totalmiles\).

For California (where total_miles =326,272 ) this means the estimated effect is \(21 + -.0029245 × 326,272 = -932\).
For Wyoming (where total_miles = 9,270) this means the estimated effect is \(-20 + -.0029245 × 9,270 = -5.3\).

The effect of total miles depends on whether a state has a cell phone ban and/or text ban. The marginal effect of increasing total miles driven by one is \(totalmiles + CellBanMiles × cellban + TextBanMiles × textban\), which is \(0.0141698 + -0.0014622 × cellban + -0.0014622 + -0.0029245 × textban\). This means that increasing total miles by 1 million (the total miles variable is measured in millions of miles) in a state with no cell ban or text ban is associated with a 0.00141 increase in deaths. Increasing total miles by 1 million miles in a state with a cell ban but no text ban is associated with a \(.0141698 + -0.0014622 = 0.0127\) increase in deaths.

Question 4-6

library(foreign)
library(ggplot2)
data <- read.csv("PS3Data.csv")

Plots showing milexp against GDP and GDP pc with region as color

Let me work through the process of building one plot to show you how I would approach this problem. I start with the most basic exploratory plot:

q1plot1 <- ggplot(data,aes(x=wdi_gdp,
                           y=wdi_megdp,
                           color=factor(ht_region)))
q1plot1 + geom_point() + 
  ggtitle("Military Expenditure vs. GDP") + 
  scale_color_discrete(name="Region") +
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Size")

Okay! I ran some R code, and it worked! A-plus. Excellent. Right?

Well, no. This is C work in the old, true sense of the term: meets expectations. If you stopped here, that’s okay for now, and it counts for credit, but you could have done so much more thinking. (I’m writing this before taking a close look at any of the solution sets, so I’m not talking about anyone in particular.)

Look again at the chart. Does this chart represent the best possible effort we could make about how to present this data? I think not. This output actually isn’t all that informative, because the variation on size of national GDP is really huge. A big part of our job as researchers, data analysts, and data visualizers is to present more than just the simple, lazy solution.

Taking the log of a variable with a huge range linearizes it and reveals relationships that might be masked in untransformed variables. So, let’s take the log of X (since Y is fairly nicely contained along one axis already):

q1plot1 <- ggplot(data,aes(x=log(wdi_gdp), # just use log() here
                           y=wdi_megdp,
                           color=factor(ht_region)))
q1plot1 + geom_point() + 
  ggtitle("Military Expenditure vs. GDP") + 
  scale_color_discrete(name="Region") +
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Size (Log)")

A few points. One is purely cosmetic: I don’t like the fact that the labels for the ht_region variable includes all those integers (which is also why when ggplot looks to organize the factors it sorts alphabetically, which puts 10 immediately after 1). So:

levels(data$ht_region) <- c("E. Europe and FSU",
                                     "The Caribbean",
                                     "Latin America",
                                     "N. Africa and Mideast", 
                                     "Sub-Saharan Africa",
                                     "W. Europe and N. America",
                                     "East Asia", 
                                     "South-East Asia",
                                     "South Asia",
                                     "The Pacific")

One additional problem creeps in here: Two pairs of countries we really want to keep distinct (Sub-Saharan Africa vs. MENA, W. Europe and N. America vs. East Asia) are almost identical visually (maybe you can tell these shades apart, but I can’t, and they’re certainly not as differentiated as they could be). That’s not ideal! Better to use not just color but also shapes:

q1plot1 <- ggplot(data,aes(x=log(wdi_gdp), 
                           y=wdi_megdp,
                           color=ht_region,
                           shape=ht_region))
q1plot1 + geom_point() + 
  ggtitle("Military Expenditure vs. GDP") + 
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Size (Log)")

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 10. Consider specifying shapes manually if you must have them.

## Warning: Removed 44 rows containing missing values (geom_point).

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 10. Consider specifying shapes manually if you must have them.

(No, I absolutely did not expect you to do this, but the point is not just to recapitulate lectures and handouts but for you to begin thinking about how to go beyond regurgitation.)

Introducing shapes makes it easy to keep our key regions visually distinct, but suddenly we’ve dropped four regions. It turns out ggplot will only automatically generate six shapes, and refuses to plot anything else unless we override it:

q1plot1 <- ggplot(data,aes(x=log(wdi_gdp), 
                           y=wdi_megdp,
                           color=ht_region,
                           shape=ht_region))
q1plot1 + geom_point() + 
  ggtitle("Military Expenditure vs. GDP") + 
  scale_shape_manual(values=1:nlevels(data$ht_region), # overrides default number of shapes
                     name="Region") +
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Size (Log)") +
  scale_color_discrete(name="Region") # when we specify color AND shape,

## Warning: Removed 22 rows containing missing values (geom_point).

#we have to make sure that both scale_color_discrete and 
# scale_shape_manual have the same "name", otherwise 
# ggplot thinks there should be two legends

This is an altogether more useful plot. It suggests that most countries don’t vary all that much in the amount of money they spend on their military as the size of their economies grows. That wasn’t altogether clear from the initial plot. The plot also suggests that there might be something interesting (that is, at variance with the trend for other regions) in the Middle East: some of the MENA countries are well above the trend line.

We can add a trend line to complete the graph. (You aren’t responsible for this! This is new information! For more on LOESS, see the Wikipedia article.)

q1plot1 <- ggplot(data,aes(x=log(wdi_gdp), # just use log() here
                           y=wdi_megdp,
                           color=ht_region,
                           shape=ht_region))
q1plot1 + 
  stat_smooth(method="loess",aes(color=NA,shape=NA)) + # See what happens if we don't use aes() 
  #and NA to override the color and shape functions
  geom_point() + #place geom_point after stat_smooth because that tells ggplot to place points 
  #over the trendline instead of the reverse
  ggtitle("Military Expenditure vs. GDP") + 
  scale_shape_manual(values=1:nlevels(data$ht_region), 
                     name="Region") +
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Size (Log)") +
  scale_color_discrete(name="Region")

And now we can take a look at the parallel chart here for GDP per capita:

q1plot2 <- ggplot(data,aes(x=log(wdi_gdpc), 
                           y=wdi_megdp,
                           color=ht_region,
                           shape=ht_region))
q1plot2 + 
  stat_smooth(method="loess",aes(color=NA,shape=NA)) + 
  geom_point() +
  ggtitle("Military Expenditure vs. GDP") + 
  scale_shape_manual(values=1:nlevels(data$ht_region), 
                     name="Region") +
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Per Capita") +
  scale_color_discrete(name="Region")

Here, I would also check the untransformed X variable; the code is the same except for substituting x=wdi_gdpc. —–

Although the technical interpretation of the logged and untransformed charts is different, the substantive takeaway is similar and I would be comfortable using either chart in a presentation (I’d probably incline toward the logged, but the interpretation of the untransformed is probably simpler for a nontechnical audience). Note that here again the MENA region has ome countries that are just pretty far off trend—when I’m working on this subject area, that suggests to me that there is probably something oil-rent related happening here.

Adding Oil Income

I would begin with the same plots as before, adding oipc:

# GDP PC and MEGDP
q1plot3 <- ggplot(data,aes(x=wdi_gdpc, 
                           y=wdi_megdp,
                           color=ht_region,
                           shape=ht_region,
                           size=oipc))
q1plot3 + 
  stat_smooth(method="loess",aes(color=NA,shape=NA)) + 
  geom_point() +
  ggtitle("Military Expenditure vs. GDP") + 
  scale_shape_manual(values=1:nlevels(data$ht_region), 
                     name="Region") +
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Per Capita") +
  scale_color_discrete(name="Region") +
  scale_size_continuous("Oil Income Per Capita (USD)")

# GDP (Overall) and MEGDP
q1plot4 <- ggplot(data,aes(x=log(wdi_gdp), 
                           y=wdi_megdp,
                           color=ht_region,
                           shape=ht_region,
                           size=oipc))
q1plot4 + 
  stat_smooth(method="loess",aes(color=NA,shape=NA)) + 
  geom_point() +
  ggtitle("Military Expenditure vs. GDP") + 
  scale_shape_manual(values=1:nlevels(data$ht_region), 
                     name="Region") +
  ylab("Military Expenditure Share of GDP") + 
  xlab("Overall GDP (Log)") +
  scale_color_discrete(name="Region") +
  scale_size_continuous("Oil Income Per Capita (USD)")

Both charts suggest that there is a relationship between Oil Income Per Capita and Military Expenditure. Unsurprisingly, this is most visible in the GDP per capita plot, since GDP per capita and Oil Income per capita are the two measures most like each other (both are measured per capita).

Regime Type and Oil Income

The exercise here is similar; I’ll only go through one chart.

# GDP (Overall) and MEGDP
q1plot5 <- ggplot(data,aes(x=log(wdi_gdpc), 
                           y=wdi_megdp,
                           color=factor(chga_demo),
                           size=oipc))
q1plot5 + 
  stat_smooth(method="loess",aes(color=NA)) + 
  geom_point() +
  ggtitle("Military Expenditure vs. GDP") + 
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Per Capita (Log)") +
  scale_color_discrete(name="Regime Type") +
  scale_size_continuous(name="Oil Income Per Capita (USD)")

This is, I think, a nice, clean chart. It also nicely underscores that most of the high oil-income per capita states are dictatorships that spend a lot more money on their military than countries with similar GDP per capita.

One way to make this (or any other) chart more informative is to vary the trend line by category, which we can do by not omitting the categorical variables in the stat_smooth() function:

# GDP (Overall) and MEGDP
q1plot5 <- ggplot(data,aes(x=log(wdi_gdpc), 
                           y=wdi_megdp,
                           color=factor(chga_demo),
                           size=oipc))
q1plot5 + 
  stat_smooth(method="loess") + # The only difference is in this line
  geom_point() +
  ggtitle("Military Expenditure vs. GDP") + 
  ylab("Military Expenditure Share of GDP") + 
  xlab("GDP Per Capita (Log)") +
  scale_color_discrete(name="Regime Type") +
  scale_size_continuous(name="Oil Income Per Capita (USD)")

Perhaps we should begin examining whether oil-rich dictatorships are unusual because they’re oil-rich or because they are dictatorships. Maybe oil-rich dictatorships are unusual because countries that might otherwise democratize have instead been turned into run-of-the mill dictatorships; maybe dictatorships that have a lot of oil money super-size the normal aspirations of dictatorships (moar money for the military!). Maybe (and that’s a genuine “maybe”!!) oil makes a country dictatorial, and having dictatorial neighbors makes you spend money on the military—and then regions with lots of oil would also have lots of dictatorships and big militaries. Or maybe oil just makes you rich, and as a rich country you do more of whatever your regime type suggests you would otherwise—there’s nothing special about oil money except the zeroes in the checks.

This chart doesn’t tell us. Only further research can. But it does suggest that we should be conscious of regime types.

Regressions

q4.lm1 <- lm(wdi_megdp~wdi_gdp, data=data)
q4.lm2 <- lm(wdi_megdp~wdi_gdpc,data=data)
q4.lm3 <- lm(wdi_megdp~oipc,data=data)
q4.lm4 <- lm(wdi_megdp~chga_demo,data=data)


	Dependent variable:

	wdi_megdp
	(1)	(2)	(3)	(4)

wdi_gdp	0.000
	(0.000)

wdi_gdpc		0.00001
		(0.00001)

oipc			0.0001^***
			(0.00003)

chga_demo1. Democracy				-1.048^***
				(0.271)

Constant	2.148^***	2.083^***	2.044^***	2.815^***
	(0.148)	(0.192)	(0.138)	(0.209)


Observations	141	141	144	144
R²	0.008	0.005	0.092	0.095
Adjusted R²	0.001	-0.002	0.086	0.089
Residual Std. Error	1.680 (df = 139)	1.682 (df = 139)	1.598 (df = 142)	1.595 (df = 142)
F Statistic	1.153 (df = 1; 139)	0.750 (df = 1; 139)	14.397^*** (df = 1; 142)	14.962^*** (df = 1; 142)

Note:	p<0.1; p<0.05; p<0.01

As we’ve discussed in class, a highly significant 0 (as with the estimate for OIPC in Model 3) might suggest rescaling our variable; here I rescale oipc to thousands (by creating a new variable and dividing oipc by 1000):

data$oipc_1k <- data$oipc/1000
q4.lm3.1k <- lm(wdi_megdp~oipc_1k,data=data)


	Dependent variable:

	wdi_megdp
	(1)	(2)

oipc	0.0001^***
	(0.00003)

oipc_1k		0.104^***
		(0.027)

Constant	2.044^***	2.044^***
	(0.138)	(0.138)


Observations	144	144
R²	0.092	0.092
Adjusted R²	0.086	0.086
Residual Std. Error (df = 142)	1.598	1.598
F Statistic (df = 1; 142)	14.397^***	14.397^***

Note:	p<0.1; p<0.05; p<0.01

That’s a little easier to deal with! It suggests for every thousand additional dollars in oil income per capita, a state’s military expenditure will increase by one-tenth of a percentage point of GDP. That might not seem like much, but for truly oil-rich states (with 20,000 or more annual in per-capita oil income), that suggests an additional couple of percentage points of GDP devoted to the military—more than the entire share of GDP that many countries devote to defense! That would be a huge, and notable, effect if it is true.

Multivariate Regressions

The question of how societies allocate resources between “guns” and “butter” (civilian goods) fascinates me, and I’m really curious if there is an effect in which oil income turns into rifles more quickly than taxation income. So let’s try some plausible hypotheses:

Dictatorships spend more on militaries than democracies.
Some neighborhoods are worse than others, and some regions are more threatening than others; region will matter as a control variable.
Wealthier countries spend less on their militaries than do others.
Societies with greater degrees of patriarchy will be more warlike.
Oil accrues to dictators, and the scale, secrecy, and source of oil rents make it easier to convert to military income.

These aren’t necessarily the best hypotheses in the world – this isn’t my particular area of expertise – but they do seem fairly plausible. So, let’s test them:

q4.lm5 <- lm(wdi_megdp~p_polity2+wdi_gdpc+ht_region,data=data)
q4.lm6 <- lm(wdi_megdp~p_polity2+wdi_gdpc+oipc+ht_region,data=data)
q4.lm7 <-  lm(wdi_megdp~p_polity2+wdi_gdpc+oipc+wdi_wip+ht_region,data=data)


	Dependent variable:

	wdi_megdp
	q4.lm5	q4.lm6	q4.lm7
	(1)	(2)	(3)

p_polity2	-0.056^**	-0.053^*	-0.053^*
	(0.027)	(0.029)	(0.029)

wdi_gdpc	0.00001	0.00001	0.00001
	(0.00002)	(0.00002)	(0.00002)

oipc		0.00001	0.00001
		(0.00004)	(0.00004)

wdi_wip			-0.016
			(0.014)

ht_regionThe Caribbean	-0.584	-0.613	-0.423
	(1.531)	(1.540)	(1.555)

ht_regionLatin America	-0.355	-0.374	-0.350
	(0.520)	(0.527)	(0.529)

ht_regionN. Africa and Mideast	1.606^***	1.588^***	1.455^***
	(0.528)	(0.534)	(0.551)

ht_regionSub-Saharan Africa	-0.298	-0.322	-0.310
	(0.440)	(0.450)	(0.452)

ht_regionW. Europe and N. America	-0.364	-0.307	-0.144
	(0.570)	(0.610)	(0.629)

ht_regionEast Asia	-0.435	-0.413	-0.499
	(0.816)	(0.823)	(0.829)

ht_regionSouth-East Asia	-0.113	-0.113	-0.069
	(0.605)	(0.607)	(0.610)

ht_regionSouth Asia	0.556	0.524	0.530
	(0.755)	(0.767)	(0.769)

ht_regionThe Pacific	-1.168	-1.180	-1.837
	(1.123)	(1.128)	(1.565)

Constant	2.287^***	2.310^***	2.583^***
	(0.408)	(0.418)	(0.489)


Observations	134	134	133
R²	0.293	0.293	0.301
Adjusted R²	0.229	0.223	0.224
Residual Std. Error	1.490 (df = 122)	1.496 (df = 121)	1.500 (df = 119)
F Statistic	4.594^*** (df = 11; 122)	4.185^*** (df = 12; 121)	3.939^*** (df = 13; 119)

Note:	p<0.1; p<0.05; p<0.01

In this (very simple) model, we find that we do, in fact, estimate that the more democratic a country is, the less it spends on its military; we also find that a country’s neighborhood does, in fact, matter. There also appears to be a small negative effect of women’s representation in parliament, but that’s probably too small to matter much. (We could explore this further.)

But the only consistently statistically significant finding relates to region.

This raises some really important issues. The Middle East is, obviously, the site of a great deal of violent conflict; it’s also telling that the Middle East is, equally obviously, home to a huge amount of oil income. Did the world get unlucky? Did all the oil end up under territories governed by unstable or warlike countries? Or did the oil make those countries unstable or warlike?

We cannot tell the difference based on this evidence.

All of the injunctions about the problems of observational data compared to experimental or quasi-experimental data that we discussed earlier in the class still applies. However, we can conclude that the evidence in this analysis using assumptions about the models I’ve chosen does not fit with the story of more oil, more soldiers.

The point of data analysis (the tools we’re using now) is not that they solve debates for us. Data can’t answer any questions without further assumptions and explanations. But they do make plain the assumptions and the evidence that we’re employing. It should be equally plain, then, that the most important part of research comes in the design of our analysis—how we choose the concepts that we believe matter, how we structure our tests (experimental or observational), and how we communicate those results.

As some of you note, the question text pretends that you’re working for someone who really wants you to find a conclusion, no matter what the balance of the evidence really shows. That happens all the time. Sometimes it happens because your boss wants you to find evidence that fits her worldview; sometimes it happens because you were unlucky or not clever enough to think of more correct model specifications. But perhaps the most insidious threat comes from whether you are being honest with yourself in your analysis. You are always your own most important supervisor: if you’re fooling yourself, nobody else can help you. Don’t let your research goal dictate your conclusions.

Question 5

Unlike you, I won’t redo everything :) I do, however, hope that this was a good lesson in two parts:

Write code in a script that you can easily copy-and-paste and use tools like find/replace to make simple changes to.
Measurement matters, a lot—not just in terms of error but in terms of concept.

Rerunning plots

# GDP (Overall) and MEGE
q5plot1 <- ggplot(data,aes(x=log(wdi_gdp), 
                           y=wdi_mege,
                           color=ht_region,
                           shape=ht_region,
                           size=oipc))
q5plot1 + 
  stat_smooth(method="loess",aes(color=NA,shape=NA)) + 
  geom_point() +
  ggtitle("Military Share of \n Government Expenditure vs. GDP") + # \n is a line break 
  scale_shape_manual(values=1:nlevels(data$ht_region), 
                     name="Region") +
  ylab("Military Expenditure Share of Government Spending") + 
  xlab("Overall GDP (Log)") +
  scale_color_discrete(name="Region")

# GDP (Overall) and MEGE
q5plot2 <- ggplot(data,aes(x=log(wdi_gdpc), 
                           y=wdi_mege,
                           color=ht_region,
                           shape=ht_region,
                           size=oipc))
q5plot2 + 
  stat_smooth(method="loess",aes(color=NA,shape=NA)) + 
  geom_point() +
  ggtitle("Military Share of Government  \n Expenditure vs. GDP Per Capita") + # \n is a line break 
  scale_shape_manual(values=1:nlevels(data$ht_region), 
                     name="Region") +
  ylab("Military Expenditure Share of Government Spending") + 
  xlab("GDP Per Capita (Log)") +
  scale_color_discrete(name="Region")  +
  scale_size_continuous("Oil Income Per Capita (USD)")

# GDP (Overall) and MEGE
q5plot3 <- ggplot(data,aes(x=log(wdi_gdpc), 
                           y=wdi_mege,
                           color=factor(chga_demo),
                           size=oipc))
q5plot3 + 
  stat_smooth(method="loess") + 
  geom_point() +
  ggtitle("Military Expenditure vs. GDP") + 
  ylab("Military  Share of Govt Expenditure") + 
  xlab("GDP Per Capita (Log)") +
  scale_color_discrete(name="Regime Type") +
  scale_size_continuous(name="Oil Income Per Capita (USD)")

q5.lm5 <- lm(wdi_mege~wdi_gdp, data=data)
q5.lm6 <- lm(wdi_mege~wdi_gdpc,data=data)
q5.lm7 <- lm(wdi_mege~oipc,data=data)
q5.lm8 <- lm(wdi_mege~chga_demo,data=data)


	Dependent variable:

	wdi_mege
	q5.lm5	q5.lm6	q5.lm7	q5.lm8
	(1)	(2)	(3)	(4)

wdi_gdp	0.000^**
	(0.000)

wdi_gdpc		-0.00005
		(0.00004)

oipc			0.0002
			(0.0001)

chga_demo1. Democracy				-3.941^***
				(1.147)

Constant	7.606^***	8.731^***	7.773^***	10.727^***
	(0.575)	(0.787)	(0.560)	(0.963)


Observations	104	104	105	105
R²	0.042	0.016	0.020	0.103
Adjusted R²	0.033	0.006	0.011	0.094
Residual Std. Error	5.540 (df = 102)	5.614 (df = 102)	5.601 (df = 103)	5.361 (df = 103)
F Statistic	4.462^** (df = 1; 102)	1.665 (df = 1; 102)	2.154 (df = 1; 103)	11.807^*** (df = 1; 103)

Note:	p<0.1; p<0.05; p<0.01

Multivariate regressions

Just for fun, let’s see how this new measurement changes the story from the earlier set of regressions.

q5.lm1 <- lm(wdi_mege~p_polity2+wdi_gdpc+ht_region,data=data)
q5.lm2 <- lm(wdi_mege~p_polity2+wdi_gdpc+oipc+ht_region,data=data)
q5.lm3 <-  lm(wdi_mege~p_polity2+wdi_gdpc+oipc+wdi_wip+ht_region,data=data)


	Dependent variable:

	wdi_mege
	q5.lm1	q5.lm2	q5.lm3
	(1)	(2)	(3)

p_polity2	-0.360^***	-0.438^***	-0.458^***
	(0.109)	(0.118)	(0.115)

wdi_gdpc	0.0001	0.0002^**	0.0002^**
	(0.0001)	(0.0001)	(0.0001)

oipc		-0.0002	-0.0003^*
		(0.0001)	(0.0001)

wdi_wip			-0.144^**
			(0.059)

ht_regionLatin America	1.357	1.866	1.645
	(1.847)	(1.858)	(1.810)

ht_regionN. Africa and Mideast	1.887	2.070	0.651
	(1.906)	(1.893)	(1.932)

ht_regionSub-Saharan Africa	1.065	1.681	1.605
	(1.652)	(1.681)	(1.636)

ht_regionW. Europe and N. America	-2.302	-3.443^*	-2.011
	(1.930)	(2.040)	(2.071)

ht_regionEast Asia	1.728	1.260	0.400
	(2.649)	(2.642)	(2.594)

ht_regionSouth-East Asia	2.617	2.398	2.431
	(2.145)	(2.130)	(2.073)

ht_regionSouth Asia	9.014^***	9.726^***	9.845^***
	(2.467)	(2.484)	(2.418)

ht_regionThe Pacific	-3.563	-3.792
	(5.072)	(5.029)

Constant	8.434^***	8.013^***	10.587^***
	(1.416)	(1.428)	(1.746)


Observations	99	99	98
R²	0.338	0.357	0.398
Adjusted R²	0.263	0.276	0.321
Residual Std. Error	4.818 (df = 88)	4.775 (df = 87)	4.646 (df = 86)
F Statistic	4.494^*** (df = 10; 88)	4.394^*** (df = 11; 87)	5.168^*** (df = 11; 86)

Note:	p<0.1; p<0.05; p<0.01

Wow! So many more stars!! Truly, this must be the model we are looking for—the one in which democracies spend less on weapons, gender-equal societies prove more peaceful, and Huh. Oil has a negative (not subtantially significant) relationship with military expenditure? And what happened to those stars next to N. Africa and Mideast? It’s like we are looking at an entirely different world.

And we are, because we changed our measurement.

Omitted variable bias is important. But being careful about what our theory really suggests and being as precise as possible about translating that concept into measurement is even more important.

There is no statistical test that can tell us if we’ve measured the right thing, but we can—and we must—think really, really hard about whether we expect oil rent (or regime type, or female parliamentarians, or overall wealth) to lead to greater bellicosity and how. When our results change, we need to be able to explain why, and to come up with good reasons for why.

By the way, these analyses, as extensive as they are, are still what we call toy examples. They don’t have all of the information we’d usually bring to bear on the question (for instance, wouldn’t you want to have information on the military threats a country faces? that seems like a major source of omitted variable bias). On the other hand, this sort of analysis (albeit more polished!) would have passed for cutting-edge work in about 1975. Right now, you’re recapitulating decades of methodological and theoretical debates; you’ll get to see which answers get more support as we become more sophisticated, which become more contested, and how the debates get sharper.

Question 6

I won’t provide a solution for Question 3, because in a deep sense there is no individual solution for me to provide. I’ll still provide some feedback on your individual cases, of course. What I’m looking for here, though, is not just your attention to the details of running code and constructing models. That counts: I do look to make sure that you’ve properly labeled everything, that you’ve presented your materials clearly and effectively, that you’ve run regressions correctly, and so on. Failure at these levels leads, to a greater or lesser degree, to failure overall.

What I’m most interested in is your ability to think through the relationships in light of what we’ve covered in this course. The readings from September about causal inference and the discussions in Bailey about variable selection and estimating beta coefficients consistently are not just throwaway lines. They are essential to this enterprise. A good analyst should be constantly thinking about the assumptions that go into her models: is this variable pre- or post-treatment? What selection effects might be confounding inference about the effects of treatment? Why include this measurement and not another one?

These and many other questions need answers, and you should not hesitate to refer back to the readings to bolster your case in favor of the choices you’ve made and against other possibilities. (Yes, even with quotations and page numbers.) You should also not hesitate to be intellectually honest and open: if you think that all the measures are bad, but the one you’ve chosen is the least-bad alternative, then report that, and discuss what the consequences of bad measurement would be. That kind of careful analysis lends credibility to your work. I know that in the clickbait age the temptation is to present every project as if it were The Best. Thing. Ever. and to scorn any weakness as evidence of complete incompetence, but that doesn’t reflect reality: it reflects the adolescent faddishness of the Internet. Our job is to be careful, to recognize the limits to knowledge, and to make as clear an inference as possible.

Problem Set 3: Solution Set

Paul Musgrave

October 28, 2016