Instructions

Please prepare your answers as clearly as possible. Label every chart completely and clearly indicate which answer corresponds to which question. Submit your answers as a PDF file.

Question 1

Download ps2q1.csv and read it into memory to answer the following questions.

Question 1 (A)

Summarize all of the variables (give the mean, median, and range). Report each value.

## Solution Code Okay, we can do this the hard way, and it is hard—or at least tedious!

library(foreign) ## You have to do this ...
data <- read.csv("ps2q1.csv",stringsAsFactors=FALSE) # stringsAsFactors isn't necessary but it will save some headaches until you know what "strings" and "factors" are

## Summarize: The Hard Way
mean(data$x1)
## [1] 9
median(data$x1)
## [1] 9
range(data$x1)
## [1]  4 14

One of the occasional points of having you do things the hard way is to make sure you have the “muscle memory” to do things quickly. The only way to develop that muscle memory is through repetition.

With that said, let’s learn a much easier way to handle this.

summary(data$x1)

The summary() command does everything we need it to do here. But there’s a more powerful alternative available through the pastecs library:

install.packages("pastecs")

Run the above code. This will prompt R to find the user-written package “pastecs”, which has a useful function we want to use. There are many packages like this written for R, all of which you can think of as containing bundles of functions that you will find useful. Note: Once you have downloaded/installed a package for the first time, you will only ever have to use the library() command to call it.

library(pastecs)
## Loading required package: boot
options(scipen=100) ## Don't worry about what these options() commands are doing for now
options(digits=2) 
stat.desc(data,norm=FALSE) ## Again, don't worry about the options _right now_
##               case    x1    x2    x3    x4    y1    y2    y3    y4
## nbr.val      11.00 11.00 11.00 11.00 11.00 11.00 11.00 11.00 11.00
## nbr.null      0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
## nbr.na        0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00
## min           1.00  4.00  4.00  4.00  8.00  4.26  3.10  5.39  5.25
## max          11.00 14.00 14.00 14.00 19.00 10.84  9.26 12.74 12.50
## range        10.00 10.00 10.00 10.00 11.00  6.58  6.16  7.35  7.25
## sum          66.00 99.00 99.00 99.00 99.00 82.51 82.51 82.50 82.51
## median        6.00  9.00  9.00  9.00  8.00  7.58  8.14  7.11  7.04
## mean          6.00  9.00  9.00  9.00  9.00  7.50  7.50  7.50  7.50
## SE.mean       1.00  1.00  1.00  1.00  1.00  0.61  0.61  0.61  0.61
## CI.mean.0.95  2.23  2.23  2.23  2.23  2.23  1.36  1.36  1.36  1.36
## var          11.00 11.00 11.00 11.00 11.00  4.13  4.13  4.12  4.12
## std.dev       3.32  3.32  3.32  3.32  3.32  2.03  2.03  2.03  2.03
## coef.var      0.55  0.37  0.37  0.37  0.37  0.27  0.27  0.27  0.27

This is much easier to work with! We have one table showing us a bunch of statistics, including median, mean, and the min and max for the variables.

Take a look at the UCLA help page on pastecs to learn more about this function—it will save you loads of time.

Now let’s get to what really matters: interpretation. Note that all of the x’s have a similar range, median, and mean —not exactly the same, but close. Similarly, the means for the y’s are identical, and although the medians are close they are not identical either. Still, all of these look pretty similar.

Question 1 (B)

Use the cor() command to correlate each variable pair (x1,y1). Report your correlations and interpret them as far as you can. See Bailey pp. 9-11 for more on how to interpret correlations.

cor(data$x1,data$y1)
## [1] 0.82
cor(data$x2,data$y2)
## [1] 0.82
cor(data$x3,data$y3)
## [1] 0.82
cor(data$x4,data$y4)
## [1] 0.82

Not only are these highly correlated, they are also correlated at exactly the same degree. We should expect to see a strong, positively-sloped relationship between our x’s and our y’s.

Question 1 (C)

Plot each variable pair (x1,y1). What stands out to you? Tell me why this might be important or interesting.

par(mfrow=c(2,2))
plot(data$x1,data$y1, main="Y1 vs X1", xlab="X1",ylab="Y1") ## Yes, you need to title and label even these simple charts!
plot(data$x2,data$y2, main="Y2 vs X2", xlab="X2",ylab="Y2")
plot(data$x3,data$y3, main="Y3 vs X3", xlab="X3",ylab="Y3")
plot(data$x4,data$y4, main="Y4 vs X4", xlab="X4",ylab="Y4")

Hmm. That’s …. not quite what the degree of similarity in the descriptive statistics led us to suspect! These datasets are each wildly different from the others, even though they “seemed to be” the same with the descriptive tools. The moral is simple: You have to get to know your data. Visualization–even basic viz, like this–is essential in doing so. If you’re reliant on descriptive statistics, you can and will be misled.

Question 2

Consider the plots above. Based on Bailey Chapter 3, which of the two samples would allow us to more reliably produce unbiased estimates of \(\beta_1\)? Why?

## Notes

The answer is that, as Bailey writes, having a larger range of X’s will reduce bias in our estimate of beta. In this case, “B” is based on a dataset that is a subset of A, just with the first and last 15 observations removed. Interestingly, when I compute the OLS regression line, I do in fact end up with different \(\beta_1\) estimates—even though I actually wrote the equation (and specified the error term) to be the same for both! These differences aren’t large—for \(\beta_{1}^A\) I get 1.0352 and for \(\beta_{1}^B\) the result is 1.032—but, again, these were generated from an identical process. The “B” estimates are literally biased away from the true estimates. (The bias in \(se(\beta_{1}^B)\) is actually much greater.)

Question 3

Consider the plots above. Based on Bailey Chapter 3, which of the two samples would allow us to more reliably produce unbiased estimates of \(\beta_1\)? Why?

We refer to the number of observations in a sample as our “N”. In general, we prefer larger N’s to small N’s. This problem illustrates why. As Bailey writes, having a larger N will (assuming those additional N’s are sampled at random) help us generate an estimate of the slope (\(\hat{\beta_1}\)) that will converge toward the mean of the distribution of all possible \(\beta_1\). (This is the idea Bailey mentions called the Central Limit Theorem.)

Consequently, “A” should give us a more unbiased estimate of \(\beta_1\) than “B”. (It does, although I should mention that neither are particularly good – with only 50 data points in the “large-N” dataset of A, there’s tons of noise.)

Question 4

For this question, use the state.x77 dataset preloaded in R. We first have to convert the matrix state.x77 to a dataframe:

state.df <- data.frame(state.x77)

Use the tools we have discussed and used so far to see what the names of the state.df dataframe are, get to know the mean, median, and so forth of each variable. Use ?state.x77 to see what each variable refers to.

Question 4(A)

Calculate the equation \(Income = \beta_0 + \beta_1 Illiteracy + \epsilon\). Report your estimates for \(\beta_0\) and \(\beta_1\).

Question 4(B)

Plot the variables and label them appropriately. Include the trend line you calculated in 4(A).

q4.lm <- lm(Income~Illiteracy,data=state.df)
summary(q4.lm)
## 
## Call:
## lm(formula = Income ~ Illiteracy, data = state.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -948.9 -376.2  -49.8  347.0 2024.6 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     4951        172   28.74 <0.0000000000000002 ***
## Illiteracy      -441        131   -3.37              0.0015 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 558 on 48 degrees of freedom
## Multiple R-squared:  0.191,  Adjusted R-squared:  0.174 
## F-statistic: 11.3 on 1 and 48 DF,  p-value: 0.00151
plot(state.df$Income~state.df$Illiteracy,main="State-Level Income vs Illiteracy Rates",
     xlab="Illiteracy Rate (Percent)",ylab="Income (in Dollars) Per Capita",pch=16)
abline(q4.lm,col="red")

Question 4(C)

Interpret your results substantively. What does the association imply about the relationship between Illiteracy and Income? Be as precise as you can.

A good answer here should specify that there is a negative relationship between Illiteracy and Income; a really good answer would go further by contextualizing that:

summary(state.df$Illiteracy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50    0.62    0.95    1.17    1.58    2.80
summary(state.df$Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3100    3990    4520    4440    4810    6320

(Note that the summary() command does a lot of work, fast.) So for every percentage-point increase in Illiteracy, predicted Income goes down by $441—about 10 percent of mean state income (that’s a lot).

Question 4(D)

Interpret the results statistcally. Is your estimate of \(\beta_1\) statistically significant? Provide an explanation of why or why not, based on Bailey.

As Bailey explains in much more detail, the key point here is \(\frac{\widehat{\beta_{Illiteracy}}}{\widehat{se(\beta_Illiteracy)}}\). Oversimplifying greatly, we normally care if the absolute value of this quantity is greater than two or less than two. If it is greater than two, that means that our estimate of \(\beta_{Illiteracy}\) is significantly different than zero, given that the errors in the estimate of \(\beta_{Illiteracy}\) will themselves follow a t-distribution. (If this doesn’t make sense to you, go back and re-read Bailey and then come see me in office hours.) That, in turn, lets us generate a p-value, where (by convention) a p-value of less than 0.05 signifies “significance”. (Same thing: if this doesn’t make sense, re-read Bailey and come see me–this is important stuff.)

Obviously, this p-value estimate is very, very close to zero, meaning we are relatively confident that \(\beta_{Illiteracy}\) is different than zero.

Question 4(E)

How reliable do you think the estimate of the relationship between Illiteracy and Income is in a causal sense? That is, do you think this is a reliable guide to the causal effect of Illiteracy on Income? Justify your answer. Explain how we might produce a more reliable estimate.

Question 5

Use the state.df dataframe again. This time, repeat 4(A)-4(E) but for estimating the equation \(Murder = \beta_0 + \beta_1 Frost + \epsilon\).

q5.lm <- lm(Murder~Frost,data=state.df)
summary(q5.lm)
## 
## Call:
## lm(formula = Murder ~ Frost, data = state.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.851 -2.634 -0.283  2.298  7.319 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)    
## (Intercept) 11.37569    1.00549   11.31 0.0000000000000038 ***
## Frost       -0.03827    0.00863   -4.43 0.0000540483988541 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.1 on 48 degrees of freedom
## Multiple R-squared:  0.29,   Adjusted R-squared:  0.276 
## F-statistic: 19.6 on 1 and 48 DF,  p-value: 0.000054
plot(state.df$Murder~state.df$Frost,main="State-Level Murder Rates vs Climate",
     ylab="Murder Rates per 100,000",xlab="Days of Frost in State Capital",pch=16)
abline(q5.lm,col="red")

The interpretation here is the same, except for the specific interpretation of the coefficients. However, neither should be interpreted causally, and this example proves why—cold climates doesn’t prevent (most) murders; something associated with climate is causing this (or, less likely, it’s a pure coincidence).

Question 6

For this question, use QOGSelectPS2.dta and the read.dta() command in the foreign package. This is a selection of data drawn from the Quality of Government project.

Read the data into R and create a plot of maternal mortality (ihme_mmr), which measures the rate of deaths of mothers per 100,000 live births, versus oil income per capita (oil_income_pc_1k), which gives the average oil and natural gas-derived income per person per country. Fully label the graph and interpret it.

library(foreign)
qogdata <- read.dta("QOGSelectPS2.dta")
summary(qogdata$ihme_mmr) ## Not mentioned in the assignment, but it's a good idea to take a look
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       4      17      46     197     281    1580
summary(qogdata$oil_income_pc_1k)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       5     831     132   25000
plot(qogdata$ihme_mmr~qogdata$oil_income_pc_1k, 
     pch=16,
     main="Maternal Mortality Rate vs. Oil Income Per Capita",
     xlab="Oil Income Per Capita",
     ylab="Maternal Mortality Rate")

The code, again, is relatively straightforward–one of the major goals I have for you at this stage of the course is to become so used to plotting data series that it becomes second nature in order for you to spend more time on the important work of theorizing and analyzing.

So let’s take a moment to think about why these two variables might be interesting. There’s a number of reasons to suspect that, if there’s a full-bore, Collier-style resource curse, that there might be a negative relationship between OIPC and maternal mortality (that is, more oil, more deaths of mothers). Maternal mortality is frequently used among other health indicators because maternal mortality results (in this century, at least) from extreme privation and lack of access to quality health-care institutions, sanitation, and nutrition. Given the difficulties of measuring “quality institutions” (although political scientists and economists have made some heroic efforts at doing so – “heroic”, by the way, means “untrustworthy” or literally “in-credible”), these are proxies for one kind of extreme, undesirable form of institutional breakdown or absence.

More to the point, social scientists are increasingly dissatisfied with spending all of our time thinking about political leaders and bureaucrats, and a major push in the past 25 years has been to spend more time thinking about the rights, lives, and experiences of ordinary people, including especially ordinary people whose rights, lives, and experiences were completely marginalized by traditional social-scientific and political attitudes. So we don’t just care about whether women giving birth live or die because it’s a useful proxy (although in this literature that is a major advantage … ), but because it matters to understanding, and maybe someday improving, the social processes that directly affect the lives of four billion people (or, including their children, 7.5 billion people).

So do we find a relationship here? Not really. In large part, I think, the plot suggests that there isn’t much of a relationship between maternal mortaility and oil income per capita. Begin by looking only at countries that have less than the 25th percentile of oil income per capita:

qogdata.50th <- subset(qogdata,oil_income_pc_1k <= quantile(oil_income_pc_1k,.50))

There are two functions here that I don’t expect you to know. One is the subset() command, which is useful but kind of a shortcut for the more “proper” R code, which would be qogdata.50th <- qogdata[qogdata$oil_income_pc_1k<=quantile(qogdata$oil_income_pc_1k,.50),] which is, I have to say, just a lot more typing and not immediately more readable. The other is quantile(), which returns the nth quantile in the distribution of the data, so quantile(oipc,.50) returns the median. We know what the median is, thanks to the summary(qogdata$oil_income_pc_1k) data above, so why not just use $4.877 directly in the subset() command instead? It’s a good idea to use functions to calculate quantities whenever possible to avoid errors caused by human inattention or clumsiness.

plot(qogdata.50th$ihme_mmr~qogdata.50th$oil_income_pc_1k, 
     pch=16,
     main="Maternal Mortality Rate vs. Oil Income Per Capita (Below Median)",
     xlab="Oil Income Per Capita",
     ylab="Maternal Mortality Rate")

This chart, focusing on only those oil producers below the median of oil-income per capita states, suggests that there is essentially no variation with regard to oil income–indeed, the plots of the states with 0 oil income per capita are entirely orthogonal to the x-axis! To the extent that there appears to be a relationship, it is that almost every country with some oil income is on the very extreme near-zero end of the maternal mortality scale.

Let’s repeat the exercise for the top half:


qogdata.50plus <- subset(qogdata,oil_income_pc_1k > quantile(oil_income_pc_1k,.50))


plot(qogdata.50plus$ihme_mmr~qogdata.50plus$oil_income_pc_1k, 
     pch=16,
     main="Maternal Mortality Rate vs. Oil Income Per Capita (Above Median)",
     xlab="Oil Income Per Capita",
     ylab="Maternal Mortality Rate")

We see a very similar pattern here. Except for the countries that stand out at a couple of thousand dollars in oil income per head and about 500 maternal deaths on the y-axis, for countries that have close to the median oil income there is really almost a perfect null relationship between the two (go ahead and use cor()) to work out what the bivariate correlation is). Those two outliers might be something interesting, however–let’s find out what they are:

qogdata[qogdata$ihme_mmr>=400 & qogdata$oil_income_pc_1k>=50,c(2,3,4,5,6)]
##             cname       chga_demo             ht_region ihme_mmr pwt_pop
## 96  Cote d'Ivoire 0. Dictatorship 4. Sub-Saharan Africa      944   20617
## 103      Cameroon 0. Dictatorship 4. Sub-Saharan Africa      705   18879
## 105    Mauritania 0. Dictatorship 4. Sub-Saharan Africa      712    3129
## 109          Chad 0. Dictatorship 4. Sub-Saharan Africa     1065   10329
## 126         Congo 0. Dictatorship 4. Sub-Saharan Africa      617    4013
## 134         Gabon 0. Dictatorship 4. Sub-Saharan Africa      494    1515

We see that all of these are marginal oil producers in sub-Saharan Africa, and that the two outliers are Gabon and the Democratic Republic of the Congo. Could there be a reason to think that dictatorships or sub-Saharan African countries are worse at providing public goods like health care? Maybe so–and if Collier is really wedded to his belief that the resource curse for the “bottom billion” flows through corruption of democratic channels, I think that he would have some explaining to do here. In general, however, the most oil-rich autocracies seem to come off quite well, since Qatar, Kuwait, the UAE, and so forth all have very high oil income and very low maternal mortality. (But our discussion of multivariate regression should now have you in the habit of wondering whether there’s some confounding factors here!)

Question 7

Create a barplot of Oil Income Per Capita by region of the world (ht_region). Fully label the chart and provide two or three sentences describing what you see. Second, plot oil income per capita versus a binary measure of whether a state is democratic (1) or autocratic (0), chga_demo. Again, fully label the graph and provide two or three sentences describing what you see.

region.oipc <- aggregate(qogdata$oil_income_pc_1k,
                         by=list(qogdata$ht_region),
                         FUN=sum,
                         na.rm=TRUE)
colnames(region.oipc) <- c("Region","OIPC")

levels(region.oipc$Region) <- c("E. Europe + Fmr USSR",
                               "Latin America",
                               "N. Africa and Mideast", 
                               "Sub-Saharan Africa",
                               "W. Europe and N. America",
                               "East Asia", 
                               "South-East Asia",
                               "South Asia",
                               "The Pacific",
                               "The Caribbean")
par(las=2,mai=c(1.5,2.5,.25,.5)) # make label text perpendicular to axis, 
# change the margin size in inches to give us 1.5 inches on the bottom, 2.5 inches on the left, 
#.25 on the top, and .5 on the right
barplot(region.oipc$OIPC,names=region.oipc$Region,horiz=TRUE,cex.names=.75)

Two things should stand out here. First, this is a nice-looking chart, but it could look nicer if it were ordered (which we could do with region.oipc <- region.oipc[order(region.oipc$OIPC),]). As we discussed the other day, taking care to think about what relationships should be presented matters. The second is that this graph doesn’t actually mean all that much by itself. We’ve summed up the oil income per capita of each country by region, but what exactly does that mean?

Arithmetically, we could add up the oil incomes per capita of Egypt (very low) and Qatar (very, very high), but since there are 80 million (or more) Egyptians and about 250,000 Qataris, that number is apt to be completely meaningless. Here, it’s relevant that, in the long example in the handout, we were looking at country-level aggregate GDPs, which we can add, since Egypt and Qatar both have one national GDP–by the way, Qatar’s GDP is about half the size of Egypt’s!. So a more relevant number will probably be the arithmetic average, or mean, of these per-capita figures, just as the example for you to work through in handout 3 suggested. (Even here, by the way, we would probably want to weight different countries’ contributions by the size of their population, but since you don’t have that data I wasn’t expecting you to do that.)

So let’s start again from the beginning:

region.oipc <- aggregate(qogdata$oil_income_pc_1k,
                         by=list(qogdata$ht_region),
                         FUN=mean, ## the only change from above!
                         na.rm=TRUE)
colnames(region.oipc) <- c("Region","OIPC")

levels(region.oipc$Region) <- c("E. Europe + Fmr USSR",
                               "Latin America",
                               "N. Africa and Mideast", 
                               "Sub-Saharan Africa",
                               "W. Europe and N. America",
                               "East Asia", 
                               "South-East Asia",
                               "South Asia",
                               "The Pacific",
                               "The Caribbean")
par(las=2,mai=c(1.5,2.5,.25,.5)) # make label text perpendicular to axis, 
# change the margin size in inches to give us 1.5 inches on the bottom, 2.5 inches on the left, 
#.25 on the top, and .5 on the right

region.oipc <- region.oipc[order(region.oipc$OIPC),]
barplot(region.oipc$OIPC,names=region.oipc$Region,
        horiz=TRUE,
        cex.names=.75,
        main="Average Oil Income P.C. By Country and Region")


The second half of the question should be simple once the first is licked:


region.democ <- aggregate(qogdata$oil_income_pc_1k,
                         by=list(qogdata$chga_demo),
                         FUN=mean, ## the only change from above!
                         na.rm=TRUE)
colnames(region.democ) <- c("Regime","OIPC")
levels(region.democ$Regime) <- c("Dictatorship","Democracy")
barplot(region.democ$OIPC,names=region.democ$Regime,
        main="Average Oil Income P.C. By Regime Type")

The interpretation for both charts is pretty straightforward. Some regions have got a lot of oil, others haven’t; similarly, dictatorships on average have a lot more oil income than democracies do. Is there something about the geology of dictatorships that makes it easier to find oil there? Doubtful–zooplankton and algae several hundred million years ago hardly knew whether they would decompose under future dictatorships. Perhaps it’s more likely that the sale of oil sustains dictatorial regimes–or even causes them? At the very least, that’s a hypothesis worth testing….

Question 8

The United Nations has developed a measure of gender inequality, the Gender Inequality Index (GII). Higher numbers of the GII express more unequal development between men and women and lower numbers represent lower gaps in the development of men and women. (For more, you can see the UN web site).

This measure is included in your dataset as undp_gii. Please (A) plot GII against oil income and (B) regress GII against oil income per capita. Substantively interpret the regression and plot—what do these numbers suggest about the relaitonship between oil income and gender quality? What explanations could affect our interpretation of that relationship? Is the relationship statistically significant? Does that affect our understanding of the causal effect of oil on gender disparities on development?

qogdata <- read.dta("QOGSelectPS2.dta")

## Regress (Why not do it first? You should be including it anyway)
qogdata.lm <- lm(undp_gii~oil_income_pc_1k,data=qogdata)
summary(qogdata.lm)
## 
## Call:
## lm(formula = undp_gii ~ oil_income_pc_1k, data = qogdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3442 -0.1772  0.0219  0.1528  0.3765 
## 
## Coefficients:
##                     Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)       0.39319009  0.01662871   23.65 <0.0000000000000002 ***
## oil_income_pc_1k -0.00000243  0.00000513   -0.47                0.64    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.19 on 139 degrees of freedom
## Multiple R-squared:  0.00162,    Adjusted R-squared:  -0.00557 
## F-statistic: 0.225 on 1 and 139 DF,  p-value: 0.636
plot(qogdata$oil_income_pc_1k,qogdata$undp_gii,
     main="Gender Inequality Index vs. Oil Income",
     ylab="UN GII",
     xlab="Oil Income Per Capita",
     pch=16)
abline(qogdata.lm, col="green") # just to show the color doesn't always have to be red

The regression suggests that there is a negative relationship between gender inequality and oil income: for every unit increase in Oil Income (that is, $1) a country receives per capita there is a corresponding drop of …. almost nothing (-0.00000243) in gender inequality. You might think: well, that’s that—the coefficient is very small, so this must be unimportant! But before you conclude that, you should see the scale that GII uses; use mean(qogdata$gii) and range(qogdata$gii):

mean(qogdata$undp_gii)
## [1] 0.39
range(qogdata$undp_gii)
## [1] 0.049 0.769
## for good measure
mean(qogdata$oil_income_pc_1k)
## [1] 831
range(qogdata$oil_income_pc_1k)
## [1]     0 25032

The coefficient for \(\beta_1\) may be very small, but so is the scale. Accordingly, any coefficient we estimate, no matter how important, is apt to be small. In this case, however, the coefficient is so small relative to the quantity we’re estimating that it’s substantively unimportant. Even moving from the minimum of oil_income_pc_1k to the maximum would have a tiny effect on estimated gender inequality:

25032*-0.00000243
## [1] -0.061
## by the way, a better R way to write this:
max(qogdata$oil_income_pc_1k) * qogdata.lm$coefficients[2]
## oil_income_pc_1k 
##           -0.061
## play around until you understand why these commands work

## to estimate change from 0 to average oil income
mean(qogdata$oil_income_pc_1k) * qogdata.lm$coefficients[2]
## oil_income_pc_1k 
##           -0.002

In other words, if we somehow took a country with no oil income and instead gave it the maximum income per capita in the sample, we would predict (based on this model) that its estimated gender inequality index would decrease by 6 points on a 100-point scale (that is, if we just multiplied the GII by 100 to make it more immediately understandable). That would matter for the women (and men!) living in that country, but the size of that estimated effect is so small relative to the cause that it’s truly secondary. For another comparison, look at the difference in moving from 0 to the average oil income per capita in the sample: a change of 2 tenths of one percent on a 100-point scale.

The statistical significance test is easier. Our estimate of \(\beta_{OilIncome}\) is not statistically significant. Divide \(\frac{\hat{\beta_1}}{\widehat{se(\beta_1)}}\), which yields a value also reported by R as the t value. In general, we are looking for t values greater than 2 (Bailey discusses this in much greater detail, and you’re responsible for that). With a p-vaue of 0.64, it seems likely that this \(\hat{\beta_1}\) differs from 0 only by chance. We therefore fail to reject the null hypothesis.