Submit your HTML output (not your .rmd file) to TurnitIn by Midnight on Tuesday, 9 Feb
Please make sure you have downloaded this file (pset2.rmd) to your computer and opened it in R Studio. By download, we do not mean you just clicked on it in your browser – we mean you have saved the actual file to a directory on your computer, and then opened it with R Studio. You should now be looking at the “raw” text of the .rmd file.
If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file.
Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).
Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches. While we were generous in grading the first pset for people whose rmd files would not knit, we will not be so generous this time.
First, we will load a dataset (derived from Fearon and Laitin, 2003). The following code should work for most people. This assumes you have used install.packages("RCurl") at some point before (which you did for, pset2!).
library(RCurl)
## Loading required package: bitops
myCsv <- getURL("https://dl.dropboxusercontent.com/u/22563946/fl2csv.csv")
fl2 <- read.csv(textConnection(myCsv))
If that does not work for you, remove it from the RMD (so your file will knit), and instead please go to and download the file fl2.RData to a directory on your computer. You can then load it using load(fl2.RData) in a code chunk once you have set your working directory (i.e. how we loaded data in the first pset).
(1a) What is the name of the variable that was created as the dateset in the above step? What are the names of the variables stored in this dataset? How many observations are there? (Pleae do not print the whole dataset in your ouput!)
The name of the variable creates as the dataset is “fl2” and the names of the variables stored in the data set are :“cname” “ethfrac” “gdpenl” “instab” “lmtnest” “lpopl1” “ncontig” “numyears” “nwstate” “Oil” “polity2l” “relfrac” “war” “war_prop” “warl”,“X”, “year There a 156 observations
head(fl2)
## X cname year warl war gdpenl lpopl1 lmtnest ncontig Oil nwstate
## 1 1 USA 1945 0 0 7.626 11.856296 3.214868 1 0 0
## 2 56 CANADA 1945 0 0 5.188 9.424968 2.797281 0 0 0
## 3 113 CUBA 1947 0 2 2.127 8.524963 1.694107 0 0 0
## 4 168 HAITI 1947 0 5 0.765 7.992945 2.797281 0 0 0
## 5 223 DOMINICA 1947 0 1 0.668 7.567863 2.856470 0 0 0
## 6 276 JAMAICA 1962 0 0 1.810 7.419980 1.335001 0 0 1
## instab polity2l ethfrac relfrac war_prop numyears
## 1 0 10 0.35695010 0.59599996 0.00000000 55
## 2 0 10 0.75499403 0.63119996 0.00000000 55
## 3 0 3 0.03572363 0.25500000 0.04347826 46
## 4 1 -1 0.01359123 0.33279997 0.09433962 53
## 5 0 -9 0.03698879 0.09500003 0.01886792 53
## 6 0 10 0.04576665 0.50380003 0.00000000 38
names(fl2)
## [1] "X" "cname" "year" "warl" "war" "gdpenl"
## [7] "lpopl1" "lmtnest" "ncontig" "Oil" "nwstate" "instab"
## [13] "polity2l" "ethfrac" "relfrac" "war_prop" "numyears"
dim(fl2)
## [1] 156 17
(1b) The variable gdpenl is GDP per capita, measures in thousands of dollars (using 1985 price). Show the sample distribution of this variable. Specifically, do a density plot, and a boxplot.
plot(density(fl2$gdpenl))
boxplot(fl2$gdpenl)
(1c) Remark on the shape of this distribution. Compute the median and mean.
Median:
median(fl2$gdpenl)
## [1] 1.091
Mean:
mean(fl2$gdpenl)
## [1] 2.46391
(1d) Repeat (1c), but this time show the distribution of log(gdpenl) using a density plot and a boxplot. Remark on the difference in shape when using the log.
plot(density(log(fl2$gdpenl)))
boxplot(log(fl2$gdpenl))
In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (=1) or not (=0). The variable war describes how many years from 1945 to 1999 that country had a civil war.
We are interested in whether the number of years in civil war (war) is on average the same for oil exporters as oil non-exporters. Even though war is technically a count variable, we will treat it as an essentially continuous variable, and work with means.
ow.data.zero=subset(fl2, select = c("Oil", "war"), subset = (Oil=="0"))
ow.data.one=subset(fl2, select = c("Oil", "war"), subset = (Oil=="1"))
(2a) Let xbar_oil and xbar_noil be the mean of war for the oil exporters and oil non-exporters, respectively. Computer the difference in means and save it as a variable.
xbar_oil=mean(ow.data.one$war)
xbar_noil=mean(ow.data.zero$war)
mean.diff= xbar_oil-xbar_noil
mean.diff
## [1] 0.4758454
(2b) We want to know if the mean of war for oil exporters and non oil exporters are the same or different. Write the null hypothesis and alternative hypothesis corresponding to this question.
H0:mean.diff = 0 H1: mean.diff does not equal 0
(2c) Construct and estimate the appropriate z-statistic. Please first construct the numerator, then the denominator, then the z-statistic itself.
Numerator: mean.diff: (xbar_oil- xbar_noil) Denmenator: SE.mean.diff: sqrt((var(ow.data.one\(war)/18)+(var(ow.data.zero\)war)/138))
Z statistic
SE.mean.diff=sqrt((var(ow.data.one$war)/18)+(var(ow.data.zero$war)/138))
mean.diff/SE.mean.diff
## [1] 0.1700467
(2d) Compare the z-statistic to the critical values necessary to determine if you can reject the null hypothesis at the 0.05 level and at the 0.10 level.
The Z statistic, 0.1700467, is significantly less than the critical values of 1.96 and 1.64.
(2e) Estimate the actual p-value. Provide a carefully-worded statement that correctly states what this p-value means.
pnorm(0.1700467)
## [1] 0.5675133
1-pnorm(0.1700467)
## [1] 0.4324867
2*(1-pnorm(0.1700467))
## [1] 0.8649734
(2f) Carefully state the conclusion you make about the hypotheses, including whether or not you reject the null hypothesis.
Under the null hypothesis, there is a 86.5 % chance that we would get a result more extreme or just as extreme.
(2g) Under the null hypothesis, what is the distribution of the z-statistic?
Under the null hypothesis, there would be an continous distribution of the z statistic, because war is a continous variable, and the null hypothesis was that there was no difference in means and the z staistic is 0.865, X less than or equal to x.
Bonus: (2h) Suppose that rather than computing the z-statistic, we think about the distribution of the difference in means (i.e. xbar_oil-xbar_noil under the null). What is the distribution of this difference in means under the null?
a continuous distribution.
Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\).
(3a) What is \(SD(X)\)?
\(SD(X)\)= sqrt(\(s^2\))
(3b) What is \(Var(\frac{1}{a}X)\) for a constant \(a\)?
\(Var(\frac{1}{a}X)\) = \(\frac{1}{a^2}Var(X)\)
(3c) What is \(Var(\frac{1}{SD(X)} X)\)?
= \(\frac{1}{SD(X)^2}Var(X)\)
= \(\frac{1}{Var(X)}Var(X)\)
= 1
Let’s go back to the question of civil wars. Suppose you hear on the news that a political scientist compared the number of civil war years for democracties and for autocracies. This investigator found that democracies had significantly lower average number of years in war than autocracies, with a p-value of less than 0.01.
(4a) Explain in simple terms why this difference cannot be assumed to be causal. You may want to use either the language of “confounders” or of “comparability” as described in class. But however you do it, make it clear to any person of reasonable intelligence why this comparison should not be given a causal interpretation.
These things can be associated with one another and still have no casusation at all. There can be other factors that contribute to theses countries going to war.
(4b) What would be the ideal research design that would allow you to say more? (It does not need to be a feasible option).
For the research design, I would eliminate the countries that have a tendency to go to war for economic and political issues.
(4c) Why would your proposed research design allow you to make causal inferences? Answer in terms of the issues you raised in your response to 4a.
Confounding:
This difference cannot be assumed to be casual because countries that are under more politcal and economical distess my be more likely to go to war and this can confond the data pool.