Introduction: Reminders about R and Rmarkdown

Please make sure you have downloaded this file (pset2.Rmd) to your computer and opened it in R Studio. By download, we do not mean you just clicked on it in your browser – we mean you have saved the actual file to a directory on your computer, and then opened it with RStudio. You should now be looking at the “raw” text of the .Rmd file.

If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the .rmd code that produces it, then click “knit” and see how it shows up in the outputted html file:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).

Include both the code to get your answer and your answer in words.

Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches.

Make sure your final Rmd file knits correctly, and check as you work – don’t wait until the very end to try knitting your code.

Question 1.

Load the Fearon and Laitin data set (fl2).

load("C:/Users/pbourke/Downloads/fl2 (1).RData")

What are the names of the variables stored in this dataset? How many variables do you have? What is your sample size, given this data set? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!) summary(fl2) The names of the variables are: cname, year, war1, war, gdpenl, lpopl1, lmtnest, ncontig, oil, nwstate, instab, polity2l, ethfrac, relfrac, war_prop, numyears

The variable gdpenl is GDP per capita, measured in thousands of dollars (using 1985 price).

Show the sample distribution of this variable. Specifically, create a density plot, and a boxplot. Remember, plots need to be labelled.

par(mfrow=c(2,2))
plot(density(fl2$gdpenl), main="GDP per Capita Density")
abline(v=mean(fl2$gdpenl), col="blue")
abline(v=median(fl2$gdpenl), col="red")
legend("topright", legend=c("mean", "median"), col=c("blue", "red"))
boxplot(fl2$gdpenl)

plot(density(log(fl2$gdpenl)), main="Log of Density of GDP")
abline(v=mean(log(fl2$gdpenl)), col="blue")
abline(v=median(log(fl2$gdpenl)))
boxplot(log(fl2$gdpenl))
abline(h=mean(log(fl2$gdpenl)), lty=2, lwd=2, col="green")

Remark on the shape of this distribution. Compute the median and mean and report their values in your code. Then add these values to your chart as lines. Comment on whether the mean and median are the same and explain why or why not.

The shape of the distribution is right skewed because there are outliers far on the right side of the x-axis, they are countries that have very high gdp per capita. The skewdness causes the mean and the median to be pulled from eachother.

Repeat (c) and (d), but this time show the distribution of log(gdpenl) using a density plot and a boxplot. Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why?

The mean and median lines are closer together in the log(gdpenl) ditribution due to the fact that it makes everything base 10, which in turn makes the values smaller and the mean and median are going to be closer because of it.

In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (Oil=1) or not (Oil=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable `ethfrac’ is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.

What is the mean value of war for oil exporters? What is the mean value of war for non oil exporters? What is the standard deviation for both groups? What does this difference in standard deviations suggest about how much variation there is in war for oil exporters versus non oil exporters?

mean(fl2$war[fl2$Oil==1])

## [1] 6.055556

mean(fl2$war[fl2$Oil==0])

## [1] 5.57971

sd(fl2$war[fl2$Oil==1])

## [1] 11.25884

sd(fl2$war[fl2$Oil==0])

## [1] 10.43003

Describe the ethfrac variable: what is the minimum and maximum? What is the mean value? What is the standard deviation? Why does the variable range from 0 to 1?

summary(fl2$ethfrac)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0010  0.1438  0.3850  0.4083  0.6691  0.9250

sd(fl2$ethfrac)

## [1] 0.2798512

The ethfrac variable is how fractionalized ethnic groups are in the country, or how much ethnic diversity there is. The minimum is .001, the maximum is .925, the mean is .4083 and the standard deviation is .2798512. The variable range is 0 to 1 or 0 to 100 in percentages. The closer the number is to 1, the more ethnically diverse that country is. The value for each country is the chance that if two people are randomly selected from that country they will be ethnically the same (0) or ethnically different (1)

Say you believe that increased ethnic factionalization causes war. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the predicted number or wars as you move from no ethnic fractionalization to the highest possible value of ethnic fractionalization?

plot(fl2$war, fl2$ethfrac)
model1<-lm(ethfrac~war, data=fl2)
abline(model1)

The independent variable is war and the dependant variable is ethnic fractionalization. There is not a strong relationship or correlation between the two variables as can be seen because the line is very flat and not showing much regression. There are many countries with a wide range of ethnic fractionalization from almost all also those countries all were not at war during the time span (0 years). However, there were countries with very different ethnic fractionalization who were at war for around 50 years during the time span.

Question 2.

Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw X_1, X_2,..,X_n, each a random variable wih expectation \(u\) and variance \(s^2\).

When you average these random variables together, what is the called? How do you write it mathematically? This is a simple mean, and mathmatically you write in as an “x” with a line over it
What is SD(X)?

The standard error of the mean which is found by taking the square root of the variance.

What is \(E[\overline{X}]\)? Explain with math and words.

The “x” with the line over it is a random variable, “E” means the expectation. So the expectation of (/overline{x}is simply E[x])

What is \(Var(\overline{X})\)?

It is the variance of just “X” divided by “N” or the sample size, this is significant because when you take the square root of the variance of “xbar” you end up with the standard deviation.

Question 3.

In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one example of each.

The estimand is a quantity which you want to find the true value of. To find this you use an estimator or the “mean” or a tool that will give you the estimate which is just a number.

If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? If your sample size is small versus big? Explain.

The distribution of a sample mean with larger variance will make the curve flatter, which allows values to spread out alonf the x axis, but with small variance, the curve will be tall and spiked with values remaining compact along the x axis. The larger the sample size, the closer the sample mean will skew the data by not eliminating outliers.

PS 15: Problem Set 2 (Due 15 April 2016)

Prof. Stokes

April 8 2016

Introduction: Reminders about R and Rmarkdown

Question 1.

Question 2.

Question 3.