Introduction: Reminders about R and Rmarkdown

Please make sure you have downloaded this file (pset2.Rmd) to your computer and opened it in R Studio. By download, we do not mean you just clicked on it in your browser – we mean you have saved the actual file to a directory on your computer, and then opened it with RStudio. You should now be looking at the “raw” text of the .Rmd file.

If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the .rmd code that produces it, then click “knit” and see how it shows up in the outputted html file:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).

Include both the code to get your answer and your answer in words.

Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches.

Make sure your final Rmd file knits correctly, and check as you work – don’t wait until the very end to try knitting your code.

Question 1.

Load the Fearon and Laitin data set (fl2).

#load("Fl2.Rdata")
load("fl2.RData")
summary(fl2)

##     cname                year           warl              war        
##  Length:156         Min.   :1945   Min.   :0.00000   Min.   : 0.000  
##  Class :character   1st Qu.:1947   1st Qu.:0.00000   1st Qu.: 0.000  
##  Mode  :character   Median :1954   Median :0.00000   Median : 0.000  
##                     Mean   :1958   Mean   :0.00641   Mean   : 5.635  
##                     3rd Qu.:1964   3rd Qu.:0.00000   3rd Qu.: 9.000  
##                     Max.   :1993   Max.   :1.00000   Max.   :52.000  
##      gdpenl            lpopl1          lmtnest          ncontig      
##  Min.   : 0.0510   Min.   : 5.403   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.6395   1st Qu.: 7.526   1st Qu.:0.6931   1st Qu.:0.0000  
##  Median : 1.0910   Median : 8.415   Median :2.3174   Median :0.0000  
##  Mean   : 2.4639   Mean   : 8.505   Mean   :2.0975   Mean   :0.1603  
##  3rd Qu.: 2.5940   3rd Qu.: 9.326   3rd Qu.:3.3150   3rd Qu.:0.0000  
##  Max.   :53.9010   Max.   :13.224   Max.   :4.5570   Max.   :1.0000  
##       Oil            nwstate           instab           polity2l       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :-10.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: -7.0000  
##  Median :0.0000   Median :1.0000   Median :0.00000   Median : -1.0000  
##  Mean   :0.1154   Mean   :0.5192   Mean   :0.03205   Mean   : -0.1154  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:  7.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   : 10.0000  
##     ethfrac          relfrac          war_prop         numyears    
##  Min.   :0.0010   Min.   :0.0000   Min.   :0.0000   Min.   : 3.00  
##  1st Qu.:0.1438   1st Qu.:0.1861   1st Qu.:0.0000   1st Qu.:34.00  
##  Median :0.3850   Median :0.3750   Median :0.0000   Median :43.50  
##  Mean   :0.4083   Mean   :0.3807   Mean   :0.1393   Mean   :40.56  
##  3rd Qu.:0.6691   3rd Qu.:0.5800   3rd Qu.:0.2323   3rd Qu.:53.00  
##  Max.   :0.9250   Max.   :0.7828   Max.   :1.0000   Max.   :55.00

What are the names of the variables stored in this dataset? How many variables do you have? What is your sample size, given this data set? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!)

names(fl2)

##  [1] "cname"    "year"     "warl"     "war"      "gdpenl"   "lpopl1"  
##  [7] "lmtnest"  "ncontig"  "Oil"      "nwstate"  "instab"   "polity2l"
## [13] "ethfrac"  "relfrac"  "war_prop" "numyears"

dim(fl2)

## [1] 156  16

The variable gdpenl is GDP per capita, measured in thousands of dollars (using 1985 price).

Show the sample distribution of this variable. Specifically, create a density plot, and a boxplot. Remember, plots need to be labelled.

hist(fl2$gdpenl, xlab = "Number of Countries", ylab = "Frequency", main = "GDP per Capita")

plot(density(fl2$gdpenl), main = "Sample Distribution of GDP per Capita" , ylab = "Density")

boxplot(fl2$gdpenl, main = "Boxplot of GDP per Capita")

Remark on the shape of this distribution. Compute the median and mean and report their values in your code. Then add these values to your chart as lines. Comment on whether the mean and median are the same and explain why or why not.

#The distribution is skewed right and there are a plethora of outliers. This means that the mean will be affected and will not accurately represent the mean of the population. The reason why the mean and the median are not the same is because there are outliers dragging the mean to the right. If the mean and median were the same, the distribution would be normal. 
mean(fl2$gdpenl)

## [1] 2.46391

median(fl2$gdpenl)

## [1] 1.091

plot(density(fl2$gdpenl), main = "GDP per Capita")
abline(v=mean(fl2$gdpenl), lty=4, lwd=2, col="red")
abline(v=median(fl2$gdpenl), lty=4, lwd=2, col="blue")

boxplot(fl2$gdpenl, main = "GDP per Capita")
abline(h=mean(fl2$gdpenl), lty=4, lwd=2, col="red")
abline(h=median(fl2$gdpenl), lty=4, lwd=2, col="blue")

Repeat (c) and (d), but this time show the distribution of log(gdpenl) using a density plot and a boxplot. Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why?

loggdp1 <- log(fl2$gdpenl)

plot(density(loggdp1), main = "Log of GDP per Capita", xlab = "", ylab = "Density")
abline(v=mean(loggdp1), lty=4, lwd=2, col="red")
abline(v=median(loggdp1), lty=4, lwd=2, col="blue")
legend("topright", legend=c("mean", "median"), lty=c(2,4), col=c("red","blue"))

boxplot(loggdp1, main = "Log of GDP per Capita")
legend("topright", legend=c("mean", "median"), lty=c(2,4), col=c("red","blue"))
abline(h=mean(loggdp1), lty=4, lwd=2, col="red")
abline(h=median(loggdp1), lty=4, lwd=2, col="blue")

mean(loggdp1)

## [1] 0.2382049

median(loggdp1)

## [1] 0.0870607

#Without the log notation the distribution is not normal distributed but as soon as we did utlize the log notation we change the numbers but not the meaning behind them. By taking the log notation of every number in the data set we are able to interpret the data without changing the value of the numbers.

In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (Oil=1) or not (Oil=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable `ethfrac’ is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.

What is the mean value of war for oil exporters? What is the mean value of war for non oil exporters? What is the standard deviation for both groups? What does this difference in standard deviations suggest about how much variation there is in war for oil exporters versus non oil exporters?

oilexporters <- subset(fl2, Oil==1)
mean(oilexporters$war)

## [1] 6.055556

nonoil <-subset(fl2, Oil == 0)
mean(nonoil$war)

## [1] 5.57971

sd(oilexporters$war)

## [1] 11.25884

sd(nonoil$war)

## [1] 10.43003

sd(oilexporters$war) - sd(nonoil$war)

## [1] 0.8288061

# When the model is incomprehensible political scientists utilize Log Likelihood in order to interepret data. Log likelihood is the log of the probability of observing the Y outcomes we report given the X data and the Beta hats. There are more wars in countries that have oil than in countries that do not export oil. Standard deviation is the squareroot of variance. Standard deviation describes how much of the data is represented under the curve of the probability distribution. In a normal distribution the first SD represents 68% of the data under the curve, in the second SD 95% of the data is under the curve, and in the 3rd SD 99.8% of the data is under the curve.

Describe the ethfrac variable: what is the minimum and maximum? What is the mean value? What is the standard deviation? Why does the variable range from 0 to 1?

min(fl2$ethfrac)

## [1] 0.001

max(fl2$ethfrac)

## [1] 0.9250348

sd(fl2$ethfrac)

## [1] 0.2798512

mean(fl2$ethfrac)

## [1] 0.4082564

#The variable ranges from 0 to 1 in order to describe how ethnically fragmented the society of the nation is. When the number is 0 or close to zero that means that the community is homogenous and most of the people in that nation are ethnically similar. If the number is closer to 1 that means that the society is very ethnically diverse and the diversty can lead to fragmentation.

Say you believe that increased ethnic factionalization causes war. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the predicted number or wars as you move from no ethnic fractionalization to the highest possible value of ethnic fractionalization?

#IV: How ethnically fragmented the observation is 
#DV: How often the country is at war
plot(fl2$ethfrac, fl2$war,
     xlab ="Ethnic Factionalization",
     ylab = "War",
     main = "The Relationship Between Ethnic Factionalization and War" )
model1 <- lm(war~ethfrac, data=fl2)
abline(model1, col = "purple")

#The relationship between war and ethnic factionalization is postively correlated. There is a weak positive correlation which means that the predicted number of wars in a country increases as ethnic fractionalization increases as well.

Question 2.

#$/Sigma _[i=1]^m$

Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw X_1, X_2,..,X_n, each a random variable wih expectation \(u\) and variance \(s^2\).

When you average these random variables together, what is the called? How do you write it mathematically?

\(\bar{x}\)

sample mean

\(E[\overline{X}]\) = E((X_1 + X_2 + … + X_n)/n) <- This is the sample mean.

What is SD(X)?

Standard deviation is described mathematically as:

s = sqaure root(∑ (X(population) - x(sample)/n(sample size) )

\(E[\overline{X}]\) = E((X_1 + X_2 + … + X_n)/n)

What is \(E[\overline{X}]\)? Explain with math and words.

Proof:

\(E[\overline{X}]\) = E((X_1 + X_2 + … + X_n)/n) <- This is the sample mean.
\(E[\overline{X}]\) = 1/n[E(X_1) + E(X_2) + … + E(X_n)]
\(E[\overline{X}]\) = 1/n[μ + μ + … + μ]
\(E[\overline{X}]\) = 1/n[nμ] = μ

What is \(Var(\overline{X})\)? Var(x): as sample size increases variance decreases Proof:

\(Var(\overline{X})\) = Var((X_1 + X_2 + … + X_n)/n)
\(Var(\overline{X})\) = Var((1/n)X_1 + (1/n)X_2 + … +(1/n)X_n)
\(Var(\overline{X})\) = (1/n^2)Var(X_1) + (1/n^2)Var(X_2) + … + (1/n^2)Var(X_n)
\(Var(\overline{X})\) = (1/n^2)[σ^2 + σ^2 + … +σ^2]
\(Var(\overline{X})\) = (1/n^2)[nσ^2] = σ^2/n

Question 3.

In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one example of each.

Estimate:An estimate is a statistic used to estimate a population parameter. Estimad: An estimad is a statistic that describes the population Estimator: a statistic that represents the population. Examples include: mean, median, mode

If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? If your sample size is small versus big? Explain.

If the variance of the underlying population is small than the distribution of the sample mean will be centered around the distrubtion of the population mean. In other words, the curve of the distribution will be taller and skinner. Additionally, if the sample size is small, there is a slight chance that the data recorded might not fall under the first 3 standard deviations. If this is the case than the sample will not accurately reflect the population. The way to adjust for this error is by collecting a larger sample size. On the other hand, if the variance of the underlying population is large the distribution of the curve will be fatter and longer. The reason why the curve does this is to account for all the points under it.

PS 15: Problem Set 2 (Due 20 October 2017 at 6pm)

Prof. Stokes

September 29 2017

Introduction: Reminders about R and Rmarkdown

Question 1.

Question 2.

Question 3.