PS 15: Problem Set 2 (Due February 9th, 2018 at 6pm)

Submit your HTML output and .Rmd file to Gauchospace by the deadline. ***

Question 1.

Load the Fearon and Laitin data set (fl2).

setwd("/Users/mikeschaible/Desktop/PS-15")
getwd()

## [1] "/Users/MikeSchaible/Desktop/PS-15"

load("fl2.rdata")
summary(fl2)

##     cname                year           warl              war        
##  Length:156         Min.   :1945   Min.   :0.00000   Min.   : 0.000  
##  Class :character   1st Qu.:1947   1st Qu.:0.00000   1st Qu.: 0.000  
##  Mode  :character   Median :1954   Median :0.00000   Median : 0.000  
##                     Mean   :1958   Mean   :0.00641   Mean   : 5.635  
##                     3rd Qu.:1964   3rd Qu.:0.00000   3rd Qu.: 9.000  
##                     Max.   :1993   Max.   :1.00000   Max.   :52.000  
##      gdpenl            lpopl1          lmtnest          ncontig      
##  Min.   : 0.0510   Min.   : 5.403   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.6395   1st Qu.: 7.526   1st Qu.:0.6931   1st Qu.:0.0000  
##  Median : 1.0910   Median : 8.415   Median :2.3174   Median :0.0000  
##  Mean   : 2.4639   Mean   : 8.505   Mean   :2.0975   Mean   :0.1603  
##  3rd Qu.: 2.5940   3rd Qu.: 9.326   3rd Qu.:3.3150   3rd Qu.:0.0000  
##  Max.   :53.9010   Max.   :13.224   Max.   :4.5570   Max.   :1.0000  
##       Oil            nwstate           instab           polity2l       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :-10.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: -7.0000  
##  Median :0.0000   Median :1.0000   Median :0.00000   Median : -1.0000  
##  Mean   :0.1154   Mean   :0.5192   Mean   :0.03205   Mean   : -0.1154  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:  7.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   : 10.0000  
##     ethfrac          relfrac          war_prop         numyears    
##  Min.   :0.0010   Min.   :0.0000   Min.   :0.0000   Min.   : 3.00  
##  1st Qu.:0.1438   1st Qu.:0.1861   1st Qu.:0.0000   1st Qu.:34.00  
##  Median :0.3850   Median :0.3750   Median :0.0000   Median :43.50  
##  Mean   :0.4083   Mean   :0.3807   Mean   :0.1393   Mean   :40.56  
##  3rd Qu.:0.6691   3rd Qu.:0.5800   3rd Qu.:0.2323   3rd Qu.:53.00  
##  Max.   :0.9250   Max.   :0.7828   Max.   :1.0000   Max.   :55.00

What are the names of the variables stored in this dataset? How many variables do you have? What is your sample size, given this data set? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!)

colnames(fl2)

##  [1] "cname"    "year"     "warl"     "war"      "gdpenl"   "lpopl1"  
##  [7] "lmtnest"  "ncontig"  "Oil"      "nwstate"  "instab"   "polity2l"
## [13] "ethfrac"  "relfrac"  "war_prop" "numyears"

ncol(fl2)

## [1] 16

nrow(fl2)

## [1] 156

The variable gdpenl is GDP per capita, measured in thousands of dollars (using 1985 price).

Show the sample distribution of this variable. Specifically, create a density plot, and a boxplot. Remember, plots need to be labelled.

plot(density(fl2$gdpenl),main="Plot of Density Distribution of GDP Per Capita",ylab = "Density",xlab = "GDP Per Capita(1985 Price)")

boxplot(fl2$gdpenl,main="Boxplot of Density Distribution of GDP Per Capita",ylab = "GDP Per Capita(1985 Price)",xlab = "Density")

Remark on the shape of this distribution. Compute the median and mean and report their values in your code. Then add these values to your chart as lines. Comment on whether the mean and median are the same and explain why or why not.

The shape of the distribuition is skewed left in the plot and towards the bottom in the boxplot. The mean and median are diffent values and measure different things, sometime those two values can line up but the presence of outliers effects the mean and median differently.

median(fl2$gdpenl)

## [1] 1.091

mean(fl2$gdpenl)

## [1] 2.46391

plot(density(fl2$gdpenl),main="Plot of Density Distribution of GDP Per Capita",ylab = "Density",xlab = "GDP Per Capita(1985 Price)")
abline(v=mean(fl2$gdpenl),col="blue")
abline(v=median(fl2$gdpenl),col="red")

boxplot(fl2$gdpenl,main="Boxplot of Density Distribution of GDP Per Capita",ylab = "GDP Per Capita(1985 Price)",xlab = "Density")
abline(h=mean(fl2$gdpenl),col="blue")
abline(h=median(fl2$gdpenl),col="red")

Repeat (c) and (d), but this time show the distribution of log(gdpenl) using a density plot and a boxplot. Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why?

The shapes of the distribution are more normally distributed around the expected mean values. The mean and median are closer together becuase the central limit theorum and law of large numbers trends the numbers towards a normal distribution around the expected value.

log(fl2$gdpenl)

##   [1]  2.031563452  1.646348305  0.754712580 -0.267879464 -0.403467131
##   [6]  0.593326814  1.792092723  0.607044502  0.137149799 -0.107585209
##  [11] -0.068278824 -0.244622593  0.222343194  0.068592773  0.291923057
##  [16]  1.890095353  0.539995978 -0.030459178 -0.576253471 -0.123298213
##  [21]  0.035367157  0.030529223  0.807368371  0.807814269  1.238953745
##  [26]  1.541373182  0.693647019  1.346773610  1.292258378  0.972292942
##  [31]  1.889642111  0.522358853 -0.057629107  1.229932907  0.846726276
##  [36] -0.401971175  1.075002446 -0.235722306 -0.211956343  1.367111548
##  [41]  0.955511408  0.820660534 -2.453407949 -1.917322693 -0.214431634
##  [46]  0.711478108 -1.335601203  1.246744922 -2.071473356  0.580538203
##  [51]  1.684174261  1.769684254  1.453953059  1.587805638  1.627867035
##  [56]  1.354803675  1.342342492  1.161274052  1.063675744  1.603017332
##  [61]  1.285368425  1.398469884 -0.136965879 -0.322963918 -0.625488483
##  [66]  0.045928980  0.095310201 -0.248461396 -0.631111780  0.113328690
##  [71] -0.560366104 -0.785262469 -0.507497837 -0.130108661 -0.112049511
##  [76] -1.002393394 -0.444725864 -0.567396025  0.581656824 -0.350976928
##  [81] -0.279713926  0.116003699 -0.715392805 -0.536143468 -0.474815220
##  [86] -1.194022463 -0.669430606 -0.761426005  0.098033781  0.415415430
##  [91] -1.537117234 -0.313341811  0.167207954 -0.094310651  0.078811196
##  [96] -0.991553238  0.672434135  1.048721551 -0.721546652 -0.549912974
## [101]  0.627541437  0.174793278  1.129464785 -0.208254968  0.242946160
## [106]  0.006975583 -0.053400762 -0.153151202  0.509825101  0.077886511
## [111]  0.737642459 -0.570929552 -0.349557500  0.387979778 -0.809681013
## [116]  0.763139543  0.767326599 -2.975929665  0.517602592  0.733809136
## [121]  3.987149049  2.833977786  3.689978818  1.772236773 -1.639897090
## [126]  1.324950753  0.821980094  1.055008644  1.035672458  1.434608210
## [131] -1.487220297 -0.200892920 -0.257476228  0.073250439 -0.679244218
## [136]  0.554459693 -0.527632787 -1.038458360 -0.216912993 -0.112049511
## [141] -1.599487547 -0.049190221 -0.711311158 -0.525939227 -0.941608577
## [146] -1.231001491 -0.454130295 -0.603306493  0.255417119  0.622724696
## [151] -0.778705088 -1.203972765  1.691939102  0.628075159  1.761815585
## [156]  0.952429781

median(log(fl2$gdpenl))

## [1] 0.0870607

mean(log(fl2$gdpenl))

## [1] 0.2382049

plot(density(log(fl2$gdpenl)),main="Plot of Density Distribution of GDP Per Capita",ylab = "Density",xlab = "GDP Per Capita(1985 Price)")
abline(v=mean(log(fl2$gdpenl)),col="blue")
abline(v=median(log(fl2$gdpenl)),col="red")

boxplot(log(fl2$gdpenl),main="Boxplot of Density Distribution of GDP Per Capita",ylab = "GDP Per Capita(1985 Price)",xlab = "Density")
abline(h=mean(log(fl2$gdpenl)),col="blue")
abline(h=median(log(fl2$gdpenl)),col="red")

In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (Oil=1) or not (Oil=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable `ethfrac’ is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.

What is the mean value of war for oil exporters? What is the mean value of war for non oil exporters? What is the standard deviation for both groups? What does this difference in standard deviations suggest about how much variation there is in war for oil exporters versus non oil exporters?

The differences in standard deviation and varation suggest that countires with oil engage in slightly more war(less than .5 difference in means) and also have higher variation than countries with no oil.

NoOil<-fl2[which(fl2$Oil=="0"),]
Oil<-fl2[which(fl2$Oil=="1"),]

mean(NoOil$war)

## [1] 5.57971

mean(Oil$war)

## [1] 6.055556

sd(NoOil$war)

## [1] 10.43003

sd(Oil$war)

## [1] 11.25884

Describe the ethfrac variable: what is the minimum and maximum? What is the mean value? What is the standard deviation? Why does the variable range from 0 to 1?

The variable ranges from 0 to 1 because it is a measure of probability which is expressed in percentages (.5 = 50%). In this case as the question states “it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.”

min(fl2$ethfrac)

## [1] 0.001

max(fl2$ethfrac)

## [1] 0.9250348

mean(fl2$ethfrac)

## [1] 0.4082564

sd(fl2$ethfrac)

## [1] 0.2798512

Say you believe that increased ethnic factionalization causes war. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the predicted number or wars as you move from no ethnic fractionalization to the highest possible value of ethnic fractionalization?

-There is a positive coerelation with ethnic fractionalization and war, although the regression line shows it is a weak correlation. As ethnic fractionalization goes up the expected number of wars also increases. At the highest level of ethnic fractionalization there is a higher varience.

plot(fl2$ethfrac, fl2$war, ylab = "War", xlab = "Ethnic Fractionalization", main = "Ethnic Fractionalization and War")
abline(lm(fl2$war~fl2$ethfrac),col="red")

Question 2.

Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw X_1, X_2,..,X_n, each a random variable wih expectation \(u\) and variance \(s^2\).

When you average these random variables together, what is the called? How do you write it mathematically?

It is called a sample mean, mahematically writter by “X^bar”
What is SD(X)?

It is the standard deviation, it measures the distance between vairables and the mean.
What is \(E[\overline{X}]\)? Explain with math and words.

It is the expected value of the sample mean. It means if you know the true probability this expected sample mean is it. Ex: obamas approval is 65% so if you take a large enough sample a large number of time you would expect the sample mean to equal 65% or .65.
What is \(Var(\overline{X})\)?

It is the variation of the sample mean.

Question 3.

In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one example of each.

The Estimator is the rules one uses to calculate the estimate(numerical value) of the variable you wish to measure which is the estimand. If you want to estimate the number of waves at campus point every hour the waves would be the estimand, the number of waves is the estimate and the meathod you used to calculate the number of waves is the estimator.
If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? If your sample size is small versus big? Explain.

The law of large number states as you draw more samples or have a larger sample size the distribution of the means will become more normal. Smaller samples sizes increase varience thus can make the distribution less centered around the mean while a large sample size does the opposite. If the underlying population is small then the distribution is more centered around the mean, if it is large there is larger varience thus the distribution is more spread out around the mean.

PS 15: Problem Set 2 (Due February 9th, 2018 at 6pm)

Micahel Schaible

Feb 9th

Question 1.

Question 2.

Question 3.