Submit your HTML output and .Rmd file to Gauchospace by the deadline.

Introduction: Reminders about R and Rmarkdown

Please make sure you have downloaded this file (pset2.Rmd) to your computer and opened it in R Studio. By download, we do not mean you just clicked on it in your browser – we mean you have saved the actual file to a directory on your computer (as pset2.Rmd, not as pset2.Rmd.txt!), and then opened it with RStudio. You should now be looking at the “raw” text of the .Rmd file.

If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the Rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the .rmd code that produces it, then click “knit” and see how it shows up in the outputted html file:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).

Include both the code to get your answer and your answer in words.

Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches.

Make sure your final Rmd file knits correctly, and check as you work – don’t wait until the very end to try knitting your code.


Question 1.

  1. Load the Fearon and Laitin data set (fl3.RData).
getwd()
## [1] "/home/jovyan/Political Science 15 Data Sets/Problem set 2"
load("fl3.RData")
  1. What are the names of the variables stored in this dataset? How many variables do you have? What is your sample size, given this data set? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!)

NAMES: cname, year, pop1, lpopl1, warl, war, gdpenl, Imtnest, ncontig, Oil, nwstate, instab, polity2l, ethfrac, relfrac, war_prop, numyears

dim(fl3)
## [1] 156  17

sample size: 156 **variable: 17

  1. The variable pop1 is population, measured in thousands of people. Show the sample distribution of this variable. Specifically, create a density plot, and a boxplot. Remember, plots need to be labelled.
plot(density(fl3$pop1),main = "Density Distribution of Population (in thousands)")

boxplot(fl3$pop1, main = "Boxplot of Population (in thousands)")

  1. Remark on the shape of the distribution (density plot) in part (c). Compute the median and mean of pop1 and report their values in your code. Then add these values to your chart as lines. Comment on whether the mean and median are the same and explain why or why not.
summary(fl3$pop1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     222    1856    4517   17586   11232  553269

distribution: Right Skewed. Mean: 17586 Median: 4517 The mean is more affected by outliers in the data then a median because the mean is the average of all numbers given but median is simply the middle separating the high and low numbers. This is a demonstration of the difference between mean and median in pop1.

plot(density(fl3$pop1),main = "Density Distribution of Population (in thousands)", xlab = "Population")
abline(v = mean(fl3$pop1), col="red", lwd=3, lty=1)
abline(v = median(fl3$pop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

boxplot(fl3$pop1, main = "Boxplot of Population (in thousands)", ylab = "Population")
abline(h = mean(fl3$pop1), col="red", lwd=3, lty=1)
abline(h = median(fl3$pop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

  1. Repeat (c) and (d), but this time show the distribution of log(pop1) using a density plot and a boxplot. Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why?
logpop1 = log(fl3$pop1)
mean(logpop1)
## [1] 8.505309
median(logpop1)
## [1] 8.415493
plot(density(logpop1),main = "Density Distribution of LogPopulation (in thousands)", xlab = "Population")
abline(v = mean(logpop1), col="red", lwd=3, lty=1)
abline(v = median(logpop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

boxplot(logpop1, main = "Boxplot of LogPopulation (in thousands)", ylab = "Population")
abline(h = mean(logpop1), col="red", lwd=3, lty=1)
abline(h = median(logpop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

mean of log(pop1): 8.505309 Median of log(pop1): 8.415493 The distribution looks more like a normal distribution because the log function compresses the data. Because of this it reduces the effect of any possible outliers on the mean and median. This is why the mean and median are so close together in the log models above.

In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (Oil=1) or not (Oil=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable ethfrac is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.

  1. What is the mean value of war for oil exporters? What is the mean value of war for non oil exporters? What is the standard deviation for both groups? What does this difference in standard deviations suggest about how much variation there is in war for oil exporters versus non oil exporters?
Oilexp <- fl3[which(fl3$Oil>=1), ]
NonOilexp <- fl3[which(fl3$Oil<1),]
mean(Oilexp$war)
## [1] 6.055556
mean(NonOilexp$war)
## [1] 5.57971
sd(Oilexp$war)-sd(NonOilexp$war)
## [1] 0.8288061

mean value of war for oil exporters: 6.055556 mean value of war for non oil exporters: 5.57971 Standard Deviation for oil exporters: 11.25884 Standard Deviation for non oil exporters: 10.43003 Difference in standard deviation: 0.8288061 The difference in standard deviation says that there is very little variation in war for oil and non oil exporters.

  1. Describe the ethfrac variable: what is the minimum and maximum? What is the mean value? What is the standard deviation?
max(fl3$ethfrac)
## [1] 0.9250348
min(fl3$ethfrac)
## [1] 0.001
sd(fl3$ethfrac)
## [1] 0.2798512
mean(fl3$ethfrac)
## [1] 0.4082564

minimum: 0.001 maximum: 0.9250348 mean: 0.4082564 standard deviation: 0.2798512

  1. Say you believe that increased ethnic factionalization causes war. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the predicted number or wars as you move from no ethnic fractionalization to the highest possible value of ethnic fractionalization?

Independent variable = ethnic fractionlization Dependent variable = years of war

model1 <- lm(war ~ ethfrac, data = fl3) 
plot(fl3$ethfrac, fl3$war, main="Ethnic Fractionalization versus 
     Years of Civil War (1945-1999)", xlab = "Ethnic Fractionalization", ylab = "Years of Civil War")
abline(model1, col = "red")

The relationship between Ethnic Fractionalization and years of civil war is weak and positive. We see a subtle increase in the predicted number of wars which enthnic fractionalization begins to increase.

Question 2.

Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw \(X_1, X_2,..,X_n\), each a random variable wih expectation \(u\) and variance \(s^2\).

  1. When you average these random variables together, what is it called? How do you write it mathematically?

The average of the random variable given is called a sample mean. A sample mean tries to get a more accurate average of the asampling distibution. \[\bar{X} = \frac{1}{n} \sum_{i=1}^{N} X_i\]

***(b) What is the standard deviation of these random variables? How do you write it mathematically?

standard deviation of the given random variable = standard error. $

  1. What is \(E[\overline{X}]\)? Explain with math and words. \(E[\overline{X}]\) Represents the expectation of our sample mean. The estimated average of our collection of means is derived from multiple observations. It then attempts to find the closest approximation to μ, which is the population mean. We calculate this but utilizing this formula: \[\bar{X} = \frac{1}{n} \sum_{i=1}^{N} X_i\]

  2. What is \(Var(\overline{X})\)? \(Var(\overline{X})\) represents the expectation of the variance of the mean. The variance of the mean is calculated by taking the variance of X and dividing it by \(N((Var(X)/N))\). And is represented by. \(\hat{Var}(\overline{X})=(1/(n-1)N)\sum_{i=1}^{N}(X-\bar{X})^2\) which describes the distribution of means.

Question 3.

  1. In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one example of each.
  1. The estimand is the object of interest. A good example of estimated is if theoretically we were trying to approximate the number of asthma patients per high school in Los Angeles county. Our extimand would be “average number of asthma patients per high school”.

  2. The estimator is the sample approximate of the estimand. Looking at our previous hypothetical it would be too much to count every single asthma patients in highschool all through Los Angeles. We will arrive at a number that would be the best approximate of the average number of asthma patients. The estimated number is our estimator

  3. The estimate is basically applying the estimator to the estimand. From our hypothetical we get that our best estimate of estimand, which is the average number of asthma patients per high school. This is represented by the estimator which is the sample average number of asthma patients per high school.

  1. If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? If your sample size is small versus big? Explain.

Distribution of the mean will always tend towards normal distribution. We know this because of the Central Limit Theory or CLT. A large value is not required for a normal distribution. We know this because of the Law of Large numbers. The normal distribution of the mean is always center on:\(E[\bar{X}]\). Variance: \((Var(X)/N)\).