Problem Set 1 (Due 31 Jan)

## Question 1.

(a) Load the Fearon and Laitin data set (fl3).

getwd()

## [1] "/Users/matthewross/Downloads"

load("fl3.RData")

(b) What are the names of the variables stored in this dataset? How many variables do you have? What is your sample size, given this data set? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!)

Variable Names:cname, year, pop1, lpop1, warl, war, gdpenl, lmtnest, ncontig, Oil, nwstate, instab, polity2l, ethfrac, relfrac, war_prop, numyears.

dim(fl3)

## [1] 156  17

The sample size is 156 given this data set.

The variable pop1 is population, measured in thousands of people.

(c) Show the sample distribution of this variable. Specifically, create a density plot, and a boxplot. Remember, plots need to be labelled.

plot(density(fl3$pop1),main = "Density Distribution of Population (in thousands)")

boxplot(fl3$pop1, main = "Boxplot of Population (in thousands)")

(d) Remark on the shape of this distribution. Compute the median and mean and report their values in your code. Then add these values to your chart as lines. Comment on whether the mean and median are the same and explain why or why not.

summary(fl3$pop1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     222    1856    4517   17586   11232  553269

The distibution is signifigantly Right Skewed. The mean is 17586. The median is 4517. The mean and median are not the same as they measure two distict qualities. Mean finds the average value of the data set while median acts as the point seperating the higher half of a data set from the lower half. Therefore, the mean is more so affected by outliers in the data than the median. This is demonstrated by the difference between the mean and median in pop1

plot(density(fl3$pop1),main = "Density Distribution of Population (in thousands)", xlab = "Population")
abline(v = mean(fl3$pop1), col="red", lwd=3, lty=1)
abline(v = median(fl3$pop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

boxplot(fl3$pop1, main = "Boxplot of Population (in thousands)", ylab = "Population")
abline(h = mean(fl3$pop1), col="red", lwd=3, lty=1)
abline(h = median(fl3$pop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

(e) Repeat (c) and (d), but this time show the distribution of log(pop1) using a density plot and a boxplot. Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why?

logpop1 = log(fl3$pop1)
mean(logpop1)

## [1] 8.505309

median(logpop1)

## [1] 8.415493

plot(density(logpop1),main = "Density Distribution of LogPopulation (in thousands)", xlab = "Population")
abline(v = mean(logpop1), col="red", lwd=3, lty=1)
abline(v = median(logpop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

boxplot(logpop1, main = "Boxplot of LogPopulation (in thousands)", ylab = "Population")
abline(h = mean(logpop1), col="red", lwd=3, lty=1)
abline(h = median(logpop1), col="blue", lwd=3, lty=2)
legend("topright", legend = c("mean", "median"), lty = c(1,2), col = c("red","blue"))

The mean of log(pop1) is 8.505309. The median of log(pop1) is 8.415493. When using the log of the variable, the ditribution comes closer to resmbling a normal distribution as the log function compresses the data and in turn, reduces the effect of outliers on the mean and median. This is why the mean and the median are closer together in these log models.

In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (Oil=1) or not (Oil=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable `ethfrac’ is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.

(f) What is the mean value of war for oil exporters? What is the mean value of war for non oil exporters? What is the standard deviation for both groups? What does this difference in standard deviations suggest about how much variation there is in war for oil exporters versus non oil exporters?

Oilexp <- fl3[which(fl3$Oil>=1), ]
NonOilexp <- fl3[which(fl3$Oil<1),]
mean(Oilexp$war)

## [1] 6.055556

mean(NonOilexp$war)

## [1] 5.57971

sd(Oilexp$war)-sd(NonOilexp$war)

## [1] 0.8288061

The mean value of war for oil exporters is 6.055556. The mean value of war for non oil exporters is 5.57971. The standard deviation for oil exporters is 11.25884. The standard deviation for non oil exporters is 10.43003. The difference (0.8288061) in standard deviation indicates that there is little variation in war for oil exporters versus non oil exporters.

(g) Describe the ethfrac variable: what is the minimum and maximum? What is the mean value? What is the standard deviation? Why does the variable range from 0 to 1?

max(fl3$ethfrac)

## [1] 0.9250348

min(fl3$ethfrac)

## [1] 0.001

mean(fl3$ethfrac)

## [1] 0.4082564

sd(fl3$ethfrac)

## [1] 0.2798512

The maximum value of ethfac is 0.9250348. The minimum value of ethfrac is 0.001. The mean value of ethfrac is 0.4082564. The standard deviation of ethfrac is 0.2798512. The variable ranges from 0 to 1 as ethfrac is measured on an index as a percentage of 1, with values 1 representing total ethnic fractionalization and 0 representing no ethnic fractionalization. This percentage describes the probability that two people randomly drawn from a the given country are from the same (0) or different (1) ethnic groups.

(h) Say you believe that increased ethnic factionalization causes war. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the predicted number or wars as you move from no ethnic fractionalization to the highest possible value of ethnic fractionalization?

The independent variable would be ethnic fractionalization and the dependent variable would be years of war.

model1 <- lm(war ~ ethfrac, data = fl3) 
plot(fl3$ethfrac, fl3$war, main="Ethnic Fractionalization versus 
     Years of Civil War (1945-1999)", xlab = "Ethnic Fractionalization", ylab = "Years of Civil War")
abline(model1, col = "red")

There is a weak, positive relationship between Ethnic Fractionalization and Years of Civil War. This means that we observe an subtle increase in the predicted number of wars as ethnic fractionalization increases.

## Question 2.

Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw X_1, X_2,..,X_n, each a random variable wih expectation \(u\) and variance \(s^2\).

(a) When you average these random variables together, what is it called? How do you write it mathematically?

The average of these random variables is called the Sample Mean. It attempts to arrive at a better approximation of the sampling distribution of the mean. The equation is below.

\[\overline{X}=1/n\sum_{i=1}^{N} x_{i}\]

(b) What is the standard deviation of these random variables? How do you write it mathematically?

The standard deviation of these random variables is called the standard error.

\[σ_{m}=σ/√(N)\]

(c) What is \(E[\overline{X}]\)? Explain with math and words.

\(E[\overline{X}]\) represents the expectation of our sample mean, the estimated average of a collection of means derived from multiple observations. It attemps to find the closest approximation to μ, the population mean. It is calculated using the formula in part b, \(\overline{X}=1/n\sum_{i=1}^{N} x_{i}\).

(d) What is \(Var(\overline{X})\)?

\(Var(\overline{X})\) represents the expectation of the variance of the mean, which is calculated by taking the the variance of X and dividing by N (\((Var(X)/N)\)). This is represented by \(\hat{Var}(\overline{X})=(1/(n-1)N)\sum_{i=1}^{N}(X-\overline{X})^2\) and describes the distribution of means.

## Question 3.

(a) In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one example of each.

The estimand is the desired object of interest. For example, if we we were attempting to approximate the number of seperatists per city in a country experienceing civil war, our estimand would be the “average number of seperatists per city”.

The estimator is the sample approximate of the estimand. As it would be infeasable to count every seperatist in each city in the country, we arrive at a number that best approximates the average number of seperatists per city. This estimated number is our estimator.

The estimate is the action of applying the estimator to the estimand. In this example, the best estimate of our estimand (the average number of seperatists per city) is represented by the estimator (the sample average number of seperatists per city).

(b) If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? If your sample size is small versus big? Explain.

The distribution of the mean will always tend towards normal distribution according to the Central Limit Theory (CLT). Large values (informed by the Law of Large Numbers) are not required for a more normal distribution. The normal distribution of the mean is always centered on \(E[\overline{X}]\) with a variance of \((Var(X)/N)\).

Problem Set 1 (Due 31 Jan)

Matthew H. Ross, PS 15, UCSB

Winter, 2019