Introduction: Reminders about R and Rmarkdown

If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the Rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the .rmd code that produces it, then click “knit” and see how it shows up in the outputted html file:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).

Include both the code to get your answer and your answer in words.

Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches.

Make sure your final Rmd file knits correctly, and check as you work – don’t wait until the very end to try knitting your code.

As a reminder, you are allowed to use generative AI to help you with your code, but it may not be used in any way for assistance with interpretive or theoretical questions.

Question 1.

Load the Fearon and Laitin data set (fldata.RData). What name shows up for the dataset in your environment?

load("fldata.RData")

ANSWER: fl3

What are the names of the variables stored in this dataset? How many variables do you have? What is your sample size, given this data set? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!)

ANSWER: The variables are cname, year, pop1, lpopl1, warl, war, gdpenl, lmtnest, ncontig, Oil, nwstate, instab, polity2l, ethfrac, relfrac, war_prop, and numyears. We have a total of 17 variables. nrow(fl3) = 156 = sample size.

The variable pop1 is population, measured in thousands of people. Show the sample distribution of this variable. Specifically, create a density plot, and a boxplot. Remember, plots need to be labelled ????.

plot(density(fl3$pop1),
  main = "Density Distribution of Each Country (in the Thousands)",
  xlab = "Population",
  ylab = "Density",
  )

boxplot(fl3$pop1,
  main = "Boxplot of Each Country's Population (in the Thousands)",
  ylab = "Density"
  )

Remark on the shape of the distribution (density plot) in part (c). Compute the median and mean of pop1 and report their values in your code. Then add these values to your chart as lines. Comment on whether the mean and median are the same and explain why or why not. If you were writing a paper using these data, would you report the mean or median as a measure of central tendency (choose only one)? Why did you make this choice based on your analysis?

plot(density(fl3$pop1),
  main = "Density Distribution of Each Country (in the Thousands)",
  xlab = "Population"
  )
abline(v = mean(fl3$pop1),
  col = "lavender",
  lwd = 3,
  lty = 1,
  )
abline(v = median(fl3$pop1),
  col = "skyblue",
  lwd = 3,
  lty = 2,
  )
legend("topright",
  legend = c("mean", "median"),
  lty = c(1,2),
  col = c("lavender", "skyblue")
  )

boxplot(fl3$pop1,
  main = "Density Distribution of Each Country (in the Thousands)",
  ylab = "Population"
  )
abline(h = mean(fl3$pop1),
  col = "lavender",
  lwd = 3,
  lty = 1,
  )
abline(h = median(fl3$pop1),
  col = "skyblue",
  lwd = 3,
  lty = 2,
  )
legend("topright",
  legend = c("mean", "median"),
  lty = c(1,2),
  col = c("lavender", "skyblue")
  )

summary(fl3$pop1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     222    1856    4517   17586   11232  553269

ANSWER: Looking at the results we can see that the distribution is very skewed to the right. The mean is 17586 and the median is 4517. They are not the same as the mean tells us the average vaule of pop1 while the median serves as the middle vaule of the dataset. I would choose the median as the measure of central tendency because the mean is affected by outliers (ie. having a number that is significantly larger or smaller than the other values) while the median serves as a halfway point; exactly half the data is to the right of the median and the other half is to the left of it.

Repeat (c) and (d), but this time show the distribution of log(pop1) using a density plot and a boxplot. Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why or why not? Does this change which measure of central tendency you would choose to report in your hypothetical research paper? Why or why not?

ANSWER:

logpop1 = log(fl3$pop1)
mean(logpop1)

## [1] 8.505309

median(logpop1)

## [1] 8.415493

plot(density(logpop1),
  main = "Density Distribution of logpop1",
  xlab = "Population"
  )
abline(v = mean(logpop1),
  col = "lavender",
  lwd = 3,
  lty = 1
  )
abline(v = median(logpop1),
  col = "skyblue",
  lwd = 3,
  lty = 2
  )
legend("topright",
  legend = c("mean", "median"),
  lty = c(1,2),
  col = c("lavender", "skyblue")
  )

boxplot(logpop1,
  main = "Boxplot of logpop1",
  ylab = "Population"
  )
abline(h = mean(logpop1),
  col = "lavender",
  lwd = 3,
  lty = 1,
  )
abline(h = median(logpop1),
  col = "skyblue",
  lwd = 3,
  lty = 2,
  )
legend("topright",
  legend = c("mean", "median"),
  lty = c(1,2),
  col = c("lavender", "skyblue")
  )

When using the log of the variable, the distribution resembles a normal distribution as the log function just compresses the data and will reduce the effect of outliers on the mean and median. Thus, both are closer to each other when using the log functions.This time I would choose the mean as the mean and median values are similar. The log function condenses our data so that outliers are not as prevalent. Additionally, the mean and median values being close together tells us that the data is symmertrical.

In the same dataset, the variable instab describes whether each country in the dataset has experienced political regime instability or not. That is, whether it is a politically-unstable state (instab=1), or politically-stable state (instab=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable ethfrac is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) ethnic groups.

What is the mean value of war for politically-unstable states? What is the mean value of war for countries that are politically-stable states? What is the standard deviation for both groups? What does this difference in standard deviations suggest about how much variation there is in war for politically-unstable versus stable states? Given these statistics, how confident are you in your ability to predict war experience for stable countries? for unstable countries? Explain your reasoning for each.

unstable <- fl3[which(fl3$instab>=1),]
stable <- fl3[which(fl3$instab<1),]

mean(unstable$war)

## [1] 12.4

mean(stable$war)

## [1] 5.410596

sd(unstable$war)

## [1] 6.348228

sd(stable$war)

## [1] 10.54025

ANSWER: We can see that the mean value for unstable states is 12.4 while for stable ones it is 5.410596. The standard deviation for unstable states is 6.348228 and for stable ones it is 10.54025. The difference in standard deviation tells us that stable states experience more variation in war experience than unstable states do. In terms of prediction, we are more confident in predicting war experience in unstable states than we do stable states, as the data for unstable states is more compact than it is for stable states. This means we can reasonably guess the experience of war for unstable states (it tends to be the same experience) whereas we cannot confidently guess it for stable ones (it can be bad, ok, good, etc.).

Describe the ethfrac variable: what is the minimum and maximum? What is the mean value? What is the median value? What is the standard deviation? Which measure of central tendency would you choose to report in a hypothetical research paper? Explain your choice.

min(fl3$ethfrac)

## [1] 0.001

max(fl3$ethfrac)

## [1] 0.9250348

mean(fl3$ethfrac)

## [1] 0.4082564

median(fl3$ethfrac)

## [1] 0.3849883

sd(fl3$ethfrac)

## [1] 0.2798512

ANSWER: The minimum value of ethfrac is 0.001 while the maximum is 0.9250348. The mean is 0.4082564 and the median is 0.4082564. The standard deviation of ethfrac is 0.2798512. Because both the mean and median are equal to each other, we know that the data is symmetrical. Therefore, the mean is what I would choose as the symmetrical data would prove there are no outliers and could prove a correlation in the data.

Say you believe that increased ethnic fractionalization causes war. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the predicted number of wars as you move from no fractionalization to the highest possible value of fractionalization? Is there anything about the scatterplot itself that makes you question whether the regression line accurately portrays the relationship between ethnic fractionalization and war? Explain.

model <- lm(war ~ ethfrac, data = fl3)
plot(fl3$ethfrac, fl3$war,
  main = "Ethnic Fractionalization vs War (1945-1993)",
  xlab = "Ethnic Fractionalization",
  ylab = "Years of War"
  )
abline(model,
       col = "red",
       lwd = 3)

ANSWER: In our case, the dependent variable will be years of war while our independent variable will be ethnic fractionalization. We can observe from our data that there is a positive relationship between ethnic fractionalization and years of war. This means that as enthic fractionalization increases, so do the number of years of war. The scatterplot does reveal some discrepancies in this idea as there are quite a few areas that experience many years of war despite having low fractionalization, but overall the data does support the idea even if it supports it very weakly.

Question 2.

Suppose you have a random variable \(X\) with expectation \(\mathbb{E}[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw \(X_1, X_2,..,X_n\), each a random variable wih expectation \(u\) and variance \(s^2\).

When you average these random variables together, what is it called? How do you write it mathematically?

ANSWER: The average of these random variables is called the Sample Mean. It is written as: \(\overline{X} = \frac{1}{n} \sum_{i=1}^{n} x_i\)

What is the standard deviation of these random variables? How do you write it mathematically?

ANSWER: The standard deviation of the random variables is called the standard error. It is written as: \(SE_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)

What is \(\mathbb{E}[\overline{X}]\)? Explain with math and words.

ANSWER: \(\mathbb{E}[\overline{X}]\) is the expectation of the sample mean, which is just the estimated average of a collection of means. It finds the closest approximation to the population mean. We can use the formula from part a to calculate it, which gets us: \(\mathbb{E}[\overline{X}]\) = \(\mathbb{E}[\frac{1}{n} \sum_{i=1}^{n} x_i]\)

What is \(Var(\overline{X})\)? Explain with math and words.

ANSWER: \(Var(\overline{X})\) epresents the expectation of the variance of the mean, which is calculated by taking the the variance of X and dividing by N (\(Var(X/N)\)). Then we can use the following formula to solve for \(Var(\overline{X})\): \(Var(\overline{X})\) = \((1/((n-1)N))\sum((X-\overline{X})^2)\)

Two researchers are studying political news consumption among voters. Using two random samples, they find the following:

Researcher A: \(\overline{X}\) = 3 hours, \(s^2\) = 4, \(n\) = 100
Researcher B: \(\overline{X}\) = 3 hours, \(s^2\) = 4, \(n\) = 500

Based on what you know about \(\mathbb{E}[\overline{X}]\) and \(Var(\overline{X})\), which researcher should be more confident in their estimate of \(\overline{X}\)? Explain your logic.

ANSWER: Because both researchers are using the same parameters, Researcher B should be more confident in their answer. Utilizing our formulas from earlier, we can see that Researcher B’s variance and standard error will be smaller due to the fact that they used a sample size of 500 rather than just 100. As a result, they will have a smaller standard error which means that their data is more distributed around the mean. Thus, they have results that are more accurate.

Question 3.

In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one original example of each – in other words, your example cannot be from the textbook, lecture slides, AI, or something the instructor or TAs said.

ANSWER: The estimand is the desired object of interest. For example, if we wanted to determine the number of red Skittles per bag in a container of Skittle bags, we would set the estimand as “the average number of red Skittles per bag.” The estimator is the sample approximate of the estimand. It is a function we use to best approximate the average number of red Skittles per bag. The estimate is the result of applying the estimator to the data set. In our case, we determine after studying how many of each color of Skittle is in a single box of Skittle bags, that the average number of red skittles is 24. Thus, our estimate is 24.

If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? How will the distribution of the mean change if your sample size is small versus big? Explain your logic for each.

ANSWER: According to the Central Limit Theory (CLT), the mean will always tend towards a normal distribution. According to the Law of Large Numbers, larger values will not be required for a normal distribution. The normal distribution of the mean will be dependent on \(\mathbb{E}[\overline{X}]\) and the variance will be dependent on \(Var(X/N)\).

Question 4.

Did you collaborate with anyone on this problem set? If so, list them here.

ANSWER: I went to Longjiao’s office hours where she helped me clarify certain points or questions. There were two other students with me there but I forgot to ask for their names. Longjiao helped all of us while we were there. I also used the internet to look up how to write equations with LaTex notation.

Question 5.

Did you use generative AI on any part of this problem set? If so, identify which model you used and how you used it – be specific!

ANSWER: No

Problem Set 2

Alexei Kim, PS 15, UCSB