Please make sure that you first rename and save this file as “LASTNAME-pset2.Rmd” before you do any work. Once you are finished with your problem set and have knitted your file, please remember the following steps to properly upload your assignment to Canvas:
As a reminder, you need to submit both your .html and .Rmd files for full credit. Any other file extension, such as .htm or .mhtml, cannot be read and graded. Also, please remember that you need to submit both of your individual files, NOT the zip folder. If you submitted the zip folder, please follow the steps above to correctly submit your problem set.
If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the Rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the .rmd code that produces it, then click “knit” and see how it shows up in the outputted html file:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).
Include both the code to get your answer and your answer in words.
Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches.
Make sure your final Rmd file knits correctly, and check as you work – don’t wait until the very end to try knitting your code.
As a reminder, you are allowed to use generative AI to help you with your code, but it may not be used in any way for assistance with interpretive or theoretical questions.
load("fldata.RData")
ANSWER: fl3
ANSWER: The variables are cname, year, pop1, lpopl1, warl, war, gdpenl, lmtnest, ncontig, Oil, nwstate, instab, polity2l, ethfrac, relfrac, war_prop, and numyears. We have a total of 17 variables. nrow(fl3) = 156 = sample size.
pop1 is population, measured in thousands
of people. Show the sample distribution of this variable. Specifically,
create a density plot, and a boxplot. Remember, plots need to be
labelled ????.plot(density(fl3$pop1),
main = "Density Distribution of Each Country (in the Thousands)",
xlab = "Population",
ylab = "Density",
)
boxplot(fl3$pop1,
main = "Boxplot of Each Country's Population (in the Thousands)",
ylab = "Density"
)
pop1 and report their values
in your code. Then add these values to your chart as lines. Comment on
whether the mean and median are the same and explain why or why not. If
you were writing a paper using these data, would you report the mean or
median as a measure of central tendency (choose only one)? Why did you
make this choice based on your analysis?plot(density(fl3$pop1),
main = "Density Distribution of Each Country (in the Thousands)",
xlab = "Population"
)
abline(v = mean(fl3$pop1),
col = "lavender",
lwd = 3,
lty = 1,
)
abline(v = median(fl3$pop1),
col = "skyblue",
lwd = 3,
lty = 2,
)
legend("topright",
legend = c("mean", "median"),
lty = c(1,2),
col = c("lavender", "skyblue")
)
boxplot(fl3$pop1,
main = "Density Distribution of Each Country (in the Thousands)",
ylab = "Population"
)
abline(h = mean(fl3$pop1),
col = "lavender",
lwd = 3,
lty = 1,
)
abline(h = median(fl3$pop1),
col = "skyblue",
lwd = 3,
lty = 2,
)
legend("topright",
legend = c("mean", "median"),
lty = c(1,2),
col = c("lavender", "skyblue")
)
summary(fl3$pop1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 222 1856 4517 17586 11232 553269
ANSWER: Looking at the results we can see that the distribution is very skewed to the right. The mean is 17586 and the median is 4517. They are not the same as the mean tells us the average vaule of pop1 while the median serves as the middle vaule of the dataset. I would choose the median as the measure of central tendency because the mean is affected by outliers (ie. having a number that is significantly larger or smaller than the other values) while the median serves as a halfway point; exactly half the data is to the right of the median and the other half is to the left of it.
ANSWER:
logpop1 = log(fl3$pop1)
mean(logpop1)
## [1] 8.505309
median(logpop1)
## [1] 8.415493
plot(density(logpop1),
main = "Density Distribution of logpop1",
xlab = "Population"
)
abline(v = mean(logpop1),
col = "lavender",
lwd = 3,
lty = 1
)
abline(v = median(logpop1),
col = "skyblue",
lwd = 3,
lty = 2
)
legend("topright",
legend = c("mean", "median"),
lty = c(1,2),
col = c("lavender", "skyblue")
)
boxplot(logpop1,
main = "Boxplot of logpop1",
ylab = "Population"
)
abline(h = mean(logpop1),
col = "lavender",
lwd = 3,
lty = 1,
)
abline(h = median(logpop1),
col = "skyblue",
lwd = 3,
lty = 2,
)
legend("topright",
legend = c("mean", "median"),
lty = c(1,2),
col = c("lavender", "skyblue")
)
When using the log of the variable, the distribution resembles a normal distribution as the log function just compresses the data and will reduce the effect of outliers on the mean and median. Thus, both are closer to each other when using the log functions.This time I would choose the mean as the mean and median values are similar. The log function condenses our data so that outliers are not as prevalent. Additionally, the mean and median values being close together tells us that the data is symmertrical.
In the same dataset, the variable instab describes
whether each country in the dataset has experienced political regime
instability or not. That is, whether it is a politically-unstable state
(instab=1), or politically-stable state (instab=0). The variable
war describes how many years from 1945 to 1999 that country
had a civil war. The variable ethfrac is a measure of how
fractionalized ethnic groups are in a given country – specifically, it’s
the probability that two people randomly drawn from a given country are
from the same (0) or different (1) ethnic groups.
unstable <- fl3[which(fl3$instab>=1),]
stable <- fl3[which(fl3$instab<1),]
mean(unstable$war)
## [1] 12.4
mean(stable$war)
## [1] 5.410596
sd(unstable$war)
## [1] 6.348228
sd(stable$war)
## [1] 10.54025
ANSWER: We can see that the mean value for unstable states is 12.4 while for stable ones it is 5.410596. The standard deviation for unstable states is 6.348228 and for stable ones it is 10.54025. The difference in standard deviation tells us that stable states experience more variation in war experience than unstable states do. In terms of prediction, we are more confident in predicting war experience in unstable states than we do stable states, as the data for unstable states is more compact than it is for stable states. This means we can reasonably guess the experience of war for unstable states (it tends to be the same experience) whereas we cannot confidently guess it for stable ones (it can be bad, ok, good, etc.).
min(fl3$ethfrac)
## [1] 0.001
max(fl3$ethfrac)
## [1] 0.9250348
mean(fl3$ethfrac)
## [1] 0.4082564
median(fl3$ethfrac)
## [1] 0.3849883
sd(fl3$ethfrac)
## [1] 0.2798512
ANSWER: The minimum value of ethfrac is 0.001 while the maximum is 0.9250348. The mean is 0.4082564 and the median is 0.4082564. The standard deviation of ethfrac is 0.2798512. Because both the mean and median are equal to each other, we know that the data is symmetrical. Therefore, the mean is what I would choose as the symmetrical data would prove there are no outliers and could prove a correlation in the data.
model <- lm(war ~ ethfrac, data = fl3)
plot(fl3$ethfrac, fl3$war,
main = "Ethnic Fractionalization vs War (1945-1993)",
xlab = "Ethnic Fractionalization",
ylab = "Years of War"
)
abline(model,
col = "red",
lwd = 3)
ANSWER: In our case, the dependent variable will be years of war while our independent variable will be ethnic fractionalization. We can observe from our data that there is a positive relationship between ethnic fractionalization and years of war. This means that as enthic fractionalization increases, so do the number of years of war. The scatterplot does reveal some discrepancies in this idea as there are quite a few areas that experience many years of war despite having low fractionalization, but overall the data does support the idea even if it supports it very weakly.
Suppose you have a random variable \(X\) with expectation \(\mathbb{E}[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw \(X_1, X_2,..,X_n\), each a random variable wih expectation \(u\) and variance \(s^2\).
ANSWER: The average of these random variables is called the Sample Mean. It is written as: \(\overline{X} = \frac{1}{n} \sum_{i=1}^{n} x_i\)
ANSWER: The standard deviation of the random variables is called the standard error. It is written as: \(SE_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)
ANSWER: \(\mathbb{E}[\overline{X}]\) is the expectation of the sample mean, which is just the estimated average of a collection of means. It finds the closest approximation to the population mean. We can use the formula from part a to calculate it, which gets us: \(\mathbb{E}[\overline{X}]\) = \(\mathbb{E}[\frac{1}{n} \sum_{i=1}^{n} x_i]\)
ANSWER: \(Var(\overline{X})\) epresents the expectation of the variance of the mean, which is calculated by taking the the variance of X and dividing by N (\(Var(X/N)\)). Then we can use the following formula to solve for \(Var(\overline{X})\): \(Var(\overline{X})\) = \((1/((n-1)N))\sum((X-\overline{X})^2)\)
Based on what you know about \(\mathbb{E}[\overline{X}]\) and \(Var(\overline{X})\), which researcher should be more confident in their estimate of \(\overline{X}\)? Explain your logic.
ANSWER: Because both researchers are using the same parameters, Researcher B should be more confident in their answer. Utilizing our formulas from earlier, we can see that Researcher B’s variance and standard error will be smaller due to the fact that they used a sample size of 500 rather than just 100. As a result, they will have a smaller standard error which means that their data is more distributed around the mean. Thus, they have results that are more accurate.
ANSWER: The estimand is the desired object of interest. For example, if we wanted to determine the number of red Skittles per bag in a container of Skittle bags, we would set the estimand as “the average number of red Skittles per bag.” The estimator is the sample approximate of the estimand. It is a function we use to best approximate the average number of red Skittles per bag. The estimate is the result of applying the estimator to the data set. In our case, we determine after studying how many of each color of Skittle is in a single box of Skittle bags, that the average number of red skittles is 24. Thus, our estimate is 24.
ANSWER: According to the Central Limit Theory (CLT), the mean will always tend towards a normal distribution. According to the Law of Large Numbers, larger values will not be required for a normal distribution. The normal distribution of the mean will be dependent on \(\mathbb{E}[\overline{X}]\) and the variance will be dependent on \(Var(X/N)\).
Did you collaborate with anyone on this problem set? If so, list them here.
ANSWER: I went to Longjiao’s office hours where she helped me clarify certain points or questions. There were two other students with me there but I forgot to ask for their names. Longjiao helped all of us while we were there. I also used the internet to look up how to write equations with LaTex notation.
Did you use generative AI on any part of this problem set? If so, identify which model you used and how you used it – be specific!
ANSWER: No