Submit your HTML output and .Rmd file to Gauchospace by the deadline. ***
setwd("/Users/mikeschaible/Desktop/PS-15")
getwd()
## [1] "/Users/MikeSchaible/Desktop/PS-15"
load("fl2.rdata")
summary(fl2)
## cname year warl war
## Length:156 Min. :1945 Min. :0.00000 Min. : 0.000
## Class :character 1st Qu.:1947 1st Qu.:0.00000 1st Qu.: 0.000
## Mode :character Median :1954 Median :0.00000 Median : 0.000
## Mean :1958 Mean :0.00641 Mean : 5.635
## 3rd Qu.:1964 3rd Qu.:0.00000 3rd Qu.: 9.000
## Max. :1993 Max. :1.00000 Max. :52.000
## gdpenl lpopl1 lmtnest ncontig
## Min. : 0.0510 Min. : 5.403 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.6395 1st Qu.: 7.526 1st Qu.:0.6931 1st Qu.:0.0000
## Median : 1.0910 Median : 8.415 Median :2.3174 Median :0.0000
## Mean : 2.4639 Mean : 8.505 Mean :2.0975 Mean :0.1603
## 3rd Qu.: 2.5940 3rd Qu.: 9.326 3rd Qu.:3.3150 3rd Qu.:0.0000
## Max. :53.9010 Max. :13.224 Max. :4.5570 Max. :1.0000
## Oil nwstate instab polity2l
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :-10.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: -7.0000
## Median :0.0000 Median :1.0000 Median :0.00000 Median : -1.0000
## Mean :0.1154 Mean :0.5192 Mean :0.03205 Mean : -0.1154
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 7.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. : 10.0000
## ethfrac relfrac war_prop numyears
## Min. :0.0010 Min. :0.0000 Min. :0.0000 Min. : 3.00
## 1st Qu.:0.1438 1st Qu.:0.1861 1st Qu.:0.0000 1st Qu.:34.00
## Median :0.3850 Median :0.3750 Median :0.0000 Median :43.50
## Mean :0.4083 Mean :0.3807 Mean :0.1393 Mean :40.56
## 3rd Qu.:0.6691 3rd Qu.:0.5800 3rd Qu.:0.2323 3rd Qu.:53.00
## Max. :0.9250 Max. :0.7828 Max. :1.0000 Max. :55.00
colnames(fl2)
## [1] "cname" "year" "warl" "war" "gdpenl" "lpopl1"
## [7] "lmtnest" "ncontig" "Oil" "nwstate" "instab" "polity2l"
## [13] "ethfrac" "relfrac" "war_prop" "numyears"
ncol(fl2)
## [1] 16
nrow(fl2)
## [1] 156
The variable gdpenl is GDP per capita, measured in thousands of dollars (using 1985 price).
plot(density(fl2$gdpenl),main="Plot of Density Distribution of GDP Per Capita",ylab = "Density",xlab = "GDP Per Capita(1985 Price)")
boxplot(fl2$gdpenl,main="Boxplot of Density Distribution of GDP Per Capita",ylab = "GDP Per Capita(1985 Price)",xlab = "Density")
Remark on the shape of this distribution. Compute the median and mean and report their values in your code. Then add these values to your chart as lines. Comment on whether the mean and median are the same and explain why or why not.
The shape of the distribuition is skewed left in the plot and towards the bottom in the boxplot. The mean and median are diffent values and measure different things, sometime those two values can line up but the presence of outliers effects the mean and median differently.
median(fl2$gdpenl)
## [1] 1.091
mean(fl2$gdpenl)
## [1] 2.46391
plot(density(fl2$gdpenl),main="Plot of Density Distribution of GDP Per Capita",ylab = "Density",xlab = "GDP Per Capita(1985 Price)")
abline(v=mean(fl2$gdpenl),col="blue")
abline(v=median(fl2$gdpenl),col="red")
boxplot(fl2$gdpenl,main="Boxplot of Density Distribution of GDP Per Capita",ylab = "GDP Per Capita(1985 Price)",xlab = "Density")
abline(h=mean(fl2$gdpenl),col="blue")
abline(h=median(fl2$gdpenl),col="red")
Repeat (c) and (d), but this time show the distribution of log(gdpenl) using a density plot and a boxplot. Remark on the difference in shape when using the log of the variable. Are your mean and median closer together or farther apart? Why?
The shapes of the distribution are more normally distributed around the expected mean values. The mean and median are closer together becuase the central limit theorum and law of large numbers trends the numbers towards a normal distribution around the expected value.
log(fl2$gdpenl)
## [1] 2.031563452 1.646348305 0.754712580 -0.267879464 -0.403467131
## [6] 0.593326814 1.792092723 0.607044502 0.137149799 -0.107585209
## [11] -0.068278824 -0.244622593 0.222343194 0.068592773 0.291923057
## [16] 1.890095353 0.539995978 -0.030459178 -0.576253471 -0.123298213
## [21] 0.035367157 0.030529223 0.807368371 0.807814269 1.238953745
## [26] 1.541373182 0.693647019 1.346773610 1.292258378 0.972292942
## [31] 1.889642111 0.522358853 -0.057629107 1.229932907 0.846726276
## [36] -0.401971175 1.075002446 -0.235722306 -0.211956343 1.367111548
## [41] 0.955511408 0.820660534 -2.453407949 -1.917322693 -0.214431634
## [46] 0.711478108 -1.335601203 1.246744922 -2.071473356 0.580538203
## [51] 1.684174261 1.769684254 1.453953059 1.587805638 1.627867035
## [56] 1.354803675 1.342342492 1.161274052 1.063675744 1.603017332
## [61] 1.285368425 1.398469884 -0.136965879 -0.322963918 -0.625488483
## [66] 0.045928980 0.095310201 -0.248461396 -0.631111780 0.113328690
## [71] -0.560366104 -0.785262469 -0.507497837 -0.130108661 -0.112049511
## [76] -1.002393394 -0.444725864 -0.567396025 0.581656824 -0.350976928
## [81] -0.279713926 0.116003699 -0.715392805 -0.536143468 -0.474815220
## [86] -1.194022463 -0.669430606 -0.761426005 0.098033781 0.415415430
## [91] -1.537117234 -0.313341811 0.167207954 -0.094310651 0.078811196
## [96] -0.991553238 0.672434135 1.048721551 -0.721546652 -0.549912974
## [101] 0.627541437 0.174793278 1.129464785 -0.208254968 0.242946160
## [106] 0.006975583 -0.053400762 -0.153151202 0.509825101 0.077886511
## [111] 0.737642459 -0.570929552 -0.349557500 0.387979778 -0.809681013
## [116] 0.763139543 0.767326599 -2.975929665 0.517602592 0.733809136
## [121] 3.987149049 2.833977786 3.689978818 1.772236773 -1.639897090
## [126] 1.324950753 0.821980094 1.055008644 1.035672458 1.434608210
## [131] -1.487220297 -0.200892920 -0.257476228 0.073250439 -0.679244218
## [136] 0.554459693 -0.527632787 -1.038458360 -0.216912993 -0.112049511
## [141] -1.599487547 -0.049190221 -0.711311158 -0.525939227 -0.941608577
## [146] -1.231001491 -0.454130295 -0.603306493 0.255417119 0.622724696
## [151] -0.778705088 -1.203972765 1.691939102 0.628075159 1.761815585
## [156] 0.952429781
median(log(fl2$gdpenl))
## [1] 0.0870607
mean(log(fl2$gdpenl))
## [1] 0.2382049
plot(density(log(fl2$gdpenl)),main="Plot of Density Distribution of GDP Per Capita",ylab = "Density",xlab = "GDP Per Capita(1985 Price)")
abline(v=mean(log(fl2$gdpenl)),col="blue")
abline(v=median(log(fl2$gdpenl)),col="red")
boxplot(log(fl2$gdpenl),main="Boxplot of Density Distribution of GDP Per Capita",ylab = "GDP Per Capita(1985 Price)",xlab = "Density")
abline(h=mean(log(fl2$gdpenl)),col="blue")
abline(h=median(log(fl2$gdpenl)),col="red")
In the same dataset, the variable Oil describes whether each country in the dataset is an oil exporter (Oil=1) or not (Oil=0). The variable war describes how many years from 1945 to 1999 that country had a civil war. The variable `ethfrac’ is a measure of how fractionalized ethnic groups are in a given country – specifically, it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.
What is the mean value of war for oil exporters? What is the mean value of war for non oil exporters? What is the standard deviation for both groups? What does this difference in standard deviations suggest about how much variation there is in war for oil exporters versus non oil exporters?
The differences in standard deviation and varation suggest that countires with oil engage in slightly more war(less than .5 difference in means) and also have higher variation than countries with no oil.
NoOil<-fl2[which(fl2$Oil=="0"),]
Oil<-fl2[which(fl2$Oil=="1"),]
mean(NoOil$war)
## [1] 5.57971
mean(Oil$war)
## [1] 6.055556
sd(NoOil$war)
## [1] 10.43003
sd(Oil$war)
## [1] 11.25884
Describe the ethfrac variable: what is the minimum and maximum? What is the mean value? What is the standard deviation? Why does the variable range from 0 to 1?
The variable ranges from 0 to 1 because it is a measure of probability which is expressed in percentages (.5 = 50%). In this case as the question states “it’s the probability that two people randomly drawn from a given country are from the same (0) or different (1) groups.”
min(fl2$ethfrac)
## [1] 0.001
max(fl2$ethfrac)
## [1] 0.9250348
mean(fl2$ethfrac)
## [1] 0.4082564
sd(fl2$ethfrac)
## [1] 0.2798512
Say you believe that increased ethnic factionalization causes war. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the predicted number or wars as you move from no ethnic fractionalization to the highest possible value of ethnic fractionalization?
-There is a positive coerelation with ethnic fractionalization and war, although the regression line shows it is a weak correlation. As ethnic fractionalization goes up the expected number of wars also increases. At the highest level of ethnic fractionalization there is a higher varience.
plot(fl2$ethfrac, fl2$war, ylab = "War", xlab = "Ethnic Fractionalization", main = "Ethnic Fractionalization and War")
abline(lm(fl2$war~fl2$ethfrac),col="red")
Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw X_1, X_2,..,X_n, each a random variable wih expectation \(u\) and variance \(s^2\).
When you average these random variables together, what is the called? How do you write it mathematically?
It is called a sample mean, mahematically writter by “X^bar”
What is SD(X)?
It is the standard deviation, it measures the distance between vairables and the mean.
What is \(E[\overline{X}]\)? Explain with math and words.
It is the expected value of the sample mean. It means if you know the true probability this expected sample mean is it. Ex: obamas approval is 65% so if you take a large enough sample a large number of time you would expect the sample mean to equal 65% or .65.
What is \(Var(\overline{X})\)?
It is the variation of the sample mean.
In your own words, explain the difference between these terms and put them in a logical sequence: Estimate, Estimand, Estimator. Give one example of each.
The Estimator is the rules one uses to calculate the estimate(numerical value) of the variable you wish to measure which is the estimand. If you want to estimate the number of waves at campus point every hour the waves would be the estimand, the number of waves is the estimate and the meathod you used to calculate the number of waves is the estimator.
If you repeatedly draw a sample and take a mean, how will the distribution of the mean change if the variance of the underlying population is small versus big? If your sample size is small versus big? Explain.
The law of large number states as you draw more samples or have a larger sample size the distribution of the means will become more normal. Smaller samples sizes increase varience thus can make the distribution less centered around the mean while a large sample size does the opposite. If the underlying population is small then the distribution is more centered around the mean, if it is large there is larger varience thus the distribution is more spread out around the mean.