Submit your HTML output and .Rmd file to Canvas by the deadline.
Please make sure you have downloaded this file (pset2.Rmd) to your computer and opened it in R Studio in the cloud platform. By download, we do not mean you just clicked on it in your browser – we mean you have saved the actual file to a directory on your computer (as pset2.Rmd, not as pset2.Rmd.txt!), and then opened it with RStudio in the cloud platform. You should now be looking at the “raw” text of the .Rmd file.
If you need to re-orient yourself, please review the introductory material that pset 1 began with describing how to include R code “chunks” into this .rmd file. Remember that when you “knit” the Rmd file, only the code written into code “chunks” will be executed and have its results integrated into the output html file. For example, the code chunk below provides a summary of the built-in dataset called cars. Take a look at the .rmd code that produces it, then click “knit” and see how it shows up in the outputted html file:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Please also remember that you will want to use the console (the bottom-left panel in the screen) to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).
Include both the code to get your answer and your answer in words.
Finally, it is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. This will save you huge headaches.
Make sure your final Rmd file knits correctly, and check as you work – don’t wait until the very end to try knitting your code.
100 Points Total
For this problem set, we will be working with a modified version of the dataset for the paper “Counting Calories: Democracy and Distribution in the Developing World” by Lisa Blaydes and Mark Andreas Kayser, published in the International Studies Quarterly in 2011 (Volume 55, Issue 4, December 2011, Pages 887–908).
1.1. Load the modified version of the Blaydes and Kayser dataset,
which is named bk2011.RData. What is the name of this
dataset once you load it into R?
1.2. What are the names of the variables stored in this dataset? (5 points)
variable.names(blaydes_kayser2011)
## [1] "country" "year" "calstotal" "calsanimal" "polity2"
## [6] "gdpcxr100"
1.3. How many variables do you have? (5 points)
ncol(blaydes_kayser2011)
## [1] 6
1.4. What is your sample size, given this data set? (PLEASE DO NOT PRINT THE WHOLE DATASET IN YOUR OUTPUT!) (5 points)
nrow(blaydes_kayser2011)
## [1] 3532
1.5. What is the unit of analysis? (Hint: it involves units and time) (5 points)
The unit of analysis is the country during the given year.
1.6. The variable calstotal is the total number of
calories consumed in a country-year. Show the sample distribution of
this variable. Specifically, create a density plot, and a boxplot.
Remember, plots need to be labelled. (5 point)
boxplot(blaydes_kayser2011$calsanimal,
xlab = "Total calories consumed from animal products",
ylab = "Frequency",
main = "Average calories consumed from animal products ", horizontal = TRUE)
plot(density(blaydes_kayser2011$calsanimal) ,
xlab = "Total calories consumed from animal products",
main = "Average calories consumed from animal products ")
1.7. What would you call the shape of the distribution (density plot)
in 1.6. using terms from lecture? Compute the median and mean of
calsanimal and report their values in your code. Then add
these values to your chart as lines. Comment on whether the mean and
median are the same and explain why or why not. Hint: what is the
skewdness of this distribution? (10 points)
mean(blaydes_kayser2011$calsanimal)
## [1] 305.5438
median(blaydes_kayser2011$calsanimal)
## [1] 256.665
plot(density(blaydes_kayser2011$calsanimal) ,
xlab = "Total calories consumed from animal sources",
main = "Average Calories From Animal Sources")
abline(v=mean(blaydes_kayser2011$calsanimal), col = "red")
abline(v=median(blaydes_kayser2011$calsanimal), col = "blue")
legend("topright", legend = c("mean","median"), lty = c(1,1), col = c("red","blue"))
This distribution of this graph right skewed. The median and the mean are not the same with the mean being farther right than the median.
1.8. Repeat 1.6 and 1.7., but this time show the distribution of the
natural log of calsanimal, which is a variable you will
have to generate manually. When plotting this, you should again use a
density plot and a boxplot. Remark on the difference in shape when using
the log of the variable. Are your mean and median closer together or
farther apart? Why? (10 points)
blaydes_kayser2011$log_calsanimal <- log(blaydes_kayser2011$calsanimal)
mean(blaydes_kayser2011$log_calanimal)
## Warning in mean.default(blaydes_kayser2011$log_calanimal): argument is not
## numeric or logical: returning NA
## [1] NA
median(blaydes_kayser2011$log_calanimal)
## NULL
boxplot((blaydes_kayser2011$log_calsanimal),
horizontal = TRUE,
xlab = ("calories"))
plot(density(blaydes_kayser2011$log_calsanimal), main = "Log of calories per animal")
xlab = "calories"
ylab = "frequency"
abline(v=mean(blaydes_kayser2011$log_calsanimal), lty=2, lwd=2, col = "red")
abline(v=median(blaydes_kayser2011$log_calsanimal), lty=4, lwd=2, col = "blue")
legend("topright", legend = c("mean", "median"), lty = c(2,4), col = c("red", "blue"))
In the same dataset, the variable polity2 describes the
political regime in a given country-year. Use the following code to
create a binary variable called democracy that takes the
value of 1 when the country has a Polity IV score larger than 6 and zero
otherwise. Researchers take this value as the threshold that separates
democracies from non-democracies (5 points)
The code below is using the ifelse function, which is a conditional statement. It takes three elements: a condition based on a variable, what to do when that condition applies, and what to do when it does not.
You can read the code in line 86 as follows: (1) Create a variable called “democracy” within the blaydes_kayser2011 dataset; (2) this variable is equal to 1 if blaydes_kayser2011$polity2>6 and it is equal to 0 otherwise. This will create a variable with two values: 1 when the Polity IV score is greater than 6 and 0 when it is equal to or smaller than 6.
Note: please remove the # from the code below so it can run
blaydes_kayser2011$democracy <- ifelse(blaydes_kayser2011$polity2>6,1,0)
1.9. What is the mean value of the total number of calories consumed
(calstotal) for countries classified as democracies? What
is the mean value of total calories consumed (calstotal)
for countries classified as non-democracies? What is the standard
deviation for both groups? What does this difference in standard
deviations suggest about how much variation there is in the total number
of calories consumed for democracies versus autocracies? (5 points)
mean(blaydes_kayser2011$calstotal[blaydes_kayser2011$democracy>=1])
## [1] 2532.217
mean(blaydes_kayser2011$calstotal[blaydes_kayser2011$democracy<=0])
## [1] 2330.711
sd(blaydes_kayser2011$calstotal[blaydes_kayser2011$democracy>=1])
## [1] 409.5074
sd(blaydes_kayser2011$calstotal[blaydes_kayser2011$democracy<=0])
## [1] 408.2518
**The standard deviations are nearly identical which means while the means are different, the distributions and variences are very similar*8
1.10 Describe the gdpcxr100 variable: what is the minimum and maximum? What is the mean value? What is the standard deviation? (5 points)
summary(blaydes_kayser2011$gdpcxr100)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5652 3.1135 8.2259 14.7479 19.0982 99.9781
sd(blaydes_kayser2011$gdpcxr100)
## [1] 17.20072
1.11 The main claim in the paper is that there is a relationship between regime type and the total number of calories consumed. In this case, what is your independent and dependent variable? Make a scatterplot that shows the relationship between these two variables, including a regression line. Describe the relationship you find. What happens to the number calories consumed as you move from autocracy to democracy? (10 points)
plot(blaydes_kayser2011$polity2, blaydes_kayser2011$calstotal,
main = "Total calories by polity of state",
xlab = "polity",
ylab = "calories consumed")
model <- lm(blaydes_kayser2011$calstotal ~ blaydes_kayser2011$polity2)
abline(model, col = "red")
Bonus question 1: in the plot above, why does the data on the X axis looks the way it does? Based on the code we used for this problem set, what could be a better way to see the differences between democracies and non-democracies in terms of the total calories consumed? (10 points)
Hint: use par(mfrow=c(1,2)) to plot two graphs one next to the other
Bonus question 2: how would you summarize the main hypothesis of the paper? (5 points)
**The hypothesis of this paper states that
For the theoretical questions this week, we will use a particular feature of the R Markdown files that allow us to write mathematical notations, called “LaTex”; for Question 2, whenever you see a symbol between $, this means that what is between those is math notation. Notice that for your answers you won’t need to re-write these using LaTex, only plain text.
(15 points)
Suppose you have a random variable \(X\) with expectation \(E[X]=u\), and variance given by \(s^2\). You then draw multiple observations from the same distribution. That is, you draw \(X_1, X_2,..,X_n\), each a random variable with expectation \(u\) and variance \(s^2\).
The average of these random variables is called the sample mean
The standard deviation of the random variables is the square root of the varience
It is the estimated value of the population mean based on the sample mean
This shows the expectation of the variance for the sample mean. It shows the expected distributions of the plots
(15 points)
An estimand is the part of the population that you will be testing (Does wearing hats aid to male baldness, the estimand is men). The estimator is the function that would be used to test your theory(taking the sample mean to dins how often 7’s are rolled when rolling two dice). And the Estimate gives us a new number in the equation to work to calculate the statistic (After flipping a coin 10 times, taking the number given by the estimator and plugging it in to add to your findings)
With a larger sample size, the distribution will look very normal creating a bell curve even if the variance is high. the main difference will be how spread out the graph is, with low variance creating a tighter graph and high varience creating a wider graph. With a smaller sample size the variance begins to matter much more and you will see the graph more likely to be skewed to one side versus another. The higher the variance is the more you will see this effect
The law of large number states that the mean will get closer and closer to the expected value the more samples that are drawn. The central limit theorem says that when the law of large numbers dictates that the distribution of the mean will look more like a normal distribution with more numbers.