PS 15: Problem Set 3 (Due 22 April 2016, 6 PM)

Jack Michael Morgenson Collaborated with Patrick Laurence Bourke and Paul Hans Mohr

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).

Include both the code to get your answer and your answer in words.

It is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. If you knit everytime you try to write some new code, you’ll know where the error is (in the last thing you did!) This will save you huge headaches.

Although the questions break up each task for you into parts, remember that you might need to put a bunch of code together into a single chunk to make it work. For example, if you create a density plot in one part of a question, and want to add the mean value to it as a line in another part, you need these two commands to follow one another in the same chunk of code.

Some tips: Start early, work with friends in the class, use the discussion forum, come to class and section, go to office hours if you need to, read the textbook and other readings – do all these things and you’ll succeed! Good luck.

Question 1.

First, we will load the same dataset (derived from Fearon and Laitin, 2003) that you used in the last problem set.

Set your working directory and load the data.

load("C:/Users/Jack/Documents/PSET3/fl2.RData")
setwd("C:/Users/Jack/Documents/PSET3")

We will deal mainly with two variables: polity2l and gdpenl. The polity2l variable is a whole number between -10 and 10, measuring where a country falls between full autocracy (-10) and full democracy (10). The variable gpdenl measures of the GDP per capita of each country in 1960 in thousands of dollars.

Produce a scatter plot with polity2l on the horizontal access and gdpenl on the vertical axis. Given how you have set up your plot, which is your independent variable and which is your dependent variable? Bonus: explain why the data looks how it does along the x-axis, where the data points are all lined up–specifically, what kind of variable is `polity’?

plot(fl2$polity2l, fl2$gdpenl,xlab = 'Autocracy/Democracy', ylab ='GDP per capita in thousands')
abline(lm(fl2$gdpenl~fl2$polity2l))

cov(fl2$gdpenl, fl2$polity2l)

## [1] -0.5289266

cor(fl2$gdpenl, fl2$polity2l)

## [1] -0.01359485

In the same code chunk as b, use the abline command to put the linear regression line on the plot. See the course slides for an example.
Estimate and report the covariance of polity2l with gdpenl, and their correlation. Write the meaning of what these results tell you, using the meaning for these two variables (stated above). Coverariance: -0.5289 Correlation: -0.01359 The correlation is -0.01, which is almost zero, which means GDP/capita and whether a state is autocratic or democrartic are practically unrelatable.
Write down the model that we are fitting when we do a linear regression of gdpenl on polity2l, using \(\beta_0\) and \(\beta_1\) where necessary. What does the \(\beta_{0}\) mean? What does the \(\beta_1\) mean? You do not have to estimate the model yet so this is not in a code chunk. GDP/Capita = 2.46268 + (-0.010)*(polity2l)+residuals
Explain how we will estimate the best values of \(\beta_0\) and \(\beta_1\). In what sense is the line that we choose (by choosing \(\beta_0\) and \(\beta_1\)) the “best-fitting” line? Used code summary(lm(fl2\(gdpenl~fl2\)polity2l))
Now use linear regression to regress gdpenl on polity2l using the lm function in R - make sure you save the model as an object. Show the result using the summary() command. Interpret the meaning of the coefficient estimates (both the intercept and the coefficients on polity2l). Bonus: Consider the p-values reported on the table in your interpretation, if you want to read ahead and figure out what these mean.

model2<-lm(gdpenl~polity2l, data=fl2)
summary(model2)

## 
## Call:
## lm(formula = gdpenl ~ polity2l, data = fl2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.508 -1.812 -1.395  0.215 51.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.46268    0.44439   5.542 1.27e-07 ***
## polity2l    -0.01069    0.06339  -0.169    0.866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.55 on 154 degrees of freedom
## Multiple R-squared:  0.0001848,  Adjusted R-squared:  -0.006307 
## F-statistic: 0.02847 on 1 and 154 DF,  p-value: 0.8662

the p-value is the probability value, and it is 0.8662.

Sometimes we need to transform a variable to make it more suitable to analysis by regression. For example, with income-related variables like gdpenl, we usually need to take their log first before using regression. Create a new variable that is equal to the log of gpdenl.

logGDP<-(log(fl2$gdpenl))

Now remake a scatter plot like before but with polity2l on the horizontal access and the log of gdpenl on the vertical axis. Add the regression line.

plot(fl2$polity2l, fl2$logGDP)
abline(lm(polity2l~logGDP, data=fl2))

It turns out that when you regress a logged dependent variable on an (unlogged) independent variable, we can roughly interpret the coefficient \(\beta\) as meaning “a one-unit shift in the independent variable corresponds to a 100\(\beta\) percent increase in the dependent variable.”

For example, a \(\beta\) of 0.01 from such a regression would imply that a one-unit change in the independent variable is associated with a \(1\%\) higher value of the dependent variable. (This is just an approximation, but for coefficient estimates near zero, it is okay.)

Using this knowledge, re-run your regression but now regress the (log of) gdpenl on polity2l. Use summary() to show the results. Interpret the new coefficient on polity2l. How has it changed compared to your earlier regression? Would you say your results are robust? Bonus: Interpret the new p-value.

plot(fl2$polity2l, fl2$logGDP)
abline(lm(polity2l~logGDP, data=fl2))

summary(lm(polity2l~logGDP, data=fl2))

## 
## Call:
## lm(formula = polity2l ~ logGDP, data = fl2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.069  -5.619  -0.476   6.426  12.127 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.6354     0.5476  -1.160    0.248    
## logGDP        2.1830     0.5097   4.283 3.24e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.669 on 154 degrees of freedom
## Multiple R-squared:  0.1064, Adjusted R-squared:  0.1006 
## F-statistic: 18.34 on 1 and 154 DF,  p-value: 3.237e-05

The new slope is 2.1830 and the inercept is -0.635. What this meansis that the correlation is higher than what we can actually achieve. The slope should be between -1 and 1, so it is significantly different.

Regardless of what you actually got in the above analyses, suppose that we find a positive and statistically significant coefficient in these regressions. Does this warrant the conclusion that “being more democratic (having a higher polity score) increases a country’s GDP per capita?” Why or why not? Follow the instructions in class for how to address such causal questions, including pointing out potential confounders, non-comparability, and proposing the ideal research design.

I am still unable to draw a causal relationship between GDP per capita, and regime type because we are unable to have randomization in this experiment. This is because it is real life data. Possible confounders could include education levels, religion, ethnicity and amount of taxes collected.

Question 2.

Read the assigned article for this section of the course, Wand et al., “The Butterfly Did It.” Focus on the: abstract, introduction, figures, tables and conclusion. You don’t need to read every word, but you do need to understand the argument and evidence. This is a very important article written in the discipline’s top journal (APSR), so it should give you a good example of how we use statistics in practice to answer questions about politics. Note: they use a slightly different regression model (estimator) than we are discussing in class, but it is a similar enough approach that you should be able to follow the overall argument.

In your own words, state the authors’ research question. Are they trying to answer a causal question?

The research question is if the butterfly style ballot caused some thousand or so voters to vote for Buchanan instead of Gore, effecting the results of the entire presidential election. They are trying to answer a causal question in this research study.

What is their independent variable? What is their dependent variable?

The independent variable: The format of the ballot (absentee vs. on election day) The Dependant variable: Voting outcomes rates

Examine Figure 3. What could you call Palm Beach County (PBC)? Why? (Use a key term we’ve learned in the course).

Palm Beach County is an outlier and an anomoly, and that is the reason for the controversy. This is why the style of the butterfly ballot is called into question.

What is the main finding from the paper? In other words, how do the authors answer their research question? What is the main evidence they use to make this claim? Put this in your own words.

The main finding is that the butterfly ballot did cause significant voting error, giving Buchanan more votes than were meant for him. The main evidence is that votes for Buchanan cames from Democrats, who were likely going to vote for Gore, thus showing they made in mistake in their voting. Also the porportion of votes for Buchanan did not match up with the porportion from the absentee ballots.

Examine Table 4. In Palm Beach County, how much more likely was it for a person casting a ballot for the Democratic Senator candidate to also vote for Buchanan on (i) election day; versus, (ii) absentee? What explains this difference in probability according to the authors? Note: this is a simple calculation if you read and understand the table.

You would be 6 times more likely to vote for Buchanan on election day as opposed to the absentee ballot if you also voted for the Democratic Senator candidate. (the calculation is 0.0102/0.0017=6). The authors would say that this is due to the difference in ballot format between election day and the absentee.

Question 3.

You mentioned to your friend that you’re learning some really neat stuff in PS 15. Now they’re curious. They want to know how they too can figure out relationships in the political world.

Your friend asks how correlation is different from covariance, and for a formula that can turn \(cov(x,y)\) into \(cor(x,y)\). Provide that formula, and explain how correlation relates to covariance. Also explain what the correlation means and the possible values it can take.

Correlation puts this association on a scale from -1 to 1, and Covariance is the measure of association between two variables.To get from covariance to correlation, follow this formula. cor(x,y)=cov(x,y)/(sd(x)*sd(y))

Your friend asks you to explain what a random variable is–in your own words, provide a definition. What are its key components?

A random variable is the thing that we are measuring in a given experiment, so for example in this problem set, gdpenl and polity2l were random variables. The key components are that the variable has a set of possible values, which each have a probability of occuring that can be put on a distribution. This shows that the values are randomly occuring.

Your friend then says they have heard of linear regression before, but they don’t know how it works. Explain in simple language what the regression is doing to estimate a relationship between two variables. Come up with a specific political science example to help your friend understand.

Linear regression estimates the correlation between two variables. The regression line is a line that fits the data by being the minimum distance from each point. When the slope is 1, there is a perfect positive correlation, a slope of -1 means perfectly negative correlation. When the slope is 0, it means the two variables are perfectly uncorrelated. You could compare a country’s poverty rates to education levels, and see if there is correlation, of course you still will not be able to say there is causation!

PS 15: Problem Set 3 (Due 22 April 2016, 6 PM)

Prof. Stokes

April 15, 2016

Question 1.

Question 2.

Question 3.