This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).

Include both the code to get your answer and your answer in words.

It is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. If you knit everytime you try to write some new code, you’ll know where the error is (in the last thing you did!) This will save you huge headaches.

Although the questions break up each task for you into parts, remember that you might need to put a bunch of code together into a single chunk to make it work. For example, if you create a density plot in one part of a question, and want to add the mean value to it as a line in another part, you need these two commands to follow one another in the same chunk of code.

Some tips: Start early, work with friends in the class, use the discussion forum, come to class and section, go to office hours if you need to, read the textbook and other readings – do all these things and you’ll succeed! Good luck.


Question 1. Democracy and GDP

First, we will load the same dataset (derived from Fearon and Laitin, 2003) that you used in the last problem set.

  1. Set your working directory and load the data.
setwd("/Users/alexsefayan/Desktop/PSthree")
getwd()
## [1] "/Users/alexsefayan/Desktop/PSThree"
load("fl2.RData")
summary(fl2)
##     cname                year           warl              war        
##  Length:156         Min.   :1945   Min.   :0.00000   Min.   : 0.000  
##  Class :character   1st Qu.:1947   1st Qu.:0.00000   1st Qu.: 0.000  
##  Mode  :character   Median :1954   Median :0.00000   Median : 0.000  
##                     Mean   :1958   Mean   :0.00641   Mean   : 5.635  
##                     3rd Qu.:1964   3rd Qu.:0.00000   3rd Qu.: 9.000  
##                     Max.   :1993   Max.   :1.00000   Max.   :52.000  
##      gdpenl            lpopl1          lmtnest          ncontig      
##  Min.   : 0.0510   Min.   : 5.403   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.6395   1st Qu.: 7.526   1st Qu.:0.6931   1st Qu.:0.0000  
##  Median : 1.0910   Median : 8.415   Median :2.3174   Median :0.0000  
##  Mean   : 2.4639   Mean   : 8.505   Mean   :2.0975   Mean   :0.1603  
##  3rd Qu.: 2.5940   3rd Qu.: 9.326   3rd Qu.:3.3150   3rd Qu.:0.0000  
##  Max.   :53.9010   Max.   :13.224   Max.   :4.5570   Max.   :1.0000  
##       Oil            nwstate           instab           polity2l       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :-10.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: -7.0000  
##  Median :0.0000   Median :1.0000   Median :0.00000   Median : -1.0000  
##  Mean   :0.1154   Mean   :0.5192   Mean   :0.03205   Mean   : -0.1154  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:  7.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   : 10.0000  
##     ethfrac          relfrac          war_prop         numyears    
##  Min.   :0.0010   Min.   :0.0000   Min.   :0.0000   Min.   : 3.00  
##  1st Qu.:0.1438   1st Qu.:0.1861   1st Qu.:0.0000   1st Qu.:34.00  
##  Median :0.3850   Median :0.3750   Median :0.0000   Median :43.50  
##  Mean   :0.4083   Mean   :0.3807   Mean   :0.1393   Mean   :40.56  
##  3rd Qu.:0.6691   3rd Qu.:0.5800   3rd Qu.:0.2323   3rd Qu.:53.00  
##  Max.   :0.9250   Max.   :0.7828   Max.   :1.0000   Max.   :55.00

We will deal mainly with two variables: polity2l and gdpenl. The polity2l variable is a whole number between -10 and 10, measuring where a country falls between full autocracy (-10) and full democracy (10). The variable gpdenl measures of the GDP per capita of each country in 1960 in thousands of dollars.

  1. Produce a scatter plot with polity2l on the horizontal access and gdpenl on the vertical axis. Given how you have set up your plot, which is your independent variable and which is your dependent variable? Bonus: explain why the data looks how it does along the x-axis, where the data points are all lined up–specifically, what kind of variable is `polity’?
plot(fl2$polity2l, fl2$gdpenl, xlab = "Polity Rating", ylab = "GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )
model1 <- lm(fl2$gdpenl ~ fl2$polity2l, data=fl2)
abline(model1, col = "pink", lw = 4)

The independent and dependent variables for this scatter plot are the Measure of how Democratic or Autocratic a Nation is and the GDP per capita of that nation (respectively). The data points are all lined up because there is a heavy concentration of democratic states that have a GDP ranging from 0 to 10. ’Polity21’is a categorical variable.

  1. In the same code chunk as b, use the abline command to put the linear regression line on the plot. See the course slides for an example.

  2. Estimate and report the covariance of polity2l with gdpenl, and their correlation. Write the meaning of what these results tell you, using the meaning for these two variables (stated above).

cov(fl2$gdpenl, fl2$polity2l)
## [1] -0.5289266
cor(fl2$gdpenl, fl2$polity2l)
## [1] -0.01359485

Corvariance: cov(x,y) multiplied by (SD(x)(SD(Y)) = cor(x,y) Correlation: A correlation measures interdependence between two or more variables. There is a weak negative correlation between Polity rating and GDP per Capita

  1. Write down the model that we are fitting when we do a linear regression of gdpenl on polity2l, using \(\beta_0\) and \(\beta_1\) where necessary. What does the \(\beta_{0}\) mean? What does the \(\beta_1\) mean? You do not have to estimate the model yet so this is not in a code chunk.

Yi = \(\beta_0\) + \(\beta_1\) + \(\epsilon\) Yi = Intercept + Slope + Error Term

  1. Explain how we will estimate the best values of \(\beta_0\) and \(\beta_1\). In what sense is the line that we choose (by choosing \(\beta_0\) and \(\beta_1\)) the “best-fitting” line?

Line of best fit is a line that captures, or goes through, most of the data in a scatterplot. If all the points in the data set are on the line and the line is moving upwards (from left to right) that means there is a correlation of 1. If all the points are on the line of best fit and are moving downward that means there is a correlation of -1. If the line cannot capture any or all points on the line of best fit that means there is a correlation of 0, or no correlation.

  1. Now use linear regression to regress gdpenl on polity2l using the lm function in R - make sure you save the model as an object. Show the result using the summary() command. Interpret the meaning of the coefficient estimates (both the intercept and the coefficients on polity2l). Bonus: Consider the p-values reported on the table in your interpretation, if you want to read ahead and figure out what these mean.
plot(fl2$polity2l, fl2$gdpenl, xlab = "Polity Rating", ylab = "GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )

model1 <- lm(fl2$gdpenl ~ fl2$polity2l, data = fl2)
abline(model1, col = "hot pink", lw = 2)

summary(model1)
## 
## Call:
## lm(formula = fl2$gdpenl ~ fl2$polity2l, data = fl2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.508 -1.812 -1.395  0.215 51.353 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.46268    0.44439   5.542 1.27e-07 ***
## fl2$polity2l -0.01069    0.06339  -0.169    0.866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.55 on 154 degrees of freedom
## Multiple R-squared:  0.0001848,  Adjusted R-squared:  -0.006307 
## F-statistic: 0.02847 on 1 and 154 DF,  p-value: 0.8662

\(\beta_0\): Typically political scientists do not interpret the Y intercept but in this case when the Democratic rating of a country is zero the GDP per Capita of the country is 2.46. \(\beta_1\): When X of Beta 1 (democratic rating of a country) increases then the GDP of the nation decreases by -0.01. P-Value: A P-Value, or probability value, indicated how probable an outcome is under the curve. Typically the rule of thumb is that anything that has a P-value of 0.05 or lower is statistically significant. In this case the P-value is 1.27e^-07 meaning that the data collected is statistically significant.

  1. Sometimes we need to transform a variable to make it more suitable to analysis by regression. For example, with income-related variables like gdpenl, we usually need to take their log first before using regression. Create a new variable that is equal to the log of gpdenl.
log <- log(fl2$gdpenl)
  1. Now remake a scatter plot like before but with polity2l on the horizontal access and the log of gdpenl on the vertical axis. Add the regression line.
plot(fl2$polity2l, log, xlab = "Polity Rating", ylab = "Log of GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )

model2 <- lm(log ~ fl2$polity2l, data = fl2)
abline(model2, col = "hot pink", lw = 2)

It turns out that when you regress a logged dependent variable on an (unlogged) independent variable, we can roughly interpret the coefficient \(\beta\) as meaning “a one-unit shift in the independent variable corresponds to a 100\(\beta\) percent increase in the dependent variable.”

For example, a \(\beta\) of 0.01 from such a regression would imply that a one-unit change in the independent variable is associated with a \(1\%\) higher value of the dependent variable. (This is just an approximation, but for coefficient estimates near zero, it is okay.)

  1. Using this knowledge, re-run your regression but now regress the (log of) gdpenl on polity2l. Use summary() to show the results. Interpret the new coefficient on polity2l. How has it changed compared to your earlier regression? Would you say your results are robust? Bonus: Interpret the new p-value.
plot(fl2$polity2l, log, xlab = "Polity Rating", ylab = "Log of GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )

model2 <- lm(log ~ fl2$polity2l, data = fl2)
abline(model2, col = "hot pink", lw = 2)

summary(model2)
## 
## Call:
## lm(formula = log ~ fl2$polity2l, data = fl2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7810 -0.6423 -0.0246  0.6165  4.1333 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.24383    0.07981   3.055  0.00265 ** 
## fl2$polity2l  0.04875    0.01138   4.283 3.24e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9967 on 154 degrees of freedom
## Multiple R-squared:  0.1064, Adjusted R-squared:  0.1006 
## F-statistic: 18.34 on 1 and 154 DF,  p-value: 3.237e-05

\(\beta_0\): Typically political scientists do not interpret the Y intercept but in this case when the Democratic rating of a country is zero the log of the GDP per Capita of the country is 0.243.

\(\beta_1\): When X of Beta 1 (democratic rating of a country) increases then the log of the GDP of the nation increases by 0.048.

P-Value: The new P-value is 0.002. Since the P-Value is less that 0.05 the P-value is still statisitcally significant. The data is not robust, because the slope for the first scatterplot is weakly negatively correlated (before using the log of GDP). After introducing log the scatterplot is weakly positvely correlated. These two differences in data mean that data is not robust.

  1. Regardless of what you actually got in the above analyses, suppose that we find a positive and statistically significant coefficient in these regressions. Does this warrant the conclusion that “being more democratic (having a higher polity score) increases a country’s GDP per capita?” Why or why not? Follow the instructions in class for how to address such causal questions, including pointing out potential confounders, non-comparability, and proposing the ideal research design.

No, we cannot conclude that being more democratic increases a country’s GDP per capita after finding a positive and satisitically significant coefficent in a regression. The reason being is because there are plenty of cofounding variables that would sway the regression to be positive or negative. The experiment must be repeated over and over again until all the nations of the world are accounted for. By examining the population of a whole we can then come to a conclusion about the relationship between GDP per Capita and Polity rating. ***

Question 2. Climate change

  1. Begin by loading a new dataset (Tempdata.RData) into R. This dataset shows the average temperature of Los Angeles for the month of October from 1944 to 2015. What is the range over the average temperature in October: What year has the highest average temperature? What year has the lowest?
load("tempdata.RData")
range(tempdata$temp)
## [1] 61.6 74.0
max(tempdata$year)
## [1] 2015
max(tempdata$temp)
## [1] 74
min(tempdata$temp)
## [1] 61.6
  1. Make a scatterplot of temperature over time. Add a trend line. What does it tell us about how the climate is changing over time?
plot(tempdata$year, tempdata$temp, xlab = "Recorded Years of October", ylab = "Recorded Average Temperatures in Fahrenheit", main = "Average Recorded Temperature of October in Los Angeles")
model3 <- lm(tempdata$temp ~ tempdata$year, data = tempdata)
abline(model3, col = "red", lw = 3)

The year with the highest average temperature in fahrenheit is the year 2015. The year with the lowest average temperature in fahrenheit it 1946

  1. Next, subset the data into groups by decade. Start with the seventies, and then create subsets for the eighties, nineties, noughts, and 2010-2015. What is the mean temperature for each decade? How is this changing over time?
seventies <- tempdata[27:36, "temp"]
eighties <- tempdata[37:46, "temp"]
nineties <- tempdata[47:56, "temp"]
noughts <- tempdata [57:66, "temp"]
ughts <- tempdata [67:72, "temp"]

mean(seventies)
## [1] 66.42
mean(eighties)
## [1] 67.01
mean(nineties)
## [1] 67.24
mean(noughts)
## [1] 65.38
mean(ughts)
## [1] 68.13333

For every decade, with the exception of noughts , the mean temperature increases.

  1. What is the standard deviation for each decade? How is the standard deviation change over time? What do the changes in standard deviation mean?
sd(seventies)
## [1] 1.309623
sd(eighties)
## [1] 1.610003
sd(nineties)
## [1] 1.594574
sd(noughts)
## [1] 1.842583
sd(ughts)
## [1] 3.583109

The changes in Standard Deviation overtime mean that the severity of the temperature is increasing year after year. This means that is is now normal to see an average temperature of 74 degrees farhenheit in the year 2015. If October in 1949 was 74 degrees that would be considered highly unusual because it does not fall in between the first three standard deviations.

  1. Why is understanding changes in temperature variation, not just average temperature, important for understanding the consequences of climate change? What does a change in the variance of temperature mean for society can plan?

Understanding changes in temperature variation is important for understanding the consequences of climate change because every decade there is an increase in the average temperature during the month of October (a historically chilly month). The change in variance of temperature means that there may be a link between human emissions and rising temperatures. A change in the variance simply means that there can be more severe changes in weather. For example, one day in 2015 it can be 65 degrees farheniet and be considered normal but if the day after is 74 or 76 it can also be considered normal.


Question 3. The Butterfly did it

Read the assigned article for this section of the course, Wand et al., “The Butterfly Did It.” Focus on the: abstract, introduction, figures, tables and conclusion. You don’t need to read every word, but you do need to understand the argument and evidence. This is a very important article written in the discipline’s top journal (APSR), so it should give you a good example of how we use statistics in practice to answer questions about politics. Note: they use a slightly different regression model (estimator) than we are discussing in class, but it is a similar enough approach that you should be able to follow the overall argument.

  1. In your own words, state the authors’ research question. Are they trying to answer a causal question? The researchers are trying to show that the butterfly ballot used in Palm Beach County, Florida, in the 2000 presidential election caused more than 2, 000 Democratic voters to vote by mistake for Reform candidate Pat Buchanan

  2. What is their independent variable? What is their dependent variable? IV: Type of Ballot used during the Presidental elecctions DV: Number of votes that went to Pat Bucanan instead of Al Gore

  3. Examine Figure 3. What could you call Palm Beach County (PBC)? Why? (Use a key term we’ve learned in the course). Palm Beach would be considered an outlier because it does reflect the majority of the data collected.

  4. What is the main finding from the paper? In other words, how do the authors answer their research question? What is the main evidence they use to make this claim? Put this in your own words. The main finding from the paper is that the Butterfly Ballot affected the number of votes and made voters accidentally vote for the reform candidate Pat Buchanan instead of Presidential Candidate Al Gore. The main evidence they use is that in the past PBC county has voted primarily democratic with no signs of change in the political climate of that country. They also showed that PBC was not a Reform vote outlier in 1996, a presidential year in which the county did not use a butterfly ballot

  5. Examine Table 4. In Palm Beach County, how much more likely was it for a person casting a ballot for the Democratic Senator candidate to also vote for Buchanan on (i) election day; versus, (ii) absentee? What explains this difference in probability according to the authors? Note: this is a simple calculation if you read and understand the table. The likelihood that a person casting a ballot voting for Democratic Senator candidate to also vote for Buchanan on election day is that they are more likely to vote for Buchanan on the absentee ballot. Deckard voters who support Buchanan should not be affected by the butterfly ballot, and the difference between election-day and absentee Buchanan vote proportions is small.


Question 4. Helping out your FRIENDS

You mentioned to your friend that you’re learning some really neat stuff in PS 15. Now they’re curious. They want to know how they too can figure out relationships in the political world.

  1. Your friend asks how correlation is different from covariance, and for a formula that can turn \(cov(x,y)\) into \(cor(x,y)\). Provide that formula, and explain how correlation relates to covariance. Also explain what the correlation means and the possible values it can take. cor(x,y) = cov(x,y)/(SD(x)(SD(Y))) Correlation is the standardized version of Covariance. The way to get Correlation is by dividing Covariance by the Standard Deviation of the IV and DV.
    Correlation is a relationship between two interdependent variables on a scale of 1 to -1. 1 means that all points fall perfectly onto an OLS line and are all positively correlated. The -1 mean that all the points on a line fall perfectly onto an OLS line and are negatively correlated. A 0 means that there is no correlation between the IV and DV.

  2. Your friend asks you to explain what a random variable is–in your own words, provide a definition. What are its key components? A random variable is a variable that is assigned a probability under a curve. That means that there is a certain chance of that point being selected. A random variable takes on values in a range and with the probabilities defined by a distribution.

  3. Your friend then says they have heard of linear regression before, but they don’t know how it works. Explain in simple language what the regression is doing to estimate a relationship between two variables. Come up with a specific political science example to help your friend understand. Linear regression shows the relationship between an Independent Variables and Dependent Variable. It does this by giving a value between 1 and -1. A good example to help visualize this is the relationship between GDP per Capita and Infant Mortality Rate. As GDP per Capita increases the infant mortality rate in a country decreases.