This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Please also remember that you will want to use the console to “try out” code to get it working. Once you get it working, copy the code that worked (not the results) over into a code chunk in your rmd. Remember that the code within your rmd file has to be self-contained and include all the steps – your rmd file will not “remember” what you did on your own in the console. When you click knit, it can only execute the code that was present in the rmd. Do not copy the results from your console into your RMD file. In addition, do not include large amounts of output in your writeup (i.e. don’t print full datasets to the screen).
Include both the code to get your answer and your answer in words.
It is best to work will small amounts of code at a time: get some code working, copy it into the rmd as a code chunk, write your text answer (outside the code chunk) if needed, and check that the file will still knit properly. Do not proceed to answer more questions until you get the first bit working. If you knit everytime you try to write some new code, you’ll know where the error is (in the last thing you did!) This will save you huge headaches.
Although the questions break up each task for you into parts, remember that you might need to put a bunch of code together into a single chunk to make it work. For example, if you create a density plot in one part of a question, and want to add the mean value to it as a line in another part, you need these two commands to follow one another in the same chunk of code.
Some tips: Start early, work with friends in the class, use the discussion forum, come to class and section, go to office hours if you need to, read the textbook and other readings – do all these things and you’ll succeed! Good luck.
First, we will load the same dataset (derived from Fearon and Laitin, 2003) that you used in the last problem set.
setwd("/Users/alexsefayan/Desktop/PSthree")
getwd()
## [1] "/Users/alexsefayan/Desktop/PSThree"
load("fl2.RData")
summary(fl2)
## cname year warl war
## Length:156 Min. :1945 Min. :0.00000 Min. : 0.000
## Class :character 1st Qu.:1947 1st Qu.:0.00000 1st Qu.: 0.000
## Mode :character Median :1954 Median :0.00000 Median : 0.000
## Mean :1958 Mean :0.00641 Mean : 5.635
## 3rd Qu.:1964 3rd Qu.:0.00000 3rd Qu.: 9.000
## Max. :1993 Max. :1.00000 Max. :52.000
## gdpenl lpopl1 lmtnest ncontig
## Min. : 0.0510 Min. : 5.403 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.6395 1st Qu.: 7.526 1st Qu.:0.6931 1st Qu.:0.0000
## Median : 1.0910 Median : 8.415 Median :2.3174 Median :0.0000
## Mean : 2.4639 Mean : 8.505 Mean :2.0975 Mean :0.1603
## 3rd Qu.: 2.5940 3rd Qu.: 9.326 3rd Qu.:3.3150 3rd Qu.:0.0000
## Max. :53.9010 Max. :13.224 Max. :4.5570 Max. :1.0000
## Oil nwstate instab polity2l
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :-10.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: -7.0000
## Median :0.0000 Median :1.0000 Median :0.00000 Median : -1.0000
## Mean :0.1154 Mean :0.5192 Mean :0.03205 Mean : -0.1154
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 7.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. : 10.0000
## ethfrac relfrac war_prop numyears
## Min. :0.0010 Min. :0.0000 Min. :0.0000 Min. : 3.00
## 1st Qu.:0.1438 1st Qu.:0.1861 1st Qu.:0.0000 1st Qu.:34.00
## Median :0.3850 Median :0.3750 Median :0.0000 Median :43.50
## Mean :0.4083 Mean :0.3807 Mean :0.1393 Mean :40.56
## 3rd Qu.:0.6691 3rd Qu.:0.5800 3rd Qu.:0.2323 3rd Qu.:53.00
## Max. :0.9250 Max. :0.7828 Max. :1.0000 Max. :55.00
We will deal mainly with two variables: polity2l and gdpenl. The polity2l variable is a whole number between -10 and 10, measuring where a country falls between full autocracy (-10) and full democracy (10). The variable gpdenl measures of the GDP per capita of each country in 1960 in thousands of dollars.
polity2l on the horizontal access and gdpenl on the vertical axis. Given how you have set up your plot, which is your independent variable and which is your dependent variable? Bonus: explain why the data looks how it does along the x-axis, where the data points are all lined up–specifically, what kind of variable is `polity’?plot(fl2$polity2l, fl2$gdpenl, xlab = "Polity Rating", ylab = "GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )
model1 <- lm(fl2$gdpenl ~ fl2$polity2l, data=fl2)
abline(model1, col = "pink", lw = 4)
The independent and dependent variables for this scatter plot are the Measure of how Democratic or Autocratic a Nation is and the GDP per capita of that nation (respectively). The data points are all lined up because there is a heavy concentration of democratic states that have a GDP ranging from 0 to 10. ’Polity21’is a categorical variable.
In the same code chunk as b, use the abline command to put the linear regression line on the plot. See the course slides for an example.
Estimate and report the covariance of polity2l with gdpenl, and their correlation. Write the meaning of what these results tell you, using the meaning for these two variables (stated above).
cov(fl2$gdpenl, fl2$polity2l)
## [1] -0.5289266
cor(fl2$gdpenl, fl2$polity2l)
## [1] -0.01359485
Corvariance: cov(x,y) multiplied by (SD(x)(SD(Y)) = cor(x,y) Correlation: A correlation measures interdependence between two or more variables. There is a weak negative correlation between Polity rating and GDP per Capita
gdpenl on polity2l, using \(\beta_0\) and \(\beta_1\) where necessary. What does the \(\beta_{0}\) mean? What does the \(\beta_1\) mean? You do not have to estimate the model yet so this is not in a code chunk.Yi = \(\beta_0\) + \(\beta_1\) + \(\epsilon\) Yi = Intercept + Slope + Error Term
Line of best fit is a line that captures, or goes through, most of the data in a scatterplot. If all the points in the data set are on the line and the line is moving upwards (from left to right) that means there is a correlation of 1. If all the points are on the line of best fit and are moving downward that means there is a correlation of -1. If the line cannot capture any or all points on the line of best fit that means there is a correlation of 0, or no correlation.
gdpenl on polity2l using the lm function in R - make sure you save the model as an object. Show the result using the summary() command. Interpret the meaning of the coefficient estimates (both the intercept and the coefficients on polity2l). Bonus: Consider the p-values reported on the table in your interpretation, if you want to read ahead and figure out what these mean.plot(fl2$polity2l, fl2$gdpenl, xlab = "Polity Rating", ylab = "GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )
model1 <- lm(fl2$gdpenl ~ fl2$polity2l, data = fl2)
abline(model1, col = "hot pink", lw = 2)
summary(model1)
##
## Call:
## lm(formula = fl2$gdpenl ~ fl2$polity2l, data = fl2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.508 -1.812 -1.395 0.215 51.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.46268 0.44439 5.542 1.27e-07 ***
## fl2$polity2l -0.01069 0.06339 -0.169 0.866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.55 on 154 degrees of freedom
## Multiple R-squared: 0.0001848, Adjusted R-squared: -0.006307
## F-statistic: 0.02847 on 1 and 154 DF, p-value: 0.8662
\(\beta_0\): Typically political scientists do not interpret the Y intercept but in this case when the Democratic rating of a country is zero the GDP per Capita of the country is 2.46. \(\beta_1\): When X of Beta 1 (democratic rating of a country) increases then the GDP of the nation decreases by -0.01. P-Value: A P-Value, or probability value, indicated how probable an outcome is under the curve. Typically the rule of thumb is that anything that has a P-value of 0.05 or lower is statistically significant. In this case the P-value is 1.27e^-07 meaning that the data collected is statistically significant.
gdpenl, we usually need to take their log first before using regression. Create a new variable that is equal to the log of gpdenl.log <- log(fl2$gdpenl)
polity2l on the horizontal access and the log of gdpenl on the vertical axis. Add the regression line.plot(fl2$polity2l, log, xlab = "Polity Rating", ylab = "Log of GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )
model2 <- lm(log ~ fl2$polity2l, data = fl2)
abline(model2, col = "hot pink", lw = 2)
It turns out that when you regress a logged dependent variable on an (unlogged) independent variable, we can roughly interpret the coefficient \(\beta\) as meaning “a one-unit shift in the independent variable corresponds to a 100\(\beta\) percent increase in the dependent variable.”
For example, a \(\beta\) of 0.01 from such a regression would imply that a one-unit change in the independent variable is associated with a \(1\%\) higher value of the dependent variable. (This is just an approximation, but for coefficient estimates near zero, it is okay.)
gdpenl on polity2l. Use summary() to show the results. Interpret the new coefficient on polity2l. How has it changed compared to your earlier regression? Would you say your results are robust? Bonus: Interpret the new p-value.plot(fl2$polity2l, log, xlab = "Polity Rating", ylab = "Log of GDP per Capita of Nation", main = "Relationship between levels of Democracy and GDP per Capita of a Nation" )
model2 <- lm(log ~ fl2$polity2l, data = fl2)
abline(model2, col = "hot pink", lw = 2)
summary(model2)
##
## Call:
## lm(formula = log ~ fl2$polity2l, data = fl2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7810 -0.6423 -0.0246 0.6165 4.1333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24383 0.07981 3.055 0.00265 **
## fl2$polity2l 0.04875 0.01138 4.283 3.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9967 on 154 degrees of freedom
## Multiple R-squared: 0.1064, Adjusted R-squared: 0.1006
## F-statistic: 18.34 on 1 and 154 DF, p-value: 3.237e-05
\(\beta_0\): Typically political scientists do not interpret the Y intercept but in this case when the Democratic rating of a country is zero the log of the GDP per Capita of the country is 0.243.
\(\beta_1\): When X of Beta 1 (democratic rating of a country) increases then the log of the GDP of the nation increases by 0.048.
P-Value: The new P-value is 0.002. Since the P-Value is less that 0.05 the P-value is still statisitcally significant. The data is not robust, because the slope for the first scatterplot is weakly negatively correlated (before using the log of GDP). After introducing log the scatterplot is weakly positvely correlated. These two differences in data mean that data is not robust.
No, we cannot conclude that being more democratic increases a country’s GDP per capita after finding a positive and satisitically significant coefficent in a regression. The reason being is because there are plenty of cofounding variables that would sway the regression to be positive or negative. The experiment must be repeated over and over again until all the nations of the world are accounted for. By examining the population of a whole we can then come to a conclusion about the relationship between GDP per Capita and Polity rating. ***
load("tempdata.RData")
range(tempdata$temp)
## [1] 61.6 74.0
max(tempdata$year)
## [1] 2015
max(tempdata$temp)
## [1] 74
min(tempdata$temp)
## [1] 61.6
plot(tempdata$year, tempdata$temp, xlab = "Recorded Years of October", ylab = "Recorded Average Temperatures in Fahrenheit", main = "Average Recorded Temperature of October in Los Angeles")
model3 <- lm(tempdata$temp ~ tempdata$year, data = tempdata)
abline(model3, col = "red", lw = 3)
The year with the highest average temperature in fahrenheit is the year 2015. The year with the lowest average temperature in fahrenheit it 1946
seventies <- tempdata[27:36, "temp"]
eighties <- tempdata[37:46, "temp"]
nineties <- tempdata[47:56, "temp"]
noughts <- tempdata [57:66, "temp"]
ughts <- tempdata [67:72, "temp"]
mean(seventies)
## [1] 66.42
mean(eighties)
## [1] 67.01
mean(nineties)
## [1] 67.24
mean(noughts)
## [1] 65.38
mean(ughts)
## [1] 68.13333
For every decade, with the exception of noughts , the mean temperature increases.
sd(seventies)
## [1] 1.309623
sd(eighties)
## [1] 1.610003
sd(nineties)
## [1] 1.594574
sd(noughts)
## [1] 1.842583
sd(ughts)
## [1] 3.583109
The changes in Standard Deviation overtime mean that the severity of the temperature is increasing year after year. This means that is is now normal to see an average temperature of 74 degrees farhenheit in the year 2015. If October in 1949 was 74 degrees that would be considered highly unusual because it does not fall in between the first three standard deviations.
Understanding changes in temperature variation is important for understanding the consequences of climate change because every decade there is an increase in the average temperature during the month of October (a historically chilly month). The change in variance of temperature means that there may be a link between human emissions and rising temperatures. A change in the variance simply means that there can be more severe changes in weather. For example, one day in 2015 it can be 65 degrees farheniet and be considered normal but if the day after is 74 or 76 it can also be considered normal.
Read the assigned article for this section of the course, Wand et al., “The Butterfly Did It.” Focus on the: abstract, introduction, figures, tables and conclusion. You don’t need to read every word, but you do need to understand the argument and evidence. This is a very important article written in the discipline’s top journal (APSR), so it should give you a good example of how we use statistics in practice to answer questions about politics. Note: they use a slightly different regression model (estimator) than we are discussing in class, but it is a similar enough approach that you should be able to follow the overall argument.
In your own words, state the authors’ research question. Are they trying to answer a causal question? The researchers are trying to show that the butterfly ballot used in Palm Beach County, Florida, in the 2000 presidential election caused more than 2, 000 Democratic voters to vote by mistake for Reform candidate Pat Buchanan
What is their independent variable? What is their dependent variable? IV: Type of Ballot used during the Presidental elecctions DV: Number of votes that went to Pat Bucanan instead of Al Gore
Examine Figure 3. What could you call Palm Beach County (PBC)? Why? (Use a key term we’ve learned in the course). Palm Beach would be considered an outlier because it does reflect the majority of the data collected.
What is the main finding from the paper? In other words, how do the authors answer their research question? What is the main evidence they use to make this claim? Put this in your own words. The main finding from the paper is that the Butterfly Ballot affected the number of votes and made voters accidentally vote for the reform candidate Pat Buchanan instead of Presidential Candidate Al Gore. The main evidence they use is that in the past PBC county has voted primarily democratic with no signs of change in the political climate of that country. They also showed that PBC was not a Reform vote outlier in 1996, a presidential year in which the county did not use a butterfly ballot
Examine Table 4. In Palm Beach County, how much more likely was it for a person casting a ballot for the Democratic Senator candidate to also vote for Buchanan on (i) election day; versus, (ii) absentee? What explains this difference in probability according to the authors? Note: this is a simple calculation if you read and understand the table. The likelihood that a person casting a ballot voting for Democratic Senator candidate to also vote for Buchanan on election day is that they are more likely to vote for Buchanan on the absentee ballot. Deckard voters who support Buchanan should not be affected by the butterfly ballot, and the difference between election-day and absentee Buchanan vote proportions is small.
You mentioned to your friend that you’re learning some really neat stuff in PS 15. Now they’re curious. They want to know how they too can figure out relationships in the political world.
Your friend asks how correlation is different from covariance, and for a formula that can turn \(cov(x,y)\) into \(cor(x,y)\). Provide that formula, and explain how correlation relates to covariance. Also explain what the correlation means and the possible values it can take. cor(x,y) = cov(x,y)/(SD(x)(SD(Y))) Correlation is the standardized version of Covariance. The way to get Correlation is by dividing Covariance by the Standard Deviation of the IV and DV.
Correlation is a relationship between two interdependent variables on a scale of 1 to -1. 1 means that all points fall perfectly onto an OLS line and are all positively correlated. The -1 mean that all the points on a line fall perfectly onto an OLS line and are negatively correlated. A 0 means that there is no correlation between the IV and DV.
Your friend asks you to explain what a random variable is–in your own words, provide a definition. What are its key components? A random variable is a variable that is assigned a probability under a curve. That means that there is a certain chance of that point being selected. A random variable takes on values in a range and with the probabilities defined by a distribution.
Your friend then says they have heard of linear regression before, but they don’t know how it works. Explain in simple language what the regression is doing to estimate a relationship between two variables. Come up with a specific political science example to help your friend understand. Linear regression shows the relationship between an Independent Variables and Dependent Variable. It does this by giving a value between 1 and -1. A good example to help visualize this is the relationship between GDP per Capita and Infant Mortality Rate. As GDP per Capita increases the infant mortality rate in a country decreases.