Introduction

Welcome to the first problem set. There is not a great deal of material here, but since this may be your first time using R and R Markdown, there are many potential pitfalls, so leave yourself plenty of time to complete it. The idea here is not that you can sit down and answer these questions straight away, but that you have a week to figure it out, and this is a key part of the learning process.

If you are looking at the HTML version of the problem set (pset1.html) that may have opened in your web browser, you are seeing the ouput produced by running the “script”" or code called in the file pset1.rmd, also available on the course website. Go ahead and open the file called pset1.rmd. If it does not open automatically within R Studio, you can open R Studio first and then use the File menu to open up pset1.rmd. Once you open pset1.rmd, you can continue reading the text easily in that file.

It will be easiest for you to open the .rmd file posted for each pset, and start writing your solutions in by learning from the code you see in the questions.

Before you start: The R Markdown Introduction

The text, output and graphics in this section are provided as an example whenever you create a new R markdown (.rmd) file in R Studio. It’s a good quick introduction so I replicate it here with minor modification. At this point, you may not understand all of the R code being used here, but the goal is to understand how the .rmd file works and how it relates to the .html file that gets outputted.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. (Here we have set the code to produce an HTML output, which is what you need to upload for this class).

When you click the Knit button a document will be generated that includes both the content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. In general, you will not include the echo = FALSE, because we will want to see your code.

Submission Instructions

Please submit your problem set via Gauchospace. Submit both the .rmd file, and the HTML file it creates. This assignment is due by 11:55 PM on Thursday January 24th. No late problem sets accepted. Please list any students you collaborated with.

Example Problem

Here is an example problem, with an example solution.

Question 0. In this question, we’ll provide the answer for you, as an example. You need to be looking at the .rmd file right now for this to make much sense.

Showing your code and the result, execute the code getwd(). Describe what this command does. You may want to execute the command directly in the console first (ask your TA if you dont know how to run a command–this is essential) to see what it does, but be sure to write it into your .rmd file so that it runs when you click knit.

Solution: Question 0

getwd()
## [1] "/home/jovyan/Political Science 15 Data Sets"

This command, when executed (either in the console or through the .rmd file once you click ``knit’’), tells the user what directory is set as the working directory. This is the directory where output will be saved, or where R will look first when searching for a file, for example a dataset. You need to always set your working directory first so R knows where to pull the data from.


Make sure to try executing your .rmd file now by clicking knit. Then take a look at the HTML that it created and see what you get.

Okay, your turn to answer the remaining questions!

Part 1. Short answer questions (1 point each; 5 points total)

Start by reading Chapter 1 of the textbook, Real Stats. You can also review your lecture notes.

Q1. A researcher observes that more educated people vote at a higher rate. He decides to publish a research article that says completing a bachelor’s degree causes people to participate in elections at a higher rate. Would you like to be a co-author on this paper? Why or why not? (100 words max)

A1. No, I would not want to co-author on their paper. By saying outright that having a bachelor’s degree automatically causes that population to go vote at a higher rate the researcher is suggesting that these two trends have a causative relationship. If we assume that the researcher has gotten the data from an observational study that has no real guarantee of randomization. Then there is probably going to have confounders that confuse the relationship between the independent variable and the dependent variable in said experiment. This would then increase the overall endogeneity of the relationship. It would increase because we can not verify that our dependent variable is completely independent of an unobserved error term

Q2. Explain what this sentence means: “Experiments create exogeneity via randomization.” (75 words max)

A2. The definition of Exogeneity is a state where all confounders or unintended error terms are eliminated via randomization. Where there is a correlative relationship between the observed independent variable acting because of the dependent variable you can assume its causative. Having an experimental study would automatically allow randomization to happen within the data which would then end in exogeneity in the data relationship.

Q3. What are some problems with experiments, particularly in a social science discipline such as political science? (100 words max)

A3. There are multiple issues that can arise from performing experimental studies in social science ranging from inconsistencies in different findings to moral issues. Looking at an experiment like the Stanford project where students were basically mentally tortured for days. But it is also very difficult to replicate experiments because humans have very complex lives and different experiences which can make it difficult to replicate the experiment making it impossible to collect randomized data to exclude stochastic elements. Because humans are constantly changing it is easier to perform observational studies.

Q4. You decide to run an experiment to see whether going to lectures helps students learn. You randomly assign half of the class to go to the lecture and section for the course, and to read the textbook. For the other half of the class, you just assign the students to read the textbook. At the end of the semester, you give the entire class a test. You find that the students in the first group did much better than those in the second group, who only read the textbook. (120 words max)

4a. What could you call each group?

A4a. The first half that did attend lecture would be called the treatment group. While the the group that would not attend lecture and only read the textbook would be called the control group.

4b. What is your independent variable and what is your dependent variable?

A4b.Dependent Variable: Test Scores Independent Variable: Lecture Attendance

4c. Given this set up, list some factors you are controlling for.

A4c. 1. By giving both groups the same textbook tou would be controlling the information provided. 2. Using randomization, you control who would be more inclined to study within the class. 3. With randomization you would be controling prior experience with corse materials 4. We are controlling for grades and test scores, by randomizing between students for each group and for who is or is not attending their lecture or section.

4d. Can you say that attending lectures caused the students to do better on the test? Why or why not? Explain using the technical terms in the textbook.

A4d. One could possibly make that claim but that depends on the degree of correlation and that there is a causative relationship between independent and dependent variable. It is a possibility because this study was experimental and non-stochastic which in tern eliminated the potential for a cofounder. This all means that the experiment would be internally valid to claim that if a student attends all lectures and section they will do better in the test.

4e. Can you say that this finding would also apply in courses with online lectures? Why or why not? Explain using the technical terms in the textbook.

A4e. No there are too many other independent variables that could come into play. First of all students could use their notes and books during the final test which would give them an advantage. And you can never be truly sure if the students are actually grasping the recorded online lectures.

Q5. Imagine you are looking at the relationship between income and level of education. List some of the factors that could lead to endogeneity. (50 words max)

A5. 1. Inheritances. 2. Loans and other bills to pay off. 3. connections and networking. 4. having a family to support. 5. job competition


Part 2. Analysis (1/2 point each; 5 points total)

When James Carville was crafting a simple catchphrase to summarize then-presidential candidate Bill Clinton’s electoral message, he hung a sign up at campaign headquarters that read ‘The Economy, stupid.’ The phrase has since morphed into ‘It’s the economy, stupid’ and it still reflects the core message that the economy decides elections. When times are good, voters want more of the same; when times are bad, they want a fresh face. If we look at presidential elections from the last 70 years, do the data support this claim?

The Presidential Voteshare database from 1948–2012 offers a chance to evaluate this hypothesis.

Download the dataset, presvote.Rdata, which you’ll find on the course website. You may want to put it in your working directory to make it easy to find (use getwd() to see what your current working directory is; you can use the Session menu in Rstudio or the setwd() command to change your working directory.)

Here is a brief description of the variables:

As will often be the case when using R, you will need to use the $ operator to access these variables within the object. Specifically, once you have loaded presvote.RData, the result will be available in the data presdata. To get at the variable vote, for example, you would use presdata$vote. Remember, the end of each chapter in the textbook includes R code that can be helpful. We also posted R resources on Gauchospace.

Q1. Load the data into R. The data are stored as an Rdata file, so you can use the load() function to load it.

load("presdata.RData")

Q2. Check the dimensions of the data (i.e. the number of rows and columns). How many observations are there? What are the dimensions of the data?

range(presdata$year)
## [1] 1948 2012
max(presdata$year) - min(presdata$year)
## [1] 64
dim(presdata)
## [1] 17  7

The dimension of the data is 17x7 (7 columns and 17 rows). The total number of individual obeservations is 119. The years this data set covers is between 1948 and 2012 which makes the range 64 years but not continuous.

Q3. What is the range of years covered in this data set?

A3. In this data set the range of years covered is 64 from 1948 to 2012

Q4. Calculate the average change in real disposable income across all points in the sample. Do you think this is a large or a small average? What is the minimum and the maximum change in real disposable income?

var(presdata$rdi4)
## [1] 3.103718
max(presdata$rdi4)
## [1] 6.03529
min(presdata$rdi4)
## [1] -0.59695

Average change in real disposable income across all points givin is 3.103718%. This is a pretty big average because 3% of an individuals total income can be a very large amount of money. The maximum change in disposable income is 6.03529% and the minimum change in real disposable income is -0.59695%

Q5. Calculate the average vote share across all points in the sample. What does this tell you about the power of incumbency?

mean(presdata$vote)
## [1] 52.04586

The average vote share is 54.04586. This tells us that on average the incumbent would win more then half of the popular vote which would usually tell us who wins the election. So this would tell us that the incumbent that had previously served a term in office would have a better chance of winning the election again.

Q6. Produce a simple scatterplot with change in income on the horizontal axis, and points showing the incumbent party’s vote share in each year.

plot(presdata$rdi4, presdata$vote, 
     main="% Change in Income v. % Votes for Incumbent", 
     xlab = "%Change in Income", 
     ylab = "% Voter for Incombent President")

Q7. Add a line to that plot (so you see a jagged line going through all the points). You want it to look like a line graph.

newdata <- presdata[order(presdata$rdi4, presdata$vote),]
plot(newdata$rdi4, newdata$vote, 
     main="% Change in Income v. % Votes for Incumbent", 
     xlab = "% Change in Income", 
     ylab = "% Vote for Incumbent President")
lines(newdata$rdi4, newdata$vote, type="l")

Q8. Make the plot again, but this time add a trend line (also known as a line of best fit or a regression line) using the abline() command.

model1 <- lm(presdata$vote ~ presdata$rdi4, data=presdata)
plot(presdata$rdi4, presdata$vote, 
     main="% Change in Income v. % Votes for Incumbent", 
     xlab = "% Change in Income", 
     ylab = "% Vote for Incumbent President")
abline(model1 ,col="purple")

summary(model1)
## 
## Call:
## lm(formula = presdata$vote ~ presdata$rdi4, data = presdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6842 -3.7406 -0.2731  2.6357  7.5002 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    45.9385     1.6919  27.152 3.62e-14 ***
## presdata$rdi4   2.2906     0.5342   4.288 0.000648 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.765 on 15 degrees of freedom
## Multiple R-squared:  0.5507, Adjusted R-squared:  0.5207 
## F-statistic: 18.38 on 1 and 15 DF,  p-value: 0.0006477

Q9. What does this line tell you about elections and the economy?

A9. The regression line in this chart shows us that there is a positive correlation between change in income and percentage of people who vote for the incumbent president. Basicly the more money the public has the beter the chances the incumbent will get reelected.

Q10. What could you call this relationship?

A10. I would call this relationship a strog positive correlative relationship.