Introduction

Welcome to the first problem set. There is not a great deal of material here, but since this may be your first time using R and R Markdown, there are many potential pitfalls, so leave yourself plenty of time to complete it. The idea here is not that you can sit down and answer these questions straight away, but that you have almost two weeks to figure it out, and this is a key part of the learning process.

If you are looking at the HTML version of the problem set (pset1.html) that may have opened in your web browser, you are seeing the ouput produced by running the “script”" or code called in the file pset1.rmd, also available on the course website. Go ahead and open the file called pset1.rmd. If it does not open automatically within R Studio, you can open R Studio first and then use the File menu to open up pset1.rmd. Once you open pset1.rmd, you can continue reading the text easily in that file.

It will be easiest for you to open the .rmd file posted for each pset, and start writing your solutions in by learning from the code you see in the questions.

Before you start: The R Markdown Introduction

The text, output and graphics in this section are provided as an example whenever you create a new R markdown (.rmd) file in R Studio. It’s a good quick introduction so I replicate it here with minor modification. At this point, you may not understand all of the R code being used here, but the goal is to understand how the .rmd file works and how it relates to the .html file that gets outputted.

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. (Here we have set the code to produce an HTML output, which is what you need to upload for this class).

When you click the Knit button a document will be generated that includes both the content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. In general, you will not include the echo = FALSE, because we will want to see your code.

Submission Instructions

Please submit your problem set via Gauchospace. Submit both the .rmd file, and the HTML file it creates. This assignment is due by 6 PM on Friday Oct 13. No late problem sets accepted. Please list any students you collaborated with.

Example Problem

Here is an example problem, with an example solution.

Question 0. In this question, we’ll provide the answer for you, as an example. You need to be looking at the .rmd file right now for this to make much sense.

Showing your code and the result, execute the code getwd(). Describe what this command does. You may want to execute the command directly in the console first (ask your TA if you dont know how to run a command–this is essential) to see what it does, but be sure to write it into your .rmd file so that it runs when you click knit.

Solution: Question 0

setwd("/Users/alexsefayan/Desktop/PS15")
getwd()
## [1] "/Users/alexsefayan/Desktop/PS15"

This command, when executed (either in the console or through the .rmd file once you click ``knit’’), tells the user what directory is set as the working directory. This is the directory where output will be saved, or where R will look first when searching for a file, for example a dataset. You need to always set your working directory first so R knows where to pull the data from.


Make sure to try executing your .rmd file now by clicking knit. Then take a look at the HTML that it created and see what you get.

Okay, your turn to answer the remaining questions!

Part 1. Short answer questions

Start by reading Chapter 1 of the textbook, Real Stats. You can also review your lecture notes.

Q1. A researcher observes that more educated people vote at a higher rate. He decides to publish a research article that says completing a bachelor’s degree leads people to participate in elections at a higher rate. Would you like to be a co-author on this paper? Why or why not? (100 words max) I would not co-author on this paper because there are a lot of confounding factors that may influence why or why not someone may vote and participate in politics. Some may include: race, wealth, and gender.

Q2. Explain what this sentence means: “Experiments create exogeneity via randomization.” (75 words max) Randomization ensures that the independent variable is exogenous by balancing out and eliminating confounding and unobservable variables in an experiment. This is done by making the independent variable uncorrelated to everything.

Q3. What are some problems with experiments, particularly in a social science discipline such as political science? (100 words max) There are many problems including but not limited to: moral issues, randomization issues, and limited funding. Unlike natural sciences, political scientists can lack control and experimental groups when dealing with cases on the national or international level. For example, political scientists cannot examine the effects of capitalism in North Korea because the North Koreans have never experienced capitalism.

Q4. You decide to run an experiment to see whether going to lectures helps students learn. You randomly assign half of the class to go to the lecture and section for the course, and to read the textbook. For the other half of the class, you just assign the students to read the textbook. At the end of the semester, you give the entire class a test. You find that the students in the first group did much better than those in the second group, who only read the textbook. (120 words max)

4a. What could you call each group?

The group that goes to lecture and section is the experimental group while the group that does not attend lecture is the control group.

4b. What is your independent variable and what is your dependent variable? The IV: Going to class and attending section. The DV: Final grade in the class

4c. Given this set up, list some factors you are controlling for.

Controlling factors: work ethic of students in the course, inherent intelligence of student, ability, and prior knowledge (if a student had taken a similiar class before). These factors are controlled because the experiment is randomized. Randomization balances out the students that possess these traits and place them into the experimental and control groups.

4d. Can you say that attending lectures caused the students to do better on the test? Why or why not? Explain using the technical terms in the textbook.

Yes, because the experiment is randomized the only thing that would affect test scores would be the luck of the draw. Because the experiment is randomized the control group and the experimental group would look identical. Randomization elimates the lurking and confounding variables that are present in observational studies.

4e. Can you say that this finding would also apply in courses with online lectures? Why or why not? Explain using the technical terms in the textbook.

If the sample of online students produces the same results than the experiment is generalizable. If the experiment is generalizable then the findings would apply to the online lecture as well. As long as the online experiment is carried out in the same fashion lecture experiment than the results should be the same.

Q5. Imagine you are looking at the relationship between income and level of education. List some of the factors that could lead to endogeneity. (50 words max)

If people come from money or have wealth they are more likely to pursue a higher level of education at private universities. Additionally if people come from wealth they probably live in affluent communities and attend private schools or schools with a lot of funding.

Some factors that can lead to endogeneity are:


Part 2. Bias in Self-reported Turnout

Surveys are frequently used to measure political behavior such as voter turnout, but some researchers are concerned about the accuracy of self-reported turnout (that is, whether people tell the truth when you ask them if they voted). In particular, they worry about possible social desirability bias where in post-election surveys–the fact that respondents who did not vote in an election lie because they may feel that they should have voted. Is such a bias present in the American National Election Studies (ANES)?

The ANES is a national survey that has been conducted for every election since 1948. The ANES conducts face to face interviews with a nationally representative sample of adults.

Download the dataset, turnout.Rdata, which you’ll find on the course website. You may want to put it in your working directory to make it easy to find (use getwd() to see what your current working directory is; you can use the Session menu in Rstudio or the setwd() command to change your working directory.)

Here is a brief description of the variables:

As will often be the case when using R, you will need to use the $ operator to access these variables within the object. Specifically, once you have loaded turnout.RData, the result will be available in the variable data. To get at the variable ANES, for example, you would use data$ANES. Remember, the end of chapters in the textbook includes R code that can be helpful. We also posted R resources on Gauchospace.

Q1. Load the data into R. The data are stored as an Rdata file, so you can use the load() function to load it.

load("turnout.RData")
summary(data)
##       year           VEP              VAP             total       
##  Min.   :1980   Min.   :159635   Min.   :164445   Min.   : 64991  
##  1st Qu.:1986   1st Qu.:171192   1st Qu.:178930   1st Qu.: 73179  
##  Median :1993   Median :181140   Median :193018   Median : 89055  
##  Mean   :1993   Mean   :182640   Mean   :194226   Mean   : 89778  
##  3rd Qu.:2000   3rd Qu.:193353   3rd Qu.:209296   3rd Qu.:102370  
##  Max.   :2008   Max.   :213314   Max.   :230872   Max.   :131304  
##                                                                   
##       ANES            felons         noncit         overseas   
##  Min.   :0.4700   Min.   : 802   Min.   : 5756   Min.   :1803  
##  1st Qu.:0.5700   1st Qu.:1424   1st Qu.: 8592   1st Qu.:2236  
##  Median :0.7050   Median :2312   Median :11972   Median :2458  
##  Mean   :0.6579   Mean   :2177   Mean   :12229   Mean   :2746  
##  3rd Qu.:0.7375   3rd Qu.:3042   3rd Qu.:15910   3rd Qu.:2937  
##  Max.   :0.7800   Max.   :3168   Max.   :19392   Max.   :4972  
##                                                                
##     osvoters  
##  Min.   :263  
##  1st Qu.:263  
##  Median :263  
##  Mean   :263  
##  3rd Qu.:263  
##  Max.   :263  
##  NA's   :13

Q2. Check the dimensions of the data (i.e. the number of rows and columns). How many observations are there? What are the dimensions of the data? What is the range of years covered in this data set?

dim(data)
## [1] 14  9
ncol(data)
## [1] 9
nrow(data)
## [1] 14
range(data$year)
## [1] 1980 2008

Q3a. Calculate the turnout rate for each year based on the voting age population, or VAP. Note that for this data set, we must add the total number of eligible overseas voters to the pool of potential voters, since the VAP variable does not include these individuals in the count. So we want total votes cast (“total”), divided by (“VAP” plus “overseas”). (The overseas ballots cast are already counted in total so do not need to be added to “total”.)

summary(data)
##       year           VEP              VAP             total       
##  Min.   :1980   Min.   :159635   Min.   :164445   Min.   : 64991  
##  1st Qu.:1986   1st Qu.:171192   1st Qu.:178930   1st Qu.: 73179  
##  Median :1993   Median :181140   Median :193018   Median : 89055  
##  Mean   :1993   Mean   :182640   Mean   :194226   Mean   : 89778  
##  3rd Qu.:2000   3rd Qu.:193353   3rd Qu.:209296   3rd Qu.:102370  
##  Max.   :2008   Max.   :213314   Max.   :230872   Max.   :131304  
##                                                                   
##       ANES            felons         noncit         overseas   
##  Min.   :0.4700   Min.   : 802   Min.   : 5756   Min.   :1803  
##  1st Qu.:0.5700   1st Qu.:1424   1st Qu.: 8592   1st Qu.:2236  
##  Median :0.7050   Median :2312   Median :11972   Median :2458  
##  Mean   :0.6579   Mean   :2177   Mean   :12229   Mean   :2746  
##  3rd Qu.:0.7375   3rd Qu.:3042   3rd Qu.:15910   3rd Qu.:2937  
##  Max.   :0.7800   Max.   :3168   Max.   :19392   Max.   :4972  
##                                                                
##     osvoters  
##  Min.   :263  
##  1st Qu.:263  
##  Median :263  
##  Mean   :263  
##  3rd Qu.:263  
##  Max.   :263  
##  NA's   :13
data$turnout <- (data$total/(data$VAP+data$overseas))
summary(data$turnout)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3483  0.3657  0.4844  0.4546  0.5241  0.5567

Q4. Compute the difference between the VAP-based estimate of turnout you just got, and the turnout according to peoples’ self-reported voting according to the ANES (the ANES variable). How big is the difference on average? What is the minimum and the maximum of the difference?

data$diff <- abs(data$turnout-data$ANES)
min(data$diff)
## [1] 0.1106116
max(data$diff)
## [1] 0.261715
mean(data$diff)
## [1] 0.2032914

Q4b. Do you think this is a large or a small difference between the self-reported voting and the ANES voting data? Why? There is a large difference between self reported voting and the ANES voting data. The reason why is because on average there is a 20% difference between the self reported data and the ANES data. Q5. Produce a simple scatterplot with election year on the horizontal axis, and points showing the dfference between the VAP-based estimate of turnout and the ANES estimate in each year.

plot(data$year, data$diff,
     xlab = "Year",
     ylab = "Difference",
     main = "Difference between VAP and ANES turnout rates")

#xlab = "Election Year"
#ylab = "Difference between VAP and ANES turnout rates"

Q6. Add a line to that plot (so you see a jagged line going through all the points). You want it to look like a line graph.

plot(data$year, data$diff,
     xlab = "Year",
     ylab = "Difference",
     main = "Difference between VAP and ANES turnout rates",
     type = "b" )

Q7a. Make the plot again, but this time add a trend line (also known as a line of best fit or a regression line).

plot (data$year, data$diff, 
      ylab = "Difference",
      xlab = "Year",
      main = "Difference between VAP and ANES turnout rates",
      type = "b")
model1 <- lm(data$diff ~ data$year,
             data = data)
abline(model1, col = "red")

Q7b. What does this line tell you about how the self reporting bias is changing over time?

Self reported biases increase overtime while the population of voters also increases.

Q7c. What could you call this relationship?

Positive correlation