Introduction

You will work with poll results from the U.S. 2016 presidential election aggregated from HuffPost Pollster, RealClearPolitics, polling firms, and news reports. The earliest poll has a start date of 2015-11-06, and the latest poll has a start date of 2016-11-06. The general election was held on November 8, 2016.

Data

Below is a preview of the data set. Consult the data dictionary for further details on the variables.

Questions

To get started, load polls.Rdata (available on Google Classroom). There is an object polls in polls.Rdata. Use polls to answer each question, unless it states otherwise.

Answer 6 of the first 9 questions. In your Rmd file clearly indicate which questions you are answering. Question 10 is extra credit.

Question 1

Who were the eight most common pollsters? Make a data frame that lists these pollsters in descending order in terms of the number of polls they conducted.

Question 2

For those pollsters that were given a grade, how many polls were conducted for each grade? Make a data frame that lists the grades and number of pollsters for each grade. Arrange the grades from A+ to D.

Question 3

Recreate the plot below. The plot is based on data for the state of Michigan, pollster “Ipsos”, and the population of likely voters. Use the ending date of the poll as your x variable. Use the raw polling data. As a hint, you will need to reshape your data frame before you make the plot.

Plot details:

  • line size: 1.5
  • point size: 3
  • colors: “#3A89CB”, “#D65454”
  • figure height: 8, figure width: 9



Question 4

From polls, produce a data frame that contains the variables: rawpoll_trump, rawpoll_clinton, trump_edge, party_color. trump_edge is defined as rawpoll_trump - rawpoll_clinton. party_color should take the value “red” if trump_edge >= 0 and “blue” otherwise. Randomly display 8 rows of this data frame. An example in tabular form is given below.

rawpoll_trump rawpoll_clinton trump_edge party_color
41.00 46.00 -5.00 blue
39.00 44.00 -5.00 blue
25.00 53.00 -28.00 blue
42.00 46.00 -4.00 blue
42.00 41.00 1.00 red
32.00 42.00 -10.00 blue
35.94 25.17 10.77 red
58.00 27.00 31.00 red

Question 5

Create a function named avg_polled(). This function should have one argument, state_of_interest. The function should return a data frame on the average number of people polled for each state input to argument state_of_interest. Sort the average number of people polled in descending order. Below are two examples of the function.

avg_polled(state_of_interest = "Michigan")
avg_polled(state_of_interest = c("Michigan", "Oklahoma", "New York"))

Question 6

Create any plot of your choice that involves at least two variables. You may create new variables based off the data available. You may also include additional data - electoral votes available per state would be interesting. In 1-2 sentences, comment on any relationships or trends between variables.

Question 7

Reshape and subset polls to produce the data frame you see below. Click the right arrow to see all columns. Variable support is based off the raw polling numbers.

Question 8

For the key battleground state Pennsylvania, did any poll with a start date in November 2016 have Trump with more support than Hillary? Hint: You can filter dates, just put the date in quotes.

Question 9

What were Hillary’s top 4 raw polling numbers at any time during the polling cycle regardless of state? Create a data frame with the poll’s start date, state, and raw polling numbers. Do the same for Trump and then Johnson.

Question 10: Extra Credit

Recreate the plot below. The plot displays the candidate’s edge based off the raw polling numbers of likely voters for each of the 50 states and District of Columbia. The edge was based off the most recent poll’s start date in each state, respectively. Grade was not considered.

Plot details:

  • width: 0.75
  • colors: “#3A89CB”, “#D65454”
  • figure height: 8, figure width: 9

Essential details

Deadline and submission

The deadline to submit Exam 1 is 11:59pm on Tuesday, February 26. Submit your work by uploading only your Rmd file through Google Classroom. Late work will not be accepted except under certain extraordinary circumstances.

Help

  • Post your questions in the #exam1 channel on Slack. These should only be general questions, where you feel the directions are not clear. Do not post any code.

  • Visit Scott or I in office hours or make an appointment. However, we will not guide you to a solution or verify your code is correct.
    • Shawn’s office hours: Wednesdays 9:00 - 10:30am & Fridays 1:30 - 3:00pm, C409 Wells Hall
    • Scott’s office hours: Thursdays 11:00 - 12:00pm, C511 Wells Hall

Academic integrity

  • This is an individual assignment. This document, its questions, and your answers should only be viewed by you, the instructor, and the teaching assistant. If you fail to abide by these rules, you will earn a 0 and an Academic Dishonesty Report will be filed.

  • You may use any course material or other resources you find helpful online.

  • You must always cite any code you copy or use as inspiration. Copied code without citation is plagiarism and will result in a 0 for the assignment.

Grading

You must use R Markdown. Formatting is at your discretion but is graded. Use the in-class assignments and resources available online for inspiration. Another useful resource for R Markdown formatting is available at: https://holtzy.github.io/Pimp-my-rmd/

Topic Points
Answer 6 of 9 questions 66
R Markdown formatting 7
Communication of results 6
Knit 6
Code style 6
- 80 characters per line
- Format of tidyverse code
- Comments used appropriately
- Spaces around operations and commas
Efficiency 6
- Using tidyverse code when possible
- Avoiding loops
Named code chunks 3
Total 100

A bonus of up to 5 points can be earned for correctly answering question 10. There is no partial credit for this question.

Data dictionary

  • state: state in which poll was taken, U.S. is for national polls
  • startdate: poll's start date
  • enddate: poll's end date
  • pollster: pollster conducting the poll
  • grade: grade assigned by FiveThirtyEight to pollster, A+ is the best, D is the worst
  • samplesize: number of individuals sampled
  • population: type of population being polled
    • A = adults
    • RV = registered voters
    • V = voters
    • LV = likely voters
  • rawpoll_clinton: percentage for Hillary Clinton
  • rawpoll_trump: percentage for Donald Trump
  • rawpoll_johnson: percentage for Gary Johnson
  • rawpoll_mcmullin: percentage for Evan McMullin
  • adjpoll_clinton: FiveThirtyEight adjusted percentage for Hillary Clinton
  • ajdpoll_trump: FiveThirtyEight adjusted percentage for Donald Trump
  • adjpoll_johnson: FiveThirtyEight adjusted percentage for Gary Johnson
  • adjpoll_mcmullin: FiveThirtyEight adjusted percentage for Evan McMullin