General instructions for labs:

Today’s lab is a review lab. It is not meant to be comprehensive; instead, the goal is to recap a few of the items we’ve covered this year. For the second exam, anything we’ve done this far is fair play as far as visualization and interpretation. From the first unit, there will be not be questions on the specific aspects of the grammar of graphics.

Preface

There’s a new version of ggplot2! This is wonderful news. You can upload the new version of the package using the following code:

#update.packages("ggplot2")

The main highlights of the ggplot2 package are found at this link here. For our purposes, I want to focus on two new developments, both of which should be easy to implement.

First, subtitles, and a more coordinated effort at labeling axes. Additionally, notice the title is now left-aligned (instead of center-aligned).

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )

Second, when making barplots, the categories now stack in the same order as the legend.

avg_price <- diamonds %>% 
  group_by(cut, color) %>% 
  summarise(price = mean(price)) %>% 
  ungroup() %>% 
  mutate(price_rel = price - mean(price))

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price, fill = color))

Pretty dope!

Part I

Return to the polls data set, which stores information from polls taken prior to the 2016 election.

You can directly access the .csv in R, and please use the following code to filter the data set.

library(dplyr); library(ggplot2)
polls.16 <- read.csv("http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv")
polls.16 <- polls.16 %>% select(-createddate, -timestamp) %>%
  filter(type == "polls-plus") %>%
  mutate(pollster.abbrev = substr(pollster, 1, 8), 
         clin.perc = adjpoll_clinton/100)

Q1 Make the following plot, which shows the results from 6 top-rated polls on the same date. Refer to the code above the subtitle and footnote. Use the clin.perc variable above.

polls.US <- polls.16 %>% 
  filter(state == "U.S.", grade == "A"|grade=="A+"|grade == "A-", startdate == "11/3/2016") %>%
  mutate(me = 1/sqrt(samplesize)) 
limits <- aes(ymin = clin.perc - me, ymax = clin.perc + me)
ggplot(polls.US, aes(x = pollster.abbrev, y = clin.perc)) + 
  geom_bar(stat = "identity") + 
  geom_errorbar(limits, width=0.25) + 
  theme(legend.position = "none") +
  labs(
    x = "", y= "",
    title = "Percent of Clinton Support",
    subtitle = "National polls with an A+/A/A- support, start date 11/3/06",
    caption = "Data via 538"
  )

Q2 Summarize the plot in 3-4 sentences, including reference to the meaning of the error bars. What does this suggest about the polls, and Clinton’s overall voting share on that date?

Main highlights: Best estimate of Clinton’s share is about 46 percent - note that there is no meaningful differences between the polls, as the error bards overlap.

Q3 On a scale of 1 (least accurate) to 10 (most), rate this poll, and justify your opinion. What could be done to improve its accuracy?

Probably about a 6. There’s a nice use of error to account for uncertainty, but there’s more to polling than performing well across the country. State-level performances are more important giving the electoral college.

Part II

A pollster decides to take a look at the adjusted poll support for Donald Trump relative to the raw numbers. Reminder: the adjusted numbers account for the demographics of those polled, relative to who is expected to turn out for the election.

Make the following scatter plot using the polls.16 data. No additional data manipulation is needed.

ggplot(polls.16, aes(rawpoll_trump, adjpoll_trump)) + 
  geom_point() +
  geom_smooth() + 
  facet_wrap(~state) + 
  geom_abline(intercept = 0, slope = 1) +
  labs(x = "Raw support for Donald Trump", y = "Adjusted support",
       title = "Trump support, all polls")

Q5 How do state-level trends compare to the solid line? And what does this suggest about the adjustment method for voting support?

Adjusted percentages tend to be higher than raw percentages, particularly for lower values of the raw percentages within each state. This is judged by the adjusted numbers falling above the trend line of y = x. This suggests that the adjustment tends to increase lower raw percentages.

Q6 Comment on the state-to-state differences in the links between adjusted and raw voting percentages (Note: this is different than comparing each state’s support for Trump, which is a much easier question). For example, consider how strong each state’s line-of-best-fit is. Notice anything?

There are some noticable state-level differences. In some (Massachusetts), the cloud of points suggests varying adjustments were required, while in others (Kansas) suggest that the adjustment types was fairly consistent.

Part III

A pollster is interested in the support for Gary Johnson. Use the following code to get you started.

library(ggthemes); library(maps)
states <- map_data("state")
dim(states)
## [1] 15537     6
polls.16.ave <- polls.16 %>%
  mutate(region = tolower(state)) %>%
  group_by(region) %>% 
  summarise(johnson.vote.ave = mean(adjpoll_johnson, na.rm = TRUE), n.poll = n())
  
states.johnson <- inner_join(polls.16.ave, states, by = "region")

Now, make the following plot:

us.base <- ggplot(data = states.johnson, 
                  mapping = aes(x = long, y = lat, 
                                group = group, fill = johnson.vote.ave)) + 
  coord_fixed(1.3) + 
  geom_polygon(color = "black")

us.base + theme_map() + 
  scale_fill_continuous(name="Percent", 
            low = "lightgreen", high = "darkgreen",  
            breaks=c(0, 5, 10, 15), na.value = "grey50") + 
  labs("Percent of Johnson support, all polls and times")

Q7 Comment on region-level differences in the support for Johnson in 1-2 sentences that the lay reader could understand.

Johnson’s support was highest in New Mexico and surrounding states, as high as 15 percent. His percentage was lower than 5 percent in much of the Northeast.

Q8 Which state was polled most often and least often? Does that effect your interpretation of the map?

Florida was polled most often - the number of polls included is helpful as if a state was only polled a few times, perhaps our confidence in the accuracy of the map for that state would be questioned.