Summary

I have decided to create a plot based upon the “Brexit_polls” dataset from DS Labs. To create the plot, I first looked at the data and decided what questions I wanted the plot to answer—how did British opinion about Brexit change in the six months between January and June 2016 (at least according to the public opinion polls). I then prepared the data so the plot could accurately show the variables I thought were most important for answering my question. After the data was prepared, I constructed a graph with color to differentiate the responses, and with a loess curve to show the trend over time. I also labeled the axes, changed the decimals on the y axis to percentages, and added a title. I tried to add a plotly hover over for more data, but ggplotly would not cooperate so I included that in a second graph, without the loess curve. Details on my step by step process are below.

Process

Load DSLabs (and tidyverse) in order to explore the datasets to decide which one to work with.

library(dslabs)
library(tidyverse)
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
##  [1] "make-admissions.R"                   
##  [2] "make-brca.R"                         
##  [3] "make-brexit_polls.R"                 
##  [4] "make-death_prob.R"                   
##  [5] "make-divorce_margarine.R"            
##  [6] "make-gapminder-rdas.R"               
##  [7] "make-greenhouse_gases.R"             
##  [8] "make-historic_co2.R"                 
##  [9] "make-mnist_27.R"                     
## [10] "make-movielens.R"                    
## [11] "make-murders-rda.R"                  
## [12] "make-na_example-rda.R"               
## [13] "make-nyc_regents_scores.R"           
## [14] "make-olive.R"                        
## [15] "make-outlier_example.R"              
## [16] "make-polls_2008.R"                   
## [17] "make-polls_us_election_2016.R"       
## [18] "make-reported_heights-rda.R"         
## [19] "make-research_funding_rates.R"       
## [20] "make-stars.R"                        
## [21] "make-temp_carbon.R"                  
## [22] "make-tissue-gene-expression.R"       
## [23] "make-trump_tweets.R"                 
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"

Brexit polls look intriguing. Let’s look at them!

Brexit Polls

Load the Brexit polls data.

data("brexit_polls")
str(brexit_polls)
## 'data.frame':    127 obs. of  9 variables:
##  $ startdate : Date, format: "2016-06-23" "2016-06-22" ...
##  $ enddate   : Date, format: "2016-06-23" "2016-06-22" ...
##  $ pollster  : Factor w/ 16 levels "BMG Research",..: 15 10 15 5 6 2 2 14 13 15 ...
##  $ poll_type : Factor w/ 2 levels "Online","Telephone": 1 1 1 2 1 2 2 1 2 1 ...
##  $ samplesize: num  4772 4700 3766 1592 3011 ...
##  $ remain    : num  0.52 0.55 0.51 0.49 0.44 0.54 0.48 0.41 0.45 0.42 ...
##  $ leave     : num  0.48 0.45 0.49 0.46 0.45 0.46 0.42 0.43 0.44 0.44 ...
##  $ undecided : num  0 0 0 0.01 0.09 0 0.11 0.16 0.11 0.13 ...
##  $ spread    : num  0.04 0.1 0.02 0.03 -0.01 ...

Decide on a whim that Brexit polls is actually what I’m interested in looking at, and proceed with saving the data and loading the libraries necessary for a visualization.

# make sure I'm in the right working directory
setwd("~/Desktop/DATA 110")

# save the dataset to a folder using write_csv
write_csv(brexit_polls, "brexit_polls.csv", na="")

# load libraries for visualization
library(ggthemes)
library(RColorBrewer)

Take a closer look at the data using the head() command:

head(brexit_polls, 10) # Change the number of rows from the default six to ten, in order to see a little more about the dates
##     startdate    enddate           pollster poll_type samplesize remain leave
## 1  2016-06-23 2016-06-23             YouGov    Online       4772   0.52  0.48
## 2  2016-06-22 2016-06-22            Populus    Online       4700   0.55  0.45
## 3  2016-06-20 2016-06-22             YouGov    Online       3766   0.51  0.49
## 4  2016-06-20 2016-06-22         Ipsos MORI Telephone       1592   0.49  0.46
## 5  2016-06-20 2016-06-22            Opinium    Online       3011   0.44  0.45
## 6  2016-06-17 2016-06-22             ComRes Telephone       1032   0.54  0.46
## 7  2016-06-17 2016-06-22             ComRes Telephone       1032   0.48  0.42
## 8  2016-06-16 2016-06-22                TNS    Online       2320   0.41  0.43
## 9  2016-06-20 2016-06-20 Survation/IG Group Telephone       1003   0.45  0.44
## 10 2016-06-18 2016-06-19             YouGov    Online       1652   0.42  0.44
##    undecided spread
## 1       0.00   0.04
## 2       0.00   0.10
## 3       0.00   0.02
## 4       0.01   0.03
## 5       0.09  -0.01
## 6       0.00   0.08
## 7       0.11   0.06
## 8       0.16  -0.02
## 9       0.11   0.01
## 10      0.13  -0.02

It looks like I could approach visualizing this data from several different angles, including the start date, the end date, or the pollster. I’m leaning toward end date, because I always like to see change over time, but before I jump in, I want to know how many different dates I’m really dealing with (is it a meaningful enough date spread…?).

max(brexit_polls$enddate)    # pick the most recent enddate
## [1] "2016-06-23"
min(brexit_polls$enddate)    # pick the earliest enddate
## [1] "2016-01-10"

It appears we’re looking at a six-month spread, which should be good enough. I’ll create a visualization showing change over time using the poll end date.

Select the enddate, the opinion (remain, leave, or undecided), and the pollster and samplesize (because I might like to include that info in “hover over” data).

I figured out from reading your class notes, that if I want to see three variables as different color dots on this scatterplot, I need to gather the things I want to look at, which I’ve decided to call “opinion” and percentage. Then I have to mutate opinion into a factor with three levels (remain, leave, and undecided). Then I should be able to ask ggplot to color opinion by its levels.

I want to see the individual points (geom_point), but I also want to see the overall trend. I do this with the “geom_smooth” function.

p <- brexit_polls %>%
        select(enddate, remain, leave, undecided, pollster, samplesize) %>% 
        gather(opinion, percentage, -pollster, -samplesize, -enddate) %>%
        mutate(opinion = factor(opinion, levels = c("remain", "leave", "undecided")))%>%
        ggplot(aes(x = enddate, y = percentage, color = opinion)) +
        geom_point(alpha = 0.4) +
        geom_smooth(method = "loess", span = 0.15) +
        theme_economist() + 
        theme(legend.title = element_blank())+
        labs(x = "poll end date", y = "percentage of voters", title = "Brexit Poll Results:  January to June 2016") +
        scale_y_continuous(labels = scales::percent)

p
## `geom_smooth()` using formula 'y ~ x'

Wow, this is pretty amazing. This is a good example of a graph that says a lot, though what it really reveals is why the Brits woke up so surprised when the country voted “Leave” on Brexit. Just looking at the plot, it appeared that it was a very close race, and that at a couple points it looked like the country was going to leave. In the end, though, it seemed the undecideds had made up their minds (the number kept dropping), and had chosen mainly to remain in the EU.

Trying with Plotly

I’ve added ggplotly so that we can hover over each point and see not only the date and percent, but also the sample size. I discovered ggplotly wasn’t playing nice with my loess smoother, so I’ve deleted that. We still have the hover over though.

library(plotly)

p <- brexit_polls %>%
        select(enddate, remain, leave, undecided, pollster, samplesize) %>% 
        gather(opinion, percentage, -pollster, -samplesize, -enddate) %>%
        mutate(opinion = factor(opinion, levels = c("remain", "leave", "undecided")))%>%
        ggplot(aes(x = enddate, y = percentage, color = opinion, text = paste ("sample size: ", samplesize ))) +
        geom_point(alpha = 0.4) +
        theme_economist() + 
        theme(legend.title = element_blank())+
        labs(x = "poll end date", y = "percentage of voters", title = "Brexit Poll Results:  January to June 2016") +
        scale_y_continuous(labels = scales::percent)

p <- ggplotly(p)
p