I have decided to create a plot based upon the “Brexit_polls” dataset from DS Labs. To create the plot, I first looked at the data and decided what questions I wanted the plot to answer—how did British opinion about Brexit change in the six months between January and June 2016 (at least according to the public opinion polls). I then prepared the data so the plot could accurately show the variables I thought were most important for answering my question. After the data was prepared, I constructed a graph with color to differentiate the responses, and with a loess curve to show the trend over time. I also labeled the axes, changed the decimals on the y axis to percentages, and added a title. I tried to add a plotly hover over for more data, but ggplotly would not cooperate so I included that in a second graph, without the loess curve. Details on my step by step process are below.
Load DSLabs (and tidyverse) in order to explore the datasets to decide which one to work with.
library(dslabs)
library(tidyverse)
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-death_prob.R"
## [5] "make-divorce_margarine.R"
## [6] "make-gapminder-rdas.R"
## [7] "make-greenhouse_gases.R"
## [8] "make-historic_co2.R"
## [9] "make-mnist_27.R"
## [10] "make-movielens.R"
## [11] "make-murders-rda.R"
## [12] "make-na_example-rda.R"
## [13] "make-nyc_regents_scores.R"
## [14] "make-olive.R"
## [15] "make-outlier_example.R"
## [16] "make-polls_2008.R"
## [17] "make-polls_us_election_2016.R"
## [18] "make-reported_heights-rda.R"
## [19] "make-research_funding_rates.R"
## [20] "make-stars.R"
## [21] "make-temp_carbon.R"
## [22] "make-tissue-gene-expression.R"
## [23] "make-trump_tweets.R"
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
Brexit polls look intriguing. Let’s look at them!
Load the Brexit polls data.
data("brexit_polls")
str(brexit_polls)
## 'data.frame': 127 obs. of 9 variables:
## $ startdate : Date, format: "2016-06-23" "2016-06-22" ...
## $ enddate : Date, format: "2016-06-23" "2016-06-22" ...
## $ pollster : Factor w/ 16 levels "BMG Research",..: 15 10 15 5 6 2 2 14 13 15 ...
## $ poll_type : Factor w/ 2 levels "Online","Telephone": 1 1 1 2 1 2 2 1 2 1 ...
## $ samplesize: num 4772 4700 3766 1592 3011 ...
## $ remain : num 0.52 0.55 0.51 0.49 0.44 0.54 0.48 0.41 0.45 0.42 ...
## $ leave : num 0.48 0.45 0.49 0.46 0.45 0.46 0.42 0.43 0.44 0.44 ...
## $ undecided : num 0 0 0 0.01 0.09 0 0.11 0.16 0.11 0.13 ...
## $ spread : num 0.04 0.1 0.02 0.03 -0.01 ...
Decide on a whim that Brexit polls is actually what I’m interested in looking at, and proceed with saving the data and loading the libraries necessary for a visualization.
# make sure I'm in the right working directory
setwd("~/Desktop/DATA 110")
# save the dataset to a folder using write_csv
write_csv(brexit_polls, "brexit_polls.csv", na="")
# load libraries for visualization
library(ggthemes)
library(RColorBrewer)
Take a closer look at the data using the head() command:
head(brexit_polls, 10) # Change the number of rows from the default six to ten, in order to see a little more about the dates
## startdate enddate pollster poll_type samplesize remain leave
## 1 2016-06-23 2016-06-23 YouGov Online 4772 0.52 0.48
## 2 2016-06-22 2016-06-22 Populus Online 4700 0.55 0.45
## 3 2016-06-20 2016-06-22 YouGov Online 3766 0.51 0.49
## 4 2016-06-20 2016-06-22 Ipsos MORI Telephone 1592 0.49 0.46
## 5 2016-06-20 2016-06-22 Opinium Online 3011 0.44 0.45
## 6 2016-06-17 2016-06-22 ComRes Telephone 1032 0.54 0.46
## 7 2016-06-17 2016-06-22 ComRes Telephone 1032 0.48 0.42
## 8 2016-06-16 2016-06-22 TNS Online 2320 0.41 0.43
## 9 2016-06-20 2016-06-20 Survation/IG Group Telephone 1003 0.45 0.44
## 10 2016-06-18 2016-06-19 YouGov Online 1652 0.42 0.44
## undecided spread
## 1 0.00 0.04
## 2 0.00 0.10
## 3 0.00 0.02
## 4 0.01 0.03
## 5 0.09 -0.01
## 6 0.00 0.08
## 7 0.11 0.06
## 8 0.16 -0.02
## 9 0.11 0.01
## 10 0.13 -0.02
It looks like I could approach visualizing this data from several different angles, including the start date, the end date, or the pollster. I’m leaning toward end date, because I always like to see change over time, but before I jump in, I want to know how many different dates I’m really dealing with (is it a meaningful enough date spread…?).
max(brexit_polls$enddate) # pick the most recent enddate
## [1] "2016-06-23"
min(brexit_polls$enddate) # pick the earliest enddate
## [1] "2016-01-10"
It appears we’re looking at a six-month spread, which should be good enough. I’ll create a visualization showing change over time using the poll end date.
Select the enddate, the opinion (remain, leave, or undecided), and the pollster and samplesize (because I might like to include that info in “hover over” data).
I figured out from reading your class notes, that if I want to see three variables as different color dots on this scatterplot, I need to gather the things I want to look at, which I’ve decided to call “opinion” and percentage. Then I have to mutate opinion into a factor with three levels (remain, leave, and undecided). Then I should be able to ask ggplot to color opinion by its levels.
I want to see the individual points (geom_point), but I also want to see the overall trend. I do this with the “geom_smooth” function.
p <- brexit_polls %>%
select(enddate, remain, leave, undecided, pollster, samplesize) %>%
gather(opinion, percentage, -pollster, -samplesize, -enddate) %>%
mutate(opinion = factor(opinion, levels = c("remain", "leave", "undecided")))%>%
ggplot(aes(x = enddate, y = percentage, color = opinion)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "loess", span = 0.15) +
theme_economist() +
theme(legend.title = element_blank())+
labs(x = "poll end date", y = "percentage of voters", title = "Brexit Poll Results: January to June 2016") +
scale_y_continuous(labels = scales::percent)
p
## `geom_smooth()` using formula 'y ~ x'
Wow, this is pretty amazing. This is a good example of a graph that says a lot, though what it really reveals is why the Brits woke up so surprised when the country voted “Leave” on Brexit. Just looking at the plot, it appeared that it was a very close race, and that at a couple points it looked like the country was going to leave. In the end, though, it seemed the undecideds had made up their minds (the number kept dropping), and had chosen mainly to remain in the EU.
I’ve added ggplotly so that we can hover over each point and see not only the date and percent, but also the sample size. I discovered ggplotly wasn’t playing nice with my loess smoother, so I’ve deleted that. We still have the hover over though.
library(plotly)
p <- brexit_polls %>%
select(enddate, remain, leave, undecided, pollster, samplesize) %>%
gather(opinion, percentage, -pollster, -samplesize, -enddate) %>%
mutate(opinion = factor(opinion, levels = c("remain", "leave", "undecided")))%>%
ggplot(aes(x = enddate, y = percentage, color = opinion, text = paste ("sample size: ", samplesize ))) +
geom_point(alpha = 0.4) +
theme_economist() +
theme(legend.title = element_blank())+
labs(x = "poll end date", y = "percentage of voters", title = "Brexit Poll Results: January to June 2016") +
scale_y_continuous(labels = scales::percent)
p <- ggplotly(p)
p