(Worth up to 20 points) Take any of the datasets included in “dslabs” and create a new multivariable graph. You may use any of the datasests including those for the four examples above, as long as you change something meaningful about the graph. You must include the following in your graph:
• Labels for x- and y-axes and a title • A theme for the graph (you must change the generic ggplot style) • Colors for a third variable, with a legend.
You can create a scatterplot or heatmap for continuous variables. You may not create a bar graph for categorical variables for this assignment. Feel free to incorporate items from Week 9 as well. In your RMD file / Rpubs document, be sure you describe in a paragraph what dataset you have used and document how you have created your graph. If you choose to use one of the datasets from the 4 above examples, be sure I can clearly understand what you have done differently.
Before deciding on a dataset, I looked at a few datasets and experimented with a few plots. The first one below is from the Gapminder dataset. I found the colorful “bubbles” appealing, especially when shown juxtaposed with parallel plots (facet wrap).
I thought it’d be interesting to look at life expectancy trends, in particular whether GDP per capita has improved life expectancy. Since the dataset starts from 1960 and goes up until 2018, there is enough data to examine any trends.
I used the code from the tutorial as a template for my plot and made some changes: expanded year range, included 3 plots for facet wrap instead of 2, adjusted x and y limits to fit the points and year labels into the plots, and other minor edits.
The end result shows what I expected - an increase in life expectancy with increasing GDP per capita over the years. The facet wrap display is good for showing this general trend. But I found it limiting in terms of providing more insight. There are many bubbles crowded together, so it’s hard to differentiate them, even though they are color-coded. Even with modifications to the axes, I could not relieve the crowding and still show 3 plots together.
My conclusion is that this plot was good for showing the general trend but would need a lot more editing to make it “data-useful.”
This dataset includes health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.
# install.packages("dslabs") # these are data science labs
library("dslabs")
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-death_prob.R"
## [5] "make-divorce_margarine.R"
## [6] "make-gapminder-rdas.R"
## [7] "make-greenhouse_gases.R"
## [8] "make-historic_co2.R"
## [9] "make-mnist_27.R"
## [10] "make-movielens.R"
## [11] "make-murders-rda.R"
## [12] "make-na_example-rda.R"
## [13] "make-nyc_regents_scores.R"
## [14] "make-olive.R"
## [15] "make-outlier_example.R"
## [16] "make-polls_2008.R"
## [17] "make-polls_us_election_2016.R"
## [18] "make-reported_heights-rda.R"
## [19] "make-research_funding_rates.R"
## [20] "make-stars.R"
## [21] "make-temp_carbon.R"
## [22] "make-tissue-gene-expression.R"
## [23] "make-trump_tweets.R"
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggthemes)
library(ggrepel)
library(RColorBrewer)
data("gapminder")
str(gapminder)
## 'data.frame': 10545 obs. of 9 variables:
## $ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
## $ infant_mortality: num 115.4 148.2 208 NA 59.9 ...
## $ life_expectancy : num 62.9 47.5 36 63 65.4 ...
## $ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
## $ population : num 1636054 11124892 5270844 54681 20619075 ...
## $ gdp : num NA 1.38e+10 NA NA 1.08e+11 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
## $ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
west <- c("Western Europe","Northern Europe","Southern Europe",
"Northern America","Australia and New Zealand")
plot1 <- gapminder %>%
mutate(new_region = case_when(
region %in% west ~ "The West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"))
plot1 <- plot1 %>%
mutate(new_region = factor(new_region, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))
plot1 %>%
filter(year%in%c(1960,1980, 2000) & !is.na(new_region) &
!is.na(life_expectancy) & !is.na(population) & !is.na(gdp)) %>%
mutate(GDPpercapita = gdp/population/100) %>%
ggplot( aes(GDPpercapita, y=life_expectancy, col = new_region, size = GDPpercapita)) +
geom_point(alpha = 0.8) +
guides(size=FALSE) +
theme(plot.title = element_text("Life Expectancy vs GDP Per Capita Around the World"), legend.title = element_blank()) +
coord_cartesian(ylim = c(40, 90)) +
coord_cartesian(xlim = c(0, 600)) +
xlab("GDP Per Capita (per 100 USD)") +
ylab("Life Expectancy (years)") +
geom_text(aes(x=100, y=85, label=year), cex=8, color="grey") +
facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "bottom")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
# F/U - Why isn't plot title showing up?
I thought the crowding could be alleviated by reducing the number of regions. So I re-created the plots for just two regions (West and Asia). This did not solve the problem, as the issue is that the West (red bubbles) by virtue of its largest GDP, overwhelms all the other bubbles.
data("gapminder")
west <- c("Western Europe","Northern Europe","Southern Europe",
"Northern America","Australia and New Zealand")
plot2 <- gapminder %>%
mutate(new_region2 = case_when(
region %in% west ~ "The West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia"
))
plot2 <- plot2 %>%
mutate(new_region2 = factor(new_region2, levels = rev(c("East Asia","The West"))))
plot2 %>%
filter(year%in%c(1960,1980, 2000) & !is.na(new_region2) &
!is.na(life_expectancy) & !is.na(population) & !is.na(gdp)) %>%
mutate(GDPpercapita = gdp/population/90) %>%
ggplot( aes(GDPpercapita, y=life_expectancy, col = new_region2, size = GDPpercapita)) +
geom_point(alpha = 0.8) +
guides(size=FALSE) +
theme(plot.title = element_blank(), legend.title = element_blank()) +
coord_cartesian(ylim = c(40, 90)) +
coord_cartesian(xlim = c(0, 600)) +
xlab("GDP Per Capita") +
ylab("Life Expectancy") +
geom_text(aes(x=100, y=85, label=year), cex=6, color="grey") +
facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
So I moved on to a different dataset to experiment some more. I looked at Stars, Olives, and Gender Inequality in Research Funding, but the data was boring to me. Too bad for Stars. I would’ve loved to do something related to astronomy, but seeing the small numbers turned me off (too many zeros!). Similarly, I wasn’t particularly interested in any of the diseases in the contagious diseases dataset.
My choice landed on Brexit poll data from 2016.
I liked the aesthetics for the FiveThirtyEight Trump vs Clinton chart in the tutorial and wanted to learn how to create that kind of graph. The Brexit data was a natural fit.
I created 3 charts for the Brexit data showing the same data but in different ways. I thought it would be interesting to compare the effectiveness of each type. Version A uses ggplot to display a scatterplot with loess smoother curves and the standard error (gray shaded area). This looks like the FiveThirtyEight plot.
Version B uses highcharter to combine two data types - a column chart for sample sizes, and 3 lines representing the voting preferences. This is supposed to look like the Food Stamps chart. But it is quite frightful.
Version C is a much improved (de-cluttered) version of B.
ggplot Theme: theme_light Line & scatter plot colors: specified by hex values
# VERSION A
data("brexit_polls")
library(ggplot2)
str(brexit_polls)
## 'data.frame': 127 obs. of 9 variables:
## $ startdate : Date, format: "2016-06-23" "2016-06-22" ...
## $ enddate : Date, format: "2016-06-23" "2016-06-22" ...
## $ pollster : Factor w/ 16 levels "BMG Research",..: 15 10 15 5 6 2 2 14 13 15 ...
## $ poll_type : Factor w/ 2 levels "Online","Telephone": 1 1 1 2 1 2 2 1 2 1 ...
## $ samplesize: num 4772 4700 3766 1592 3011 ...
## $ remain : num 0.52 0.55 0.51 0.49 0.44 0.54 0.48 0.41 0.45 0.42 ...
## $ leave : num 0.48 0.45 0.49 0.46 0.45 0.46 0.42 0.43 0.44 0.44 ...
## $ undecided : num 0 0 0 0.01 0.09 0 0.11 0.16 0.11 0.13 ...
## $ spread : num 0.04 0.1 0.02 0.03 -0.01 ...
scatter_brexit <- brexit_polls %>%
select(enddate, poll_type, remain, leave, undecided) %>%
gather(vote, percentage, -enddate, -poll_type) %>%
mutate(vote = factor(vote, levels = c("remain","leave","undecided")))%>%
group_by(poll_type) %>%
ungroup() %>%
ggplot(aes(enddate, percentage, color = vote)) +
geom_point(show.legend = FALSE, alpha=0.4) + # darked dots
geom_smooth(method = "loess", span = 0.2, alpha=0.2) + # lighten SE shading
scale_y_continuous(limits = c(0,1))+
ggtitle("Brexit Weekly Poll Results - Remain or Leave? (Jan-June 2016)")+
xlab("Month Poll Taken (2016)")+
ylab("Percentage of People In Favor")+
theme(legend.title=element_text("In Favor Of"))+ # does not work
# SET LINE COLORS
scale_color_manual(values=c("#E34424", "#36c516","#2D67BB"))+
# ADD THEME
theme_light()
scatter_brexit
library(highcharter)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
##
## Attaching package: 'highcharter'
## The following object is masked from 'package:dslabs':
##
## stars
cols <- c("red","black")
food_stamps<- read_csv("food_stamps.csv")
## Parsed with column specification:
## cols(
## year = col_double(),
## participants = col_double(),
## costs = col_double()
## )
highchart() %>%
hc_yAxis_multiples(
list(title = list(text = "Participants (millions)")),
list(title = list(text = "Costs ($ billions)"),
opposite = TRUE)
) %>%
hc_add_series(data = food_stamps$participants,
name = "Participants (millions)",
type = "column",
yAxis = 0) %>%
hc_add_series(data = food_stamps$costs,
name = "Costs ($ billions)",
type = "line",
yAxis = 1) %>%
hc_xAxis(categories = food_stamps$year,
tickInterval = 5) %>%
hc_colors(cols) %>%
hc_chart(style = list(fontFamily = "Georgia",
fontWeight = "bold"))
highcharter Theme: “Economist” Color palette: Brewer “Set1”
# VERSION B
data("brexit_polls")
library(highcharter)
# Set the color palette
cols <- brewer.pal(7, "Set1")
messy_brexit <- brexit_polls
highchart() %>%
hc_yAxis_multiples(
list(title = list(text = "Sample Size (number of UK voters")),
list(title = list(text = "Votes"),
opposite = TRUE)
) %>%
hc_add_series(data =messy_brexit$samplesize,
name = "Sample Size (number of UK voters)",
type = "column",
yAxis = 0) %>%
hc_add_series(data = messy_brexit$remain,
name = "Percentage Voting Remain",
type = "line",
yAxis = 1) %>%
hc_add_series(data = messy_brexit$leave,
name = "Percentage Voting Leave",
type = "line",
yAxis = 1) %>%
hc_add_series(data = messy_brexit$undecided,
name = "Percentage Voting Undecided",
type = "line",
yAxis = 1) %>%
hc_xAxis(categories = messy_brexit$month,
tickInterval = 6) %>%
hc_colors(cols) %>%
hc_chart(style = list(fontFamily = "Georgia",
fontWeight = "bold"))%>%
#ADD THEME
hc_add_theme(hc_theme_economist())
And YIKES! This is unintelligible. It looks like a stock market chart.
Version C is a reformed version of B. I aggregated the weeks data into months, in order to reduce the number of columns.
The Food Stamps chart had a clean trending line superimposed on the column chart because the cost changes were gradual over time. After seeing Version B, I realized that this type of graph was not quite appropriate for poll data, which can fluctuate up and down drastically. This is made worse by poll data that is gathered frequently (weekly in this dataset). Many time points with drastic up and down values makes a line graph very messy.
For version C, I aggregated the weekly data into months to reduce the amount of fluctuations. The resulting chart is now too simple (too few columns), which I think diminishes the visual impact of the Foods Stamps plot. On the other hand, the graph clearly conveys how the voting preferences trended over this time period. And the sample size info is a “nice-to-know” addition that is also easy to absorb.
But it is weird seeing the Remain and Leave lines “floating” above the columns. This is another reason why this chart does not work for this data.
Theme: “Flatdark” Line & Column Colors: set by Brewer palette “Set3” Title & Subtitle Colors: specified by hex values
# VERSION C - Step 1 - Clean up the data
# Weekly data needs to be aggregated into months (to simplify the column chart)
data("brexit_polls")
library(highcharter)
library(dplyr)
# Convert remain/leave/undecided percentages into counts
cleaned_brexit <- brexit_polls %>%
mutate(remain_total = remain*samplesize) %>%
mutate(leave_total = leave*samplesize) %>%
mutate(undecided_total = undecided*samplesize) %>%
# Add column for month
mutate(month = format(startdate, "%m")) %>%
group_by(month) %>%
# Sum up the weekly sample size total, remain total, leave total, undecided total for each month
summarize(samplesize = sum(samplesize),
remain_total = sum(remain_total),
leave_total = sum(leave_total),
undecided_total = sum(undecided_total)) %>%
# Calculate percentages for Remain, Leave, Undecided
mutate(remain_percent = remain_total/samplesize) %>%
mutate(leave_percent = leave_total/samplesize) %>%
mutate(undecided_percent = undecided_total/samplesize)
# VERSION C - Step 2 - Create the chart
# Set the color palette
cols <- brewer.pal(7, "Set3")
highchart() %>%
hc_yAxis_multiples(
list(title = list(text = "Sample Size (# of UK citizens polled)")),
list(title = list(text = "Percentage In Favor"),
opposite = TRUE)
) %>%
# MANUALLY CHANGE LINE COLORS
hc_add_series(data =cleaned_brexit$samplesize,
name = "Sample Size (number of UK voters)",
type = "column",
yAxis = 0) %>%
hc_add_series(data = cleaned_brexit$remain_percent,
name = "Percentage In Favor of Remaining",
type = "line",
yAxis = 1,
color = "#f0547a") %>%
hc_add_series(data = cleaned_brexit$leave_percent,
name = "Percentage In Favor of Leaving",
type = "line",
yAxis = 1,
color = "#eed911") %>%
hc_add_series(data = cleaned_brexit$undecided_percent,
name = "Percentage Undecided",
type = "line",
yAxis = 1,
color = "#ffffff") %>%
hc_xAxis(categories = cleaned_brexit$month,
tickInterval = 1) %>%
hc_colors(cols) %>%
hc_chart(style = list(fontFamily = "Georgia",
fontWeight = "bold")) %>%
# ADD TITLES - found code through Google search
hc_title(text = "<b>Brexit Poll Opinions (Jan-June 2016)</b>",
margin = 20, align = "left",
style = list(color = "#90ed7d", useHTML = TRUE)) %>%
hc_subtitle(text = "UK Citizens' Preference to Remain or Leave the EU",
align = "left", style = list(color = "#c7f6be",
fontWeight ="bold")) %>%
# ADD X-AXIS LABEL - found code through Google search
hc_xAxis(title = list(text = "Month Poll Taken (2016)"),
opposite = TRUE) %>%
# ADD THEME
hc_add_theme(hc_theme_flatdark())