Datasets

We’ll be using data from the blue_jays.rda, tech_stocks.rda, corruption.rda, and cdc.txt datasets which are already in the /data subdirectory in our data_vis_labs project.

# Load package(s)
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(lubridate)
library(dplyr)
library(directlabels)

# Load datasets
load(file = "data/blue_jays.rda")
load(file = "data/tech_stocks.rda")
load(file = "data/corruption.rda")

# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

Exercise 1

Using blue_jays.rda dataset, recreate the following graphic as precisely as possible.

yrng<-range(blue_jays$Head)
xrng<-range(blue_jays$Mass)
caption<-"Head length versus body mass for 123 blue jays"

##Data for geom_text
label_info<-blue_jays %>% slice(8, 26)

##Code for the given graphic
ggplot(data = blue_jays, 
       mapping = aes(Mass,Head, color=KnownSex)) +
         geom_point(aes(color = KnownSex), size=2, alpha = 0.6, 
                    show.legend=FALSE) +
  geom_text(data=label_info, aes(label=KnownSex),nudge_x = 0.5, show.legend = FALSE) +
  xlab("Body mass (g)") +
  ylab("Head length(mm)") + 
  annotate("text", x = xrng[1], y=yrng[2], label=caption,
           hjust=0, vjust=1, size= 4)

For this, I have created a scatterplot of Head Length by Mass from the blue_jays dataset. The color is by the variable KnownSex, with size 2 and a transparency of 0.6. The annotate function inserts the caption of “Head Length versus body mass for 123 blue jays” in its position of top left.

Exercise 2

Using tech_stocks dataset, recreate the following graphics as precisely as possible.

Plot 1
## Location info for annotate
yrng<-range(tech_stocks$price_indexed)
xrng<-range(tech_stocks$date)
caption<-"Stock price over time for four major tech companies"

##Data for labels
label_info <- tech_stocks %>%
  arrange(desc(date)) %>%
  distinct(company, .keep_all = TRUE)
## Warning: Detecting old grouped_df format, replacing `vars` attribute by
## `groups`
## Code for given graphic
ggplot(data = tech_stocks,
       mapping = aes(date, price_indexed, color = company)) +
  geom_line(aes(color = company), show.legend=FALSE) +
  geom_text(data = label_info, aes(label = company), 
            nudge_x = 0.5, show.legend = FALSE, color="black") + 
  xlab("") +
  ylab("Stock price, indexed") +
  annotate("text", x = xrng[1], y=yrng[2], label=caption,
           hjust=0, vjust=1, size= 4, family = "serif")

In this line plot, I have created 4 separate line graphs of Stock price, indexed by year. I tried to add the company names at the end of their corresponding graphs with the geom_dl function.

Plot 2
## Location info for annotate
yrng<-range(tech_stocks$price_indexed)
xrng<-range(tech_stocks$date)
caption<-"Stock price over time for four major tech companies"

## Setting a seed
set.seed(9876)

##Data for labels
label_info <- tech_stocks %>%
  arrange(desc(date)) %>%
  distinct(company, .keep_all = TRUE)

## Code for given graphic
ggplot(data = tech_stocks,
       mapping = aes(date, price_indexed, color = company)) +
  geom_line(aes(color = company), show.legend=FALSE) +
  xlab("") +
  ylab("Stock price, indexed") +
  annotate("text", x = xrng[1], y=yrng[2], label=caption,
           hjust=0, vjust=1, size= 4, family = "serif") +
  geom_text_repel(data = label_info, aes(label = company), box.padding = 0.6, hjust = 1, min.segment.length = 0, show.legend = FALSE, color = "black")

Above, I have created a line graph with four different plots based on company. I have used geom_text_repel and geom_dl in order to add minimum segment length, box.padding, and in-graph company labels.

Exercise 3

Using corruption.rda dataset, recreate the following graphic as precisely as possible.

##Dataset for plotting points
corruption_plot<-corruption %>%
  filter(year == 2015) %>%
  na.omit()

##Setting a seed
set.seed(9876)

# Dataset for labels
country_highlight <- c(
  "United States", "Singapore", "Ghana", "Niger",
  "Chile", "Argentina", "Japan", "China", "Iraq")

##Build a plot
ggplot(corruption_plot, mapping = aes(cpi,hdi, color = region, label = region)) +
  geom_point(aes(color = region), alpha = 0.6) +
  geom_smooth(method = "lm",formula = "y ~ log(x)", se = FALSE, color = "grey60") +
  geom_text_repel(subset(corruption_plot, country %in% country_highlight), mapping = aes(label = country),
                  min.segment.length = 0, box.padding = 0.6, color = "black") +
  xlab("Corruption Perception Index, 2015 (100 = least corrupt)") + 
  ylab("Human Development Index, 2015
       (1.0 = most developed)") +
  ggtitle("Corruption and human development(2015)")

Above, I have used the corruption_plot dataset in order to create a line graph with factors by regions. In order to only label certain countries in country_highlight, I have used ‘subset’ and ‘in%in’.

Exercise 4

Using cdc dataset, recreate the described graphic as precisely as possible.

Using Bilbo Baggins’ responses below to the CDC BRSFF questions, add Bilbo’s data point as a transparent (0.5) solid red circle of size 4 to a scatterplot of weight by height with transparent (0.1) solid blue circles of size 2 as the plotting characters. In addition, label the point with his name in red. Rotate the label so it read vertically from top to bottom and shift it up by 10 pounds. Plot should use appropriately formatted axis labels. Remember that the default shape is a solid circle.

  • genhlth - How would you rate your general health? fair
  • exerany - Have you exercised in the past month? 1=yes
  • hlthplan - Do you have some form of health coverage? 0=no
  • smoke100 - Have you smoked at least 100 cigarettes in your life time? 1=yes
  • height - height in inches: 46
  • weight - weight in pounds: 120
  • wtdesire - weight desired in pounds: 120
  • age - in years: 45
  • gender - m for males and f for females: m


Hint: Create a new dataset (maybe call it bilbo or bilbo_baggins) using either data.frame() (base R - example in book) or tibble() (tidyverse - see help documentation for the function). Make sure to use variable names that exactly match cdc’s variable names

# Build dataset for Bilbo Baggins
# tidyverse way to build data: Base R is data.frame()
bilbo <- tibble(
  genhlth  = "fair",
  exerany  = 1,
  hlthplan = 0,
  smoke100 = 1,
  height   = 46,
  weight   = 120,
  wtdesire = 120,
  age      = 45,
  gender   = "m"
)

#Code for graph
ggplot(cdc, aes(height, weight)) +
  geom_point(size = 2, alpha = 0.1, shape = 16, color = "blue")+
  geom_point(bilbo, mapping = aes(height, weight), 
             color="red", alpha = 0.5, size = 4) +
  geom_text(data = bilbo, aes(height, weight, 
                              label = "Bilbo Baggins", hjust = 0, xjust = 0, angle = 90, color = "red"), position = position_nudge(y = 10), show.legend = FALSE) +
  xlab("Height(in)") + 
  ylab("Weight(lbs)")
## Warning: Ignoring unknown aesthetics: xjust

Above, I have created a cdc dataset that includes the data bilbo. The Bilbo Baggins label goes up by 10 from the data point.