# Load package(s)

library(ggplot2)
library(tidyverse)
library(lubridate)
library(splines)

Datasets

We’ll be using data from the BA_degrees.rda and dow_jones_industrial.rda datasets which are already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in each dataset.

BA_degrees.rda

field - field of study
year_str - academic year (e.g. 1970-71)
year - closing year of academic year
count - number of degrees conferred within a field for the year
perc - field’s percentage of degrees conferred for the year

dow_jones_industrial.rda

date - date
open - Dow Jones Industrial Average at open
high - Day’s high for the Dow Jones Industrial Average
low - Day’s low for the Dow Jones Industrial Average
close - Dow Jones Industrial Average at close
volume - number of trades for the day

We’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt file — place this file in your /data subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.

genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)
exerany - Have you exercised in the past month? (1 = yes, 0 = no)
hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)
smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)
height - height in inches
weight - weight in pounds
wtdesire - weight desired in pounds
age - in years
gender - m for males and f for females

Exercises

Exercise 1

load(file = "data/BA_degrees.rda")

Here, I have loaded the ‘BA_degrees.rda’ dataset to code for the various plots of Exercise 1.

Plot 1

# Wrangling for plotting

ba_dat <- BA_degrees %>%
  # mean % per field
  group_by(field) %>% 
  mutate(mean_perc = mean(perc)) %>% 
  # Only fields with mean >= 5%
  filter(mean_perc >= 0.05) %>%
  # Organizing for plotting
  arrange(desc(mean_perc), year) %>% 
  ungroup() %>% 
  mutate(field = fct_inorder(field))

ggplot(ba_dat, aes(year,perc)) +
  geom_line() + 
  facet_wrap(~ field) +
  labs(x = "Year",
       y = "Proportion of degrees")

Here, for plot 1, I have created a line plot of proportion of degrees by year using the geom_line function and used the facet_wrap function to categorize the line plots based on field of study.

Plot 2

ggplot(ba_dat, aes(year,perc)) +
  geom_line() +
  geom_area(color = "red",fill = "red",alpha = 0.5) +
  facet_wrap(~field) +
  labs(x = "Year",
       y = "Proportion of degrees")

Here, I have created a line plot of proportion of degrees by year with geom_line, used facet_wrap to categorize the plots by field of study, and colored the areas under the lines with red of transparency (alpha value) of 0.5.

Plot 3

ggplot(ba_dat, aes(year,perc,colour=field)) +
  geom_line() + 
  labs(x = "Year",
       y = "Proportion of degrees")

Here, I have created a line plot of proportion of degrees by year with geom_line, and categorized the fields with lines by color using ’colour=field".

Exercise 2

Using dow_jones_industrial dataset, recreate the following graphics as precisely as possible.

load(file = "data/dow_jones_industrial.rda")

Here, I have loaded the dataset of dow_jones_industrial.rda from the data file.

Plot 1

# Restrict data to useful range
djia_date_range <- dow_jones_industrial %>% 
  filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10"))

ggplot(djia_date_range, aes(date,open)) +
  geom_line(colour = "purple") +
  geom_smooth(colour = "green",fill = "red") +
  labs(x = "",y = "Dow Jones Industrial Average")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Here, I have created a line graph of Dow Jones Industrial Average by Date with the color of the line being purple, using the geom_line function. With the geom_smooth function, I have set a line that overlays the scatterplots with a green line and a width color of red.

Plot 2

ggplot(djia_date_range, aes(date,open)) +
  geom_line() +
  geom_smooth(colour = "blue", se = FALSE, span = 0.3)+
  labs(x="",y ="Dow Jones Industrial Average")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Here, I have created a line graph of Dow Jones Industrial Average by Date with the color of blue. I have used the se=FALSE function to delete the width of the geom_smooth line, set span=0.3 for a wiggliness of 0.3.

Plot 3

ggplot(djia_date_range, aes(date,open)) +
  geom_line() +
  geom_smooth(colour = "blue",method = "lm", se = FALSE, formula = y~ns(x,6)) +
  labs(x = "",y = "Dow Jones Industrial Average")

Here, I have created a geom_smooth line with a colour of blue, using the method=“lm” for a linear function, se=FALSE in order to delete the width, and formula=y~ns(x,6) for a binomial function.

Exercise 3

Using cdc dataset, recreate the following graphics as precisely as possible.

# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

Here, I have read from the cdc dataset and factored the genhlth factor into 5 different categories.

Plot 1

genhlth_count <- cdc %>% 
  count(genhlth)

Here, I have read the cdc dataset through genhlth_count, for a counting of the genhlth factor.

ggplot(cdc, aes(genhlth)) +
  geom_bar()

Here, I have created a bar plot of a single variable “genhlth” with geom_bar.

ggplot(genhlth_count, aes(genhlth, n)) +
  geom_col()

Here, I have used genhlth_count to create a bar chart using geom_col() of two variables of genhlth and n.

Plot 2

ggplot(cdc,aes(genhlth,colour = as.factor(hlthplan), fill = as.factor(hlthplan))) +
  geom_bar(position = "dodge")

Here, I have created a bar chart of a single variable genhlth using geom_bar(), and differentiated with colors by as.factor(hlthplan). The position=“dodge” creates two different bars for each factor.

Plot 3

ggplot(cdc, aes(x = weight,group=genhlth,fill=genhlth)) +
geom_density(alpha = 0.2) +
  facet_wrap(~gender)

Here, I have created a density plot with geom_density with transparency of 0.2 using alpha=0.2. The variable used is weight, and the lines are differentiated by color based on genhlth and filled with their colors. The facet_wrap(~gender) function creates two different graphs based on gender.

Plot 4

ggplot(cdc, aes(x = weight,group = gender,fill = gender)) +
geom_density(alpha = 0.5) +
  facet_wrap(~genhlth) +
  xlim(50,300)

## Warning: Removed 103 rows containing non-finite values (stat_density).

Here, I have created a density plot with colors by gender and divided into multiple plots by the genhlth factor.

Plot 5

ggplot(cdc,aes(x = gender,y = height,group = gender,fill = gender)) +
  geom_boxplot(alpha = 0.4) +
  facet_grid(~genhlth)

Here, I have created a box plot of height by gender with colors used to differentiate between gender. The facet_grid(~genhlth) function builds plots that span vertically.

Plot 6

ggplot(cdc) +
  geom_point(aes(x = height,y = weight,color = gender),alpha = 0.2) +
  geom_smooth(aes(x = height,y = weight,color = gender),
              method ="lm",se = FALSE,fullrange = TRUE)

Here, I have created a geom_smooth line on top of a scatterplot created by geom_point.Both plots are on the same panel of weight by height. The colors are differentiated by gender. The fullrange function spans the geom_smooth linear line further.

L03 ggplot II

Taehyung Kim

April 9, 2019

Datasets

Exercises

Exercise 1

Plot 1

Plot 2

Plot 3

Exercise 2

Plot 1

Plot 2

Plot 3

Exercise 3

Plot 1

Plot 2

Plot 3

Plot 4

Plot 5

Plot 6