# Load package(s)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(splines)
We’ll be using data from the BA_degrees.rda
and dow_jones_industrial.rda
datasets which are already in the /data
subdirectory in our data_vis_labs project. Below is a description of the variables contained in each dataset.
BA_degrees.rda
field
- field of studyyear_str
- academic year (e.g. 1970-71)year
- closing year of academic yearcount
- number of degrees conferred within a field for the yearperc
- field’s percentage of degrees conferred for the yeardow_jones_industrial.rda
date
- dateopen
- Dow Jones Industrial Average at openhigh
- Day’s high for the Dow Jones Industrial Averagelow
- Day’s low for the Dow Jones Industrial Averageclose
- Dow Jones Industrial Average at closevolume
- number of trades for the dayWe’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt
file — place this file in your /data
subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.
genhlth
- How would you rate your general health? (excellent, very good, good, fair, poor)exerany
- Have you exercised in the past month? (1 = yes
, 0 = no
)hlthplan
- Do you have some form of health coverage? (1 = yes
, 0 = no
)smoke100
- Have you smoked at least 100 cigarettes in your life time? (1 = yes
, 0 = no
)height
- height in inchesweight
- weight in poundswtdesire
- weight desired in poundsage
- in yearsgender
- m
for males and f
for femalesload(file = "data/BA_degrees.rda")
Here, I have loaded the ‘BA_degrees.rda’ dataset to code for the various plots of Exercise 1.
# Wrangling for plotting
ba_dat <- BA_degrees %>%
# mean % per field
group_by(field) %>%
mutate(mean_perc = mean(perc)) %>%
# Only fields with mean >= 5%
filter(mean_perc >= 0.05) %>%
# Organizing for plotting
arrange(desc(mean_perc), year) %>%
ungroup() %>%
mutate(field = fct_inorder(field))
ggplot(ba_dat, aes(year,perc)) +
geom_line() +
facet_wrap(~ field) +
labs(x = "Year",
y = "Proportion of degrees")
Here, for plot 1, I have created a line plot of proportion of degrees by year using the geom_line function and used the facet_wrap function to categorize the line plots based on field of study.
ggplot(ba_dat, aes(year,perc)) +
geom_line() +
geom_area(color = "red",fill = "red",alpha = 0.5) +
facet_wrap(~field) +
labs(x = "Year",
y = "Proportion of degrees")
Here, I have created a line plot of proportion of degrees by year with geom_line, used facet_wrap to categorize the plots by field of study, and colored the areas under the lines with red of transparency (alpha value) of 0.5.
ggplot(ba_dat, aes(year,perc,colour=field)) +
geom_line() +
labs(x = "Year",
y = "Proportion of degrees")
Here, I have created a line plot of proportion of degrees by year with geom_line, and categorized the fields with lines by color using ’colour=field".
Using dow_jones_industrial
dataset, recreate the following graphics as precisely as possible.
load(file = "data/dow_jones_industrial.rda")
Here, I have loaded the dataset of dow_jones_industrial.rda from the data file.
# Restrict data to useful range
djia_date_range <- dow_jones_industrial %>%
filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10"))
ggplot(djia_date_range, aes(date,open)) +
geom_line(colour = "purple") +
geom_smooth(colour = "green",fill = "red") +
labs(x = "",y = "Dow Jones Industrial Average")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here, I have created a line graph of Dow Jones Industrial Average by Date with the color of the line being purple, using the geom_line function. With the geom_smooth function, I have set a line that overlays the scatterplots with a green line and a width color of red.
ggplot(djia_date_range, aes(date,open)) +
geom_line() +
geom_smooth(colour = "blue", se = FALSE, span = 0.3)+
labs(x="",y ="Dow Jones Industrial Average")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here, I have created a line graph of Dow Jones Industrial Average by Date with the color of blue. I have used the se=FALSE function to delete the width of the geom_smooth line, set span=0.3 for a wiggliness of 0.3.
ggplot(djia_date_range, aes(date,open)) +
geom_line() +
geom_smooth(colour = "blue",method = "lm", se = FALSE, formula = y~ns(x,6)) +
labs(x = "",y = "Dow Jones Industrial Average")
Here, I have created a geom_smooth line with a colour of blue, using the method=“lm” for a linear function, se=FALSE in order to delete the width, and formula=y~ns(x,6) for a binomial function.
Using cdc
dataset, recreate the following graphics as precisely as possible.
# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
mutate(genhlth = factor(genhlth,
levels = c("excellent", "very good", "good", "fair", "poor")
))
Here, I have read from the cdc dataset and factored the genhlth factor into 5 different categories.
genhlth_count <- cdc %>%
count(genhlth)
Here, I have read the cdc dataset through genhlth_count, for a counting of the genhlth factor.
ggplot(cdc, aes(genhlth)) +
geom_bar()
Here, I have created a bar plot of a single variable “genhlth” with geom_bar.
ggplot(genhlth_count, aes(genhlth, n)) +
geom_col()
Here, I have used genhlth_count to create a bar chart using geom_col() of two variables of genhlth and n.
ggplot(cdc,aes(genhlth,colour = as.factor(hlthplan), fill = as.factor(hlthplan))) +
geom_bar(position = "dodge")
Here, I have created a bar chart of a single variable genhlth using geom_bar(), and differentiated with colors by as.factor(hlthplan). The position=“dodge” creates two different bars for each factor.
ggplot(cdc, aes(x = weight,group=genhlth,fill=genhlth)) +
geom_density(alpha = 0.2) +
facet_wrap(~gender)
Here, I have created a density plot with geom_density with transparency of 0.2 using alpha=0.2. The variable used is weight, and the lines are differentiated by color based on genhlth and filled with their colors. The facet_wrap(~gender) function creates two different graphs based on gender.
ggplot(cdc, aes(x = weight,group = gender,fill = gender)) +
geom_density(alpha = 0.5) +
facet_wrap(~genhlth) +
xlim(50,300)
## Warning: Removed 103 rows containing non-finite values (stat_density).
Here, I have created a density plot with colors by gender and divided into multiple plots by the genhlth factor.
ggplot(cdc,aes(x = gender,y = height,group = gender,fill = gender)) +
geom_boxplot(alpha = 0.4) +
facet_grid(~genhlth)
Here, I have created a box plot of height by gender with colors used to differentiate between gender. The facet_grid(~genhlth) function builds plots that span vertically.
ggplot(cdc) +
geom_point(aes(x = height,y = weight,color = gender),alpha = 0.2) +
geom_smooth(aes(x = height,y = weight,color = gender),
method ="lm",se = FALSE,fullrange = TRUE)
Here, I have created a geom_smooth line on top of a scatterplot created by geom_point.Both plots are on the same panel of weight by height. The colors are differentiated by gender. The fullrange function spans the geom_smooth linear line further.