This project showcases an example of an R Vignette, centered around the powerful visualization capabilities of the ggplot2 package. In this RMD, a dataset sourced from FiveThirtyEight is used, specifically focusing on the age distribution within the U.S. Congress (https://fivethirtyeight.com/features/aging-congress-boomers/). The goal is to demonstrate how to effectively utilize ggplot2, a part of the TidyVerse ecosystem, to create insightful and visually appealing plots from this dataset.
library(tidyverse)
library(ggalt)
congress <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-demographics/data_aging_congress.csv")
glimpse(congress)
## Rows: 29,120
## Columns: 13
## $ congress <dbl> 82, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, …
## $ start_date <date> 1951-01-03, 1947-01-03, 1949-01-03, 1951-01-03, 1953-01…
## $ chamber <chr> "House", "House", "House", "House", "House", "House", "H…
## $ state_abbrev <chr> "ND", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "V…
## $ party_code <dbl> 200, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ bioname <chr> "AANDAHL, Fred George", "ABBITT, Watkins Moorman", "ABBI…
## $ bioguide_id <chr> "A000001", "A000002", "A000002", "A000002", "A000002", "…
## $ birthday <date> 1897-04-09, 1908-05-21, 1908-05-21, 1908-05-21, 1908-05…
## $ cmltv_cong <dbl> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ cmltv_chamber <dbl> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ age_days <dbl> 19626, 14106, 14837, 15567, 16298, 17028, 17759, 18489, …
## $ age_years <dbl> 53.73306, 38.62012, 40.62149, 42.62012, 44.62149, 46.620…
## $ generation <chr> "Lost", "Greatest", "Greatest", "Greatest", "Greatest", …
The grammar of graphics is the integral of ggplot2. The structure of it takes in data, a coordinate system, and specifies a geom. Depending on the geom, an x and/or y may need to be specified. These are specified in the aes function.
The following are required aesthetics for a histogram:
mapping, which is the aes(x = n)
data, the data set name to be used in the graph
The following are optional aesthetics, followed by their default values:
stat, default value is “bin”
position, default value is “stack”
binwidth, default value is relative to the data set inputs
bins, default value is 30
na.rm, default value is FALSE
orientation, default value is NA
show.legend, default value is NA
inherit.aes, default value is TRUE
aes, the aesthetic field has the ability to contain more aesthetic specifications such as color and shapes in other geoms - default values are null for these additional mappings
#histogram of age
ggplot(congress, aes(x = age_years)) +
geom_histogram()
#histogram of age with bin number specified
ggplot(congress, aes(x = age_years)) +
geom_histogram(bins = 60)
#histogram of age with binwidth specified
ggplot(congress, aes(x = age_years)) +
geom_histogram(binwidth = 4)
#histogram of age with bins and colored by generation
ggplot(congress, aes(x = age_years)) +
geom_histogram(aes(fill = generation), bins = 20)
The below chunks explore different combinations of variable types in ggplot2. The types explored are as follows:
1 Variable:
Continuous = geom_histogram
Discrete = geom_bar
2 Variables:
Both Continuous = geom_point
1 Continuous, 1 Discrete = geom_boxplot
#birthday at start of term histogram
ggplot(congress, aes(x = birthday)) +
geom_histogram(bins = 60)
#age histogram colored by generation
ggplot(congress, aes(x = age_years)) +
geom_histogram(aes(fill = generation), bins = 60)
#bar chart of generation frequency
ggplot(congress, aes(x = generation)) +
geom_bar()
#bar chart of generation frequency colored by chamber
ggplot(congress, aes(x = generation)) +
geom_bar(aes(fill = chamber))
#scatter plot of start date and birthday
ggplot(congress, aes(x = start_date, y = birthday)) +
geom_point()
#scatter plot of start date and birthday with color for age to highlight age change over time
ggplot(congress, aes(x = start_date, y = birthday)) +
geom_point(aes(color = age_years))
#boxplots for each congress showing age IQR
ggplot(congress, aes(x = congress, y = age_years)) +
geom_boxplot(aes(group = congress))
#boxplots for each congress showing age IQR colored by chamber
ggplot(congress, aes(x = congress, y = age_years)) +
geom_boxplot(aes(group = congress, fill = chamber))
ggplot2 can also be used to visualize the correlation
between variables. The geom_point function can be used to
create a scatter plot, and the geom_smooth function can be
used to add a line of best fit to the plot.
First we will create a variable that represents which party the member of congress belongs to. We do this by joining the congress data frame with a data frame that contains the party codes.
# Read in csv file from github and join on part_code
party_codes <-
read_csv(
"https://raw.githubusercontent.com/pkowalchuk/SPRING2024TIDYVERSE/main/party-codes.csv"
)
congress <- congress |>
inner_join(party_codes, by = c("party_code" = "party_code"))
We can now use ggplot2 to find correlation between age
and party_name.
The most frequently used plot for data analysis is undoubtedly the scatterplot. Whenever you want to understand the nature of relationship between two variables, invariably the first choice is the scatterplot.
ggplot(congress, aes(x = age_years, y = party_name)) +
geom_point() +
labs(title = "Age vs Party Name", x = "Age", y = "Party Name")
Sometimes to better highlight a region of the plot, we can add a encircling shape around the points.
ggplot(congress, aes(x = age_years, y = party_name)) +
geom_point() +
geom_encircle(
aes(x = age_years, y = party_name),
data = congress %>% filter(age_years > 80),
color = "red",
expand = 0.05,
size = 2
) +
labs(title = "Age vs Party Name + Encircling", x = "Age", y = "Party Name")
We can use counts chart to overcome the problem of data points overlapping.
ggplot(congress, aes(x = age_years, y = party_name)) +
geom_count(col = "tomato", show.legend = F) +
labs(title = "Age vs Party Name", x = "Age", y = "Party Name")
Loading in new data to demonstrate jitter plot.
data("mpg")
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Jitter Plot", x = "Class", y = "Highway MPG")
If you have a third variable that you want to represent in the plot, you can use a bubble chart.
mpg_select <-
mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"),]
bubble <- ggplot(mpg_select, aes(displ, cty)) +
geom_point() +
labs(title = "Bubble Chart", x = "City MPG", y = "Highway MPG")
bubble + geom_point(aes(size = hwy, col = manufacturer)) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
Sometimes want the values to change over time. You can implement an animated bubble chart by importing the gganimate package.
library(gganimate)
library(gapminder)
ggplot(gapminder,
aes(
gdpPercap,
lifeExp,
size = pop,
color = continent,
frame = year
)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
scale_color_manual(values = c("red", "blue", "green", "yellow", "purple", "orange")) +
scale_size(range = c(2, 12)) +
scale_x_log10() +
labs(title = "Bubble Chart", x = "GDP per Capita", y = "Life Expectancy") +
theme_minimal() +
transition_time(year) +
ease_aes('linear')