Tidyverse-Create

Assignment

Create an Example using one or more Tidyverse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
You should complete your submission on the schedule stated in the course syllabus.

Introduction

As of tidyverse 1.3.0, the “library(tidyverse)” command loads eight packages with varying uses for data science: ggplot2, dplyr, tidyr, readr, purrr, tibble, string, and forcats.
While I have used ggplot2 before, for this assignment I will explore it further because it is perhaps the package which the greatest number of data analyses will use. In my business experience, at the end of almost every project, I need to summarize findings into a graphic or chart.
In addition to ggplot2, I will leverage the dplyr package from tidyverse to prepare my data for visualizations. dplyr is nearly essential to use in combination with ggplot2 because of its usefulness in organizing/grouping data.

Imports

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Reading Dataset

Dataset: Age-related info (date of birth, birthday, generation) for each congressperson for each session of congress from March 1919 (66th session) to January 2023 (118th).
Taken From 538’s Github, related to the April 3, 2023 article “Congress Today Is Older Than It’s Ever Been.”

congress <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-demographics/data_aging_congress.csv')

## Rows: 29120 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): chamber, state_abbrev, bioname, bioguide_id, generation
## dbl  (6): congress, party_code, cmltv_cong, cmltv_chamber, age_days, age_years
## date (2): start_date, birthday
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

names(congress)

##  [1] "congress"      "start_date"    "chamber"       "state_abbrev" 
##  [5] "party_code"    "bioname"       "bioguide_id"   "birthday"     
##  [9] "cmltv_cong"    "cmltv_chamber" "age_days"      "age_years"    
## [13] "generation"

Each row in this dataset is a congressperson for each session of congress. It has 14 columns, defined on 538’s Github. For easier reference in case a classmate chooses to extend my analysis, I’ll define the columns below. 1. “congress” (The session of Congress a congressperson is in, ranging from 66 to 118. There’s a new session of congress every two years.) 2. “start_date” (The start date of the session of Congress.) 3. “chamber” (Whether a congressperson is in the Senate or House of Representatives.) 4. “state_abbrev” (Which state a congressperson is from, abbreviated in two letter postal form e.g. “CA” for California.) 5. “party_code” (“100” for democrats, “200” for Republicans, “328” for Independents.) 6. “bioname” (Name of congressperson.) 7. “bioguide_id” (Code used by Biographical Directory of US Congress to uniquely identify each member.) 8. “birthday” (Congressperson date of birth.) 9. “cmltv_cong” (How many terms of congress a representative has served in either the Senate or House.) 10. “cmltv_chamber” (How many terms of congress a representative has served in their current chamber.) 11. “age_days” (Congressperson’s age in days.) 12. “age_years” (Congressperson’s age in years (including decimal points).) 13. “generation” (Generation the congressperson belongs or belonged to, e.g. Greatest, Boomer, Silent, etc.)

Data Preparation

While there are multiple columns I won’t use in my preliminary “CREATE” analysis, I’ll keep all columns from the raw data set in a dataframe in case a classmate chooses to “EXTEND” my analysis. The raw dataframe will be “congress” while the dataframe I use will be “congress_ross”.
Subsetting dataframe “congress_ross” which has fewer columns (only those which I will use) with base R. This way when I’m creating visualizations in ggplot2 I won’t have to subset the data as much to only include necessary columns. Also renaming columns for greater intuitiveness.

congress_ross <- congress[c('congress','start_date','chamber','state_abbrev','party_code','birthday','age_years')]
names(congress_ross) <- c('congress_sess','cong_start_date','chamber','state_abbr','party_code','birthday','age_years')

Data Transformation and Visualization with dplyr and ggplot2

dplyr is a highly versatile package for dataframe manipulation. The first functions I’ll demonstrate are how it can “mutate” existing information into new columns.

Using the “mutate” function, I can create a column based on another column as in mutate(new_col = old_col), which would copy old_col and generate new_col as the last column on the dataframe.

In my instance, I’ll leverage mutate to create new columns based on functions applied to other columns. Using lubridate::year(ymd_field) I can extract the year from a date. Using a case_when function I can evaluate values in the party_code column, and output a string in a new column. - Creating cong_start_year (from cong_start_date) to show how the average age of congress people has changed between sessions. - Transforming party_code to party_name for intuitiveness and easier charting.

congress_ross <- congress_ross %>%
  dplyr::mutate(cong_start_year = lubridate::year(cong_start_date), 
                party_name = case_when(
                  party_code == 100 ~ "Democrat",
                  party_code == 200 ~ "Republican",
                  party_code == 328 ~ "Independent"
    ))

dplyr can additionally be used to filter and pivot dataframes, organizing cuts of data which can then be charted with ggplot2. Below I use the filter function to separate Republican and Democrat dataframes, and filter for only session of congress starting on or after 1960. (Because Hawaii and Alaska became states in 1959.) In addition, I use dplyr to group by chamber and congress start year before calculating an aggregated average age for these members of Congress within the group.

I take this filtered and pivoted dataframe from dplyr and then can make a highly customized plot with ggplot2. ggplot2 works primarily by entering arguments in into the ggplot function, then adding new functions after a plus (“+”) sign. Below I enter arguments into ggplot() which create a plot with cong_start_year as the X and avg_age as the Y, with separate charted colors per chamber. Then I add additional features via functions concatenated with +. geom_line() to make a line graph; xlab() and ylab() to add axis labels; labs() to add a graph title plus caption; guides() for a legend; scale_y_continuous() for a y-axis scale; and geom_line() to change the size of the lines on the graph.

One can find the available parameters for each of these functions by typing ?functionname into the console.

As a final step, I put my two charts into side-by-side plots so they can be compared more easily. The plot_grid function comes from the cowplot library which is not a part of the tidyverse.

#Change Y axis minimum based on minimum age for member of congress
repub_age <- congress_ross %>% 
  dplyr::filter(cong_start_year>=1960,party_name=='Republican') %>%
  dplyr::group_by(chamber,cong_start_year) %>%
  dplyr::summarise(avg_age = mean(age_years)) %>%
  ggplot(aes(x=cong_start_year,y=avg_age, color=chamber))+
    geom_line(stat="identity", show.legend = TRUE) + 
    xlab("Year Session of Congress Started") +
    ylab("Avg Age of Representatives") +
    labs(title = "Congressional Republicans: Avg Age", caption = "Source: FiveThirtyEight") + 
  guides(fill=guide_legend(title="Skill Category")) + 
  scale_y_continuous(limits = c(40,70)) +
  geom_line(size=2)

## `summarise()` has grouped output by 'chamber'. You can override using the
## `.groups` argument.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

dem_age <- congress_ross %>% 
  dplyr::filter(cong_start_year>=1960,party_name=='Democrat') %>%
  dplyr::group_by(chamber,cong_start_year) %>%
  dplyr::summarise(avg_age = mean(age_years)) %>%
  ggplot(aes(x=cong_start_year,y=avg_age, color=chamber))+
    geom_line(stat="identity", show.legend = TRUE) + 
    xlab("Year Session of Congress Started") +
    ylab("Avg Age of Representatives") +
    labs(title = "Congressional Democrats: Avg Age", caption = "Source: FiveThirtyEight") + 
  guides(fill=guide_legend(title="Skill Category")) + 
  scale_y_continuous(limits = c(40,70)) +
  geom_line(size=2)

## `summarise()` has grouped output by 'chamber'. You can override using the
## `.groups` argument.

library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:lubridate':
## 
##     stamp

plot_grid(repub_age,dem_age)

TIDYVERSE Extend Assignment

For the extend assignment, I will be using my classmate, Ross’s data set to perform data analysis using dplyr and provide data visualizations ggplot2.

For starter, we can see the highest average in age for both political parties.

#find the average in the Democrat party for every congressional year 
average_dem_age <- congress_ross %>%
  filter(cong_start_year>=1960,party_name == "Democrat") %>%
  group_by(cong_start_year) %>%
  summarise(avg_age = round(mean(age_years), 2))

#pull the 10 highest average age
top10_dem <- average_dem_age %>%
  slice_max(average_dem_age$avg_age, n = 10)

#bar chart for democrat
ggplot(top10_dem, aes(x = reorder(cong_start_year, avg_age), y = avg_age)) +
  geom_bar(stat = "identity", show.legend = FALSE, fill = "lightblue", color = "white") + 
  geom_text(aes(label = avg_age), hjust = -0.05, vjust = 0.5, color = "black", size = 2.5) +
  xlab("Congressional Year") +
  ylab("Average Age for Democrat") +
  ggtitle("Top 10 Average Age in Democrat") +
  coord_flip() +
  theme_minimal()

As we can see, the highest average age for Democrat is 60.6 during 2017. Let’s see what highest average age is on the Republican party.

#find the average in the Republican party for every congressional year 
average_rep_age <- congress_ross %>%
  filter(cong_start_year>=1960,party_name == "Republican") %>%
  group_by(cong_start_year) %>%
  summarise(avg_age = round(mean(age_years), 2))

#pull the 10 highest average age
top10_rep <- average_rep_age %>%
  slice_max(average_rep_age$avg_age, n = 10)

#bar chart for republican
ggplot(top10_rep, aes(x = reorder(cong_start_year, avg_age), y = avg_age)) +
  geom_bar(stat = "identity", show.legend = FALSE, fill = "pink", color = "white") + 
  geom_text(aes(label = avg_age), hjust = -0.05, vjust = 0.5, color = "black", size = 2.5) +
  xlab("Congressional Year") +
  ylab("Average Age for Republican") +
  ggtitle("Top 10 Average Age in Republican") +
  coord_flip() +
  theme_minimal()

In comparison, he republican’s highest average age is lower than their counterpart. Their highest average age is 57.75 back in 2019. Now let’s see who serve the longest in both House and Senate congress.

# extract a new data set
congress_new_set <- congress[c("congress","start_date","chamber", "state_abbrev","party_code","bioname", "birthday","cmltv_cong", "age_years", "generation")]

# label each congressman based on the party code and parse the congressional date in a descending order based on cumulative years
congress_new_set <- congress_new_set %>%
  mutate(cong_start_year = year(start_date), 
         party_name = case_when(
           party_code == 100 ~ "Democrat",
           party_code == 200 ~ "Republican",
           party_code == 328 ~ "Independent")) %>%
  arrange(desc(cmltv_cong))

# remove the duplicate politician name
oldest_10_politician <- congress_new_set %>%
  distinct(congress_new_set$bioname, .keep_all = TRUE)

# extract the top 10 longest serving politician
oldest_10_politician <- oldest_10_politician %>%
  slice_max(oldest_10_politician$cmltv_cong, n = 10)

# lolipop chart for longest serving politician
ggplot(oldest_10_politician, aes(x = reorder(bioname, cmltv_cong), y = cmltv_cong)) +
  geom_segment(aes(x = reorder(bioname, cmltv_cong), xend = reorder(bioname, cmltv_cong), y = 0, yend = cmltv_cong), color = "gray", lwd = 1.5) +
  geom_point(size = 8.5, pch = 21, bg = "lightgreen", col = 1) +
  geom_text(aes(label = cmltv_cong), color = "black", size = 2.5) +
  xlab("Politician Name") +
  ylab("Cumulative Years in Congress") +
  ggtitle("Top 10 Longest Serving Politicans") +
  coord_flip() +
  theme_minimal()

In conclusion, with more data visualizations, we can see different perspectives in the data set.