For this weekly assignment I will be taking a look at data related to the ages of Congressional Representatives Historically. This is of particular relevance because Congress today is older than it has ever been.
https://fivethirtyeight.com/features/aging-congress-boomers/
data_aging_congress.csv contains information about the age of every member of the U.S. Senate and House from the 66th Congress (1919-1921) to the 118th Congress (2023-2025). Data is as of March 29, 2023, and is based on all voting members who served in either the Senate or House in each Congress. The data excludes delegates or resident commissioners from non-states. Any member who served in both chambers in the same Congress was assigned to the chamber in which they cast more votes. We began with the 66th Congress because it was the first Congress in which all senators had been directly elected, rather than elected by state legislatures, following the ratification of the 17th Amendment in 1913.
Our goal with this data set is to see if we can gain some insights into any meaningful trends in the historical age of the US Congress. Time-permiting, I will break down which traits are correlated with older congress members.
The code below imports the tidyverse library along with the raw csv data which is posted on 538’s Github page:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.3 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.3 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
urlfile="https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-demographics/data_aging_congress.csv"
CongressionalAge<-read_csv(url(urlfile))
## Rows: 29120 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): chamber, state_abbrev, bioname, bioguide_id, generation
## dbl (6): congress, party_code, cmltv_cong, cmltv_chamber, age_days, age_years
## date (2): start_date, birthday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- tibble(CongressionalAge)
summary(df)
## congress start_date chamber state_abbrev
## Min. : 66.00 Min. :1919-03-04 Length:29120 Length:29120
## 1st Qu.: 79.00 1st Qu.:1945-01-03 Class :character Class :character
## Median : 92.00 Median :1971-01-03 Mode :character Mode :character
## Mean : 91.88 Mean :1970-10-18
## 3rd Qu.:105.00 3rd Qu.:1997-01-03
## Max. :118.00 Max. :2023-01-03
## party_code bioname bioguide_id birthday
## Min. :100.0 Length:29120 Length:29120 Min. :1835-06-10
## 1st Qu.:100.0 Class :character Class :character 1st Qu.:1891-12-21
## Median :100.0 Mode :character Mode :character Median :1918-11-22
## Mean :146.7 Mean :1917-01-24
## 3rd Qu.:200.0 3rd Qu.:1943-05-16
## Max. :537.0 Max. :1997-01-17
## cmltv_cong cmltv_chamber age_days age_years
## Min. : 1.000 Min. : 1.000 Min. : 8644 Min. :23.67
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.:16732 1st Qu.:45.81
## Median : 4.000 Median : 4.000 Median :19523 Median :53.45
## Mean : 5.414 Mean : 5.112 Mean :19626 Mean :53.73
## 3rd Qu.: 8.000 3rd Qu.: 7.000 3rd Qu.:22359 3rd Qu.:61.22
## Max. :30.000 Max. :30.000 Max. :35824 Max. :98.08
## generation
## Length:29120
## Class :character
## Mode :character
##
##
##
# changing the name of the column 'party_code' to just 'party', changing the party codes to what party they actually correspond to based on the README file.
names(df)[names(df) == "party_code"] <- "party"
df$party <- ifelse(df$party == 100, "Democrat",
ifelse(df$party == 200, "Republican",
ifelse(df$party == 328, "Independent", NA)))
glimpse(df)
## Rows: 29,120
## Columns: 13
## $ congress <dbl> 82, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, …
## $ start_date <date> 1951-01-03, 1947-01-03, 1949-01-03, 1951-01-03, 1953-01…
## $ chamber <chr> "House", "House", "House", "House", "House", "House", "H…
## $ state_abbrev <chr> "ND", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "V…
## $ party <chr> "Republican", "Democrat", "Democrat", "Democrat", "Democ…
## $ bioname <chr> "AANDAHL, Fred George", "ABBITT, Watkins Moorman", "ABBI…
## $ bioguide_id <chr> "A000001", "A000002", "A000002", "A000002", "A000002", "…
## $ birthday <date> 1897-04-09, 1908-05-21, 1908-05-21, 1908-05-21, 1908-05…
## $ cmltv_cong <dbl> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ cmltv_chamber <dbl> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ age_days <dbl> 19626, 14106, 14837, 15567, 16298, 17028, 17759, 18489, …
## $ age_years <dbl> 53.73306, 38.62012, 40.62149, 42.62012, 44.62149, 46.620…
## $ generation <chr> "Lost", "Greatest", "Greatest", "Greatest", "Greatest", …
My goal here is to see whether or not we can establish a trend in the age of Congressmembers historically. Using ggplot I have managed to separate all the ages of Congressmembers and sort them by the Congress (the groups of Congressmembers that served together).
ggplot(data = df, aes(x = congress, y = age_years)) +
geom_jitter(width = 0.1, height = 0, size = 2, alpha = 0.6, color = "darkblue") +
xlab("Congress") +
ylab("Age") +
theme_minimal()
The above plot is not the most practical, so I am going to try a box and whisker plot instead:
# Create a single graph with multiple box and whisker plots
ggplot(data = df, aes(x = factor(congress), y = age_years, group = congress)) +
geom_boxplot(width = 0.5, fill = "lightblue", color = "darkblue", alpha = 0.6) +
xlab("Congress") +
ylab("Age") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 6))
# Group by congress and calculate mean and median age
sortedCongress <- df %>%
group_by(congress) %>%
summarize(avg_age = mean(age_years, na.rm = TRUE),
median_age = median(age_years, na.rm = TRUE))
# Plotting the average and median age
ggplot(sortedCongress, aes(x = congress)) +
geom_point(aes(y = avg_age, color = "Average Age")) +
geom_point(aes(y = median_age, color = "Median Age")) +
xlab("Congress") +
ylab("Age") +
scale_color_manual(values = c("blue", "red"), guide = guide_legend(title = NULL)) +
theme_minimal()
# Create a line graph for average and median age
ggplot(sortedCongress, aes(x = congress)) +
geom_line(aes(y = avg_age, color = "Average Age")) +
geom_line(aes(y = median_age, color = "Median Age")) +
xlab("Congress") +
ylab("Age") +
scale_color_manual(values = c("blue", "red"), guide = guide_legend(title = NULL)) +
theme_minimal()
# Display the summary of sortedCongress
glimpse(sortedCongress)
## Rows: 53
## Columns: 3
## $ congress <dbl> 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,…
## $ avg_age <dbl> 51.73002, 52.60360, 52.56625, 53.22577, 53.95525, 54.61430,…
## $ median_age <dbl> 51.02533, 51.76044, 52.22450, 53.25530, 54.29158, 54.99521,…
The use of the plots reveals several key bits of information. The current Congress stands out as having the highest average age in the modern era, with Senators being elected directly. This seems to be the continuation of a trend established around the time of the 97th Congress, which serves as the minimum for the data set. The average age of Congress members has risen steadily since then to its current value.
With the code in this document we have been able to look at the trend in Congressional ages over time and determine that the current Congress is in fact that the oldest under the current scheme of Congressional Data. Below I take a look at how individual states and Generations have contributed to this trend over time. In the future, I would like to create separate trendlines for both of the parties in order to evaluate if there are historical preferences for older candidates that are distinguishable between the two major parties.
I would want to split the dataset up based on the party affiliation, state and generation data available in the CSV.
I was able to get some meaningful data by state and generation, however I would need to modify the code below further in order to effectively break the data down based on historical affiliation.
# Filter data for Democrat and Republican parties
democrat_df <- df %>%
filter(party == "Democrat")
republican_df <- df %>%
filter (party == "Republican")
# Sort congress by party
DemocratsortedCongress <- democrat_df %>%
group_by(congress) %>%
summarize(avg_age = mean(age_years, na.rm = TRUE),
median_age = median(age_years, na.rm = TRUE))
RepublicasortedCongress <- republican_df %>%
group_by(congress) %>%
summarize(avg_age = mean(age_years, na.rm =TRUE),
median_age = median(age_years, na.rm = TRUE))
# Create a line graph for average and median age
ggplot(DemocratsortedCongress, aes(x = congress)) +
geom_line(aes(y = avg_age, color = "Average Age")) +
geom_line(aes(y = median_age, color = "Median Age")) +
xlab("Congress") +
ylab("Age") +
scale_color_manual(values = c("blue", "red"), guide = guide_legend(title = NULL)) +
theme_minimal()
Now taking a look at it by state, we can see the states sorted based on their historical contribution to the age of congress, based on the average age of the congressmen elected from that state.
## state_abbrev avg_age
## 46 VT 60.62154
## 11 HI 59.18275
## 1 AK 59.04962
## 50 WY 58.32722
## 27 NC 56.56834
## 28 ND 56.29585
## 45 VA 56.25488
## 49 WV 55.99294
## 29 NE 55.90023
## 8 DE 55.68293
## 44 UT 55.63519
## 33 NV 55.17901
## 12 IA 55.06993
## 4 AZ 54.87047
## 17 KY 54.78432
## 14 IL 54.77631
## 26 MT 54.77056
## 20 MD 54.49438
## 6 CO 54.40649
## 5 CA 54.40593
## 37 OR 54.39561
## 16 KS 54.35446
## 40 SC 54.31028
## 31 NJ 54.12764
## 19 MA 53.96023
## 38 PA 53.92870
## 10 GA 53.86750
## 13 ID 53.69074
## 43 TX 53.67426
## 9 FL 53.55584
## 39 RI 53.49647
## 2 AL 53.26870
## 32 NM 53.19441
## 24 MO 53.17416
## 35 OH 53.16196
## 30 NH 52.86027
## 34 NY 52.79494
## 21 ME 52.72298
## 42 TN 52.71196
## 22 MI 52.69646
## 25 MS 52.68117
## 3 AR 52.41892
## 47 WA 52.40123
## 7 CT 51.66845
## 48 WI 51.59190
## 41 SD 51.45681
## 18 LA 50.94785
## 23 MN 50.90118
## 15 IN 50.62130
## 36 OK 50.12730
I wanted to take a look at it by Generation. The general pattern seems to be that older generation’s have an older average age.
# Calculate average age for each generation
generation_avg_age <- aggregate(age_years ~ generation, data = df, FUN = mean, na.rm = TRUE)
# Calculate count of congressmen for each generation
generation_count <- aggregate(age_years ~ generation, data = df, FUN = length)
# Merge the two aggregated data frames
generation_summary <- merge(generation_avg_age, generation_count, by = "generation")
# Rename the columns in the result
colnames(generation_summary) <- c("generation", "avg_age", "num_congressmen")
# Sort the data frame by average age in descending order
generation_summary <- generation_summary[order(-generation_summary$avg_age), ]
# Print the sorted data frame
generation_summary
## generation avg_age num_congressmen
## 4 Gilded 82.60151 15
## 9 Progressive 68.11945 485
## 8 Missionary 57.88836 4768
## 10 Silent 54.02800 5601
## 6 Lost 53.36611 4732
## 1 Boomers 53.11984 5108
## 5 Greatest 52.10814 7147
## 2 Gen X 44.84130 1130
## 7 Millennial 36.18125 133
## 3 Gen Z 25.96030 1
# Doing the same thing, but looking at the number of cumulative terms for each generation, as opposed to the age of the person.
# Calculate average cumulative number of terms that congressmen were elected to for each generation
generation_cmltv_cong <- aggregate(cmltv_cong ~ generation, data = df, FUN = mean, na.rm = TRUE)
# Calculate count of congressmen for each generation
generation_count <- aggregate(cmltv_cong ~ generation, data = df, FUN = length)
# Merge the two aggregated data frames
generation_summary_cmltv_cong <- merge(generation_cmltv_cong, generation_count, by = "generation")
# Rename the columns in the result
colnames(generation_summary_cmltv_cong) <- c("generation", "cmltv_cong", "num_congressmen")
# Sort the data frame by average age in descending order
generation_summary_cmltv_cong <- generation_summary_cmltv_cong[order(-generation_summary_cmltv_cong$cmltv_cong), ]
# Print the sorted data frame
generation_summary_cmltv_cong
## generation cmltv_cong num_congressmen
## 4 Gilded 10.933333 15
## 9 Progressive 7.340206 485
## 10 Silent 6.105517 5601
## 5 Greatest 5.726459 7147
## 8 Missionary 5.379195 4768
## 6 Lost 5.114117 4732
## 1 Boomers 4.925020 5108
## 2 Gen X 3.154867 1130
## 7 Millennial 1.721805 133
## 3 Gen Z 1.000000 1
install.packages( "qt")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Warning: package 'qt' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages