Introduction:

The article discusses the increasing age of the U.S. Congress, noting that it is older now than ever before, with a median age of 59 years. It highlights the challenges this aging body faces, particularly in understanding modern technology, and the impact of having older members on legislative priorities, which tend to focus more on issues concerning older Americans. The article also suggests that the age trend in Congress might plateau or decline as younger generations like Gen X and Millennials gradually replace the aging baby boomers.

The link to the article: https://fivethirtyeight.com/features/aging-congress-boomers/

Reading the whole .csv file. The path to the .csv file is from GitHub repository.

The data has 29,120 rows. The output just shows first 10 rows.

file_path<-"https://raw.githubusercontent.com/Natacode819/Data-607-Assignment1/main/data_aging_congress.csv"
datacongress<-read.csv(file_path)
head(datacongress, 10)
##    congress start_date chamber state_abbrev party_code                 bioname
## 1        82 1951-01-03   House           ND        200    AANDAHL, Fred George
## 2        80 1947-01-03   House           VA        100 ABBITT, Watkins Moorman
## 3        81 1949-01-03   House           VA        100 ABBITT, Watkins Moorman
## 4        82 1951-01-03   House           VA        100 ABBITT, Watkins Moorman
## 5        83 1953-01-03   House           VA        100 ABBITT, Watkins Moorman
## 6        84 1955-01-03   House           VA        100 ABBITT, Watkins Moorman
## 7        85 1957-01-03   House           VA        100 ABBITT, Watkins Moorman
## 8        86 1959-01-03   House           VA        100 ABBITT, Watkins Moorman
## 9        87 1961-01-03   House           VA        100 ABBITT, Watkins Moorman
## 10       88 1963-01-03   House           VA        100 ABBITT, Watkins Moorman
##    bioguide_id   birthday cmltv_cong cmltv_chamber age_days age_years
## 1      A000001 1897-04-09          1             1    19626  53.73306
## 2      A000002 1908-05-21          1             1    14106  38.62012
## 3      A000002 1908-05-21          2             2    14837  40.62149
## 4      A000002 1908-05-21          3             3    15567  42.62012
## 5      A000002 1908-05-21          4             4    16298  44.62149
## 6      A000002 1908-05-21          5             5    17028  46.62012
## 7      A000002 1908-05-21          6             6    17759  48.62149
## 8      A000002 1908-05-21          7             7    18489  50.62012
## 9      A000002 1908-05-21          8             8    19220  52.62149
## 10     A000002 1908-05-21          9             9    19950  54.62012
##    generation
## 1        Lost
## 2    Greatest
## 3    Greatest
## 4    Greatest
## 5    Greatest
## 6    Greatest
## 7    Greatest
## 8    Greatest
## 9    Greatest
## 10   Greatest

Checking missing values

There are various techniques in statistical programs to represent missing data. R uses NA. I checked whether there are any missing values in the data. I used is.na() function. The output shows that there are no missing data in any columns in the file.

na_counts<-colSums(is.na(datacongress))
print(na_counts)
##      congress    start_date       chamber  state_abbrev    party_code 
##             0             0             0             0             0 
##       bioname   bioguide_id      birthday    cmltv_cong cmltv_chamber 
##             0             0             0             0             0 
##      age_days     age_years    generation 
##             0             0             0

Another way to check whether there are any missing values using sapply() function.

na_columns<-sapply(datacongress,anyNA)
print(na_columns)
##      congress    start_date       chamber  state_abbrev    party_code 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##       bioname   bioguide_id      birthday    cmltv_cong cmltv_chamber 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##      age_days     age_years    generation 
##         FALSE         FALSE         FALSE

Summary function:

Sometimes it is useful to get a basic summary (min, max, median, mean, …) for each variable of the data

summary(datacongress)
##     congress       start_date          chamber          state_abbrev      
##  Min.   : 66.00   Length:29120       Length:29120       Length:29120      
##  1st Qu.: 79.00   Class :character   Class :character   Class :character  
##  Median : 92.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 91.88                                                           
##  3rd Qu.:105.00                                                           
##  Max.   :118.00                                                           
##    party_code      bioname          bioguide_id          birthday        
##  Min.   :100.0   Length:29120       Length:29120       Length:29120      
##  1st Qu.:100.0   Class :character   Class :character   Class :character  
##  Median :100.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :146.7                                                           
##  3rd Qu.:200.0                                                           
##  Max.   :537.0                                                           
##    cmltv_cong     cmltv_chamber       age_days       age_years    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 8644   Min.   :23.67  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:16732   1st Qu.:45.81  
##  Median : 4.000   Median : 4.000   Median :19523   Median :53.45  
##  Mean   : 5.414   Mean   : 5.112   Mean   :19626   Mean   :53.73  
##  3rd Qu.: 8.000   3rd Qu.: 7.000   3rd Qu.:22359   3rd Qu.:61.22  
##  Max.   :30.000   Max.   :30.000   Max.   :35824   Max.   :98.08  
##   generation       
##  Length:29120      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Checking the data type

class(datacongress)
## [1] "data.frame"

List of all variables in the data

colnames(datacongress)
##  [1] "congress"      "start_date"    "chamber"       "state_abbrev" 
##  [5] "party_code"    "bioname"       "bioguide_id"   "birthday"     
##  [9] "cmltv_cong"    "cmltv_chamber" "age_days"      "age_years"    
## [13] "generation"

Subset of data:

All of rows 1 through 20

subset_data1<-datacongress[1:20, ]
subset_data1
##    congress start_date chamber state_abbrev party_code                 bioname
## 1        82 1951-01-03   House           ND        200    AANDAHL, Fred George
## 2        80 1947-01-03   House           VA        100 ABBITT, Watkins Moorman
## 3        81 1949-01-03   House           VA        100 ABBITT, Watkins Moorman
## 4        82 1951-01-03   House           VA        100 ABBITT, Watkins Moorman
## 5        83 1953-01-03   House           VA        100 ABBITT, Watkins Moorman
## 6        84 1955-01-03   House           VA        100 ABBITT, Watkins Moorman
## 7        85 1957-01-03   House           VA        100 ABBITT, Watkins Moorman
## 8        86 1959-01-03   House           VA        100 ABBITT, Watkins Moorman
## 9        87 1961-01-03   House           VA        100 ABBITT, Watkins Moorman
## 10       88 1963-01-03   House           VA        100 ABBITT, Watkins Moorman
## 11       89 1965-01-03   House           VA        100 ABBITT, Watkins Moorman
## 12       90 1967-01-03   House           VA        100 ABBITT, Watkins Moorman
## 13       91 1969-01-03   House           VA        100 ABBITT, Watkins Moorman
## 14       92 1971-01-03   House           VA        100 ABBITT, Watkins Moorman
## 15       93 1973-01-03   House           SD        200           ABDNOR, James
## 16       94 1975-01-03   House           SD        200           ABDNOR, James
## 17       95 1977-01-03   House           SD        200           ABDNOR, James
## 18       96 1979-01-03   House           SD        200           ABDNOR, James
## 19       97 1981-01-03  Senate           SD        200           ABDNOR, James
## 20       98 1983-01-03  Senate           SD        200           ABDNOR, James
##    bioguide_id   birthday cmltv_cong cmltv_chamber age_days age_years
## 1      A000001 1897-04-09          1             1    19626  53.73306
## 2      A000002 1908-05-21          1             1    14106  38.62012
## 3      A000002 1908-05-21          2             2    14837  40.62149
## 4      A000002 1908-05-21          3             3    15567  42.62012
## 5      A000002 1908-05-21          4             4    16298  44.62149
## 6      A000002 1908-05-21          5             5    17028  46.62012
## 7      A000002 1908-05-21          6             6    17759  48.62149
## 8      A000002 1908-05-21          7             7    18489  50.62012
## 9      A000002 1908-05-21          8             8    19220  52.62149
## 10     A000002 1908-05-21          9             9    19950  54.62012
## 11     A000002 1908-05-21         10            10    20681  56.62149
## 12     A000002 1908-05-21         11            11    21411  58.62012
## 13     A000002 1908-05-21         12            12    22142  60.62149
## 14     A000002 1908-05-21         13            13    22872  62.62012
## 15     A000009 1923-02-13          1             1    18222  49.88912
## 16     A000009 1923-02-13          2             2    18952  51.88775
## 17     A000009 1923-02-13          3             3    19683  53.88912
## 18     A000009 1923-02-13          4             4    20413  55.88775
## 19     A000009 1923-02-13          5             1    21144  57.88912
## 20     A000009 1923-02-13          6             2    21874  59.88775
##    generation
## 1        Lost
## 2    Greatest
## 3    Greatest
## 4    Greatest
## 5    Greatest
## 6    Greatest
## 7    Greatest
## 8    Greatest
## 9    Greatest
## 10   Greatest
## 11   Greatest
## 12   Greatest
## 13   Greatest
## 14   Greatest
## 15   Greatest
## 16   Greatest
## 17   Greatest
## 18   Greatest
## 19   Greatest
## 20   Greatest

Subset of data:

Selected only columns: “congress”,“start_date”,“chamber”, “age_years”. The output is for the first 20 rows

subset_data2<-datacongress[, c("congress","start_date","chamber", "age_years"  ) ]
head(subset_data2,20)
##    congress start_date chamber age_years
## 1        82 1951-01-03   House  53.73306
## 2        80 1947-01-03   House  38.62012
## 3        81 1949-01-03   House  40.62149
## 4        82 1951-01-03   House  42.62012
## 5        83 1953-01-03   House  44.62149
## 6        84 1955-01-03   House  46.62012
## 7        85 1957-01-03   House  48.62149
## 8        86 1959-01-03   House  50.62012
## 9        87 1961-01-03   House  52.62149
## 10       88 1963-01-03   House  54.62012
## 11       89 1965-01-03   House  56.62149
## 12       90 1967-01-03   House  58.62012
## 13       91 1969-01-03   House  60.62149
## 14       92 1971-01-03   House  62.62012
## 15       93 1973-01-03   House  49.88912
## 16       94 1975-01-03   House  51.88775
## 17       95 1977-01-03   House  53.88912
## 18       96 1979-01-03   House  55.88775
## 19       97 1981-01-03  Senate  57.88912
## 20       98 1983-01-03  Senate  59.88775

Subset of data:

All records where age is > 65

The output is only for the first 20 rows

subset_data3<-datacongress[datacongress$age_years>65,]
head(subset_data3,20)
##     congress start_date chamber state_abbrev party_code
## 32       109 2005-01-03   House           HI        100
## 33       110 2007-01-03   House           HI        100
## 34       111 2009-01-03   House           HI        100
## 55        91 1969-01-03   House           MS        100
## 56        92 1971-01-03   House           MS        100
## 69        71 1929-03-04   House           NJ        200
## 70        72 1931-03-04   House           NJ        200
## 84       111 2009-01-03   House           NY        100
## 85       112 2011-01-03   House           NY        100
## 103       77 1941-01-03  Senate           CO        100
## 152       71 1929-03-04   House           IL        200
## 153       72 1931-03-04   House           IL        200
## 163       86 1959-01-03  Senate           VT        200
## 164       87 1961-01-03  Senate           VT        200
## 165       88 1963-01-03  Senate           VT        200
## 166       89 1965-01-03  Senate           VT        200
## 167       90 1967-01-03  Senate           VT        200
## 168       91 1969-01-03  Senate           VT        200
## 169       92 1971-01-03  Senate           VT        200
## 170       93 1973-01-03  Senate           VT        200
##                       bioname bioguide_id   birthday cmltv_cong cmltv_chamber
## 32          ABERCROMBIE, Neil     A000014 1938-06-26          9             9
## 33          ABERCROMBIE, Neil     A000014 1938-06-26         10            10
## 34          ABERCROMBIE, Neil     A000014 1938-06-26         11            11
## 55  ABERNETHY, Thomas Gerstle     A000016 1903-05-16         14            14
## 56  ABERNETHY, Thomas Gerstle     A000016 1903-05-16         15            15
## 69  ACKERMAN, Ernest Robinson     A000021 1863-06-17          6             6
## 70  ACKERMAN, Ernest Robinson     A000021 1863-06-17          7             7
## 84     ACKERMAN, Gary Leonard     A000022 1942-11-19         14            14
## 85     ACKERMAN, Gary Leonard     A000022 1942-11-19         15            15
## 103     ADAMS, Alva Blanchard     A000028 1875-10-29          6             6
## 152           ADKINS, Charles     A000057 1863-02-07          3             3
## 153           ADKINS, Charles     A000057 1863-02-07          4             4
## 163       AIKEN, George David     A000062 1892-08-20         10            10
## 164       AIKEN, George David     A000062 1892-08-20         11            11
## 165       AIKEN, George David     A000062 1892-08-20         12            12
## 166       AIKEN, George David     A000062 1892-08-20         13            13
## 167       AIKEN, George David     A000062 1892-08-20         14            14
## 168       AIKEN, George David     A000062 1892-08-20         15            15
## 169       AIKEN, George David     A000062 1892-08-20         16            16
## 170       AIKEN, George David     A000062 1892-08-20         17            17
##     age_days age_years generation
## 32     24298  66.52430     Silent
## 33     25028  68.52293     Silent
## 34     25759  70.52430     Silent
## 55     23974  65.63723   Greatest
## 56     24704  67.63587   Greatest
## 69     24001  65.71116 Missionary
## 70     24731  67.70979 Missionary
## 84     24152  66.12457     Silent
## 85     24882  68.12320     Silent
## 103    23807  65.18001 Missionary
## 152    24131  66.06708 Missionary
## 153    24861  68.06571 Missionary
## 163    24241  66.36824       Lost
## 164    24972  68.36961       Lost
## 165    25702  70.36824       Lost
## 166    26433  72.36961       Lost
## 167    27163  74.36824       Lost
## 168    27894  76.36961       Lost
## 169    28624  78.36824       Lost
## 170    29355  80.36961       Lost

Grooping of records based on “bioname” column

Since the same name listed multiple times, I decided to group names and take only those records that have the latest date in the “start_date” column

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df_latest<-datacongress%>%group_by(bioname)%>%
  filter(start_date==max(start_date))%>%ungroup()
head(df_latest,20)
## # A tibble: 20 × 13
##    congress start_date chamber state_abbrev party_code bioname       bioguide_id
##       <int> <chr>      <chr>   <chr>             <int> <chr>         <chr>      
##  1       82 1951-01-03 House   ND                  200 AANDAHL, Fre… A000001    
##  2       92 1971-01-03 House   VA                  100 ABBITT, Watk… A000002    
##  3       99 1985-01-03 Senate  SD                  200 ABDNOR, James A000009    
##  4       83 1953-01-03 Senate  NE                  200 ABEL, Hazel … A000010    
##  5       88 1963-01-03 House   OH                  200 ABELE, Homer… A000011    
##  6      111 2009-01-03 House   HI                  100 ABERCROMBIE,… A000014    
##  7       73 1933-03-04 House   NC                  100 ABERNETHY, C… A000015    
##  8       92 1971-01-03 House   MS                  100 ABERNETHY, T… A000016    
##  9       95 1977-01-03 Senate  SD                  100 ABOUREZK, Ja… A000017    
## 10       94 1975-01-03 House   NY                  100 ABZUG, Bella… A000018    
## 11       72 1931-03-04 House   NJ                  200 ACKERMAN, Er… A000021    
## 12      112 2011-01-03 House   NY                  100 ACKERMAN, Ga… A000022    
## 13       91 1969-01-03 House   IN                  200 ADAIR, Edwin… A000024    
## 14       74 1935-01-03 House   IL                  100 ADAIR, Jacks… A000025    
## 15       77 1941-01-03 Senate  CO                  100 ADAMS, Alva … A000028    
## 16      102 1991-01-03 Senate  WA                  100 ADAMS, Brock… A000031    
## 17       79 1945-01-03 House   NH                  200 ADAMS, Sherm… A000046    
## 18       73 1933-03-04 House   DE                  100 ADAMS, Wilbu… A000050    
## 19       99 1985-01-03 House   NY                  100 ADDABBO, Jos… A000052    
## 20       87 1961-01-03 House   NJ                  100 ADDONIZIO, H… A000054    
## # ℹ 6 more variables: birthday <chr>, cmltv_cong <int>, cmltv_chamber <int>,
## #   age_days <int>, age_years <dbl>, generation <chr>

Filtering records

For future graph, I want to select only those records who are in “House” chamber.

Also, I applied summarise() function to get the median age.

df_house=df_latest%>%filter(chamber=="House")
df_house_year =df_house%>%group_by(start_date)%>%summarise(age=median(age_years))
df_house_result<-data.frame(df_house_year, type="House")
head(df_house_result,20)
##    start_date      age  type
## 1  1919-03-04 50.67488 House
## 2  1921-03-04 51.18138 House
## 3  1923-03-04 52.51745 House
## 4  1925-03-04 54.89528 House
## 5  1927-03-04 56.43806 House
## 6  1929-03-04 56.68720 House
## 7  1931-03-04 56.78850 House
## 8  1933-03-04 56.60507 House
## 9  1935-01-03 55.28268 House
## 10 1937-01-03 50.95414 House
## 11 1939-01-03 54.27789 House
## 12 1941-01-03 50.55715 House
## 13 1943-01-03 54.08077 House
## 14 1945-01-03 52.87337 House
## 15 1947-01-03 53.25941 House
## 16 1949-01-03 54.63655 House
## 17 1951-01-03 55.76044 House
## 18 1953-01-03 55.65503 House
## 19 1955-01-03 56.16975 House
## 20 1957-01-03 60.40246 House

Filtering records

Selected only those records who are in “Senate” chamber.

Also,summarise() function applied to get the median age.

df_Senate=df_latest%>%filter(chamber=="Senate")
df_Senate_year =df_Senate%>%group_by(start_date)%>%summarise(age=median(age_years))
df_Senate_result<-data.frame(df_Senate_year, type="Senate")
head(df_Senate_result,20)
##    start_date      age   type
## 1  1919-03-04 61.16359 Senate
## 2  1921-03-04 59.26626 Senate
## 3  1923-03-04 64.02464 Senate
## 4  1925-03-04 62.82546 Senate
## 5  1927-03-04 64.79671 Senate
## 6  1929-03-04 62.46954 Senate
## 7  1931-03-04 61.67693 Senate
## 8  1933-03-04 61.12936 Senate
## 9  1935-01-03 60.50650 Senate
## 10 1937-01-03 57.80835 Senate
## 11 1939-01-03 64.27105 Senate
## 12 1941-01-03 60.50787 Senate
## 13 1943-01-03 54.42300 Senate
## 14 1945-01-03 62.84736 Senate
## 15 1947-01-03 60.07666 Senate
## 16 1949-01-03 60.85421 Senate
## 17 1951-01-03 54.12457 Senate
## 18 1953-01-03 63.00616 Senate
## 19 1955-01-03 61.67146 Senate
## 20 1957-01-03 60.82683 Senate

Combined filtered records

Two filtered records should be combined in order represent them together in the graph.

df_combined<-rbind(df_house_result, df_Senate_result)
head(df_combined,20)
##    start_date      age  type
## 1  1919-03-04 50.67488 House
## 2  1921-03-04 51.18138 House
## 3  1923-03-04 52.51745 House
## 4  1925-03-04 54.89528 House
## 5  1927-03-04 56.43806 House
## 6  1929-03-04 56.68720 House
## 7  1931-03-04 56.78850 House
## 8  1933-03-04 56.60507 House
## 9  1935-01-03 55.28268 House
## 10 1937-01-03 50.95414 House
## 11 1939-01-03 54.27789 House
## 12 1941-01-03 50.55715 House
## 13 1943-01-03 54.08077 House
## 14 1945-01-03 52.87337 House
## 15 1947-01-03 53.25941 House
## 16 1949-01-03 54.63655 House
## 17 1951-01-03 55.76044 House
## 18 1953-01-03 55.65503 House
## 19 1955-01-03 56.16975 House
## 20 1957-01-03 60.40246 House

Graph

A line graph was selected to show the median age of House and Senate members over years

library(ggplot2)
g<-ggplot(df_combined,aes(x=start_date, y=age, color=type, group=type))+
  geom_line(size=1)+
  geom_point(size=3)+
  labs(title="Median age of the U.S. Senate and U.S. House by Congress, 1919 to 2023", x="Years", Y="Age")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
g

Conclusion:

The provided graph shows that “Congress today is older than it’s ever been”.