---
title: "Congressional Representation"
author: "John Ladd"
date: "25 October 2021"
output:
  html_document:
    code_folding: hide
---

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──

## ✓ broom        0.7.12     ✓ rsample      0.1.1 
## ✓ dials        0.1.0      ✓ tune         0.2.0 
## ✓ infer        1.0.0      ✓ workflows    0.2.6 
## ✓ modeldata    0.1.1      ✓ workflowsets 0.2.1 
## ✓ parsnip      0.2.1      ✓ yardstick    0.0.9 
## ✓ recipes      0.2.0

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(readr)
library(ggthemes)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:scales':
## 
##     alpha, rescale

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

congress <- read.csv("congress_data.csv")

Setup

As always, read this section carefully, and when you’re ready to turn in the lab, you can delete the setup text and most of the instructions in the other sections. The goal is to create a clean, readable report that makes sense to someone who hasn’t seen the instructions.

This is Part 2 of our US Congress Lab. You may need to refer back to the previous lab guide for more information and context. You’ll be working with data that you filtered and wrangled in the last lab. You might choose a single team member’s data and code to work from.

As a group, you will need to turn in one copy of the html version of your RMarkdown Report. Please use code folding and remember to answer each question completely and accurately. The best teams work on projects together (side by side, pair programming), help teach those who are moving more slowly (not just “tell” the answer), and show up to planned group meetings on time and prepared with work done and ready to contribute to discussing the answers.

This means that everyone does the work—everyone should be contributing to code and/or following along on their own copies, helping to make decisions and interpret results, and checking in with each other to make sure each person actually understands what is going on. It means collaborating and making sure you have a plan for meeting after Lab, during the week, and everyone being patient and helping out.

Introduction

As a team, look over the questions below, and decide how to write an introduction that captures some of the general points of this report. You may need to repeat a little of what you wrote last time, but it shouldn’t be copied over verbatim.

The United States Congress is split into two chambers: the House of Representatives and the Senate. The House currently has 435 delegates, while the Senate has 100 (two for each state). The US Congress had their first session in 1789. Since the mid-19th century, Democrats and Republicans have emerged as the two dominant political parties. Congress has become increasingly polarized since the mid-20th century.

The objective of this lab is to discover patterns in the makeup of Congress over the course of its 116 iterations. The data consists of every member of Congress since 1789 as well as their political party, what session(s) of Congress they served, what state and distinct they represented, their birth and death dates, their political grade according to Nominate (negative is more left and positive is more right), their gender, and what their racial identity is.

Ethical Considerations

You’re working with the same datasets this week, so you may have some of the same ethical concerns as last time. As a team, read over each other’s Ethical Considerations sections from last week’s lab. Decide how to combine them into a new ethics statement for this week that captures all the concerns of the group.

Polarization in Congress has been present since the 20th century. While not always “liberal versus conservative,” that has been the dominant trend recently. By manipulating this data, we are able to help political analysts determine the trends of polarization throughout the entirety of Congress’s existence. We are also able to help sociologists and psychologists that want to analyze human behavior along lines of political affiliation . These data can also be used by schools in order to demonstrate the increasing polarization of Congress over time to students.

Statistical Analysis and Interpretation

Last week you did the Data Explanation and Exploration section of this lab. This week, you’ll go straight to Statistical Analysis and Interpretation. Really, this is one giant lab.

Voting trends and Political parties

Let’s not forget about all the other data we have in this congress dataset. We can also look at voting trends through time. And perhaps political party tells part of the story too!

Do you think age is generally linked to likelihood to vote more in a more polarizing way (extremely conservatively, or extremely liberally)? We can check this quickly using a scatterplot. Make one now, separate by chamber, with a trend-line, using semi-transparent points and your preferred theme, adding extra polish, and interpret it in a few sentences. Think carefully about your measures. If you are interested in looking at “extreme” voting generally (without care to whether it is liberal or conservative), will you want to use the exact nominate_dim1 measure, or will you want to transform it first?
Your visual can be interpreted to a pattern that you think is (or is not) apparent. But let’s check it using what we’ve learned about linear regression.

In class this week we’ve discussed two approaches: correlation and regression (linear models). Both of these approaches allow us to test whether there is a positive or negative relationship between an x and y variable. You can run a correlation using cor.test(data$yvar, data$xvar). To run a linear regression you can use tidymodels. When it comes to regression output it is often best to assign the model a name and then ask for a summary of that new object like this:

model1 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(yvar~xvar, data=data)

summary(model1$fit)

Note that with the regression the "lm" is an obvious abbreviation for linear model, and unlike the correlation, the fit() code mirrors the equation of a line ($y=mx+b$, or in statistical terms $Y=a+bX$ or $Y = β_{0} + β_{1}X$ … all of these are the same equation, but you may have encountered it shown in different ways in previous classes).

Since we know the House and Senate have some different dynamics at work, let’s run two linear regressions, one for each chamber, checking if there is a relationship between a tendency to vote in a more polarizing way and age. Think through the steps you’ll need to take in preparing the data and setting up each of your models. Remember to interpret your results completely and accurately (p-values, slope, intercept, R2, and linking the results back to the data).

congress <- congress %>% group_by(nominate_dim1) %>% mutate(nomAbsValue = abs(nominate_dim1))
congress <- congress %>% mutate(currentYear = replace(died,is.na(died),2021))
congress <- congress %>% group_by(currentYear) %>% mutate(age = currentYear - born)
house <- congress %>% filter(chamber=="House")
senate <- congress %>%filter(chamber=="Senate")
houseLnReg <- linear_reg() %>% set_engine("lm") %>% fit(nomAbsValue~age, data = house)
summary(houseLnReg$fit)

## 
## Call:
## stats::lm(formula = nomAbsValue ~ age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36523 -0.11016  0.00387  0.10471  0.67506 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.970e-01  4.472e-03   88.78   <2e-16 ***
## age         -8.096e-04  6.006e-05  -13.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.161 on 39224 degrees of freedom
##   (327 observations deleted due to missingness)
## Multiple R-squared:  0.004612,   Adjusted R-squared:  0.004586 
## F-statistic: 181.7 on 1 and 39224 DF,  p-value: < 2.2e-16

tidy(houseLnReg)

## # A tibble: 2 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.397    0.00447        88.8 0       
## 2 age         -0.000810 0.0000601     -13.5 2.51e-41

ggplot(data=house, aes(x=age,y=nomAbsValue)) + geom_point() + stat_smooth(method="lm") + theme_economist() + labs(x = "Age of Congressperson", y = "Distance from political center", title = "Political Polarization And Age In The House")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 327 rows containing non-finite values (stat_smooth).

## Warning: Removed 327 rows containing missing values (geom_point).

senateLnReg <- linear_reg() %>% set_engine("lm") %>% fit(nomAbsValue~age, data = senate)
summary(senateLnReg$fit)

## 
## Call:
## stats::lm(formula = nomAbsValue ~ age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3615 -0.1153 -0.0025  0.1046  0.6827 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4271753  0.0111482  38.318   <2e-16 ***
## age         -0.0012917  0.0001464  -8.824   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1687 on 9632 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:  0.008019,   Adjusted R-squared:  0.007916 
## F-statistic: 77.86 on 1 and 9632 DF,  p-value: < 2.2e-16

tidy(senateLnReg)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  0.427    0.0111       38.3  4.02e-299
## 2 age         -0.00129  0.000146     -8.82 1.30e- 18

ggplot(data=senate, aes(x=age,y=nomAbsValue)) + geom_point() + stat_smooth(method="lm") + theme_economist() + labs(x = "Age of Senator", y = "Distance from political center", title = "Political Polarization And Age In The Senate")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 54 rows containing non-finite values (stat_smooth).

## Warning: Removed 54 rows containing missing values (geom_point).

For our House data, the slope of the line is almost zero (0.008). The R2 is also very low (0.01), meaning that just 1 percent of variation in political polarization may be due to age. There is almost certainly no relationship between the two variables. For the Senate data, there is even less correlation between political polarization and age. The slope of our regression line is slightly negative (-0.003). The R2 is 0.04, which is generally considered statistically insignificant. In neither model is there a strong correlation between political polarization and age.

So… age might not be the best predictor of voting extremity, though we did see some interesting trends in how average age through time did show a distinct trend. But what else might help predict how members of congress vote, or how those votes change through time? So far, we’ve divided the data by chamber (House and Senate), by demographic variables (id_cat and Woman) and age. But we haven’t yet talked about political parties. The data includes a variable for this, party_code. Unfortunately, if you table(housedata$party_code) you’ll see there are lots of different parties (you can see the party codes at https://voteview.com/articles/data_help_parties).
You may be surprised that most members of Congress historically belong to one of two parties. What are these two parties? Which one is consistently more liberal in terms of the nominate_dim1 measure? Lastly, does it look like one of the parties has polarized more – which one? (Don’t worry we’ll come back to testing this in the next section.) You can start to answer these questions by using an exploratory graphic – let’s try a scatter plot to show voting trends through time, using the nominate_dim1 measure, setting points to be semi-transparent, and coloring them by political party. Try making the graph using an alternate to geom_point(), as shown below. Note that the code below will also change the party_code to a factor, instead of a numeric variable.

congress <- congress %>% group_by(congress) %>% mutate(year = 1789 + 2*congress)
ggplot(data = congress, aes(x = congress, y = nomAbsValue, color=factor(party_code),alpha=1/10)) + geom_jitter() + theme_economist() + labs(y="Distance from political center", x="Session of Congress", title = "Political Polarization by Party")

## Warning: Removed 231 rows containing missing values (geom_point).

ggplot(data = congress, aes(x = year, y = nomAbsValue, color=factor(party_code))) + geom_jitter() + theme_economist() + labs(y="Distance from political center", x="Year", title = "Political Polarization by Party")

## Warning: Removed 231 rows containing missing values (geom_point).

    geom_jitter(aes(color=factor(party_code)), alpha=1/10)

What does geom_jitter() do differently from geom_point()? What does alpha achieve? Why do you think party_code is better as a factor than a numeric variable here?

geom_jitter adds a small amount of variation to the location of each point to make the data look smoother. “Party_code” should be a factor rather than a numeric variable because it’s categorical rather than continuous data. Setting alpha to 0.1 increases the transparency of data points to make overlapping areas easier to see.

Since two parties dominate the data, let’s focus on trends in representation and voting just in those main parties. We probably won’t have enough data to make really robust conclusions or comparisons on those smaller parties anyway. There are several important dates to know here, regarding our question and purpose: The United States civil war was fought 1861-1865, with the Emancipation Proclamation, officially ending slavery, happening on January 1, 1863 (though the 13th amendment abolishing slavery didn’t pass until January 31, 1865, and the news finally reached Galveston, TX on June 19, 1865). The post-war period of Reconstruction (read about it on wikipedia for domain knowledge and context) lasted until about 1876, at which point the two major parties dominated politics going forward. Let’s focus on this more narrow period of time for the rest of the project, which will help us reduce the smaller parties, hone our figures, and build toward some statistical analyses.

Let’s start by making a new dataset that only includes congresses since 1876 (call it congress1876). Then use table() with your new dataset and the party_code measure to see how many party codes there are. Much better right?

congress1876 <- congress %>% group_by(year) %>% filter(year > 1875)

table(congress1876$party_code)

## 
##   100   112   114   117   200   213   326   328   329   331   340   347   354 
## 20267     3     9     2 17663     1    28    53    20    16    83     3    14 
##   355   356   370   380   402   522   523   537  1060 
##     2     2    48     7     1     7     1    39    11

congress1876 <- congress1876 %>%
            mutate(twoparty = ifelse(party_code != 100 & party_code != 200, NA_integer_,
                ifelse(party_code == 200, "Republican", "Democrat")))

table(congress1876$twoparty)

## 
##   Democrat Republican 
##      20267      17663

congress1876$twoparty <- as.factor(congress1876$twoparty)

Let’s set things up so that we can focus on the two major parties versus all others. Here we’ll want to create a few simple variables that will identify Democrats versus Republicans. Ultimately, since we will just be using those two groups we want to make a variable that will let us focus only on members of those two parties.

We are going to need to use a mutate() with some ifelse() to do this. Remember that mutate makes a new variable and ifelse can be used to make that variable equal to things based on whether a logical statement is true or false. What makes ifelse really helpful is that it can be nested to make more complex statements. Specifically, an ifelse takes the form :

ifelse(logical_claim, output_if_true, output_if_false)

so if I used it with a mutate() like:

data <- mutate(data, newvar=ifelse(var <= 100,1,0))

I’d be adding a new variable, “newvar”, to my dataset, “data”, that is equal to 1 if the variable “var” is less than or equal to 100, or is equal to zero if “var” is greater than 100. Got it?

Here’s some code for you to try out to identify the Democrat and Republican parties (check the website to make sure you’re assigning the numbers correctly!). You’ll need to fill in the blanks, and check your output to make sure it worked.

    congress1876 <- congress1876 %>%
            mutate(congress1876,
                twoparty = ifelse(party_code != 100 & party_code != 200, NA_integer_,
                ifelse(party_code == 200, _____, _____)))

Show the frequencies and your labels using: table(congress1876$rep_or_dem). After you create it, you can make sure that your variable encoded as a factor by using:

congress1876$twoparty <- as.factor(congress1876$twoparty)

Now that mutate() was complex and I’d like to know that you understand what it did. Explain in words the nested ifelse logic in the command so that you understand what it produces. What did NA_integer_do?

The top line of code indicated which dataset was being changed. The second line of code is the mutate function, which creates the new column. The third line creates the title of the new column, and includes the ifelse function. Within the ifelse function, party_code is identified as the variable that will determine what is a democrat and what is a republican. NA_integer rejects all integers that are not 100 or 200 and all null values as well. The final line assigns the word Republican to the value 200 and Democrat to the value 100.

Whew! Now with the party code out of the way we can really look at two party polarization over time. Building off the code from #4 above (the one that used geom_jitter()), make a polished plot of voting trends through time that varies the color of the points by whether the member is a Democrat or Republican. For an added challenge and additional polish see if you can make the colors assigned to each group match the stereotypical red for Republican and blue for Democrat. Assuming you used color=columnname as an aesthetic in your geometry, this can be done by adding a layer to your ggplot code like:

demsreps <- congress %>% filter(party_code == 100 | party_code == 200)


ggplot(data = demsreps, aes(x = year, y = nomAbsValue, color=factor(party_code))) + geom_jitter() + theme_fivethirtyeight() + labs(y="Distance from political center", x="Year", title="Political Polarization by Party", color="Party affiliation") + scale_color_manual(labels=c("Democratic","Republican"),values=c("blue","red")) + scale_x_continuous(breaks=seq(1820, 2020, 20))

## Warning: Removed 166 rows containing missing values (geom_point).

where you can specify the actual colors for the number of observations your factor variable takes on. Take a moment to look up the help()for scale_color_manual or google it to see what the options are and examples of how to use it. You can find different shades of color here, if you don’t like the default red and blue: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf

If you made the new party variables correctly, this should be two values and you can use the obvious colors “blue” and “red”.

you can also fix the numeric labeling on the x axis using something like: scale_x_continuous(breaks=seq(1877, 2020, 10))which sets the limits to a sequence ranging from 1877 to 2020 with labels for every 10 years. Or you could experiment with setting the range to be different. You can also rotate axis label if they look too crowded, using theme()and setting the angle to 90 or to 45 (or something else):

        theme(axis.text.x=element_text(angle=90, hjust=1))

Discuss the figure and address what trends you see in voting by each party. Do you think more extreme voting (polarization) is the exception or the norm for the major parties? Also, which party seems to have polarized more quickly in recent years?

Both parties now seem to be more polarized on average than they were in 1980. However, there were more Democratic outliers that were much more liberal than average Democrat in 1980 than there are as of 2020, yet there are actually more Republican outliers, people much more conservative than the average Republican, in 2020 than there were in 1980. The data appears to show, then, that the Republicans have polarized more rapidly than the Democrats have.

*Polarization

How have voting trends changed through time for different groups?

Now that we have focused our dataset on the post-Reconstruction period and the two main political parties, let’s go back to our analysis of representation and voting trends in Congress. Let’s start by making an overall plot of the average voting trends through time based on congress member identity category. This will be similar to the process you’ve already practiced: You’ll need to summarize average voting by congress or year (or include both), chamber, and id_cat, and then make a line plot through time, making each of the id_cat groups a separate line and color. Make it polished! Use facet_wrap(~variable) to make side-by-side plots for the House and Senate.

congress <- congress %>% group_by(congress,id_cat) %>% mutate(mean_nominate = mean(nominate_dim1, na.rm = TRUE))
legislature <- congress %>% filter(chamber == "House" | chamber == "Senate")


ggplot(data = legislature, aes(x = year, y = mean_nominate, color=factor(id_cat))) + geom_line() + theme_solarized_2(light=FALSE) + labs(y="Mean Nominate score", x="Year", title="Political Trends By Race/Ethnicity", color="Race/Ethnicity") + scale_color_manual(values=c("deeppink","orange","yellow","chartreuse","cyan","white")) + scale_x_continuous(breaks=seq(1790, 2020, 40)) + facet_wrap(~chamber) + theme(axis.text.x=element_text(angle=90, hjust=1))

Interpret the graph in a few sentences. What trends do you notice? One line should show a pretty drastic change in voting tendencies. You can read more about why this trend appears here and how it shifted party affiliation: Temporary Farewell and Return to Congress.

Since there were no black Congresspeople in office between 1901 and 1929, there’s a very sharp drop in Nominate score that occurs during that period. Before Reconstruction, almost all African-Americans voted for Republicans, while after the return of African-Americans to Congress, they were mostly Democrats. In the House we see another downward trend line for Hispanics, although not as pronounced. White Congresspeople display a cyclic pattern of upswings and downswings in conservatism. Right now they appear to be on an upswing.

Since voting trends by identity category appear to mainly follow major party trends and membership, let’s try another variable. Using the same process let’s also make a summary table and plot of the average voting trends through time based on congress member gender. What trends do you see here?

legislature <- legislature %>% group_by(congress,Woman) %>% mutate(mean_nominate2= mean(nominate_dim1, na.rm = TRUE))
ggplot(data = legislature, aes(x = year, y = mean_nominate2, color=factor(Woman))) + geom_line() + theme_solarized_2(light=FALSE) + labs(y="Mean Nominate score", x="Year", title="Political Trends By Gender", color="Gender") + scale_color_manual(values=c("skyblue","magenta"),labels=c("Male","Female")) + scale_x_continuous(breaks=seq(1780, 2020, 40)) + facet_wrap(~chamber) + theme(axis.text.x=element_text(angle=90, hjust=1))

Just focusing on the voting trend of women, let’s run two linear regressions, one for each chamber, to see if the trend is significant through time. What are your conclusions? Interpret the results completely and accurately.

The first ever female legislator was classified as a conservative (Jeanette Rankin), which skewed the average. Once more women were elected to Congress, the average Nominate score dropped extremely steeply.

Like white congresspeople, male congresspeople seem to follow a cyclic trend and right now they are becoming increasingly conservative.

house <- legislature %>% filter(chamber=="House")
senate <- legislature %>%filter(chamber=="Senate")

SenateWomen <- linear_reg() %>% set_engine("lm") %>% fit(congress~mean_nominate2, data = senate)
summary(SenateWomen$fit)

## 
## Call:
## stats::lm(formula = congress ~ mean_nominate2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.817 -21.983   4.744  21.165  56.209 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     67.4936     0.3079  219.19   <2e-16 ***
## mean_nominate2 -50.5993     3.2011  -15.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.22 on 9686 degrees of freedom
## Multiple R-squared:  0.02515,    Adjusted R-squared:  0.02505 
## F-statistic: 249.9 on 1 and 9686 DF,  p-value: < 2.2e-16

HouseWomen <- linear_reg() %>% set_engine("lm") %>% fit(congress~mean_nominate2, data = house)

The previous plot makes it clear that political party is probably part of the story. Let’s make one more summary table and plot to try to determine if the trend you saw is true of women members of congress generally, or if it’s more of an impact of political party membership. This time, make sure to include your “twoparty” variable as a grouping variable. When you make your plot this time, let’s make a 2x2 grid that shows plots separately for chamber and political party.

CongressWomen <- congress1876 %>% filter(Woman == 1)
table(CongressWomen$party_code)

## 
##  100  200 
## 1145  550

ggplot(congress1876, aes(x = nominate_dim1,fill=twoparty)) + geom_boxplot() + facet_wrap(~chamber) + facet_wrap(~twoparty) + labs(title="Nominate Scores of Congress Since 1876",x="Nominate score dimension 1") + theme_fivethirtyeight() + scale_fill_manual(values=c("blue","red"),labels=c("Democrat","Republican"))

## Warning: Removed 148 rows containing non-finite values (stat_boxplot).

ggplot(CongressWomen, aes(x = nominate_dim1,fill=twoparty)) + geom_boxplot() + facet_wrap(~chamber) + facet_wrap(~twoparty) + labs(title="Nominate Scores of Women In Congress Since 1876",x="Nominate score dimension 1") + theme_fivethirtyeight() + scale_fill_manual(values=c("blue","red"),labels=c("Democrat","Republican"))

## Warning: Removed 10 rows containing non-finite values (stat_boxplot).

We can do this using `facet_grid(chamber~twoparty)`. As usual, add your polish to the graphic, including labels, a caption, and a theme, and write a few sentences interpreting what you found. Were there any interesting or general patterns across the gridded plots?

The medians of both sets of data appear to be very close, but just looking at the middle three quartiles, women tend to be farther left if they are Democrats and slightly more moderate if they are Republicans. There was no boxplot generated for women from third parties, presumably because there isn’t enough data.

Clearly, regardless of party, women seem to differ from other members of congress in their average voting trends. But is it a significant difference? Are women voting more liberally or conservatively than other members? We can test that! Check if the average woman in congress votes differently from their congressional peers (you can lump all the data together for this one). Think about which statistical test would be best to answer your question, and interpret the results completely and accurately. Did the statistical results support what you saw in the graph?

I will use a single-variable T-test to analyze these data.

t.test(CongressWomen$nominate_dim1, congress$nominate_dim1)

## 
##  Welch Two Sample t-test
## 
## data:  CongressWomen$nominate_dim1 and congress$nominate_dim1
## t = -19.471, df = 1812.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1902435 -0.1554254
## sample estimates:
##    mean of x    mean of y 
## -0.167618398  0.005216026

Yes, the data show that there is a difference in the average Nominate score of Congresswomen when compared to men and women together, and that women vote more liberally.

Searching for related variables to political polarization in voting

Let’s add one more variable to our congress1876 dataset. If values further from 0 are representations of “extreme” or polarized voting, what transformation could you use to put the two parties on the same scale so the nominate_dim1 values are comparable?

The absolute value; that’s why I used the variable “nomAbsValue”

OK, for all the cookie dough… Let’s try two big regression models (one for the House and one for the Senate) to try to interpret if some of the demographic factors, party membership, or congress variables we looked at have a relationship to voting in a more polarizing way, in recent years. Based on the graphs you’ve seen so far, choose a more recent span of years to filter your data to, for example since WWII (~1944), or something else marking an event of your choosing.

senate$id_cat <- as.numeric(senate$id_cat)
senate1970 <- senate %>% filter(year > 1969)
SenateParties <- linear_reg() %>% set_engine("lm") %>% fit(year+id_cat+nominate_dim1+nominate_dim2~nomAbsValue, data = senate1970)
summary(SenateParties$fit)

## 
## Call:
## stats::lm(formula = year + id_cat + nominate_dim1 + nominate_dim2 ~ 
##     nomAbsValue, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.983 -12.131   0.341  12.183  32.247 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1993.7844     0.6888 2894.47   <2e-16 ***
## nomAbsValue   24.2597     1.8836   12.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.61 on 2654 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.05883,    Adjusted R-squared:  0.05847 
## F-statistic: 165.9 on 1 and 2654 DF,  p-value: < 2.2e-16

tidy(SenateParties)

## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   1994.      0.689    2894.  0       
## 2 nomAbsValue     24.3     1.88       12.9 7.30e-37

print(SenateParties)

## parsnip model object
## 
## 
## Call:
## stats::lm(formula = year + id_cat + nominate_dim1 + nominate_dim2 ~ 
##     nomAbsValue, data = data)
## 
## Coefficients:
## (Intercept)  nomAbsValue  
##     1993.78        24.26

house$id_cat <- as.numeric(house$id_cat)
house1970 <- house %>% filter(year > 1969)
HouseParties <- linear_reg() %>% set_engine("lm") %>% fit(year+id_cat+nominate_dim1+nominate_dim2~nomAbsValue, data = house1970)
summary(HouseParties$fit)

## 
## Call:
## stats::lm(formula = year + id_cat + nominate_dim1 + nominate_dim2 ~ 
##     nomAbsValue, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.13 -11.50   0.26  11.71  33.89 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1991.4872     0.3321 5997.31   <2e-16 ***
## nomAbsValue   27.9329     0.8369   33.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 11498 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.08832,    Adjusted R-squared:  0.08824 
## F-statistic:  1114 on 1 and 11498 DF,  p-value: < 2.2e-16

tidy(HouseParties)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   1991.      0.332    5997.  0        
## 2 nomAbsValue     27.9     0.837      33.4 3.33e-233

print(HouseParties)

## parsnip model object
## 
## 
## Call:
## stats::lm(formula = year + id_cat + nominate_dim1 + nominate_dim2 ~ 
##     nomAbsValue, data = data)
## 
## Coefficients:
## (Intercept)  nomAbsValue  
##     1991.49        27.93

Based on the plots and tests you’ve run so far, and your thoughts about the span of time that you picked, choose 3-5 variables that you want to put into a multiple regression model that examines potential correlates of voting trends. Keep in mind, you should always have some hypothesis, or reason, to put variables into a model before you run it. Suggestions might be to include congress or year, age, id_cat, Woman, party…

id_cat — Non-white congresspeople are likely to be more liberal than white congresspeople, and in turn, to vote in a less polarizing way.

year – polarization has increased over time.

nominate_dim1 – those with a lower Nominate score (more liberal) generally vote in a less polarizing way.

nominate_dim2 – those with a lower Nominate score (more progressive in social views) generally vote in a less polarizing way.

Interpret the results completely and accurately. This means for each regression, pulling out the p-value, R2, intercept, and the slopes for each significant estimate, interpreting in terms of the data. You should also compare your findings between the 2 chambers. Did the results qualitatively differ? If so, how?

The y-intercept is 1991.49; this means that when “nomAbsValue” is at 0, the y-value is 1991.49. The slope is 27.93, which means that with every unit increase in the explanatory variables on the regression line, “nomAbsValue” increases by 27.93. The r2 is 0.08, meaning about 8 percent of the data here can be explained by the regression line (which is not much). The p-value is virtually zero (2.2e-16), which means that, assuming the null hypothesis that there is no relationship between these variables, we have almost no chance of getting this result.

Conclusion

Write a short conclusion summarizing what you learned about this data from your work last week and next week. What conclusions might you draw about how Congress is polarized? What further research should be done?

Congress has grown increasingly polarized over time, and the Republican party has become polarized more rapidly than the Democratic party. Congress has diversified radically since its inception in 1789 when every Congressperson was a white male. There has been polarization at least along gender lines, with women both being more liberal on average and being less polarized as voters (in either party). Other than during the brief Reconstruction period, racial minorities have always been more liberal than whites in Congress. However, we haven’t looked at polarization along racial lines yet, which I think is worth looking into.

Read these articles from Pew Research and Data Wrapper. How do they align with what you’ve investigated in this lab? Write a paragraph on what you learned and where you think political research in this area should look into next. In other words, what are the next questions you have for what might be related to changing diversity in congress and/or changing trends or polarization in voting?

https://www.pewresearch.org/fact-tank/2019/02/15/the-changing-face-of-congress/

https://blog.datawrapper.de/age-of-us-senators-charts/

As our data reflected, Congress is becoming increasingly diverse in terms of gender and racial dynamics. The article also noted that Congress has somewhat diversified religiously, which was not a variable we looked into, but seems natural as other indicators of diversity have gone up; I think that is promising for exploratory analysis. Apparently the current Senate is the oldest ever, counter to some data from 2019 we had that determined that Senate ages reached their peak in the 80s, so that is worth looking into as well.

Changing Congress: Representation & Voting Trends II