---
title: "Congressional Representation"
author: "John Ladd"
date: "25 October 2021"
output:
html_document:
code_folding: hide
---
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✓ broom 0.7.12 ✓ rsample 0.1.1
## ✓ dials 0.1.0 ✓ tune 0.2.0
## ✓ infer 1.0.0 ✓ workflows 0.2.6
## ✓ modeldata 0.1.1 ✓ workflowsets 0.2.1
## ✓ parsnip 0.2.1 ✓ yardstick 0.0.9
## ✓ recipes 0.2.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(readr)
library(ggthemes)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
congress <- read.csv("congress_data.csv")
As always, read this section carefully, and when you’re ready to turn in the lab, you can delete the setup text and most of the instructions in the other sections. The goal is to create a clean, readable report that makes sense to someone who hasn’t seen the instructions.
This is Part 2 of our US Congress Lab. You may need to refer back to the previous lab guide for more information and context. You’ll be working with data that you filtered and wrangled in the last lab. You might choose a single team member’s data and code to work from.
As a group, you will need to turn in one copy of the html version of your RMarkdown Report. Please use code folding and remember to answer each question completely and accurately. The best teams work on projects together (side by side, pair programming), help teach those who are moving more slowly (not just “tell” the answer), and show up to planned group meetings on time and prepared with work done and ready to contribute to discussing the answers.
This means that everyone does the work—everyone should be contributing to code and/or following along on their own copies, helping to make decisions and interpret results, and checking in with each other to make sure each person actually understands what is going on. It means collaborating and making sure you have a plan for meeting after Lab, during the week, and everyone being patient and helping out.
As a team, look over the questions below, and decide how to write an introduction that captures some of the general points of this report. You may need to repeat a little of what you wrote last time, but it shouldn’t be copied over verbatim.
The United States Congress is split into two chambers: the House of Representatives and the Senate. The House currently has 435 delegates, while the Senate has 100 (two for each state). The US Congress had their first session in 1789. Since the mid-19th century, Democrats and Republicans have emerged as the two dominant political parties. Congress has become increasingly polarized since the mid-20th century.
The objective of this lab is to discover patterns in the makeup of Congress over the course of its 116 iterations. The data consists of every member of Congress since 1789 as well as their political party, what session(s) of Congress they served, what state and distinct they represented, their birth and death dates, their political grade according to Nominate (negative is more left and positive is more right), their gender, and what their racial identity is.
You’re working with the same datasets this week, so you may have some of the same ethical concerns as last time. As a team, read over each other’s Ethical Considerations sections from last week’s lab. Decide how to combine them into a new ethics statement for this week that captures all the concerns of the group.
Polarization in Congress has been present since the 20th century. While not always “liberal versus conservative,” that has been the dominant trend recently. By manipulating this data, we are able to help political analysts determine the trends of polarization throughout the entirety of Congress’s existence. We are also able to help sociologists and psychologists that want to analyze human behavior along lines of political affiliation . These data can also be used by schools in order to demonstrate the increasing polarization of Congress over time to students.
Last week you did the Data Explanation and Exploration section of this lab. This week, you’ll go straight to Statistical Analysis and Interpretation. Really, this is one giant lab.
Let’s not forget about all the other data we have in this congress dataset. We can also look at voting trends through time. And perhaps political party tells part of the story too!
Do you think age is generally linked to likelihood to vote more in a more polarizing way (extremely conservatively, or extremely liberally)? We can check this quickly using a scatterplot. Make one now, separate by chamber, with a trend-line, using semi-transparent points and your preferred theme, adding extra polish, and interpret it in a few sentences. Think carefully about your measures. If you are interested in looking at “extreme” voting generally (without care to whether it is liberal or conservative), will you want to use the exact nominate_dim1 measure, or will you want to transform it first?
Your visual can be interpreted to a pattern that you think is (or is not) apparent. But let’s check it using what we’ve learned about linear regression.
In class this week we’ve discussed two approaches: correlation and regression (linear models). Both of these approaches allow us to test whether there is a positive or negative relationship between an x and y variable. You can run a correlation using cor.test(data$yvar, data$xvar). To run a linear regression you can use tidymodels. When it comes to regression output it is often best to assign the model a name and then ask for a summary of that new object like this:
model1 <- linear_reg() %>%
set_engine("lm") %>%
fit(yvar~xvar, data=data)
summary(model1$fit)
Note that with the regression the "lm" is an obvious abbreviation for linear model, and unlike the correlation, the fit() code mirrors the equation of a line (\(y=mx+b\), or in statistical terms \(Y=a+bX\) or \(Y = β_{0} + β_{1}X\) … all of these are the same equation, but you may have encountered it shown in different ways in previous classes).
Since we know the House and Senate have some different dynamics at work, let’s run two linear regressions, one for each chamber, checking if there is a relationship between a tendency to vote in a more polarizing way and age. Think through the steps you’ll need to take in preparing the data and setting up each of your models. Remember to interpret your results completely and accurately (p-values, slope, intercept, R2, and linking the results back to the data).
congress <- congress %>% group_by(nominate_dim1) %>% mutate(nomAbsValue = abs(nominate_dim1))
congress <- congress %>% mutate(currentYear = replace(died,is.na(died),2021))
congress <- congress %>% group_by(currentYear) %>% mutate(age = currentYear - born)
house <- congress %>% filter(chamber=="House")
senate <- congress %>%filter(chamber=="Senate")
houseLnReg <- linear_reg() %>% set_engine("lm") %>% fit(nomAbsValue~age, data = house)
summary(houseLnReg$fit)
##
## Call:
## stats::lm(formula = nomAbsValue ~ age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36523 -0.11016 0.00387 0.10471 0.67506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.970e-01 4.472e-03 88.78 <2e-16 ***
## age -8.096e-04 6.006e-05 -13.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.161 on 39224 degrees of freedom
## (327 observations deleted due to missingness)
## Multiple R-squared: 0.004612, Adjusted R-squared: 0.004586
## F-statistic: 181.7 on 1 and 39224 DF, p-value: < 2.2e-16
tidy(houseLnReg)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.397 0.00447 88.8 0
## 2 age -0.000810 0.0000601 -13.5 2.51e-41
ggplot(data=house, aes(x=age,y=nomAbsValue)) + geom_point() + stat_smooth(method="lm") + theme_economist() + labs(x = "Age of Congressperson", y = "Distance from political center", title = "Political Polarization And Age In The House")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 327 rows containing non-finite values (stat_smooth).
## Warning: Removed 327 rows containing missing values (geom_point).
senateLnReg <- linear_reg() %>% set_engine("lm") %>% fit(nomAbsValue~age, data = senate)
summary(senateLnReg$fit)
##
## Call:
## stats::lm(formula = nomAbsValue ~ age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3615 -0.1153 -0.0025 0.1046 0.6827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4271753 0.0111482 38.318 <2e-16 ***
## age -0.0012917 0.0001464 -8.824 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1687 on 9632 degrees of freedom
## (54 observations deleted due to missingness)
## Multiple R-squared: 0.008019, Adjusted R-squared: 0.007916
## F-statistic: 77.86 on 1 and 9632 DF, p-value: < 2.2e-16
tidy(senateLnReg)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.427 0.0111 38.3 4.02e-299
## 2 age -0.00129 0.000146 -8.82 1.30e- 18
ggplot(data=senate, aes(x=age,y=nomAbsValue)) + geom_point() + stat_smooth(method="lm") + theme_economist() + labs(x = "Age of Senator", y = "Distance from political center", title = "Political Polarization And Age In The Senate")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 54 rows containing non-finite values (stat_smooth).
## Warning: Removed 54 rows containing missing values (geom_point).
For our House data, the slope of the line is almost zero (0.008). The R2 is also very low (0.01), meaning that just 1 percent of variation in political polarization may be due to age. There is almost certainly no relationship between the two variables. For the Senate data, there is even less correlation between political polarization and age. The slope of our regression line is slightly negative (-0.003). The R2 is 0.04, which is generally considered statistically insignificant. In neither model is there a strong correlation between political polarization and age.
So… age might not be the best predictor of voting extremity, though we did see some interesting trends in how average age through time did show a distinct trend. But what else might help predict how members of congress vote, or how those votes change through time? So far, we’ve divided the data by chamber (House and Senate), by demographic variables (id_cat and Woman) and age. But we haven’t yet talked about political parties. The data includes a variable for this, party_code. Unfortunately, if you table(housedata$party_code) you’ll see there are lots of different parties (you can see the party codes at https://voteview.com/articles/data_help_parties).
You may be surprised that most members of Congress historically belong to one of two parties. What are these two parties? Which one is consistently more liberal in terms of the nominate_dim1 measure? Lastly, does it look like one of the parties has polarized more – which one? (Don’t worry we’ll come back to testing this in the next section.) You can start to answer these questions by using an exploratory graphic – let’s try a scatter plot to show voting trends through time, using the nominate_dim1 measure, setting points to be semi-transparent, and coloring them by political party. Try making the graph using an alternate to geom_point(), as shown below. Note that the code below will also change the party_code to a factor, instead of a numeric variable.
congress <- congress %>% group_by(congress) %>% mutate(year = 1789 + 2*congress)
ggplot(data = congress, aes(x = congress, y = nomAbsValue, color=factor(party_code),alpha=1/10)) + geom_jitter() + theme_economist() + labs(y="Distance from political center", x="Session of Congress", title = "Political Polarization by Party")
## Warning: Removed 231 rows containing missing values (geom_point).
ggplot(data = congress, aes(x = year, y = nomAbsValue, color=factor(party_code))) + geom_jitter() + theme_economist() + labs(y="Distance from political center", x="Year", title = "Political Polarization by Party")
## Warning: Removed 231 rows containing missing values (geom_point).
geom_jitter(aes(color=factor(party_code)), alpha=1/10)
geom_jitter() do differently from geom_point()? What does alpha achieve? Why do you think party_code is better as a factor than a numeric variable here?geom_jitter adds a small amount of variation to the location of each point to make the data look smoother. “Party_code” should be a factor rather than a numeric variable because it’s categorical rather than continuous data. Setting alpha to 0.1 increases the transparency of data points to make overlapping areas easier to see.
Since two parties dominate the data, let’s focus on trends in representation and voting just in those main parties. We probably won’t have enough data to make really robust conclusions or comparisons on those smaller parties anyway. There are several important dates to know here, regarding our question and purpose: The United States civil war was fought 1861-1865, with the Emancipation Proclamation, officially ending slavery, happening on January 1, 1863 (though the 13th amendment abolishing slavery didn’t pass until January 31, 1865, and the news finally reached Galveston, TX on June 19, 1865). The post-war period of Reconstruction (read about it on wikipedia for domain knowledge and context) lasted until about 1876, at which point the two major parties dominated politics going forward. Let’s focus on this more narrow period of time for the rest of the project, which will help us reduce the smaller parties, hone our figures, and build toward some statistical analyses.
Let’s start by making a new dataset that only includes congresses since 1876 (call it congress1876). Then use table() with your new dataset and the party_code measure to see how many party codes there are. Much better right?
congress1876 <- congress %>% group_by(year) %>% filter(year > 1875)
table(congress1876$party_code)
##
## 100 112 114 117 200 213 326 328 329 331 340 347 354
## 20267 3 9 2 17663 1 28 53 20 16 83 3 14
## 355 356 370 380 402 522 523 537 1060
## 2 2 48 7 1 7 1 39 11
congress1876 <- congress1876 %>%
mutate(twoparty = ifelse(party_code != 100 & party_code != 200, NA_integer_,
ifelse(party_code == 200, "Republican", "Democrat")))
table(congress1876$twoparty)
##
## Democrat Republican
## 20267 17663
congress1876$twoparty <- as.factor(congress1876$twoparty)
We are going to need to use a mutate() with some ifelse() to do this. Remember that mutate makes a new variable and ifelse can be used to make that variable equal to things based on whether a logical statement is true or false. What makes ifelse really helpful is that it can be nested to make more complex statements. Specifically, an ifelse takes the form :
ifelse(logical_claim, output_if_true, output_if_false)
so if I used it with a mutate() like:
data <- mutate(data, newvar=ifelse(var <= 100,1,0))
I’d be adding a new variable, “newvar”, to my dataset, “data”, that is equal to 1 if the variable “var” is less than or equal to 100, or is equal to zero if “var” is greater than 100. Got it?
Here’s some code for you to try out to identify the Democrat and Republican parties (check the website to make sure you’re assigning the numbers correctly!). You’ll need to fill in the blanks, and check your output to make sure it worked.
congress1876 <- congress1876 %>%
mutate(congress1876,
twoparty = ifelse(party_code != 100 & party_code != 200, NA_integer_,
ifelse(party_code == 200, _____, _____)))
Show the frequencies and your labels using: table(congress1876$rep_or_dem). After you create it, you can make sure that your variable encoded as a factor by using:
congress1876$twoparty <- as.factor(congress1876$twoparty)
mutate() was complex and I’d like to know that you understand what it did. Explain in words the nested ifelse logic in the command so that you understand what it produces. What did NA_integer_do?The top line of code indicated which dataset was being changed. The second line of code is the mutate function, which creates the new column. The third line creates the title of the new column, and includes the ifelse function. Within the ifelse function, party_code is identified as the variable that will determine what is a democrat and what is a republican. NA_integer rejects all integers that are not 100 or 200 and all null values as well. The final line assigns the word Republican to the value 200 and Democrat to the value 100.
geom_jitter()), make a polished plot of voting trends through time that varies the color of the points by whether the member is a Democrat or Republican. For an added challenge and additional polish see if you can make the colors assigned to each group match the stereotypical red for Republican and blue for Democrat. Assuming you used color=columnname as an aesthetic in your geometry, this can be done by adding a layer to your ggplot code like:demsreps <- congress %>% filter(party_code == 100 | party_code == 200)
ggplot(data = demsreps, aes(x = year, y = nomAbsValue, color=factor(party_code))) + geom_jitter() + theme_fivethirtyeight() + labs(y="Distance from political center", x="Year", title="Political Polarization by Party", color="Party affiliation") + scale_color_manual(labels=c("Democratic","Republican"),values=c("blue","red")) + scale_x_continuous(breaks=seq(1820, 2020, 20))
## Warning: Removed 166 rows containing missing values (geom_point).
where you can specify the actual colors for the number of observations your factor variable takes on. Take a moment to look up the help()for scale_color_manual or google it to see what the options are and examples of how to use it. You can find different shades of color here, if you don’t like the default red and blue: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
If you made the new party variables correctly, this should be two values and you can use the obvious colors “blue” and “red”.
scale_x_continuous(breaks=seq(1877, 2020, 10))which sets the limits to a sequence ranging from 1877 to 2020 with labels for every 10 years. Or you could experiment with setting the range to be different. You can also rotate axis label if they look too crowded, using theme()and setting the angle to 90 or to 45 (or something else): theme(axis.text.x=element_text(angle=90, hjust=1))
Discuss the figure and address what trends you see in voting by each party. Do you think more extreme voting (polarization) is the exception or the norm for the major parties? Also, which party seems to have polarized more quickly in recent years?
Both parties now seem to be more polarized on average than they were in 1980. However, there were more Democratic outliers that were much more liberal than average Democrat in 1980 than there are as of 2020, yet there are actually more Republican outliers, people much more conservative than the average Republican, in 2020 than there were in 1980. The data appears to show, then, that the Republicans have polarized more rapidly than the Democrats have.
*Polarization
facet_wrap(~variable) to make side-by-side plots for the House and Senate.congress <- congress %>% group_by(congress,id_cat) %>% mutate(mean_nominate = mean(nominate_dim1, na.rm = TRUE))
legislature <- congress %>% filter(chamber == "House" | chamber == "Senate")
ggplot(data = legislature, aes(x = year, y = mean_nominate, color=factor(id_cat))) + geom_line() + theme_solarized_2(light=FALSE) + labs(y="Mean Nominate score", x="Year", title="Political Trends By Race/Ethnicity", color="Race/Ethnicity") + scale_color_manual(values=c("deeppink","orange","yellow","chartreuse","cyan","white")) + scale_x_continuous(breaks=seq(1790, 2020, 40)) + facet_wrap(~chamber) + theme(axis.text.x=element_text(angle=90, hjust=1))
Interpret the graph in a few sentences. What trends do you notice? One line should show a pretty drastic change in voting tendencies. You can read more about why this trend appears here and how it shifted party affiliation: Temporary Farewell and Return to Congress.
Since there were no black Congresspeople in office between 1901 and 1929, there’s a very sharp drop in Nominate score that occurs during that period. Before Reconstruction, almost all African-Americans voted for Republicans, while after the return of African-Americans to Congress, they were mostly Democrats. In the House we see another downward trend line for Hispanics, although not as pronounced. White Congresspeople display a cyclic pattern of upswings and downswings in conservatism. Right now they appear to be on an upswing.
legislature <- legislature %>% group_by(congress,Woman) %>% mutate(mean_nominate2= mean(nominate_dim1, na.rm = TRUE))
ggplot(data = legislature, aes(x = year, y = mean_nominate2, color=factor(Woman))) + geom_line() + theme_solarized_2(light=FALSE) + labs(y="Mean Nominate score", x="Year", title="Political Trends By Gender", color="Gender") + scale_color_manual(values=c("skyblue","magenta"),labels=c("Male","Female")) + scale_x_continuous(breaks=seq(1780, 2020, 40)) + facet_wrap(~chamber) + theme(axis.text.x=element_text(angle=90, hjust=1))
The first ever female legislator was classified as a conservative (Jeanette Rankin), which skewed the average. Once more women were elected to Congress, the average Nominate score dropped extremely steeply.
Like white congresspeople, male congresspeople seem to follow a cyclic trend and right now they are becoming increasingly conservative.
house <- legislature %>% filter(chamber=="House")
senate <- legislature %>%filter(chamber=="Senate")
SenateWomen <- linear_reg() %>% set_engine("lm") %>% fit(congress~mean_nominate2, data = senate)
summary(SenateWomen$fit)
##
## Call:
## stats::lm(formula = congress ~ mean_nominate2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.817 -21.983 4.744 21.165 56.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.4936 0.3079 219.19 <2e-16 ***
## mean_nominate2 -50.5993 3.2011 -15.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.22 on 9686 degrees of freedom
## Multiple R-squared: 0.02515, Adjusted R-squared: 0.02505
## F-statistic: 249.9 on 1 and 9686 DF, p-value: < 2.2e-16
HouseWomen <- linear_reg() %>% set_engine("lm") %>% fit(congress~mean_nominate2, data = house)
CongressWomen <- congress1876 %>% filter(Woman == 1)
table(CongressWomen$party_code)
##
## 100 200
## 1145 550
ggplot(congress1876, aes(x = nominate_dim1,fill=twoparty)) + geom_boxplot() + facet_wrap(~chamber) + facet_wrap(~twoparty) + labs(title="Nominate Scores of Congress Since 1876",x="Nominate score dimension 1") + theme_fivethirtyeight() + scale_fill_manual(values=c("blue","red"),labels=c("Democrat","Republican"))
## Warning: Removed 148 rows containing non-finite values (stat_boxplot).
ggplot(CongressWomen, aes(x = nominate_dim1,fill=twoparty)) + geom_boxplot() + facet_wrap(~chamber) + facet_wrap(~twoparty) + labs(title="Nominate Scores of Women In Congress Since 1876",x="Nominate score dimension 1") + theme_fivethirtyeight() + scale_fill_manual(values=c("blue","red"),labels=c("Democrat","Republican"))
## Warning: Removed 10 rows containing non-finite values (stat_boxplot).
We can do this using `facet_grid(chamber~twoparty)`. As usual, add your polish to the graphic, including labels, a caption, and a theme, and write a few sentences interpreting what you found. Were there any interesting or general patterns across the gridded plots?
The medians of both sets of data appear to be very close, but just looking at the middle three quartiles, women tend to be farther left if they are Democrats and slightly more moderate if they are Republicans. There was no boxplot generated for women from third parties, presumably because there isn’t enough data.
I will use a single-variable T-test to analyze these data.
t.test(CongressWomen$nominate_dim1, congress$nominate_dim1)
##
## Welch Two Sample t-test
##
## data: CongressWomen$nominate_dim1 and congress$nominate_dim1
## t = -19.471, df = 1812.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1902435 -0.1554254
## sample estimates:
## mean of x mean of y
## -0.167618398 0.005216026
Yes, the data show that there is a difference in the average Nominate score of Congresswomen when compared to men and women together, and that women vote more liberally.
Write a short conclusion summarizing what you learned about this data from your work last week and next week. What conclusions might you draw about how Congress is polarized? What further research should be done?
Congress has grown increasingly polarized over time, and the Republican party has become polarized more rapidly than the Democratic party. Congress has diversified radically since its inception in 1789 when every Congressperson was a white male. There has been polarization at least along gender lines, with women both being more liberal on average and being less polarized as voters (in either party). Other than during the brief Reconstruction period, racial minorities have always been more liberal than whites in Congress. However, we haven’t looked at polarization along racial lines yet, which I think is worth looking into.
https://www.pewresearch.org/fact-tank/2019/02/15/the-changing-face-of-congress/
https://blog.datawrapper.de/age-of-us-senators-charts/
As our data reflected, Congress is becoming increasingly diverse in terms of gender and racial dynamics. The article also noted that Congress has somewhat diversified religiously, which was not a variable we looked into, but seems natural as other indicators of diversity have gone up; I think that is promising for exploratory analysis. Apparently the current Senate is the oldest ever, counter to some data from 2019 we had that determined that Senate ages reached their peak in the 80s, so that is worth looking into as well.