---
title: "Changing Congress: Representation and Voting Trends I"
author: "Alice Wu"
date: "03/09/2022"
output: html_document
  code_folding: hide
---

Setup

As always, read this section carefully, then delete it before you turn in your lab.

This is a two-part individual (Part 1; questions 1-20) and team (Part 2; questions 21-37) lab. Please submit this individual portion as a knitted HTML file using code folding, titled “lastname_congress.html”.

In this lab we will:

Practice data wrangling skills such as creating new measures, collapsing and filtering data for further analysis.
Use conditional statements.
Practice advanced data visualization techniques.
Begin building a toolkit of introductory statistical methods.
Use code folding to make a polished report that hides code by default.

All of these will be valuable in future projects and classes. Being able to master and then extend and adapt the techniques will be important moving forward.

You will hand in a knitted HTML file for assessment. Please include headers for each of the main sections as you code (using #) and make sure to answer all questions completely, in between code chunks of your Rmd Report.

Code folding

So far, I’ve been having you show your code in the reports we’ve been writing. But now that you’re nearly halfway into the semester, we’re ready to hide the code and make more professional reports. For this report, I’d like you to hide code that is for setup or more routine data wrangling, using the extra arguments you can add into the {r} part of your code chunk (hint: include, warnings, etc). You may need to visit the Rmarkdown cheat sheet to figure it out. You can find some of the common arguments here, in the header “Chunk Options”: https://rmarkdown.rstudio.com/lesson-3.html

For graphics, new commands, and statistical tests, I’d like you to instead use “code folding”. This is a technique that only works when we are rendering (knitting) the Rmarkdown code to html, because we need something capable of being dynamic. Code folding will hide your code by default, but provide a nifty little button that you can click on to instantly show, or re-hide the code, in your report. Cool!

At the top of your file, in the boilerplate, edit the output section to include code_folding: hide. Your entire header will look like this:

---
title: "Congressional Representation"
author: "John Ladd"
date: "25 October 2021"
output:
  html_document:
    code_folding: hide
---

Give it a try, and once you have a few R code chunks, try knitting to see if it worked, and what it looks like. The formatting in your header for the output section needs to look exactly as shown above, including the indentations. You can find more details here, including an example of how to include a Table of Contents (determined by your headers), if you want more information or to try experimenting: https://blog.rstudio.com/2016/03/21/rmarkdown-v0-9-5/.

Introduction

In class we discussed the increasing polarization of parties in Congress in terms of how Congresspeople vote; polarization is the movement of policy positions away from the center so that there is less overlap between the two parties’ stances on political issues. Polarization is an impediment to bipartisanship and dialogue between the two American political parties since they are finding it harder and harder to agree on basic fundamentals. In the CSV file we find the names and demographic information of Congresspeople in the House both currently and previously and their scores on a metric designed to evaluate them on how far they are to the political left or right. Our guest speaker told us that the Republican Party in particular has moved away from the political center and farther right over the course of the past several decades as shown by changes in their average political scores. I would like to learn whether or not that is the case. I’d also like to see how demographic information has changed over time.

Ethical Considerations

Stakeholders include political analysts like Nate Silver of FiveThirtyEight who make a living off evaluating political trends using data and political strategists. It also includes the Congresspeople themselves. Other than the obvious reasons for this, increasing polarization in “safe” counties means that moderate candidates will often be defeated in primaries by more extreme candidates. It also includes the President, who must convince an increasingly divided Congress to pass their desired legislation.

Data Explanation and Exploration

n.b. Most of your analysis this week is exploratory. Next week, you’ll work on the Statistical Analysis and Interpretation section of this lab.

Initial Setup - Data and packages

You won’t need to include any of this in your output but will need to do the following:

To begin you’ll want to download the following files from the Github assignment for into a directory that you can set up as your R Project for the lab.
- congress_data.csv (congress)
- View the online codebook here: https://voteview.com/articles/data_help_members
- More about the measures and dataset: https://link.springer.com/article/10.1007/s11127-018-0546-0 2.Make sure you have the following packages installed and load them in using the library() function:
- tidyverse; ggthemes; maps
read_csv() in the .csv file listed above and assign it a name of your choosing. Bear in mind that good names are short and informative. For clarity, the lab refers to the named data files using the labels in parentheses above.

Exploring and setting up the Congressional data

You can delete the instructions once you’ve completed all the steps, so that you wind up with a clean, readable report.

As a first step glimpse() the congress datasets that you just brought into RStudio. (no need to include this output in your write-up).
What are the units of analysis (the rows in this data set)? Does the number of observations make sense (how many members are there in the House and the Senate)? Hint: you may want to look at summaries of specific variables or use table(data$colname) to get frequencies of specific measures to check your intuition.

The rows are of the individual Congresspeople. Initially it doesn’t make sense because there are only 535 Congresspeople currently, but it also includes members from previous sessions of Congress.

What data wrangling might you need to do before you are ready to analyze this dataset? This dataset is supposed to be about House member voting records, but the President for each congress is provided. Create a second data frame that only includes Presidents, and filter your main datafile to only include House and Senate members.
As you probably noticed in the previous section we do not have an obvious time variable in the dataset. The closest thing we have is the congress number, but most people (except congress scholars) don’t think of time in this way. So we need to make a year variable, where the year associated with each congress is the first year that the congress met. Obviously we don’t want to type one in or recode the congress number since that would be cumbersome and tedious. Instead we’ll want to leverage our logic and skills to mutate() one. You’ll need some basic math to do this and remember that each congress lasts 2 years, the first congress convened in 1789, and the last congress in our dataset (the 116th) started in 2019 (note: it will not be complete until 3 January 2021). Make clever use of mutate() to generate a new year variable in the congress data and provide a summary of your new variable to confirm that it ranges from 1789 to 2019.
- Hint: Imagine that I had a dataset of current Denison students, “data”, which had a variable class which took on values of 1, 2, 3, or 4 for whether the student was a freshman, sophomore, junior, or senior. Then if I wanted to make a new variable for the graduation year (gradyr) of the student I could use:

Note that I had to think about what I wanted the new variable to look like in terms of how the class variable was coded. By starting at 2025 seniors would subtract 4 and yield 2019, while freshman would just be equal to 2025-1, or 2022. For the task here to determine congress year, you’ll need to something similar but you’ll also need multiplication (*) as part of your equation. Basically think about what the start year should be and then what you’ll want to add in terms of the congress variable.

Now, we also have a few categorical variables that we should really have R treat as factors. Using one of your summary functions, check how the variables for id_cat and Woman are coded. If they are not marked as factors, change them now using as.factor(). Remember, you’ll need to save this back to the main dataset, not as a separate variable, or it won’t be useful to you.
Compare the total numbers of congressional representatives that fall into different demographic categories (id_cat), using a bar plot. So id_cat should be your x-axis and the total count should be your y-axis (it will count totals automatically, similarly to how histogram automatically totals each the bins heights for you). You can use geom_bar(), not the summary method we used for the substance use lab. You should make side-by-side plots for the house and senate using facet_wrap().

ggplot(data=legislature, aes(x=id_cat, y=race, fill=id_cat)) + geom_bar(stat="identity") + labs(y="count", x="Race and Ethnicity", fill="Racial Categories") + theme(axis.text.x = element_text(angle = 70, hjust = 1, face = "bold")) + geom_col(width=0.4, position = "dodge") + facet_wrap(~chamber)

Now look at a summary of the nominate_dim1 values and produce a polished visual (i.e., axis labels, title, appropriate binwidth, etc…) of the distribution of all the ideological positions (often interpreted as economic liberalism-conservatism; nominate_dim1) values for all rows in the dataset. You should make side-by-side plots for the house and senate using facet_wrap(). Remember that given the scale for nominate_dim1 scores (-1= very liberal and 1=very conservative) the zero point is a fair place to imagine a moderate. Briefly discuss what the results suggest for polarization in terms of this measure of legislator ideologies.
- While you are likely familiar with the ggplot code you need to execute the visual portion of this text, you can make long titles or captions separate over multiple lines using \nto indicate where you want a line break. For example, the layer in your ggplot could look like:

ggplot(data=legislature, aes(x=nominate_dim1,fill="frequency")) + geom_histogram(bins = 50) + labs(title = "Political Leanings of Congresspeople", x = "Liberal-Conservative Axis Score", y="Frequency") + facet_wrap(~chamber) + theme(plot.title = element_text(hjust = 0.5), legend.position="none")

## Warning: Removed 213 rows containing non-finite values (stat_bin).

* If you want to center the title you can add a layer to your ggplot that looks like this:

theme(plot.title = element_text(hjust = 0.5))

* feel free to tinker with this code and use `help()` or google to see how it works. `theme()` contains a lot of power to alter the way your visuals look, and to give a professional polish to your reports.

In light of your previous figure, we still might want to know if the average Congress representative is significantly polarized. Run a statistical test you know how to use at this stage in the lab to assess whether the average member in the entire dataset is extreme. Will this be a one or a two sample t-test? For a one-sample t-test, you would only put in one group, and it is compared against a hypothetical mean of zero (mu=0), though in theory, you could set the comparison mean to a different value (mu=0.5). In a two-sample t-test, which you’ve already done a lot in class, you put in two groups, and compare their means to see if the difference is zero or not. Think about the way the data is distributed. We could leave it alone, or use abs() to take the absolute value if you think that makes more sense. There are several acceptable approaches to this problem.

What are the two things you will be comparing here? Discuss the result in terms of statistical and substantive significance.

Analysis over time and across subsets of the data

In the previous section you looked at aggregate voting scores. Of course, this is highly abstracted and paints too broad of a picture, to lump all the data together across all of US Congressional history. At this point, we can’t yet address questions related to time, or subgroups of Congress. House and Senate members may or may not be polarized on average, but clearly any look at representation or political polarization today must consider the effects of time, parties, and other factors.

Let’s begin with investigating how representation has changed in congress through time. Contrary to the depiction in the hit musical “Hamilton” (seriously, go see it if you ever have a chance!), the United States founding fathers were all white men. In fact, it was quite some time before any non-white or women federal legislative officials were elected (nearly 100 years!). We’re going to use the data to create a descriptive history lesson of the changing face of congress through time, and attempt to relate it to key historical events.

Identity category and gender

Right now, we have data for all the members of congress from 1789 to present. But to really see how representation has changed, we don’t just want to count people of different identity categories (after all the total number of people in congress has also changed through time). Rather, we should compare proportional, or percentage, representation in each congress, and see how it has changed over time.

So to do that, you’ll need to use what you know to create a summary table that calculates percent of each demographic category (id_cat, Woman), separately by congress (or year) and chamber. This one is kind of complicated, so I’ll give the first one to you, but I’m counting on you reading it closely and making sure you understand it, so you can replicate it for the Woman category.

The code above should give you the percent representation of each different identity category, separately by congress (year) and by chamber (house, senate). Explain in detail why the code works. What is happening in each line, that helps get from your main dataset, to the summary dataset output? What is each function doing, and why is the dplyr chain ordered this way?
Now, write some code that does the same thing - calculates percent representation separately for each chamber in each congress - but for the Woman variable.
OK, now you have some summary datasets, with variables for year, congress, chamber, and percent representation. It’s hard to see in the table, so let’s look at them visually! Make 2 line plots, showing the change in representation through time, using the summary datasets you made for identity category and for woman.

ggplot(percWhite, aes(x = year, y=percent, color = chamber)) + geom_line() + facet_wrap(~chamber) + labs(title = "How Racial Demographics Have Changed In Washington", y = "% identifying as 'White or Other'", color = "Chamber of government") + theme_dark()

ggplot(justmen, aes(x = year, y=percent, color = chamber)) + geom_line() + facet_wrap(~chamber) + labs(title = "How Gender Representation Has Changed In Washington", y = "% of members who are male", color = "Chamber of government") + theme_dark()

Each plot should:

show the data as separate lines through time, for identity category or woman category
use a ggplot layer to make side-by-side plots comparing representation in the House and the Senate (there’s a handy ggplot function for this, remember?)
Use some aesthetic, like color or linetype to differentiate the lines and help the graph be interpretable
Add polish by specifying your own x and y axis labels, and a title or caption for your visual. Remember, if your title or caption is long, you can use \nto signify you want the text to continue on a new line.
Use one of the new themes that we brought in with the ggthemes library. Once you’ve selected one you like, use this theme for all the visuals in your lab. To try different options, you can type theme_ and then hit “tab” to see the different options. You can copy thematic styles from fivethirtyeight, Wall Street Journal, the Economist, data viz giants like Edward Tufte or Stephen Few, among other options too.

Don’t forget to interpret what is shown! What do you notice about the trends through time? Do you have any hypotheses about what historic events might be related to changes in representation? State some ideas here.

We see a sudden influx of people in color in both houses of Congress around 1870, that curves back up around 1880, when both houses return to being virtually all-white. This coincides with the Restruction period that occurred after the Civil War in which Northerners sought to transform the South and secure the rights of freedmen. Unfortunately this period was short-lived and after just a decade, Southerners succeeded in reclaiming their legislatures and removing black access to nearly all the rights to which freed slaves were entitled under the Constitution, such as the right to vote. Non-white representation in Congress returns slowly through the early 20th century before rising exponentially in the last half of the 20th century, potentially due to the civil rights movement and passage of the Voting Rights Act in the mid to late sixties.

Women took longer to enter Congress most likely because nationwide female enfranchisement arrived only about half a century after (nominal) African-American enfranchisement. We generally see a more stable line here, although in the Senate chart, we see that there were no women in the Senate in 1950. This was during the post-WWII period, when many women who worked outside the home during the war returned to domestic life.

One odd thing I noticed is that there is no change in the Presidents line during the Obama administration, and indeed, Obama is listed as “White or Other” in our dataset. While it is true that he is biracial and that his mother was white, he identifies as black rather than white, so I assume this was an error.

Let’s try those plots one more time, but add some historical context onto them. For each set of plots (id_cat and woman), add at least 2 vertical dashed lines that you think should represent a historic event that might be related to a change in representation for certain groups. You may need to brainstorm in your study group for ideas, or Google for some dates. For example, in 1991, Anita Hill’s testimony of alleged sexual harassment against Supreme Court nominee Clarence Thomas spurred many women to run for office and 1992 was given the name “Year of the Woman”. To add this context to my graph, I could add a dashed vertical line, using code like this:

        geom_vline(xintercept = 1991, linetype="dashed")

You could add 2 or more vertical lines by adding multiple years: c(1991, XXXX, XXXX), and you could look up different linetypes or experiment with making your line a different color. Or you could add multiple geom_vline() ggplot layers.

Show the two new versions of your side-by-side graphs, with the vertical lines and a caption using labs(caption=" ") to tell the reader what the lines represent.

ggplot(percWhite, aes(x=year, y=percent, color = chamber)) + geom_line() + facet_wrap(~chamber) + labs(title = "How Racial Demographics Have Changed In Washington", y = "% identifying as 'White or Other'", color = "Chamber of government") + theme_dark() + geom_vline(xintercept = c (1865, 1954, 1968), linetype="dashed") + geom_vline(xintercept = 1877, linetype="dotted")

ggplot(justmen, aes(x = year, y=percent, color = chamber)) + geom_line() + facet_wrap(~chamber) + labs(title = "How Gender Representation Has Changed In Washington", y = "% of members who are male", color = "Chamber of government") + theme_dark() + geom_vline(xintercept = 1920, linetype="dashed") + geom_vline(xintercept = 1945, linetype="dotted")

Using geographic information in a map

One more way we can think of representation is geographically. We haven’t done much mapping in class yet, so I’ll share some example code that can help you create the map, but you’ll need to calculate your summary dataset first. For this one, let’s make a map of the number of non-white members of Congress, by state. You’ll have to think about how to set your filter statement and what should go into your group_by statement. Remember, n_distinct()will count the number of distinct items in the column (in this case, member’s names) and n() will count the number of rows in each sub_group of the data.
Now, we can take this new summary dataframe, and add it to a map of the states that we can get from the maps package we added way back at the beginning of our code. Remember that? We can use this to see which states have elected representatives that include non-white persons, and how many.

We can get all the latitude and longitude coordinates to draw US state boundaries using the map_data() function, like this:

Use head() or str()to look at what is contained in your new states_map dataframe. We can then take the summary dataset that you made already and use a few nested functions to make our state abbreviations into full length state names so that text matching can be used. Luckily, base R comes with an internal dataset of all the state abbreviations (state.abb) and the state names( state.name)! What do you think the tolower() function is doing in the code, and why would you want to use it?

Then, we can join together the two datasets, the one with all our summary data, and the one with the map boundaries. Left join will keep all the rows and columns from the first (on the left) data frame that we add, and match the state (region) names from the second data frame we list (on the right), along with it’s columns. Run the code below and then look at the new dataframe to see what it did.

Now, let’s make a map! scale_fill_viridis_c() is a fancy color palette that we can add to make the differences more obvious, and more easily read by most forms of color blindness. Notice that to make a map, we have to make closed shapes, or polygons, for the state outlines. So we can use geom_polygon() to achieve this goal, and can color them by whatever variable we put into fill. Don’t forget to add your own caption and theme!

ggplot(states_map, aes(long, lat, group = group))+
  geom_polygon(aes(fill = num_by_state), col="white")+
  scale_fill_viridis_c(option = "C") + theme_linedraw() + labs(fill ="Number of Congresspeople", title = "Number of Representatives and Senators of Color Per State", x = "Longitude (W°)", y = "Latitude (°N)")

Remember, whenever you want to do a lot in a single chunk of code, it is best to start small, check that it works, and then build up to the complexity. This makes it easier to find bugs and get the result that you want efficiently!

Write a few sentences below your graph to interpret it. What geographic trends exist? Is there anything you see that is surprising? What new questions does this generate for you?

California naturally has the most non-white representatives because it is the largest state, and the states with no representatives from minority races have low populations and therefore fewer representatives in general (with one or two exceptions, like Kentucky). I feel like this map will not be particularly useful unless we change it to the percentage of representatives that are non-white by state. I’d like to ask how different our results would be if we measured that instead.

Age

Let’s create one more measure for representation in Congress: age. It seems obvious that most Senate and House representatives are older, but has that always been the case? A Senator must be at least 30 years old to run for office, and a House Representative must be at least 25 years old. What two variables could you use to calculate the age that a member of congress was at the beginning of each congressional session? Use a handy dplyr layer to make that calculation now, call your new variable “age”, and add it to your congress dataframe.

legislature <- legislature %>% mutate(currentYear = replace(died,is.na(died),2022))
 ages <- legislature %>% mutate(age = currentYear - born)
median_ages <- ages %>% group_by(congress) %>% filter (chamber == "House") %>% mutate(median = median(age, na.rm = TRUE))
median_ages2 <- ages %>% group_by(congress) %>% filter (chamber == "Senate") %>% mutate(median = median(age, na.rm = TRUE))

Now, let’s calculate the median age of congress people for each chamber and session separately. Save this to a new summary dataframe, called median_ages.
Let’s plot this one too, as a line graph. You know what to do!

ggplot(data = median_ages, aes(x = congress, y = median)) + geom_line() + labs(x = "Session of Congress", y = "Average age of member", title = "Average Age of the House of Representatives Across Time") + facet_wrap(~chamber)

ggplot(data = median_ages2, aes(x = congress, y = median)) + geom_line() + labs(x = "Session of Congress", y = "Average age of member", title = "Average Age of the Senate Across Time") + facet_wrap(~chamber)

Write a few sentences interpreting the graph. Do you notice a trend? How do the two chambers differ? Any hypotheses why this might be?

We can see for both chambers that the age of the average member rose roughly to a peak at the 95th Congress (1977-1979), and since then has dropped sharply. Senators tend to be older than representatives. This is most likely because an individual senator is more influential than an individual representative; therefore experience matters more for voters when choosing a senator. Being a senator is seen as a more senior position.

Conclusion

Write a brief paragraph summing up what you learned in this lab. Did you answer the questions that you framed in the introduction? What were some of the surprising findings?

We learned about the composition of Congress against a variety of metrics and how to display these data in an aesthetically pleasing way. We saw that the distribution of political leanings in Congress follows a bimodal distribution, but my question of whether this session of Congress is more polarized than earlier sessions of Congress has not been answered. We did discover how the demographics of Congress have changed over time. It is surprising that our Congresspeople were the oldest back in the 70s, rather than earlier in the 20th or 19th centuries!