1 Initial setup - data and packages

View(House_all_members)

2. Exploring and setting up the House data

glimpse(House_all_members) 
## Rows: 40,144
## Columns: 22
## $ congress                      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ chamber                       <chr> "President", "House", "House", "House", …
## $ icpsr                         <int> 99869, 4766, 8457, 9062, 9489, 9706, 967…
## $ state_icpsr                   <int> 99, 1, 1, 1, 1, 1, 11, 44, 44, 44, 52, 5…
## $ district_code                 <int> 0, 98, 98, 98, 98, 98, 1, 2, 1, 3, 6, 3,…
## $ state_abbrev                  <chr> "USA", "CT", "CT", "CT", "CT", "CT", "DE…
## $ party_code                    <int> 5000, 5000, 5000, 5000, 5000, 5000, 5000…
## $ occupancy                     <int> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ last_means                    <int> NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ bioname                       <chr> "WASHINGTON, George", "HUNTINGTON, Benja…
## $ bioguide_id                   <chr> NA, "H000995", "S000349", "S001047", "T0…
## $ born                          <int> NA, 1736, 1721, 1740, 1740, 1743, 1758, …
## $ died                          <int> NA, 1800, 1793, 1819, 1809, 1804, 1802, …
## $ nominate_dim1                 <dbl> NA, 0.639, 0.589, 0.531, 0.692, 0.738, 0…
## $ nominate_dim2                 <dbl> NA, 0.304, 0.307, 0.448, 0.246, 0.206, -…
## $ nominate_log_likelihood       <dbl> NA, -29.04670, -40.59580, -25.87361, -30…
## $ nominate_geo_mean_probability <dbl> NA, 0.708, 0.684, 0.724, 0.750, 0.825, 0…
## $ nominate_number_of_votes      <int> NA, 84, 107, 80, 106, 86, 94, 103, 98, 9…
## $ nominate_number_of_errors     <int> NA, 12, 18, 13, 11, 5, 18, 12, 9, 2, 11,…
## $ conditional                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ nokken_poole_dim1             <dbl> NA, 0.649, 0.614, 0.573, 0.749, 0.770, 0…
## $ nokken_poole_dim2             <dbl> NA, 0.229, 0.298, 0.529, 0.166, 0.146, -…
table(House_all_members$chamber)
## 
##     House President 
##     40017       127

The data set is sorted by having each President and their house members per congress number. The first president and his house are all have the number 1 in the congress column and the second president and his house all have the number 2 and so on. The observations don’t make sense because there should be 435 house members per president so there is not data for each of the house members.

Presidents <- filter(House_all_members, chamber == "President")
House_all_members <- filter(House_all_members, chamber == "House")

View(Presidents) View(House_all_members)

summary(House_all_members$nominate_dim1)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -1.00000 -0.33100 -0.04700  0.00452  0.35100  0.99800      163
ggplot(House_all_members, aes(x = nominate_dim1)) +
  geom_histogram(binwidth = 0.1, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(
    x = "Ideological Position (nominate_dim1)", 
    y = "Count of House Members", 
    title = "Distribution of Ideological Positions in the House \n(-1 = Very Liberal, 1 = Very Conservative)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
## Warning: Removed 163 rows containing non-finite outside the scale range
## (`stat_bin()`).

The Histogram shows that there is around a even amount of house members that voted both liberally and conservativly. This represents high polarization as the members of different parties are less likely to overlap.

t.test(House_all_members$nominate_dim1, mu = 0)
## 
##  One Sample t-test
## 
## data:  House_all_members$nominate_dim1
## t = 2.4069, df = 39853, p-value = 0.01609
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.0008396045 0.0082044062
## sample estimates:
##   mean of x 
## 0.004522005

This will be a one sample T-test and the two things we are comparing are if the members are very liberal or very conservative. In this the mean, .0045 is very close to 0 which means that the data is not substantively signifigant. This result shows that the data is statistically signifigant as the p-value is .01 which is less than .05 meaning we will reject the null hypothesis. The null hypothesis is the mean is 0 and the alternate hypothesis is the mean is anything other than 0.

3. Analysis over time and across subsets of the data

  1. We think that overtime polarization has increased as the two politcal sides continue to distance themselves from middle. Another reason that polarization has increased is as more people start to get college degrees they tend to lean left.

House_all_members <- House_all_members %>%
  mutate(year = 1789 + (congress - 1) * 2)

summary(House_all_members$year) View(House_all_members) 3.

ggplot(House_all_members, aes(x = year, y = nominate_dim1)) +
  geom_point(alpha = 0.6, color = "blue", size = 0.5) + 
  labs(
    x = "Year", 
    y = "Ideological Position (nominate_dim1)", 
    title = "Ideological Polarization of House Representatives Over Time")
## Warning: Removed 163 rows containing missing values or values outside the scale range
## (`geom_point()`).

After 1850 it starts polarize as the data starts to split apart. And then at 1925 the polarization starts to sabilize until 200 when it begins to polarize again. From the data we see that the polarization has been not very prevelent as the data increases and decreases in polarization over the past 200 years.

table(House_all_members$party_code)
## 
##     1    13    22    26    29    37    44    46   100   108   112   114   117 
##   648  1557   217    77   946     2    28     3 18721     8    10     6     2 
##   200   203   206   208   213   300   310   326   328   329   331   340   347 
## 15630    32    46     8     1    17    80    28    32    39    22    62     3 
##   354   355   356   370   380   402   403   522   523   537   555   603  1060 
##     6     2     2    41     7     1     2     7     1    26   801     2     4 
##  1116  1275  1346  3333  3334  4000  4444  5000  6000  7000  7777  8000  8888 
##     1   293    65   100    19   117    12   131     2     7    56    15    72
ggplot(House_all_members, aes(x = year, y = nominate_dim1)) +
geom_jitter(aes(color=factor(party_code)), alpha=1/10)
## Warning: Removed 163 rows containing missing values or values outside the scale range
## (`geom_point()`).

In the jitter the points have more random noise than in geom_point where all overlapping points are not as clearly distinguishable. Alpha is used to make the points more blurry and easier to see patterns.

The two parties are republican and democratic and the green represents the republican party and the yellow represents the democratic party. The democratic party has lower nominate_dim1 numbers and is hense more likely to lean more liberal. The republian party has a higher nominate_dim1 number and is hense more likely to lean more conservitave.

Filtering for year after 1876

house1876 <- filter(House_all_members, congress >= 49)

Seeing how many party codes are in the dataset

table(house1876$party_code)
## 
##   100   200   213   326   328   329   331   340   347   354   355   356   370 
## 15892 13696     1     2    20     6    10    62     3     6     2     2    41 
##   380   402   522   523   537  1060 
##     7     1     7     1    26     4

Creating a dummy variable

house1876 <-mutate(house1876,twoparty=ifelse(party_code!=100 & party_code!=200, NA_integer_, ifelse(party_code==200, 1,0)))

Logic behind creating the dummy var: The code checks each row’s party_code:

If the party_code is not 100 (Democrat) and not 200 (Republican), it assigns NA_integer_ (missing value).

Otherwise, it checks: - If the party_code is 200 (Republican), it assigns 1. - Else (meaning it’s 100 = Democrat), it assigns 0.

NA_integer_ explicitly creates a missing value of integer type, which matches the other values (1 and 0) so the column stays an integer vector.

Creating Rep or Dem with party variables

house1876 <- mutate(house1876,
                    rep_or_dem = ifelse(twoparty == 0, "Democrat",
                                        ifelse(twoparty == 1, "Republican", NA)))

Creating factor variable

house1876$rep_or_dem <- as.factor(house1876$rep_or_dem)

Viewing Frequencies

table(house1876$rep_or_dem)
## 
##   Democrat Republican 
##      15892      13696

The values seem reasonabble, there are slightly more Democrats in the house than republicans but it is roughly even and likely accurate.

  1. Party polarization over time
ggplot(house1876, aes(x = year, y = nominate_dim1, color = rep_or_dem)) +
  geom_point(alpha = 0.5) +
  scale_color_manual(values = c("Democrat" = "blue", "Republican" = "red")) +
  scale_x_continuous(breaks = seq(1877, 2017, 10)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Polarization in the U.S. House Since 1877",
       x = "Year",
       y = "Ideological Score (DW-NOMINATE)",
       color = "Party")
## Warning: Removed 102 rows containing missing values or values outside the scale range
## (`geom_point()`).

This plot shows polarization in the U.S. House since 1877. Each point is a member of Congress, colored by party. Democrats (blue) and Republicans (red) show increasingly distinct ideological clusters, especially from the 1980s onward. Historically, there was more overlap (especially betweeb 1927-1987), but today the divide is clear. Republicans appear to have shifted more sharply to the right in recent years, suggesting that they may have polarized more quickly than Democrats.

4) Testing claims for polarization in the U.S. House since 1876

Lets begin by creating a summary measure to see how polarized congress is over time.

house1876 <- house1876 %>%
  group_by(congress, rep_or_dem) %>%
  mutate(party_mean = mean(nominate_dim1, na.rm = TRUE))

This code groups the data by Congress session and party, and then calculates the mean ideological score (nominate_dim1) for each party within each Congress. The result is a new column, party_mean, where every row within a given party-congress group gets the same average value.

house1876 <- house1876 %>%
  group_by(congress) %>%
  mutate(congress_mean = mean(nominate_dim1, na.rm = TRUE))
house1876 %>%
filter(!is.na(rep_or_dem)) %>%
ggplot(aes(x = year)) +
geom_line(aes(y = party_mean, color =
rep_or_dem)) +
geom_line(aes(y = congress_mean), color =
"black") +
scale_color_manual(values = c("Democrat" =
"blue", "Republican" = "red")) +
theme_tufte() +
  
scale_x_continuous(breaks = seq(1877, 2013, 10)) +
labs(x = "Year", y = "Mean dimension 1 nominate score", title = "Mean-DW1 scores over time and by
party", color = "Party", caption = "Note: black line is mean DW1 for all members in a Congress"
)

The output shows that over time, the average DW-NOMINATE scores for Democrats and Republicans have moved farther apart, especially from the 1980s onward. Early in the timeline, the two parties’ mean scores are closer together, suggesting less ideological division. But in recent decades, the gap between the red and blue lines has grown, highlighting increased polarization.

A good measure of polarization could be the distance between the mean scores of the two major parties in each Congress. This measure shows how ideologically far apart the parties are. The greater the distance, the more polarized the Congress. This is also a good measure because it is easy to compute from the data.

  1. Creating Summary tables and merging them into one dataset
house1876_byYear <- house1876 %>%
  group_by(year) %>%
  summarise(congress_mean = mean(nominate_dim1, na.rm = TRUE))

house1876_dems <- house1876 %>%
  filter(rep_or_dem == "Democrat") %>%
  group_by(year) %>%
  summarise(dem_mean = mean(nominate_dim1, na.rm = TRUE))

house1876_reps <- house1876 %>%
  filter(rep_or_dem == "Republican") %>%
  group_by(year) %>%
  summarise(rep_mean = mean(nominate_dim1, na.rm = TRUE))

house1876cong <- left_join(house1876_byYear, house1876_dems, by = "year")
house1876cong <- left_join(house1876cong, house1876_reps, by = "year")

Creating the plot:

ggplot(house1876cong, aes(x = year)) +
  geom_line(aes(y = dem_mean, color = "Democrat")) +
  geom_line(aes(y = rep_mean, color = "Republican")) +
  geom_line(aes(y = congress_mean, color = "All members")) +
  scale_color_manual(values = c("Democrat" = "blue", 
                                "Republican" = "red", 
                                "All members" = "black")) +
  theme_minimal() +
  labs(
    title = "Polarization in the U.S. House: Mean DW-NOMINATE by Party",
    x = "Year",
    y = "Mean DW-NOMINATE Score",
    color = "Group"
  )

This new figure looks almost the same as the one from Step 4.1 because both plots show the average ideological scores by party and congress over time. The key difference is that this version uses a dataset with only one row per year, making it easier to analyze statistically. Since each line is drawn from explicit columns (dem_mean, rep_mean, congress_mean), rather than from a grouped rep_or_dem, this version is much more cleaner and much more efficient for modeling. The expectation is that they should be the same visually, and any differences would probably come from data prep errors or NA handling.

# Create measure that shows the magnitude of polarization
house1876cong <- house1876cong %>%
mutate(polarization_magnitude = abs(rep_mean - dem_mean))

# Create measure that shows the direction of polarization
house1876cong <- house1876cong %>%
mutate(polarization_direction = abs(rep_mean) - abs(dem_mean))

polarization_magnitude: This tells us how far apart the parties are in a given year. High values - high polarization Low values - low polarization

This value is always positive (since it’s absolute value), ranging from 0 to ~2.

polarization_direction: This shows which party has moved further from the center. Positive values - Republicans are further from center Negative values - Democrats are further from center

Values range from about -1 to 1, and yes, this one can be negative. If it’s near 0, both parties are about equally distant from the center.

t-test:

t.test(house1876cong$polarization_direction, mu = 0)
## 
##  One Sample t-test
## 
## data:  house1876cong$polarization_direction
## t = 7.1478, df = 68, p-value = 7.68e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.03129898 0.05554289
## sample estimates:
##  mean of x 
## 0.04342094

The t value is significant, and since the p-value is much less than 0.05, we reject the null hypothesis that the mean polarization direction is 0. This means the difference between party distances from the center is statistically significant.

Also, because the mean is positive (0.043) and the confidence interval is entirely above zero, we can conclude that Republicans have, on average, been further from the ideological center than Democrats, contributing more to overall polarization in the U.S. House since 1877.

  1. Yes, polarization has clearly increased over time. Based on the plots and DW-NOMINATE data, the ideological distance between Democrats and Republicans has grown steadily since the 1980s. Earlier in the period (1876–1950), the parties often overlapped, but in recent decades, their mean positions have diverged significantly.

This trend supports concerns about excessive polarization in recent years. The increasing gap suggests it’s a structural and long-term shift. Time and polarization appear strongly correlated, polarization is clearly not flat or random over time.

  1. Correlation test:
cor.test(house1876cong$year, house1876cong$polarization_magnitude)
## 
##  Pearson's product-moment correlation
## 
## data:  house1876cong$year and house1876cong$polarization_magnitude
## t = 0.32225, df = 67, p-value = 0.7483
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1991966  0.2734732
## sample estimates:
##        cor 
## 0.03933872

The correlation is very weak (r is very low) and not statistically significant (p > 0.05). This suggests that there is no meaningful linear relationship between time and polarization magnitude over the full 1876–present time period.

Despite the appearance of increasing polarization in plots, this implies that the trend is not strongly linear.

Linear regression test:

model1 <- lm(polarization_magnitude ~ year, data = house1876cong)
summary(model1)
## 
## Call:
## lm(formula = polarization_magnitude ~ year, data = house1876cong)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16489 -0.11795  0.01084  0.10985  0.18986 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4688938  0.6807719   0.689    0.493
## year        0.0001123  0.0003485   0.322    0.748
## 
## Residual standard error: 0.1153 on 67 degrees of freedom
## Multiple R-squared:  0.001548,   Adjusted R-squared:  -0.01335 
## F-statistic: 0.1038 on 1 and 67 DF,  p-value: 0.7483

The regression shows no statistically significant relationship between year and polarization magnitude. The p-value is very high (0.748) and the R-squared value is close to 0, indicating that year does not meaningfully explain changes in polarization magnitude.

Although polarization may appear to increase in more recent years visually, the results in our linear model confirms that there’s no clear, consistent trend over the full time period.

  1. Lets filter to recent years and test again:
recent_data <- house1876cong %>% filter(year >= 1980)

# Correlation test
cor.test(recent_data$year, recent_data$polarization_magnitude)
## 
##  Pearson's product-moment correlation
## 
## data:  recent_data$year and recent_data$polarization_magnitude
## t = 18.745, df = 19, p-value = 1.031e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9358019 0.9896030
## sample estimates:
##       cor 
## 0.9740133
# Linear model
model_recent <- lm(polarization_magnitude ~ year, data = recent_data)
summary(model_recent)
## 
## Call:
## lm(formula = polarization_magnitude ~ year, data = recent_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.033214 -0.017749 -0.003558  0.013703  0.040482 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.412e+01  7.943e-01  -17.78 2.67e-13 ***
## year         7.441e-03  3.969e-04   18.75 1.03e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02203 on 19 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.946 
## F-statistic: 351.4 on 1 and 19 DF,  p-value: 1.031e-13

Correlation test results:

Correlation coefficient indicates extremely strong positive relationship
p-value of 1.03e-13 is highly statistically significant
95% CI = [0.936, 0.989] so we are confident that the true correlation is strongly positive

Linear regression results:

Coefficient for year = 0.00744 ,so, strong upward trend in polarization magnitude
p-value = 1.03e-13, highly significant
R-squared = 0.948 , model explains 94.8% of the variation in polarization magnitude since 1980

So based on these results polarization has sharply increased in recent decades, even if the overall trend from 1876 onward appears flat. It highlights how trends can depend heavily on the time window analyzed.

  1. Selective reporting is a key ethical issue. If researchers only present results from recent decades (as in Q6) without mentioning that earlier data shows no clear trend (Q5), it may mislead readers into thinking polarization has always been rising. Misrepresenting the data as such could influence public opinion or policy based on incomplete evidence.

  2. The 117th Congress being incomplete can affect the accuracy in visualizations or summary statistics. If only partial data is available for that session, the polarization scores could be underestimated or skewed, especially if not all votes or members are included. This could create a misleading drop or spike at the end of the timeline.

  1. Filtering the dataset to post-1994
recent_members <- house1876 %>%
  filter(year >= 1944, rep_or_dem == "Democrat" | rep_or_dem == "Republican")
  1. Creating a measure of ideological extremity with the abs value of DW-NOMINATE
recent_members <- recent_members %>%
  mutate(extremity = abs(nominate_dim1))
  1. Fitting the linear models
dems_model <- lm(extremity ~ year, data = filter(recent_members, rep_or_dem == "Democrat"))
reps_model <- lm(extremity ~ year, data = filter(recent_members, rep_or_dem == "Republican"))

summary(dems_model)
## 
## Call:
## lm(formula = extremity ~ year, data = filter(recent_members, 
##     rep_or_dem == "Democrat"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36079 -0.10521  0.01074  0.10437  0.70712 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.879e+00  1.417e-01  -20.32   <2e-16 ***
## year         1.618e-03  7.149e-05   22.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1526 on 9490 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.05122,    Adjusted R-squared:  0.05112 
## F-statistic: 512.3 on 1 and 9490 DF,  p-value: < 2.2e-16
summary(reps_model)
## 
## Call:
## lm(formula = extremity ~ year, data = filter(recent_members, 
##     rep_or_dem == "Republican"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39325 -0.10400 -0.01083  0.09400  0.71945 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.350e+00  1.472e-01  -43.13   <2e-16 ***
## year         3.379e-03  7.418e-05   45.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1517 on 7773 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.2107, Adjusted R-squared:  0.2106 
## F-statistic:  2075 on 1 and 7773 DF,  p-value: < 2.2e-16

The results show that both Democrats and Republicans have become more ideologically extreme over time (both slopes are positive and highly significant with p < 2e-16).

However, the Republican slope (0.00338) is more than double that of Democrats (0.00162), showing that Republicans have polarized more quickly since 1944.

The higher R-squared for Republicans (0.2107) compared to Democrats (0.0512) further supports this difference, meaning the model explains more variation in extremity for Republicans.