Assignment 6: Graphing Relationships

library(tidyverse)
library(skimr)
library(gt)
library(htmltools)
library(DT)
library(ggrepel)
library(sysfonts)
library(showtext)
library(ggtext)
library(tigris)

Question 1

Part A: Importing the Data

First, we use read_csv() to read in our data frame:

dta <- read_csv("state_prez_vote_08-24.csv")

dta

# A tibble: 51 × 12
   state  abbrev obamavote08 mccainvote08 obamavote12 romneyvote12 clintonvote16
   <chr>  <chr>        <dbl>        <dbl>       <dbl>        <dbl>         <dbl>
 1 Alaba… AL          813479      1266546      795696      1255925        729547
 2 Alaska AK          123594       193841      122640       164676        116454
 3 Arizo… AZ         1034707      1230111     1025232      1233654       1161167
 4 Arkan… AR          422310       638017      394409       647744        380494
 5 Calif… CA         8274473      5011781     7854285      4839958       8753792
 6 Color… CO         1288633      1073629     1322998      1185050       1338870
 7 Conne… CT          997772       629428      905083       634892        897572
 8 Delaw… DE          255459       152374      242584       165484        235603
 9 Distr… DC          245800        17367      267070        21381        282830
10 Flori… FL         4282074      4045624     4237756      4163447       4504975
# ℹ 41 more rows
# ℹ 5 more variables: trumpvote16 <dbl>, bidenvote20 <dbl>, trumpvote20 <dbl>,
#   trumpvote24 <dbl>, Harrisvote24 <dbl>

Now, we need to do several things with the dataframe. Currently, each row represents a state. Thus, we want to pivot long so that each row entry represents a state, year, and candidate. From there, we want to clean up the created cand_year column, which contains the candidate and year; we use several string methods to isolate the year and candidate into separate columns.

df_1 <- dta %>%
  pivot_longer( # pivot on the vote proportions
    cols = ends_with("08") | ends_with("12") |
           ends_with("16") | ends_with("20") | ends_with("24"),
    names_to = "cand_year",
    values_to = "votes"
  ) |>
  mutate( # remove "vote" from cand_year
    cand_year = gsub("vote", "", cand_year)
  ) |>
  mutate( # extract year and candidate from cand_year
    year = as.numeric(paste0("20", str_sub(cand_year, -2))),
    candidate = str_sub(cand_year, 1, -3)
  )

df_1

# A tibble: 510 × 6
   state   abbrev cand_year   votes  year candidate
   <chr>   <chr>  <chr>       <dbl> <dbl> <chr>    
 1 Alabama AL     obama08    813479  2008 obama    
 2 Alabama AL     mccain08  1266546  2008 mccain   
 3 Alabama AL     obama12    795696  2012 obama    
 4 Alabama AL     romney12  1255925  2012 romney   
 5 Alabama AL     clinton16  729547  2016 clinton  
 6 Alabama AL     trump16   1318255  2016 trump    
 7 Alabama AL     biden20    849624  2020 biden    
 8 Alabama AL     trump20   1441170  2020 trump    
 9 Alabama AL     trump24   1462616  2024 trump    
10 Alabama AL     Harris24   772412  2024 Harris   
# ℹ 500 more rows

Next, we want to compute the total proportion of republican votes for each year and state. To do so, we utilize a group_by in which we group on state, abbrev, and year. To compute the republican proportion of the vote, we simply divide the republican votes (which we identify using a boolean statement) by the total votes in the state-year group:

df_2 <- df_1 |>
  group_by(state, abbrev, year) |> # group on state
  summarize( # divide republican votes by total votes
    republican_proportion = sum(votes[candidate %in% c("mccain", "romney", "trump")] / sum(votes)))

Finally, we have arrived at our desired data frame.

df_2

# A tibble: 255 × 4
# Groups:   state, abbrev [51]
   state   abbrev  year republican_proportion
   <chr>   <chr>  <dbl>                 <dbl>
 1 Alabama AL      2008                 0.609
 2 Alabama AL      2012                 0.612
 3 Alabama AL      2016                 0.644
 4 Alabama AL      2020                 0.629
 5 Alabama AL      2024                 0.654
 6 Alaska  AK      2008                 0.611
 7 Alaska  AK      2012                 0.573
 8 Alaska  AK      2016                 0.584
 9 Alaska  AK      2020                 0.553
10 Alaska  AK      2024                 0.568
# ℹ 245 more rows

Part B: Creating a Clean Table of Summary Statistics for 2016, 2020, and 2024

To create a clean table of summary statistics, we first need to group our states by year to generate a yearly distribution. We can make use of skimr to do this:

display_0 <- df_2 |>
  group_by(year) |> # group all states for each year
  skim(republican_proportion) |> # summarize state distributions for each year
  focus(c(year, numeric.mean, numeric.p50, numeric.sd, numeric.hist)) |> # select only desired summaries
  filter(year >= 2016) # 2016 or later

display_0

Data summary
Name	group_by(df_2, year)
Number of rows	255
Number of columns	4
_______________________
Column type frequency:
numeric	1
________________________
Group variables	year

Variable type: numeric

skim_variable	year	mean	p50	sd	hist
republican_proportion	2016	0.52	0.52	0.13	▁▁▅▇▃
republican_proportion	2020	0.50	0.50	0.12	▁▁▆▇▅
republican_proportion	2024	0.52	0.52	0.12	▁▁▅▇▆

Then, we can make a few additional modifications for display reasons:

display_1 <- display_0 |>
  yank("numeric") |> # isolate numeric summary
  select(-skim_variable) |> # remove metavariable
  rename(median = p50, distribution = hist, `std` = sd) |> # rename nicely
  rename_with(~ str_to_title(.x)) # titlecasing

display_1

Variable type: numeric

Year	Mean	Median	Std	Distribution
2016	0.52	0.52	0.13	▁▁▅▇▃
2020	0.50	0.50	0.12	▁▁▆▇▅
2024	0.52	0.52	0.12	▁▁▅▇▆

With our desired dataframe tidied, we can now display the table:

display_1 |> 
  gt() |>
  tab_header(title = "Republican Vote Proportions", subtitle = "State-Level Distribution") |>
  fmt_number(columns = c("Mean", "Median", "Std"), decimal = 3) |>
  tab_footnote("Data from Professor Lawrence's Reputable CSV") |>
  tab_style( # right aligning numbers + sizing
    style = cell_text(align = "right", size = px(30), color = "#8B0000"),
    locations = cells_body(columns = c("Year", "Mean", "Median", "Std"))
  ) |>
  tab_style( # center distribution
    style = cell_text(align = "center", size = px(30), color = "#8B0000"),
    locations = cells_body(columns = "Distribution")
  ) |>
  tab_style( # col labels sizing
    style = cell_text(size = px(35), color = "white", weight = "bold", align = "center"),
    locations = cells_column_labels()
  ) |>
  tab_style( # center header
    style = cell_text(align = "center", size = px(40), color = "#8B0000"),
    locations = cells_title(groups = "title")
  ) |>
  tab_style( # center header
    style = cell_text(align = "center", size = px(35), color = "#8B0000"),
    locations = cells_title(groups = "subtitle")) |>
  tab_style( # center header
    style = cell_text(size = px(12), color = "#8B0000"),
    locations = cells_footnotes()) |>
  tab_style( # lining
    style = cell_borders(
      sides = "top",
      color = "#D4AF37",
      weight = px(4),
      style = "solid"
    ),
    locations = list(cells_title(groups="title"), cells_column_labels())) |>
  tab_style( # lining
    style = cell_borders(
      sides = "bottom",
      color = "#D4AF37",
      weight = px(4),
      style = "solid"
    ),
    locations = list(cells_title(groups="subtitle"), cells_column_labels())) |>
  tab_style( # lining
    style = cell_borders(
      sides = "bottom",
      color = "#D4AF37",
      weight = px(4)
    ),
    locations = cells_body(
      rows = 3
    )
  ) |>
  opt_row_striping() |>
  tab_options( # background colors
    row.striping.background_color = "white",
    table.background.color = "white",         
    column_labels.background.color = "#8B0000",
    table.border.bottom.width = px(4),
    table.border.bottom.color = "white",
    table_body.hlines.style = "none",
    column_labels.vlines.style = "none"
  ) |>
  opt_table_font( # font
    font = list(
      google_font("Lora")
    ))

Republican Vote Proportions
State-Level Distribution
Year	Mean	Median	Std	Distribution
2016	0.520	0.519	0.128	▁▁▅▇▃
2020	0.503	0.499	0.123	▁▁▆▇▅
2024	0.525	0.516	0.121	▁▁▅▇▆
Data from Professor Lawrence's Reputable CSV

In brief, I would say that in 2016, Donald Trump burst on the political scene, capturing majority vote share; the relatively high standard deviation suggested that people were still hesitant about Trump as a politician but many were willing to give the “new face” a chance. On the contrary, 2020 saw a substantial drop in vote share and standard deviation, suggesting that Trump’s support (likely due to the controversial/extreme handling of COVID-19) had wavered as previous on-the-fence supporters aligned with the more centrist Biden. Finally, 2024 was easily Trump’s most decisive victory; he achieved his highest vote share with lowest variability, trends partially driven by the relative disorganization of the Democratic party leading up to the election.

For the most part, it is hard to grasp any trends from this table; the numbers seem very sporadic, which is fitting given the polarizing nature of Trump’s political persona. While the president’s core has remained consistently faithful throughout the years, I think this table is subtly affording us an insight into his fringe supporters. In other words, the average American may have initially been undecided about Trump as a politician, but the consistent reduction in variability implies that this uncertainty has reduced over the years. That is, it seems states have made up their mind about Trump.

Part C: Creating an Interactive Table of State Voting Proportions

To create our interactive table, we will want to re-pivot our data. While this is not strictly necessary, the task seems to imply that we want to have all of the data for a given state in one row. As such, we will pivot back to wide form.

df_3 <- df_2 |>
  mutate(republican_proportion = round(republican_proportion, 3)) |> # round props
  pivot_wider(names_from=year, values_from=republican_proportion) |> # each row is state
  rename(State = state, Abbreviation = abbrev) # cleaner names

Now, we can display our data:

browsable(
  tagList( # HTML styling
    tags$style(HTML("
      table.dataTable {
        font-family: 'Lora';
        font-size: 14px;
        color: #8B0000; 
        background-color: #fdfcf9; 
      }
      table.dataTable thead th {
      border-top: 4px solid #D4AF37;
      border-bottom: 4px solid #D4AF37 !important;
      background-color: #8B0000;
      color: #FFFFFF;
      }
      table.dataTable thead {
        background-color: #FFFFFF; 
        color: #8B0000;
      }
      table.dataTable tbody tr:hover {
        background-color: #FFF8E1; 
      }
      table.dataTable tbody td {
        border: 1px solid #FFFFFF;
      }
      table.dataTable tbody tr:last-child td {
        border-bottom: 4px solid #D4AF37;
      }
    ")),
datatable(df_3, # create interactive table
    tags$caption(
    style = 'caption-side: top; text-align: center; font-size: 22px; font-weight: bold; color: #8B0000', '🐘 Republican Vote Proportions 🐘')), 
    tags$div(
      style = "
        display: flex; 
        justify-content: flex-end; 
        margin-top: 5px; 
        padding-right: 10px;
      ",
      tags$div(
        style = "font-size: 14px; color: #555555;",
        tags$em("Data from Professor Lawrence's Reputable CSV")
      )
    )))

Data from Professor Lawrence's Reputable CSV

Question 2

Part A/B: Generating and Describing a Vote Swing Scatter plot

We begin by creating the desired scatter plot. In this case, we will use df_3, which is the wide form of our data:

# load font
font_add_google("League Spartan", "League Spartan") 
showtext_auto() 

props_1 <- ggplot(df_3, aes(x = `2020`, y = `2024`, label = Abbreviation)) + # 2020 on x, 2024 on y, abbrev labels
  geom_point(color = "#4B1248") + # points
  geom_text_repel(size = 3, color = "#4B1248", max.overlaps = 55, family = "League Spartan") + # pt labels
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + # 45 degree line
  labs( # labels
    title = "Trump Vote Proportion by State",
    caption = "Data from Professor Lawrence's Reputable CSV",
    subtitle = "2024 v. 2020"
  ) +
  theme_minimal() + # clear axis/backdrop
  theme( # font re-sizing/positioning
    axis.title = element_text(size = 17, hjust = 0.5),
    text = element_text(color = "#4B1248", family = "League Spartan"),
    plot.title = element_text(size=20, hjust = 0.5),
    plot.subtitle = element_text(size = 17, hjust = 0.5),
    plot.caption = element_text(size = 9, hjust = 1),
    panel.background = element_rect(fill = "#F8F5EF", color = "#A9C1D9"),
    plot.background = element_rect(fill = "white", color = NA),
    panel.grid = element_line(color="#A9C1D9", size = 0.9),
    axis.text = element_text(size = 15, color = "#4B1248"),
    plot.margin = margin(10, 10, 10, 10),
    axis.ticks = element_line(color = "#4B1248", size = 0.5),
    axis.ticks.length = unit(5, "pt")
  )

props_1

To analyze the visual, we need to first clarify what the 45 degree line means. Visually, the 45 degree line situates points into three regions: above the line, on the line, and below the line. Since the underlying line represents y = x, points on the line imply that the voting proportions in both elections were equal. If a point falls above the line, then this implies that the Trump voting proportion in the 2024 election was greater than in the 2020 election for that state. On the other hand, points below the line imply that the Trump voting proportion in the 2024 election was less than than in the 2020 election for that state. Using the 45 degree line makes it possible for us to make judgements about relative changes for a particular state.

With this context laid out, it appears that virtually all states fall above the line. That is, Trump received a greater proportion of votes in all states. Many of these points are in close proximity to the line, which means that the difference between the elections was not extreme. Still, based on the summary data from above, I would have not expected to see an increase in Trump support for virtually all states; the mean proportion only increased by 0.022 between the two elections.

Part C/D: Labeling States and Outliers

Evidently, there are some serious challenges with labeling the states. In its current construction, many of the states are bunched together, so choosing to label all 51 points is very difficult. As seen in the graph, there are so many names within proximity of each other that it is hard to even determine which names correspond to which points. While I used ggrepel to help space out labels, many of the names are so far away from their points (because of collisions) that the guide lines adds significant noise to the visual. Along the same lines, to avoid additional collisions, I was forced to make the text much smaller than I would like.

As suggested by the instructions, one possible alternative is to only label a few noteworthy points:

props_2 <- ggplot(df_3, aes(x = `2020`, y = `2024`, label = Abbreviation)) + # 2020 on x, 2024 on y, abbrev labels
  geom_point(color = "#4B1248") + # points
  geom_text_repel(data = subset(df_3, df_3$Abbreviation %in% c("DC", "FL", "NJ", "CA", "TN", "TX", "NY")), size = 3, color = "#4B1248", family = "League Spartan", nudge_y = 0.05, nudge_x = -0.02) + # first set of pt labels
  geom_text_repel(data = subset(df_3, df_3$Abbreviation %in% c("WA", "NE", "UT", "WY")), size = 3, color = "#4B1248", family = "League Spartan", nudge_y = -0.03, nudge_x = 0.03) + # second set of pt labels
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + # 45 degree line
  labs( # labels
    title = "Trump Vote Proportion by State",
    caption = "Data from Professor Lawrence's Reputable CSV",
    subtitle = "2024 v. 2020"
  ) +
  theme_minimal() + # clear axis/backdrop
  theme( # font re-sizing/positioning
    axis.title = element_text(size = 17),
    text = element_text(color = "#4B1248", family = "League Spartan"),
    plot.title = element_text(size=20, hjust = 0.5),
    plot.subtitle = element_text(size = 17, hjust = 0.5),
    plot.caption = element_text(size = 9, hjust = 1),
    panel.background = element_rect(fill = "#F8F5EF", color = "#A9C1D9"),
    plot.background = element_rect(fill = "white", color = NA),
    panel.grid = element_line(color="#A9C1D9", size = 0.9),
    axis.text = element_text(size = 15, color = "#4B1248"),
    plot.margin = margin(10, 10, 10, 10),
    axis.ticks = element_line(color = "#4B1248", size = 0.5),
    axis.ticks.length = unit(5, "pt")
  )

props_2

I have chosen to label 10 states, which I believe to be the most unusual points. First, perhaps the most unusual point is Washington D.C.; while most points have Trump voting proportions between 0.3 and 0.8, the district sits below 0.1. In this way, DC is unusual due to its seemingly strong disapproval of Donald Trump. Similarly, I chose to label Wyoming because of its consistently high approval of Donald Trump. While Wyoming boasts the highest Trump voting proportions in both elections, its relative support is not nearly as extreme as the nation’s capital’s relative disapproval.

Additionally, I chose to label California, New Jersey, New York, Florida, Texas, and Tennessee. These points do not possess showcase extreme approval or disapproval for Donald Trump. Rather, they are unusual in the sense that these states saw significant swings in Trump voting proportions; this is made clear by their vertical distances from the line, which are the greatest out of all points displayed.

Finally, I labeled Washington, Nebraska, and Utah. As mentioned earlier, vote proportions generally increased for Donald Trump. However, these three points are unusual in that their vote proportions only marginally increased; these points look like they almost lie on the line. Some of these states, such as Washington, are historically liberal, while others like Utah are historically conservative.

Part E: Removing D.C.

To remove D.C., we simply filter our dataset to exclude the entry, making use of the same ggplot() call from before:

props_3 <- ggplot(df_3 |> filter(Abbreviation != "DC"), aes(x = `2020`, y = `2024`, label = Abbreviation)) + # 2020 on x, 2024 on y, abbrev labels
  geom_point(color = "#4B1248") + # points
  geom_text_repel(data = subset(df_3, df_3$Abbreviation %in% c("FL", "NJ", "CA", "TN", "TX", "NY")), size = 3, color = "#4B1248", family = "League Spartan", nudge_y = 0.01, nudge_x = 0.003) + # first set of pt labels
  geom_text_repel(data = subset(df_3, df_3$Abbreviation %in% c("WA", "NE", "UT", "WY")), size = 3, color = "#4B1248", family = "League Spartan", nudge_y = -0.01, nudge_x = 0.01) + # second set of pt labels
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") + # 45 degree line
  labs( # labels
    title = "Trump Vote Proportion by State",
    caption = "Data from Professor Lawrence's Reputable CSV",
    subtitle = "2024 v. 2020"
  ) +
  theme_minimal() + # clear axis/backdrop
  theme( # font re-sizing/positioning
    axis.title = element_text(size = 17),
    text = element_text(color = "#4B1248", family = "League Spartan"),
    plot.title = element_text(size=20, hjust = 0.5),
    plot.subtitle = element_text(size = 17, hjust = 0.5),
    plot.caption = element_text(size = 9, hjust = 1),
    panel.background = element_rect(fill = "#F8F5EF", color = "#A9C1D9"),
    plot.background = element_rect(fill = "white", color = NA),
    panel.grid = element_line(color="#A9C1D9", size = 0.9),
    axis.text = element_text(size = 15, color = "#4B1248"),
    plot.margin = margin(10, 10, 10, 10),
    axis.ticks = element_line(color = "#4B1248", size = 0.5),
    axis.ticks.length = unit(5, "pt")
  )

props_3

Omitting DC greatly reduces the point congestion seen in the original plot. In other words, since D.C. was situated so much lower than most states, it forced the graph to cover significantly more area than it would have otherwise. The result is an inappropriate consolidation of the space in which the remaining 50 points lie. When we remove DC, we are able to expand out the area covered by the remaining 50 points. In this case, this helps reveal a level of granularity that we did not get before. For one, most points now appear much farther from the line than they did before. Also, with greater magnification, we now see that all states lie above the horizontal line. In this visual, all of these observations take far less effort to make. Thus, I would recommend omitting DC if we wanted to highlight general state voting trends.

Question 3

Part A: Generating a Line Plot of Republican Vote Proportions

We begin by generating a line plot of Republican vote proportions without modifications:

line_plot_tot <- ggplot(df_2 |> filter(abbrev != "DC"), aes(x = year, y = republican_proportion, color = abbrev)) + # year on x, republican prop on y, abbrev color
  geom_line() + # line graphs
  theme_minimal() + # clear axis/backdrop
  labs( # labels
    title = "Republican Vote Proportions Across U.S. States",
    caption = "Data from Professor Lawrence's Reputable CSV",
    subtitle = "Presidential Elections",
    y = "",
    x = ""
  ) +
  scale_x_continuous(breaks = c(2008, 2012, 2016, 2020, 2024)) +
  theme( # font re-sizing/positioning
    axis.title = element_text(size = 17),
    text = element_text(color = "#4B1248", family = "League Spartan"),
    plot.title = element_text(size=20, hjust = 0.5),
    plot.subtitle = element_text(size = 17, hjust = 0.5),
    plot.caption = element_text(size = 9, hjust = 1),
    panel.background = element_rect(fill = "#F8F5EF", color = "#A9C1D9"),
    plot.background = element_rect(fill = "white", color = NA),
    panel.grid = element_line(color="#A9C1D9", size = 0.9),
    axis.text = element_text(size = 15, color = "#4B1248"),
    plot.margin = margin(10, 10, 10, 10),
    axis.ticks = element_line(color = "#4B1248", size = 0.5),
    axis.ticks.length = unit(5, "pt"),
    legend.position = "none"
  )

line_plot_tot

In its current construction, the line graph is not easily readable (many lines collide). While I tried to highlight each state with a different color instead of using gray, this did not really solve our issue on its own (since it is virtually impossible to 50 distinct colors). As such, I will make some aesthetic adjustments before analyzing further.

Part B/C/D: Highlighting States and Describing Patterns

To highlight different states, I will add two different layers of lines. The first layer plots all of the non-highlighted lines in a light gray hue—ghost lines. Thus, these lines will still be on the graph but will be de-emphasized. The next layer contains five states of interest, which I highlight in different colors. To do this, I first split my original data frame into two:

wanted <- c("WY", "UT", "NC", "ME", "MA") # States to highlight

line_g_df <- df_2 |> 
  filter(abbrev != "DC") |> # No DC
  mutate(flag = ifelse(abbrev %in% wanted, abbrev, "Unwanted")) # Flag desired states

gray_df <- line_g_df |> filter(flag == "Unwanted") # Filter out desired states
color_df <- line_g_df |> filter(flag != "Unwanted") # Filter out undesired states

highlight_colors <- setNames(c("#D97430", "#D4A017", "#3B7A57", "#A14A4A", "#6A7D94"), wanted)

Now, I can generate the desired visual:

line_plot_highlighted <- ggplot() +
  geom_line(data=gray_df, aes(x=year, y=republican_proportion, group=abbrev), color="gray", alpha=0.7) + # ghost lines
  geom_line(data=color_df, aes(x=year, y=republican_proportion, color=flag), linewidth=1.2) + # lines of interest
  scale_color_manual( # apply coloring
    values = highlight_colors,
    breaks = wanted 
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = c(2008, 2012, 2016, 2020, 2024)) + # re-scale axis
  labs( # labels
    title = "Republican Vote Proportions Across U.S. States",
    subtitle = "Presidential Elections",
    caption = "Data from Professor Lawrence's Reputable CSV",
    y = "", # not needed
    x = "", 
    color = "State"
  ) +
  theme( # font re-sizing/positioning
    axis.title = element_text(size = 17),
    text = element_text(color = "#4B1248", family = "League Spartan"),
    plot.title = element_text(size=20, hjust = 0.5),
    plot.subtitle = element_text(size = 17, hjust = 0.5),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    plot.caption.position = "plot",
    plot.caption = element_text(size = 9, hjust = 1),
    panel.background = element_rect(fill = "#F8F5EF", color = "#A9C1D9"),
    plot.background = element_rect(fill = "white", color = NA),
    panel.grid = element_line(color= "#A9C1D9", size = 0.9),
    axis.text = element_text(size = 15, color = "#4B1248"),
    plot.margin = margin(10, 10, 10, 10),
    axis.ticks = element_line(color = "#4B1248", size = 0.5),
    axis.ticks.length = unit(5, "pt")
  )

line_plot_highlighted

Before discussing my highlighted states of interest, I first want to try to summarize the general trend of the data. Put simply, most states saw republican voting proportions steadily increase from 2008 to 2016. After a 2016 peak, voting proportions declined in 2020 before rebounding in 2024, but these rates generally fell short of the 2016 peak. While my summary is incredibly oversimplified, most states appeared to follow this pattern to varying degrees. I also want to note that several states go against this pattern; two of the states I highlighted do not follow this trend (and this is precisely why I highlighted them).

Now, I want to discuss each of the states I highlighted. I took two different approaches to highlighting. Since my goal is to accurately represent the data, I wanted to make sure that multiple of the highlighted lines exhibit the general trend I described above. Thus, I chose to highlight Maine, North Carolina, and Wyoming. While these three states have different historical political affiliations, they all appear to follow the rise-fall-rise trend I described above. In some ways, North Carolina is also interesting because, compared to most states, its fluctuations are very minimal. Based on the graph, it appears that proportions remained in the interval 0.5 to 0.53 across the five elections. Within this small range, though, it still exhibited that characteristic movement. The remaining two states I highlighted, Massachusetts and Utah, are odd compared to their counterparts. Moreover, Utah was considered one of the most Republican states in the early 2000s. In 2012, these numbers rose dramaticall, likely because of Republican Mitt Romney’s religous identity as a Mormon. Over the next three elections, though, Republican support steadily declined in Utah; the state is much closer to the middle now. Similarly, Massachusetts also a small uptick in Republican support during 2012, mainly due to it being the home state of Mitt Romney. Unlike most states, though, Republican support dwindled almost linearly for the next three elections before picking back up in 2024.

Above all else, my line coloring ironically seems to send the belated message that Mitt Romney was a very influential candidate, especially given that he lost. Besides this, I think the highlighting does a solid job capturing the essence of rise-fall-rise trend typical of many states during this period. Still, highlighting only five states means we lose significant nuance about the remaining data. It is worth noting that I may have over represented the outliers in my highlighting. That is, only a handful of states strongly split from the central trend I described, yet I chose 40% of my highlighted lines to be outliers. In that way, my decision may be somewhat misleading, though it is difficult to be proportionate when I am only highlighting five lines.

Part E: Dividing by Census Region

To divide by census region, we need to first add the census labels to our dataframe. There are several ways to do this; I used AI to extract the census labels for me:

states_df <- data.frame( # Create dataframe 
  state = c(
    "Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont",
    "New Jersey", "New York", "Pennsylvania",
    "Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin",
    "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota",
    "Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "District of Columbia", "West Virginia",
    "Alabama", "Kentucky", "Mississippi", "Tennessee",
    "Arkansas", "Louisiana", "Oklahoma", "Texas",
    "Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming",
    "Alaska", "California", "Hawaii", "Oregon", "Washington"
  ),
  region = c(
    rep("Northeast", 6), rep("Northeast", 3),
    rep("Midwest", 5), rep("Midwest", 7),
    rep("South", 9), rep("South", 4), rep("South", 4),
    rep("West", 8), rep("West", 5)
  )
)

df_2 <- df_2 |> # Join census labels by state
  left_join(states_df, by = "state")

df_2

# A tibble: 255 × 5
# Groups:   state, abbrev [51]
   state   abbrev  year republican_proportion region
   <chr>   <chr>  <dbl>                 <dbl> <chr> 
 1 Alabama AL      2008                 0.609 South 
 2 Alabama AL      2012                 0.612 South 
 3 Alabama AL      2016                 0.644 South 
 4 Alabama AL      2020                 0.629 South 
 5 Alabama AL      2024                 0.654 South 
 6 Alaska  AK      2008                 0.611 West  
 7 Alaska  AK      2012                 0.573 West  
 8 Alaska  AK      2016                 0.584 West  
 9 Alaska  AK      2020                 0.553 West  
10 Alaska  AK      2024                 0.568 West  
# ℹ 245 more rows

With the region attribute added, we can now proceed with faceting:

line_plot_regions <- ggplot(df_2 |> filter(abbrev != "DC"), aes(x = year, y = republican_proportion, color = abbrev)) + # year on x, republican prop on y, color abbrev
  geom_line() + # line graphs
  theme_minimal() + # clear axis/backdrop
  labs( # labels
    title = "Republican Vote Proportions By Census Regions",
    subtitle = "Presidential Elections",
    caption = "Data from Professor Lawrence's Reputable CSV",
    y = "",
    x = ""
  ) +
  scale_x_continuous(breaks = c(2008, 2012, 2016, 2020, 2024)) +
  theme( # font re-sizing/positioning
    axis.title = element_text(size = 17),
    text = element_text(color = "#4B1248", family = "League Spartan"),
    plot.title = element_text(size=20, hjust = 0.5),
    plot.subtitle = element_text(size = 17, hjust = 0.5),
    plot.caption = element_text(size = 9, hjust = 1),
    panel.background = element_rect(fill = "#F8F5EF", color = "#A9C1D9"),
    plot.background = element_rect(fill = "white", color = NA),
    panel.grid = element_line(color="#A9C1D9", size = 0.9),
    axis.text = element_text(size = 10, color = "#4B1248"),
    plot.margin = margin(10, 10, 10, 10),
    axis.ticks = element_line(color = "#4B1248", size = 0.5),
    axis.ticks.length = unit(5, "pt"),
    legend.position = "none",
    panel.spacing.x = unit(1.5, "lines"),
    strip.text = element_text(size = 13)
  ) +
  facet_wrap(~region)

line_plot_regions

In its current set-up, the data is certainly more readable than the first graphic. As with the previous part, though, I would prefer to look at a few representative states in each census region (as well as regional averages). By doing so, the we can better understand the broader trends of the data without so much noise:

mean_df <- df_2 |>
  group_by(year, region) |> # Group on year, region
  summarize(mean_prop = mean(republican_proportion)) # compute mean proportions

wanted <- c("IN", "WI", "CT", "MA", "KY", "NC", "MT", "AZ", "Midwest") # States to highlight (plus average)

line_g_df <- df_2 |> 
  filter(abbrev != "DC") |> # No DC
  mutate(flag = ifelse(abbrev %in% wanted, abbrev, "Unwanted")) # Flag desired states

gray_df <- line_g_df |> filter(flag == "Unwanted") # Filter out desired states
color_df <- line_g_df |> filter(flag != "Unwanted") # Filter out undesired states

highlight_colors <- setNames(c("#D97430", "#A28C6A", "#9C6B84", "#3B7A57", "#A14A4A", "#3E7F8C", "#D4A017", "#4F5D70", "black")
, wanted)

highlight_colors

line_plot_regions_highlighted <- ggplot() +
  geom_line(data=gray_df, aes(x=year, y=republican_proportion, group=abbrev), color="gray", alpha=0.7) + # ghost lines
  geom_line(data=color_df, aes(x=year, y=republican_proportion, color=flag), linewidth=1.2) + # lines of interest
  geom_line(data=mean_df, aes(x=year, y=mean_prop, color=region), linewidth=0.7, linetype="dashed") + # mean lines (to get legend)
  geom_line(data=mean_df, aes(x=year, y=mean_prop, group=region), linewidth=0.7, linetype="dashed", color = "black") + # mean lines
  scale_color_manual( # apply coloring
    values = highlight_colors,
    breaks = wanted,
    labels = c("IN", "WI", "CT", "MA", "KY", "NC", "MT", "AZ", "AVG")
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = c(2008, 2012, 2016, 2020, 2024)) + # re-scale axis
  labs( # labels
    title = "Republican Vote Proportions Across U.S. States",
    subtitle = "Presidential Elections",
    caption = "Data from Professor Lawrence's Reputable CSV",
    y = "", # not needed
    x = "", 
    color = "State"
  ) +
  theme( # font re-sizing/positioning
    axis.title = element_text(size = 17),
    text = element_text(color = "#4B1248", family = "League Spartan"),
    plot.title = element_text(size=20, hjust = 0.5),
    plot.subtitle = element_text(size = 17, hjust = 0.5),
    plot.caption.position = "plot",
    plot.caption = element_text(size = 9, hjust = 1),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = "#F8F5EF", color = "#A9C1D9"),
    plot.background = element_rect(fill = "white", color = NA),
    panel.grid = element_line(color= "#A9C1D9", size = 0.9),
    axis.text = element_text(size = 13, color = "#4B1248"),
    plot.margin = margin(10, 10, 10, 10),
    axis.ticks = element_line(color = "#4B1248", size = 0.5),
    axis.ticks.length = unit(5, "pt"),
    strip.text = element_text(size = 13),
    panel.spacing.x = unit(1.5, "lines")
  ) +
  facet_wrap(~region)

line_plot_regions_highlighted

When we facet by region (and highlight representative states), we see an interesting picture; I will break it down by region. As a general philosophy, I highlighted two states in each region (as well as an average line). One state follows the rise-fall-rise trend. That is, Republican vote proportions grew steadily from 2008 to 2016, fell in 2020, and rose again in 2024. The other line reflects a smaller sub-trend that I identified within the given region.

The Midwest was dominated by the rise-fall-rise trend; both Indiana and Wisconsin capture this perfectly, and I found it difficult to identify any state that did not follow this pattern. Next, the Northeast showcases a distinctive identity. Indeed, while states like Connecticut follow the rise-fall-rise trend, the region stays true to its historically liberal roots; states like Massachusetts generally showed decreasing approval for Republican candidates over the period. Also, we see that most northeast states are situated in a small interval of proportions that is very low compared to the other regions. I would say that, out of the four regions, the Midwest and Northeast have the most clear identities; this is probably due to the fact that these regions are smaller geographically and relatively homogeneous in terms of state cultural composition. Regarding the South, we see that Kentucky follows the general trend exactly. Interestingly, North Carolina showcases almost no movement across the elections. While the Southern states are widely dispersed, most states did not seem to vary much in their Republican voting proportions, which is reflected by relatively smooth lines; North Carolina is an extreme example of this. Lastly, we have the West, which simply put, is all over the place. Again, we see states that follow the rise-fall-rise trend, but, for the most part, it is hard to make generalizations beyond this; some states, like Nevada, show no change, others like Utah show rapid downward trends, and some show steady increases.

In this case, I think that faceting was helpful in confirming that the rise-fall-rise trend was not contained to a single region. Given that we have so many states, faceting also allows us to space out our visual in a way that makes it easier to digest. It is interesting to see that certain regions to have clear tendencies, which was obscured from our original visual.

Question 4

To complete question 4, we first need to produce our desired metric. We can do so by using the wide dataframe df_3, which has voting proportions for each year for each state. Note that I will be displaying percentages instead of proportions. The justification behind this change is that percentages are much easier for audiences to grasp than proportions; they also look “nicer” on visuals.

swing_df <- df_3 |> 
  mutate(swing_per = (`2024` - `2020`) * 100) |> # compute swing percentage
  select(c(State, Abbreviation, swing_per)) # take only states, abbreviations, and swing_per

swing_df

# A tibble: 51 × 3
# Groups:   State, Abbreviation [51]
   State                Abbreviation swing_per
   <chr>                <chr>            <dbl>
 1 Alabama              AL                2.50
 2 Alaska               AK                1.50
 3 Arizona              AZ                3.00
 4 Arkansas             AR                1.50
 5 California           CA                4.50
 6 Colorado             CO                1.30
 7 Connecticut          CT                2.80
 8 Delaware             DE                2.10
 9 District of Columbia DC                1.2 
10 Florida              FL                4.90
# ℹ 41 more rows

Then, we retrieve our US dataframe, which contains the geographical data:

us_df <- tigris::states(cb = TRUE, 
                     resolution = "20m", 
                     progress_bar = FALSE) |> # select states
  filter(NAME != "Puerto Rico") |> # exclude Puerto Rico
  tigris::shift_geometry() # shift/rescale Hawaii and Alaska

Then, we merge our two data frames on their common state column:

swing_geo <- us_df |> 
  left_join(swing_df, by=c("NAME" = "State")) # join on state

For clarity, I want to express swing percentages in quartiles, so I will use cut_number() to segment the percentages:

swing_geo <- swing_geo |>
  mutate(quartile = cut_number(swing_per, n = 4)) # split metric into quartiles

By default, the intervals created have labels that are clunky and ill-formatted. As such, I want to change them to simplify the visual. While this change will improve readability, it should be noted that we lose information about end point inclusion. That is, without brackets and parentheses, we do not know if a given group includes its endpoints or not. In the bigger picture, though, this is really not that big of a deal:

# clean quartiles
swing_geo$quartile_clean <- swing_geo$quartile |>
  str_replace_all("\\(|\\]|\\[", "") |> # remove brackets
  str_replace_all(",", " - ") # replace commas with dash

Finally, we can display the data:

swing_classed <- ggplot(swing_geo) + 
  geom_sf(aes(fill = quartile_clean), color = "#F9F6EE", linewidth = 0.4) +  # use swing metric for fill
  scale_fill_manual(values = c("#F8B8B5", "#F28B82", "#D93025", "#7B1B1B")) + # custom segment colors
  labs( # labels
       title = "Change in Trump Voting Percentage", 
       subtitle = "2020 to 2024",
       fill = "Percent Swing\n(Quartiles)",
       caption = "Data from Professor Lawrence's Reputable CSV",
       tag = "Average: 2.18%") +
  theme_void() + # clear theme
  theme( # font re-sizing/positioning
    text = element_text(color = "#622A0F", family = "League Spartan"),
    plot.title.position = "plot",
    plot.title = element_text(size = 25, hjust = 0.5),
    plot.subtitle = element_text(size = 22, hjust = 0.5),
    legend.text = element_text(size = 11, hjust = 0.5),
    legend.title = element_text(size = 13, hjust=0.5),
    plot.caption.position = "plot",
    plot.tag.position = c(.855,.26),
    plot.tag = element_text(hjust = 0, size=10),
    plot.caption = element_text(vjust=2, hjust=1, size=12),
    plot.margin = margin(10, 10, 10, 20),
    panel.background = element_rect(fill = "lightblue", color = NA),
    plot.background = element_rect(fill = "lightblue", color = NA)
  )

swing_classed

By mapping the data, we are adding an additional layer of information: geography. Indeed, while the scatter plot in question 2 afforded us the greatest level of precision possible for each state, comparisons were relatively contained. That is, I could say a lot about a single state, but it was virtually impossible to make any deductions about specific groups of states or regions; southern points, for instance, span the entire graph. Alternatively, mapping the data does the grouping for us, which makes identifying trends much easier. One interesting observation is the strong swing we see in the states that border Mexico; California, Mexico, New Mexico, and Texas all sit comfortably above the average. For those familiar with U.S. politics, this trend is in some ways expected as Trump centered his agenda on curbing illegal immigration from South America, which directly affects these states. Other regional swings that stand out include parts of the South (Mississippi, Alabama, Florida, South Carolina, Tennessee) as well as the New York Tri-State area. These shifts are less obvious and warrant additional exploration. Lastly, we see that large pockets of the Midwest saw little to no change, perhaps implying that Trump’s platform did not change much to reduce or increase appeal to residents within these states.