Problem 1 [20 points]

Create a plotly graph of your choosing that represents at least two variables, one of which must be a categorical variable.

This plot can be a scatter plot, overlayed density plots (graphing variable is continuous, separate densities grouped by categorical variable), etc. choropleth maps could also be on the list…you have to admit they look kinda cool.

The graph must include:

  1. customized hover text that is informative to the graphing elements in the plot

  2. separate color to represent groups

  3. labeled axes and appropriate title

Include at least a 1-paragraph discussion about the graph. Discuss what is being plotted and what information is being displayed in the graph. Discuss any information that the reader may gain from hovering the cursor over graphing elements. Discuss any issues/challenges you had (if any) while making the plot, and you you dealt with or overcame them.

# Data sourced from 2022 Small Area Health Insurance Estimates
# https://www.census.gov/data/datasets/time-series/demo/sahie/estimates-acs.html

# install.packages("rjson")
library(rjson)
library(plotly)
library(tidyverse)


# Import dataset
sahie <- read.csv("SAHIE_2022_clean.csv")

# load geojson mapping for county level data
url <- 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'
counties <- rjson::fromJSON(file=url)

# Remap Connecticut FIPS codes
# https://github.com/CT-Data-Collaborative/2022-tract-crosswalk
ct_fips <- data.frame(FIPS_old = c(9003, 9013, 9007, 9015, 9005, 9009, 9011, 9001),
                      FIPS_new = c(9110, 9110, 9130, 9150, 9160, 9170, 9180, 9190)
                      )

sahie <- left_join(sahie, ct_fips, by = join_by(FIPS == FIPS_new))

# This shows that the variable is largely contained within 80-100% with the center around 90%
# That scale will be used for coloring the counties below
summary(as.numeric(sahie$PCTLIIC))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   62.30   86.80   90.60   89.53   93.10   97.90       1
quantile(as.numeric(sahie$PCTLIIC), probs = seq(0.05, 1, 0.05), na.rm = TRUE)
##   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70%  75%  80% 
## 80.6 83.3 84.8 86.0 86.8 87.6 88.5 89.2 89.8 90.6 91.2 91.8 92.3 92.7 93.1 93.5 
##  85%  90%  95% 100% 
## 94.0 94.4 95.1 97.9
# Create Plot
sahie %>% 
  mutate(FIPS_clean = ifelse(is.na(FIPS_old) , FIPS, FIPS_old)) %>%  #remap Connecticut codes
  mutate(FIPS_pad = str_pad(FIPS_clean, 5, "left", 0)) %>% #left pad FIPS code to be pickup up by geojson mappings
  
  plot_ly(
    hoverinfo = "text",
          text = ~paste("State:", State_Name,
                        "<br> County:", County_Name,
                        "<br> Insured Rate:", PCTLIIC, "%",
                        "<br> Standard Error:", pctliic_moe, "%")
    ) %>%
  add_trace(
    type = 'choroplethmapbox',
    geojson = counties,
    z = ~as.numeric(PCTLIIC),
    locations = ~FIPS_pad,
    #set to be symmetric around median value
    zmin= 80,
    zmid = 90,
    zmax= 100,
    colorscale = "RdBu",
    reversescale = TRUE,
    colorbar = list(title = "Insured Rate"),
    marker=list(line=list(
      width=0),
      opacity=0.5)
    ) %>%
  layout(
    mapbox=list(
      style="carto-positron",
      zoom =3, #county level zoom
      center=list(lon= -95.71, lat=37.09)), #centers on continental US
     title = list(text = "Percentage of Americans under 65 with Health Insurance Coverage<br><sup>Source: 2022 Small Area Health Insurance Estimates (SAHIE) using the American Community Survey (ACS)</sup>")
    )

Discussion

This graph shows the percentage of people under age 65 with health insurance, per county in the United States for 2022. The data is sourced from the US Census Bureau: 2008 - 2022 Small Area Health Insurance Estimates (SAHIE) using the American Community Survey (ACS), and was refined in Excel before loading into R for easier data manipulation. The map shows that apparent discrepancies in insured rates appear largely by regional and state specific patterns. Notably, Louisiana appears out of place among southern states. Upon investigating, Louisiana expanded Medicaid access under the Affordable Care act, while surrounding states did not. The states that did not expand Medicaid appear to have the lowest insured levels. These states include: Alabama, Florida, Georgia, Kansas, Mississippi, South Carolina, Tennessee, Texas, Wisconsin, and Wyoming. However, that alone does not explain everything. Wisconsin has a highly insured population without Medicaid expansion. North Carolina, Missouri, Oklahoma, as well as other Western states did expand Medicaid access, yet still have lower insured populations. There are likely other, more granular drivers of these differences that would need to be investigated further.

This was a challenging map to put together. It required a lot of trial and error as well as searching for arguments to pass to these functions that would accomplish the picture I was trying to create. Notably, this 2022 information uses new FIPS codes for Connecticut that are not on the underlying geoJSON file that plotly uses to map to county level data. To solve this, I chose to crosswalk these FIPS codes from the new census tracts to the old ones so that they would appear on the map. This involved several files to identify existing FIPS mappings, reference a crosswalk file between names, and map these names to the new codes (see: 2020 Census Tracts Crosswalk section). The old values were then mapped back on to the dataset so that they could overwrite the newer values and thus appear on the map.

Other challenges included mapping the color scale and map display options. Originally the color scale was too homogeneous, so I had to figure out how to differentiate high and low values around a midpoint. The zmin/zmid/zmax values need to be symmetric around the midpoint, so investigating the percentile distribution of values allowed this to be tuned to reflect the spread of the data. Also, the marker line width was difficult to adjust. It was originally a thick enough border to obscure the color of the map fill. Finding the syntax to set this to “0” allowed the data to show through.


Problem 2 [20 points]

Create an animated plotly graph with a data set of your choosing. This can be, but does not have to be a scatter plot. Also, the animation does not have to take place over time. As mentioned in the notes, the frame can be set to a categorical variable. However, the categories the frames cycle through should be organized (if needs be) such that the progression through them shows some pattern.

This graph should include:

  1. Aside from the graphing variable, a separate categorical variable. For example, in our animated scatter plot we color grouped the points by continent.

  2. Appropriate axis labels and a title

  3. Augment the frame label to make it more visible. This can include changing the font size and color to make it stand out more, and/or moving the frame label to a new location in the plotting region. Note, if you do this, make sure it is till clearly visible and does not obstruct the view of your plot.

Include at least a 1-paragraph discussion about the plot. Discuss what you are plotting and what trends can be seen throughout the animation. Discuss any issues, if any, you ran into in making the plot and how you overcame them.

# data sourced from
# https://www.ers.usda.gov/data-products/feed-grains-database/feed-grains-yearbook-tables


library(plotly)
library(RColorBrewer)

grains <- read.csv("USDA_Feed_Grains_Yearbook_Data_Tables.csv")

grains %>% 
  filter(Commodity_Year >= 1929) %>% 
  mutate( Production_Bushels = Production_Bushels_Millions * 1000000) %>% 
  plot_ly(x = ~Harvested_Acres_Millions, y = ~Production_Bushels,
          hoverinfo = "text",
          text = ~paste("Grain:", Grain_type,
                        "<br> Harvested Acres:", format(Harvested_Acres_Millions*1000000, big.mark = ",", scientific = FALSE),
                        "<br> Bushels Produced:", format(Production_Bushels, big.mark = ",", scientific = FALSE),
                        "<br> Bushels Per Acre:", Bushels_Per_Acre)
          ) %>% 
  add_markers(frame = ~Commodity_Year,
              ids = ~Grain_type,
              size = ~Bushels_Per_Acre,
              marker = list(sizemode = "diameter",
                            sizeref = 2,
                            sizemin = 3
                            ),
              color = ~Grain_type,
              colors = brewer.pal(4,"Set1")
              ) %>% 
  layout(
      title = list(text = "Grain Yields by Harvested Acres (1929-2024)<br><sup>USDA Feed Grains Database</sup>"),
      xaxis = list(title = list(text = "Harvested Acres (Millions)")),
      yaxis = list(type = "log",
                   title = list(text = "Bushels Produced"))
        ) %>% 
  animation_slider(currentvalue = list(prefix = FALSE,
                                       font = list(color = "darkred", size = 35))
                   )

Discussion:

The plot shows the historical harvest yields of feed grains in the USA from 1929 through 2024 for Barley, Corn, Oats, and Sorghum. This data was sourced from: USDA Feed Grains Database (Table 1) and formatted in Excel for a cleaner upload. With the amount of acres harvested (in millions) on the x-axis, the amount of bushels produced on the y-axis, and the size of the point linked to the yield per acre, we observe several patterns:

  1. The amount of land dedicated to corn in 2024 is roughly the same now as it was nearly 100 years ago. It reached its minimum value in the mid-1960’s, and has rebounded through current day, all the while total production was increasing almost every year due to the increasing amount harvested per acre.
  2. Over time, the diversity of crop plantings have diminished. Corn has taken over as the most-farmed grain at the expense of the other feed grains.
  3. All crops yield more per acre over time regardless of total acres planted. This is likely due to changing farming practices over time.

The data behind this chart had older history; however, sorghum wasn’t added to the database until 1929, so I decided to start at that point. This allowed both Sorghum to be included as well as to not make the animation longer than necessary. Over time, corn’s dominance made the graph hard to visualize, so a log scale was used to better visualize all data points concurrently with the y-axis. Additionally, coding marker size to yield per acre necessitated playing with marker size options to emphasize how much yields have changed over time. I started with an “area” fill and moved to a “diameter” fill to better convey this information.