Part A: Graduate Degree Holders in Florida

For Part A, the percentage of the population in Florida counties who have a graduate degree is explored. According to the U.S. Census Bureau in 2021, 14.4% of the population had completed an advanced degree, such as a masters or doctoral (U.S. Bureau, 2022). How does each county in Florida compare to the national average? Are there counties which exceed or under perform this metric? And if so, what might be the factors which affect each county population’s graduate degree attainment?

To address the questions, the following R libraries are leveraged:

# load library
library(tidyverse)     # tidyverse 
library(tidycensus)    # acquire census data
library(classInt)      # for calculating class intervals
library(sf)            # handle simple features, vector spatial data
library(scales)        # plot formatting
library(plotly)        # interactive plots
library(ggiraph)       # interactive graphics
library(mapview)       # interactive maps
library(leaflet)       # leaflet maps
library(leafpop)       # create pop-ups for leaflet maps

Data Acquisition

The tidycensus library is used for acquiring data from the U.S. Census Bureau’s American Community Survey (ACS). The ACS collects a wide array of socioeconomic and demographic information. The ACS contains five and one year estimates (with the three year estimates being discontinued), however one year estimates are only available for areas with populations greater than 65,000.

In Part A, the percentage of each county’s population who has a graduate degree for 2023 is retrieved and visualized to answer the following questions:

  • 1. Which Florida county populations have the highest percentage of graduate degree holders?
  • 2. Which have the smallest percentage?
# get ACS data for percentage of population with graduate degree
fl_grad <- get_acs(
  # get county-level data
  geography = "county",
  # % pop with graduate degree
  variables = "DP02_0066P",
  # get 2023 data
  year = 2023,
  # get Florida data
  state = "FL",
  # 5-Year ACS estimates
  survey = "acs5",
  # get geometry
  geometry = TRUE
)

With the ACS data acquired, the structure of the dataframe is inspected.

# view data structure
glimpse(fl_grad)
## Rows: 67
## Columns: 6
## $ GEOID    <chr> "12123", "12017", "12071", "12101", "12033", "12037", "12065"…
## $ NAME     <chr> "Taylor County, Florida", "Citrus County, Florida", "Lee Coun…
## $ variable <chr> "DP02_0066P", "DP02_0066P", "DP02_0066P", "DP02_0066P", "DP02…
## $ estimate <dbl> 4.3, 7.7, 11.9, 9.9, 10.5, 11.3, 7.1, 4.1, 24.7, 7.6, 5.7, 5.…
## $ moe      <dbl> 1.4, 0.7, 0.4, 0.5, 0.6, 2.2, 1.6, 0.9, 0.9, 2.1, 1.5, 0.8, 0…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-83.83005 2..., MULTIPOLYGON (((…

The DP02_006P data set consists of 67 rows of data, which accounts for all Florida counties, and 6 columns consisting of:

  • GEOID: unique identifier/primary key for the county
  • NAME: name of the county
  • variable: variable code (percent of population with graduate degree)
  • estimate: the estimated percentage
  • moe: the margin of error (90% CI)
  • geometry: geometry properties for each county

The estimate cannot be considered without its margin of error. For any given estimate, the U.S Census Bureau is 90% confident the true value lies +/- the margin of error. For counties with smaller populations, the margin of error will be larger relative to the estimate, thus we must consider our error when plotting.

Next, the raw data is tidied and transformed to facilitate exploration and analysis.

Processing

# tidy and transform fl_grad
fl_grad <- fl_grad %>%
  # split NAME column into distinct county and state columns
  separate_wider_delim(cols=NAME, delim=",", names=c("county", "state")) %>%
  # remove " County" suffix
  mutate(county = str_remove(county, " County"),
         # get decimal/percentage values for estimates (makes plotting better with scales library)
         estimate = estimate/100,
         moe = moe/100) %>%
  # arrange by descending order
  arrange(-estimate)

Visualization

In this section, the ACS data is plotted to visualize how each county’s population compares to eachother, along with the associated error for the estimates.

 # create ggplot object for estimates with error to visualize uncertainty
fl_plot_error <- ggplot(fl_grad, aes(x=estimate, y=reorder(county,estimate))) +
  # error
  geom_errorbar(aes(xmin=estimate - moe, xmax=estimate+moe),
                width=0.5, linewidth=0.5) +
  geom_point(color="darkred", size=2) +
  # format x-axis labels as percentage
  scale_x_continuous(labels=label_percent()) +
  # set labels
  labs(title = "Percent of Population with Graduate Degree, 2023 ACS",
       subtitle = "Counties in Florida",
       caption = "Data acquired with R and tidycensus",
       x = "ACS Estimate",
       y = "") +
  # set theme
  theme_minimal(base_size=12)
# convert ggplot object to to interactive plot using ggplotly
ggplotly(fl_plot_error, 
         # set hover tooltip to be ACS estimate
         tooltip = "x", 
         # set plot height to fit all counties
         height = 800)

Results

1. Which counties in the selected state have the largest percentages of graduate degree holders?

Alachua, Leon, St. Johns, Sarasota, and Collier have the largest percentage of residents who posses graduate degrees in their respective counties. Notably, many of these counties either host a major university or are located near one. For example, University of Florida is located in Alachua, Florida State University is located in Leon County, and University of South Florida is located in Sarasota County.

2. Which counties in the selected state have the largest percentages of graduate degree holders?

Hendry, Glades, Dixie, Hamilton counties posses the smallest percentage of graduate degree holders in Florida.

Although the dot plot with the error bars is useful, the number of counties in Florida initially rendered the plot difficult to read due to the size. Fortunately, the ggplotly() function allows you to input a manual height value to accommodate every county.

For a visual representation of the data, the percentage estimates are plotted using the leaflet library below:

# create map of graduate degree holders in Florida by county
fl_grad <- fl_grad %>%
  # set geometry
  st_set_geometry('geometry') %>%
  # set coordinate reference system (otherwise will get warning on leaflet map)
  st_transform(crs = 4326) %>%
  # format estimate as percentage for map
  mutate(labelEstimate = percent(estimate, accuracy = 0.1)) # one decimal place

# create color palette  
palette <- colorNumeric(palette = "Greens",  # green
                    # set possible values
                    domain = fl_grad$estimate)

# create leaflet map of population with graduate degree by county
leaflet(fl_grad) %>%
  # add basemap
  addProviderTiles(providers$CartoDB.Positron) %>%
  # add fl_grad polygon layer
  addPolygons(
    # set fill color to green palette
    fillColor = ~palette(estimate),
    # set border color
    color = "black",
    # set line thickness
    weight = .75,   
    # set opacity slightly transperent so basemap labels are visible
    fillOpacity = .75,
    # set highlight options (when mouse hovers)
    highlightOptions = highlightOptions(
      # set outline weight
      weight = 4, 
      # set color
      color = "white", 
      # bring county to front (looks slightly better)
      bringToFront = TRUE
    ),
    # create tooltip label with County: Percentage
    label = ~paste0(county, ": ", labelEstimate)
  ) %>%
  # add legend
  addLegend(
    # set color palette
    pal = palette, 
    # set values to be estimate %
    values = ~estimate, 
    # set legend title
    title = "% Population with Graduate Degree", 
    # set legend position
    position = "bottomleft",
    # format legend values with %
    labFormat = labelFormat(suffix = "%")
  )

Part B: Poverty in Florida Counties

For Part B, the poverty rate, which is the percentage of the population whose income is under the federal poverty line, is explored. In this exercise, two U.S. Census ACS 5-year estimate data sets are leveraged to calculate the poverty rate and visualize which counties posses the highest rates.

  • Total population for whom poverty status is determined (B01003_001)
  • Population where income is below poverty level (B17001_002)

Data Acquisition

Once again, the tidycensus library is used to acquire ACS 5-year estimates for both county populations and poverty status for 2023.

# get population where poverty status is determined
fl_pop <- get_acs(
  # get county-level data
  geography = "county",
  # total population
  variables = "B17001_001",
  # get Florida
  state = "FL",
  # get 2023
  year = 2023,
  # ACS 5-yr estimate
  survey = "acs5",
  # get geometry
  geometry = TRUE
)
# get population whose income is below poverty level
fl_poverty <- get_acs(
  # get county-level data
  geography = "county",
  # number of people in poverty
  variables = "B17001_002", 
  # get Florida
  state = "FL",
  # get 2023
  year = 2023,
  # ACS 5-yr estimate
  survey = "acs5"
)

Processing

The percentage of the population below the poverty line is estimated by dividing the ACS poverty and population estimates in each county. Both estimates possesses a margin of error which will important to include in the calculated poverty rate because lower population counties have higher associated error for their estimates. To determine this uncertainty in the calculated poverty rate, the two errors are propagated using both margin of error variables. This is accomplished by the following formula:

# merge population and poverty data
pop_poverty <- merge(fl_pop, fl_poverty, 
                     by="GEOID",
                     suffixes = c(".pop", ".poverty")) %>%
  # remove duplicate NAME column after merge
  select(-NAME.poverty) %>%
  # rename name column
  rename(county = NAME.pop) %>%
  # remove " County, Florida" suffix
  mutate(county = str_remove(county, " County, Florida"),
         # calculate poverty percentage
         per.poverty = (estimate.poverty / estimate.pop),
         # calculate margin of error for poverty percentage
         # include document explaining error propagation for multiplication/division (ratios)
         moe.per = per.poverty * sqrt((moe.poverty / estimate.poverty)^2 + (moe.pop / estimate.pop)^2)) %>%
  # set geometry column for mapping
  st_set_geometry("geometry") %>%
  # arrange by percent poverty (ascending)
  arrange(per.poverty) %>%
  # glimpse
  glimpse()
## Rows: 67
## Columns: 11
## $ GEOID            <chr> "12129", "12109", "12115", "12113", "12119", "12019",…
## $ county           <chr> "Wakulla", "St. Johns", "Sarasota", "Santa Rosa", "Su…
## $ variable.pop     <chr> "B17001_001", "B17001_001", "B17001_001", "B17001_001…
## $ estimate.pop     <dbl> 31471, 288913, 440606, 188472, 129329, 221079, 469868…
## $ moe.pop          <dbl> 252, 379, 473, 212, 193, 337, 465, 147, 445, 338, 367…
## $ variable.poverty <chr> "B17001_002", "B17001_002", "B17001_002", "B17001_002…
## $ estimate.poverty <dbl> 1758, 19379, 36101, 15911, 11723, 20204, 43264, 8659,…
## $ moe.poverty      <dbl> 588, 2238, 2364, 1722, 1306, 2759, 2621, 1399, 1991, …
## $ per.poverty      <dbl> 0.05586095, 0.06707556, 0.08193488, 0.08442103, 0.090…
## $ moe.per          <dbl> 0.018689221, 0.007746776, 0.005366059, 0.009137129, 0…
## $ geometry         <MULTIPOLYGON [°]> MULTIPOLYGON (((-84.73656 3..., MULTIPOL…

With the data transformations and tidying complete, the dataframe now contains the following columns:

  • GEOID: unique identifier/primary key for the county
  • county: name of the county
  • variable.pop: variable code (population where poverty status determined)
  • estimate.pop: total population estimate
  • moe.pop: total population margin of error
  • variable.poverty: variable code (population whose income is below poverty line)
  • estimate.poverty: poverty population estimate
  • moe.poverty: poverty margin of error
  • per.poverty: calculated poverty rate
  • moe.per: margin of error for poverty rate (obtained via error propagation formula for Multiplication & Division)
  • geometry: geometry properties for each county

Visualization

Next, dot plot is created which illustrates the percentage of the population in each Florida county under the poverty line, with the error propagated. This time, the ggiraph library is used.

# construct dot plot
fl_plot <- ggplot(pop_poverty,
                  # 
                  aes(x=per.poverty,
                      y=reorder(county,per.poverty),
                      tooltip = per.poverty,
                      data_id = GEOID)) +
  # plot interactive percentage error
  geom_errorbar(aes(xmin=per.poverty - moe.per, 
                    xmax=per.poverty+moe.per),
                width=0.5, linewidth=0.5) +
  # plot interactive point percentage
  geom_point_interactive(color="darkblue", size=2) +
  # format x-axis labels as percentage
  scale_x_continuous(labels=label_percent()) +
  # set labels
  labs(title = "Percent of Population Below Poverty Level, 2023 ACS",
       subtitle = "Counties in Florida",
       caption = "Data acquired with R and tidycensus",
       x = "ACS Estimate",
       y = "") +
  # set theme
  theme_minimal(base_size=8)
# create interactive plot using giraf()
girafe(ggobj = fl_plot,
       # set height (inches)
       height_svg = 10) %>%
  # hover points
  girafe_options(opts_hover(css = "fill:cyan;"),
                 opts_zoom(min = 0.5, max = 4))

There is a lot of uncertainty in the poverty rate for some counties. Intuitively, it makes sense that counties with lower populations would possess higher margins of error for both their population and poverty estimates because sample sizes are smaller due to fewer households and these counties are sampled less frequently.

Consequently, the calculated uncertainty for the poverty rate will likewise carry over. This can be verified by calculating Pearson’s correlation coefficient between the population and poverty rate margin of error. The result of Pearson’s test is -0.55, indicating a strong negative correlation between the size of the population and margin of error. In other words, the larger the county’s population, the smaller the associated error.

# perform pearson's correlation analysis
correlation <- cor(pop_poverty$estimate.pop, pop_poverty$moe.per, method = "pearson")
# view pearson's R results
correlation
## [1] -0.5480425

Now the poverty rate is visualized to better understand the geographic distribution for poverty rates. The classInt library is used to calculate the break points in the poverty rates using Natural Breaks (Jenks) which classifies values based on natural groupings.

# set number of classes
classes <- 5
# calculate natural breaks for per.poverty
breaks <- classIntervals(pop_poverty$per.poverty, 
                         n = classes, 
                         style = "jenks")$brks # creates a vector of classes
#
# assign natural breaks classes
pop_poverty <- pop_poverty %>%
  # assign class to per.poverty values based on breaks intervals
  mutate(class =  cut(per.poverty, breaks = breaks, include.lowest = TRUE))

Next, a static map with the poverty rate classified by natural breaks is created to visualize the data.

# create map with poverty rate classified by natural breaks
ggplot(pop_poverty, aes(fill = class))+
  geom_sf() +
  # set theme to void
  theme_void()+
  # use rocket paletee
  scale_fill_viridis_d(option = "rocket") +  # ggplot uses "pretty breaks" by default
  labs(title = "Percent Population under Poverty Line",
       subtitle = " Counties of Florida",
       fill = "ACS estimate",
       caption = "2017-2023 ACS")

Finally, two interactive map are created using mapview and leaflet to facilitate data exploration.

# create mapview 
mapview(pop_poverty, zcol='per.poverty')
# create poverty rate map by county
pop_poverty <- pop_poverty %>%
  # set geometry
  st_set_geometry('geometry') %>%
  # set coordinate reference system (otherwise will get warning on leaflet map)
  st_transform(crs = 4326) %>%
  # format estimate as percentage for map
  mutate(labelEstimate = percent(per.poverty, accuracy = 0.1)) # one decimal place

# create color palette  
palettePoverty <- colorNumeric(palette = "Reds",  # reds
                    # set possible values
                    domain = pop_poverty$per.poverty)

# create leaflet map of population with graduate degree by county
leaflet(pop_poverty) %>%
  # add basemap (dark matter)
  addProviderTiles(providers$CartoDB.DarkMatter) %>%
  # add pop_poverty polygon layer
  addPolygons(
    # set fill color to red palette
    fillColor = ~palettePoverty(per.poverty),
    # set border color
    color = "black",
    # set line thickness
    weight = .75,   
    # set opacity slightly transperent so basemap labels are visible
    fillOpacity = .75,
    # set highlight options (when mouse hovers)
    highlightOptions = highlightOptions(
      # set outline weight
      weight = 4, 
      # set color
      color = "white", 
      # bring county to front (looks slightly better)
      bringToFront = TRUE
    ),
    # create tooltip label with County: Percentage
    label = ~paste0(county, ": ", labelEstimate)
  ) %>%
  # add legend
  addLegend(
    # set color palette
    pal = palettePoverty, 
    # set values to be estimate %
    values = ~per.poverty, 
    # set legend title
    title = "Povert Rate %", 
    # set legend position
    position = "bottomleft",
    # format legend values with %
    labFormat = labelFormat(suffix = "%")
  )

Results

The top five areas with the highest poverty rates are Gadsen, Hamilton, Hardee, Putnam and Desoto Counties. However as noted previously, many counties with high poverty rates also have low populations which result in significant associated error for the caclulations. When ranking the poverty rate by county, great care should be shown to include this important context.

#References

U.S. Census Bureau. (2022, February 24). Educational attainment in the United States: 2021. U.S. Department of Commerce. https://www.census.gov/newsroom/press-releases/2022/educational-attainment.html