For Part A, the percentage of the population in Florida counties who have a graduate degree is explored. According to the U.S. Census Bureau in 2021, 14.4% of the population had completed an advanced degree, such as a masters or doctoral (U.S. Bureau, 2022). How does each county in Florida compare to the national average? Are there counties which exceed or under perform this metric? And if so, what might be the factors which affect each county population’s graduate degree attainment?
To address the questions, the following R libraries are leveraged:
# load library
library(tidyverse) # tidyverse
library(tidycensus) # acquire census data
library(classInt) # for calculating class intervals
library(sf) # handle simple features, vector spatial data
library(scales) # plot formatting
library(plotly) # interactive plots
library(ggiraph) # interactive graphics
library(mapview) # interactive maps
library(leaflet) # leaflet maps
library(leafpop) # create pop-ups for leaflet maps
The tidycensus library is used for acquiring data from the U.S. Census Bureau’s American Community Survey (ACS). The ACS collects a wide array of socioeconomic and demographic information. The ACS contains five and one year estimates (with the three year estimates being discontinued), however one year estimates are only available for areas with populations greater than 65,000.
In Part A, the percentage of each county’s population who has a graduate degree for 2023 is retrieved and visualized to answer the following questions:
# get ACS data for percentage of population with graduate degree
fl_grad <- get_acs(
# get county-level data
geography = "county",
# % pop with graduate degree
variables = "DP02_0066P",
# get 2023 data
year = 2023,
# get Florida data
state = "FL",
# 5-Year ACS estimates
survey = "acs5",
# get geometry
geometry = TRUE
)
With the ACS data acquired, the structure of the dataframe is inspected.
# view data structure
glimpse(fl_grad)
## Rows: 67
## Columns: 6
## $ GEOID <chr> "12123", "12017", "12071", "12101", "12033", "12037", "12065"…
## $ NAME <chr> "Taylor County, Florida", "Citrus County, Florida", "Lee Coun…
## $ variable <chr> "DP02_0066P", "DP02_0066P", "DP02_0066P", "DP02_0066P", "DP02…
## $ estimate <dbl> 4.3, 7.7, 11.9, 9.9, 10.5, 11.3, 7.1, 4.1, 24.7, 7.6, 5.7, 5.…
## $ moe <dbl> 1.4, 0.7, 0.4, 0.5, 0.6, 2.2, 1.6, 0.9, 0.9, 2.1, 1.5, 0.8, 0…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-83.83005 2..., MULTIPOLYGON (((…
The DP02_006P data set consists of 67 rows of data, which accounts for all Florida counties, and 6 columns consisting of:
The estimate cannot be considered without its margin of error. For any given estimate, the U.S Census Bureau is 90% confident the true value lies +/- the margin of error. For counties with smaller populations, the margin of error will be larger relative to the estimate, thus we must consider our error when plotting.
Next, the raw data is tidied and transformed to facilitate exploration and analysis.
# tidy and transform fl_grad
fl_grad <- fl_grad %>%
# split NAME column into distinct county and state columns
separate_wider_delim(cols=NAME, delim=",", names=c("county", "state")) %>%
# remove " County" suffix
mutate(county = str_remove(county, " County"),
# get decimal/percentage values for estimates (makes plotting better with scales library)
estimate = estimate/100,
moe = moe/100) %>%
# arrange by descending order
arrange(-estimate)
In this section, the ACS data is plotted to visualize how each county’s population compares to eachother, along with the associated error for the estimates.
# create ggplot object for estimates with error to visualize uncertainty
fl_plot_error <- ggplot(fl_grad, aes(x=estimate, y=reorder(county,estimate))) +
# error
geom_errorbar(aes(xmin=estimate - moe, xmax=estimate+moe),
width=0.5, linewidth=0.5) +
geom_point(color="darkred", size=2) +
# format x-axis labels as percentage
scale_x_continuous(labels=label_percent()) +
# set labels
labs(title = "Percent of Population with Graduate Degree, 2023 ACS",
subtitle = "Counties in Florida",
caption = "Data acquired with R and tidycensus",
x = "ACS Estimate",
y = "") +
# set theme
theme_minimal(base_size=12)
# convert ggplot object to to interactive plot using ggplotly
ggplotly(fl_plot_error,
# set hover tooltip to be ACS estimate
tooltip = "x",
# set plot height to fit all counties
height = 800)
1. Which counties in the selected state have the largest percentages of graduate degree holders?
Alachua, Leon, St. Johns, Sarasota, and Collier have the largest percentage of residents who posses graduate degrees in their respective counties. Notably, many of these counties either host a major university or are located near one. For example, University of Florida is located in Alachua, Florida State University is located in Leon County, and University of South Florida is located in Sarasota County.
2. Which counties in the selected state have the largest percentages of graduate degree holders?
Hendry, Glades, Dixie, Hamilton counties posses the smallest percentage of graduate degree holders in Florida.
Although the dot plot with the error bars is useful, the number of counties in Florida initially rendered the plot difficult to read due to the size. Fortunately, the ggplotly() function allows you to input a manual height value to accommodate every county.
For a visual representation of the data, the percentage estimates are plotted using the leaflet library below:
# create map of graduate degree holders in Florida by county
fl_grad <- fl_grad %>%
# set geometry
st_set_geometry('geometry') %>%
# set coordinate reference system (otherwise will get warning on leaflet map)
st_transform(crs = 4326) %>%
# format estimate as percentage for map
mutate(labelEstimate = percent(estimate, accuracy = 0.1)) # one decimal place
# create color palette
palette <- colorNumeric(palette = "Greens", # green
# set possible values
domain = fl_grad$estimate)
# create leaflet map of population with graduate degree by county
leaflet(fl_grad) %>%
# add basemap
addProviderTiles(providers$CartoDB.Positron) %>%
# add fl_grad polygon layer
addPolygons(
# set fill color to green palette
fillColor = ~palette(estimate),
# set border color
color = "black",
# set line thickness
weight = .75,
# set opacity slightly transperent so basemap labels are visible
fillOpacity = .75,
# set highlight options (when mouse hovers)
highlightOptions = highlightOptions(
# set outline weight
weight = 4,
# set color
color = "white",
# bring county to front (looks slightly better)
bringToFront = TRUE
),
# create tooltip label with County: Percentage
label = ~paste0(county, ": ", labelEstimate)
) %>%
# add legend
addLegend(
# set color palette
pal = palette,
# set values to be estimate %
values = ~estimate,
# set legend title
title = "% Population with Graduate Degree",
# set legend position
position = "bottomleft",
# format legend values with %
labFormat = labelFormat(suffix = "%")
)
For Part B, the poverty rate, which is the percentage of the population whose income is under the federal poverty line, is explored. In this exercise, two U.S. Census ACS 5-year estimate data sets are leveraged to calculate the poverty rate and visualize which counties posses the highest rates.
Once again, the tidycensus library is used to acquire ACS 5-year estimates for both county populations and poverty status for 2023.
# get population where poverty status is determined
fl_pop <- get_acs(
# get county-level data
geography = "county",
# total population
variables = "B17001_001",
# get Florida
state = "FL",
# get 2023
year = 2023,
# ACS 5-yr estimate
survey = "acs5",
# get geometry
geometry = TRUE
)
# get population whose income is below poverty level
fl_poverty <- get_acs(
# get county-level data
geography = "county",
# number of people in poverty
variables = "B17001_002",
# get Florida
state = "FL",
# get 2023
year = 2023,
# ACS 5-yr estimate
survey = "acs5"
)
The percentage of the population below the poverty line is estimated by dividing the ACS poverty and population estimates in each county. Both estimates possesses a margin of error which will important to include in the calculated poverty rate because lower population counties have higher associated error for their estimates. To determine this uncertainty in the calculated poverty rate, the two errors are propagated using both margin of error variables. This is accomplished by the following formula:
# merge population and poverty data
pop_poverty <- merge(fl_pop, fl_poverty,
by="GEOID",
suffixes = c(".pop", ".poverty")) %>%
# remove duplicate NAME column after merge
select(-NAME.poverty) %>%
# rename name column
rename(county = NAME.pop) %>%
# remove " County, Florida" suffix
mutate(county = str_remove(county, " County, Florida"),
# calculate poverty percentage
per.poverty = (estimate.poverty / estimate.pop),
# calculate margin of error for poverty percentage
# include document explaining error propagation for multiplication/division (ratios)
moe.per = per.poverty * sqrt((moe.poverty / estimate.poverty)^2 + (moe.pop / estimate.pop)^2)) %>%
# set geometry column for mapping
st_set_geometry("geometry") %>%
# arrange by percent poverty (ascending)
arrange(per.poverty) %>%
# glimpse
glimpse()
## Rows: 67
## Columns: 11
## $ GEOID <chr> "12129", "12109", "12115", "12113", "12119", "12019",…
## $ county <chr> "Wakulla", "St. Johns", "Sarasota", "Santa Rosa", "Su…
## $ variable.pop <chr> "B17001_001", "B17001_001", "B17001_001", "B17001_001…
## $ estimate.pop <dbl> 31471, 288913, 440606, 188472, 129329, 221079, 469868…
## $ moe.pop <dbl> 252, 379, 473, 212, 193, 337, 465, 147, 445, 338, 367…
## $ variable.poverty <chr> "B17001_002", "B17001_002", "B17001_002", "B17001_002…
## $ estimate.poverty <dbl> 1758, 19379, 36101, 15911, 11723, 20204, 43264, 8659,…
## $ moe.poverty <dbl> 588, 2238, 2364, 1722, 1306, 2759, 2621, 1399, 1991, …
## $ per.poverty <dbl> 0.05586095, 0.06707556, 0.08193488, 0.08442103, 0.090…
## $ moe.per <dbl> 0.018689221, 0.007746776, 0.005366059, 0.009137129, 0…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-84.73656 3..., MULTIPOL…
With the data transformations and tidying complete, the dataframe now contains the following columns:
Next, dot plot is created which illustrates the percentage of the population in each Florida county under the poverty line, with the error propagated. This time, the ggiraph library is used.
# construct dot plot
fl_plot <- ggplot(pop_poverty,
#
aes(x=per.poverty,
y=reorder(county,per.poverty),
tooltip = per.poverty,
data_id = GEOID)) +
# plot interactive percentage error
geom_errorbar(aes(xmin=per.poverty - moe.per,
xmax=per.poverty+moe.per),
width=0.5, linewidth=0.5) +
# plot interactive point percentage
geom_point_interactive(color="darkblue", size=2) +
# format x-axis labels as percentage
scale_x_continuous(labels=label_percent()) +
# set labels
labs(title = "Percent of Population Below Poverty Level, 2023 ACS",
subtitle = "Counties in Florida",
caption = "Data acquired with R and tidycensus",
x = "ACS Estimate",
y = "") +
# set theme
theme_minimal(base_size=8)
# create interactive plot using giraf()
girafe(ggobj = fl_plot,
# set height (inches)
height_svg = 10) %>%
# hover points
girafe_options(opts_hover(css = "fill:cyan;"),
opts_zoom(min = 0.5, max = 4))
There is a lot of uncertainty in the poverty rate for some counties. Intuitively, it makes sense that counties with lower populations would possess higher margins of error for both their population and poverty estimates because sample sizes are smaller due to fewer households and these counties are sampled less frequently.
Consequently, the calculated uncertainty for the poverty rate will likewise carry over. This can be verified by calculating Pearson’s correlation coefficient between the population and poverty rate margin of error. The result of Pearson’s test is -0.55, indicating a strong negative correlation between the size of the population and margin of error. In other words, the larger the county’s population, the smaller the associated error.
# perform pearson's correlation analysis
correlation <- cor(pop_poverty$estimate.pop, pop_poverty$moe.per, method = "pearson")
# view pearson's R results
correlation
## [1] -0.5480425
Now the poverty rate is visualized to better understand the geographic distribution for poverty rates. The classInt library is used to calculate the break points in the poverty rates using Natural Breaks (Jenks) which classifies values based on natural groupings.
# set number of classes
classes <- 5
# calculate natural breaks for per.poverty
breaks <- classIntervals(pop_poverty$per.poverty,
n = classes,
style = "jenks")$brks # creates a vector of classes
#
# assign natural breaks classes
pop_poverty <- pop_poverty %>%
# assign class to per.poverty values based on breaks intervals
mutate(class = cut(per.poverty, breaks = breaks, include.lowest = TRUE))
Next, a static map with the poverty rate classified by natural breaks is created to visualize the data.
# create map with poverty rate classified by natural breaks
ggplot(pop_poverty, aes(fill = class))+
geom_sf() +
# set theme to void
theme_void()+
# use rocket paletee
scale_fill_viridis_d(option = "rocket") + # ggplot uses "pretty breaks" by default
labs(title = "Percent Population under Poverty Line",
subtitle = " Counties of Florida",
fill = "ACS estimate",
caption = "2017-2023 ACS")
Finally, two interactive map are created using mapview and leaflet to facilitate data exploration.
# create mapview
mapview(pop_poverty, zcol='per.poverty')
# create poverty rate map by county
pop_poverty <- pop_poverty %>%
# set geometry
st_set_geometry('geometry') %>%
# set coordinate reference system (otherwise will get warning on leaflet map)
st_transform(crs = 4326) %>%
# format estimate as percentage for map
mutate(labelEstimate = percent(per.poverty, accuracy = 0.1)) # one decimal place
# create color palette
palettePoverty <- colorNumeric(palette = "Reds", # reds
# set possible values
domain = pop_poverty$per.poverty)
# create leaflet map of population with graduate degree by county
leaflet(pop_poverty) %>%
# add basemap (dark matter)
addProviderTiles(providers$CartoDB.DarkMatter) %>%
# add pop_poverty polygon layer
addPolygons(
# set fill color to red palette
fillColor = ~palettePoverty(per.poverty),
# set border color
color = "black",
# set line thickness
weight = .75,
# set opacity slightly transperent so basemap labels are visible
fillOpacity = .75,
# set highlight options (when mouse hovers)
highlightOptions = highlightOptions(
# set outline weight
weight = 4,
# set color
color = "white",
# bring county to front (looks slightly better)
bringToFront = TRUE
),
# create tooltip label with County: Percentage
label = ~paste0(county, ": ", labelEstimate)
) %>%
# add legend
addLegend(
# set color palette
pal = palettePoverty,
# set values to be estimate %
values = ~per.poverty,
# set legend title
title = "Povert Rate %",
# set legend position
position = "bottomleft",
# format legend values with %
labFormat = labelFormat(suffix = "%")
)
The top five areas with the highest poverty rates are Gadsen, Hamilton, Hardee, Putnam and Desoto Counties. However as noted previously, many counties with high poverty rates also have low populations which result in significant associated error for the caclulations. When ranking the poverty rate by county, great care should be shown to include this important context.
#References
U.S. Census Bureau. (2022, February 24). Educational attainment in the United States: 2021. U.S. Department of Commerce. https://www.census.gov/newsroom/press-releases/2022/educational-attainment.html