introduction, data preparation, and analysis
This lab focused on using census/american community survey data in order to create informative maps and graphs. The analysis done in this lab focuses on the state of Massachusetts. Part one looks at the percentage of people in each county with graduate degrees using interactive and static graphs. Part two looks at medium income values by census tract looking specifically at Suffolk county within Massachusetts. This county contains the city of Boston and is where further analysis will be done in this classes term project. Interactive and static maps were created to analyze this variable within the given area.
Loading the appropriate pakcages is the first step in analysis
#Load Packages
library(tidycensus)
library(tidyverse)
library(scales)
library(plotly)
library(mapview)
library(sf)
For part A the data looked at is the percentage of people with graduate degrees by county. The state of Massachusetts specifically was looked at. To access this data, the tidycensus package was used to communicate with the census API and get this specific data into R.
#Use get_acs() to fetch data on the percentage of the population that have a graduate degree
grad_deg <- get_acs(
geography = "county",
variables = "DP02_0066P",
state = "MA"
)
Now that the data has been brought in, it can be helpful to get a better understanding of the data. Data was sorted from smallest to largest percent estimate in order to better understand the variance in data. The created table was used to see the counties with the largest and smallest percent of graduate degree holders.
#sort from min to max percentage
sort_grad <- arrange(grad_deg, estimate)
#See highest percentage values
tail(sort_grad)
## # A tibble: 6 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 25001 Barnstable County, Massachusetts DP02_0066P 22.4 0.7
## 2 25025 Suffolk County, Massachusetts DP02_0066P 23.5 0.6
## 3 25007 Dukes County, Massachusetts DP02_0066P 24.4 4.5
## 4 25015 Hampshire County, Massachusetts DP02_0066P 27.6 1.1
## 5 25021 Norfolk County, Massachusetts DP02_0066P 28.4 0.5
## 6 25017 Middlesex County, Massachusetts DP02_0066P 30.5 0.4
#See lowest percent values
head(sort_grad)
## # A tibble: 6 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 25005 Bristol County, Massachusetts DP02_0066P 11.4 0.4
## 2 25013 Hampden County, Massachusetts DP02_0066P 12.1 0.5
## 3 25023 Plymouth County, Massachusetts DP02_0066P 16.1 0.5
## 4 25027 Worcester County, Massachusetts DP02_0066P 16.8 0.4
## 5 25009 Essex County, Massachusetts DP02_0066P 17.9 0.4
## 6 25011 Franklin County, Massachusetts DP02_0066P 18.5 1
It seems that Middlesex and Norfolk counties have the highest percentages of graduate degree holders in Massachusetts. And the counties with the lowest percentages are Bristol and Hampden counties.
To get a better idea of the values of this data, as well as the amount of error in these values, a bar graph can be made showing the percentage in each county.
errorplot <- ggplot(sort_grad, aes(x = estimate,
y = reorder(NAME, estimate))) + #order by value
geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), #create error bars
width = 0.5, linewidth = 0.5) +
geom_point(color = "blue", size = 2) +
scale_x_continuous(labels = label_percent(scale=1)) +
scale_y_discrete(labels = function(x) str_remove(x, " County, Massachusetts |, Massachusetts")) + #Remove State name from labels
#Add titles and captions and axis labels
labs(title = "Percentage of People With Graduate Degrees, 2020-2024 ACS",
subtitle = "Counties in Massachusetts",
caption = str_wrap("Figure 1: Percentage data showing the amount of people with graduate degrees in each county in the State of Massachusetts. There are over 20% difference between the highest and lowest percentages in the state. Data acquired with R and tidycensus. Error bars represent margin of error around estimates."),
x = "ACS estimate",
y = "") +
theme_minimal(base_size = 12)
print(errorplot)
Looking at this graph it is interesting to note that Dukes county and Nantucket County have much higher margins or error than the other counties. This might be because these are the smallest counties population wise.
This graph can also be made interactive which can make understanding the exact values slightly easier. This is done using plotly.
#Use the plotly package to convert the plot created earlier to an interactive chart
ggplotly(errorplot, tooltip = "x")
For this part of the lab, the goal was to look at income data within the county that my term project is taking place in. For this reason, I wanted to look at median income data at the tract level in Suffolk county. To do this I needed to find the name of this variable in the ACS table. I did this by using the load variable function.
var <- load_variables(year=2024,dataset="acs5")
With this name found I was able to load the Median household income past 12 months Suffolk County Tracts data as spatial data for mapping.
income<- get_acs(
geography = "tract",
variables = "B19013_001",
state = "MA",
county = "Suffolk",
geometry = TRUE,
progress_bar = FALSE
)
This data can be displayed interactively as a choropleth map using the mapview package and function
#display your data interactively
mapview(income, zcol = "estimate")
However, graduated symbols is probably a better method for displaying this data. Therefore, a static graduated symbols map was created.
#Create centroid points for map
incntrd = st_centroid(income)
#Create plot graduated symbols
ggplot() +
geom_sf(data = incntrd, aes(size = estimate),alpha = 0.7, color = "blue") +
geom_sf(data = income, fill = NA) +
labs(title = "Median Household Income, 2020-2024 ACS",
subtitle = "Suffolk County Tracts",
caption = str_wrap("Figure 2: Median Household income by census tract within Suffolk County, Massachusetts. There are a lot of higher values clustered in the center of the county. Data acquired with R and tidycensus."),
) +
theme_void()
Although there is a lot of clustering in the center of the county, where Boston is approximately located. It can still be somewhat seen that some of the higher income values are located centrally and near the coast.