An article written by Nathaniel Rakich of FiveThirtyEight described how people from urban and rural communities often vote on partisan lines with rural communities regularly voting for Republicans and more urban communities choosing to vote for Democrats. He wrote that this pattern has continued for years, citing the past 2 presidential elections and the 2018 midterm elections. They analyzed census data with populations and voting records to explore how well this pattern appeared to correlate across the United States. Their analysis may help predict the results of 2020’s presidential election.
The original article can be found here; https://fivethirtyeight.com/features/how-urban-or-rural-is-your-state-and-what-does-that-mean-for-the-2020-election/
When comparing rural and urban communities and their voting patterns, it is important to know how urban the area of study is. To calculate the “urbanness,” that is, how urban or rural an area is, they created an index using census data from a 5-year American Community Survey completed in 2017. Their formula first calculated the average population of everyone within a 5 mile radius of each census tract then took the natural logarithm of it. These indexes were then tabled by state to compare its “partisan lean” (from FiveThirtyEights prior analyses) with the index value. A correlation coefficient was calculated and a new table was generated using only a state’s urban index as the determining factor for how a state would vote.
Given the data provided, it may be interesting to validate their formula by graphing the urban index and averaged census population. This should show a graph that is identical to that of a natural logarithmic function. It may also be interesting to visually explore what the urban index would look like in without averaging the population over a 5 mile radius.
library(readr)
library(ggplot2)
# Reading data from the original source in 538's Github repository
UrbanIndexData <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/urbanization-index/urbanization-census-tract.csv", TRUE, ",")
head(UrbanIndexData)
Since, not all of this data is needed it will be best to subset the data to only what is necessary for review.
UrbanIndexData <- subset(UrbanIndexData, select = c("state", "population", "adj_radiuspop_5", "urbanindex"))
UrbanIndexData[1:5, 1:4]
There also appeared to be no N/A or missing values in this data. However, as a quick check, it is best to validate it.
# Total the number of values that are N/A in the entire data frame
sum(is.na(UrbanIndexData))
## [1] 0
If data needed to be omitted or excluded entirely from this analysis one might use the ‘na.omit’ function as shown in this example:
UrbanIndexData <- na.omit(UrbanIndexData)
Some of the column names are a little tricky. For example, the name “adj_radiuspop_5” might be confusing if the methods were not known. These should be changed to avoid abbreviations and include meaningful names.
# Changing column names to avoid abbreviations and include meaningful names
colnames(UrbanIndexData) <- c("State", "Population_2017", "Neighborhood_Population", "UrbanIndex")
UrbanIndexData[1:5, 1:4]
Remarkably, all the data types were also correctly read into the data frame. None of the data will need coercing or manipulations for the purposes of this review. This can be shown in the letters below the column names in the data frame or one can test each variable like this:
class(UrbanIndexData$State)
## [1] "factor"
class(UrbanIndexData$Population_2017)
## [1] "integer"
class(UrbanIndexData$Neighborhood_Population)
## [1] "numeric"
class(UrbanIndexData$UrbanIndex)
## [1] "numeric"
Seeing the integer and the numeric data types for the population columns and urban index is all we needed. The State names being listed as factors is helpful as a reference but will not be used here.
As a hypothetical, if the data types were not the correctly listed in the data frame they could be changed using the ‘as.numeric’, or ‘as.integer’ functions.
# Example function to coerce to a numeric or integer data type
# as.numeric(UrbanIndexData$UrbanIndex)
# as.integer(UrbanIndexData$Population_2017)
Now, the data is clean and ready for manipulation. To validate the natural logarithm of their formula, plot urban index on the x axis and the average population on the y axis. The result should look identical to that of the natural logarithm function.
# ggplot can make better "publish worthy" graphs and charts than base R (in my opinion)
ggplot(UrbanIndexData, aes(x=Neighborhood_Population, y=UrbanIndex)) +
geom_point(size = 1, shape = 20, color = "dark grey") +
labs(x="Urban Density Index", y = "Population",
title = "Census Population by Index",
subtitle ="Estimated by 538 using the 2017 ACS Survey at a 5 Mile Radius",
caption = "A demonstration of the natural logrithmic function using 538's urban index") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
panel.background = element_rect(fill ="white"),
plot.caption = element_text(hjust = 0.5))
This graph succeeds in demonstrating the natural logarithmic function from their computation of the urban index with an averaged population.Next, to explore how this urban index displays with census population as is. That is, no averaging of any populations at any radius. It is simply a raw census population gather from the ACS survey in 2017 compared to 538’s urban index of each census tract.
ggplot(UrbanIndexData, aes(x=UrbanIndex, y=Population_2017)) +
geom_point(size = 1, shape = 20, color = "dark grey") +
labs(x="Urban Density Index", y = "Population",
title = "Census Population by Index",
subtitle ="Estimated by 538 using the 2017 ACS Survey at a 5 Mile Radius") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
panel.background = element_rect(fill ="white"))
At first glance there is a gap around 8 on the urban density index. This might be explained by the formula used to compute the urban index and absence of comparing it the averaged population. Regardless, looking at the urban index data shows a lower number of values between 8 and 9. This can be shown here below:
# Selecting from the Urban Index Data frame where the Urban Index is greater than 8 AND less than 9
UID_range8to9 <- subset(UrbanIndexData, UrbanIndex >8 & UrbanIndex <9)
# Counting the number of rows in the new data frame with only values between 8 and 9
nrow(UID_range8to9)
## [1] 1757
# Counting the number of row in the original data frame for comparison
nrow(UrbanIndexData)
## [1] 73280
# Calculating the percentage of values that 8 to 9 encompasses in the data frame
UID_percent8to9 <- ((nrow(UID_range8to9))/(nrow(UrbanIndexData))*100)
UID_percent8to9 <- round(UID_percent8to9, digits = 2)
UID_percent8to9
## [1] 2.4
From this brief look, it is easy to see that there are fewer urban index values between 8 and 9 than other areas the original data set. Assuming each row contains an urban index record there are 73280 total urban indexes while only 1757 between 8 and 9. Meaning they only make up about 2.4 of the data frame. Still, it is curious why there appears to be a gap in this range.
Based on this index if the election were determined solely by the urbanness of the census area, then Democrats would win the electoral college 323 to 215 delegates. Those who live in rural areas regularly vote for conservatives while those in more urban areas regularaly vote for more progressive candidates. Suburbs are also increasingly voting more democrats than republicans. FiveThirtyEight also found how strongly the relationship between “urbanness” and voting patterns correlated. Their results showed a close connection with a correlation coefficient of 0.55 in 2012 and .69 in 2016. Their expectation is that this trend will continue to grow and influence the 2020 presidential election.
To include what I might do to extend this work, I would add tables for the “top 5 most urban states” and the “top 5 least urban states.” I may also try to compute the ranges of their urban indexes and complete another correlation test with FiveThirtyEights “partisan lean” data (which was not provided openly for this analysis) to better grasp how far apart these states are and how well the partisan lean and urbanness correlates at both extremes of the poltical spectrum.