Analyzing and Visualizing the Atlas of Rural and Small Town America Dataset
The data used for this project is taken from the U.S. Department of Agriculture, Economic Research Service. This is the Atlas of Rural and Small Town America. The dataset provides statistics by broad categories for various socioeconomic factors, including demographic data from the American Community Survey (ACS), economic data from the bureau of Labor Statistics, categorical variables (codes) for various county classifications, data on income, and data on veterans.
For this project, we are only going to look at County Classifications. Let’s first import the excel workbook and specified sheet and convert into a tibble.
I then convert it to a data frame for better data wrangling ability and view the first few rows.
paged_table(head(RuralAtlasData23))
With just a quick glance, we can see a few interesting tidbits.
45 columns is of course quite a bit to work with. We are going to select only nine (9) relevant variables for this project. These shall include the Unique County ID, the State, the County, if the county is classified as Nonmetro (the county does not have an Urbanized Area or Urbanized Cluster in its jurisdiction), if the county is classified as a Micropolitan (population of at least 10,000 but less than 50,000), if the county has low education in 2015, if the county has low employment in 2015, if the county experienced population loss in the past decade (2005 - 2015),if the county is designated as a retirement destination due to a high percentage of those over the age of 65 residing in the county, counties in persistent poverty and persistent child poverty in the past three decades (1970 - 2000), and if the county had high natural amenities.
RuralAtlasData23 <- select(RuralAtlasData23, "FIPStxt",
"State",
"County",
"Nonmetro2013",
"Micropolitan2013",
"Low_Education_2015_update",
"Low_Employment_2015_update",
"Population_loss_2015_update",
"Retirement_Destination_2015_Update",
"PersistentChildPoverty2004",
"PersistentPoverty2000",
"HiAmenity")
Next, let’s rename those long columns into something more digestible.
RuralAtlasData23 <- rename(RuralAtlasData23,
UniqueID = "FIPStxt",
Nonmetro = "Nonmetro2013",
Micropolitan = "Micropolitan2013",
Low_Education = "Low_Education_2015_update",
Low_Employment = "Low_Employment_2015_update",
Population_Loss = "Population_loss_2015_update",
Retirement_Destination = "Retirement_Destination_2015_Update",
Persistent_Child_Poverty = "PersistentChildPoverty2004",
Persistent_Poverty = "PersistentPoverty2000")
While the columns / variables are now easier to understand, the coded responses are not. We’ll need to recode those 0s and 1s to better reflect what they are identifying.
RuralAtlasData23 <- RuralAtlasData23 %>%
mutate(Nonmetro = recode(Nonmetro, '0' = "Urban", '1' = "Rural"),
Micropolitan = recode(Micropolitan, '0' = "Not Micropolitan", '1' = "Micropolitan"),
Low_Education = recode(Low_Education,'0' = "Mid-to-High Education", '1' = "Low Education"),
Low_Employment = recode(Low_Employment,'0' = "Mid-to-High Employment", '1' = "Low Employment"),
Population_Loss = recode(Population_Loss, '0' = "No Population Loss", '1' = "Population Loss"),
Retirement_Destination = recode(Retirement_Destination,'0' = "Not an RD", '1' = "RD"),
Persistent_Child_Poverty = recode(Persistent_Child_Poverty,'0' = "No Persistent Child Poverty", '1' = "Persistent Child Poverty"),
Persistent_Poverty = recode(Persistent_Poverty, '0' = "No Persistent Poverty", '1' = "Persistent Poverty"),
HiAmenity = recode(HiAmenity, '0' = "Not High Amenity", '1' = "High Amenity")
)
paged_table(head(RuralAtlasData23))
As the last step in this data wrangling process, let’s filter out all the states save Texas (my home state!). When exploring geographic units of analysis, it’s often better to hone in on a smaller frame to find potentially richer information. While information is limited by the dataset, I think localizing this data moving forward will help us better answer some research questions.
RuralAtlasData23 <- RuralAtlasData23 %>%
filter(State == "TX")
paged_table(head(RuralAtlasData23))
Now that the data is cleaned and filtered, let’s consider some exploratory research questions.
Let’s select the relevant columns for this question, filter on only Rural counties that experienced population loss, and provide a count.
38 Rural Texas counties experienced population loss. There are a total of 172 Rural Texas counties (of 254 total). Doing some quick math will pull a percentage of those that experienced population loss.
(38 / 172) * 100
[1] 22.09302
22% of all Texas Rural counties experienced population loss. This did not meet the threshold set by the research question (25%), and therefore we can conclude that the majority of Texas Rural counties are growing.
We could further explore a question of similar concern by comparing population loss across the Rural / Urban Continuum, and see what percentage of Texas Urban counties experienced population loss. Let’s examine that real quick.
(4 / 82) * 100
[1] 4.878049
There’s much less population loss for Texas Urban counties. Only 4.8% have experienced some form of population loss in the past decade (2005 - 2015). From this we can conclude that, while rural counties have not met the threshold of substantial population loss, they are 4x more likely to experience population loss than their urban counterparts.
We visualize this percentage using a stacked bar chart, showing those counties coded “Yes” as experiencing population loss. As we can see, the vast majority of counties that experienced population loss from 2005 - 2015 were classified as “Rural.”
Question1vV <- RuralAtlasData23 %>%
select("UniqueID",
"County",
"Nonmetro",
"Population_Loss"
) %>%
filter(Population_Loss == "Yes")
ggplot(Question1vV,
aes("Population_Loss",
fill = "Nonmetro")) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "Experienced Population Loss",
title = "Percentage Population Loss, 2005 - 2015") +
theme_minimal()
Note: for the Final Draft, we need to change the Y Axis from decimal to percent and/or perhaps show the percentages in the column chart.
Now’s the time to explore some frequency tables. We don’t have any numeric variables, so we will solely be using frequency tables to determine the percentage of counties in Texas that are coded as X variable.
We’ll start with determining the percentage of counties that are classified as Retirement Destinations, stratified by Nonmetro status.
# A tibble: 4 x 4
Nonmetro Retirement_Destination n prop
<chr> <chr> <int> <dbl>
1 Rural Not an RD 153 0.602
2 Rural RD 19 0.0748
3 Urban Not an RD 54 0.213
4 Urban RD 28 0.110
Less than 50 counties are classified as Retirement Destinations (RDs). From a brief glance, it appears that there are more Urban RDs than Rural ones. Interesting. We’ll come back to this with a cross tab, but first, let’s pull other variables into a new function for this research question.
# A tibble: 14 x 6
Nonmetro Retirement_Desti~ Persistent_Pove~ Persistent_Child~ n
<chr> <chr> <chr> <chr> <int>
1 Rural Not an RD No Persistent P~ No Persistent Ch~ 74
2 Rural Not an RD No Persistent P~ Persistent Child~ 46
3 Rural Not an RD Persistent Pove~ No Persistent Ch~ 1
4 Rural Not an RD Persistent Pove~ Persistent Child~ 32
5 Rural RD No Persistent P~ No Persistent Ch~ 15
6 Rural RD No Persistent P~ Persistent Child~ 2
7 Rural RD Persistent Pove~ Persistent Child~ 2
8 Urban Not an RD No Persistent P~ No Persistent Ch~ 38
9 Urban Not an RD No Persistent P~ Persistent Child~ 7
10 Urban Not an RD Persistent Pove~ Persistent Child~ 9
11 Urban RD No Persistent P~ No Persistent Ch~ 25
12 Urban RD No Persistent P~ Persistent Child~ 1
13 Urban RD Persistent Pove~ No Persistent Ch~ 1
14 Urban RD Persistent Pove~ Persistent Child~ 1
# ... with 1 more variable: prop <dbl>
That’s a little un-intuitive, but hopefully visualizing the data will help us understand the table better.
We can see some quick points of interest, though. First, there are four Rural RD counties that have some form of persistent poverty. That’s almost 25% of all Rural RDs. For Urban RDs, only three have some form of persistent poverty. That’s 9%, a substantial reduction compared to Rural RDs.
We’ll complete this analysis with a crosstabs and proportional crosstabs to help begin answering Research Question #2: Are Rural Texas counties more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?
xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23)
Persistent_Poverty
Nonmetro No Persistent Poverty Persistent Poverty
Rural 137 35
Urban 71 11
And then the proportional crosstabs.
prop.table(xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23))*100
Persistent_Poverty
Nonmetro No Persistent Poverty Persistent Poverty
Rural 53.937008 13.779528
Urban 27.952756 4.330709
Looks like Rural counties are 3x more likely to be classified and experience Persistent Poverty as compared to their Urban counterpart. When taking the data from Research Question #1, we can see that a Texas Rural county is much more likely to experience population loss and persistent poverty compared to Urban counties, often at rates of three to four times.
We’ll repeat this process for Persistent Child Poverty.
xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23)
Persistent_Child_Poverty
Nonmetro No Persistent Child Poverty Persistent Child Poverty
Rural 90 82
Urban 64 18
And then the proportional crosstabs.
prop.table(xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23))*100
Persistent_Child_Poverty
Nonmetro No Persistent Child Poverty Persistent Child Poverty
Rural 35.433071 32.283465
Urban 25.196850 7.086614
Looks very similar to Persistent Poverty, save one striking difference: Persistent Child Poverty is more than twice as likely to affect Rural counties as Persistent Poverty. So it looks like children are more affected by deeply entrenched poverty in Rural counties than their teenage or adult counterparts.
However, let’s not jump to any conclusions just yet and integrate poverty with retirement destinations to see if there’s any overlap. That will be one step of this RQ’s visualization process.
Due to the categorical nature of this current dataset, we are going to use bar charts for our univariate and bivariate graphs. We’ll focus on Retirement Destinations for both initial plots.
Nothing really amazing here. Most counties are not retirement destinations, almost 4:1. Let’s add some color to this graph.
Now we’re getting somewhere! It looks like there are more Urban Retirement Destinations, both in count and in frequency. So older individuals are moving not to the countryside but to the city.
But this simple bar chart is pretty boring, still. We can look at proportions by editing the position from stack to fill and updating the colors / labels.
ggplot(RuralAtlasData23,
aes(Retirement_Destination,
fill = Nonmetro)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "Retirement Destination",
title = "More Than Half of All Texas Retirement Destinations are in Urban Counties") +
theme_minimal()
That’s much better. And while a bar chart is still not that exciting of a data visualization, it tells us a little bit about the Retirement Destination column. Let’s add the two poverty variables to a facet grid and see how these four variables compare.
ggplot(RuralAtlasData23,
aes(Retirement_Destination,
fill = Nonmetro)) +
geom_bar(position = "fill") +
facet_grid(vars(Persistent_Poverty),
vars(Persistent_Child_Poverty)) +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "Retirement Destination",
title = "Retirement Destinations By Nonmetro Status and Persistent Poverty") +
theme_minimal()
If we look at the bottom right grid, we see the trifecta, where a large percentage of RDs with both types of persistent poverty are rural. Likewise for the bottom left, all RDs that do not have PP or PCP are urban.
Returning back to the research question, are Rural Texas counties that are retirement destinations more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?
Per this chart, it looks like a resounding yes. Rural RDs are often 3x more likely to have one or both of the persistent poverty variables than their urban counterparts.
Let’s analyze the final question.
Let’s break this up into five distinct analyses:
What percentage of Rural counties have High Amenities?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) population loss?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) persistent poverty?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) low education?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) low employment?
# A tibble: 4 x 4
Nonmetro HiAmenity n prop
<chr> <chr> <int> <dbl>
1 Rural High Amenity 78 0.307
2 Rural Not High Amenity 94 0.370
3 Urban High Amenity 41 0.161
4 Urban Not High Amenity 41 0.161
So 30% of all Rural counties are considered High Amenity. Due to rural counties having more natural resources, I was not surprised at this high value, more than double that of all urban counties. Let’s repeat this table, selecting only Rural and High Amenity, for the other four variables.
# A tibble: 2 x 5
Nonmetro HiAmenity Population_Loss n prop
<chr> <chr> <chr> <int> <dbl>
1 Rural High Amenity No Population Loss 62 0.795
2 Rural High Amenity Population Loss 16 0.205
I’ve included the Nonmetro and HiAmenity columns to see that the filters are working. I will not include those in the other three analyses.
As we can see, counties with high amenities did not experience substantial amounts of population loss. Perhaps the amenities, whether natural or man-made, are a reason to keep populations in their rural areas? Or perhaps it’s a feedback loop, if man-made. Infrastructure is developed because people are staying – for various other reasons. Let’s look to see if there’s a trend with the other variables.
# A tibble: 2 x 3
Persistent_Poverty n prop
<chr> <int> <dbl>
1 No Persistent Poverty 59 0.756
2 Persistent Poverty 19 0.244
Nothing too striking. If the county is classified as having high amenities, that can translate to high amounts of natural resources – such as timber, oil, and natural gas. That helps drive local extraction economies, and while they’re more susceptible to boom and bust cycles, I think that labor floor allows counties to overcome entrenched, generational poverty. Less than one-fourth have persistent poverty, which is substantial considering rural counties are 3x more likely to experience persistent poverty compared to their urban counterparts. The trend counties across most other variables, when compared against PP.
# A tibble: 2 x 3
Low_Education n prop
<chr> <int> <dbl>
1 Low Education 38 0.487
2 Mid-to-High Education 40 0.513
I had to review and make sure these outputs were re-coded correctly – and they are! Here’s the interesting data point we were looking for. Rural counties with high amenities can still experience low rates of education. It would be prudent to cross-reference the other variables onto these 38 counties to see if there is a pattern to this inequality. We’ll attempt that in our data visualization section.
# A tibble: 2 x 3
Low_Employment n prop
<chr> <int> <dbl>
1 Low Employment 28 0.359
2 Mid-to-High Employment 50 0.641
Not surprising, though a little lower compared to Low Education. This is probably similar compared to the concerns raised with Persistent Poverty, i.e. high natural resources.
So we can say that rural Texas counties with high amenities do not experience considerable population loss or persistent poverty. It looks like there’s an unclear relationship with those that have low education, but not low employment.
Let’s build off of Research Question 2’s visualization section and provide a facet grid, starting with all variables to see if there’s an underlying relationship.
ggplot(RuralAtlasData23,
aes(HiAmenity,
fill = Nonmetro)) +
geom_bar(position = "fill") +
facet_grid(vars(Population_Loss),
vars(Persistent_Poverty)) +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "High Amenity Locale",
title = "High Amenity Locales By Nonmetro Status, Population Loss, and Persistent Poverty") +
theme_minimal()
However, due to the limitations of facet grid, we can only plot two variables on the grid (for a total of four variables on this plot). Makes sense, to limit the amount of clutter and processing needed for this visualization, but it does make this analysis a little harder.
This plot is a little more difficult to read compared to the RQ2 plot, but we can center in on a few key outputs.
Rural, Not High Amenities (bottom right) are about 80% of counties that experience both Persistent Poverty and Population Loss.
The figure for population loss is even more striking for Rural, Not High Amenities (bottom left), with nearly 90% of those counties experiencing population loss but not persistent poverty.
The split is pretty even between rural and urban counties, with or without high amenities, for those not afflicted by persistent poverty or population loss. There are other factors underlying that stability.
I’m going to provide two more plots to help round out this final research question. The first will look at the final two variables, Low Education and Low Employment. The second will explore Low Education against other variables to see if there’s an underlying pattern for the insignificant rural, high amenity relationship.
ggplot(RuralAtlasData23,
aes(HiAmenity,
fill = Nonmetro)) +
geom_bar(position = "fill") +
facet_grid(vars(Low_Education),
vars(Low_Employment)) +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "High Amenity Locale",
title = "High Amenity Locales By Nonmetro Status, Low Education, and Low Employment") +
theme_minimal()
Some major observations:
Urban counties (bottom right) excel in Mid-to-High Employment and Education, irrespective of amenity status.
Rural counties (left side) make up the majority of Low Employment counties. Cities are hubs for commerce, so that’s understandable.
Rural counties (top left) disproportionately experience Low Employment and Low Education compared to their urban counterparts.
Founding out the final visualization, we’re looking to key in on counties that are (a) Rural, (b) High Amenity, and (c) Low Education. Any pattern we can find between this variablel and a second one under these conditions is worthwhile, at least for this exercise. After running through all the variables, we end on Retirement Destination.
ggplot(RuralAtlasData23,
aes(HiAmenity,
fill = Nonmetro)) +
geom_bar(position = "fill") +
facet_grid(vars(Low_Education),
vars(Retirement_Destination)) +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "High Amenity Locale",
title = "High Amenity Locales By Nonmetro Status, Low Education, and Retirment Destination") +
theme_minimal()
The only point I’d like to make from this graph is that counties that meet all three criteria: High Amenity, Retirement Destination, and Low Education, are all rural. There are no urban counties that meet these three criteria. However, there’s not much more we can say.
Here we begin to see the limitations of (a) the facet-grid and data visualization and (b) categorical variables in general. That being said, this provided an easily digestible plot to help make sense of some introductory questions for rural communities.
I enjoyed this process. While I do have some experience with R, it has been piecemeal; I’ve developed short scripts and reports as projects for work required. Taking a more structured approach, asking research questions prior to diving into the data, and troubleshooting code all remind me of the reasons I fell in love with my first master’s program. This has been a welcome introduction into this MS in DACSS.
However, some issues arose the further I delved into this project.
The most obvious problem was me picking a dataset that only included categorical variables. I have plenty of experience joining datasets in SQL, but I did not have the desire nor the time to delve too deeply into the R syntax (it may be exactly the same for Left Joins for all I know). If I had included more continuous data, such as income and population estimates, I think I could have pulled some more interesting visualizations and insights. However, that would have prompted statistical analyses on my part, due to my curiosity, and again, time would not have permitted that.
If I revisit this project for a portfolio, I will be sure to incorporate a second dataset from this Atlas and build off of the work done here.
Second, I have found I need to work on understanding functions better and be more careful in the debugging process. That was one of the primary limiters during the middle editions of this project. I was able to edit some of these functions and visualizations, but many of them are still rough. I need to focus on iteration a little better, as a good chunk of this code is repeatable through the different stages of analysis.
Third, I think attention to detail is paramount to success in R scripting. I have spent numerous hours on this project debugging problems, helping understand vectors and what functions don’t work with them, so on and so forth. The more I look back upon this project, the more I find this as the key takeaway.
What’s left for the final project? I’ll probably clean up some of the text and formatting, while also adjusting the colors for the plots to make them color-blind friendly. I also need to provide the citations. Otherwise, most of it should be good to go!
So in conclusion, we’ve worked through the Atlas of Rural and Small Town America to better understand Rural Texas counties. We can draw some generalities from these analyses:
These counties are more likely to experience population loss than their urban counterparts.
Rural retirement destinations are more likely to experience persistent poverty and persistent child poverty, compared to urban retirement destinations.
Rural counties with high amenities experience positive economic and social indicators, save education, which afflicts around 50% of all rural, high amenity locales.
What does this mean for rural communities? Focus on the resources you have and build off of these natural landscapes to help reduce population loss, poverty, and unemployment.