Final Project

Introduction

Importing and Cleaning the Data

The data used for this project is taken from the U.S. Department of Agriculture, Economic Research Service. This is the Atlas of Rural and Small Town America. The dataset provides statistics by broad categories for various socioeconomic factors, including demographic data from the American Community Survey (ACS), economic data from the bureau of Labor Statistics, categorical variables (codes) for various county classifications, data on income, and data on veterans.

For this project, we are only going to look at County Classifications. Let’s first import the excel workbook, specified sheet, and convert it into a tibble.

Show code

library(readxl)
library(tidyverse)
library(dplyr)
library(rmarkdown)
library(khroma)
library(RColorBrewer)
RuralAtlasData23 <- read_excel("RuralAtlasData23.xlsx", 
    sheet = "County Classifications")

We will then convert it to a data frame for better data wrangling ability and view the first few rows.

Show code

paged_table(head(RuralAtlasData23))

With just a quick glance, we can see a few interesting tidbits.

There are 3,215 rows.
There are 45 columns, or variables.
The first three columns are characters. They are broken down by:

FIPStxt, or the County’s Unique ID;
State, and;
County.

The remaining 42 are all double, or float.

We will recode some of these variables to characters in a later step.

Wrangling the Data

45 columns is of course quite a bit to work with. We are going to select only nine (9) relevant variables for this project. These shall include the Unique County ID, the State, the County, if the county is classified as Nonmetro (the county does not have an Urbanized Area or Urbanized Cluster in its jurisdiction), if the county is classified as a Micropolitan (population of at least 10,000 but less than 50,000), if the county has low education in 2015, if the county has low employment in 2015, if the county experienced population loss in the past decade (2005 - 2015),if the county is designated as a retirement destination due to a high percentage of those over the age of 65 residing in the county, counties in persistent poverty and persistent child poverty in the past three decades (1970 - 2000), and if the county had high natural amenities.

Show code

RuralAtlasData23 <- select(RuralAtlasData23, "FIPStxt", 
                           "State", 
                           "County", 
                           "Nonmetro2013", 
                           "Micropolitan2013", 
                           "Low_Education_2015_update",
                           "Low_Employment_2015_update",
                           "Population_loss_2015_update",
                           "Retirement_Destination_2015_Update",
                           "PersistentChildPoverty2004",
                           "PersistentPoverty2000", 
                           "HiAmenity")

Next, let’s rename those long columns into something more digestible.

Show code

RuralAtlasData23 <- rename(RuralAtlasData23, 
                           UniqueID = "FIPStxt", 
                           Nonmetro = "Nonmetro2013", 
                           Micropolitan = "Micropolitan2013", 
                           Low_Education = "Low_Education_2015_update",
                           Low_Employment = "Low_Employment_2015_update",
                           Population_Loss = "Population_loss_2015_update", 
                           Retirement_Destination = "Retirement_Destination_2015_Update",
                           Persistent_Child_Poverty = "PersistentChildPoverty2004",
                           Persistent_Poverty = "PersistentPoverty2000")

While the columns / variables are now easier to understand, the coded responses are not. We’ll need to recode those 0s and 1s to better reflect what they are identifying.

Show code

RuralAtlasData23 <- RuralAtlasData23 %>%
  mutate(Nonmetro = recode(Nonmetro, '0' = "Urban", '1' = "Rural"),
         Micropolitan = recode(Micropolitan, '0' = "Not Micropolitan", '1' = "Micropolitan"),
         Low_Education = recode(Low_Education,'0' = "Mid-to-High Education", '1' = "Low Education"),
         Low_Employment = recode(Low_Employment,'0' = "Mid-to-High Employment", '1' = "Low Employment"),
         Population_Loss = recode(Population_Loss, '0' = "No Population Loss", '1' = "Population Loss"),
         Retirement_Destination = recode(Retirement_Destination,'0' = "Not an RD", '1' = "RD"),
         Persistent_Child_Poverty = recode(Persistent_Child_Poverty,'0' = "No Persistent Child Poverty", '1' = "Persistent Child Poverty"),
         Persistent_Poverty = recode(Persistent_Poverty, '0' = "No Persistent Poverty", '1' = "Persistent Poverty"),
         HiAmenity = recode(HiAmenity, '0' = "Not High Amenity", '1' = "High Amenity")
         )
paged_table(head(RuralAtlasData23))

As the last step in this data wrangling process, let’s filter out all the states save Texas (my home state!). When exploring geographic units of analysis, it’s often better to hone in on a smaller frame to find potentially richer information. While information is limited by the dataset, I think localizing this data moving forward will help us better answer some research questions.

Show code

RuralAtlasData23 <- RuralAtlasData23 %>%
  filter(State == "TX")
paged_table(head(RuralAtlasData23))

Research Questions

Now that the data is cleaned and filtered, let’s consider some exploratory research questions.

Did a substantial amount of Rural Texas counties (25% or more) experience population loss?
Are Rural Texas counties that are retirement destinations more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?
Do Rural Texas counties with high amounts of natural amenities consistently experience both (a) population loss and (b) persistent poverty? How does low education and low employment factor into this?

Question 1: Rural Population Loss

Analyzing the Data

Let’s select the relevant columns for this question, filter on only Rural counties that experienced population loss, and provide a count.

Show code

Question1 <- RuralAtlasData23 %>%
  select("UniqueID", 
         "County",
         "Nonmetro",
         "Population_Loss"
         ) %>%
  filter(Nonmetro == "Rural", 
         Population_Loss == "Yes")
paged_table(head(Question1))

38 Rural Texas counties experienced population loss. There are a total of 172 Rural Texas counties (of 254 total). Doing some quick math will pull a percentage of those that experienced population loss.

Show code

(38 / 172) * 100

[1] 22.09302

22% of all Texas Rural counties experienced population loss. This did not meet the threshold set by the research question (25%), and therefore we can conclude that the majority of Texas Rural counties are growing.

We could further explore a question of similar concern by comparing population loss across the Rural / Urban Continuum, and see what percentage of Texas Urban counties experienced population loss. Let’s examine that real quick.

Show code

Question1vU <- RuralAtlasData23 %>%
  select("UniqueID", 
         "County",
         "Nonmetro",
         "Population_Loss"
         ) %>%
  filter(Nonmetro == "Urban", 
         Population_Loss == "Yes")
paged_table(head(Question1vU))

Show code

(4 / 82) * 100

[1] 4.878049

There’s much less population loss for Texas Urban counties. Only 4.8% have experienced some form of population loss in the past decade (2005 - 2015). From this we can conclude that, while rural counties have not met the threshold of substantial population loss, they are 4x more likely to experience population loss than their urban counterparts.

Visualizing the Data

We visualize this percentage using a stacked bar chart, showing those counties coded “Yes” as experiencing population loss. As we can see, the vast majority of counties that experienced population loss from 2005 - 2015 were classified as “Rural.”

Show code

Question1vV <- RuralAtlasData23 %>%
  select("UniqueID", 
         "County",
         "Nonmetro",
         "Population_Loss"
         ) %>%
  filter(Population_Loss == "Yes")

ggplot(Question1vV, 
       aes("Population_Loss",
           fill = "Nonmetro")) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(y = "Percent",
       x = "Experienced Population Loss",
       title = "Percentage Population Loss, 2005 - 2015") +
  theme_minimal()

Very minimal design, and it shows that very few urban counties experienced population loss over this decade.

Research Question 2: Retirement Destinations and Persistent Poverty

Analyzing the Data

Now’s the time to explore some frequency tables. We don’t have any numeric variables, so we will solely be using frequency tables to determine the percentage of counties in Texas that are coded as X variable.

We’ll start with determining the percentage of counties that are classified as Retirement Destinations, stratified by Nonmetro status.

Show code

RuralAtlasData23 %>%
  count(Nonmetro,
        Retirement_Destination) %>%
  mutate(prop=n/sum(n))

# A tibble: 4 x 4
  Nonmetro Retirement_Destination     n   prop
  <chr>    <chr>                  <int>  <dbl>
1 Rural    Not an RD                153 0.602 
2 Rural    RD                        19 0.0748
3 Urban    Not an RD                 54 0.213 
4 Urban    RD                        28 0.110

Less than 50 counties are classified as Retirement Destinations (RDs). From a brief glance, it appears that there are more Urban RDs than Rural ones. Interesting. We’ll come back to this with a cross tab, but first, let’s pull other variables into a new function for this research question.

Show code

RQ2 <- RuralAtlasData23 %>%
  count(Nonmetro,
        Retirement_Destination,
        Persistent_Poverty,
        Persistent_Child_Poverty) %>%
  mutate(prop=n/sum(n)) 
print(RQ2)

# A tibble: 14 x 6
   Nonmetro Retirement_Desti~ Persistent_Pove~ Persistent_Child~     n
   <chr>    <chr>             <chr>            <chr>             <int>
 1 Rural    Not an RD         No Persistent P~ No Persistent Ch~    74
 2 Rural    Not an RD         No Persistent P~ Persistent Child~    46
 3 Rural    Not an RD         Persistent Pove~ No Persistent Ch~     1
 4 Rural    Not an RD         Persistent Pove~ Persistent Child~    32
 5 Rural    RD                No Persistent P~ No Persistent Ch~    15
 6 Rural    RD                No Persistent P~ Persistent Child~     2
 7 Rural    RD                Persistent Pove~ Persistent Child~     2
 8 Urban    Not an RD         No Persistent P~ No Persistent Ch~    38
 9 Urban    Not an RD         No Persistent P~ Persistent Child~     7
10 Urban    Not an RD         Persistent Pove~ Persistent Child~     9
11 Urban    RD                No Persistent P~ No Persistent Ch~    25
12 Urban    RD                No Persistent P~ Persistent Child~     1
13 Urban    RD                Persistent Pove~ No Persistent Ch~     1
14 Urban    RD                Persistent Pove~ Persistent Child~     1
# ... with 1 more variable: prop <dbl>

That’s a little un-intuitive, but hopefully visualizing the data will help us understand the table better.

We can see some quick points of interest, though. First, there are four Rural RD counties that have some form of persistent poverty. That’s almost 25% of all Rural RDs. For Urban RDs, only three have some form of persistent poverty. That’s 9%, a substantial reduction compared to Rural RDs.

We’ll complete this analysis with a crosstabs and proportional crosstabs to help begin answering Research Question #2: Are Rural Texas counties more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?

Show code

xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23)

        Persistent_Poverty
Nonmetro No Persistent Poverty Persistent Poverty
   Rural                   137                 35
   Urban                    71                 11

And then the proportional crosstabs.

Show code

prop.table(xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23))*100

        Persistent_Poverty
Nonmetro No Persistent Poverty Persistent Poverty
   Rural             53.937008          13.779528
   Urban             27.952756           4.330709

Looks like Rural counties are 3x more likely to be classified and experience Persistent Poverty as compared to their Urban counterpart. When taking the data from Research Question #1, we can see that a Texas Rural county is much more likely to experience population loss and persistent poverty compared to Urban counties, often at rates of three to four times.

We’ll repeat this process for Persistent Child Poverty.

Show code

xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23)

        Persistent_Child_Poverty
Nonmetro No Persistent Child Poverty Persistent Child Poverty
   Rural                          90                       82
   Urban                          64                       18

And then the proportional crosstabs.

Show code

prop.table(xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23))*100

        Persistent_Child_Poverty
Nonmetro No Persistent Child Poverty Persistent Child Poverty
   Rural                   35.433071                32.283465
   Urban                   25.196850                 7.086614

Looks very similar to Persistent Poverty, save one striking difference: Persistent Child Poverty is more than twice as likely to affect Rural counties as Persistent Poverty. So it looks like children are more affected by deeply entrenched poverty in Rural counties than their teenage or adult counterparts.

However, let’s not jump to any conclusions just yet and integrate poverty with retirement destinations to see if there’s any overlap. That will be one step of this RQ’s visualization process.

Visualizing the Data

Due to the categorical nature of this current dataset, we are going to use bar charts for our univariate and bivariate graphs. We’ll focus on Retirement Destinations for both initial plots.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination)) +
  geom_bar()

Nothing really amazing here. Most counties are not retirement destinations, almost 4:1. Let’s add some color to this graph.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "stack")

Now we’re getting somewhere! It looks like there are more Urban Retirement Destinations, both in count and in frequency. So older individuals are moving not to the countryside but to the city.

But this simple bar chart is pretty boring, still. We can look at proportions by editing the position from stack to fill and updating the colors / labels.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Paired") +
  labs(y = "Percent",
       x = "Retirement Destination",
       title = "More Than Half of All Texas Retirement Destinations are in Urban Counties") +
  theme_minimal()

That’s much better. And while a bar chart is still not that exciting of a data visualization, it tells us a little bit about the Retirement Destination column. Let’s add the two poverty variables to a facet grid and see how these four variables compare.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  facet_grid(vars(Persistent_Poverty), 
             vars(Persistent_Child_Poverty)) +
  scale_fill_brewer(palette = "Paired") +
  labs(y = "Percent",
       x = "Retirement Destination",
       title = "Retirement Destinations By Nonmetro Status and Persistent Poverty") +
  theme_minimal()

If we look at the bottom right grid, we see the trifecta, where a large percentage of RDs with both types of persistent poverty are rural. Likewise for the bottom left, all RDs that do not have PP or PCP are urban.

Returning back to the research question, are Rural Texas counties that are retirement destinations more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?

Per this chart, it looks like a resounding yes. Rural RDs are often 3x more likely to have one or both of the persistent poverty variables than their urban counterparts.

Research Question #3: High Amenity Locales

Analyzing the Data

Let’s analyze the final question.

Do Rural Texas counties with high amounts of natural amenities consistently experience both (a) population loss and (b) persistent poverty? How does low education and low employment factor into this?

Let’s break this up into five distinct analyses:

What percentage of Rural counties have High Amenities?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) population loss?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) persistent poverty?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) low education?
Of those that are both (a) Rural and (b) High Amenity, do they experience considerable (>25%) low employment?

Show code

RuralAtlasData23 %>%
  count(Nonmetro,
        HiAmenity) %>%
  mutate(prop=n/sum(n))

# A tibble: 4 x 4
  Nonmetro HiAmenity            n  prop
  <chr>    <chr>            <int> <dbl>
1 Rural    High Amenity        78 0.307
2 Rural    Not High Amenity    94 0.370
3 Urban    High Amenity        41 0.161
4 Urban    Not High Amenity    41 0.161

So 30% of all Rural counties are considered High Amenity. Due to rural counties having more natural resources, I was not surprised at this high value, more than double that of all urban counties. Let’s repeat this table, selecting only Rural and High Amenity, for the other four variables.

Show code

RuralAtlasData23 %>%
  filter(Nonmetro == "Rural",
         HiAmenity == "High Amenity") %>%
  count(Nonmetro,
        HiAmenity,
        Population_Loss) %>%
  mutate(prop=n/sum(n))

# A tibble: 2 x 5
  Nonmetro HiAmenity    Population_Loss        n  prop
  <chr>    <chr>        <chr>              <int> <dbl>
1 Rural    High Amenity No Population Loss    62 0.795
2 Rural    High Amenity Population Loss       16 0.205

I’ve included the Nonmetro and HiAmenity columns to see that the filters are working. I will not include those in the other three analyses.

As we can see, counties with high amenities did not experience substantial amounts of population loss. Perhaps the amenities, whether natural or man-made, are a reason to keep populations in their rural areas? Or perhaps it’s a feedback loop, if man-made. Infrastructure is developed because people are staying – for various other reasons. Let’s look to see if there’s a trend with the other variables.

Show code

RuralAtlasData23 %>%
  filter(Nonmetro == "Rural",
         HiAmenity == "High Amenity") %>%
  count(Persistent_Poverty) %>%
  mutate(prop=n/sum(n))

# A tibble: 2 x 3
  Persistent_Poverty        n  prop
  <chr>                 <int> <dbl>
1 No Persistent Poverty    59 0.756
2 Persistent Poverty       19 0.244

Nothing too striking. If the county is classified as having high amenities, that can translate to high amounts of natural resources – such as timber, oil, and natural gas. That helps drive local extraction economies, and while they’re more susceptible to boom and bust cycles, I think that labor floor allows counties to overcome entrenched, generational poverty. Less than one-fourth have persistent poverty, which is substantial considering rural counties are 3x more likely to experience persistent poverty compared to their urban counterparts. The trend counties across most other variables, when compared against PP.

Show code

RuralAtlasData23 %>%
  filter(Nonmetro == "Rural",
         HiAmenity == "High Amenity") %>%
  count(Low_Education) %>%
  mutate(prop=n/sum(n))

# A tibble: 2 x 3
  Low_Education             n  prop
  <chr>                 <int> <dbl>
1 Low Education            38 0.487
2 Mid-to-High Education    40 0.513

I had to review and make sure these outputs were re-coded correctly – and they are! Here’s the interesting data point we were looking for. Rural counties with high amenities can still experience low rates of education. It would be prudent to cross-reference the other variables onto these 38 counties to see if there is a pattern to this inequality. We’ll attempt that in our data visualization section.

Show code

RuralAtlasData23 %>%
  filter(Nonmetro == "Rural",
         HiAmenity == "High Amenity") %>%
  count(Low_Employment) %>%
  mutate(prop=n/sum(n))

# A tibble: 2 x 3
  Low_Employment             n  prop
  <chr>                  <int> <dbl>
1 Low Employment            28 0.359
2 Mid-to-High Employment    50 0.641

Not surprising, though a little lower compared to Low Education. This is probably similar compared to the concerns raised with Persistent Poverty, i.e. high natural resources.

So we can say that rural Texas counties with high amenities do not experience considerable population loss or persistent poverty. It looks like there’s an unclear relationship with those that have low education, but not low employment.

Visualizing the Data

Let’s build off of Research Question 2’s visualization section and provide a facet grid, starting with all variables to see if there’s an underlying relationship.

Show code

ggplot(RuralAtlasData23, 
       aes(HiAmenity,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  facet_grid(vars(Population_Loss),
             vars(Persistent_Poverty)) +
  scale_fill_brewer(palette = "Dark2") +
  labs(y = "Percent",
       x = "High Amenity Locale",
       title = "High Amenity Locales By Nonmetro Status, Population Loss, and Persistent Poverty") +
  theme_minimal()

However, due to the limitations of facet grid, we can only plot two variables on the grid (for a total of four variables on this plot). Makes sense, to limit the amount of clutter and processing needed for this visualization, but it does make this analysis a little harder.

This plot is a little more difficult to read compared to the RQ2 plot, but we can center in on a few key outputs.

Rural, Not High Amenities (bottom right) are about 80% of counties that experience both Persistent Poverty and Population Loss.
The figure for population loss is even more striking for Rural, Not High Amenities (bottom left), with nearly 90% of those counties experiencing population loss but not persistent poverty.
The split is pretty even between rural and urban counties, with or without high amenities, for those not afflicted by persistent poverty or population loss. There are other factors underlying that stability.

I’m going to provide two more plots to help round out this final research question. The first will look at the final two variables, Low Education and Low Employment. The second will explore Low Education against other variables to see if there’s an underlying pattern for the insignificant rural, high amenity relationship.

Show code

ggplot(RuralAtlasData23, 
       aes(HiAmenity,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  facet_grid(vars(Low_Education),
             vars(Low_Employment)) +
  scale_fill_brewer(palette = "Dark2") +
  labs(y = "Percent",
       x = "High Amenity Locale",
       title = "High Amenity Locales By Nonmetro Status, Low Education, and Low Employment") +
  theme_minimal()

Some major observations:

Urban counties (bottom right) excel in Mid-to-High Employment and Education, irrespective of amenity status.
Rural counties (left side) make up the majority of Low Employment counties. Cities are hubs for commerce, so that’s understandable.
Rural counties (top left) disproportionately experience Low Employment and Low Education compared to their urban counterparts.

Founding out the final visualization, we’re looking to key in on counties that are (a) Rural, (b) High Amenity, and (c) Low Education. Any pattern we can find between this variablel and a second one under these conditions is worthwhile, at least for this exercise. After running through all the variables, we end on Retirement Destination.

Show code

ggplot(RuralAtlasData23, 
       aes(HiAmenity,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  facet_grid(vars(Low_Education),
             vars(Retirement_Destination)) +
  scale_fill_brewer(palette = "Dark2") +
  labs(y = "Percent",
       x = "High Amenity Locale",
       title = "High Amenity Locales By Nonmetro Status, Low Education, and Retirment Destination") +
  theme_minimal()

The only point I’d like to make from this graph is that counties that meet all three criteria: High Amenity, Retirement Destination, and Low Education, are all rural. There are no urban counties that meet these three criteria. However, there’s not much more we can say.

Here we begin to see the limitations of (a) the facet-grid and data visualization and (b) categorical variables in general. That being said, this provided an easily digestible plot to help make sense of some introductory questions for rural communities.

Reflection

I enjoyed this process. While I do have some experience with R, it has been piecemeal; I’ve developed short scripts and reports as projects for work required. Taking a more structured approach, asking research questions prior to diving into the data, and troubleshooting code all remind me of the reasons I fell in love with my first master’s program. This has been a welcome introduction into this MS in DACSS.

However, some issues arose the further I delved into this project.

The most obvious problem was me picking a dataset that only included categorical variables. I have plenty of experience joining datasets in SQL, but I did not have the desire nor the time to delve too deeply into the R syntax (it may be exactly the same for Left Joins for all I know). If I had included more continuous data, such as income and population estimates, I think I could have pulled some more interesting visualizations and insights. However, that would have prompted statistical analyses on my part, due to my curiosity, and again, time would not have permitted that.

If I revisit this project for a portfolio, I will be sure to incorporate a second dataset from this Atlas and build off of the work done here.

Second, I have found I need to work on understanding functions better and be more careful in the debugging process. That was one of the primary limiters during the middle editions of this project. I was able to edit some of these functions and visualizations, but many of them are still rough. I need to focus on iteration a little better, as a good chunk of this code is repeatable through the different stages of analysis.

Third, I think attention to detail is paramount to success in R scripting. I have spent numerous hours on this project debugging problems, helping understand vectors and what functions don’t work with them, so on and so forth. The more I look back upon this project, the more I find this as the key takeaway.

Concluding Thoughts

So in conclusion, we’ve worked through the Atlas of Rural and Small Town America to better understand Rural Texas counties. We can draw some generalities from these analyses:

These counties are more likely to experience population loss than their urban counterparts.
Rural retirement destinations are more likely to experience persistent poverty and persistent child poverty, compared to urban retirement destinations.
Rural counties with high amenities experience positive economic and social indicators, save education, which afflicts around 50% of all rural, high amenity locales.

What does this mean for rural communities? Focus on the resources you have and build off of these natural landscapes to help reduce population loss, poverty, and unemployment.

Citations

Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.

USDA Economic Research Service. (2022). Atlas of Rural and Small-Town America. Economic Research Service, Department of Agriculture. Retrieved from: https://data.nal.usda.gov/dataset/atlas-rural-and-small-town-america. Accessed on 07-01-2022.