Homework 5

Importing and Cleaning the Data

The data used for this project is taken from the U.S. Department of Agriculture, Economic Research Service. This is the Atlas of Rural and Small Town America. The dataset provides statistics by broad categories for various socioeconomic factors, including demographic data from the American Community Survey (ACS), economic data from the bureau of Labor Statistics, categorical variables (codes) for various county classifications, data on income, and data on veterans.

For this project, we are only going to look at County Classifications. Let’s first import the excel workbook and specified sheet and convert into a tibble.

Show code

library(readxl)
library(tidyverse)
library(dplyr)
library(rmarkdown)
library(khroma)
RuralAtlasData23 <- read_excel("RuralAtlasData23.xlsx", 
    sheet = "County Classifications")

I then convert it to a data frame for better data wrangling ability and view the first few rows.

Show code

paged_table(head(RuralAtlasData23))

With just a quick glance, we can see a few interesting tidbits.

There are 3,215 rows.
There are 45 columns, or variables.
The first three columns are characters. They are broken down by:

FIPStxt, or the County’s Unique ID;
State, and;
County.

The remaining 42 are all double, or float.

We will recode some of these variables to characters in a later step.

Wrangling the Data

45 columns is of course quite a bit to work with. We are going to select only nine (9) relevant variables for this project. These shall include the Unique County ID, the State, the County, if the county is classified as Nonmetro (the county does not have an Urbanized Area or Urbanized Cluster in its jurisdiction), if the county is classified as a Micropolitan (population of at least 10,000 but less than 50,000), if the county has low education in 2015, if the county has low employment in 2015, if the county experienced population loss in the past decade (2005 - 2015),if the county is designated as a retirement destination due to a high percentage of those over the age of 65 residing in the county, counties in persistent poverty and persistent child poverty in the past three decades (1970 - 2000), and if the county had high natural amenities.

Show code

RuralAtlasData23 <- select(RuralAtlasData23, "FIPStxt", 
                           "State", 
                           "County", 
                           "Nonmetro2013", 
                           "Micropolitan2013", 
                           "Low_Education_2015_update",
                           "Low_Employment_2015_update",
                           "Population_loss_2015_update",
                           "Retirement_Destination_2015_Update",
                           "PersistentChildPoverty2004",
                           "PersistentPoverty2000", 
                           "HiAmenity")

Next, let’s rename those long columns into something more digestible.

Show code

RuralAtlasData23 <- rename(RuralAtlasData23, 
                           UniqueID = "FIPStxt", 
                           Nonmetro = "Nonmetro2013", 
                           Micropolitan = "Micropolitan2013", 
                           Low_Education = "Low_Education_2015_update",
                           Low_Employment = "Low_Employment_2015_update",
                           Population_Loss = "Population_loss_2015_update", 
                           Retirement_Destination = "Retirement_Destination_2015_Update",
                           Persistent_Child_Poverty = "PersistentChildPoverty2004",
                           Persistent_Poverty = "PersistentPoverty2000")

While the columns / variables are now easier to understand, the coded responses are not. We’ll need to recode those 0s and 1s to better reflect what they are identifying.

Show code

RuralAtlasData23 <- RuralAtlasData23 %>%
  mutate(Nonmetro = recode(Nonmetro, '0' = "Urban", '1' = "Rural"),
         Micropolitan = recode(Micropolitan, '0' = "No", '1' = "Yes"),
         Low_Education = recode(Low_Education,'0' = "No", '1' = "Yes"),
         Low_Employment = recode(Low_Employment,'0' = "No", '1' = "Yes"),
         Population_Loss = recode(Population_Loss, '0' = "No", '1' = "Yes"),
         Retirement_Destination = recode(Retirement_Destination,'0' = "No", '1' = "Yes"),
         Persistent_Child_Poverty = recode(Persistent_Child_Poverty,'0' = "No", '1' = "Yes"),
         Persistent_Poverty = recode(Persistent_Poverty, '0' = "No", '1' = "Yes"),
         HiAmenity = recode(HiAmenity, '0' = "No", '1' = "Yes")
         )
paged_table(head(RuralAtlasData23))

As the last step in this data wrangling process, let’s filter out all the states save Texas (my home state!). When exploring geographic units of analysis, it’s often better to hone in on a smaller frame to find potentially richer information. While information is limited by the dataset, I think localizing this data moving forward will help us better answer some research questions.

Show code

RuralAtlasData23 <- RuralAtlasData23 %>%
  filter(State == "TX")
paged_table(head(RuralAtlasData23))

Research Questions

Now that the data is cleaned and filtered, let’s consider some exploratory research questions.

Did a substantial amount of Rural Texas counties (25% or more) experience population loss?
Are Rural Texas counties that are retirement destinations more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?
Do Rural Texas counties with high amounts of natural amenities consistently experience both (a) population loss and (b) persistent poverty? How does low education and low employment factor into this?

Question 1: Rural Population Loss

Analyzing the Data

Let’s select the relevant columns for this question, filter on only Rural counties that experienced population loss, and provide a count.

Show code

Question1 <- RuralAtlasData23 %>%
  select("UniqueID", 
         "County",
         "Nonmetro",
         "Population_Loss"
         ) %>%
  filter(Nonmetro == "Rural", 
         Population_Loss == "Yes")
paged_table(head(Question1))

38 Rural Texas counties experienced population loss. There are a total of 172 Rural Texas counties (of 254 total). Doing some quick math will pull a percentage of those that experienced population loss.

Show code

(38 / 172) * 100

[1] 22.09302

22% of all Texas Rural counties experienced population loss. This did not meet the threshold set by the research question (25%), and therefore we can conclude that the majority of Texas Rural counties are growing.

We could further explore a question of similar concern by comparing population loss across the Rural / Urban Continuum, and see what percentage of Texas Urban counties experienced population loss. Let’s examine that real quick.

Show code

Question1vU <- RuralAtlasData23 %>%
  select("UniqueID", 
         "County",
         "Nonmetro",
         "Population_Loss"
         ) %>%
  filter(Nonmetro == "Urban", 
         Population_Loss == "Yes")
paged_table(head(Question1vU))

Show code

(4 / 82) * 100

[1] 4.878049

There’s much less population loss for Texas Urban counties. Only 4.8% have experienced some form of population loss in the past decade (2005 - 2015). From this we can conclude that, while rural counties have not met the threshold of substantial population loss, they are 4x more likely to experience population loss than their urban counterparts.

Visualizing the Data

We visualize this percentage using a stacked bar chart, showing those counties coded “Yes” as experiencing population loss. As we can see, the vast majority of counties that experienced population loss from 2005 - 2015 were classified as “Rural.”

Show code

Question1vV <- RuralAtlasData23 %>%
  select("UniqueID", 
         "County",
         "Nonmetro",
         "Population_Loss"
         ) %>%
  filter(Population_Loss == "Yes")

ggplot(Question1vV, 
       aes("Population_Loss",
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Paired") +
  labs(y = "Percent",
       x = "Experienced Population Loss",
       title = "Percentage Population Loss, 2005 - 2015") +
  theme_minimal()

Note: for the Final Draft, we need to change the Y Axis from decimal to percent and/or perhaps show the percentages in the column chart.

Research Question 2: Retirement Destinations and Persistent Poverty

Analyzing the Data

Now’s the time to explore some frequency tables. We don’t have any numeric variables, so we will solely be using frequency tables to determine the percentage of counties in Texas that are coded as X variable.

We’ll start with determining the percentage of counties that are classified as Retirement Destinations, stratified by Nonmetro status.

Show code

RuralAtlasData23 %>%
  count(Nonmetro,
        Retirement_Destination) %>%
  mutate(prop=n/sum(n))

# A tibble: 4 x 4
  Nonmetro Retirement_Destination     n   prop
  <chr>    <chr>                  <int>  <dbl>
1 Rural    No                       153 0.602 
2 Rural    Yes                       19 0.0748
3 Urban    No                        54 0.213 
4 Urban    Yes                       28 0.110

Less than 50 counties are classified as Retirement Destinations (RDs). From a brief glance, it appears that there are more Urban RDs than Rural ones. Interesting. We’ll come back to this with a cross tab, but first, let’s pull other variables into a new function for this research question.

Show code

RQ2 <- RuralAtlasData23 %>%
  count(Nonmetro,
        Retirement_Destination,
        Persistent_Poverty,
        Persistent_Child_Poverty) %>%
  mutate(prop=n/sum(n)) 
print(RQ2)

# A tibble: 14 x 6
   Nonmetro Retirement_Desti~ Persistent_Pove~ Persistent_Child~     n
   <chr>    <chr>             <chr>            <chr>             <int>
 1 Rural    No                No               No                   74
 2 Rural    No                No               Yes                  46
 3 Rural    No                Yes              No                    1
 4 Rural    No                Yes              Yes                  32
 5 Rural    Yes               No               No                   15
 6 Rural    Yes               No               Yes                   2
 7 Rural    Yes               Yes              Yes                   2
 8 Urban    No                No               No                   38
 9 Urban    No                No               Yes                   7
10 Urban    No                Yes              Yes                   9
11 Urban    Yes               No               No                   25
12 Urban    Yes               No               Yes                   1
13 Urban    Yes               Yes              No                    1
14 Urban    Yes               Yes              Yes                   1
# ... with 1 more variable: prop <dbl>

That’s a little un-intuitive, but hopefully visualizing the data will help us understand the table better.

We can see some quick points of interest, though. First, there are four Rural RD counties that have some form of persistent poverty. That’s almost 25% of all Rural RDs. For Urban RDs, only three have some form of persistent poverty. That’s 9%, a substantial reduction compared to Rural RDs.

We’ll complete this analysis with a crosstabs and proportional crosstabs to help begin answering Research Question #2: Are Rural Texas counties more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?

Show code

xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23)

        Persistent_Poverty
Nonmetro  No Yes
   Rural 137  35
   Urban  71  11

And then the proportional crosstabs.

Show code

prop.table(xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23))*100

        Persistent_Poverty
Nonmetro        No       Yes
   Rural 53.937008 13.779528
   Urban 27.952756  4.330709

Looks like Rural counties are 3x more likely to be classified and experience Persistent Poverty as compared to their Urban counterpart. When taking the data from Research Question #1, we can see that a Texas Rural county is much more likely to experience population loss and persistent poverty compared to Urban counties, often at rates of three to four times.

We’ll repeat this process for Persistent Child Poverty.

Show code

xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23)

        Persistent_Child_Poverty
Nonmetro No Yes
   Rural 90  82
   Urban 64  18

And then the proportional crosstabs.

Show code

prop.table(xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23))*100

        Persistent_Child_Poverty
Nonmetro        No       Yes
   Rural 35.433071 32.283465
   Urban 25.196850  7.086614

Looks very similar to Persistent Poverty, save one striking difference: Persistent Child Poverty is more than twice as likely to affect Rural counties as Persistent Poverty. So it looks like children are more affected by deeply entrenched poverty in Rural counties than their teenage or adult counterparts.

However, let’s not jump to any conclusions just yet and integrate poverty with retirement destinations to see if there’s any overlap. That will be one step of this RQ’s visualization process.

Visualizing the Data

Due to the categorical nature of this current dataset, we are going to use bar charts for our univariate and bivariate graphs. We’ll focus on Retirement Destinations for both initial plots.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination)) +
  geom_bar()

Nothing really amazing here. Most counties are not retirement destinations, almost 4:1. Let’s add some color to this graph.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "stack")

Now we’re getting somewhere! It looks like there are more Urban Retirement Destinations, both in count and in frequency. So older individuals are moving not to the countryside but to the city.

But this simple bar chart is pretty boring, still. We can look at proportions by editing the position from stack to fill and updating the colors / labels.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Paired") +
  labs(y = "Percent",
       x = "Retirement Destination",
       title = "More Than Half of All Texas Retirement Destinations are in Urban Counties") +
  theme_minimal()

That’s much better. And while a bar chart is still not that exciting of a data visualization, it tells us a little bit about the Retirement Destination column. Let’s add the two poverty variables to a facet grid and see how these four variables compare.

Show code

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  facet_grid(vars(Persistent_Poverty), 
             vars(Persistent_Child_Poverty)) +
  scale_fill_brewer(palette = "Paired") +
  labs(y = "Percent",
       x = "Retirement Destination",
       title = "Retirement Destinations By Nonmetro Status and Persistent Poverty") +
  theme_minimal()

This is still a little hard to read due to the categorical variables all being Y/N. I’m not sure how to add Axis Labels on a facet_grid, so that’ll be something I’ll need to research for Homework #6.

Regardless, if we look at the bottom right grid, we see the trifecta, where a large percentage of RDs with both types of persistent poverty are rural. Likewise for the bottom left, all RDs that do not have PP or PCP are urban.

Returning back to the research question, are Rural Texas counties that are retirement destinations more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?

Per this chart, it looks like a resounding yes. Rural RDs are often 3x more likely to have one or both of the persistent poverty variables than their urban counterparts.

Research Question #3:

Analyzing the Data

This will be completed at Homework #6.

Visualizing the Data

This will be completed at Homework #6.

Concluding Thoughts

Wrapping up Homework #5, it appears I have a few minor items to address before Homework #6, including: 1) RQ1 needs axis labels to change to a percent and update colors for color blind people, 2) RQ2 needs to update facet_grid axis labels, change axis labels to a percent, and update colors, and 3) RQ3 needs to be completed in full.

I do not think I will have time to join this dataset on a separate tab (Income). I had wanted to do so as the categorical-only variables provided limitations to analysis and visualization. However, this allows me to drill deeper into understanding categorical visualizations. I would still like to add geom_point and improve facet wrapping. I’m not sure how much time will allow for this with RQ3, but we shall see.

I also need to clean up and tighten the code/writing for this report. This section will be rewritten for Homework #6 and hopefully provide some answers to the concluding thoughts in Homework #5.