Homework 4

Importing and Cleaning the Data

For Homework 4, we will continue using the Atlas for Rural and Small Town America. A lot of the sections from the previous Homework 3 module – importing, writing to a data frame, cleaning, and wrangling, will be hidden from this report. However, in places where I have selected, renamed, and recoded new variables, I will include that code chunk.

RuralAtlasData23 <- select(RuralAtlasData23, "FIPStxt", 
                           "State", 
                           "County", 
                           "Nonmetro2013", 
                           "Micropolitan2013", 
                           "Low_Education_2015_update",
                           "Low_Employment_2015_update",
                           "Population_loss_2015_update",
                           "Retirement_Destination_2015_Update",
                           "PersistentChildPoverty2004",
                           "PersistentPoverty2000", 
                           "HiAmenity")

We’ve added Low Education, Low Employment, Retirement Destination, and Persistent Child Poverty. Let’s rename and recode these columns.

RuralAtlasData23 <- rename(RuralAtlasData23, 
                           UniqueID = "FIPStxt", 
                           Nonmetro = "Nonmetro2013", 
                           Micropolitan = "Micropolitan2013", 
                           Low_Education = "Low_Education_2015_update",
                           Low_Employment = "Low_Employment_2015_update",
                           Population_Loss = "Population_loss_2015_update", 
                           Retirement_Destination = "Retirement_Destination_2015_Update",
                           Persistent_Child_Poverty = "PersistentChildPoverty2004",
                           Persistent_Poverty = "PersistentPoverty2000")

RuralAtlasData23 <- RuralAtlasData23 %>%
  mutate(Nonmetro = recode(Nonmetro, '0' = "Urban", '1' = "Rural"),
         Micropolitan = recode(Micropolitan, '0' = "No", '1' = "Yes"),
         Low_Education = recode(Low_Education,'0' = "No", '1' = "Yes"),
         Low_Employment = recode(Low_Employment,'0' = "No", '1' = "Yes"),
         Population_Loss = recode(Population_Loss, '0' = "No", '1' = "Yes"),
         Retirement_Destination = recode(Retirement_Destination,'0' = "No", '1' = "Yes"),
         Persistent_Child_Poverty = recode(Persistent_Child_Poverty,'0' = "No", '1' = "Yes"),
         Persistent_Poverty = recode(Persistent_Poverty, '0' = "No", '1' = "Yes"),
         HiAmenity = recode(HiAmenity, '0' = "No", '1' = "Yes")
         )
head(RuralAtlasData23)

# A tibble: 6 x 12
  UniqueID State County  Nonmetro Micropolitan Low_Education
  <chr>    <chr> <chr>   <chr>    <chr>        <chr>        
1 01001    AL    Autauga Urban    No           No           
2 01003    AL    Baldwin Urban    No           No           
3 01005    AL    Barbour Rural    No           Yes          
4 01007    AL    Bibb    Urban    No           Yes          
5 01009    AL    Blount  Urban    No           Yes          
6 01011    AL    Bullock Rural    No           Yes          
# ... with 6 more variables: Low_Employment <chr>,
#   Population_Loss <chr>, Retirement_Destination <chr>,
#   Persistent_Child_Poverty <chr>, Persistent_Poverty <chr>,
#   HiAmenity <chr>

The last step is selecting only those rows/counties that are in Texas.

RuralAtlasData23 <- RuralAtlasData23 %>%
  filter(State == "TX")
print(RuralAtlasData23)

# A tibble: 254 x 12
   UniqueID State County    Nonmetro Micropolitan Low_Education
   <chr>    <chr> <chr>     <chr>    <chr>        <chr>        
 1 48001    TX    Anderson  Rural    Yes          Yes          
 2 48003    TX    Andrews   Rural    Yes          Yes          
 3 48005    TX    Angelina  Rural    Yes          No           
 4 48007    TX    Aransas   Urban    No           No           
 5 48009    TX    Archer    Urban    No           No           
 6 48011    TX    Armstrong Urban    No           No           
 7 48013    TX    Atascosa  Urban    No           Yes          
 8 48015    TX    Austin    Urban    No           No           
 9 48017    TX    Bailey    Rural    No           Yes          
10 48019    TX    Bandera   Urban    No           No           
# ... with 244 more rows, and 6 more variables: Low_Employment <chr>,
#   Population_Loss <chr>, Retirement_Destination <chr>,
#   Persistent_Child_Poverty <chr>, Persistent_Poverty <chr>,
#   HiAmenity <chr>

Frequency Tables

Now’s the time to explore some frequency tables. We don’t have any numeric variables, so we will solely be using frequency tables to determine the percentage of counties in Texas that are coded as X variable. For the Final Project, I am looking at joining this dataframe on another worksheet with the dataset to include numeric values, along with creating an iterative for loop to pull these proportions into one dataframe/output. We shall see if there is enough time to do so.

RuralAtlasData23 %>%
  select(Nonmetro) %>%
  table() %>%
  prop.table()*100

.
   Rural    Urban 
67.71654 32.28346

68% of counties in Texas are classified as Rural, while 32% are Urban. We can repeat this process for the remaining eight (8) columns.

RuralAtlasData23 %>%
  select(Micropolitan) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
81.88976 18.11024

Nothing tremendous here. Would need to dig into Rural/Urban for Micropolitan to see the percentage of those Rural counties that have more than 10K population but less than 50K.

RuralAtlasData23 %>%
  select(Low_Education) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
62.99213 37.00787

But this is worrisome. Almost 40% of all Texas counties are classified as having Low Education. While the Variable Classification does not define how the Atlas codes a county as Low Education, I can make an educated guess that there’s a threshold of those that do not have X% of a degree, whether as worrisome as not having a percentage of High School Diplomas or not having a Bachelor’s degree.

RuralAtlasData23 %>%
  select(Low_Employment) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
72.04724 27.95276

Another worrisome statistic. Almost 30% of all Texas counties are classified as having Low Employment. Similar to the challenges noted in defining Low Education, this is most certainly meeting a specific threshold to determine eligibility.

RuralAtlasData23 %>%
  select(Population_Loss) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
83.46457 16.53543

Population loss – particularly rural vs urban – is a research question we reviewed in Homework #3. We shall not delve deep into that question here.

RuralAtlasData23 %>%
  select(Retirement_Destination) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
81.49606 18.50394

A variable we can examine in the Data Visualization section. Nothing much to say about it here.

RuralAtlasData23 %>%
  select(Persistent_Child_Poverty) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
60.62992 39.37008

Compared to Persistent Poverty (18% of counties), Persistent Child Poverty is double that. PP and PCP are defined as a county experiencing 20% of a population under the poverty rate for 20+ years. This shows that almost half of all counties in Texas are classified as having deeply entrenched child poverty.

RuralAtlasData23 %>%
  select(Persistent_Poverty) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
81.88976 18.11024

See discussion above. Not much more to say.

RuralAtlasData23 %>%
  select(HiAmenity) %>%
  table() %>%
  prop.table()*100

.
      No      Yes 
53.14961 46.85039

What’s interesting here is that 47% of all Texas counties are classified as having high amenities. I would definitely want to know how this is classified and would love to dig deeper, using some visualizations against other variables to see if there’s a pattern anywhere.

We’ll end this section with a crosstabs and proportional crosstabs to help answer Research Question #2 from Homework #3: Are Rural Texas counties more likely to experience persistent poverty compared with their Urban counterparts?

xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23)

        Persistent_Poverty
Nonmetro  No Yes
   Rural 137  35
   Urban  71  11

And then the proportional crosstabs.

prop.table(xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23))*100

        Persistent_Poverty
Nonmetro        No       Yes
   Rural 53.937008 13.779528
   Urban 27.952756  4.330709

Looks like Rural counties are 3x more likely to be classified and experience Persistent Poverty as compared to their Urban counterpart. When taking the data from Research Question #1, we can see that a Texas Rural county is much more likely to experience population loss and persistent poverty compared to Urban counties, often at rates of three to four times.

While there are plenty of other proportional crosstabs that would make for interesting research questions, we’re going to stop here for now and explore two data visualizations.

Visualizing the Data

Due to the categorical nature of this current dataset, we are going to use bar charts for our univariate and bivariate graphs. We’ll focus on Retirement Destinations for both plots.

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination)) +
  geom_bar()

Nothing really amazing here. Most counties are not retirement destinations, almost 4:1. Let’s add some color to this graph.

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "stack")

Now we’re getting somewhere! It looks like there are more Urban Retirement Destinations, both in count and in frequency. So older individuals are moving not to the countryside but to the city.

But this simple bar chart is pretty boring, still. We can look at proportions by editing the position from stack to fill and updating the colors / labels.

ggplot(RuralAtlasData23, 
       aes(Retirement_Destination,
           fill = Nonmetro)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Paired") +
  labs(y = "Percent",
       x = "Retirement Destination",
       title = "More Than Half of All Texas Retirement Destinations are in Urban Counties") +
  theme_minimal()

That’s much better. And while a bar chart is still not that exciting of a data visualization, it tells us a little bit about the Retirement Destination column. For the Final Project, I hope to add more categorical data visualization, such as geom_tile and geom_count.

There are of course limitations to this bar chart. We can add some improvements by showing percentage labels within the bars – detailing how many cities are Retirement Destinations by their Nonmetro classification. The Y axis could also be edited from a decimal to percentage.

For readability’s sake, we should make the color / palette for colorblind people. Different hues from the same color make it easier on the eyes, but that doesn’t help if a colorblind person can’t easily make a distinction with blues! That’s not difficult, but we are aiming for a simple visualization here. Other readability changes could include parsing down the chart lines, changing the alpha score, and amending the legend.

I think this bivariate plot opens the door for more unanswered questions. How do other variables impact retirement destinations, such as persistent poverty and high amenities? I will need to explore future categorical data visualizations, incorporating multiple variables, to see what we can produce. There is a limitation here, so perhaps it will be fruitful to add numeric variables into the analysis.