Analyzing and Visualizing the Atlas of Rural and Small Town America Dataset
The data used for this project is taken from the U.S. Department of Agriculture, Economic Research Service. This is the Atlas of Rural and Small Town America. The dataset provides statistics by broad categories for various socioeconomic factors, including demographic data from the American Community Survey (ACS), economic data from the bureau of Labor Statistics, categorical variables (codes) for various county classifications, data on income, and data on veterans.
For this project, we are only going to look at County Classifications. Let’s first import the excel workbook and specified sheet and convert into a tibble.
I then convert it to a data frame for better data wrangling ability and view the first few rows.
paged_table(head(RuralAtlasData23))
With just a quick glance, we can see a few interesting tidbits.
45 columns is of course quite a bit to work with. We are going to select only nine (9) relevant variables for this project. These shall include the Unique County ID, the State, the County, if the county is classified as Nonmetro (the county does not have an Urbanized Area or Urbanized Cluster in its jurisdiction), if the county is classified as a Micropolitan (population of at least 10,000 but less than 50,000), if the county has low education in 2015, if the county has low employment in 2015, if the county experienced population loss in the past decade (2005 - 2015),if the county is designated as a retirement destination due to a high percentage of those over the age of 65 residing in the county, counties in persistent poverty and persistent child poverty in the past three decades (1970 - 2000), and if the county had high natural amenities.
RuralAtlasData23 <- select(RuralAtlasData23, "FIPStxt",
"State",
"County",
"Nonmetro2013",
"Micropolitan2013",
"Low_Education_2015_update",
"Low_Employment_2015_update",
"Population_loss_2015_update",
"Retirement_Destination_2015_Update",
"PersistentChildPoverty2004",
"PersistentPoverty2000",
"HiAmenity")
Next, let’s rename those long columns into something more digestible.
RuralAtlasData23 <- rename(RuralAtlasData23,
UniqueID = "FIPStxt",
Nonmetro = "Nonmetro2013",
Micropolitan = "Micropolitan2013",
Low_Education = "Low_Education_2015_update",
Low_Employment = "Low_Employment_2015_update",
Population_Loss = "Population_loss_2015_update",
Retirement_Destination = "Retirement_Destination_2015_Update",
Persistent_Child_Poverty = "PersistentChildPoverty2004",
Persistent_Poverty = "PersistentPoverty2000")
While the columns / variables are now easier to understand, the coded responses are not. We’ll need to recode those 0s and 1s to better reflect what they are identifying.
RuralAtlasData23 <- RuralAtlasData23 %>%
mutate(Nonmetro = recode(Nonmetro, '0' = "Urban", '1' = "Rural"),
Micropolitan = recode(Micropolitan, '0' = "No", '1' = "Yes"),
Low_Education = recode(Low_Education,'0' = "No", '1' = "Yes"),
Low_Employment = recode(Low_Employment,'0' = "No", '1' = "Yes"),
Population_Loss = recode(Population_Loss, '0' = "No", '1' = "Yes"),
Retirement_Destination = recode(Retirement_Destination,'0' = "No", '1' = "Yes"),
Persistent_Child_Poverty = recode(Persistent_Child_Poverty,'0' = "No", '1' = "Yes"),
Persistent_Poverty = recode(Persistent_Poverty, '0' = "No", '1' = "Yes"),
HiAmenity = recode(HiAmenity, '0' = "No", '1' = "Yes")
)
paged_table(head(RuralAtlasData23))
As the last step in this data wrangling process, let’s filter out all the states save Texas (my home state!). When exploring geographic units of analysis, it’s often better to hone in on a smaller frame to find potentially richer information. While information is limited by the dataset, I think localizing this data moving forward will help us better answer some research questions.
RuralAtlasData23 <- RuralAtlasData23 %>%
filter(State == "TX")
paged_table(head(RuralAtlasData23))
Now that the data is cleaned and filtered, let’s consider some exploratory research questions.
Let’s select the relevant columns for this question, filter on only Rural counties that experienced population loss, and provide a count.
38 Rural Texas counties experienced population loss. There are a total of 172 Rural Texas counties (of 254 total). Doing some quick math will pull a percentage of those that experienced population loss.
(38 / 172) * 100
[1] 22.09302
22% of all Texas Rural counties experienced population loss. This did not meet the threshold set by the research question (25%), and therefore we can conclude that the majority of Texas Rural counties are growing.
We could further explore a question of similar concern by comparing population loss across the Rural / Urban Continuum, and see what percentage of Texas Urban counties experienced population loss. Let’s examine that real quick.
(4 / 82) * 100
[1] 4.878049
There’s much less population loss for Texas Urban counties. Only 4.8% have experienced some form of population loss in the past decade (2005 - 2015). From this we can conclude that, while rural counties have not met the threshold of substantial population loss, they are 4x more likely to experience population loss than their urban counterparts.
We visualize this percentage using a stacked bar chart, showing those counties coded “Yes” as experiencing population loss. As we can see, the vast majority of counties that experienced population loss from 2005 - 2015 were classified as “Rural.”
Question1vV <- RuralAtlasData23 %>%
select("UniqueID",
"County",
"Nonmetro",
"Population_Loss"
) %>%
filter(Population_Loss == "Yes")
ggplot(Question1vV,
aes("Population_Loss",
fill = Nonmetro)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "Experienced Population Loss",
title = "Percentage Population Loss, 2005 - 2015") +
theme_minimal()
Note: for the Final Draft, we need to change the Y Axis from decimal to percent and/or perhaps show the percentages in the column chart.
Now’s the time to explore some frequency tables. We don’t have any numeric variables, so we will solely be using frequency tables to determine the percentage of counties in Texas that are coded as X variable.
We’ll start with determining the percentage of counties that are classified as Retirement Destinations, stratified by Nonmetro status.
# A tibble: 4 x 4
Nonmetro Retirement_Destination n prop
<chr> <chr> <int> <dbl>
1 Rural No 153 0.602
2 Rural Yes 19 0.0748
3 Urban No 54 0.213
4 Urban Yes 28 0.110
Less than 50 counties are classified as Retirement Destinations (RDs). From a brief glance, it appears that there are more Urban RDs than Rural ones. Interesting. We’ll come back to this with a cross tab, but first, let’s pull other variables into a new function for this research question.
# A tibble: 14 x 6
Nonmetro Retirement_Desti~ Persistent_Pove~ Persistent_Child~ n
<chr> <chr> <chr> <chr> <int>
1 Rural No No No 74
2 Rural No No Yes 46
3 Rural No Yes No 1
4 Rural No Yes Yes 32
5 Rural Yes No No 15
6 Rural Yes No Yes 2
7 Rural Yes Yes Yes 2
8 Urban No No No 38
9 Urban No No Yes 7
10 Urban No Yes Yes 9
11 Urban Yes No No 25
12 Urban Yes No Yes 1
13 Urban Yes Yes No 1
14 Urban Yes Yes Yes 1
# ... with 1 more variable: prop <dbl>
That’s a little un-intuitive, but hopefully visualizing the data will help us understand the table better.
We can see some quick points of interest, though. First, there are four Rural RD counties that have some form of persistent poverty. That’s almost 25% of all Rural RDs. For Urban RDs, only three have some form of persistent poverty. That’s 9%, a substantial reduction compared to Rural RDs.
We’ll complete this analysis with a crosstabs and proportional crosstabs to help begin answering Research Question #2: Are Rural Texas counties more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?
xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23)
Persistent_Poverty
Nonmetro No Yes
Rural 137 35
Urban 71 11
And then the proportional crosstabs.
prop.table(xtabs(~ Nonmetro + Persistent_Poverty, RuralAtlasData23))*100
Persistent_Poverty
Nonmetro No Yes
Rural 53.937008 13.779528
Urban 27.952756 4.330709
Looks like Rural counties are 3x more likely to be classified and experience Persistent Poverty as compared to their Urban counterpart. When taking the data from Research Question #1, we can see that a Texas Rural county is much more likely to experience population loss and persistent poverty compared to Urban counties, often at rates of three to four times.
We’ll repeat this process for Persistent Child Poverty.
xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23)
Persistent_Child_Poverty
Nonmetro No Yes
Rural 90 82
Urban 64 18
And then the proportional crosstabs.
prop.table(xtabs(~ Nonmetro + Persistent_Child_Poverty, RuralAtlasData23))*100
Persistent_Child_Poverty
Nonmetro No Yes
Rural 35.433071 32.283465
Urban 25.196850 7.086614
Looks very similar to Persistent Poverty, save one striking difference: Persistent Child Poverty is more than twice as likely to affect Rural counties as Persistent Poverty. So it looks like children are more affected by deeply entrenched poverty in Rural counties than their teenage or adult counterparts.
However, let’s not jump to any conclusions just yet and integrate poverty with retirement destinations to see if there’s any overlap. That will be one step of this RQ’s visualization process.
Due to the categorical nature of this current dataset, we are going to use bar charts for our univariate and bivariate graphs. We’ll focus on Retirement Destinations for both initial plots.
Nothing really amazing here. Most counties are not retirement destinations, almost 4:1. Let’s add some color to this graph.
Now we’re getting somewhere! It looks like there are more Urban Retirement Destinations, both in count and in frequency. So older individuals are moving not to the countryside but to the city.
But this simple bar chart is pretty boring, still. We can look at proportions by editing the position from stack to fill and updating the colors / labels.
ggplot(RuralAtlasData23,
aes(Retirement_Destination,
fill = Nonmetro)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "Retirement Destination",
title = "More Than Half of All Texas Retirement Destinations are in Urban Counties") +
theme_minimal()
That’s much better. And while a bar chart is still not that exciting of a data visualization, it tells us a little bit about the Retirement Destination column. Let’s add the two poverty variables to a facet grid and see how these four variables compare.
ggplot(RuralAtlasData23,
aes(Retirement_Destination,
fill = Nonmetro)) +
geom_bar(position = "fill") +
facet_grid(vars(Persistent_Poverty),
vars(Persistent_Child_Poverty)) +
scale_fill_brewer(palette = "Paired") +
labs(y = "Percent",
x = "Retirement Destination",
title = "Retirement Destinations By Nonmetro Status and Persistent Poverty") +
theme_minimal()
This is still a little hard to read due to the categorical variables all being Y/N. I’m not sure how to add Axis Labels on a facet_grid, so that’ll be something I’ll need to research for Homework #6.
Regardless, if we look at the bottom right grid, we see the trifecta, where a large percentage of RDs with both types of persistent poverty are rural. Likewise for the bottom left, all RDs that do not have PP or PCP are urban.
Returning back to the research question, are Rural Texas counties that are retirement destinations more likely to experience persistent poverty compared with their Urban counterparts? What about persistent child poverty?
Per this chart, it looks like a resounding yes. Rural RDs are often 3x more likely to have one or both of the persistent poverty variables than their urban counterparts.
This will be completed at Homework #6.
This will be completed at Homework #6.
Wrapping up Homework #5, it appears I have a few minor items to address before Homework #6, including: 1) RQ1 needs axis labels to change to a percent and update colors for color blind people, 2) RQ2 needs to update facet_grid axis labels, change axis labels to a percent, and update colors, and 3) RQ3 needs to be completed in full.
I do not think I will have time to join this dataset on a separate tab (Income). I had wanted to do so as the categorical-only variables provided limitations to analysis and visualization. However, this allows me to drill deeper into understanding categorical visualizations. I would still like to add geom_point and improve facet wrapping. I’m not sure how much time will allow for this with RQ3, but we shall see.
I also need to clean up and tighten the code/writing for this report. This section will be rewritten for Homework #6 and hopefully provide some answers to the concluding thoughts in Homework #5.