Project 1

Author

Daniel Johnson

Intro

This dataset is from corgis, but is essentially a bunch of census data put together in 1 nice large dataset. It is all data from the big census that is conducted every 10 years. It has every county in every state (plus DC), and for each has a plethora of data for each county, including percentage of the population for each ethnicity, a household and per capita income, a couple different education related variablews, and about 2 dozen more. We only really care about 2, however. The first being “Education.Bachelor’s Degree or Higher” which is the percentage of the population 25 years and older with any kind of college degree (including associates degree). The other is per capita income, which is the total income of all people and households in the county, divided by the population of the county. I intend to explore the relationship between these 2 and graph them in the following:

Intro Code

library(tidyverse)
data <- read_csv("county_demographics.csv")

load needed library(ies) and load dataset into r

Statistical analysis

First I made an exploratory plot to see if I wanted to explore the relationship between percentage of the population with a college degree and a counties per capita income. The plot is a simple point cloud of all counties in the dataset, graphed with percent with a college degree on the x and county per capita income on the y, and a linear model line thrown in for fun.

ggplot(data, aes(x = `Education.Bachelor's Degree or Higher`, y = `Income.Per Capita Income`)) +
  geom_point() +
  geom_smooth(methed = 'lm', formula = y~x)

Warning in geom_smooth(methed = "lm", formula = y ~ x): Ignoring unknown
parameters: `methed`

`geom_smooth()` using method = 'gam'

This chart seems to indicate there is some kind of correlation, which is promising. Now to do a linear regression to see to what extend the relationship is.

Here I created a linear model using the percent with a college degree and per capita income for the whole country:

linear <- lm(`Income.Per Capita Income` ~ `Education.Bachelor's Degree or Higher`, data = data)
summary(linear)


Call:
lm(formula = `Income.Per Capita Income` ~ `Education.Bachelor's Degree or Higher`, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-22910.8  -2206.4    -54.1   2299.8  26436.9 

Coefficients:
                                         Estimate Std. Error t value Pr(>|t|)
(Intercept)                             15737.713    184.234   85.42   <2e-16
`Education.Bachelor's Degree or Higher`   561.459      7.685   73.06   <2e-16
                                           
(Intercept)                             ***
`Education.Bachelor's Degree or Higher` ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4120 on 3137 degrees of freedom
Multiple R-squared:  0.6299,    Adjusted R-squared:  0.6297 
F-statistic:  5338 on 1 and 3137 DF,  p-value: < 2.2e-16

Based on the linear model, we can see that the equation for our linear estimation is y=15,7737.714 + 561.459(x), meaning that for every 1 percent increase in people who have a college degree the counties per capita income increases by almost $600, and if no one has a degree the per capita income should be a little less then $16,000.

We can also see the p value is insanely small (1 * 10^-16) which means there is an extremely high probability that the education level of the population affects the per capita income of that population.

Final Visualization

For my final visualization I decided I wanted to expand on my exploratory plot from earlier, but fancier and more locally relevant. The first step to that is to narrow down how many counties will be in the visualization, since as you can see above, there is a massive blob in the middle where you won’t be able to differentiate any of the counties. I therefore decided to only show maryland and every state (plus DC) that borders us to see how our neighbors are doing compared to ourselves. This means i had to use the filter function to only hav ethose states in a new dataframe.

mid_atlantic <- data |>
  filter(State == "VA" | State == "WV" | State == "DC" | State == "MD" | State == "DE" | State == "PA")

I decided to keep the scatterplot, but I made each state a different color using color brewer, and a different shape to help show the area of the graphs each states occupy.

ggplot(mid_atlantic, aes(x = `Education.Bachelor's Degree or Higher`, y = `Income.Per Capita Income`, color = State, shape = State)) +
  geom_point(size = 2.5) +
  #geom_smooth(methed = 'lm', formula = y~x) +
  scale_shape_manual(values = 15:20) +
  scale_x_continuous(limits = c(0, NA)) +
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Percentage of Population with College Degree",
       y = "Per Capita Income ($)",
       fill = "State",
       title = "Percentage of Population with College Degree vs Per Capita Income in Mid-Atlantic*",
       caption = "Source: US Census Bureau \n *All data at county level") +
  scale_color_brewer(palette = "Set2") +
  theme_bw()

Final Essay

For what I wanted to do with this dataset, it ended up being extremely simple to clean: just filter for the wanted states. I think the final visualization is interesting compared to the exploratory one because pretty much all of the “above the trend line” outliers; basically all of the counties where the per captia income is much higher then expected given the education level of the county. One of the interesting things you can see from the final is how each state has its own cluster that are along different parts of where the trend line would be. West Virginia was the lowest, followed by Pennsylvania, then Virginia and finally Maryland with what looks like the highest average. Virginia has a massive spread compared to the other states, with some of the lowest income/education level counties and the 2 highest in the region. You can also tell the few really urban counties per state, with several DC metro area counties in both MD and VA towards the top of the graph. You can also see some of the counties in metro Philadelphia, and a couple in metro Baltimore. The final super notable thing to notice is all of the counties in Virginia that are below our trend line. Upon further digging, these are caused by 2 separate factors. 1 being the very unique way the state handles Counties, which is that almost all of the cities and towns ar separate counties from the surrounding areas, whereas everywhere else counties are in cities or a few of the largest cities in the country are their own counties (like Baltimore City). This is the case for small cities that you probably haven’t heard of. This means that all of the educated people in a rural or semi-rural area are concentrated in a small county as the center of the region, but the region is still pretty rural so the income doesn’t match the education level. The other factor is that several of these smaller towns and cities in Virginia are home to universities big and small, such as Charlottesville with the University of Virginia and Blacksburg with Virginia Tech. These ares have highly educated populations but more rural economies and income levels. The only thing I didn’t get to work that I would’ve liked to add to my final visualization is a linear model line. When I tried, it made one for each state, but I just want one for the whole graph.

Source(s):

https://stackoverflow.com/questions/13701347/force-the-origin-to-start-at-0 used for scaling axis of final plot

https://ggplot2.tidyverse.org/reference/geom_point.html used for point shapes