Project 3

Author

Senay Leul Kahsay

Source: Proxycurl [https://nubela.co/blog/demographic-data-segmentation-definition-examples-uses/]

County Demographics

Introduction

For the final project, I decided to look at demographics. Specifically, demographics of all the counties in the United States. I used the “county_demographics” data set that I found from the United States Census Bureau through the CORGIS repository. The data set consists of all the 3,139 counties in the country as rows and 43 columns as information on age distributions, education levels, employment, ethnicity, household information, and other miscellaneous statistics from 2015 to 2019.

I chose this data set because it was an interesting one and had many different possibilities for analysis. The main thing I want to statistically explore is if whether higher education always leads to higher income. I will be using various variables but the main ones I will use for statistical analysis are going to be “per_capita_income”, and another quantitative variable that shows percentage of the people 25 and older that graduated with a bachelor’s degree or higher. I believe the data was collected from surveys and questionnaires.

library(tidyverse)
library(ggfortify)
library(plotly)
setwd("C:/Users/senay/OneDrive/Desktop/Scoo/Spring 2025/DATA 110/Datasets")
countydemo <- read_csv("county_demographics.csv")

Data Analysis

Exploratory Data Analysis

#looking at the first few lines
head(countydemo)

# A tibble: 6 × 43
  County           State `Age.Percent 65 and Older` `Age.Percent Under 18 Years`
  <chr>            <chr>                      <dbl>                        <dbl>
1 Abbeville County SC                          22.4                         19.8
2 Acadia Parish    LA                          15.8                         25.8
3 Accomack County  VA                          24.6                         20.7
4 Ada County       ID                          14.9                         23.2
5 Adair County     IA                          23                           21.8
6 Adair County     KY                          19.2                         19.9
# ℹ 39 more variables: `Age.Percent Under 5 Years` <dbl>,
#   `Education.Bachelor's Degree or Higher` <dbl>,
#   `Education.High School or Higher` <dbl>,
#   `Employment.Nonemployer Establishments` <dbl>,
#   `Ethnicities.American Indian and Alaska Native Alone` <dbl>,
#   `Ethnicities.Asian Alone` <dbl>, `Ethnicities.Black Alone` <dbl>,
#   `Ethnicities.Hispanic or Latino` <dbl>, …

Cleaning the data

#making the letters lowercase
names(countydemo) <- tolower(names(countydemo))

#substituting spaces with underscores
names(countydemo) <- gsub(" ", "_", names(countydemo))

#substituting periods with underscores
names(countydemo) <- gsub("[.]", "_", names(countydemo))

#removing apostrophes
names(countydemo) <- gsub("[']", "", names(countydemo))

#removing hyphens 
names(countydemo) <- gsub("[-]", "_", names(countydemo))

#the data set indicates na values as -1 so i will change it by using code from online 
countydemo[countydemo == -1] <- NA

#shortening the column names by removing the first words which are just the categories of the columns 
names(countydemo) <- gsub("age_", "", names(countydemo))
names(countydemo) <- gsub("education_", "", names(countydemo))
names(countydemo) <- gsub("ethnicities_", "", names(countydemo))
names(countydemo) <- gsub("housing_", "", names(countydemo))
names(countydemo) <- gsub("income_", "", names(countydemo))
names(countydemo) <- gsub("miscellaneous_", "", names(countydemo))
names(countydemo) <- gsub("population_", "", names(countydemo))
names(countydemo) <- gsub("sales_", "", names(countydemo))
names(countydemo) <- gsub("employment_", "", names(countydemo))

Statistical Analysis

#I begin by looking at the correlation between the variables 
cor(countydemo$bachelors_degree_or_higher, countydemo$per_capita_income)

[1] 0.7936316

There is a strong correlation(0.79) between higher income and higher education.

#making a linger regression model to try to predict per capita income based the percentage of people 25 and over with a bachelor's or higher degree
model1 <- lm(per_capita_income ~ bachelors_degree_or_higher , data = countydemo)
summary(model1)


Call:
lm(formula = per_capita_income ~ bachelors_degree_or_higher, 
    data = countydemo)

Residuals:
     Min       1Q   Median       3Q      Max 
-22910.8  -2206.4    -54.1   2299.8  26436.9 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                15737.713    184.234   85.42   <2e-16 ***
bachelors_degree_or_higher   561.459      7.685   73.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4120 on 3137 degrees of freedom
Multiple R-squared:  0.6299,    Adjusted R-squared:  0.6297 
F-statistic:  5338 on 1 and 3137 DF,  p-value: < 2.2e-16

autoplot(model1, 1:4, nrow=2, ncol = 2)

The equation for this model would be: per capita income = 15737.713 + 561.459 (percentage of people with a bachelor’s or higher degree). The y-intercept of the equation is 15738 which is the predicted per capita income when percentage of people with a bachelor’s or higher degree is 0. The slope of 561 shows that for each additional percentage increase of graduates, per capita income increases on average by 561 dollars.
p-value is extremely small, showing that the model is statistically significant.
The R-squared is 0.63 which shows that the 63% of the variation in per capita income can be explained by percentage of people with a bachelor’s or higher degree.
Based on the p-value and r-squared value, we can conclude that this is a good model. However, the diagnostic plots reveals that it is not perfect and that is has some flaws. Overall this is an okay model.

Jane Nam, a writer for bestcolleges.com, affirms that college graduates tend to make more money than those who haven’t earned a degree. She adds, “In fact, more education leads to higher salaries on average” (“Does Education Make a Difference in Salary?”). It is noteworthy, however, that this is not the case for every major or field. Overall, higher education is usually a very statistically accurate predictor for income, and the higher the educational level is, the more income.

Visualization

First Visualization

#making a subset data set with a new column of ethnicity by pivoting longer 
df2 <- countydemo |> 
  pivot_longer(cols = 9:16, 
               names_to = "ethinicities",
               values_to = "percentage"
)

p1 <- df2 |>
  ggplot(aes(x = percentage, y = homeownership_rate, color=ethinicities)) + 
  geom_point(size = 3) + 
  scale_color_brewer(palette = "Set1") + 
  theme_minimal() + 
  labs(title = "Homeownership Rates and Percentage of Different Ethincities", 
       caption = "Source: United Census Bureau", 
       x = "Percentage", 
       y = "Homeownership Rate")
p1

options(scipen = 9999) #changing scientific notation of y axis

This visualization shows the Home ownership rates of different ethnicities as well as their population percentage for various counties across the US.

Second Visualization

#creating a subset data set of the counties in the DMV area(the data set counts DC as a county)
dmvdemo <- countydemo |> 
  filter(state %in% c("MD","DC", "VA"))

p2 <- dmvdemo |> 
  ggplot(aes(x=county, y = median_houseold_income, fill = state)) +
geom_bar(stat = "identity", position = "dodge")+ #making a bar chart
  scale_fill_brewer(palette = "Set1")+ #changing deafult color package 
  theme_classic() +
  theme(axis.text.x=element_blank()) + #removing the x axis texts because there are mnay counties 
  labs(title = "Median Household Income for The DMV counties from 2015 to 2019",
       x = "County",
       y = "Median Household Income", 
       caption = "Source: United States Census Bureau")
p2

#i incorporated interactivity to make it possible to gain information about specific counties 
ggplotly(p2)

My second visualization is a bar chart of the median household income of the DMV counties. At first, we can only see that the county with the highest median household income belonged to Virginia. However, I added interactivity to identify each county. The county with the highest median household income is Loudon County.

Third Visualization

Link: https://public.tableau.com/app/profile/senay.leul.kahsay/viz/FinalProject-ThirdVisualization/Sheet1?publish=yes

For my third and final visualization, I used Tableau to make a heat map of the US that shows percentage of people 65 and above. I changed the default color and added the source as well as the title. Something that stood out to me was that the east side of the country had a more balanced percentage of people aged 65 and above, always somewhere around 20%. However, there is much more variety on the west side, some counties reaching 40% and others below 10%.

Summary

I cleaned the data set thoroughly and did some statistical analysis to show how higher education can predict higher income. I found my linear to be overall decent with a very small p-value and a high R-squared value. However, the diagnostic plots revealed that the model was not perfect. Then, I made my first visualization that shows Home ownership rates of different ethnicities as well as their population percentage for various counties across the US. After that, I made my second visualization, an interactive bar chart that shows the Median Household Income for The DMV counties from 2015 to 2019. It revealed that Loudoun County from Virginia had the highest median household income with 142,300 dollars. Finally, I made my last visualization in Tableau, a heat map of the US that shows percentage of people 65 and above. Something that stood out to me from this visualization is how the east side of the country has almost the same percentage of elderly, while the west side had more variety with some counties and states reaching 40%.

References

Code to replace -1 values with NA
Nam, Jane. “Does Education Make a Difference in Salary?: BestColleges.” BestColleges.Com, Best Colleges, 21 Feb. 2024, www.bestcolleges.com/research/does-education-make-difference-in-salary/.