##.Project 2 : Poverty and Median Household Income in the United States
Introduction
####. My project focuses on analyzing poverty estimates, poverty percentages, and median household income across different states in the U.S.. The data set I chose came from the U.S. Census Bureau’s Small Area Income and Poverty Estimates (SAIPE) Program, released in December 2023. The variables included are: Poverty Estimate and Poverty Percent for All Ages, Age 0-17, Age 5-17 in Families, Age 0-4 - Median Household Income, and U.S regions, I specifically chose this because of my own personal experiences in Maryland and seeing the wide range of disparity among Median Income and poverty growing up. I wanted to extend this interest across the U.S and learn about our various regions.
####. The data has been cleaned and prepared for analysis, including handling missing values, outliers, and ensuring data consistency across variables. I had to remove extraneous white spaces that caused NA values and also commas that caused variable changes in the data set () The most tedious part of this project was preparing it and making sure it was ready for usage. I struggled with conventional naming for the data columns but the data itself was fine.
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Set working directorysetwd("/Users/jasonlaucel/Data 110 Folder")# Read in the datasetdata <-read.csv("Household22 1.csv")# Define clean column namesclean_col_names <-c("Geographic_Information", "All_Ages", "Poverty_Estimate_All_Ages","Poverty_Percent_All_Ages", "Poverty_Estimate_Age_0_17", "Poverty_Percent_Age_0_17","Median_Household_Income", "Age_0_4", "Poverty_Estimate_Age_5_17_in_Families","Poverty_Percent_Age_5_17_in_Families")# Assign the cleaned column names to the datasetcolnames(data) <- clean_col_names# Remove the first row of titlesdata <- data[-1, ]# Convert numeric columns to numeric data typenumeric_cols <-c("All_Ages", "Poverty_Estimate_All_Ages", "Poverty_Percent_All_Ages","Poverty_Estimate_Age_0_17", "Poverty_Percent_Age_0_17","Median_Household_Income", "Age_0_4","Poverty_Estimate_Age_5_17_in_Families", "Poverty_Percent_Age_5_17_in_Families")data[numeric_cols] <-lapply(data[numeric_cols], as.numeric)
# Check summary statistics of numeric columnssummary(data)
Geographic_Information All_Ages Poverty_Estimate_All_Ages
Length:52 Min. : 63555 Min. : 7.40
Class :character 1st Qu.: 209735 1st Qu.:10.57
Mode :character Median : 567764 Median :11.95
Mean : 1575062 Mean :12.37
3rd Qu.: 896311 3rd Qu.:13.32
Max. :40951625 Max. :19.20
Poverty_Percent_All_Ages Poverty_Estimate_Age_0_17 Poverty_Percent_Age_0_17
Min. : 12569 Min. : 7.70 Min. : 8462
1st Qu.: 61354 1st Qu.:12.40 1st Qu.: 42485
Median : 156490 Median :15.15 Median : 110958
Mean : 445498 Mean :15.59 Mean : 314432
3rd Qu.: 253074 3rd Qu.:17.80 3rd Qu.: 180136
Max. :11582950 Max. :26.40 Max. :8176725
Median_Household_Income Age_0_4 Poverty_Estimate_Age_5_17_in_Families
Min. : 6.60 Min. :52788 Min. : 3454
1st Qu.:11.68 1st Qu.:66886 1st Qu.: 16748
Median :14.05 Median :72314 Median : 42608
Mean :14.76 Mean :74497 Mean : 120812
3rd Qu.:16.82 3rd Qu.:83217 3rd Qu.: 70350
Max. :26.20 Max. :99897 Max. :3141107
Poverty_Percent_Age_5_17_in_Families
Min. : 9.10
1st Qu.:13.60
Median :16.05
Mean :16.76
3rd Qu.:19.07
Max. :26.50
# Scatter plot for Poverty Percent All Ages by Stateggplot(data, aes(x =reorder(Geographic_Information, Poverty_Percent_All_Ages), y = Poverty_Percent_All_Ages)) +geom_point() +labs(x ="State",y ="Poverty Estimate All Ages",title ="Poverty Estimate All Ages by State",caption ="Data Source: U.S. Census Bureau, (SAIPE) Program" ) +theme_minimal() +theme(axis.text.x =element_text(angle =90, hjust =1, vjust =1, size =6), # Adjust x-axis labelsaxis.text.y =element_text(size =8), # Adjust y-axis tick labels font sizeaxis.title.y =element_text(size =10) # Adjust y-axis title font size ) +scale_y_continuous(labels = scales::comma_format())
# Y-axis labels as numbers
# Perform linear regression analysismodel <-lm(Poverty_Percent_All_Ages ~ Median_Household_Income, data = data)# Display summary of the regression modelsummary(model)
Call:
lm(formula = Poverty_Percent_All_Ages ~ Median_Household_Income,
data = data)
Residuals:
Min 1Q Median 3Q Max
-593510 -358501 -263106 -151512 11116490
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27624 844507 0.033 0.974
Median_Household_Income 28312 55180 0.513 0.610
Residual standard error: 1611000 on 50 degrees of freedom
Multiple R-squared: 0.005238, Adjusted R-squared: -0.01466
F-statistic: 0.2633 on 1 and 50 DF, p-value: 0.6101
# Define regions based on Geographic locationsnortheast <-c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania")midwest <-c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota")south <-c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "Washington DC", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas")west <-c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming", "Alaska", "California", "Oregon", "Washington")# Add a new categorical column for region based on state names in North Americadata <- data %>%mutate(Region =case_when( Geographic_Information %in% northeast ~"Northeast", Geographic_Information %in% midwest ~"Midwest", Geographic_Information %in% south ~"South", Geographic_Information %in% west ~"West",TRUE~"Other"# If the state doesn't match any region, categorize as "Other", ( This was just a safe measure) ) ) %>%filter( Region !="Other"# Exclude states or regions outside of North America )# Verify the Region columnhead(data)
# Bar plot for median household income by regionplot <-ggplot(data, aes(x =reorder(Geographic_Information, Median_Household_Income), y = Median_Household_Income, fill = Geographic_Information)) +geom_bar(stat ="identity") +scale_y_continuous(labels = scales::comma_format()) +# Format y-axis labels as regular numberslabs(x ="States",y ="Median Household Income",title ="Median Household Income by U.S States",caption ="Data Source: U.S. Census Bureau, (SAIPE) Program" ) +theme_minimal() +theme(legend.position ="none", # Remove legendplot.title =element_text(size =8), # Adjust title font sizeaxis.text.x =element_text(angle =90, hjust =1, size =6) # Rotate x-axis labels for better readability and reduce font size )# Use plotly for interactivity plotly::ggplotly(plot) %>% plotly::layout(yaxis =list(tickformat =",.0f"# Format y-axis labels ) )
# Summarize data by regionregion_summary <- data %>%group_by(Region) %>%summarise(Median_Household_Income =median(Median_Household_Income, na.rm =TRUE),Poverty_Percent_All_Ages =median(Poverty_Percent_All_Ages, na.rm =TRUE))# Scatter plot for median household income vs poverty percent for all agesplot <-ggplot(region_summary, aes(x = Poverty_Percent_All_Ages, y = Median_Household_Income, color = Region)) +geom_point() +labs(x ="Median Household Income",y ="Poverty Percent for All Ages",title ="Median H.H Income vs Poverty Percent for All Ages in U.S Regions",caption ="Data Source: U.S. Census Bureau, (SAIPE) Program" ) +theme_minimal() +theme(legend.position ="bottom", # Display legend at the bottomplot.title =element_text(size =8), # Adjust title font sizelegend.title =element_blank() # Remove legend title )# Use plotly for interactivity and format y-axis labelsplotly::ggplotly(plot) %>% plotly::layout(yaxis =list(tickformat =",.0f"# Format y-axis labels as regular numbers without decimals ) )
The topic of the data is Median Household Income and also Poverty percentages in the U.S by region. The original dataset didn’t include categorization by state but that’s something I added later. The variables I used were categorical and quantitative, the percentage, the median income, and also regions. The data came from the U.S Census Bureau (SAIPE) program. When I was researching this topic I learned the reason for researching Income disparities is to fund specific under funded school districts.Based on these estimate calculations from 2022 the fiscal year 2024 will be used as a time to allocate appropriate funds where needed. According to the Census Bureau, the calculations of median income for counties is $60,833.Poverty rates averaged out to 13.6 %. I would say my visualization roughly reflects these estimations. Besides understanding data I also used another website to understand the context of poverty, how it’s been addressed, how it’s developing, and the established parameters of poverty across the various states/regions of the U.S.####. Census Bureau Releases Small Area Income and Poverty Estimates for States, Counties and School Districts. (2023, December 14). Census Bureau. Retrieved April 13, 2024, from https://www.census.gov/newsroom/press-releases/2023/saipe.htmlPoverty Guidelines | ASPE. (n.d.). ASPE. Retrieved April 13, 2024, from https://aspe.hhs.gov/topics/poverty-economic-mobility/poverty-guidelines