Project 2 new- Jason Laucel

##.Project 2 : Poverty and Median Household Income in the United States

Introduction

####. My project focuses on analyzing poverty estimates, poverty percentages, and median household income across different states in the U.S.. The data set I chose came from the U.S. Census Bureau’s Small Area Income and Poverty Estimates (SAIPE) Program, released in December 2023. The variables included are: Poverty Estimate and Poverty Percent for All Ages, Age 0-17, Age 5-17 in Families, Age 0-4 - Median Household Income, and U.S regions, I specifically chose this because of my own personal experiences in Maryland and seeing the wide range of disparity among Median Income and poverty growing up. I wanted to extend this interest across the U.S and learn about our various regions.

####. The data has been cleaned and prepared for analysis, including handling missing values, outliers, and ensuring data consistency across variables. I had to remove extraneous white spaces that caused NA values and also commas that caused variable changes in the data set () The most tedious part of this project was preparing it and making sure it was ready for usage. I struggled with conventional naming for the data columns but the data itself was fine.

# Call appropriate libraries
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Set working directory
setwd("/Users/jasonlaucel/Data 110 Folder")

# Read in the dataset
data <- read.csv("Household22 1.csv")

# Define clean column names
clean_col_names <- c(
  "Geographic_Information", "All_Ages", "Poverty_Estimate_All_Ages",
  "Poverty_Percent_All_Ages", "Poverty_Estimate_Age_0_17", "Poverty_Percent_Age_0_17",
  "Median_Household_Income", "Age_0_4", "Poverty_Estimate_Age_5_17_in_Families",
  "Poverty_Percent_Age_5_17_in_Families"
)

# Assign the cleaned column names to the dataset
colnames(data) <- clean_col_names

# Remove the first row of titles
data <- data[-1, ]

# Convert numeric columns to numeric data type
numeric_cols <- c("All_Ages", "Poverty_Estimate_All_Ages", "Poverty_Percent_All_Ages",
                   "Poverty_Estimate_Age_0_17", "Poverty_Percent_Age_0_17",
                   "Median_Household_Income", "Age_0_4",
                   "Poverty_Estimate_Age_5_17_in_Families", "Poverty_Percent_Age_5_17_in_Families")

data[numeric_cols] <- lapply(data[numeric_cols], as.numeric)

# Explore the cleaned data
head(data)

  Geographic_Information All_Ages Poverty_Estimate_All_Ages
2          United States 40951625                      12.6
3                Alabama   798469                      16.2
4                 Alaska    77305                      10.8
5                Arizona   904567                      12.5
6               Arkansas   481650                      16.3
7             California  4673718                      12.2
  Poverty_Percent_All_Ages Poverty_Estimate_Age_0_17 Poverty_Percent_Age_0_17
2                 11582950                      16.3                  8176725
3                   237861                      21.8                   165972
4                    23113                      13.4                    15001
5                   250329                      16.0                   174580
6                   144789                      21.2                    99080
7                  1275904                      15.3                   926365
  Median_Household_Income Age_0_4 Poverty_Estimate_Age_5_17_in_Families
2                    15.5   74755                               3141107
3                    20.6   59703                                 68982
4                    11.9   88072                                  7304
5                    14.9   74355                                 69229
6                    19.6   55505                                 42721
7                    14.8   91517                                318877
  Poverty_Percent_Age_5_17_in_Families
2                                 17.4
3                                 24.4
4                                 16.1
5                                 18.0
6                                 24.7
7                                 15.4

# Check summary statistics of numeric columns
summary(data)

 Geographic_Information    All_Ages        Poverty_Estimate_All_Ages
 Length:52              Min.   :   63555   Min.   : 7.40            
 Class :character       1st Qu.:  209735   1st Qu.:10.57            
 Mode  :character       Median :  567764   Median :11.95            
                        Mean   : 1575062   Mean   :12.37            
                        3rd Qu.:  896311   3rd Qu.:13.32            
                        Max.   :40951625   Max.   :19.20            
 Poverty_Percent_All_Ages Poverty_Estimate_Age_0_17 Poverty_Percent_Age_0_17
 Min.   :   12569         Min.   : 7.70             Min.   :   8462         
 1st Qu.:   61354         1st Qu.:12.40             1st Qu.:  42485         
 Median :  156490         Median :15.15             Median : 110958         
 Mean   :  445498         Mean   :15.59             Mean   : 314432         
 3rd Qu.:  253074         3rd Qu.:17.80             3rd Qu.: 180136         
 Max.   :11582950         Max.   :26.40             Max.   :8176725         
 Median_Household_Income    Age_0_4      Poverty_Estimate_Age_5_17_in_Families
 Min.   : 6.60           Min.   :52788   Min.   :   3454                      
 1st Qu.:11.68           1st Qu.:66886   1st Qu.:  16748                      
 Median :14.05           Median :72314   Median :  42608                      
 Mean   :14.76           Mean   :74497   Mean   : 120812                      
 3rd Qu.:16.82           3rd Qu.:83217   3rd Qu.:  70350                      
 Max.   :26.20           Max.   :99897   Max.   :3141107                      
 Poverty_Percent_Age_5_17_in_Families
 Min.   : 9.10                       
 1st Qu.:13.60                       
 Median :16.05                       
 Mean   :16.76                       
 3rd Qu.:19.07                       
 Max.   :26.50

# Scatter plot for Poverty Percent All Ages by State
ggplot(data, aes(x = reorder(Geographic_Information, Poverty_Percent_All_Ages), y = Poverty_Percent_All_Ages)) +
  geom_point() +
  labs(
    x = "State",
    y = "Poverty Estimate All Ages",
    title = "Poverty Estimate All Ages by State",
    caption = "Data Source: U.S. Census Bureau, (SAIPE) Program"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1, size = 6),  # Adjust x-axis labels
    axis.text.y = element_text(size = 8),  # Adjust y-axis tick labels font size
    axis.title.y = element_text(size = 10)  # Adjust y-axis title font size
  ) +
  scale_y_continuous(labels = scales::comma_format())

# Y-axis labels as numbers

# Perform linear regression analysis
model <- lm(Poverty_Percent_All_Ages ~ Median_Household_Income, data = data)

# Display summary of the regression model
summary(model)


Call:
lm(formula = Poverty_Percent_All_Ages ~ Median_Household_Income, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
 -593510  -358501  -263106  -151512 11116490 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)                27624     844507   0.033    0.974
Median_Household_Income    28312      55180   0.513    0.610

Residual standard error: 1611000 on 50 degrees of freedom
Multiple R-squared:  0.005238,  Adjusted R-squared:  -0.01466 
F-statistic: 0.2633 on 1 and 50 DF,  p-value: 0.6101

# Define regions based on Geographic locations

northeast <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania")
midwest <- c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota")
south <- c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "Washington DC", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas")
west <- c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming", "Alaska", "California", "Oregon", "Washington")

# Add a new categorical column for region based on state names in North America
data <- data %>%
  mutate(
    Region = case_when(
      Geographic_Information %in% northeast ~ "Northeast",
      Geographic_Information %in% midwest ~ "Midwest",
      Geographic_Information %in% south ~ "South",
      Geographic_Information %in% west ~ "West",
      TRUE ~ "Other"  # If the state doesn't match any region, categorize as "Other", ( This was just a safe measure)
    )
  ) %>%
  filter(
    Region != "Other"  # Exclude states or regions outside of North America
  )

# Verify the Region column
head(data)

  Geographic_Information All_Ages Poverty_Estimate_All_Ages
1                Alabama   798469                      16.2
2                 Alaska    77305                      10.8
3                Arizona   904567                      12.5
4               Arkansas   481650                      16.3
5             California  4673718                      12.2
6               Colorado   542333                       9.5
  Poverty_Percent_All_Ages Poverty_Estimate_Age_0_17 Poverty_Percent_Age_0_17
1                   237861                      21.8                   165972
2                    23113                      13.4                    15001
3                   250329                      16.0                   174580
4                   144789                      21.2                    99080
5                  1275904                      15.3                   926365
6                   133767                      11.1                    95548
  Median_Household_Income Age_0_4 Poverty_Estimate_Age_5_17_in_Families
1                    20.6   59703                                 68982
2                    11.9   88072                                  7304
3                    14.9   74355                                 69229
4                    19.6   55505                                 42721
5                    14.8   91517                                318877
6                    10.7   89096                                 34104
  Poverty_Percent_Age_5_17_in_Families Region
1                                 24.4  South
2                                 16.1   West
3                                 18.0   West
4                                 24.7  South
5                                 15.4   West
6                                 11.2   West

# Bar plot for median household income by region
plot <- ggplot(data, aes(x = reorder(Geographic_Information, Median_Household_Income), y = Median_Household_Income, fill = Geographic_Information)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::comma_format()) +  # Format y-axis labels as regular numbers
  labs(
    x = "States",
    y = "Median Household Income",
    title = "Median Household Income by U.S States",
    caption = "Data Source: U.S. Census Bureau, (SAIPE) Program"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",  # Remove legend
    plot.title = element_text(size = 8),  # Adjust title font size
    axis.text.x = element_text(angle = 90, hjust = 1, size = 6)  # Rotate x-axis labels for better readability and reduce font size
  )

# Use plotly for interactivity 
plotly::ggplotly(plot) %>%
  plotly::layout(
    yaxis = list(
      tickformat = ",.0f"  # Format y-axis labels 
    )
  )

# Summarize data by region
region_summary <- data %>%
  group_by(Region) %>%
  summarise(Median_Household_Income = median(Median_Household_Income, na.rm = TRUE),
            Poverty_Percent_All_Ages = median(Poverty_Percent_All_Ages, na.rm = TRUE))

# Scatter plot for median household income vs poverty percent for all ages
plot <- ggplot(region_summary, aes(x = Poverty_Percent_All_Ages, y = Median_Household_Income, color = Region)) +
  geom_point() +
  labs(
    x = "Median Household Income",
    y = "Poverty Percent for All Ages",
    title = "Median H.H Income vs Poverty Percent for All Ages in U.S  Regions",
    caption = "Data Source: U.S. Census Bureau, (SAIPE) Program"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",  # Display legend at the bottom
    plot.title = element_text(size = 8),  # Adjust title font size
    legend.title = element_blank()  # Remove legend title
  )

# Use plotly for interactivity and format y-axis labels
plotly::ggplotly(plot) %>%
  plotly::layout(
    yaxis = list(
      tickformat = ",.0f"  # Format y-axis labels as regular numbers without decimals
    )
  )

The topic of the data is Median Household Income and also Poverty percentages in the U.S by region. The original dataset didn’t include categorization by state but that’s something I added later. The variables I used were categorical and quantitative, the percentage, the median income, and also regions. The data came from the U.S Census Bureau (SAIPE) program. When I was researching this topic I learned the reason for researching Income disparities is to fund specific under funded school districts.Based on these estimate calculations from 2022 the fiscal year 2024 will be used as a time to allocate appropriate funds where needed. According to the Census Bureau, the calculations of median income for counties is $60,833.Poverty rates averaged out to 13.6 %. I would say my visualization roughly reflects these estimations. Besides understanding data I also used another website to understand the context of poverty, how it’s been addressed, how it’s developing, and the established parameters of poverty across the various states/regions of the U.S.####. Census Bureau Releases Small Area Income and Poverty Estimates for States, Counties and School Districts. (2023, December 14). Census Bureau. Retrieved April 13, 2024, from https://www.census.gov/newsroom/press-releases/2023/saipe.htmlPoverty Guidelines | ASPE. (n.d.). ASPE. Retrieved April 13, 2024, from https://aspe.hhs.gov/topics/poverty-economic-mobility/poverty-guidelines