1- What is the distribution of median household income across tracts in Harris County?
2- What is the relationship between the race (percentage of White and Black populations) and median household income?
3- What is the relationship between between higher education levels and poverty rates in Harris County?
4- Can we predict the population growth of Harris County for the next five years (2025-2030) based on historical data?
#Q1
#research questions
# 1- What is the distribution of median household income across tracts in Harris County?
# 2- What is the relationship between the race (percentage of White and Black populations) and median household income?
# 3- What is the relationship between between higher education levels and poverty rates in Harris County?
# 4- Can we predict the population growth of Harris County for the next five years (2025-2030) based on historical data?
#Q2
#Use Census API to get the census tract-level data with at least 4 variables
library(tidycensus)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tmap)
## Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
## remotes::install_github('r-tmap/tmap')
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
census_api_key("0d539976d5203a96fa55bbf4421110d4b3db3648", overwrite = TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
variables <- load_variables(2020, "acs5", cache = TRUE)
var <- c(median_income = "B19013_001",
total_pop = "B01003_001",
white_pop = "B02001_002",
black_pop = "B02001_003",
edu_bachelors = "B15003_022",
edu_graduate = "B15003_023",
poverty = "B17021_002")
st <-"TX"
ct <-"Harris County"
harriss <- get_acs(geography = "tract", variables = var, count=ct,
state = st,output="wide", year = 2020, geometry = TRUE)
## Getting data from the 2016-2020 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |===== | 7% | |======= | 9% | |======= | 10% | |======== | 11% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============== | 19% | |=============== | 21% | |=============== | 22% | |================ | 23% | |================ | 24% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |====================== | 31% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |=============================== | 44% | |=============================== | 45% | |================================ | 46% | |================================== | 49% | |==================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================ | 69% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 80% | |========================================================= | 81% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |======================================================================| 100%
#The percentage of White , Black , higher_educated and Poverty rate
harriss$pct_white <- (harriss$white_popE / harriss$total_popE) * 100
harriss$pct_black <- (harriss$black_popE / harriss$total_popE) * 100
harriss$total_educated <- harriss$edu_bachelorsE + harriss$edu_graduateE
harriss$pct_higher_edu <- (harriss$total_educated / harriss$total_popE) * 100
harriss$poverty_rate <- (harriss$povertyE / harriss$total_popE) * 100
#Q3
#Calculate mean, median, min, and max values
#for median household income
mean_income <- mean(harriss$median_incomeE, na.rm = TRUE)
median_income <- median(harriss$median_incomeE, na.rm = TRUE)
min_income <- min(harriss$median_incomeE, na.rm = TRUE)
max_income <- max(harriss$median_incomeE, na.rm = TRUE)
#for higher education attainment
mean_edu <- mean(harriss$pct_higher_edu, na.rm = TRUE)
median_edu <- median(harriss$pct_higher_edu, na.rm = TRUE)
min_edu <- min(harriss$pct_higher_edu, na.rm = TRUE)
max_edu <- max(harriss$pct_higher_edu, na.rm = TRUE)
#for poverty rate
mean_poverty <- mean(harriss$poverty_rate, na.rm = TRUE)
median_poverty <- median(harriss$poverty_rate, na.rm = TRUE)
min_poverty <- min(harriss$poverty_rate, na.rm = TRUE)
max_poverty <- max(harriss$poverty_rate, na.rm = TRUE)
mean_income
## [1] 69228.68
median_income
## [1] 58272.5
min_income
## [1] 13293
max_income
## [1] 250001
#Q4
#Make at least three types of figures and summarize your findings
#Histogram of median household income distribution
ggplot(harriss, aes(x = median_incomeE)) +
geom_histogram(binwidth = 5000, fill = "blue", color = "black") +
labs(
title = "Median Household Income distribution",
x = "Median Household Income",
y = "Count"
)
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
#Boxplot of median household income
ggplot(harriss, aes(y = median_incomeE)) +
geom_boxplot(fill = "blue") +
labs(
title = "Boxplot of Median Household Income",
y = "Median Household Income"
)
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
#Scatter plot of Median Household Income vs Percentage of White Population
ggplot(harriss, aes(x = pct_white, y = median_incomeE)) +
geom_point(color = "blue") +
labs(
title = "Median Household Income vs Percentage of White Population",
x = "Percentage of White Population",
y = "Median Household Income "
)
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Scatter plot of higher education attainment vs poverty rate
ggplot(harriss, aes(x = pct_higher_edu, y = poverty_rate)) +
geom_point(color = "blue") +
labs(
title = "Poverty Rate vs Higher Education Attainment",
x = "Higher Education Attainment ",
y = "Poverty Rate"
)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Summary of Findings:
#The histogram and boxplot show that most households in Harris County have a lower to middle income, while only a few have high incomes which means a significant economic disparity in the area.
#the scatter plot(Median Household Income vs Percentage of White Population) shows that as the percentage of the White population increases,the median household income tends to be higher.
#The scatter plot (higher education attainment vs poverty rate) shows that areas with more residents who have higher education tend to have lower poverty rates.
#Q5
#Make at least a PDF chart and CDF chart for any variable
#Create a dominant_race
harriss$dominant_race <- "Other"
harriss$dominant_race[harriss$pct_white > harriss$pct_black] <- "White"
harriss$dominant_race[harriss$pct_black > harriss$pct_white] <- "Black"
# PDF of median income by dominant race
ggplot(harriss, aes(x = median_incomeE, color = dominant_race)) +
geom_density() +
labs(title = "PDF of median income by dominant race ")
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_density()`).
#CDF of poverty rate by education level
harriss$edu_category <- "Low Education"
harriss$edu_category[harriss$pct_higher_edu >= median_edu] <- "High Education"
ggplot(harriss, aes(x = poverty_rate, color = edu_category)) +
stat_ecdf(geom = "step") +
labs(
title = "CDF of Poverty Rate by Education Level",
x = "Poverty Rate ",
y = "Cumulative Probability",
)
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_ecdf()`).
#Q6
# Make a prediction of population OR GDP OR other variable of your study area for the next five years (2025-2030)
#https://usafacts.org/data/topics/people-society/population-and-demographics/our-changing-population/state/texas/county/harris-county/
##population prediction
x <- c(2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021)
y <- c(4262549, 4352419, 4454951, 4556559, 4622836, 4655798, 4676913, 4709243, 4734505, 4735287)
plot(x, y, pch = 19, xlab = "Year", ylab = "Population")
poly.lm1 <- lm(y ~ poly(x, 1))
poly.lm2 <- lm(y ~ poly(x, 2))
poly.lm3 <- lm(y ~ poly(x, 3))
new.x <- c(2025, 2026, 2027, 2028, 2029, 2030)
new.df <- data.frame(x = new.x)
new.y1 <- predict(poly.lm1, newdata = new.df)
new.y2 <- predict(poly.lm2, newdata = new.df)
new.y3 <- predict(poly.lm3, newdata = new.df)
x_axis <- seq(2012, 2022, length = 10)
lines(x_axis, predict(poly.lm1, data.frame(x = x_axis)), col = 'green')
lines(x_axis, predict(poly.lm2, data.frame(x = x_axis)), col = 'red')
lines(x_axis, predict(poly.lm3, data.frame(x = x_axis)), col = 'purple')
new.y1
## 1 2 3 4 5 6
## 5018865 5070954 5123043 5175132 5227222 5279311
new.y2
## 1 2 3 4 5 6
## 4578069 4506185 4420525 4321091 4207882 4080898
new.y3
## 1 2 3 4 5 6
## 4718734 4712528 4708923 4709644 4716414 4730958
#Q7
#Relationship Between Median Income and Racial Composition
income_race_model <- lm(median_incomeE ~ pct_white + pct_black, data = harriss)
#Relationship Between Poverty Rate and Higher Education Attainment
poverty_edu_model <- lm(poverty_rate ~ pct_higher_edu, data = harriss)
ggplot(harriss, aes(x = pct_white, y = median_incomeE)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Percentage of White Population vs. Median Income with Regression Line ")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(harriss, aes(x = pct_black, y = median_incomeE)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Percentage of Black Population vs. Median Income with Regression Line")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(harriss, aes(x = pct_higher_edu, y = poverty_rate)) +
geom_point() +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Higher Education Percentage vs Poverty Rate with Regression Line")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Q8
p <- ggplot(harriss, aes(x = median_incomeE)) +
geom_histogram(binwidth = 5000, fill = "blue", color = "black") +
labs(
title = "Median Household Income Distribution",
x = "Median Household Income",
y = "Count"
)
ggplotly(p)
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
#Q9
#In this analysis of In Harris County, I examined some characteristics of Harris County, Texas, focusing on median household income, racial composition, education levels, and poverty rates at the census tract level using Census API.
#summary of findigs:
#The findings shows that most census tracts have median household incomes below the county average of approximately $69,228.68 with a few areas showing high incomes up to $250,001.
#Also it shows a relationship between racial composition and income: Scatter plot, PDF plot CDF plot
#census tracts with a higher percentage of White residents tend to have higher median incomes, while areas with a higher percentage of Black residents generally have lower incomes. This relationship is confirmed by the OLS regression model in Q7. Furthermore, areas with more residents who have attained higher education tend to have lower poverty rates which was also confirmed by regression model in Q7.
#Lastly, by using historical population data, the population projections for 2025-2030 shows a steady growth, with the population expected to reach approximately 5,279,311 by 2030. An interactive scatter plot was also created for further exploration.
In this analysis of In Harris County, I examined some characteristics of Harris County, Texas, focusing on median household income, racial composition, education levels, and poverty rates at the census tract level using Census API.
summary of findigs:
median household income distribution The findings shows that most census tracts have median household incomes below the county average of approximately $69,228.68 with a few areas showing high incomes up to $250,001.
Also it shows a relationship between racial composition and income: Scatter plot, PDF plot CDF plot census tracts with a higher percentage of White residents tend to have higher median incomes, while areas with a higher percentage of Black residents generally have lower incomes. This relationship is confirmed by the OLS regression model in Q7. Furthermore, areas with more residents who have attained higher education tend to have lower poverty rates which was also confirmed by regression model in Q7.
Lastly, by using historical population data, the population projections for 2025-2030 shows a steady growth, with the population expected to reach approximately 5,279,311 by 2030. An interactive scatter plot was also created for further exploration.