Libraries
This project investigates the relationship between socioeconomic factors and the housing market in the United States, focusing on three metropolitan areas: New York City, Miami, and Los Angeles. Utilizing datasets on median housing prices, homeownership rates, education, jobs, and household income across multiple years and quarters, the study aims to uncover trends and disparities in housing affordability and accessibility. The analysis incorporates visualizations and highlights the potential impact of regional economic conditions and policy decisions. By understanding these patterns, this research seeks to inform strategies for improving housing equity and affordability, especially in urban areas with diverse socioeconomic dynamics. The findings contribute to a broader discourse on the intersection of economic policy, urban development, and social equity in the U.S. housing market.
First, the raw housing data is loaded directly from a CSV file hosted in a GitHub repository.
load USA dataset
cpi_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/cpi.csv"
electricity_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/electricity_price_city_usa.csv"
m2_money_supply_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/m2_money_supply.csv"
median_house_prices_sold_usa_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/median_house_prices_sold_usa.csv"
employment_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/total_non_farm_employment.csv"
median_household_income_usa_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/median_household_income_usa.csv"
cpi_df <- read_csv(cpi_url)
## Rows: 418 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): CPIAUCSL
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
electricity_df <- read_csv(electricity_url)
## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
m2_money_supply_df <- read_csv(m2_money_supply_url)
## Rows: 418 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): M2SL
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
median_house_prices_sold_usa_df <- read_csv(median_house_prices_sold_usa_url)
## Rows: 139 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): MSPUS
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
employment_df <- read_csv(employment_url)
## Rows: 419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): PAYEMS
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
median_household_income_usa_df <- read.csv(median_household_income_usa_url)
load nyc dataset
nyc_median_income_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/new%20york%20tristate/nyc_median_household_income.csv"
nyc_tristate_electricity_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/new%20york%20tristate/nyc_tristate_electricity.csv"
nyc_median_income_df <- read_csv(nyc_median_income_url)
## Rows: 34 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): MEHOINUSNYA646N
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nyc_tristate_electricity_df <- read_csv(nyc_tristate_electricity_url)
## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
load miami dataset
miami_electricity_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/miami/miami_electricity.csv"
miami_median_income_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/miami/miami_median_household_income.csv"
miami_electricity_df <- read_csv(miami_electricity_url)
## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
miami_median_income_df <- read_csv(miami_median_income_url)
## Rows: 40 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): MEHOINUSFLA646N
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
load califonia dataset
san_francisco_electricity_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/california/san_francisco_electricity.csv"
san_francisco_median_income_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/california/san_francisco_median_household_income.csv"
san_francisco_electricity_df <- read_csv(san_francisco_electricity_url )
## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
san_francisco_median_income_df <- read_csv(san_francisco_median_income_url)
## Rows: 34 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): MHICA06075A052NCEN
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Electrcity Df Transformation
transform_electricity_data <- function(df) {
transformed_df <- df %>%
pivot_longer(
cols = -Year,
names_to = "Month",
values_to = "Electricity Cost per Kilowatt-Hour"
) %>%
mutate(
Date = as.Date(paste(Year, Month, "01", sep = "-"), format = "%Y-%b-%d")
) %>%
select(Date, `Electricity Cost per Kilowatt-Hour`)
return(transformed_df)
}
san_francisco_electricity_df <- transform_electricity_data(san_francisco_electricity_df)
electricity_df <- transform_electricity_data(electricity_df)
nyc_tristate_electricity_df <- transform_electricity_data(nyc_tristate_electricity_df)
miami_electricity_df<- transform_electricity_data(miami_electricity_df)
rename columns
rename_column <- function(df, old_col_name, new_col_name) {
renamed_df <- df %>%
rename(!!new_col_name := !!sym(old_col_name))
return(renamed_df)
}
san_francisco_median_income_df <- rename_column(san_francisco_median_income_df,"MHICA06075A052NCEN", "San Francisco Median Income")
nyc_median_income_df <- rename_column(nyc_median_income_df,"MEHOINUSNYA646N", "NYC Median Income")
miami_median_income_df <- rename_column(miami_median_income_df,"MEHOINUSFLA646N", "Miami Median Income")
median_household_income_usa_df <- rename_column(median_household_income_usa_df,"MEHOINUSA646N", "USA Median Income")
cpi_df <- rename_column(cpi_df,"CPIAUCSL", "CPI Index")
median_house_prices_sold_usa_df <- rename_column(median_house_prices_sold_usa_df,"MSPUS", "Median Housing Price Sold USA")
employment_df <- rename_column(employment_df,"PAYEMS", "Population Employed USA Thousand Per Person")
san_francisco_electricity_df <- rename_column(san_francisco_electricity_df,"Electricity Cost per Kilowatt-Hour", "San Francisco Electricity Cost per Kilowatt-Hour")
electricity_df <- rename_column(electricity_df,"Electricity Cost per Kilowatt-Hour", "USA Electricity Cost per Kilowatt-Hour")
nyc_tristate_electricity_df <- rename_column(nyc_tristate_electricity_df,"Electricity Cost per Kilowatt-Hour", "NYC Electricity Cost per Kilowatt-Hour")
miami_electricity_df<- rename_column(miami_electricity_df,"Electricity Cost per Kilowatt-Hour", "Miami Electricity Cost per Kilowatt-Hour")
Merge electricty data into a one df
merged_electricity_df <- san_francisco_electricity_df %>%
full_join(electricity_df, by = "Date") %>%
full_join(nyc_tristate_electricity_df, by = "Date") %>%
full_join(miami_electricity_df, by = "Date")
head(merged_electricity_df)
## # A tibble: 6 × 5
## Date San Francisco Electricity Cost per Kilowat…¹ USA Electricity Cost…²
## <date> <dbl> <dbl>
## 1 1990-01-01 0.109 0.081
## 2 1990-02-01 0.109 0.081
## 3 1990-03-01 0.109 0.081
## 4 1990-04-01 0.109 0.082
## 5 1990-05-01 0.111 0.082
## 6 1990-06-01 0.111 0.088
## # ℹ abbreviated names: ¹`San Francisco Electricity Cost per Kilowatt-Hour`,
## # ²`USA Electricity Cost per Kilowatt-Hour`
## # ℹ 2 more variables: `NYC Electricity Cost per Kilowatt-Hour` <dbl>,
## # `Miami Electricity Cost per Kilowatt-Hour` <dbl>
ggplot(data = merged_electricity_df, aes(x = Date)) +
geom_line(aes(y = `San Francisco Electricity Cost per Kilowatt-Hour`, color = "San Francisco"), size = 1) +
geom_line(aes(y = `USA Electricity Cost per Kilowatt-Hour`, color = "USA"), size = 1) +
geom_line(aes(y = `NYC Electricity Cost per Kilowatt-Hour`, color = "NYC"), size = 1) +
geom_line(aes(y = `Miami Electricity Cost per Kilowatt-Hour`, color = "Miami"), size = 1) +
labs(
title = "Electricity Costs Over Time",
x = "Date",
y = "Electricity Cost (per Kilowatt-Hour)",
color = "City/Region"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
heatmap_data <- merged_electricity_df %>%
pivot_longer(
cols = -Date,
names_to = "Region",
values_to = "Electricity_Cost"
)
# Create the heatmap
ggplot(data = heatmap_data, aes(x = Date, y = Region, fill = Electricity_Cost)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red", name = "Electricity Cost") +
labs(
title = "Electricity Costs Over Time by Region",
x = "Year",
y = "Region"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ratio_df <- merged_electricity_df %>%
mutate(
`San Francisco Ratio` = `San Francisco Electricity Cost per Kilowatt-Hour` / `USA Electricity Cost per Kilowatt-Hour`,
`NYC Ratio` = `NYC Electricity Cost per Kilowatt-Hour` / `USA Electricity Cost per Kilowatt-Hour`,
`Miami Ratio` = `Miami Electricity Cost per Kilowatt-Hour` / `USA Electricity Cost per Kilowatt-Hour`
) %>%
select(Date, `San Francisco Ratio`, `NYC Ratio`, `Miami Ratio`) %>%
pivot_longer(
cols = -Date,
names_to = "Region",
values_to = "Ratio"
)
# Create the ratio plot
ggplot(data = ratio_df, aes(x = Date, y = Ratio, color = Region)) +
geom_line(size = 1) +
labs(
title = "Electricity Price Ratios Relative to USA Average",
x = "Date",
y = "Price Ratio",
color = "Region"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).
scatter_data <- merged_electricity_df %>%
pivot_longer(
cols = -Date,
names_to = "Region",
values_to = "Electricity_Cost"
)
ggplot(data = scatter_data, aes(x = Date, y = Electricity_Cost, color = Region)) +
geom_point(size = 2, alpha = 0.7) +
labs(
title = "Scatter Plot of Electricity Costs Over Time",
x = "Date",
y = "Electricity Cost (per Kilowatt-Hour)",
color = "Region"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).
san_francisco_median_income_df <- san_francisco_median_income_df %>%
mutate(DATE = as.Date(DATE))
nyc_median_income_df <- nyc_median_income_df %>%
mutate(DATE = as.Date(DATE))
miami_median_income_df <- miami_median_income_df %>%
mutate(DATE = as.Date(DATE))
median_household_income_usa_df <- median_household_income_usa_df %>%
mutate(DATE = as.Date(DATE))
merged_income_df <- san_francisco_median_income_df %>%
full_join(nyc_median_income_df, by = "DATE") %>%
full_join(miami_median_income_df, by = "DATE") %>%
full_join(median_household_income_usa_df, by = "DATE")
merged_income_df <- merged_income_df %>%
select(-starts_with("Date.")) %>%
slice(-1)
merged_income_df <- merged_income_df %>%
mutate(across(
c(`San Francisco Median Income`, `NYC Median Income`, `Miami Median Income`, `USA Median Income`),
~ as.numeric(.)
))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! NAs introduced by coercion
head(merged_income_df)
## # A tibble: 6 × 5
## DATE San Francisco Median In…¹ `NYC Median Income` `Miami Median Income`
## <date> <dbl> <dbl> <dbl>
## 1 1990-01-01 NA 31590 26690
## 2 1991-01-01 NA 31790 27250
## 3 1992-01-01 NA 31050 27350
## 4 1993-01-01 34623 31700 28550
## 5 1994-01-01 NA 31900 29290
## 6 1995-01-01 37854 33030 29750
## # ℹ abbreviated name: ¹`San Francisco Median Income`
## # ℹ 1 more variable: `USA Median Income` <dbl>
ggplot(data = merged_income_df, aes(x = DATE)) +
geom_line(aes(y = `San Francisco Median Income`, color = "San Francisco"), size = 1) +
geom_line(aes(y = `NYC Median Income`, color = "NYC"), size = 1) +
geom_line(aes(y = `Miami Median Income`, color = "Miami"), size = 1) +
geom_line(aes(y = `USA Median Income`, color = "USA"), size = 1) +
labs(
title = "Median Income Trends Over Time",
x = "Year",
y = "Median Income",
color = "Region"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(labels = scales::comma)
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
long_income_df <- merged_income_df %>%
pivot_longer(
cols = -DATE,
names_to = "Region",
values_to = "Median_Income"
)
ggplot(long_income_df, aes(x = Region, y = Median_Income, fill = Region)) +
geom_boxplot() +
labs(
title = "Distribution of Median Incomes by Region",
x = "Region",
y = "Median Income"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(labels = scales::comma)
## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
cpi
ggplot(cpi_df, aes(x = as.Date(DATE), y = `CPI Index`)) +
geom_line(color = "red", size = 1) +
labs(
title = "CPI Index Over Time",
x = "Date",
y = "CPI Index"
) +
theme_minimal()
colnames(cpi_df)
## [1] "DATE" "CPI Index"
# Calculate the growth rate for CPI
cpi_plot_df <- cpi_df %>%
arrange(as.Date(DATE)) %>%
mutate(Growth_Rate = (`CPI Index` - lag(`CPI Index`)) / lag(`CPI Index`) * 100) # Percentage growth
# Create the growth rate line plot
ggplot(cpi_plot_df, aes(x = as.Date(DATE), y = Growth_Rate)) +
geom_line(color = "red", size = 1) +
labs(
title = "CPI Growth Rate Over Time",
x = "Date",
y = "Growth Rate (%)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
m2 money supply
ggplot(m2_money_supply_df, aes(x = as.Date(DATE), y = M2SL)) +
geom_line(color = "green", size = 1) +
labs(
title = "M2 Money Supply Over Time",
x = "Date",
y = "M2SL (Money Supply)"
) +
theme_minimal()
m2_money_supply_plot_df <- m2_money_supply_df %>%
mutate(Growth_Rate = (M2SL - lag(M2SL)) / lag(M2SL) * 100)
ggplot(m2_money_supply_plot_df, aes(x = as.Date(DATE), y = Growth_Rate)) +
geom_line(color = "green", size = 1) +
labs(title = "Growth Rate of M2 Money Supply", x = "Date", y = "Growth Rate (%)") +
theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
median house hold prices sold
ggplot(median_house_prices_sold_usa_df, aes(x = as.Date(DATE), y = `Median Housing Price Sold USA`)) +
geom_line(color = "purple", size = 1) +
labs(
title = "Median Housing Prices Sold in the USA Over Time",
x = "Date",
y = "Median Housing Price"
) +
theme_minimal()+
scale_y_continuous(labels = scales::comma)
ggplot(employment_df, aes(x = as.Date(DATE), y = `Population Employed USA Thousand Per Person`)) +
geom_line(color = "orange", size = 1) +
labs(
title = "USA Population Employed (Thousands) Over Time",
x = "Date",
y = "Population Employed (Thousands)"
) +
theme_minimal()
employment_plot_df <- employment_df %>%
mutate(Year = format(as.Date(DATE), "%Y")) %>%
group_by(Year) %>%
summarize(Annual_Change = last(`Population Employed USA Thousand Per Person`) -
first(`Population Employed USA Thousand Per Person`))
ggplot(employment_plot_df, aes(x = Year, y = Annual_Change)) +
geom_bar(stat = "identity", fill = "orange") +
labs(title = "Annual Changes in Employment", x = "Year", y = "Change in Employment (Thousands)") +
theme_minimal()
merged_electricity_df <- merged_electricity_df %>% rename(DATE = Date)
nyc_data <- merged_income_df %>%
select(DATE, `NYC Median Income`) %>%
rename(Median_Income = `NYC Median Income`) %>%
left_join(
merged_electricity_df %>% select(DATE, `NYC Electricity Cost per Kilowatt-Hour`),
by = "DATE"
) %>%
left_join(
cpi_df %>% rename(CPI_Index = `CPI Index`),
by = "DATE"
) %>%
left_join(
m2_money_supply_df,
by = "DATE"
) %>%
left_join(
employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
by = "DATE"
)
nyc_data_clean <- nyc_data %>% na.omit()
regression_model <- lm(
Median_Income ~ `NYC Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
data = nyc_data_clean
)
summary(regression_model)
##
## Call:
## lm(formula = Median_Income ~ `NYC Electricity Cost per Kilowatt-Hour` +
## CPI_Index + M2SL + Employment_Thousands, data = nyc_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3752.3 -937.2 117.2 1053.6 4971.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.994e+04 7.030e+03 -2.836 0.00824
## `NYC Electricity Cost per Kilowatt-Hour` -1.245e+04 2.808e+04 -0.443 0.66078
## CPI_Index -3.917e+00 5.431e+01 -0.072 0.94300
## M2SL 1.821e+00 2.177e-01 8.367 3.20e-09
## Employment_Thousands 4.267e-01 9.225e-02 4.626 7.17e-05
##
## (Intercept) **
## `NYC Electricity Cost per Kilowatt-Hour`
## CPI_Index
## M2SL ***
## Employment_Thousands ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1853 on 29 degrees of freedom
## Multiple R-squared: 0.9849, Adjusted R-squared: 0.9828
## F-statistic: 473.5 on 4 and 29 DF, p-value: < 2.2e-16
NYC Electricity Cost per Kilowatt-Hour,
CPI Index, M2 Money Supply, and
Employment_Thousands).Each coefficient represents the estimated impact of a one-unit increase in the predictor on NYC Median Income, holding other variables constant:
Residuals represent the differences between observed and predicted values: - The residuals are relatively small, with a standard error of 1853, indicating that the model predicts NYC Median Income fairly accurately.
M2 Money Supply and Employment_Thousands
have statistically significant positive relationships with NYC Median
Income.
NYC Electricity Cost per Kilowatt-Hour and
CPI Index are not statistically significant, meaning
there’s no strong evidence that these directly influence NYC Median
Income in this model.M2 Money Supply and Employment_Thousands,
as evidenced by their significance levels and positive
coefficients.ggplot(nyc_data_clean, aes(x = M2SL, y = Median_Income)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between M2 Money Supply and NYC Median Income",
x = "M2 Money Supply (Billion USD)",
y = "NYC Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(nyc_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between Employment Levels and NYC Median Income",
x = "Employment Levels (Thousands)",
y = "NYC Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(nyc_data_clean, aes(x = M2SL, y = Median_Income, color = Employment_Thousands)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
scale_color_gradient(low = "lightblue", high = "darkblue", name = "Employment Levels") +
labs(
title = "Combined Effect of M2 Money Supply and Employment on NYC Median Income",
x = "M2 Money Supply (Billion USD)",
y = "NYC Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Merge and clean the Miami-specific data
miami_data <- merged_income_df %>%
select(DATE, `Miami Median Income`) %>%
rename(Median_Income = `Miami Median Income`) %>%
left_join(
merged_electricity_df %>% select(DATE, `Miami Electricity Cost per Kilowatt-Hour`),
by = "DATE"
) %>%
left_join(
cpi_df %>% rename(CPI_Index = `CPI Index`),
by = "DATE"
) %>%
left_join(
m2_money_supply_df,
by = "DATE"
) %>%
left_join(
employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
by = "DATE"
)
miami_data_clean <- miami_data %>% na.omit()
regression_model_miami <- lm(
Median_Income ~ `Miami Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
data = miami_data_clean
)
summary(regression_model_miami)
##
## Call:
## lm(formula = Median_Income ~ `Miami Electricity Cost per Kilowatt-Hour` +
## CPI_Index + M2SL + Employment_Thousands, data = miami_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3363.8 -813.3 -44.1 792.6 3708.2
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2.383e+04 4.960e+03 -4.805
## `Miami Electricity Cost per Kilowatt-Hour` 7.909e+04 3.113e+04 2.541
## CPI_Index 2.385e+01 3.468e+01 0.688
## M2SL 9.229e-01 1.712e-01 5.391
## Employment_Thousands 3.488e-01 6.120e-02 5.699
## Pr(>|t|)
## (Intercept) 4.36e-05 ***
## `Miami Electricity Cost per Kilowatt-Hour` 0.0167 *
## CPI_Index 0.4971
## M2SL 8.56e-06 ***
## Employment_Thousands 3.65e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1437 on 29 degrees of freedom
## Multiple R-squared: 0.9862, Adjusted R-squared: 0.9842
## F-statistic: 516.3 on 4 and 29 DF, p-value: < 2.2e-16
Each coefficient represents the estimated impact of a one-unit increase in the predictor on Miami Median Income, holding other variables constant:
M2 Money Supply and Employment_Thousands
cannot realistically be zero.ggplot(miami_data_clean, aes(x = M2SL, y = Median_Income)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between M2 Money Supply and Miami Median Income",
x = "M2 Money Supply (Billion USD)",
y = "Miami Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(miami_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
geom_point(color = "green", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between Employment Levels and Miami Median Income",
x = "Employment Levels (Thousands)",
y = "Miami Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Merge and clean the San Francisco-specific data
sf_data <- merged_income_df %>%
select(DATE, `San Francisco Median Income`) %>%
rename(Median_Income = `San Francisco Median Income`) %>%
left_join(
merged_electricity_df %>% select(DATE, `San Francisco Electricity Cost per Kilowatt-Hour`),
by = "DATE"
) %>%
left_join(
cpi_df %>% rename(CPI_Index = `CPI Index`),
by = "DATE"
) %>%
left_join(
m2_money_supply_df,
by = "DATE"
) %>%
left_join(
employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
by = "DATE"
)
sf_data_clean <- sf_data %>% na.omit()
regression_model_sf <- lm(
Median_Income ~ `San Francisco Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
data = sf_data_clean
)
summary(regression_model_sf)
##
## Call:
## lm(formula = Median_Income ~ `San Francisco Electricity Cost per Kilowatt-Hour` +
## CPI_Index + M2SL + Employment_Thousands, data = sf_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7705 -2972 6 1876 8376
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -7.163e+04 2.244e+04
## `San Francisco Electricity Cost per Kilowatt-Hour` -1.637e+05 7.696e+04
## CPI_Index 2.172e+02 1.603e+02
## M2SL 4.249e+00 5.784e-01
## Employment_Thousands 6.837e-01 2.729e-01
## t value Pr(>|t|)
## (Intercept) -3.192 0.00406 **
## `San Francisco Electricity Cost per Kilowatt-Hour` -2.127 0.04441 *
## CPI_Index 1.354 0.18877
## M2SL 7.346 1.79e-07 ***
## Employment_Thousands 2.505 0.01977 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4324 on 23 degrees of freedom
## Multiple R-squared: 0.9803, Adjusted R-squared: 0.9768
## F-statistic: 285.5 on 4 and 23 DF, p-value: < 2.2e-16
Each coefficient represents the estimated impact of a one-unit increase in the predictor on San Francisco Median Income, holding other variables constant:
M2 Money Supply and Employment_Thousands
cannot realistically be zero.ggplot(sf_data_clean, aes(x = M2SL, y = Median_Income)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between M2 Money Supply and San Francisco Median Income",
x = "M2 Money Supply (Billion USD)",
y = "San Francisco Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(sf_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
geom_point(color = "green", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between Employment Levels and San Francisco Median Income",
x = "Employment Levels (Thousands)",
y = "San Francisco Median Income"
) +
theme_minimal()+
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
# Merge and clean the USA-specific data
usa_data <- merged_income_df %>%
select(DATE, `USA Median Income`) %>%
rename(Median_Income = `USA Median Income`) %>%
left_join(
merged_electricity_df %>% select(DATE, `USA Electricity Cost per Kilowatt-Hour`),
by = "DATE"
) %>%
left_join(
cpi_df %>% rename(CPI_Index = `CPI Index`),
by = "DATE"
) %>%
left_join(
m2_money_supply_df,
by = "DATE"
) %>%
left_join(
employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
by = "DATE"
)
usa_data_clean <- usa_data %>% na.omit()
regression_model_usa <- lm(
Median_Income ~ `USA Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
data = usa_data_clean
)
summary(regression_model_usa)
##
## Call:
## lm(formula = Median_Income ~ `USA Electricity Cost per Kilowatt-Hour` +
## CPI_Index + M2SL + Employment_Thousands, data = usa_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2036.24 -688.94 -8.26 503.91 2892.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.306e+04 4.595e+03 -5.019 2.41e-05
## `USA Electricity Cost per Kilowatt-Hour` 1.341e+04 3.637e+04 0.369 0.715
## CPI_Index 2.160e+01 3.835e+01 0.563 0.578
## M2SL 1.383e+00 1.257e-01 11.009 7.14e-12
## Employment_Thousands 4.108e-01 5.924e-02 6.934 1.28e-07
##
## (Intercept) ***
## `USA Electricity Cost per Kilowatt-Hour`
## CPI_Index
## M2SL ***
## Employment_Thousands ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1054 on 29 degrees of freedom
## Multiple R-squared: 0.9946, Adjusted R-squared: 0.9939
## F-statistic: 1345 on 4 and 29 DF, p-value: < 2.2e-16
Each coefficient represents the estimated impact of a one-unit increase in the predictor on USA Median Income, holding other variables constant:
M2 Money Supply and Employment_Thousands
cannot realistically be zero.ggplot(usa_data_clean, aes(x = M2SL, y = Median_Income)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between M2 Money Supply and USA Median Income",
x = "M2 Money Supply (Billion USD)",
y = "USA Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(usa_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
geom_point(color = "green", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between Employment Levels and USA Median Income",
x = "Employment Levels (Thousands)",
y = "USA Median Income"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
### USA DATA ANAYSIS FOR HOUSING
# Merge and clean USA Median Housing Price data
usa_housing_data <- median_house_prices_sold_usa_df %>%
left_join(
cpi_df %>% rename(CPI_Index = `CPI Index`),
by = "DATE"
) %>%
left_join(
m2_money_supply_df,
by = "DATE"
) %>%
left_join(
employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
by = "DATE"
)
usa_housing_data_clean <- usa_housing_data %>% na.omit()
regression_model_housing <- lm(
`Median Housing Price Sold USA` ~ CPI_Index + M2SL + Employment_Thousands,
data = usa_housing_data_clean
)
summary(regression_model_housing)
##
## Call:
## lm(formula = `Median Housing Price Sold USA` ~ CPI_Index + M2SL +
## Employment_Thousands, data = usa_housing_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24206 -9153 -377 5656 33289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.605e+05 1.968e+04 -8.157 2.11e-13 ***
## CPI_Index 2.579e+02 1.245e+02 2.071 0.0402 *
## M2SL 9.835e+00 7.485e-01 13.140 < 2e-16 ***
## Employment_Thousands 1.907e+00 2.515e-01 7.582 4.92e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12780 on 135 degrees of freedom
## Multiple R-squared: 0.9806, Adjusted R-squared: 0.9801
## F-statistic: 2272 on 3 and 135 DF, p-value: < 2.2e-16
Each coefficient represents the estimated impact of a one-unit increase in the predictor on USA Median Housing Prices, holding other variables constant:
M2 Money Supply and Employment_Thousands
cannot realistically be zero.ggplot(usa_housing_data_clean, aes(x = CPI_Index, y = `Median Housing Price Sold USA`)) +
geom_point(color = "purple", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between CPI Index and USA Median Housing Prices",
x = "CPI Index",
y = "USA Median Housing Prices"
) +
theme_minimal()+
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(usa_housing_data_clean, aes(x = M2SL, y = `Median Housing Price Sold USA`)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between M2 Money Supply and USA Median Housing Prices",
x = "M2 Money Supply (Billion USD)",
y = "USA Median Housing Prices"
) +
theme_minimal()+
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(usa_housing_data_clean, aes(x = Employment_Thousands, y = `Median Housing Price Sold USA`)) +
geom_point(color = "green", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(
title = "Relationship Between Employment Levels and USA Median Housing Prices",
x = "Employment Levels (Thousands)",
y = "USA Median Housing Prices"
) +
theme_minimal()+
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
# load data
rawhousingprices <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/Media%20Housing%20Price.csv"
housingprices <- read.csv(rawhousingprices)
head(housingprices)
## RegionID SizeRank RegionName RegionType StateName X2018.03.31
## 1 102001 0 United States country 263267
## 2 394913 1 New York, NY msa NY 503000
## 3 753899 2 Los Angeles, CA msa CA 721333
## 4 394463 3 Chicago, IL msa IL 284600
## 5 394514 4 Dallas, TX msa TX 322997
## 6 394692 5 Houston, TX msa TX 294467
## X2018.04.30 X2018.05.31 X2018.06.30 X2018.07.31 X2018.08.31 X2018.09.30
## 1 271267 276633 280000 280300 279633 279600
## 2 513000 521300 526300 528300 526600 526267
## 3 735000 743333 750000 749667 742967 736267
## 4 294600 300600 302267 301967 297933 296117
## 5 328497 331797 332800 329502 324202 321168
## 6 297933 299633 299933 299633 297800 296165
## X2018.10.31 X2018.11.30 X2018.12.31 X2019.01.31 X2019.02.28 X2019.03.31
## 1 279300 278933 277300 275967 276633 281300
## 2 527600 529000 527667 524667 524667 528000
## 3 728267 724633 719667 711300 711600 716267
## 4 292783 288483 283333 280000 284300 291267
## 5 318966 317633 315999 313333 313667 316967
## 6 293132 291632 289933 288300 289965 292665
## X2019.04.30 X2019.05.31 X2019.06.30 X2019.07.31 X2019.08.31 X2019.09.30
## 1 287967 294567 297900 298900 297000 294833
## 2 537667 544333 549000 547667 544333 544333
## 3 732633 742667 752933 756567 759233 758967
## 4 301267 306300 309467 306465 302798 298998
## 5 323633 328467 330133 328467 324966 321633
## 6 297632 299300 299933 298966 295633 292316
## X2019.10.31 X2019.11.30 X2019.12.31 X2020.01.31 X2020.02.29 X2020.03.31
## 1 292133 290467 287933 284933 285233 289600
## 2 545667 549000 549300 549300 549300 552333
## 3 758667 762333 765667 767667 771298 779965
## 4 295633 290967 284933 281633 284967 291633
## 5 319933 319600 317967 314667 313633 315267
## 6 289316 287650 286000 283333 284333 287666
## X2020.04.30 X2020.05.31 X2020.06.30 X2020.07.31 X2020.08.31 X2020.09.30
## 1 295633 299000 303967 309300 314300 315300
## 2 555967 559300 561300 566333 574967 584967
## 3 787965 794333 802333 816000 826000 829633
## 4 297967 299600 301300 304633 307667 309300
## 5 318600 321633 325633 330600 333233 334267
## 6 293166 297133 301132 303966 307299 309967
## X2020.10.31 X2020.11.30 X2020.12.31 X2021.01.31 X2021.02.28 X2021.03.31
## 1 315300 313633 309967 304967 307967 314667
## 2 592967 597967 599600 596600 594967 594667
## 3 831300 838294 844661 849327 849296 849629
## 4 306300 303267 298267 296567 299900 304933
## 5 332633 331600 328600 328559 330293 335926
## 6 313275 314308 314308 313666 315999 317666
## X2021.04.30 X2021.05.31 X2021.06.30 X2021.07.31 X2021.08.31 X2021.09.30
## 1 326333 333333 339407 341073 340740 339667
## 2 597667 598667 595633 589267 579600 574633
## 3 856296 864667 874667 874667 866333 856333
## 4 309966 311666 313333 313000 307967 302933
## 5 342633 350967 359667 367829 371159 373159
## 6 323000 327667 332667 334667 333333 333333
## X2021.10.31 X2021.11.30 X2021.12.31 X2022.01.31 X2022.02.28 X2022.03.31
## 1 339633 338300 333600 330300 335267 346633
## 2 574333 576333 576333 576667 584550 592550
## 3 850000 850000 850000 856633 871633 888263
## 4 299263 295930 289263 284933 287967 297667
## 5 373330 376633 378300 381633 388000 394667
## 6 333233 333233 331233 328000 328000 333650
## X2022.04.30 X2022.05.31 X2022.06.30 X2022.07.31 X2022.08.31 X2022.09.30
## 1 362967 376300 387967 393300 393300 391300
## 2 603550 610667 617667 613330 606330 599663
## 3 904963 916596 924967 920633 910333 901667
## 4 309300 317967 323300 324967 322967 319967
## 5 407333 419330 432663 438297 434967 428300
## 6 343650 356317 364167 367500 364500 358000
## X2022.10.31 X2022.11.30 X2022.12.31 X2023.01.31 X2023.02.28 X2023.03.31
## 1 386633 381667 375733 371900 371900 374833
## 2 606333 612999 619333 619333 624667 634967
## 3 897667 894333 887667 882633 885967 892967
## 4 314967 308600 302933 299300 302167 308833
## 5 419000 410667 403667 398000 396667 400150
## 6 352997 349663 347122 343792 342425 344933
## X2023.04.30 X2023.05.31 X2023.06.30 X2023.07.31 X2023.08.31 X2023.09.30
## 1 383107 391107 397607 399167 397833 396333
## 2 647967 662633 672333 679000 676000 679333
## 3 916000 942667 975333 992300 998967 999633
## 4 319133 329500 336500 339867 339967 339633
## 5 411813 426480 438330 441667 440000 435000
## 6 351567 358267 364633 367667 366000 362967
## X2023.10.31 X2023.11.30 X2023.12.31 X2024.01.31 X2024.02.29 X2024.03.31
## 1 393300 389967 384633 381000 381000 386000
## 2 684667 693000 693300 689967 691300 694333
## 3 999567 999233 995567 991667 992000 995633
## 4 336300 331333 324967 321633 323300 329967
## 5 430000 426356 422356 417356 414333 416633
## 6 358300 354967 351667 349983 349983 352983
## X2024.04.30 X2024.05.31 X2024.06.30 X2024.07.31 X2024.08.31 X2024.09.30
## 1 392967 400600 405900 405933 403267 399933
## 2 699333 706333 713000 716000 709330 709630
## 3 1015967 1045967 1079000 1095333 1090650 1077650
## 4 339967 349933 356600 358267 354933 351633
## 5 424633 432933 437967 438300 434667 430865
## 6 358000 363667 365667 366000 365333 363633
## X2024.10.31
## 1 399600
## 2 714963
## 3 1061650
## 4 346633
## 5 427532
## 6 359967
Since our focus is on three specific cities, we filter the data to include only New York, Miami, and Los Angeles. We then clean the dataset by renaming the variables and restructuring it from a wide format to a longer format.
filtered_housingprices <- housingprices %>%
filter(RegionName %in% c("New York, NY", "Miami, FL", "Los Angeles, CA")) %>%
pivot_longer(
cols = starts_with("X20"),
names_to = "Month",
values_to = "Value"
) %>%
mutate(Month = as.Date(Month, format = "X%Y.%m.%d")) %>%
mutate(Year = format(Month, "%Y")) %>%
group_by(Year, RegionName) %>%
mutate(Yearly_Mean = mean(Value, na.rm = TRUE)) %>%
select("SizeRank", "RegionName", "Year", "Yearly_Mean") %>%
distinct()
head(filtered_housingprices)
## # A tibble: 6 × 4
## # Groups: Year, RegionName [6]
## SizeRank RegionName Year Yearly_Mean
## <int> <chr> <chr> <dbl>
## 1 1 New York, NY 2018 522903.
## 2 1 New York, NY 2019 540720.
## 3 1 New York, NY 2020 570358.
## 4 1 New York, NY 2021 587392.
## 5 1 New York, NY 2022 603637.
## 6 1 New York, NY 2023 663933.
Since 2018, housing prices have risen significantly across all three cities, with the most dramatic increase observed in Los Angeles.
ggplot(filtered_housingprices, aes(x = Year, y = Yearly_Mean, color = RegionName, group = RegionName)) +
geom_line(size = 1) +
geom_text(aes(label = round(Yearly_Mean, 2)), vjust = -0.5, size = 3, check_overlap = TRUE) +
labs(
title = "Yearly Mean Housing Prices for New York, Miami, and Los Angeles",
x = "Year",
y = "Yearly Mean Housing Price"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(labels = scales::comma)
The rate of change in housing prices shows that Miami experienced the most significant fluctuation, with a nearly 28% increase. It’s also the only city to see a negative change in 2024, at -4%. On the other hand, both New York and Los Angeles saw more stable price changes, with no year surpassing a 10%.
housing_rate_of_change <- filtered_housingprices %>%
group_by(RegionName) %>%
arrange(Year) %>%
mutate(
Rate_of_Change = (Yearly_Mean - lag(Yearly_Mean)) / lag(Yearly_Mean) * 100
) %>%
filter(!is.na(Rate_of_Change))
ggplot(housing_rate_of_change, aes(x = Year, y = Rate_of_Change, fill = RegionName)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(Rate_of_Change, 2)),
position = position_dodge(width = 0.8),
vjust = -0.5, size = 3) +
labs(
title = "Year-over-Year Rate of Change in Housing Prices",
subtitle = "For New York, Los Angeles, and Miami",
x = "Year",
y = "Rate of Change (%)",
fill = "City"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Three expensive cities — Los Angeles is the most expensive city to buy a home in. While Miami has less expensive homes than LA or New York, it does skew to the right, meaning there are more higher valued homes than not.
anova_housingprices <- aov(Yearly_Mean ~ RegionName, data = filtered_housingprices)
summary(anova_housingprices)
## Df Sum Sq Mean Sq F value Pr(>F)
## RegionName 2 6.329e+11 3.165e+11 40.21 2.29e-07 ***
## Residuals 18 1.417e+11 7.870e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(filtered_housingprices, aes(x = RegionName, y = Yearly_Mean, fill = RegionName)) +
geom_boxplot() +
labs(title = "ANOVA Comparison of Housing Prices by Region")+
scale_y_continuous(labels = scales::comma)
Again, we load the CSV file from GitHub.
rawincome <- read.csv("https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/Income.csv")
head(rawincome)
## Metropolitan.Statistical.Areas X2021.Dollars X2022.Dollars X2023.Dollars
## 1 Abilene, TX 51,995 53,514 56,034
## 2 Akron, OH 59,583 59,609 62,904
## 3 Albany, GA 46,969 46,754 48,546
## 4 Albany-Lebanon, OR 52,230 54,123 57,016
## 5 Albany-Schenectady-Troy, NY 66,929 67,884 71,972
## 6 Albuquerque, NM 52,372 54,687 57,278
## US.Rank.2023
## 1 239
## 2 130
## 3 344
## 4 219
## 5 57
## 6 216
Then, we filter for the appropriate cities while reformatting the column names and structure.
filtered_income <- rawincome %>%
filter(Metropolitan.Statistical.Areas %in% c("Los Angeles-Long Beach-Anaheim, CA", "Miami-Fort Lauderdale-Pompano Beach, FL", "New York-Newark-Jersey City, NY-NJ-PA") ) %>%
rename_with(~ gsub("^X", "", .x), starts_with("X")) %>%
pivot_longer(
cols = starts_with("20"),
names_to = "Income by Year",
values_to = "Average Income"
) %>%
mutate(`Income by Year` = str_remove(`Income by Year`, "\\.Dollars"))
head(filtered_income)
## # A tibble: 6 × 4
## Metropolitan.Statistical.Areas US.Rank.2023 `Income by Year` `Average Income`
## <chr> <int> <chr> <chr>
## 1 Los Angeles-Long Beach-Anaheim… 24 2021 75,332
## 2 Los Angeles-Long Beach-Anaheim… 24 2022 76,760
## 3 Los Angeles-Long Beach-Anaheim… 24 2023 80,898
## 4 Miami-Fort Lauderdale-Pompano … 19 2021 73,220
## 5 Miami-Fort Lauderdale-Pompano … 19 2022 78,610
## 6 Miami-Fort Lauderdale-Pompano … 19 2023 84,302
In recent years, the average income in the New York area has seen notable growth, making it the highest among the three regions. Miami surpassed Los Angeles’ 2021 income in both 2022 and 2023. Meanwhile, Los Angeles has experienced the slowest growth in comparison.
ggplot(filtered_income, aes(x = `Income by Year`, y = `Average Income`,
group = Metropolitan.Statistical.Areas,
color = Metropolitan.Statistical.Areas)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
labs(
title = "Average Income by Year in Metropolitan Areas",
x = "Year",
y = "Average Income (Dollars)",
color = "Metropolitan Areas"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For both 2022 and 2023, Miami had the greatest change of income growth, followed by New York and Los Angeles.
filtered_income_num <- filtered_income %>%
mutate(`Average Income` = as.numeric(gsub(",", "", gsub("\\$", "", `Average Income`))))
filtered_income_change <- filtered_income_num %>%
arrange(Metropolitan.Statistical.Areas, `Income by Year`) %>%
group_by(Metropolitan.Statistical.Areas) %>%
mutate(rate_of_change = (`Average Income` - lag(`Average Income`)) / lag(`Average Income`) * 100)
ggplot(filtered_income_change, aes(x = `Income by Year`, y = rate_of_change,
fill = Metropolitan.Statistical.Areas,
label = round(rate_of_change, 2))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(rate_of_change, 2)), position = position_dodge(width = 0.8), vjust = -0.5) +
labs(
title = "Rate of Change in Average Income by Year in Metropolitan Areas",
x = "Year",
y = "Rate of Change (%)",
fill = "Metropolitan Areas"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
A heatmap visual — where the darkest and highest income (above 90k) is found in the NY area as of 2023.
filtered_income$`Average Income` <- as.numeric(gsub(",", "", filtered_income$`Average Income`))
ggplot(filtered_income, aes(x = `Income by Year`, y = Metropolitan.Statistical.Areas, fill = `Average Income`)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(
title = "Heatmap of Average Income by Year and Metropolitan Area",
x = "Year",
y = "Metropolitan Area",
fill = "Average Income"
) +
theme_minimal()
For our homeownership rates data, we loaded the csv and removed the first few lines.
rawhomeownership <- read_csv("https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/Homeownership%20Rates%20-%20Sheet1.csv")
## New names:
## Rows: 79 Columns: 25
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Metropolitan Statistical Area dbl (24): First Quarter 2023, Margin of
## Error1...3, Second Quarter 2023,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `Margin of Error1` -> `Margin of Error1...3`
## • `Margin of Error1` -> `Margin of Error1...5`
## • `Margin of Error1` -> `Margin of Error1...7`
## • `Margin of Error1` -> `Margin of Error1...9`
## • `Margin of Error1` -> `Margin of Error1...11`
## • `Margin of Error1` -> `Margin of Error1...13`
## • `Margin of Error1` -> `Margin of Error1...15`
## • `Margin of Error1` -> `Margin of Error1...17`
## • `Margin of Error1` -> `Margin of Error1...19`
## • `Margin of Error1` -> `Margin of Error1...21`
## • `Margin of Error1` -> `Margin of Error1...23`
## • `Margin of Error1` -> `Margin of Error1...25`
rawhomeownership$`Metropolitan Statistical Area` <- gsub("[,\\.]", "", rawhomeownership$`Metropolitan Statistical Area`)
rawhomeownership <- rawhomeownership[-c(1, 2, 3), ]
rawhomeownership
## # A tibble: 76 × 25
## `Metropolitan Statistical Area` `First Quarter 2023` `Margin of Error1...3`
## <chr> <dbl> <dbl>
## 1 "Akron OH " 64.5 9.4
## 2 "Albany-Schenectady-Troy NY " 79.6 8.1
## 3 "Albuquerque NM" 67.2 4.5
## 4 "Allentown-Bethlehem-Easton PA… 69.7 9.7
## 5 "Atlanta-Sandy Springs-Roswell… 66.6 3.6
## 6 "Austin-Round Rock TX" 63.4 6
## 7 "Baltimore-Columbia-Towson MD2" 72.9 5.2
## 8 "Baton Rouge LA………………………………………" 64.8 6.2
## 9 "Birmingham-Hoover AL" 67 6.1
## 10 "Boston-Cambridge-Newton MA-NH… 61.1 3.5
## # ℹ 66 more rows
## # ℹ 22 more variables: `Second Quarter 2023` <dbl>,
## # `Margin of Error1...5` <dbl>, `Third Quarter 2023` <dbl>,
## # `Margin of Error1...7` <dbl>, `Fourth Quarter 2023` <dbl>,
## # `Margin of Error1...9` <dbl>, `First Quarter 2022` <dbl>,
## # `Margin of Error1...11` <dbl>, `Second Quarter 2022` <dbl>,
## # `Margin of Error1...13` <dbl>, `Third Quarter 2022` <dbl>, …
To clean our data, we had to restructure the columns and rearrange by Quarter and Year. Otherwise, our x-axis would present in alphabetical order, and therefore not in time order.
rawhomeownership <- rawhomeownership %>%
clean_names()
rawhomeownership
## # A tibble: 76 × 25
## metropolitan_statistical_area first_quarter_2023 margin_of_error1_3
## <chr> <dbl> <dbl>
## 1 "Akron OH " 64.5 9.4
## 2 "Albany-Schenectady-Troy NY " 79.6 8.1
## 3 "Albuquerque NM" 67.2 4.5
## 4 "Allentown-Bethlehem-Easton PA-NJ" 69.7 9.7
## 5 "Atlanta-Sandy Springs-Roswell GA1" 66.6 3.6
## 6 "Austin-Round Rock TX" 63.4 6
## 7 "Baltimore-Columbia-Towson MD2" 72.9 5.2
## 8 "Baton Rouge LA………………………………………" 64.8 6.2
## 9 "Birmingham-Hoover AL" 67 6.1
## 10 "Boston-Cambridge-Newton MA-NH3" 61.1 3.5
## # ℹ 66 more rows
## # ℹ 22 more variables: second_quarter_2023 <dbl>, margin_of_error1_5 <dbl>,
## # third_quarter_2023 <dbl>, margin_of_error1_7 <dbl>,
## # fourth_quarter_2023 <dbl>, margin_of_error1_9 <dbl>,
## # first_quarter_2022 <dbl>, margin_of_error1_11 <dbl>,
## # second_quarter_2022 <dbl>, margin_of_error1_13 <dbl>,
## # third_quarter_2022 <dbl>, margin_of_error1_15 <dbl>, …
homeownership <- rawhomeownership %>%
filter(`metropolitan_statistical_area` %in% c("Los Angeles-Long Beach-Anaheim CA14",
"Miami-Fort Lauderdale-West Palm Beach FL16",
"New York-Newark-Jersey City NY-NJ-PA19")) %>%
select(`metropolitan_statistical_area`, `first_quarter_2021`, `second_quarter_2021`, `third_quarter_2021`, `fourth_quarter_2021`,
`first_quarter_2022`, `second_quarter_2022`, `third_quarter_2022`, `fourth_quarter_2022`,
`first_quarter_2023`, `second_quarter_2023`, `third_quarter_2023`, `fourth_quarter_2023`) %>%
pivot_longer(
cols = -`metropolitan_statistical_area`,
names_to = "year_quarter",
values_to = "homeownership_rate"
) %>%
mutate(
quarter = case_when(
grepl("first", year_quarter) ~ "Q1",
grepl("second", year_quarter) ~ "Q2",
grepl("third", year_quarter) ~ "Q3",
grepl("fourth", year_quarter) ~ "Q4"
),
year = sub(".*(\\d{4}).*", "\\1", year_quarter),
label = paste(quarter, year),
label = factor(label, levels = c("Q1 2021", "Q2 2021", "Q3 2021", "Q4 2021",
"Q1 2022", "Q2 2022", "Q3 2022", "Q4 2022",
"Q1 2023", "Q2 2023", "Q3 2023", "Q4 2023"))
)
homeownership
## # A tibble: 36 × 6
## metropolitan_statistica…¹ year_quarter homeownership_rate quarter year label
## <chr> <chr> <dbl> <chr> <chr> <fct>
## 1 Los Angeles-Long Beach-A… first_quart… 61.2 Q1 2021 Q1 2…
## 2 Los Angeles-Long Beach-A… second_quar… 65.7 Q2 2021 Q2 2…
## 3 Los Angeles-Long Beach-A… third_quart… 63.9 Q3 2021 Q3 2…
## 4 Los Angeles-Long Beach-A… fourth_quar… 67.7 Q4 2021 Q4 2…
## 5 Los Angeles-Long Beach-A… first_quart… 63.4 Q1 2022 Q1 2…
## 6 Los Angeles-Long Beach-A… second_quar… 64 Q2 2022 Q2 2…
## 7 Los Angeles-Long Beach-A… third_quart… 62.4 Q3 2022 Q3 2…
## 8 Los Angeles-Long Beach-A… fourth_quar… 67.8 Q4 2022 Q4 2…
## 9 Los Angeles-Long Beach-A… first_quart… 46.1 Q1 2023 Q1 2…
## 10 Los Angeles-Long Beach-A… second_quar… 46.8 Q2 2023 Q2 2…
## # ℹ 26 more rows
## # ℹ abbreviated name: ¹metropolitan_statistical_area
All areas show a downward trend in homeownership despite an upward trend in average income.
ggplot(homeownership, aes(x = label, y = `homeownership_rate`, color = `metropolitan_statistical_area`, group = `metropolitan_statistical_area`)) +
geom_line(size = 1) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", size = 0.5) + # Separate linear models for each region
labs(
title = "Homeownership Rates by Year and Metropolitan Area",
x = "Year and Quarter",
y = "Homeownership Rate (%)",
color = "Metropolitan Statistical Area"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'
In fact, Los Angeles had the greatest drop in homeownership rates as of Q1 of 2023 at -32%. If we recall, Los Angeles had the most expensive housing prices, and the slowest average income growth. New York follows at -25%, despite have the highest average income of the areas.
homeownership_change <- homeownership %>%
group_by(`metropolitan_statistical_area`) %>%
arrange(`metropolitan_statistical_area`, `label`) %>%
mutate(rate_of_change = (`homeownership_rate` - lag(`homeownership_rate`)) / lag(`homeownership_rate`) * 100)
ggplot(homeownership_change, aes(x = label, y = rate_of_change, fill = `metropolitan_statistical_area`)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(rate_of_change, 2)),
position = position_dodge(width = 0.8),
vjust = -1, size = 2) +
labs(
title = "Rate of Change in Homeownership Rates by Year and Metropolitan Area",
x = "Year and Quarter",
y = "Rate of Change (%)",
fill = "Metropolitan Statistical Area"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).
Here we load the data for non farmer jobs in the US since the 1990s.
raw_nonfarmer <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/total_non_farm_employment.csv"
nonfarmerjobs <- read_csv(raw_nonfarmer)
## Rows: 419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): PAYEMS
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nonfarmerjobs
## # A tibble: 419 × 2
## DATE PAYEMS
## <date> <dbl>
## 1 1990-01-01 109189
## 2 1990-02-01 109432
## 3 1990-03-01 109637
## 4 1990-04-01 109671
## 5 1990-05-01 109834
## 6 1990-06-01 109857
## 7 1990-07-01 109822
## 8 1990-08-01 109610
## 9 1990-09-01 109520
## 10 1990-10-01 109374
## # ℹ 409 more rows
The number of non-farm jobs has steadily increased since 1990, with a significant drop in 2020 due to the pandemic. However, the job market has since rebounded, surpassing pre-pandemic levels. Non-farm jobs make up nearly 80% of the workforce contributing to the country’s GDP. Despite this substantial contribution, homeownership rates have been declining, raising questions about the broader economic and social implications.
ggplot(nonfarmerjobs, aes(x = DATE, y = PAYEMS)) +
geom_line(color = "skyblue", size = 1) +
labs(
title = "Non Farmer Jobs Over the Years",
x = "Date",
y = "Non Farmer Job Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The data on housing prices, average income, and homeownership rates across New York, Miami, and Los Angeles reveal notable disparities in the growth and stability of these metrics. Despite New York having the highest average income growth, it experienced a sharp decline in homeownership rates, mirroring a similar trend in Los Angeles around the same period (Q4 2022). Los Angeles, in particular, has shown the most significant challenges, with the slowest growth in income and a sharp decline in homeownership, making it the most vulnerable area in this study. In contrast, Miami demonstrated more stability, with a substantial increase in income and a lower rate of change in homeownership.
It’s important for policymakers and stakeholders in California to focus on initiatives that address the affordability crisis and improve access to homeownership. Potential solutions could include expanding affordable housing projects, offering homebuyer assistance programs, and addressing the root causes of income inequality. For New York, focusing on supporting sustainable growth in both income and homeownership could help mitigate the risks of economic disparity. Miami, while showing promising trends, could further benefit from targeted policies that ensure the continued growth of both its housing market and income levels.