Libraries

Abstract

This project investigates the relationship between socioeconomic factors and the housing market in the United States, focusing on three metropolitan areas: New York City, Miami, and Los Angeles. Utilizing datasets on median housing prices, homeownership rates, education, jobs, and household income across multiple years and quarters, the study aims to uncover trends and disparities in housing affordability and accessibility. The analysis incorporates visualizations and highlights the potential impact of regional economic conditions and policy decisions. By understanding these patterns, this research seeks to inform strategies for improving housing equity and affordability, especially in urban areas with diverse socioeconomic dynamics. The findings contribute to a broader discourse on the intersection of economic policy, urban development, and social equity in the U.S. housing market.

Housing Prices

First, the raw housing data is loaded directly from a CSV file hosted in a GitHub repository.

load USA dataset

cpi_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/cpi.csv"
electricity_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/electricity_price_city_usa.csv"
m2_money_supply_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/m2_money_supply.csv"
median_house_prices_sold_usa_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/median_house_prices_sold_usa.csv"
employment_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/total_non_farm_employment.csv"
median_household_income_usa_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/median_household_income_usa.csv"

cpi_df <- read_csv(cpi_url)

## Rows: 418 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): CPIAUCSL
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

electricity_df <- read_csv(electricity_url)

## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

m2_money_supply_df <- read_csv(m2_money_supply_url)

## Rows: 418 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): M2SL
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

median_house_prices_sold_usa_df <- read_csv(median_house_prices_sold_usa_url)

## Rows: 139 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): MSPUS
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

employment_df <- read_csv(employment_url)

## Rows: 419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): PAYEMS
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

median_household_income_usa_df <- read.csv(median_household_income_usa_url)

load nyc dataset

nyc_median_income_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/new%20york%20tristate/nyc_median_household_income.csv"
nyc_tristate_electricity_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/new%20york%20tristate/nyc_tristate_electricity.csv"

nyc_median_income_df <- read_csv(nyc_median_income_url)

## Rows: 34 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): MEHOINUSNYA646N
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nyc_tristate_electricity_df <- read_csv(nyc_tristate_electricity_url)

## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

load miami dataset

miami_electricity_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/miami/miami_electricity.csv"
miami_median_income_url <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/miami/miami_median_household_income.csv"

miami_electricity_df <- read_csv(miami_electricity_url)

## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

miami_median_income_df <- read_csv(miami_median_income_url)

## Rows: 40 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): MEHOINUSFLA646N
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

load califonia dataset

san_francisco_electricity_url  <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/california/san_francisco_electricity.csv"
san_francisco_median_income_url  <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/california/san_francisco_median_household_income.csv"
san_francisco_electricity_df <- read_csv(san_francisco_electricity_url )

## Rows: 35 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

san_francisco_median_income_df <- read_csv(san_francisco_median_income_url)

## Rows: 34 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): MHICA06075A052NCEN
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Transformation

Electrcity Df Transformation

transform_electricity_data <- function(df) {
  transformed_df <- df %>%
    pivot_longer(
      cols = -Year, 
      names_to = "Month", 
      values_to = "Electricity Cost per Kilowatt-Hour" 
    ) %>%
    mutate(
      Date = as.Date(paste(Year, Month, "01", sep = "-"), format = "%Y-%b-%d")
    ) %>%
    select(Date, `Electricity Cost per Kilowatt-Hour`)
  
  return(transformed_df)
}

san_francisco_electricity_df <- transform_electricity_data(san_francisco_electricity_df)
electricity_df <- transform_electricity_data(electricity_df)
nyc_tristate_electricity_df <- transform_electricity_data(nyc_tristate_electricity_df)
miami_electricity_df<- transform_electricity_data(miami_electricity_df)

rename columns

rename_column <- function(df, old_col_name, new_col_name) {
  renamed_df <- df %>%
    rename(!!new_col_name := !!sym(old_col_name))
  
  return(renamed_df)
}

san_francisco_median_income_df <- rename_column(san_francisco_median_income_df,"MHICA06075A052NCEN", "San Francisco Median Income")
nyc_median_income_df <- rename_column(nyc_median_income_df,"MEHOINUSNYA646N", "NYC Median Income")
miami_median_income_df <- rename_column(miami_median_income_df,"MEHOINUSFLA646N", "Miami Median Income")
median_household_income_usa_df <- rename_column(median_household_income_usa_df,"MEHOINUSA646N", "USA Median Income")

cpi_df <- rename_column(cpi_df,"CPIAUCSL", "CPI Index")
median_house_prices_sold_usa_df <- rename_column(median_house_prices_sold_usa_df,"MSPUS", "Median Housing Price Sold USA")
employment_df <- rename_column(employment_df,"PAYEMS", "Population Employed USA Thousand Per Person")

san_francisco_electricity_df <- rename_column(san_francisco_electricity_df,"Electricity Cost per Kilowatt-Hour", "San Francisco Electricity Cost per Kilowatt-Hour")
electricity_df <- rename_column(electricity_df,"Electricity Cost per Kilowatt-Hour", "USA Electricity Cost per Kilowatt-Hour")
nyc_tristate_electricity_df <- rename_column(nyc_tristate_electricity_df,"Electricity Cost per Kilowatt-Hour", "NYC Electricity Cost per Kilowatt-Hour")
miami_electricity_df<- rename_column(miami_electricity_df,"Electricity Cost per Kilowatt-Hour", "Miami Electricity Cost per Kilowatt-Hour")

Merge electricty data into a one df

merged_electricity_df <- san_francisco_electricity_df %>%
  full_join(electricity_df, by = "Date") %>%
  full_join(nyc_tristate_electricity_df, by = "Date") %>%
  full_join(miami_electricity_df, by = "Date")

head(merged_electricity_df)

## # A tibble: 6 × 5
##   Date       San Francisco Electricity Cost per Kilowat…¹ USA Electricity Cost…²
##   <date>                                            <dbl>                  <dbl>
## 1 1990-01-01                                        0.109                  0.081
## 2 1990-02-01                                        0.109                  0.081
## 3 1990-03-01                                        0.109                  0.081
## 4 1990-04-01                                        0.109                  0.082
## 5 1990-05-01                                        0.111                  0.082
## 6 1990-06-01                                        0.111                  0.088
## # ℹ abbreviated names: ¹`San Francisco Electricity Cost per Kilowatt-Hour`,
## #   ²`USA Electricity Cost per Kilowatt-Hour`
## # ℹ 2 more variables: `NYC Electricity Cost per Kilowatt-Hour` <dbl>,
## #   `Miami Electricity Cost per Kilowatt-Hour` <dbl>

ggplot(data = merged_electricity_df, aes(x = Date)) +
  geom_line(aes(y = `San Francisco Electricity Cost per Kilowatt-Hour`, color = "San Francisco"), size = 1) +
  geom_line(aes(y = `USA Electricity Cost per Kilowatt-Hour`, color = "USA"), size = 1) +
  geom_line(aes(y = `NYC Electricity Cost per Kilowatt-Hour`, color = "NYC"), size = 1) +
  geom_line(aes(y = `Miami Electricity Cost per Kilowatt-Hour`, color = "Miami"), size = 1) +
  labs(
    title = "Electricity Costs Over Time",
    x = "Date",
    y = "Electricity Cost (per Kilowatt-Hour)",
    color = "City/Region"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

heatmap_data <- merged_electricity_df %>%
  pivot_longer(
    cols = -Date, 
    names_to = "Region", 
    values_to = "Electricity_Cost" 
  )

# Create the heatmap
ggplot(data = heatmap_data, aes(x = Date, y = Region, fill = Electricity_Cost)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red", name = "Electricity Cost") +
  labs(
    title = "Electricity Costs Over Time by Region",
    x = "Year",
    y = "Region"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ratio_df <- merged_electricity_df %>%
  mutate(
    `San Francisco Ratio` = `San Francisco Electricity Cost per Kilowatt-Hour` / `USA Electricity Cost per Kilowatt-Hour`,
    `NYC Ratio` = `NYC Electricity Cost per Kilowatt-Hour` / `USA Electricity Cost per Kilowatt-Hour`,
    `Miami Ratio` = `Miami Electricity Cost per Kilowatt-Hour` / `USA Electricity Cost per Kilowatt-Hour`
  ) %>%
  select(Date, `San Francisco Ratio`, `NYC Ratio`, `Miami Ratio`) %>%
  pivot_longer(
    cols = -Date, 
    names_to = "Region",
    values_to = "Ratio" 
  )

# Create the ratio plot
ggplot(data = ratio_df, aes(x = Date, y = Ratio, color = Region)) +
  geom_line(size = 1) +
  labs(
    title = "Electricity Price Ratios Relative to USA Average",
    x = "Date",
    y = "Price Ratio",
    color = "Region"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).

scatter_data <- merged_electricity_df %>%
  pivot_longer(
    cols = -Date, 
    names_to = "Region", 
    values_to = "Electricity_Cost" 
  )

ggplot(data = scatter_data, aes(x = Date, y = Electricity_Cost, color = Region)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(
    title = "Scatter Plot of Electricity Costs Over Time",
    x = "Date",
    y = "Electricity Cost (per Kilowatt-Hour)",
    color = "Region"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).

san_francisco_median_income_df <- san_francisco_median_income_df %>%
  mutate(DATE = as.Date(DATE))

nyc_median_income_df <- nyc_median_income_df %>%
  mutate(DATE = as.Date(DATE))

miami_median_income_df <- miami_median_income_df %>%
  mutate(DATE = as.Date(DATE))

median_household_income_usa_df <- median_household_income_usa_df %>%
  mutate(DATE = as.Date(DATE))

merged_income_df <- san_francisco_median_income_df %>%
  full_join(nyc_median_income_df, by = "DATE") %>%
  full_join(miami_median_income_df, by = "DATE") %>%
  full_join(median_household_income_usa_df, by = "DATE")

merged_income_df <- merged_income_df %>%
  select(-starts_with("Date.")) %>% 
  slice(-1)                       

merged_income_df <- merged_income_df %>%
  mutate(across(
    c(`San Francisco Median Income`, `NYC Median Income`, `Miami Median Income`, `USA Median Income`),
    ~ as.numeric(.)
  ))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! NAs introduced by coercion

head(merged_income_df)

## # A tibble: 6 × 5
##   DATE       San Francisco Median In…¹ `NYC Median Income` `Miami Median Income`
##   <date>                         <dbl>               <dbl>                 <dbl>
## 1 1990-01-01                        NA               31590                 26690
## 2 1991-01-01                        NA               31790                 27250
## 3 1992-01-01                        NA               31050                 27350
## 4 1993-01-01                     34623               31700                 28550
## 5 1994-01-01                        NA               31900                 29290
## 6 1995-01-01                     37854               33030                 29750
## # ℹ abbreviated name: ¹`San Francisco Median Income`
## # ℹ 1 more variable: `USA Median Income` <dbl>

ggplot(data = merged_income_df, aes(x = DATE)) +
  geom_line(aes(y = `San Francisco Median Income`, color = "San Francisco"), size = 1) +
  geom_line(aes(y = `NYC Median Income`, color = "NYC"), size = 1) +
  geom_line(aes(y = `Miami Median Income`, color = "Miami"), size = 1) +
  geom_line(aes(y = `USA Median Income`, color = "USA"), size = 1) +
  labs(
    title = "Median Income Trends Over Time",
    x = "Year",
    y = "Median Income",
    color = "Region"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)

## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

long_income_df <- merged_income_df %>%
  pivot_longer(
    cols = -DATE,
    names_to = "Region",
    values_to = "Median_Income"
  )

ggplot(long_income_df, aes(x = Region, y = Median_Income, fill = Region)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Median Incomes by Region",
    x = "Region",
    y = "Median Income"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)

## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

cpi

ggplot(cpi_df, aes(x = as.Date(DATE), y = `CPI Index`)) +
  geom_line(color = "red", size = 1) +
  labs(
    title = "CPI Index Over Time",
    x = "Date",
    y = "CPI Index"
  ) +
  theme_minimal()

colnames(cpi_df)

## [1] "DATE"      "CPI Index"

# Calculate the growth rate for CPI
cpi_plot_df <- cpi_df %>%
  arrange(as.Date(DATE)) %>% 
  mutate(Growth_Rate = (`CPI Index` - lag(`CPI Index`)) / lag(`CPI Index`) * 100) # Percentage growth

# Create the growth rate line plot
ggplot(cpi_plot_df, aes(x = as.Date(DATE), y = Growth_Rate)) +
  geom_line(color = "red", size = 1) +
  labs(
    title = "CPI Growth Rate Over Time",
    x = "Date",
    y = "Growth Rate (%)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

m2 money supply

ggplot(m2_money_supply_df, aes(x = as.Date(DATE), y = M2SL)) +
  geom_line(color = "green", size = 1) +
  labs(
    title = "M2 Money Supply Over Time",
    x = "Date",
    y = "M2SL (Money Supply)"
  ) +
  theme_minimal()

m2_money_supply_plot_df <- m2_money_supply_df %>%
  mutate(Growth_Rate = (M2SL - lag(M2SL)) / lag(M2SL) * 100)
ggplot(m2_money_supply_plot_df, aes(x = as.Date(DATE), y = Growth_Rate)) +
  geom_line(color = "green", size = 1) +
  labs(title = "Growth Rate of M2 Money Supply", x = "Date", y = "Growth Rate (%)") +
  theme_minimal()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

median house hold prices sold

ggplot(median_house_prices_sold_usa_df, aes(x = as.Date(DATE), y = `Median Housing Price Sold USA`)) +
  geom_line(color = "purple", size = 1) +
  labs(
    title = "Median Housing Prices Sold in the USA Over Time",
    x = "Date",
    y = "Median Housing Price"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

ggplot(employment_df, aes(x = as.Date(DATE), y = `Population Employed USA Thousand Per Person`)) +
  geom_line(color = "orange", size = 1) +
  labs(
    title = "USA Population Employed (Thousands) Over Time",
    x = "Date",
    y = "Population Employed (Thousands)"
  ) +
  theme_minimal()

employment_plot_df <- employment_df %>%
  mutate(Year = format(as.Date(DATE), "%Y")) %>%
  group_by(Year) %>%
  summarize(Annual_Change = last(`Population Employed USA Thousand Per Person`) -
                              first(`Population Employed USA Thousand Per Person`))
ggplot(employment_plot_df, aes(x = Year, y = Annual_Change)) +
  geom_bar(stat = "identity", fill = "orange") +
  labs(title = "Annual Changes in Employment", x = "Year", y = "Change in Employment (Thousands)") +
  theme_minimal()

NYC Analysis on income

merged_electricity_df <- merged_electricity_df %>% rename(DATE = Date)

nyc_data <- merged_income_df %>%
  select(DATE, `NYC Median Income`) %>%
  rename(Median_Income = `NYC Median Income`) %>%
  left_join(
    merged_electricity_df %>% select(DATE, `NYC Electricity Cost per Kilowatt-Hour`),
    by = "DATE"
  ) %>%
  left_join(
    cpi_df %>% rename(CPI_Index = `CPI Index`),
    by = "DATE"
  ) %>%
  left_join(
    m2_money_supply_df,
    by = "DATE"
  ) %>%
  left_join(
    employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
    by = "DATE"
  )

nyc_data_clean <- nyc_data %>% na.omit()

regression_model <- lm(
  Median_Income ~ `NYC Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
  data = nyc_data_clean
)

summary(regression_model)

## 
## Call:
## lm(formula = Median_Income ~ `NYC Electricity Cost per Kilowatt-Hour` + 
##     CPI_Index + M2SL + Employment_Thousands, data = nyc_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3752.3  -937.2   117.2  1053.6  4971.3 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              -1.994e+04  7.030e+03  -2.836  0.00824
## `NYC Electricity Cost per Kilowatt-Hour` -1.245e+04  2.808e+04  -0.443  0.66078
## CPI_Index                                -3.917e+00  5.431e+01  -0.072  0.94300
## M2SL                                      1.821e+00  2.177e-01   8.367 3.20e-09
## Employment_Thousands                      4.267e-01  9.225e-02   4.626 7.17e-05
##                                             
## (Intercept)                              ** 
## `NYC Electricity Cost per Kilowatt-Hour`    
## CPI_Index                                   
## M2SL                                     ***
## Employment_Thousands                     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1853 on 29 degrees of freedom
## Multiple R-squared:  0.9849, Adjusted R-squared:  0.9828 
## F-statistic: 473.5 on 4 and 29 DF,  p-value: < 2.2e-16

Regression Analysis: NYC Median Income

1. Overall Model Fit

Multiple R-squared (0.9849):
- This indicates that 98.49% of the variation in NYC Median Income is explained by the predictors (NYC Electricity Cost per Kilowatt-Hour, CPI Index, M2 Money Supply, and Employment_Thousands).
- A very high R-squared suggests the model fits the data well.
Adjusted R-squared (0.9828):
- This value adjusts for the number of predictors and confirms that the model explains a significant portion of the variation, even after accounting for the complexity of the model.
F-statistic (473.5, p < 2.2e-16):
- The overall model is highly significant, indicating that at least one predictor has a meaningful relationship with NYC Median Income.

2. Coefficients

Each coefficient represents the estimated impact of a one-unit increase in the predictor on NYC Median Income, holding other variables constant:

Intercept (-19,940):
- When all predictors are zero, the model predicts a baseline NYC Median Income of -$19,940.
- This value is not interpretable in a practical sense because predictors like M2 Money Supply and employment can’t be zero.
NYC Electricity Cost per Kilowatt-Hour (-12,450):
- Estimate: A $1 increase in NYC electricity cost is associated with a $12,450 decrease in NYC Median Income.
- p-value (0.661): Not statistically significant (p > 0.05), meaning there’s no strong evidence that electricity costs impact median income in NYC.
CPI Index (-3.92):
- Estimate: A one-unit increase in the CPI Index is associated with a $3.92 decrease in NYC Median Income.
- p-value (0.943): Not statistically significant (p > 0.05). CPI does not have a measurable direct impact on NYC Median Income in this model.
M2 Money Supply (1.82):
- Estimate: A $1 billion increase in M2 money supply is associated with a $1.82 increase in NYC Median Income.
- p-value (3.20e-09): Highly statistically significant (p < 0.001). There’s a strong positive correlation between M2 and NYC Median Income.
Employment Levels (0.427):
- Estimate: An increase of 1,000 employed persons in the USA is associated with a $0.427 increase in NYC Median Income.
- p-value (7.17e-05): Highly statistically significant (p < 0.001). Employment levels are strongly correlated with NYC Median Income.

3. Residuals

Residuals represent the differences between observed and predicted values: - The residuals are relatively small, with a standard error of 1853, indicating that the model predicts NYC Median Income fairly accurately.

4. Key Takeaways

Significant Predictors:

M2 Money Supply and Employment_Thousands have statistically significant positive relationships with NYC Median Income.
- This suggests that broader economic growth (as captured by M2 money supply) and higher employment levels significantly drive median income in NYC.

Insignificant Predictors:

NYC Electricity Cost per Kilowatt-Hour and CPI Index are not statistically significant, meaning there’s no strong evidence that these directly influence NYC Median Income in this model.

5. Correlation

Yes, there is a strong correlation between NYC Median Income and both M2 Money Supply and Employment_Thousands, as evidenced by their significance levels and positive coefficients.
However, no significant correlation is found with electricity costs or CPI Index.

ggplot(nyc_data_clean, aes(x = M2SL, y = Median_Income)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between M2 Money Supply and NYC Median Income",
    x = "M2 Money Supply (Billion USD)",
    y = "NYC Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

ggplot(nyc_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between Employment Levels and NYC Median Income",
    x = "Employment Levels (Thousands)",
    y = "NYC Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

ggplot(nyc_data_clean, aes(x = M2SL, y = Median_Income, color = Employment_Thousands)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", color = "black", se = FALSE) +
  scale_color_gradient(low = "lightblue", high = "darkblue", name = "Employment Levels") +
  labs(
    title = "Combined Effect of M2 Money Supply and Employment on NYC Median Income",
    x = "M2 Money Supply (Billion USD)",
    y = "NYC Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

MIAMI DATA ANALYSIS

# Merge and clean the Miami-specific data
miami_data <- merged_income_df %>%
  select(DATE, `Miami Median Income`) %>%
  rename(Median_Income = `Miami Median Income`) %>%
  left_join(
    merged_electricity_df %>% select(DATE, `Miami Electricity Cost per Kilowatt-Hour`),
    by = "DATE"
  ) %>%
  left_join(
    cpi_df %>% rename(CPI_Index = `CPI Index`),
    by = "DATE"
  ) %>%
  left_join(
    m2_money_supply_df,
    by = "DATE"
  ) %>%
  left_join(
    employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
    by = "DATE"
  )

miami_data_clean <- miami_data %>% na.omit()

regression_model_miami <- lm(
  Median_Income ~ `Miami Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
  data = miami_data_clean
)

summary(regression_model_miami)

## 
## Call:
## lm(formula = Median_Income ~ `Miami Electricity Cost per Kilowatt-Hour` + 
##     CPI_Index + M2SL + Employment_Thousands, data = miami_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3363.8  -813.3   -44.1   792.6  3708.2 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -2.383e+04  4.960e+03  -4.805
## `Miami Electricity Cost per Kilowatt-Hour`  7.909e+04  3.113e+04   2.541
## CPI_Index                                   2.385e+01  3.468e+01   0.688
## M2SL                                        9.229e-01  1.712e-01   5.391
## Employment_Thousands                        3.488e-01  6.120e-02   5.699
##                                            Pr(>|t|)    
## (Intercept)                                4.36e-05 ***
## `Miami Electricity Cost per Kilowatt-Hour`   0.0167 *  
## CPI_Index                                    0.4971    
## M2SL                                       8.56e-06 ***
## Employment_Thousands                       3.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1437 on 29 degrees of freedom
## Multiple R-squared:  0.9862, Adjusted R-squared:  0.9842 
## F-statistic: 516.3 on 4 and 29 DF,  p-value: < 2.2e-16

Regression Analysis: Miami Median Income

1. Overall Model Fit

Multiple R-squared (0.9862):
- The model explains 98.62% of the variation in Miami Median Income, indicating a very strong fit.
Adjusted R-squared (0.9842):
- After accounting for the number of predictors, the model still explains a substantial portion of the variation in Miami Median Income.
F-statistic (516.3, p < 2.2e-16):
- The overall model is highly significant, meaning at least one predictor has a strong relationship with Miami Median Income.

2. Coefficients

Each coefficient represents the estimated impact of a one-unit increase in the predictor on Miami Median Income, holding other variables constant:

Intercept (-23,830):
- When all predictors are zero, the model predicts a baseline Miami Median Income of -$23,830.
- This value is not meaningful since predictors like M2 Money Supply and Employment_Thousands cannot realistically be zero.
Miami Electricity Cost per Kilowatt-Hour (79,090):
- Estimate: A $1 increase in electricity costs is associated with a $79,090 increase in Miami Median Income.
- p-value (0.0167): Statistically significant (p < 0.05), suggesting that electricity costs have a meaningful positive relationship with Miami Median Income.
CPI Index (23.85):
- Estimate: A one-unit increase in the CPI Index is associated with a $23.85 increase in Miami Median Income.
- p-value (0.4971): Not statistically significant (p > 0.05), meaning CPI does not have a measurable impact on Miami Median Income in this model.
M2 Money Supply (0.9229):
- Estimate: A $1 billion increase in M2 money supply is associated with a $0.9229 increase in Miami Median Income.
- p-value (8.56e-06): Highly statistically significant (p < 0.001). There is a strong positive relationship between M2 money supply and Miami Median Income.
Employment Levels (0.3488):
- Estimate: An increase of 1,000 employed persons in the USA is associated with a $0.3488 increase in Miami Median Income.
- p-value (3.65e-06): Highly statistically significant (p < 0.001). Employment levels are strongly correlated with Miami Median Income.

3. Residuals

Residual Standard Error (1437):
- Indicates the typical difference between observed and predicted values for Miami Median Income.
- The residuals are smaller compared to the NYC model, suggesting more precise predictions for Miami.

4. Key Takeaways

Significant Predictors:

Miami Electricity Costs: Unlike NYC, electricity costs are significant for Miami and positively correlated with median income. This may reflect higher energy expenses in the region influencing income.
M2 Money Supply and Employment Levels: These remain highly significant, as in NYC, indicating broader economic factors (monetary supply and job availability) are key drivers of income.

Insignificant Predictors:

CPI Index: Similar to NYC, CPI does not show a significant relationship with Miami Median Income in this model.

5. Comparison with NYC

Electricity Costs: Significant for Miami but not for NYC, likely reflecting regional differences in cost structures and energy consumption.
CPI: Insignificant in both cities.
M2 Money Supply and Employment Levels: Consistently significant and positively correlated with median income in both cities.

ggplot(miami_data_clean, aes(x = M2SL, y = Median_Income)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between M2 Money Supply and Miami Median Income",
    x = "M2 Money Supply (Billion USD)",
    y = "Miami Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

ggplot(miami_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
  geom_point(color = "green", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between Employment Levels and Miami Median Income",
    x = "Employment Levels (Thousands)",
    y = "Miami Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

San Francisco DATA ANALYSIS

# Merge and clean the San Francisco-specific data
sf_data <- merged_income_df %>%
  select(DATE, `San Francisco Median Income`) %>%
  rename(Median_Income = `San Francisco Median Income`) %>%
  left_join(
    merged_electricity_df %>% select(DATE, `San Francisco Electricity Cost per Kilowatt-Hour`),
    by = "DATE"
  ) %>%
  left_join(
    cpi_df %>% rename(CPI_Index = `CPI Index`),
    by = "DATE"
  ) %>%
  left_join(
    m2_money_supply_df,
    by = "DATE"
  ) %>%
  left_join(
    employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
    by = "DATE"
  )

sf_data_clean <- sf_data %>% na.omit()

regression_model_sf <- lm(
  Median_Income ~ `San Francisco Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
  data = sf_data_clean
)

summary(regression_model_sf)

## 
## Call:
## lm(formula = Median_Income ~ `San Francisco Electricity Cost per Kilowatt-Hour` + 
##     CPI_Index + M2SL + Employment_Thousands, data = sf_data_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7705  -2972      6   1876   8376 
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                        -7.163e+04  2.244e+04
## `San Francisco Electricity Cost per Kilowatt-Hour` -1.637e+05  7.696e+04
## CPI_Index                                           2.172e+02  1.603e+02
## M2SL                                                4.249e+00  5.784e-01
## Employment_Thousands                                6.837e-01  2.729e-01
##                                                    t value Pr(>|t|)    
## (Intercept)                                         -3.192  0.00406 ** 
## `San Francisco Electricity Cost per Kilowatt-Hour`  -2.127  0.04441 *  
## CPI_Index                                            1.354  0.18877    
## M2SL                                                 7.346 1.79e-07 ***
## Employment_Thousands                                 2.505  0.01977 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4324 on 23 degrees of freedom
## Multiple R-squared:  0.9803, Adjusted R-squared:  0.9768 
## F-statistic: 285.5 on 4 and 23 DF,  p-value: < 2.2e-16

Regression Analysis: San Francisco Median Income

1. Overall Model Fit

Multiple R-squared (0.9803):
- The model explains 98.03% of the variation in San Francisco Median Income, indicating a very strong fit.
Adjusted R-squared (0.9768):
- After accounting for the number of predictors, the model still explains a substantial portion of the variation in San Francisco Median Income.
F-statistic (285.5, p < 2.2e-16):
- The overall model is highly significant, meaning at least one predictor has a strong relationship with San Francisco Median Income.

2. Coefficients

Each coefficient represents the estimated impact of a one-unit increase in the predictor on San Francisco Median Income, holding other variables constant:

Intercept (-71,630):
- When all predictors are zero, the model predicts a baseline San Francisco Median Income of -$71,630.
- This value is not meaningful since predictors like M2 Money Supply and Employment_Thousands cannot realistically be zero.
San Francisco Electricity Cost per Kilowatt-Hour (-163,700):
- Estimate: A $1 increase in electricity costs is associated with a $163,700 decrease in San Francisco Median Income.
- p-value (0.04441): Statistically significant (p < 0.05), suggesting electricity costs have a meaningful negative relationship with San Francisco Median Income. This may reflect a cost burden reducing disposable income.
CPI Index (217.2):
- Estimate: A one-unit increase in the CPI Index is associated with a $217.2 increase in San Francisco Median Income.
- p-value (0.18877): Not statistically significant (p > 0.05), meaning CPI does not have a measurable impact on San Francisco Median Income in this model.
M2 Money Supply (4.249):
- Estimate: A $1 billion increase in M2 money supply is associated with a $4.249 increase in San Francisco Median Income.
- p-value (1.79e-07): Highly statistically significant (p < 0.001). There is a strong positive relationship between M2 money supply and San Francisco Median Income.
Employment Levels (0.6837):
- Estimate: An increase of 1,000 employed persons in the USA is associated with a $0.6837 increase in San Francisco Median Income.
- p-value (0.01977): Statistically significant (p < 0.05). Employment levels are positively correlated with San Francisco Median Income.

3. Residuals

Residual Standard Error (4324):
- Indicates the typical difference between observed and predicted values for San Francisco Median Income.
- Residuals are larger than for Miami and NYC, suggesting slightly less precise predictions for San Francisco.

4. Key Takeaways

Significant Predictors:

San Francisco Electricity Costs:
- Unlike Miami, electricity costs are significant and negatively correlated with income in San Francisco.
- This may reflect the impact of higher utility costs reducing disposable income or affecting living standards.
M2 Money Supply:
- Remains highly significant and positively correlated, as seen in NYC and Miami.
Employment Levels:
- Positively correlated and significant, highlighting the role of national employment trends in driving income.

Insignificant Predictors:

CPI Index:
- As with NYC and Miami, CPI is not a significant predictor for San Francisco Median Income.

5. Comparison with NYC and Miami

Electricity Costs:
- Positive for Miami, negative for San Francisco, and insignificant for NYC. This highlights regional differences in how electricity costs impact median income.
M2 Money Supply and Employment Levels:
- Consistently significant and positively correlated with median income across all three cities.
CPI:
- Insignificant in all three cities.

ggplot(sf_data_clean, aes(x = M2SL, y = Median_Income)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between M2 Money Supply and San Francisco Median Income",
    x = "M2 Money Supply (Billion USD)",
    y = "San Francisco Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

ggplot(sf_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
  geom_point(color = "green", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between Employment Levels and San Francisco Median Income",
    x = "Employment Levels (Thousands)",
    y = "San Francisco Median Income"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

## `geom_smooth()` using formula = 'y ~ x'

USA DATA ANALYSIS

# Merge and clean the USA-specific data
usa_data <- merged_income_df %>%
  select(DATE, `USA Median Income`) %>%
  rename(Median_Income = `USA Median Income`) %>%
  left_join(
    merged_electricity_df %>% select(DATE, `USA Electricity Cost per Kilowatt-Hour`),
    by = "DATE"
  ) %>%
  left_join(
    cpi_df %>% rename(CPI_Index = `CPI Index`),
    by = "DATE"
  ) %>%
  left_join(
    m2_money_supply_df,
    by = "DATE"
  ) %>%
  left_join(
    employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
    by = "DATE"
  )

usa_data_clean <- usa_data %>% na.omit()


regression_model_usa <- lm(
  Median_Income ~ `USA Electricity Cost per Kilowatt-Hour` + CPI_Index + M2SL + Employment_Thousands,
  data = usa_data_clean
)

summary(regression_model_usa)

## 
## Call:
## lm(formula = Median_Income ~ `USA Electricity Cost per Kilowatt-Hour` + 
##     CPI_Index + M2SL + Employment_Thousands, data = usa_data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2036.24  -688.94    -8.26   503.91  2892.63 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              -2.306e+04  4.595e+03  -5.019 2.41e-05
## `USA Electricity Cost per Kilowatt-Hour`  1.341e+04  3.637e+04   0.369    0.715
## CPI_Index                                 2.160e+01  3.835e+01   0.563    0.578
## M2SL                                      1.383e+00  1.257e-01  11.009 7.14e-12
## Employment_Thousands                      4.108e-01  5.924e-02   6.934 1.28e-07
##                                             
## (Intercept)                              ***
## `USA Electricity Cost per Kilowatt-Hour`    
## CPI_Index                                   
## M2SL                                     ***
## Employment_Thousands                     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1054 on 29 degrees of freedom
## Multiple R-squared:  0.9946, Adjusted R-squared:  0.9939 
## F-statistic:  1345 on 4 and 29 DF,  p-value: < 2.2e-16

Regression Analysis: USA Median Income

1. Overall Model Fit

Multiple R-squared (0.9946):
- The model explains 99.46% of the variation in USA Median Income, indicating an exceptionally strong fit.
Adjusted R-squared (0.9939):
- After accounting for the number of predictors, the model still explains a substantial portion of the variation in USA Median Income.
F-statistic (1345, p < 2.2e-16):
- The overall model is highly significant, meaning at least one predictor has a strong relationship with USA Median Income.

2. Coefficients

Each coefficient represents the estimated impact of a one-unit increase in the predictor on USA Median Income, holding other variables constant:

Intercept (-23,060):
- When all predictors are zero, the model predicts a baseline USA Median Income of -$23,060.
- This value is not meaningful as predictors like M2 Money Supply and Employment_Thousands cannot realistically be zero.
USA Electricity Cost per Kilowatt-Hour (13,410):
- Estimate: A $1 increase in electricity costs is associated with a $13,410 increase in USA Median Income.
- p-value (0.715): Not statistically significant (p > 0.05), suggesting that electricity costs do not have a measurable impact on USA Median Income in this model.
CPI Index (21.60):
- Estimate: A one-unit increase in the CPI Index is associated with a $21.60 increase in USA Median Income.
- p-value (0.578): Not statistically significant (p > 0.05). CPI does not have a meaningful relationship with USA Median Income in this model.
M2 Money Supply (1.383):
- Estimate: A $1 billion increase in M2 money supply is associated with a $1.383 increase in USA Median Income.
- p-value (7.14e-12): Highly statistically significant (p < 0.001). There is a strong positive relationship between M2 money supply and USA Median Income.
Employment Levels (0.4108):
- Estimate: An increase of 1,000 employed persons in the USA is associated with a $0.4108 increase in USA Median Income.
- p-value (1.28e-07): Highly statistically significant (p < 0.001). Employment levels are strongly correlated with USA Median Income.

3. Residuals

Residual Standard Error (1054):
- Indicates the typical difference between observed and predicted values for USA Median Income.
- Residuals are smaller compared to the city-specific models, suggesting more precise predictions.

4. Key Takeaways

Significant Predictors:

M2 Money Supply:
- Highly significant and positively correlated with USA Median Income, as in the city-level analyses.
Employment Levels:
- Also highly significant and positively correlated, highlighting the importance of job availability and labor market trends.

Insignificant Predictors:

USA Electricity Costs:
- Not significant, suggesting electricity costs do not directly drive USA Median Income.
CPI Index:
- As with NYC, Miami, and SF, CPI is not a significant predictor in this model.

ggplot(usa_data_clean, aes(x = M2SL, y = Median_Income)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between M2 Money Supply and USA Median Income",
    x = "M2 Money Supply (Billion USD)",
    y = "USA Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

ggplot(usa_data_clean, aes(x = Employment_Thousands, y = Median_Income)) +
  geom_point(color = "green", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between Employment Levels and USA Median Income",
    x = "Employment Levels (Thousands)",
    y = "USA Median Income"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

### USA DATA ANAYSIS FOR HOUSING

# Merge and clean USA Median Housing Price data
usa_housing_data <- median_house_prices_sold_usa_df %>%
  left_join(
    cpi_df %>% rename(CPI_Index = `CPI Index`),
    by = "DATE"
  ) %>%
  left_join(
    m2_money_supply_df,
    by = "DATE"
  ) %>%
  left_join(
    employment_df %>% rename(Employment_Thousands = `Population Employed USA Thousand Per Person`),
    by = "DATE"
  )

usa_housing_data_clean <- usa_housing_data %>% na.omit()

regression_model_housing <- lm(
  `Median Housing Price Sold USA` ~ CPI_Index + M2SL + Employment_Thousands,
  data = usa_housing_data_clean
)

summary(regression_model_housing)

## 
## Call:
## lm(formula = `Median Housing Price Sold USA` ~ CPI_Index + M2SL + 
##     Employment_Thousands, data = usa_housing_data_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24206  -9153   -377   5656  33289 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.605e+05  1.968e+04  -8.157 2.11e-13 ***
## CPI_Index             2.579e+02  1.245e+02   2.071   0.0402 *  
## M2SL                  9.835e+00  7.485e-01  13.140  < 2e-16 ***
## Employment_Thousands  1.907e+00  2.515e-01   7.582 4.92e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12780 on 135 degrees of freedom
## Multiple R-squared:  0.9806, Adjusted R-squared:  0.9801 
## F-statistic:  2272 on 3 and 135 DF,  p-value: < 2.2e-16

Regression Analysis: USA Median Housing Prices

1. Overall Model Fit

Multiple R-squared (0.9806):
- The model explains 98.06% of the variation in USA Median Housing Prices, indicating a very strong fit.
Adjusted R-squared (0.9801):
- After accounting for the number of predictors, the model still explains a substantial portion of the variation.
F-statistic (2272, p < 2.2e-16):
- The overall model is highly significant, meaning at least one predictor has a strong relationship with USA Median Housing Prices.

2. Coefficients

Each coefficient represents the estimated impact of a one-unit increase in the predictor on USA Median Housing Prices, holding other variables constant:

Intercept (-160,500):
- When all predictors are zero, the model predicts a baseline USA Median Housing Price of -$160,500.
- This value is not meaningful, as predictors like M2 Money Supply and Employment_Thousands cannot realistically be zero.
CPI Index (257.9):
- Estimate: A one-unit increase in the CPI Index is associated with a $257.9 increase in USA Median Housing Prices.
- p-value (0.0402): Statistically significant (p < 0.05). Inflation (as captured by CPI) has a measurable positive relationship with housing prices.
M2 Money Supply (9.835):
- Estimate: A $1 billion increase in M2 money supply is associated with a $9,835 increase in USA Median Housing Prices.
- p-value (< 2e-16): Highly statistically significant (p < 0.001). There is a very strong positive relationship between M2 money supply and USA Median Housing Prices.
Employment Levels (1.907):
- Estimate: An increase of 1,000 employed persons in the USA is associated with a $1.907 increase in USA Median Housing Prices.
- p-value (4.92e-12): Highly statistically significant (p < 0.001). Employment levels are strongly correlated with USA Median Housing Prices.

3. Residuals

Residual Standard Error (12,780):
- Indicates the typical difference between observed and predicted values for USA Median Housing Prices.
- Residuals are relatively small given the large scale of housing prices.

4. Key Takeaways

Significant Predictors:

CPI Index:
- Statistically significant and positively correlated with housing prices. Inflation drives up housing costs, as expected.
M2 Money Supply:
- Highly significant and positively correlated. An increase in money supply often leads to greater spending capacity and higher housing prices.
Employment Levels:
- Highly significant and positively correlated. More employment leads to greater demand for housing, driving up prices.

ggplot(usa_housing_data_clean, aes(x = CPI_Index, y = `Median Housing Price Sold USA`)) +
  geom_point(color = "purple", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between CPI Index and USA Median Housing Prices",
    x = "CPI Index",
    y = "USA Median Housing Prices"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

## `geom_smooth()` using formula = 'y ~ x'

ggplot(usa_housing_data_clean, aes(x = M2SL, y = `Median Housing Price Sold USA`)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between M2 Money Supply and USA Median Housing Prices",
    x = "M2 Money Supply (Billion USD)",
    y = "USA Median Housing Prices"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

## `geom_smooth()` using formula = 'y ~ x'

ggplot(usa_housing_data_clean, aes(x = Employment_Thousands, y = `Median Housing Price Sold USA`)) +
  geom_point(color = "green", alpha = 0.7) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(
    title = "Relationship Between Employment Levels and USA Median Housing Prices",
    x = "Employment Levels (Thousands)",
    y = "USA Median Housing Prices"
  ) +
  theme_minimal()+
  scale_y_continuous(labels = scales::comma)

## `geom_smooth()` using formula = 'y ~ x'

# load data
rawhousingprices <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/Media%20Housing%20Price.csv"
housingprices <- read.csv(rawhousingprices)
head(housingprices)

##   RegionID SizeRank      RegionName RegionType StateName X2018.03.31
## 1   102001        0   United States    country                263267
## 2   394913        1    New York, NY        msa        NY      503000
## 3   753899        2 Los Angeles, CA        msa        CA      721333
## 4   394463        3     Chicago, IL        msa        IL      284600
## 5   394514        4      Dallas, TX        msa        TX      322997
## 6   394692        5     Houston, TX        msa        TX      294467
##   X2018.04.30 X2018.05.31 X2018.06.30 X2018.07.31 X2018.08.31 X2018.09.30
## 1      271267      276633      280000      280300      279633      279600
## 2      513000      521300      526300      528300      526600      526267
## 3      735000      743333      750000      749667      742967      736267
## 4      294600      300600      302267      301967      297933      296117
## 5      328497      331797      332800      329502      324202      321168
## 6      297933      299633      299933      299633      297800      296165
##   X2018.10.31 X2018.11.30 X2018.12.31 X2019.01.31 X2019.02.28 X2019.03.31
## 1      279300      278933      277300      275967      276633      281300
## 2      527600      529000      527667      524667      524667      528000
## 3      728267      724633      719667      711300      711600      716267
## 4      292783      288483      283333      280000      284300      291267
## 5      318966      317633      315999      313333      313667      316967
## 6      293132      291632      289933      288300      289965      292665
##   X2019.04.30 X2019.05.31 X2019.06.30 X2019.07.31 X2019.08.31 X2019.09.30
## 1      287967      294567      297900      298900      297000      294833
## 2      537667      544333      549000      547667      544333      544333
## 3      732633      742667      752933      756567      759233      758967
## 4      301267      306300      309467      306465      302798      298998
## 5      323633      328467      330133      328467      324966      321633
## 6      297632      299300      299933      298966      295633      292316
##   X2019.10.31 X2019.11.30 X2019.12.31 X2020.01.31 X2020.02.29 X2020.03.31
## 1      292133      290467      287933      284933      285233      289600
## 2      545667      549000      549300      549300      549300      552333
## 3      758667      762333      765667      767667      771298      779965
## 4      295633      290967      284933      281633      284967      291633
## 5      319933      319600      317967      314667      313633      315267
## 6      289316      287650      286000      283333      284333      287666
##   X2020.04.30 X2020.05.31 X2020.06.30 X2020.07.31 X2020.08.31 X2020.09.30
## 1      295633      299000      303967      309300      314300      315300
## 2      555967      559300      561300      566333      574967      584967
## 3      787965      794333      802333      816000      826000      829633
## 4      297967      299600      301300      304633      307667      309300
## 5      318600      321633      325633      330600      333233      334267
## 6      293166      297133      301132      303966      307299      309967
##   X2020.10.31 X2020.11.30 X2020.12.31 X2021.01.31 X2021.02.28 X2021.03.31
## 1      315300      313633      309967      304967      307967      314667
## 2      592967      597967      599600      596600      594967      594667
## 3      831300      838294      844661      849327      849296      849629
## 4      306300      303267      298267      296567      299900      304933
## 5      332633      331600      328600      328559      330293      335926
## 6      313275      314308      314308      313666      315999      317666
##   X2021.04.30 X2021.05.31 X2021.06.30 X2021.07.31 X2021.08.31 X2021.09.30
## 1      326333      333333      339407      341073      340740      339667
## 2      597667      598667      595633      589267      579600      574633
## 3      856296      864667      874667      874667      866333      856333
## 4      309966      311666      313333      313000      307967      302933
## 5      342633      350967      359667      367829      371159      373159
## 6      323000      327667      332667      334667      333333      333333
##   X2021.10.31 X2021.11.30 X2021.12.31 X2022.01.31 X2022.02.28 X2022.03.31
## 1      339633      338300      333600      330300      335267      346633
## 2      574333      576333      576333      576667      584550      592550
## 3      850000      850000      850000      856633      871633      888263
## 4      299263      295930      289263      284933      287967      297667
## 5      373330      376633      378300      381633      388000      394667
## 6      333233      333233      331233      328000      328000      333650
##   X2022.04.30 X2022.05.31 X2022.06.30 X2022.07.31 X2022.08.31 X2022.09.30
## 1      362967      376300      387967      393300      393300      391300
## 2      603550      610667      617667      613330      606330      599663
## 3      904963      916596      924967      920633      910333      901667
## 4      309300      317967      323300      324967      322967      319967
## 5      407333      419330      432663      438297      434967      428300
## 6      343650      356317      364167      367500      364500      358000
##   X2022.10.31 X2022.11.30 X2022.12.31 X2023.01.31 X2023.02.28 X2023.03.31
## 1      386633      381667      375733      371900      371900      374833
## 2      606333      612999      619333      619333      624667      634967
## 3      897667      894333      887667      882633      885967      892967
## 4      314967      308600      302933      299300      302167      308833
## 5      419000      410667      403667      398000      396667      400150
## 6      352997      349663      347122      343792      342425      344933
##   X2023.04.30 X2023.05.31 X2023.06.30 X2023.07.31 X2023.08.31 X2023.09.30
## 1      383107      391107      397607      399167      397833      396333
## 2      647967      662633      672333      679000      676000      679333
## 3      916000      942667      975333      992300      998967      999633
## 4      319133      329500      336500      339867      339967      339633
## 5      411813      426480      438330      441667      440000      435000
## 6      351567      358267      364633      367667      366000      362967
##   X2023.10.31 X2023.11.30 X2023.12.31 X2024.01.31 X2024.02.29 X2024.03.31
## 1      393300      389967      384633      381000      381000      386000
## 2      684667      693000      693300      689967      691300      694333
## 3      999567      999233      995567      991667      992000      995633
## 4      336300      331333      324967      321633      323300      329967
## 5      430000      426356      422356      417356      414333      416633
## 6      358300      354967      351667      349983      349983      352983
##   X2024.04.30 X2024.05.31 X2024.06.30 X2024.07.31 X2024.08.31 X2024.09.30
## 1      392967      400600      405900      405933      403267      399933
## 2      699333      706333      713000      716000      709330      709630
## 3     1015967     1045967     1079000     1095333     1090650     1077650
## 4      339967      349933      356600      358267      354933      351633
## 5      424633      432933      437967      438300      434667      430865
## 6      358000      363667      365667      366000      365333      363633
##   X2024.10.31
## 1      399600
## 2      714963
## 3     1061650
## 4      346633
## 5      427532
## 6      359967

Since our focus is on three specific cities, we filter the data to include only New York, Miami, and Los Angeles. We then clean the dataset by renaming the variables and restructuring it from a wide format to a longer format.

filtered_housingprices <- housingprices %>%
  filter(RegionName %in% c("New York, NY", "Miami, FL", "Los Angeles, CA")) %>%
  pivot_longer(
    cols = starts_with("X20"),  
    names_to = "Month",     
    values_to = "Value"     
  ) %>%
  mutate(Month = as.Date(Month, format = "X%Y.%m.%d")) %>%
  mutate(Year = format(Month, "%Y")) %>%
  group_by(Year, RegionName) %>%  
  mutate(Yearly_Mean = mean(Value, na.rm = TRUE)) %>%  
  select("SizeRank", "RegionName", "Year", "Yearly_Mean") %>%
  distinct()
head(filtered_housingprices)

## # A tibble: 6 × 4
## # Groups:   Year, RegionName [6]
##   SizeRank RegionName   Year  Yearly_Mean
##      <int> <chr>        <chr>       <dbl>
## 1        1 New York, NY 2018      522903.
## 2        1 New York, NY 2019      540720.
## 3        1 New York, NY 2020      570358.
## 4        1 New York, NY 2021      587392.
## 5        1 New York, NY 2022      603637.
## 6        1 New York, NY 2023      663933.

Since 2018, housing prices have risen significantly across all three cities, with the most dramatic increase observed in Los Angeles.

ggplot(filtered_housingprices, aes(x = Year, y = Yearly_Mean, color = RegionName, group = RegionName)) +
  geom_line(size = 1) + 
  geom_text(aes(label = round(Yearly_Mean, 2)), vjust = -0.5, size = 3, check_overlap = TRUE) + 
  labs(
    title = "Yearly Mean Housing Prices for New York, Miami, and Los Angeles",
    x = "Year",
    y = "Yearly Mean Housing Price"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_y_continuous(labels = scales::comma)

The rate of change in housing prices shows that Miami experienced the most significant fluctuation, with a nearly 28% increase. It’s also the only city to see a negative change in 2024, at -4%. On the other hand, both New York and Los Angeles saw more stable price changes, with no year surpassing a 10%.

housing_rate_of_change <- filtered_housingprices %>%
  group_by(RegionName) %>%
  arrange(Year) %>%
  mutate(
    Rate_of_Change = (Yearly_Mean - lag(Yearly_Mean)) / lag(Yearly_Mean) * 100
  ) %>%
  filter(!is.na(Rate_of_Change))

ggplot(housing_rate_of_change, aes(x = Year, y = Rate_of_Change, fill = RegionName)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = round(Rate_of_Change, 2)),  
            position = position_dodge(width = 0.8),    
            vjust = -0.5, size = 3) +                
  labs(
    title = "Year-over-Year Rate of Change in Housing Prices",
    subtitle = "For New York, Los Angeles, and Miami",
    x = "Year",
    y = "Rate of Change (%)",
    fill = "City"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Three expensive cities — Los Angeles is the most expensive city to buy a home in. While Miami has less expensive homes than LA or New York, it does skew to the right, meaning there are more higher valued homes than not.

anova_housingprices <- aov(Yearly_Mean ~ RegionName, data = filtered_housingprices)
summary(anova_housingprices)

##             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## RegionName   2 6.329e+11 3.165e+11   40.21 2.29e-07 ***
## Residuals   18 1.417e+11 7.870e+09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(filtered_housingprices, aes(x = RegionName, y = Yearly_Mean, fill = RegionName)) +
  geom_boxplot() +
  labs(title = "ANOVA Comparison of Housing Prices by Region")+
  scale_y_continuous(labels = scales::comma)

Median Income

Again, we load the CSV file from GitHub.

rawincome <- read.csv("https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/Income.csv")
head(rawincome)

##   Metropolitan.Statistical.Areas X2021.Dollars X2022.Dollars X2023.Dollars
## 1                    Abilene, TX        51,995        53,514        56,034
## 2                      Akron, OH        59,583        59,609        62,904
## 3                     Albany, GA        46,969        46,754        48,546
## 4             Albany-Lebanon, OR        52,230        54,123        57,016
## 5    Albany-Schenectady-Troy, NY        66,929        67,884        71,972
## 6                Albuquerque, NM        52,372        54,687        57,278
##   US.Rank.2023
## 1          239
## 2          130
## 3          344
## 4          219
## 5           57
## 6          216

Then, we filter for the appropriate cities while reformatting the column names and structure.

filtered_income <- rawincome %>% 
  filter(Metropolitan.Statistical.Areas %in% c("Los Angeles-Long Beach-Anaheim, CA", "Miami-Fort Lauderdale-Pompano Beach, FL", "New York-Newark-Jersey City, NY-NJ-PA") ) %>% 
  rename_with(~ gsub("^X", "", .x), starts_with("X")) %>% 
  pivot_longer(
    cols = starts_with("20"),  
    names_to = "Income by Year",         
    values_to = "Average Income"         
  ) %>% 
  mutate(`Income by Year` = str_remove(`Income by Year`, "\\.Dollars"))
head(filtered_income)

## # A tibble: 6 × 4
##   Metropolitan.Statistical.Areas  US.Rank.2023 `Income by Year` `Average Income`
##   <chr>                                  <int> <chr>            <chr>           
## 1 Los Angeles-Long Beach-Anaheim…           24 2021             75,332          
## 2 Los Angeles-Long Beach-Anaheim…           24 2022             76,760          
## 3 Los Angeles-Long Beach-Anaheim…           24 2023             80,898          
## 4 Miami-Fort Lauderdale-Pompano …           19 2021             73,220          
## 5 Miami-Fort Lauderdale-Pompano …           19 2022             78,610          
## 6 Miami-Fort Lauderdale-Pompano …           19 2023             84,302

In recent years, the average income in the New York area has seen notable growth, making it the highest among the three regions. Miami surpassed Los Angeles’ 2021 income in both 2022 and 2023. Meanwhile, Los Angeles has experienced the slowest growth in comparison.

ggplot(filtered_income, aes(x = `Income by Year`, y = `Average Income`, 
                            group = Metropolitan.Statistical.Areas, 
                            color = Metropolitan.Statistical.Areas)) +
  geom_line(size = 1.2) + 
  geom_point(size = 3) +   
  labs(
    title = "Average Income by Year in Metropolitan Areas",
    x = "Year",
    y = "Average Income (Dollars)",
    color = "Metropolitan Areas"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

For both 2022 and 2023, Miami had the greatest change of income growth, followed by New York and Los Angeles.

filtered_income_num <- filtered_income %>%
  mutate(`Average Income` = as.numeric(gsub(",", "", gsub("\\$", "", `Average Income`))))

filtered_income_change <- filtered_income_num %>%
  arrange(Metropolitan.Statistical.Areas, `Income by Year`) %>%
  group_by(Metropolitan.Statistical.Areas) %>%
  mutate(rate_of_change = (`Average Income` - lag(`Average Income`)) / lag(`Average Income`) * 100)

ggplot(filtered_income_change, aes(x = `Income by Year`, y = rate_of_change, 
                            fill = Metropolitan.Statistical.Areas, 
                            label = round(rate_of_change, 2))) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = round(rate_of_change, 2)), position = position_dodge(width = 0.8), vjust = -0.5) + 
  labs(
    title = "Rate of Change in Average Income by Year in Metropolitan Areas",
    x = "Year",
    y = "Rate of Change (%)",
    fill = "Metropolitan Areas"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

A heatmap visual — where the darkest and highest income (above 90k) is found in the NY area as of 2023.

filtered_income$`Average Income` <- as.numeric(gsub(",", "", filtered_income$`Average Income`))

ggplot(filtered_income, aes(x = `Income by Year`, y = Metropolitan.Statistical.Areas, fill = `Average Income`)) +
  geom_tile() +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(
    title = "Heatmap of Average Income by Year and Metropolitan Area",
    x = "Year",
    y = "Metropolitan Area",
    fill = "Average Income"
  ) +
  theme_minimal()

Homeownership Rates

For our homeownership rates data, we loaded the csv and removed the first few lines.

rawhomeownership <- read_csv("https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/Homeownership%20Rates%20-%20Sheet1.csv")

## New names:
## Rows: 79 Columns: 25
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Metropolitan Statistical Area dbl (24): First Quarter 2023, Margin of
## Error1...3, Second Quarter 2023,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `Margin of Error1` -> `Margin of Error1...3`
## • `Margin of Error1` -> `Margin of Error1...5`
## • `Margin of Error1` -> `Margin of Error1...7`
## • `Margin of Error1` -> `Margin of Error1...9`
## • `Margin of Error1` -> `Margin of Error1...11`
## • `Margin of Error1` -> `Margin of Error1...13`
## • `Margin of Error1` -> `Margin of Error1...15`
## • `Margin of Error1` -> `Margin of Error1...17`
## • `Margin of Error1` -> `Margin of Error1...19`
## • `Margin of Error1` -> `Margin of Error1...21`
## • `Margin of Error1` -> `Margin of Error1...23`
## • `Margin of Error1` -> `Margin of Error1...25`

rawhomeownership$`Metropolitan Statistical Area` <- gsub("[,\\.]", "", rawhomeownership$`Metropolitan Statistical Area`)
rawhomeownership <- rawhomeownership[-c(1, 2, 3), ]
rawhomeownership

## # A tibble: 76 × 25
##    `Metropolitan Statistical Area` `First  Quarter  2023` `Margin of Error1...3`
##    <chr>                                            <dbl>                  <dbl>
##  1 "Akron OH "                                       64.5                    9.4
##  2 "Albany-Schenectady-Troy NY "                     79.6                    8.1
##  3 "Albuquerque NM"                                  67.2                    4.5
##  4 "Allentown-Bethlehem-Easton PA…                   69.7                    9.7
##  5 "Atlanta-Sandy Springs-Roswell…                   66.6                    3.6
##  6 "Austin-Round Rock TX"                            63.4                    6  
##  7 "Baltimore-Columbia-Towson MD2"                   72.9                    5.2
##  8 "Baton Rouge LA………………………………………"                   64.8                    6.2
##  9 "Birmingham-Hoover AL"                            67                      6.1
## 10 "Boston-Cambridge-Newton MA-NH…                   61.1                    3.5
## # ℹ 66 more rows
## # ℹ 22 more variables: `Second  Quarter  2023` <dbl>,
## #   `Margin of Error1...5` <dbl>, `Third  Quarter  2023` <dbl>,
## #   `Margin of Error1...7` <dbl>, `Fourth  Quarter  2023` <dbl>,
## #   `Margin of Error1...9` <dbl>, `First  Quarter  2022` <dbl>,
## #   `Margin of Error1...11` <dbl>, `Second  Quarter  2022` <dbl>,
## #   `Margin of Error1...13` <dbl>, `Third  Quarter  2022` <dbl>, …

To clean our data, we had to restructure the columns and rearrange by Quarter and Year. Otherwise, our x-axis would present in alphabetical order, and therefore not in time order.

rawhomeownership <- rawhomeownership %>%
  clean_names()
rawhomeownership

## # A tibble: 76 × 25
##    metropolitan_statistical_area       first_quarter_2023 margin_of_error1_3
##    <chr>                                            <dbl>              <dbl>
##  1 "Akron OH "                                       64.5                9.4
##  2 "Albany-Schenectady-Troy NY "                     79.6                8.1
##  3 "Albuquerque NM"                                  67.2                4.5
##  4 "Allentown-Bethlehem-Easton PA-NJ"                69.7                9.7
##  5 "Atlanta-Sandy Springs-Roswell GA1"               66.6                3.6
##  6 "Austin-Round Rock TX"                            63.4                6  
##  7 "Baltimore-Columbia-Towson MD2"                   72.9                5.2
##  8 "Baton Rouge LA………………………………………"                   64.8                6.2
##  9 "Birmingham-Hoover AL"                            67                  6.1
## 10 "Boston-Cambridge-Newton MA-NH3"                  61.1                3.5
## # ℹ 66 more rows
## # ℹ 22 more variables: second_quarter_2023 <dbl>, margin_of_error1_5 <dbl>,
## #   third_quarter_2023 <dbl>, margin_of_error1_7 <dbl>,
## #   fourth_quarter_2023 <dbl>, margin_of_error1_9 <dbl>,
## #   first_quarter_2022 <dbl>, margin_of_error1_11 <dbl>,
## #   second_quarter_2022 <dbl>, margin_of_error1_13 <dbl>,
## #   third_quarter_2022 <dbl>, margin_of_error1_15 <dbl>, …

homeownership <- rawhomeownership %>%
  filter(`metropolitan_statistical_area` %in% c("Los Angeles-Long Beach-Anaheim CA14", 
                                                "Miami-Fort Lauderdale-West Palm Beach FL16", 
                                                "New York-Newark-Jersey City NY-NJ-PA19")) %>% 
  select(`metropolitan_statistical_area`, `first_quarter_2021`, `second_quarter_2021`, `third_quarter_2021`, `fourth_quarter_2021`, 
         `first_quarter_2022`, `second_quarter_2022`, `third_quarter_2022`, `fourth_quarter_2022`, 
         `first_quarter_2023`, `second_quarter_2023`, `third_quarter_2023`, `fourth_quarter_2023`) %>% 
   pivot_longer(
    cols = -`metropolitan_statistical_area`,
    names_to = "year_quarter",            
    values_to = "homeownership_rate"  
  ) %>%
  mutate(
    quarter = case_when(
      grepl("first", year_quarter) ~ "Q1",
      grepl("second", year_quarter) ~ "Q2",
      grepl("third", year_quarter) ~ "Q3",
      grepl("fourth", year_quarter) ~ "Q4"
    ),
    year = sub(".*(\\d{4}).*", "\\1", year_quarter),
    label = paste(quarter, year),         
    label = factor(label, levels = c("Q1 2021", "Q2 2021", "Q3 2021", "Q4 2021", 
                                     "Q1 2022", "Q2 2022", "Q3 2022", "Q4 2022", 
                                     "Q1 2023", "Q2 2023", "Q3 2023", "Q4 2023"))
  )
homeownership

## # A tibble: 36 × 6
##    metropolitan_statistica…¹ year_quarter homeownership_rate quarter year  label
##    <chr>                     <chr>                     <dbl> <chr>   <chr> <fct>
##  1 Los Angeles-Long Beach-A… first_quart…               61.2 Q1      2021  Q1 2…
##  2 Los Angeles-Long Beach-A… second_quar…               65.7 Q2      2021  Q2 2…
##  3 Los Angeles-Long Beach-A… third_quart…               63.9 Q3      2021  Q3 2…
##  4 Los Angeles-Long Beach-A… fourth_quar…               67.7 Q4      2021  Q4 2…
##  5 Los Angeles-Long Beach-A… first_quart…               63.4 Q1      2022  Q1 2…
##  6 Los Angeles-Long Beach-A… second_quar…               64   Q2      2022  Q2 2…
##  7 Los Angeles-Long Beach-A… third_quart…               62.4 Q3      2022  Q3 2…
##  8 Los Angeles-Long Beach-A… fourth_quar…               67.8 Q4      2022  Q4 2…
##  9 Los Angeles-Long Beach-A… first_quart…               46.1 Q1      2023  Q1 2…
## 10 Los Angeles-Long Beach-A… second_quar…               46.8 Q2      2023  Q2 2…
## # ℹ 26 more rows
## # ℹ abbreviated name: ¹metropolitan_statistical_area

All areas show a downward trend in homeownership despite an upward trend in average income.

ggplot(homeownership, aes(x = label, y = `homeownership_rate`, color = `metropolitan_statistical_area`, group = `metropolitan_statistical_area`)) + 
  geom_line(size = 1) + 
  geom_point(size = 3) + 
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", size = 0.5) +  # Separate linear models for each region
  labs(
    title = "Homeownership Rates by Year and Metropolitan Area",
    x = "Year and Quarter",
    y = "Homeownership Rate (%)",
    color = "Metropolitan Statistical Area"
  ) + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `geom_smooth()` using formula = 'y ~ x'

In fact, Los Angeles had the greatest drop in homeownership rates as of Q1 of 2023 at -32%. If we recall, Los Angeles had the most expensive housing prices, and the slowest average income growth. New York follows at -25%, despite have the highest average income of the areas.

homeownership_change <- homeownership %>%
  group_by(`metropolitan_statistical_area`) %>%
  arrange(`metropolitan_statistical_area`, `label`) %>%
  mutate(rate_of_change = (`homeownership_rate` - lag(`homeownership_rate`)) / lag(`homeownership_rate`) * 100)

ggplot(homeownership_change, aes(x = label, y = rate_of_change, fill = `metropolitan_statistical_area`)) + 
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(rate_of_change, 2)), 
            position = position_dodge(width = 0.8), 
            vjust = -1, size = 2) + 
  labs(
    title = "Rate of Change in Homeownership Rates by Year and Metropolitan Area",
    x = "Year and Quarter",
    y = "Rate of Change (%)",
    fill = "Metropolitan Statistical Area"
  ) + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_bar()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_text()`).

Here we load the data for non farmer jobs in the US since the 1990s.

raw_nonfarmer <- "https://raw.githubusercontent.com/crystaliquezada/finalproject_data607/refs/heads/main/data/usa/total_non_farm_employment.csv"
nonfarmerjobs <- read_csv(raw_nonfarmer)

## Rows: 419 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): PAYEMS
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nonfarmerjobs

## # A tibble: 419 × 2
##    DATE       PAYEMS
##    <date>      <dbl>
##  1 1990-01-01 109189
##  2 1990-02-01 109432
##  3 1990-03-01 109637
##  4 1990-04-01 109671
##  5 1990-05-01 109834
##  6 1990-06-01 109857
##  7 1990-07-01 109822
##  8 1990-08-01 109610
##  9 1990-09-01 109520
## 10 1990-10-01 109374
## # ℹ 409 more rows

The number of non-farm jobs has steadily increased since 1990, with a significant drop in 2020 due to the pandemic. However, the job market has since rebounded, surpassing pre-pandemic levels. Non-farm jobs make up nearly 80% of the workforce contributing to the country’s GDP. Despite this substantial contribution, homeownership rates have been declining, raising questions about the broader economic and social implications.

ggplot(nonfarmerjobs, aes(x = DATE, y = PAYEMS)) +
  geom_line(color = "skyblue", size = 1) +
  labs(
    title = "Non Farmer Jobs Over the Years",
    x = "Date",
    y = "Non Farmer Job Count"
  ) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

The data on housing prices, average income, and homeownership rates across New York, Miami, and Los Angeles reveal notable disparities in the growth and stability of these metrics. Despite New York having the highest average income growth, it experienced a sharp decline in homeownership rates, mirroring a similar trend in Los Angeles around the same period (Q4 2022). Los Angeles, in particular, has shown the most significant challenges, with the slowest growth in income and a sharp decline in homeownership, making it the most vulnerable area in this study. In contrast, Miami demonstrated more stability, with a substantial increase in income and a lower rate of change in homeownership.

It’s important for policymakers and stakeholders in California to focus on initiatives that address the affordability crisis and improve access to homeownership. Potential solutions could include expanding affordable housing projects, offering homebuyer assistance programs, and addressing the root causes of income inequality. For New York, focusing on supporting sustainable growth in both income and homeownership could help mitigate the risks of economic disparity. Miami, while showing promising trends, could further benefit from targeted policies that ensure the continued growth of both its housing market and income levels.

Final_Project_Data607

Ahmed Hassan, Crystal Quezada

2024-12-06

Abstract

Housing Prices

Data Transformation

NYC Analysis on income

Regression Analysis: NYC Median Income

1. Overall Model Fit

2. Coefficients

3. Residuals

4. Key Takeaways

Significant Predictors:

Insignificant Predictors:

5. Correlation

MIAMI DATA ANALYSIS

Regression Analysis: Miami Median Income

1. Overall Model Fit

2. Coefficients

3. Residuals

4. Key Takeaways

Significant Predictors:

Insignificant Predictors:

5. Comparison with NYC

San Francisco DATA ANALYSIS

Regression Analysis: San Francisco Median Income

1. Overall Model Fit

2. Coefficients

3. Residuals

4. Key Takeaways

Significant Predictors:

Insignificant Predictors:

5. Comparison with NYC and Miami

USA DATA ANALYSIS

Regression Analysis: USA Median Income

1. Overall Model Fit

2. Coefficients

3. Residuals

4. Key Takeaways

Significant Predictors:

Insignificant Predictors:

Regression Analysis: USA Median Housing Prices

1. Overall Model Fit

2. Coefficients

3. Residuals

4. Key Takeaways

Significant Predictors:

Median Income

Homeownership Rates

Conclusion