library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
# 1 Load Current Data Set

setwd("C:/Users/cramo/OneDrive/Desktop/My Class Stuff/Monday Class")

ahs.household.data <- read_csv("household.csv")
## Rows: 55669 Columns: 1180
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (635): CONTROL, JACPRIMARY, JACSECNDRY, JADEQUACY, JAIRRATE, JBATHEXCLU,...
## dbl (545): TOTROOMS, PERPOVLVL, OUTAGEFRQ, RENT, DINING, LAUNDY, RATINGHS, R...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 2A Re-Assemble DV
ahs.household.data <- ahs.household.data %>% mutate(tot.cost.burden = ((TOTHCAMT * 12) + MAINTAMT) / HINCP)
clean.research.data <- ahs.household.data %>% filter(is.finite(tot.cost.burden), tot.cost.burden > 0)
clean.research.data <- clean.research.data %>% mutate(log.cost = log(tot.cost.burden))
# 2B Assemble and Select IV
## Build Housing Age 
clean.research.data <- clean.research.data %>% mutate(housing.age = 2023 - YRBUILT)
## Other IV Selected market val, pov level, & home condition 
# Create Linear Model
affordable.housing.model <- lm(log.cost ~ housing.age + MARKETVAL + PERPOVLVL + RATINGHS, data = clean.research.data)
# Summary of Model
summary(affordable.housing.model)
## 
## Call:
## lm(formula = log.cost ~ housing.age + MARKETVAL + PERPOVLVL + 
##     RATINGHS, data = clean.research.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6973 -0.5380  0.0999  0.5774  9.4104 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.379e+00  1.064e-02  129.61   <2e-16 ***
## housing.age -2.628e-03  1.451e-04  -18.12   <2e-16 ***
## MARKETVAL    2.664e-07  6.907e-09   38.57   <2e-16 ***
## PERPOVLVL   -5.236e-03  2.447e-05 -213.97   <2e-16 ***
## RATINGHS    -1.113e-01  8.306e-04 -133.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9246 on 54630 degrees of freedom
## Multiple R-squared:  0.6962, Adjusted R-squared:  0.6962 
## F-statistic: 3.13e+04 on 4 and 54630 DF,  p-value: < 2.2e-16

5 Interpret the Results

The r squared for this linear model is at 0.6962, meaning the independent variables selected explain 69% of the dependent variable. That means housing age, market value, percent of poverty level, and home condition explain a large portion of the variation in total housing cost burden. All of the independent variables have a p-value well below the 0.05 threshold meaning they are significant. All of my variables have a p-value well below the even stricter .001 threshold at <2e-16. This suggests that housing age, home value, poverty level, and housing condition all have meaningful relationships with total housing cost burden. None of the variables are insignificant.

6 Interpret Variable Estimates

The coefficient for housing age is -0.0026, meaning that for each additional year of housing age, total housing cost burden decreases slightly, holding all other variables constant. This result is somewhat unexpected, as older homes are typically associated with higher maintenance costs. One possible explanation is that older homes are more likely to be fully paid off or have lower mortgage payments, which reduces overall housing costs. As a result, even if maintenance costs are higher, total housing cost burden may still be lower. In future analysis, I would like to separate homeowners by mortgage status to better test whether this relationship is driven by lifecycle effects or differences in financing.

The coefficient for market value is 2.664e-07, meaning that for every one-dollar increase in home value, total housing cost burden increases by a very small amount. However, because home values vary by tens or hundreds of thousands of dollars, this effect becomes more meaningful at larger scales. This indicates that more expensive homes are associated with higher cost burden overall.

The coefficient for percent of poverty level is -0.0052, meaning that for each one-unit increase in percent of poverty level, total housing cost burden decreases. This suggests that as households move further above the poverty line, they experience lower housing cost burden, which aligns with expectations since higher-income households are better able to absorb housing-related costs.

The coefficient for housing condition (RATINGHS) is -0.1113, meaning that for each one-unit improvement in housing condition, total housing cost burden decreases by a noticeable amount. This is one of the larger effects in the model and suggests that homes in worse condition are associated with higher costs, likely due to increased repair and maintenance needs.

# 7) Does the model you create meet or violate the assumption of linearity?

plot(affordable.housing.model, which = 1)

## The red line is relatively flat, indicating that the relationship is mostly linear. The dots generally follow the pattern of the red line, suggesting that the model describes the data reasonably well. There is some deviation in the data past about the second interval, where the points spread out more and follow the line less closely. Overall, this suggests that the linearity assumption is largely met, although not perfectly.