library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.3
library(dplyr)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Data Cleaning

#Vet Data boilerplate, data cleaning
# create a file path
file_path <-"C:/Users/Administrator/Desktop/Graduate School/Applied Quant Methods/My Class Stuff/Data Project/Veteran Homelessness/2023 Homeless Veterans.xlsx"

# Read sheet 2
sheet_2<- read_excel(file_path, sheet="2023") 

# Remove Sheet 2 "2023"  Totals row
vet_data <- sheet_2 |> filter (State != "Total")

#change variable names
vet_data <- vet_data |> rename(
                   CoC_Count = "Number of CoCs", 
                   ES_Count = "Sheltered ES Homeless Veterans", 
                   TH_Count = "Sheltered TH Homeless Veterans", 
                   SH_Count = "Sheltered SH Homeless Veterans", 
                   Sheltered = "Sheltered Total Homeless Veterans", 
                   Unsheltered = "Unsheltered Homeless Veterans", 
                   Homeless_Vets = "Homeless Veterans")

#Create column for unsheltered rate
vet_data <- vet_data |> mutate(Unsheltered_Rate = Unsheltered/`Homeless_Vets`)

#Create column for Transitional Housing rate across Total homeless veteran population
vet_data <- vet_data |> mutate(TH_Rate = TH_Count/Homeless_Vets)

Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome

vet_model_1<-lm(Unsheltered_Rate~CoC_Count+TH_Rate, data=vet_data)
summary(vet_model_1)
## 
## Call:
## lm(formula = Unsheltered_Rate ~ CoC_Count + TH_Rate, data = vet_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28062 -0.07670 -0.00255  0.09440  0.45402 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.544722   0.047362  11.501 8.86e-16 ***
## CoC_Count    0.001253   0.002655   0.472    0.639    
## TH_Rate     -0.866280   0.126031  -6.874 8.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1559 on 51 degrees of freedom
## Multiple R-squared:  0.4821, Adjusted R-squared:  0.4618 
## F-statistic: 23.74 on 2 and 51 DF,  p-value: 5.171e-08

Check the following assumptions:

Does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.

If your model violates an assumption, which one?

1. Linearity (plot and raintest)

plot

plot(vet_model_1, which=1)

The fitted line is mostly straight. Just by visual inspection it does appear to be linear. So it passess the assumption of linearity.

Rainbow test

raintest(vet_model_1)
## 
##  Rainbow test
## 
## data:  vet_model_1
## Rain = 1.4501, df1 = 27, df2 = 24, p-value = 0.1804

According to the Rainbow Test, the high p-value means that the data is linear.

2. Independence of Errors (durbin-watson)

Durbin- Watson test

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(vet_model_1)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02949372      1.963657   0.952
##  Alternative hypothesis: rho != 0

The p-value is very high, it is 0.926, much higher than 0.05 which means that the errors ARE independent. The distance between the errors are not highly correlated. This passes the assumption of an independence of errors.

3. Homoscedasticity (plot, bptest)

plot

plot(vet_model_1,which=3)

The red line is very curvy and steadily increasing which means this likely violates the assumption of Homoscedasticity.The residual spread gets larger as the fitted values increases, the variance of the residuals are not evenly spread across the fitted values. We can say that this model appears to be heteroscedastic by visual inspection.

Breusch-Pagan Test

bptest(vet_model_1)
## 
##  studentized Breusch-Pagan test
## 
## data:  vet_model_1
## BP = 13.78, df = 2, p-value = 0.001018

The p-value is significant, it is much lower than 0.05. So we can reject homoscedasticity and assume the model is heteroscedastic. This model fails the assumption of homoscedasticity and will require mitigation.

4. Normality of residuals (QQ plot, shapiro test)

QQ plot

plot(vet_model_1,which=2)

The residuals are more evenly distributed in the center of the dotted like, aka the bell curve, and fit very well along the dotted line. On the tails of the line we see some residuals have significant deviations from the dotted line and don’t really touch it. But again that’s somewhat expected that the tails would have outliers. Overall, by visual inspection, this is appears to be normally distributed.

Shapiro Test

shapiro.test(vet_model_1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  vet_model_1$residuals
## W = 0.97332, p-value = 0.2687

The p-value is high suggesting that the residuals are consistent with a normal distribution. The model passes the assumption of normality of residuals.

5. No multicolinarity (VIF, cor)

VIF

kitchen_sink<-lm(Unsheltered_Rate ~ CoC_Count + ES_Count + SH_Count + TH_Count, data = vet_data)
  
vif(kitchen_sink)
## CoC_Count  ES_Count  SH_Count  TH_Count 
##  3.352792  7.803907  5.012926  9.682637

The closest variable to being strongly correlated is the TH_Count because it is almost at a 10. However other measures of this assumption say that anything over 5 is strongly correlated. So potentially, ES_Count and SH_Count could be correlated. The assumption of no multicollinearity is likely violated.

cor

kitchen_sink_vars<- vet_data |> dplyr::select(CoC_Count, ES_Count, SH_Count, TH_Count)

cor(kitchen_sink_vars)
##           CoC_Count  ES_Count  SH_Count  TH_Count
## CoC_Count 1.0000000 0.8205100 0.7754552 0.8150272
## ES_Count  0.8205100 1.0000000 0.8510735 0.9259319
## SH_Count  0.7754552 0.8510735 1.0000000 0.8889992
## TH_Count  0.8150272 0.9259319 0.8889992 1.0000000

ES_Count is highly correlated with TH_Count at 0.926. SH_Count is highly correlated with TH_Count at 0.889. ES_Count is highly correlated with SH_Count at 0.851. These are extremely high correlations which makes sense since together these three variables are the components for the sheltered veteran population. So the model fails the assumption of no multicollinearity.

Assumption check

  1. Linearity- Pass
  2. Independence of Errors- Pass
  3. Homoscedasticity- Fail
  4. Normality of Residuals- Pass
  5. No multicollinearity- Fail

Two assumptions are violated by the model: The Independence of Errors assumption and the No Multicollinearity assumption.

Mitigation

Mitigation for failure of homoscedasticity assumption

vet_model_1_log <-lm(log(Unsheltered_Rate)~log(CoC_Count+TH_Rate), data=vet_data)
bptest(vet_model_1_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  vet_model_1_log
## BP = 0.38705, df = 1, p-value = 0.5339

The log model has a high p-value which means it is now homoscedastic. In fact, let’s look at the plot! The line is much straighter for the log model than the original model. Essentially we homogenized the model because the log model is non-significant.

plot(vet_model_1_log,which=3)

plot(vet_model_1,which=3)

Mitigation for failure of no multicollinearity assumption

We saw earlier that the TH_Count had a high value for correlation in the vif() test.

vif(kitchen_sink)
## CoC_Count  ES_Count  SH_Count  TH_Count 
##  3.352792  7.803907  5.012926  9.682637
cor(kitchen_sink_vars)
##           CoC_Count  ES_Count  SH_Count  TH_Count
## CoC_Count 1.0000000 0.8205100 0.7754552 0.8150272
## ES_Count  0.8205100 1.0000000 0.8510735 0.9259319
## SH_Count  0.7754552 0.8510735 1.0000000 0.8889992
## TH_Count  0.8150272 0.9259319 0.8889992 1.0000000

To mitigate, let’s create a new model that takes out TH_Count.

kitchen_sink_altered <-lm(Unsheltered_Rate~CoC_Count+ES_Count+SH_Count, data=vet_data)
vif(kitchen_sink_altered)
## CoC_Count  ES_Count  SH_Count 
##  3.276785  4.738763  3.884056

As you can see, the vif() for ES_Count and SH_Count went down by a good amount. Within this model, all variables are below a vif() value of 5 meaning this model now meets the assumption of no multicollinearity because those big correlations aren’t there.