library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.3
library(dplyr)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
Data Cleaning
#Vet Data boilerplate, data cleaning
# create a file path
file_path <-"C:/Users/Administrator/Desktop/Graduate School/Applied Quant Methods/My Class Stuff/Data Project/Veteran Homelessness/2023 Homeless Veterans.xlsx"
# Read sheet 2
sheet_2<- read_excel(file_path, sheet="2023")
# Remove Sheet 2 "2023" Totals row
vet_data <- sheet_2 |> filter (State != "Total")
#change variable names
vet_data <- vet_data |> rename(
CoC_Count = "Number of CoCs",
ES_Count = "Sheltered ES Homeless Veterans",
TH_Count = "Sheltered TH Homeless Veterans",
SH_Count = "Sheltered SH Homeless Veterans",
Sheltered = "Sheltered Total Homeless Veterans",
Unsheltered = "Unsheltered Homeless Veterans",
Homeless_Vets = "Homeless Veterans")
#Create column for unsheltered rate
vet_data <- vet_data |> mutate(Unsheltered_Rate = Unsheltered/`Homeless_Vets`)
#Create column for Transitional Housing rate across Total homeless veteran population
vet_data <- vet_data |> mutate(TH_Rate = TH_Count/Homeless_Vets)
vet_model_1<-lm(Unsheltered_Rate~CoC_Count+TH_Rate, data=vet_data)
summary(vet_model_1)
##
## Call:
## lm(formula = Unsheltered_Rate ~ CoC_Count + TH_Rate, data = vet_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28062 -0.07670 -0.00255 0.09440 0.45402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.544722 0.047362 11.501 8.86e-16 ***
## CoC_Count 0.001253 0.002655 0.472 0.639
## TH_Rate -0.866280 0.126031 -6.874 8.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1559 on 51 degrees of freedom
## Multiple R-squared: 0.4821, Adjusted R-squared: 0.4618
## F-statistic: 23.74 on 2 and 51 DF, p-value: 5.171e-08
Does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.
If your model violates an assumption, which one?
plot(vet_model_1, which=1)
The fitted line is mostly straight. Just by visual inspection it does
appear to be linear. So it passess the assumption of linearity.
raintest(vet_model_1)
##
## Rainbow test
##
## data: vet_model_1
## Rain = 1.4501, df1 = 27, df2 = 24, p-value = 0.1804
According to the Rainbow Test, the high p-value means that the data is linear.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
durbinWatsonTest(vet_model_1)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02949372 1.963657 0.952
## Alternative hypothesis: rho != 0
The p-value is very high, it is 0.926, much higher than 0.05 which means that the errors ARE independent. The distance between the errors are not highly correlated. This passes the assumption of an independence of errors.
plot(vet_model_1,which=3)
The red line is very curvy and steadily increasing which means this
likely violates the assumption of Homoscedasticity.The residual spread
gets larger as the fitted values increases, the variance of the
residuals are not evenly spread across the fitted values. We can say
that this model appears to be heteroscedastic by visual inspection.
bptest(vet_model_1)
##
## studentized Breusch-Pagan test
##
## data: vet_model_1
## BP = 13.78, df = 2, p-value = 0.001018
The p-value is significant, it is much lower than 0.05. So we can reject homoscedasticity and assume the model is heteroscedastic. This model fails the assumption of homoscedasticity and will require mitigation.
plot(vet_model_1,which=2)
The residuals are more evenly distributed in the center of the dotted
like, aka the bell curve, and fit very well along the dotted line. On
the tails of the line we see some residuals have significant deviations
from the dotted line and don’t really touch it. But again that’s
somewhat expected that the tails would have outliers. Overall, by visual
inspection, this is appears to be normally distributed.
shapiro.test(vet_model_1$residuals)
##
## Shapiro-Wilk normality test
##
## data: vet_model_1$residuals
## W = 0.97332, p-value = 0.2687
The p-value is high suggesting that the residuals are consistent with a normal distribution. The model passes the assumption of normality of residuals.
kitchen_sink<-lm(Unsheltered_Rate ~ CoC_Count + ES_Count + SH_Count + TH_Count, data = vet_data)
vif(kitchen_sink)
## CoC_Count ES_Count SH_Count TH_Count
## 3.352792 7.803907 5.012926 9.682637
The closest variable to being strongly correlated is the TH_Count because it is almost at a 10. However other measures of this assumption say that anything over 5 is strongly correlated. So potentially, ES_Count and SH_Count could be correlated. The assumption of no multicollinearity is likely violated.
kitchen_sink_vars<- vet_data |> dplyr::select(CoC_Count, ES_Count, SH_Count, TH_Count)
cor(kitchen_sink_vars)
## CoC_Count ES_Count SH_Count TH_Count
## CoC_Count 1.0000000 0.8205100 0.7754552 0.8150272
## ES_Count 0.8205100 1.0000000 0.8510735 0.9259319
## SH_Count 0.7754552 0.8510735 1.0000000 0.8889992
## TH_Count 0.8150272 0.9259319 0.8889992 1.0000000
ES_Count is highly correlated with TH_Count at 0.926. SH_Count is highly correlated with TH_Count at 0.889. ES_Count is highly correlated with SH_Count at 0.851. These are extremely high correlations which makes sense since together these three variables are the components for the sheltered veteran population. So the model fails the assumption of no multicollinearity.
Two assumptions are violated by the model: The Independence of Errors assumption and the No Multicollinearity assumption.
vet_model_1_log <-lm(log(Unsheltered_Rate)~log(CoC_Count+TH_Rate), data=vet_data)
bptest(vet_model_1_log)
##
## studentized Breusch-Pagan test
##
## data: vet_model_1_log
## BP = 0.38705, df = 1, p-value = 0.5339
The log model has a high p-value which means it is now homoscedastic. In fact, let’s look at the plot! The line is much straighter for the log model than the original model. Essentially we homogenized the model because the log model is non-significant.
plot(vet_model_1_log,which=3)
plot(vet_model_1,which=3)
We saw earlier that the TH_Count had a high value for correlation in the vif() test.
vif(kitchen_sink)
## CoC_Count ES_Count SH_Count TH_Count
## 3.352792 7.803907 5.012926 9.682637
cor(kitchen_sink_vars)
## CoC_Count ES_Count SH_Count TH_Count
## CoC_Count 1.0000000 0.8205100 0.7754552 0.8150272
## ES_Count 0.8205100 1.0000000 0.8510735 0.9259319
## SH_Count 0.7754552 0.8510735 1.0000000 0.8889992
## TH_Count 0.8150272 0.9259319 0.8889992 1.0000000
To mitigate, let’s create a new model that takes out TH_Count.
kitchen_sink_altered <-lm(Unsheltered_Rate~CoC_Count+ES_Count+SH_Count, data=vet_data)
vif(kitchen_sink_altered)
## CoC_Count ES_Count SH_Count
## 3.276785 4.738763 3.884056
As you can see, the vif() for ES_Count and SH_Count went down by a good amount. Within this model, all variables are below a vif() value of 5 meaning this model now meets the assumption of no multicollinearity because those big correlations aren’t there.