library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stats)

Clean_Data <- read_excel("Clean_Data.xlsx")
View(Clean_Data)

Continuing with the work of Homework 6, I will be using the following variables for my linear model:

Dependent Variable

Independent Variables

The model produced an adjusted r-squared of ~90%, it had a very low p-value, and each independent variable was considered as statistically significant. When performing the linearity test, the model did not acheive the assumption of linearity.

At the end of the Homework 6, I recommend running histogram tests on all four variables to see if they are at all skewed.

hist(Clean_Data$`Overall Homeless`)

hist(Clean_Data$`Total Amount Awarded`)

hist(Clean_Data$`Total Amounts of Beds in 2022`)

hist(Clean_Data$`Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022`)

As the above histograms highlight, each of the variables is heavily skewed to the left.

I will now create a new data set with only the variables being used and will log the entire data set.

With this new data set, I will create a new linear model with the logged variables and run a summary on it to see if any new results exist.

library(dplyr)

Clean_Data2 <- Clean_Data %>% select(`Overall Homeless`, `Total Amount Awarded`, `Total Amounts of Beds in 2022`, `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022`)

CD2_Logged <- log(Clean_Data2)

The logged variable “Total Amount of Persons Exiting from PH” had some negative infinity (-Inf) values. I will be changing these values to “NA” as the linear model will ignore those.

CD2_Logged[is.na(CD2_Logged)|CD2_Logged == "-Inf"] = NA
HomelessModel <- lm(`Overall Homeless` ~ `Total Amount Awarded` + `Total Amounts of Beds in 2022` + `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022`, data = CD2_Logged)

summary(HomelessModel)
## 
## Call:
## lm(formula = `Overall Homeless` ~ `Total Amount Awarded` + `Total Amounts of Beds in 2022` + 
##     `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022`, 
##     data = CD2_Logged)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.50876 -0.33236 -0.01659  0.28777  1.93887 
## 
## Coefficients:
##                                                                                                                       Estimate
## (Intercept)                                                                                                            0.50203
## `Total Amount Awarded`                                                                                                -0.10027
## `Total Amounts of Beds in 2022`                                                                                        1.20302
## `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022` -0.18036
##                                                                                                                       Std. Error
## (Intercept)                                                                                                              0.40204
## `Total Amount Awarded`                                                                                                   0.04193
## `Total Amounts of Beds in 2022`                                                                                          0.05223
## `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022`    0.03654
##                                                                                                                       t value
## (Intercept)                                                                                                             1.249
## `Total Amount Awarded`                                                                                                 -2.391
## `Total Amounts of Beds in 2022`                                                                                        23.033
## `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022`  -4.935
##                                                                                                                       Pr(>|t|)
## (Intercept)                                                                                                             0.2126
## `Total Amount Awarded`                                                                                                  0.0173
## `Total Amounts of Beds in 2022`                                                                                        < 2e-16
## `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022` 1.22e-06
##                                                                                                                          
## (Intercept)                                                                                                              
## `Total Amount Awarded`                                                                                                *  
## `Total Amounts of Beds in 2022`                                                                                       ***
## `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022` ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5045 on 366 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.8007, Adjusted R-squared:  0.799 
## F-statistic:   490 on 3 and 366 DF,  p-value: < 2.2e-16

With the new model and it’s logged values,

I will now check the model for the following assumptions: - Linearity - Independence of Errors - Homoscedasticity - Normality of Residuals - Multicolinarity

Linearity using Plot and Rain Test

library(ggplot2)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
plot(HomelessModel, which = 1)

raintest(HomelessModel)
## 
##  Rainbow test
## 
## data:  HomelessModel
## Rain = 1.7511, df1 = 185, df2 = 181, p-value = 8.647e-05

Based on the linearity test, this model is not linear as it does not show a clean progression or regression of Overall Homelessness based on the other variables. Additionally, with the low p-value generated by the Rain Test, this model can be assumed as non-linear.

Independence of Errors using Durbin-Watson

library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(HomelessModel)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.3284217      1.335374       0
##  Alternative hypothesis: rho != 0

With the p-value being 0, the variables are not independent. However, D-W Statistic is nearing 2 so the model is not completely useless.

Homoscedasticity using Plot and Breusch-Pagan Test

library(lmtest)

plot(HomelessModel, which =3)

bptest(HomelessModel)
## 
##  studentized Breusch-Pagan test
## 
## data:  HomelessModel
## BP = 8.2612, df = 3, p-value = 0.04091

While the plot shows the points to be fairly evenly distributed along the line (which one could use to assume that homoscedasticity is being met), the bptest shows a p-value less than 0.05 (although it is close) which means we must reject the assumption.

Normality of Residuals using QQ Plot and Shapiro Test

plot(HomelessModel, which = 2)

shapiro.test(HomelessModel$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  HomelessModel$residuals
## W = 0.99271, p-value = 0.06855

Based on the Q-Q Plot and the Shapiro Test p-value being below 0.05 the residuals of this model are to be considered normal.

Checking for No Multicolinearity Using VIF and cor

vif(HomelessModel)
##                                                                                                `Total Amount Awarded` 
##                                                                                                              4.878565 
##                                                                                       `Total Amounts of Beds in 2022` 
##                                                                                                              5.009693 
## `Total Persons Exited PH to permanent destinations or Remained in PH for 6+ months (measure excludes PH-RRH) in 2022` 
##                                                                                                              4.806856

Based on this VIF test, none of the variables are too strongly correlated meaning that this model passed the assumption of No Multicolinearity.

Mitigations

For this model, it passed the assumption of normality of residuals and no multicolinearity.

However, it did not pass linearity, independence of errors, or homoscedasticity.

Linearity - The values were already logged.

Non-Independent

summary(Robust_HomelessModel) ``` Homoscedasticity - The dependent variable is already logged.