Load your chosen dataset into Rmarkdown
#load libraries
library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
## Warning: package 'readxl' was built under R version 4.5.3
library(dplyr)
#Vet Data boierplate, data cleaning
# create a file path
file_path <-"C:/Users/Administrator/Desktop/Graduate School/Applied Quant Methods/My Class Stuff/Data Project/Veteran Homelessness/2023 Homeless Veterans.xlsx"
# Read sheet 2 and remove totals row
sheet_2<- read_excel(file_path, sheet="2023")
# Remove Sheet 2 "2023" Totals row
vet_data_sheet_2 <- sheet_2 |> filter (State != "Total")
# Read Sheet 1
change_sheet <-read_excel(file_path, sheet="Change")
# Remove Sheet 1 "Change" totals row
change_column <- change_sheet |> filter (State != "Total")
# Merge change column into 2023 data
vet_data_long_variable_names <- vet_data_sheet_2 |> left_join (change_column, by = "State")
#change variable names
vet_data <- rename(vet_data_long_variable_names,
CoC_Count = "Number of CoCs",
ES_Count = "Sheltered ES Homeless Veterans",
TH_Count = "Sheltered TH Homeless Veterans",
SH_Count = "Sheltered SH Homeless Veterans",
Sheltered = "Sheltered Total Homeless Veterans",
Unsheltered = "Unsheltered Homeless Veterans",
Homeless_Vets = "Homeless Veterans",
Homeless_Rate_Change = "Change in Veteran Homelessness, 2022-2023")
#Create column for unsheltered rate
vet_data <- vet_data |> mutate(Unsheltered_Rate = Unsheltered/`Homeless_Vets`)
view(vet_data)
Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
Dependent Variable: Total number of homeless veterans (Homeless_Veterans) Independent Variable 1: Number of Emergency Sheltered veterans (ES_Count) Independent Variable 2: Number of Transitional Housing sheltered veterans (TH_Count) Independent Variable 3: Number of Safe Haven sheltered veterans (SH_Count) Independent Variable 4: Number of Continuums of Care (CoCs)
Create a linear model using the “lm()” command, save it to some object
# Linear Model for Homeless Veterans and Emergency Shelter count, Transitional Housing Count, and Continuums of Care counts
Homeless_kitchen_sink_model<- lm(Homeless_Vets~ES_Count+CoC_Count+TH_Count+SH_Count, data =vet_data)
Call a “summary()” on your new model
summary(Homeless_kitchen_sink_model)
##
## Call:
## lm(formula = Homeless_Vets ~ ES_Count + CoC_Count + TH_Count +
## SH_Count, data = vet_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -635.48 -96.19 43.23 105.89 561.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.0897 63.4340 -0.269 0.78875
## ES_Count 1.2873 0.5140 2.504 0.01564 *
## CoC_Count -25.8525 9.2285 -2.801 0.00727 **
## TH_Count 1.4422 0.5657 2.549 0.01397 *
## SH_Count 16.8539 1.4509 11.617 1.11e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 296.1 on 49 degrees of freedom
## Multiple R-squared: 0.9626, Adjusted R-squared: 0.9596
## F-statistic: 315.5 on 4 and 49 DF, p-value: < 2.2e-16
Interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?
The model yields an R-squared value of 95.96%. Meaning this model explains about 96% of the number of homeless veterans across the states. Based upon the p-value, there are 4 significant variables; the Emergency Shelter count, the Transitional Housing count, the Safe Haven count and the Continuums of Care count.There are no insignificant variables within the model. All individual p-values are less than 0.05. The p-value is very small for the model and is essentially zero meaning that the odds of this model’s significant being due to random chance is very small.
Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable? * If you have no significant variables, then just pick one and pretend it’s significant.
Emergency Shelter counts: For every increase in 1 veteran sheltered at an Emergency Shelter, we see the homeless veteran population increase by ~1.
Safe Haven counts: For every increase in 1 veteran sheltered at a Safe Haven shelter, we see the homeless veteran population increase by ~ 17.
Transitional Housing counts: For every increase in 1 veteran sheltered at a Transitional Housing shelter, we see the homeless veteran population increase by ~1.
Continuums of Care counts: For every increase in 1 Continuum of Care in the country, we see the homeless veteran population decrease by around 26.
All of these tested independent variables have a significant influence over the total number of homeless veterans across the U.S. states.
Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”
plot(Homeless_kitchen_sink_model, which=1)
It appears that the line of residuals vs fitted is not straight at all.
This suggests that the relationship between variables is nonlinear. This
model does NOT meet the linearity assumption.