R Markdown - A tall white fountain played

Once more for the plot, we will be examining the same dependent (Food Insecure population) and the independent variable which contribute to this condition (Unemployment, poverty, home ownership). The reason being is that for population, you have a census tract report which states 111.2 persons are experiencing food insecurity. Creating a more clear sense of how many people are FI will create more accurate data.

options (scipen = 100)
library(readxl)
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
CAFB_SetUp <- read.csv("Capital_Area_Food_Bank_Hunger_Estimates.csv")
CAFB_Clean <- CAFB_SetUp[c("F15_FI_POP", "UNEMPLOYME", "POVERTY_RA", "HOME_OWN")]
colSums(is.na(CAFB_Clean))
## F15_FI_POP UNEMPLOYME POVERTY_RA   HOME_OWN 
##          0          0          0          0
CAFB_Condition_Report <- lm(F15_FI_POP ~ UNEMPLOYME+POVERTY_RA+HOME_OWN, data = CAFB_Clean)
summary(CAFB_Condition_Report)
## 
## Call:
## lm(formula = F15_FI_POP ~ UNEMPLOYME + POVERTY_RA + HOME_OWN, 
##     data = CAFB_Clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -866.33 -133.04  -39.22  117.72 1031.11 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   482.37      27.53  17.521 <0.0000000000000002 ***
## UNEMPLOYME   2965.30     160.93  18.426 <0.0000000000000002 ***
## POVERTY_RA    118.01     122.32   0.965               0.335    
## HOME_OWN     -466.63      33.08 -14.107 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 212.8 on 1035 degrees of freedom
## Multiple R-squared:  0.5703, Adjusted R-squared:  0.569 
## F-statistic: 457.8 on 3 and 1035 DF,  p-value: < 0.00000000000000022
##3.A
plot(CAFB_Condition_Report, which = 1)

raintest(CAFB_Condition_Report)
## 
##  Rainbow test
## 
## data:  CAFB_Condition_Report
## Rain = 0.75768, df1 = 520, df2 = 515, p-value = 0.9992
##3.B
durbinWatsonTest(CAFB_Condition_Report)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.04797603      1.900028   0.082
##  Alternative hypothesis: rho != 0
##3.C
plot(CAFB_Condition_Report, which = 3)

bptest(CAFB_Condition_Report)
## 
##  studentized Breusch-Pagan test
## 
## data:  CAFB_Condition_Report
## BP = 136.96, df = 3, p-value < 0.00000000000000022
##3.D
plot(CAFB_Condition_Report, which = 2)

shapiro.test(CAFB_Condition_Report$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  CAFB_Condition_Report$residuals
## W = 0.95644, p-value < 0.00000000000000022
##3.E
vif(CAFB_Condition_Report)
## UNEMPLOYME POVERTY_RA   HOME_OWN 
##   1.915107   2.746652   1.647234

##4 I believe the model meets my assumptions. Variables such as unemployment, poverty rate, and even home ownership can affect the rate of food insecurity in a given census tract. However, we have to consider other factors which the data set does not take into account like transportation, distance to food assistance pantry, etc. However, this model does give insight to how these factors contribute to the struggle which these people face. In the first plot we can see that the data points are linear as they are very close to one another with some outliers further down the line. The rainbow test suggests a P-value of 0.9992, indicating that the relationship of the variables is linear. When we ran the Durbin Watson test, the P-value registered at 0.12, above the 0.05 threshold. This means that the errors found in the data set are truly independent and unrelated to any correlation in the data set. That said, when we look at the homoscedasticity, it looks like an unorganized mess. Data points are spread far and uneven from each other, as such we have a more heteroscedastic model. This is confirmed in the bptest as the P-value is 2x10^-16. The normality of residuals test shows a far more uniform line than the previous model. Most of the data plots line up on the dotted line, save for the beginning and the end. Despite this, the data is not considered normally distributed, especially when we apply the shapiro test. Here, we received a P-Value of 2.2x10^-16. When looking at the three independent variables, there is the question of correlation to one another. In the Vif test, all three variables can be considered non multicollinearity as their values stay below 3. After processing all these tests, I do believe this data set helps confirm that these three factor play a major role in food insecurity plaguing the DC community.

On a separate note, when we consider Lbs of Food Needed per tract in the model, this creates an interesting error in which the results are a near perfect fit. Considering we are examining the ability of the Capital Area Food Bank’s ability to address food insecurity, and if we toss in a variable of the food needed in the area, it makes sense that variable creates a perfect fit with the percentage of need in a census tract.

##5 Of these tests, two showed the data does not meet the proper assumptions. In the homoscedascity test and the normal of residuals test. With that in mind, I believe we can fix this by throwing the data into a Log format.

##CAFB_LOG <- CAFB_Clean
##CAFB_LOG[is.na(CAFB_LOG) | CAFB_LOG == "Inf"] <- NA
##CAFB_Alter <- lm(log(F15_FI_POP)~log(UNEMPLOYME)+POVERTY_RA+HOME_OWN, data = CAFB_LOG, na.action = na.omit)
CAFB_Alter <- lm(sqrt(F15_FI_POP)~sqrt(UNEMPLOYME)+POVERTY_RA+HOME_OWN, data = CAFB_Clean, na.action = na.omit)
plot(CAFB_Alter, which = 3)

bptest(CAFB_Alter)
## 
##  studentized Breusch-Pagan test
## 
## data:  CAFB_Alter
## BP = 12.855, df = 3, p-value = 0.004961

##6 With that in mind, I have attempted to correct the log format for the homoscedascity test and the noral residuals test. Since the data contained 0’s in the population for the Y, we had to square root transform the information. This created a more homoscedacity curve than before, however the studentized Breush-Pagan test did not change all that much as it yielded a new result of 4.96x10^-3. With this conclusion, we cannot confirm a linear model with this data.