library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(pastecs)
## Warning: package 'pastecs' was built under R version 4.4.3
library(ggplot2)
library(lubridate)
options(scipen = 999)#removes scientific notation
library(readxl)
veteran <- read_excel("VA_National_2001-2022_Appendix_508 (1).xlsx", sheet = "Veteran", range = "A2:N24")
colnames(veteran)[9]<-"male.crude.per.100k"
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vets<-lm(`Female Crude Rate per 100,000`~`Year of Death`+male.crude.per.100k+`Population Estimate`,data=veteran)
Focusing on female veteran suicide rate per 100k and looking by year at the male veteran suicide rate. also the size of the state per population.
summary(vets)
##
## Call:
## lm(formula = `Female Crude Rate per 100,000` ~ `Year of Death` +
## male.crude.per.100k + `Population Estimate`, data = veteran)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0971 -0.6698 -0.1397 0.6884 2.6304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3276.579388639 1356.159827445 -2.416 0.0265 *
## `Year of Death` 1.577318821 0.657313802 2.400 0.0274 *
## male.crude.per.100k 0.514787773 0.461130679 1.116 0.2789
## `Population Estimate` 0.000004527 0.000001873 2.417 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.135 on 18 degrees of freedom
## Multiple R-squared: 0.8413, Adjusted R-squared: 0.8149
## F-statistic: 31.81 on 3 and 18 DF, p-value: 0.0000002083
Combination of three independant variables explains 81.49% of the variance in female veteran crude suicide rate per 100k. the overall model is significant with a P value of 0.0000002083.
there is a signficant effect of year of death with a P value 0.0274.The coefficient of 1.58 suggests that each additional year results in an average increase in female veteran suicides of 1.58 per 100k.
there is also a signficant effect of population estimate with a P value of 0.0265. The coefficient of 0.000004527 means that for every one additonal person the number of female veteran suicides per 100k increases by this much.Another way to look at this is that this means for every additional million people in a state, the female veteran suicide rate increases by 4.5 per 100k.
there is no effect of male veteran crude suicide rate.
plot(vets, which=1)
the plot shows some concerning curvature that might imply that the assumption of linearity is violated. I would want to check the other assumptions to see if I can trust the results of this regression model.
plot(vets, which =1)
raintest(`Female Crude Rate per 100,000`~`Year of Death`+male.crude.per.100k+`Population Estimate`,data=veteran)
##
## Rainbow test
##
## data: `Female Crude Rate per 100,000` ~ `Year of Death` + male.crude.per.100k + `Population Estimate`
## Rain = 2.8241, df1 = 11, df2 = 7, p-value = 0.08921
the data is not significantly non-linear. the p value is greater than .05 therefore it doesnt show to be against the linear assumption based on the rainbow test.
dwtest(vets)
##
## Durbin-Watson test
##
## data: vets
## DW = 2.109, p-value = 0.3772
## alternative hypothesis: true autocorrelation is greater than 0
there is independence of errors that rue autocorrelation is less than 0.P value greater than .05 therefore we don’t reject the null hypothesis.
bptest(vets)
##
## studentized Breusch-Pagan test
##
## data: vets
## BP = 9.9454, df = 3, p-value = 0.01904
p
plot(vets, which = 3)
plot(vets, which=2)
vif(vets)
## `Year of Death` male.crude.per.100k `Population Estimate`
## 297.24584 55.25476 277.10133
vif(vets)
## `Year of Death` male.crude.per.100k `Population Estimate`
## 297.24584 55.25476 277.10133
there is extreme correlation
dplyr::select(veteran, `Female Crude Rate per 100,000`,
`Year of Death`, male.crude.per.100k, `Population Estimate`)|>
cor()
## Female Crude Rate per 100,000 Year of Death
## Female Crude Rate per 100,000 1.0000000 0.8856209
## Year of Death 0.8856209 1.0000000
## male.crude.per.100k 0.8874562 0.9907292
## Population Estimate -0.8709804 -0.9981582
## male.crude.per.100k Population Estimate
## Female Crude Rate per 100,000 0.8874562 -0.8709804
## Year of Death 0.9907292 -0.9981582
## male.crude.per.100k 1.0000000 -0.9900518
## Population Estimate -0.9900518 1.0000000
Correlations are high.It meets many of the assumptions. The key assumption it violates is the data is independent and identically distributed. The variables in the model are not independent they are highly correlated. They move in the same or opposite directions. The female rate is highly correlated with the time of death meaning suicide rate are increasing over time. They are also highly correlated with the male rate.In regard to mitigation, I would have to construct a different model.