1 Motivation
2 Desired Outcome
3 Current Knowledge on the Subject
4 Data Sources
5 Tidying my data
6 Analysis
7 Conclusion
8 Recommendations
References

1 Motivation

A few days ago, CDC released mortality data for 2017. Life expectancy has dropped for three years in a row. One of the main causes is an increase in suicide rates. Suicide keeps increasing every year for the past 2 decades and it is hard to pinpoint the exact reasons. One of the main paradoxes is a big difference in suicide rates across the country. The suicide rate for 2017 varies from 6.6 in DC to 28.9 in Montana, which is almost 4.5 times greater. Please note that the rate is age adjusted, so age distribution should not play the role. Why would a person in Montana be 4.5 times more likely to commit suicide than one in DC? If we could answer the following question, there is a greater possibility to reduce suicide rates and prevent such tragedies across our country. To illustrate that suicide is not remote and could only happen to someone unstable and weak, I would like to reference a tragedy that took place while I was working on the project: Top US Navy Admiral on active duty committed suicide(Martin 2018) only 58 years old.

2 Desired Outcome

At the end of this project, I expect to come with a list of variables responsible for varying suicide rates across of the states. As mentioned above, I believe that understanding factors behind the variance will help us address the ever-worsening problem of suicide across the country. To come up with answers, I will try to apply both supervised and unsupervised methods, but a primary final goal is to come with a supervised model. Final measure to evaluate our model will be Adjusted R squire.

To cluster states would also be interesting since we can see whether geography plays a role.

3 Current Knowledge on the Subject

We know that gender, age, mental health, substance abuse, viability of guns, and ethnicity play a major role in suicide rates. Other possible factors include but are not limited to altitude, population density, rural status, and specific occupations.

4 Data Sources

I researched multiple sources and none of them have any raw, detailed data because suicide is a sensitive topic for families; therefore, the government protects the families’ privacy and does not provide individual data for each case; even grouped data is often masked. This is one of the limitations I had to deal with. It was disappointing not to have all the details, but I understand that in the real world, data is not always easily and readily available.

After reviewing multiple sources of potential data, I have chosen two that seem to give me the most insight on the topic:

CDC WONDER, Detailed Mortality. The Mortality Database maintained by CDC(CDC 2018).
SAMHSA 2016-2017 State Prevalence Estimates. State level data on drug use, mental illness, and suicide attempts(SAMHSA 2018).

I have also pulled an additional Wikipedia source for comparison study:

Wikipedia. Table: Suicides per 100,000 people in 2016 (age standardized). WHO(Wikipedia 2018).

library(readr)

library(stringr)

library(dplyr)

library(tidyr) 

library(car) 

library(ggplot2)

library(usmap)

library(readxl)

library(ggthemes) 

library(XML)

library(RCurl)

library(ggpubr)

library(knitr)

library(kableExtra)

library(factoextra)

library(NbClust)

5 Tidying my data

5.1 CDC Wonder, Underlying Cause of Death, 1999-2017

This is probably as good as it gets when it comes to suicide statistics. It does not have individual records and results with values under 11 are masked (fewer than 11 deaths). However, I am not aware of better alternatives to pull suicide data. From what I have seen, CDC is the main source of data for suicide reporting.

Please note that CDC Wonder does have API; however, it does not allow to pull state level data from API.

I will pull multiple data requests from the system. I have tried to create my own master data file in one pull; however, it did not work for the following reasons:

If data from 2017 is used, it would have a lot of masked results. Expanding the time frame to other years, an issue of data not being up-to data, or current, rises.
Not enough fields can be pulled out by Wonder Query, which were needed for the project. The query only allows you to pull 6 categorical variables, while the prospected goal was all 8 of them.

5.1.1 2017 Suicide Data by State

The first query was Suicide Data by state for year 2017. Data was not limited in any way except for the year (2017) and cause of death (suicide).

state2017<-read.table("C:/Final Project/Underlying Cause of Death, 2017 Suicide State All Race All Ages.txt", header =TRUE, sep = "\t")

state2017$state<-state2017$State

state2017$AARate<-state2017$Age.Adjusted.Rate

arrange(state2017, AARate)

state2017$group1<-"US States"

par(ps = 10, cex = 1, cex.main = 1)

boxplot(state2017$AARate)

title(main="Plot 1. Boxplot. 2017 Suicide Age Adjusted Rate per 100,000 people by State")

ggplot(state2017, aes(x=group1, y=AARate))+geom_violin()+ggtitle("Plot 2. Violin Plot. 2017 Suicide Age Adjusted Rate per 100,000 by State")+theme(plot.title = element_text(size=12))

plot_usmap(data = state2017, values = "AARate", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+ 
  labs(title = "Plot 3. 2017 Suicide Age Adjusted Rate per 100,000 by State")

Montana has the highest rate, DC has the lowest. Montana seems to be an outlier.

5.1.2 2017 Suicide Data by State, Gender, Age 15+ Only (To get closer to SAMHA’s age of 18+)

The second query used was 2017 Suicide Data comparing state and gender. Data was limited to one year (2017), cause of death (suicide), and age (15+). Please note that we encountered unreliable data according to CDC. The data is not big enough to be considered reliable for females in DC and Delaware. For simplicity we will use the actual rate for these states. The number we use here is not for our final analysis, so we can get away with some “cheating”. If we really feel that including these 2 states will significantly skew our data (I do not feel that way), then we have 2 solutions to the problem: one is to drop these 2 states from our analysis, or pull a few more years of data, instead of using just 2017.

state2017GA<-read.table("C:/Final Project/Underlying Cause of Death, 2017Suicide State Gender 15+.txt", header =TRUE, sep = "\t")

state2017GA$state<-state2017GA$State

state2017GA$AARate<-as.numeric(as.character(state2017GA$Age.Adjusted.Rate))

arrange(state2017GA, AARate)

state2017GA$group1<-"US States"

state2017M<-filter(state2017GA, Gender=='Male')

arrange(state2017M, AARate)

par(ps = 10, cex = 1, cex.main = 1)

boxplot(state2017M$AARate)

title(main="Plot 4. Boxplot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Male, Age 15+")

ggplot(state2017M, aes(x=group1, y=AARate))+geom_violin()+ggtitle("Plot 5. Violin plot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Male, Age 15+")+theme(plot.title = element_text(size=11))

plot_usmap(data = state2017M, values = "AARate", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+ 
  labs(title = "Plot 6. 2017 Suicide Age Adjusted Rate per 100,000 by State - Male, Age 15+")

state2017F<-filter(state2017GA, Gender=='Female')

state2017F$AARate1<-(state2017F$Age.Adjusted.Rate.Lower.95..Confidence.Interval+state2017F$Age.Adjusted.Rate.Upper.95..Confidence.Interval)/2

arrange(state2017F, AARate1)

par(ps = 10, cex = 1, cex.main = 1)

boxplot(state2017F$AARate1)

title(main="Plot 7. Boxplot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Female, Age 15+")

ggplot(state2017F, aes(x=group1, y=AARate1))+geom_violin()+ggtitle("Plot 8. Violin plot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Female, Age 15+")+theme(plot.title = element_text(size=10))

plot_usmap(data = state2017F, values = "AARate1", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+ 
  labs(title = "Plot 9. 2017 Suicide Age Adjusted Rate per 100,000 by State - Female, Age 15+")

For male, Montana has the highest rate, DC has the lowest again as well. Montana, Wyoming, and Alaska seem to be outliers.

For females, Delaware has the lowest rate, but data is unreliable. If we only use reliable data, then NY could be proud: tt has the lowest rate based on reliable data. Montana, on the other hand, has the highest rate.

5.1.3 2015-2017 Suicide Data by State, Gender, Ethnicity, Metro Status Age 15+ Only (To get closer to SAMHA’s age of 18+)

The third query used was 2015-2017 Suicide Data by state, gender, ethnicity, and metro residence. Data was limited to 3 years (2015-2017), cause of death (suicide), and age (15+). I have pulled 3 years of data to smooth it and to avoid unreliability label. I have still got some. We got a lot of detailed information from the data since it includes multiple aspects like gender, residency status, and ethnicity; therefore, it brings us one step closer to answering our question.

state15_17<-read.table("C:/Final Project/Underlying Cause of Death, 2015-2017 Suicide State Ethnicity Gender Metro 15+.txt", header =TRUE, sep = "\t")

state15_17$state<-state15_17$State

state15_17$pop<-as.numeric(as.character(state15_17$Population))

state15_17$relFlag<-if (state15_17$Age.Adjusted.Rate=='Unreliable') {'U'} else {'R'}

state15_17$AARate<-(as.numeric(as.character(state15_17$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state15_17$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2

arrange(state15_17, AARate)

state15_17$group1<-"US States"

state15_17C<-filter(state15_17, state=='Colorado')%>%select(state,Gender,X2013.Urbanization,AARate,Race,Hispanic.Origin,pop,relFlag)

state15_17M<-filter(state15_17, state=='Minnesota')%>%select(state,Gender,X2013.Urbanization,AARate,Race,Hispanic.Origin,pop,relFlag)

state15_17WM<-filter(state15_17, Race=='White'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17WM$prod<-state15_17WM$AARate*state15_17WM$pop

state15_17WM1<-select(state15_17WM,prod,pop,state)

state15_17WM1<-aggregate(.~state, data=state15_17WM1, FUN=sum)

state15_17WM1$AARate<-state15_17WM1$prod/state15_17WM1$pop

state15_17WF<-filter(state15_17, Race=='White'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17WF$prod<-state15_17WF$AARate*state15_17WF$pop

state15_17WF1<-select(state15_17WF,prod,pop,state)

state15_17WF1<-aggregate(.~state, data=state15_17WF1, FUN=sum)

state15_17WF1$AARate<-state15_17WF1$prod/state15_17WF1$pop

state15_17BM<-filter(state15_17, Race=='Black or African American'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17BM$prod<-state15_17BM$AARate*state15_17BM$pop

state15_17BM1<-select(state15_17BM,prod,pop,state)

state15_17BM1<-aggregate(.~state, data=state15_17BM1, FUN=sum)

state15_17BM1$AARate<-state15_17BM1$prod/state15_17BM1$pop

state15_17BF<-filter(state15_17, Race=='Black or African American'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17BF$prod<-state15_17BF$AARate*state15_17BF$pop

state15_17BF1<-select(state15_17BF,prod,pop,state)

state15_17BF1<-aggregate(.~state, data=state15_17BF1, FUN=sum)

state15_17BF1$AARate<-state15_17BF1$prod/state15_17BF1$pop

state15_17HM<-filter(state15_17, Gender=='Male'&Hispanic.Origin=='Hispanic or Latino')

state15_17HM$prod<-state15_17HM$AARate*state15_17HM$pop

state15_17HM1<-select(state15_17HM,prod,pop,state)

state15_17HM1<-aggregate(.~state, data=state15_17HM1, FUN=sum)

state15_17HM1$AARate<-state15_17HM1$prod/state15_17HM1$pop

state15_17HF<-filter(state15_17, Race=='Black or African American'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17HF$prod<-state15_17HF$AARate*state15_17HF$pop

state15_17HF1<-select(state15_17HF,prod,pop,state)

state15_17HF1<-aggregate(.~state, data=state15_17HF1, FUN=sum)

state15_17HF1$AARate<-state15_17HF1$prod/state15_17HF1$pop

state15_17%>%select(state,Race,Hispanic.Origin,Gender,X2013.Urbanization,Population, AARate)%>%filter(state=='Montana'|state=='District of Columbia')

5.1.4 2015-2017 Mortality Data by State, Gender, Ethnicity, Metro Status Age 15+ Only (To get closer to SAMHA’s age of 18+) - Drug Induced Indicator

The third query I have pulled was 2015-2017 Mortality Data by state, gender, ethnicity, metro residence, and drug induced indicator. Data was limited to 3 years (2015-2017) and age (15+). I have pulled 3 years of data to smooth it and to avoid unreliability label.

state15_17DI<-read.table("C:/Final Project/Underlying Cause of Death,  2015-2017 Drug Induced.txt", header =TRUE, sep = "\t")

state15_17DI$state<-state15_17DI$State

state15_17DI

state15_17DI$pop<-as.numeric(as.character(state15_17DI$Population))

state15_17DI$relFlag<-if (state15_17DI$Age.Adjusted.Rate=='Unreliable') {'U'} else {'R'}

state15_17DI$AARate<-(as.numeric(as.character(state15_17DI$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state15_17DI$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2

arrange(state15_17DI, state)

state15_17DI$group1<-"US States"

state15_17WMDI<-filter(state15_17DI, Race=='White'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17WMDI$prod<-state15_17WMDI$AARate*state15_17WMDI$pop

state15_17WMDI1<-select(state15_17WMDI,prod,pop,state,Drug.Alcohol.Induced)

state15_17WMDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17WMDI1, FUN=sum)

state15_17WMDI1$AARate1<-state15_17WMDI1$prod/state15_17WMDI1$pop

state15_17WMDI2<-select(state15_17WMDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)

colnames(state15_17WMDI2)<-c('state','Alcohol','Other','Drug')

state15_17WMM<-merge(state15_17WM1,state15_17WMDI2,by='state')

state15_17WMC<-select(state15_17WMM,AARate, Alcohol, Other, Drug)

state15_17WMM

cor(state15_17WMC)

state15_17WFDI<-filter(state15_17DI, Race=='White'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17WFDI$prod<-state15_17WFDI$AARate*state15_17WFDI$pop

state15_17WFDI1<-select(state15_17WFDI,prod,pop,state,Drug.Alcohol.Induced)

state15_17WFDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17WFDI1, FUN=sum)

state15_17WFDI1$AARate1<-state15_17WFDI1$prod/state15_17WFDI1$pop

state15_17WFDI2<-select(state15_17WFDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)

colnames(state15_17WFDI2)<-c('state','Alcohol','Other','Drug')

state15_17WFM<-merge(state15_17WF1,state15_17WFDI2,by='state')

state15_17WFC<-select(state15_17WFM,AARate, Alcohol, Other, Drug)

state15_17WFM

cor(state15_17WFC)

state15_17BMDI<-filter(state15_17DI, Race=='Black or African American'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17BMDI
state15_17BMDI$prod<-state15_17BMDI$AARate*state15_17BMDI$pop

state15_17BMDI1<-select(state15_17BMDI,prod,pop,state,Drug.Alcohol.Induced)

state15_17BMDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17BMDI1, FUN=sum)

state15_17BMDI1$AARate1<-state15_17BMDI1$prod/state15_17BMDI1$pop

state15_17BMDI2<-select(state15_17BMDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)

colnames(state15_17BMDI2)<-c('state','Alcohol','Other','Drug')

state15_17BMDI2<-filter(state15_17BMDI2,!is.na(Alcohol)&!is.na(Drug))

state15_17BMM<-merge(state15_17BM1,state15_17BMDI2,by='state')

state15_17BMC<-select(state15_17BMM,AARate, Alcohol, Other, Drug)

state15_17BMM

cor(state15_17BMC)

state15_17BFDI<-filter(state15_17DI, Race=='Black or African American'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')

state15_17BFDI$prod<-state15_17BFDI$AARate*state15_17BFDI$pop

state15_17BFDI1<-select(state15_17BFDI,prod,pop,state,Drug.Alcohol.Induced)

state15_17BFDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17BFDI1, FUN=sum)

state15_17BFDI1$AARate1<-state15_17BFDI1$prod/state15_17BFDI1$pop

state15_17BFDI2<-select(state15_17BFDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)

colnames(state15_17BFDI2)<-c('state','Alcohol','Other','Drug')

state15_17BMDI2<-filter(state15_17BMDI2,!is.na(Alcohol)&!is.na(Drug))

state15_17BFM<-merge(state15_17BF1,state15_17BFDI2,by='state')

state15_17BFC<-select(state15_17BFM,AARate, Alcohol, Other, Drug)

state15_17BFM

cor(state15_17BFC)

state15_17HMDI<-filter(state15_17DI, Gender=='Male'&Hispanic.Origin=='Hispanic or Latino')

state15_17HMDI

state15_17HMDI$prod<-state15_17HMDI$AARate*state15_17HMDI$pop

state15_17HMDI1<-select(state15_17HMDI,prod,pop,state,Drug.Alcohol.Induced)

state15_17HMDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17HMDI1, FUN=sum)

state15_17HMDI1$AARate1<-state15_17HMDI1$prod/state15_17HMDI1$pop

state15_17HMDI2<-select(state15_17HMDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)

colnames(state15_17HMDI2)<-c('state','Alcohol','Other','Drug')

state15_17HMDI2<-filter(state15_17HMDI2,!is.na(Alcohol)&!is.na(Drug))

state15_17HMM<-merge(state15_17HM1,state15_17HMDI2,by='state')

state15_17HMC<-select(state15_17HMM,AARate, Alcohol, Other, Drug)

state15_17HMM

cor(state15_17HMC)

state15_17HFDI<-filter(state15_17DI, Gender=='Female'&Hispanic.Origin=='Hispanic or Latino')

state15_17HFDI$prod<-state15_17HFDI$AARate*state15_17HFDI$pop

state15_17HFDI1<-select(state15_17HFDI,prod,pop,state,Drug.Alcohol.Induced)

state15_17HFDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17HFDI1, FUN=sum)

state15_17HFDI1$AARate1<-state15_17HFDI1$prod/state15_17HFDI1$pop

state15_17HFDI2<-select(state15_17HFDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)

colnames(state15_17HFDI2)<-c('state','Alcohol','Other','Drug')

state15_17HMDI2<-filter(state15_17HMDI2,!is.na(Alcohol)&!is.na(Drug))

state15_17HFM<-merge(state15_17HF1,state15_17HFDI2,by='state')

state15_17HFC<-select(state15_17HFM,AARate, Alcohol, Other, Drug)

state15_17HFM

cor(state15_17HFC)

state15_17DI%>%select(state,Race,Hispanic.Origin,Gender, Drug.Alcohol.Induced, AARate)%>%filter(state=='Montana'|state=='District of Columbia')

For White Non-Hispanic Male, Alcohol induced causes correlate with r=0.57.

For White Non-Hispanic Female Alcohol induced causes correlate with r=0.65.

For Black Non-Hispanic Male Alcohol induced causes correlate with r=0.43.

For Black Non-Hispanic Female Alcohol induced causes correlate with r=0.51.

For Hispanic Male Alcohol induced causes correlate with r=0.59.

For Hispanic Female Alcohol induced causes correlate with r=NA.

5.1.5 2008-2017 Suicide Data For Colorado and Minnesota, Gender, Ethnicity, Metro Status, and ICD10 groups Age 15+ Only (To get closer to SAMHA’s age of 18+)

The fourth query was 2008-2017 Suicide Data for Colorado and Minnesota by gender, ethnicity, metro residence, and ICD10 groups. Data was limited to 10 years (2008-2017) and age (15+). Please note that all disease was pulled since suicide was not the main aspect of the given data. I have pulled 10 years to get the most reliable data.

state08_17<-read.table("C:/Final Project/Underlying Cause of Death, 2008-2017 Col Min by ICD.txt", header =TRUE, sep = "\t")

state08_17$state<-state08_17$State

state08_17$pop<-as.numeric(as.factor(state08_17$Population))

state08_17$relFlag<-if (state08_17$Age.Adjusted.Rate=='Unreliable') {'U'} else {'R'}

state08_17$AARate<-(as.numeric(as.character(state08_17$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state08_17$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2

arrange(state08_17, AARate)

state08_17$group1<-"US States"

state08_17<-select(state08_17,state,Gender,X2013.Urbanization,AARate,Race,pop,relFlag,ICD.10.113.Cause.List)

state08_17M<-filter(state08_17,state=='Minnesota')

state08_17C<-filter(state08_17,state=='Colorado')

5.1.6 2008-2017 Mortality Data For All States, Gender, Ethnicity, and ICD10 groups Age 15+ Only

The fifth query was 2008-2017 Mortality Data for all sates by gender, ethnicity, and ICD10 groups. Data was limited to 10 years (2008-2017), ethnicity, and Metro areas. Please note that all disease was pulled since suicide was not only of mention. To get most reliable results data from 10 years ago was used.

state08_17a<-read.table("C:/Final Project/Underlying Cause of Death, 2008-2017 Metro Non Hispanic 15+.txt", header =TRUE, sep = "\t")

#state08_17a<-filter(state08_17a,Crude.Rate!='Unreliable')



state08_17a$state<-state08_17a$State

state08_17a$pop<-as.numeric(as.factor(state08_17a$Population))
state08_17a$AARate<-(as.numeric(as.character(state08_17a$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state08_17a$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2

state08_17a$group1<-"US States"

state08_17a<-select(state08_17a,state,Gender,AARate,Race,pop,ICD.10.113.Cause.List)

state08_17b<-spread(state08_17a,ICD.10.113.Cause.List,AARate)

state08_17b

state08_17bMB<-filter(state08_17b,Gender=='Male',Race=='Black or African American')

state08_17bMB

state08_17bMBC<-state08_17bMB[-c(1:5)]

state08_17bMBC<-state08_17bMBC[, -which(colMeans(is.na(state08_17bMBC)) > 0.25)]

state08_17bMBC

colnames(state08_17bMBC)[13] <- "Suicide"

state08_17bMBC<-filter(state08_17bMBC,!is.na(Suicide))

cor(as.matrix(state08_17bMBC[,13]), as.matrix(state08_17bMBC[,-13]), use = "pairwise.complete.obs")

state08_17bMW<-filter(state08_17b,Gender=='Male',Race=='White')

state08_17bMW

state08_17bMWC<-state08_17bMW[-c(1:5)]

state08_17bMWC<-state08_17bMWC[, -which(colMeans(is.na(state08_17bMWC)) > 0.1)]

state08_17bMWC

colnames(state08_17bMWC)[19] <- "Suicide"

state08_17bMBC<-filter(state08_17bMWC,!is.na(Suicide))

cor(as.matrix(state08_17bMWC[,19]), as.matrix(state08_17bMWC[,-11]), use = "pairwise.complete.obs")

5.1.7 2017 Suicide Data For DC and Montana by State, Ethnicity

The 6th query was 2017 Suicide Data for DC and Montana by ethnicity and ICD10 groups. Data was limited to 2017 and DC and Montana only.

state17_DC_Mo<-read.table("C:/Final Project/Underlying Cause of Death, 2017 Monata vs DC.txt", header =TRUE, sep = "\t")

5.2 SAMHA State Prevalance of Drug and Alcohol Use and Mental Issues.

This data does not provide suicide rates; however, it does provide statistics on alcohol, drug, and mental health by state. It should help us understand which states have issues with substance abuse and mental health. The data is provided in Excel format. I have loaded the file on my GitHub account(M.Groysman 2018). The file has multiple tabs. We do not need all of them, so we will only load the ones we need:

Drug use, any drug
Drug use, not marijuana
Drug use, heroin
Alcohol use
Alcohol use, binge
Drug Abuse disorder
Pain Reliever Abuse disorder
Alcohol Abuse disorder
Substance Abuse disorder
Serious Mental Disorder
Any Mental Disorder
Serious Thoughts of Suicide, the most important table for us. Please note that this is not a suicide rate. However, we will work with model that:

Suicide thought->Suicide Planning->Suicide Attempt->Suicide(in case of successful attempt)

5.2.1 Data Load

drug_use <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 1")

drug_use <- drug_use[6:62, ]

colnames(drug_use) <- c("Index", "state", "Y12P", "Y12P_LE", "Y12P_HE", "Y12_17", 
    "Y12_17_LE", "Y12_17_HE", "Y18_25", "Y18_25_LE", "Y18_25_HE", "Y26P", "Y26P_LE", 
    "Y26P_HE", "Y18P", "Y18P_LE", "Y18P_HE")

drug_uset <- transform(drug_use, Y12P = as.numeric(Y12P), Y18P = as.numeric(Y18P), 
    Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))

drug_uset %>% arrange(desc(Y12P))

drug_use1 <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 6")

drug_use1<-drug_use1[6:62,]

colnames(drug_use1)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

drug_uset1<-transform(drug_use1,Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))

drug_uset1%>%arrange(desc(Y12P))

drug_useh <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 9")

drug_useh<-drug_useh[6:62,]

colnames(drug_useh)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

drug_useth<-transform(drug_useh, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))

drug_useth%>%arrange(desc(Y12P))

alc_use <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 13")

alc_use<-alc_use[6:62,]

colnames(alc_use)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

alc_uset<-transform(alc_use,  Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))

alc_uset%>%arrange(desc(Y12P))

alc_useb <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 14")

alc_useb<-alc_useb[6:62,]

colnames(alc_useb)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

alc_usebt<-transform(alc_useb, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))

alc_usebt%>%arrange(desc(Y12P))

drug_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 20")

drug_d<-drug_d[6:62,]

colnames(drug_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

drug_dt<-transform(drug_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))

drug_dt%>%arrange(desc(Y12P))

pain_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 21")

pain_d<-pain_d[6:62,]

colnames(pain_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

pain_dt<-transform(pain_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))

pain_dt%>%arrange(desc(Y12P))

alc_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 22")

alc_d<-alc_d[6:62,]

colnames(alc_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

alc_dt<-transform(alc_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25))

alc_dt%>%arrange(desc(Y12P))

sa_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 23")

sa_d<-sa_d[6:62,]

colnames(sa_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")

sa_dt<-transform(sa_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25))

sa_dt%>%arrange(desc(Y12P))

mh_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 27")

mh_d<-mh_d[6:62,]

colnames(mh_d)<-c("Index","state","Y18P","Y18P_LE","Y18P_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE")

mh_dt<-transform(mh_d, Y18P = as.numeric(Y18P))

mh_dt%>%arrange(desc(Y18P))

amh_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 28")

amh_d<-amh_d[6:62,]

colnames(amh_d)<-c("Index","state","Y18P","Y18P_LE","Y18P_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE")

amh_dt<-transform(amh_d, Y18P = as.numeric(Y18P))

amh_dt%>%arrange(desc(Y18P))

5.2.2 Serious thoughts of suicide, rate among 18+ cohort

suicide_t <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 30")

suicide_t<-suicide_t[6:62,]

colnames(suicide_t)<-c("Index","state","Y18P","Y18P_LE","Y18P_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE")

suicide_t%>%filter(!is.na(Index))%>%filter(!state %in% c('Total U.S.','Midwest','West','South','Northeast'))

suicide_tt<-transform(suicide_t, Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25),Y26P = as.numeric(Y26P))

suicide_tt1<-filter(suicide_tt,!is.na(Y18P)&substr(suicide_tt$state,1,3)!='Tot')%>%arrange(desc(Y18P))

arrange(suicide_tt1,desc(Y18P))

suicide_tt1$group1<-"US States"

par(ps = 10, cex = 1, cex.main = 1)

boxplot(suicide_tt1$Y18P*100)

title(main="Plot 10. Boxplot. Serious thoughts of suicide, % rate among 18+ cohort")

ggplot(suicide_tt1, aes(x=group1, y=Y18P*100))+geom_violin()+ggtitle("Plot 11. Violin plot. Serious thoughts of suicide, % rate among 18+ cohort")+theme(plot.title = element_text(size=13))

plot_usmap(data = suicide_tt1, values = "Y18P", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+ 
  labs(title = "Plot 12. Serious thoughts of suicide, % rate among 18+ cohort")

5.2.2.1 Discussion of the table:

Utah has the highest % of population with thoughts of suicide, NJ had the lowest. Utah seems to be an outline. Interestingly, Utah has high but not highest rate of suicide among US states. Alaska and Montana have higher rates.

5.2.2.2 Serious thoughts of suicide, rate among younger cohort, 18-25

Among young people, Alaska is slightly ahead of Utah in suicide thoughts. Apparently, Utah has teen suicide crisis(Riess 2016).

suicide_tt1%>%arrange(desc(Y18_25))

par(ps = 10, cex = 1, cex.main = 1)

boxplot(suicide_tt1$Y18_25*100)

title(main="Plot 13. Boxplot. Serious thoughts of suicide, % rate among 18-25 year old cohort")

ggplot(suicide_tt1, aes(x=group1, y=Y18_25*100)) +geom_violin()+ggtitle("Plot 14. Serious thoughts of suicide, % rate among 18-25 year old cohort")+theme(plot.title = element_text(size=10))

plot_usmap(data = suicide_tt1, values = "Y18_25", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+ 
  labs(title = "Plot 15. Serious thoughts of suicide, % rate among 18-25 year old cohort")

5.2.3 Serious thoughts of suicide, rate among older cohort, 26+

Both Alaska and Utah seem to be outliers. Among older people (26+), Idaho, state adjutant to Utah, has the highest number with suicide thoughts, while NJ again has the lowest number.

suicide_tt1%>%arrange(desc(Y26P))

par(ps = 10, cex = 1, cex.main = 1)

boxplot(suicide_tt1$Y26P*100)

title(main="Plot 16. Serious thoughts of suicide, rate among 25+ year old cohort")

ggplot(suicide_tt1, aes(x=group1, y=Y26P*100))+geom_violin()+ggtitle("Plot 17. Serious thoughts of suicide, % rate among 25+ year old cohort")+theme(plot.title = element_text(size=10))

plot_usmap(data = suicide_tt1, values = "Y26P", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+ 
  labs(title = "Plot 18. Serious thoughts of suicide, % rate among 25+ year old cohort")

5.2.4 SAMHA Master Table

Let’s put a SAMHA master table to include all separate pieces together

Drug use, any drug
Drug use, not marijuana
Drug use, heroin
Alcohol use
Alcohol use, binge
Drug Abuse disorder
Pain Reliever Abuse disorder
Alcohol Abuse disorder
Substance Abuse disorder
Serious Mental Disorder
Any Mental Disorder
Serious Thoughts of Suicide

drug_usetA<-select(drug_uset,"state","Y18P")

colnames(drug_usetA)<-c("state","DrugAny")

drug_uset1A<-select(drug_uset1,"state","Y18P")

colnames(drug_uset1A)<-c("state","DrugNoM")

drug_useth1<-select(drug_useth,"state","Y18P")

colnames(drug_useth1)<-c("state","Heroin")

alc_uset1<-select(alc_uset,"state","Y18P")

colnames(alc_uset1)<-c("state","AlcUse")

alc_usebt1<-select(alc_usebt,"state","Y18P")

colnames(alc_usebt1)<-c("state","Binge")

drug_dt1<-select(drug_dt,"state","Y18P")

colnames(drug_dt1)<-c("state","DrugDis")

pain_dt1<-select(pain_dt,"state","Y18P")

colnames(pain_dt1)<-c("state","Painkillers")

alc_dt1<-select(alc_dt,"state","Y18P")

colnames(alc_dt1)<-c("state","AlcDis")

sa_dt1<-select(sa_dt,"state","Y18P")

colnames(sa_dt1)<-c("state","SubAbuse")

suicide_tt2<-select(suicide_tt1,"state","Y18P")

colnames(suicide_tt2)<-c("state","ThoughtSuic")

mh_dt1<-select(mh_dt,"state","Y18P")

colnames(mh_dt1)<-c("state","SerMH")


SAMHA_master<-Reduce(function(x,y) merge(x = x, y = y, by = "state"), list(suicide_tt2, mh_dt1, pain_dt1, drug_uset1A, drug_usetA, alc_usebt1,drug_dt1, alc_dt1, alc_uset1, drug_useth1, sa_dt1))

SAMHA_master<-SAMHA_master%>%filter(!state %in% c('Total U.S.','Midwest','West','South','Northeast'))

5.3 2016 Suicide Rate by Country

wikiURL<-"https://en.wikipedia.org/wiki/List_of_countries_by_suicide_rate"

wikiData <- getURL(wikiURL)

countries <- readHTMLTable(wikiData)

countries1<-countries[[3]]

countries2<-countries1[-1,]

colnames(countries2)<-c("Rank","Country", "AgeAdjRateT","a","b","c","d","e")

countries2$temp<-as.character(countries2$AgeAdjRateT)

countries2$AgeAdjRate<-as.numeric(countries2$temp)

countries2<-countries2%>%select("Rank","Country","AgeAdjRate")

countries2$Country<-gsub("\\(more info\\)", "", as.character(countries2$Country))

head(countries2)%>%
  kable() %>%
  kable_styling()

	Rank	Country	AgeAdjRate
2	1	Guyana	30.2
3	2	Lesotho	28.9
4	3	Russia	26.5
5	4	Lithuania	25.7
6	5	Suriname	23.2
7	6	Cote d’Ivoire	23.0

6 Analysis

Correlation classification(Gerstman Unknown):

0 < |r| < .3 weak correlation .3 < |r| < .7 moderate correlation |r| > 0.7 strong correlation

6.1 CDC Wonder Data.

6.1.1 2017 State Suicide Rate.

Let’s check for normality. The distribution of our data does somewhat resemble to normal. Montana creates a long tail.

ggdensity(state2017$AARate, 
          main = "Plot 19. Density plot of 2017 Age Adj Suicide Rate per 100,000 by State",
          xlab = "States")

6.1.1.1 US States vs World Countries: Suicide Rate

Next let’s merge US State data into Wiki Country data to see how US States would rank if they were independent countries.

stateWiki<-select(state2017,State,AARate)

colnames(stateWiki)<-c("Country","AgeAdjRate")

stateWiki$USflag<-"Y"

countries3<-select(countries2,Country, AgeAdjRate)

countries3$USflag<-"N"

countryState <- rbind(countries3, stateWiki)

countryState<-arrange(countryState,desc(AgeAdjRate))

countryState$rank <- NA

countryState$rank<- 1:nrow(countryState)

countryState<-select(countryState,rank, Country, USflag, AgeAdjRate)

countryState %>%
  kable() %>%
  kable_styling()

rank	Country	USflag	AgeAdjRate
1	Guyana	N	30.200
2	Montana	Y	28.943
3	Lesotho	N	28.900
4	Alaska	Y	27.014
5	Wyoming	Y	26.940
6	Russia	N	26.500
7	Lithuania	N	25.700
8	New Mexico	Y	23.300
9	Suriname	N	23.200
10	Idaho	Y	23.168
11	Cote d’Ivoire	N	23.000
12	Kazakhstan	N	22.800
13	Utah	Y	22.691
14	South Dakota	Y	22.517
15	Equatorial Guinea	N	22.000
16	Belarus	N	21.400
17	West Virginia	Y	21.064
18	Arkansas	Y	20.800
19	Colorado	Y	20.339
20	Nevada	Y	20.285
21	South Korea	N	20.200
22	North Dakota	Y	20.083
23	Ugandab	N	20.000
24	Cameroon	N	19.500
25	Kansas	Y	19.131
26	Oklahoma	Y	19.129
27	Zimbabwe	N	19.100
28	Oregon	Y	18.998
29	New Hampshire	Y	18.898
30	Maine	Y	18.874
31	Ukraine	N	18.500
32	Missouri	Y	18.487
33	Vermont	Y	18.310
34	Arizona	Y	18.158
35	Nigeria	N	17.300
36	Latvia	N	17.200
37	Kentucky	Y	16.920
38	Washington	Y	16.909
39	Tennessee	Y	16.783
40	Eswatini	N	16.700
41	Alabama	Y	16.628
42	Togo	N	16.600
43	India	N	16.500
44	Uruguay	N	16.500
45	Indiana	Y	16.330
46	South Carolina	Y	16.297
47	Sierra Leone	N	16.100
48	Benin	N	15.700
49	Belgium	N	15.700
50	Chad	N	15.500
51	Wisconsin	Y	15.385
52	Louisiana	Y	15.225
53	Kiribati	N	15.200
54	Hawaii	Y	15.178
55	Cabo Verde	N	15.100
56	Iowa	Y	15.040
57	Pennsylvania	Y	15.023
58	Mississippi	Y	15.018
59	Burundi	N	15.000
60	Ohio	Y	14.807
61	Burkina Faso	N	14.800
62	Nebraska	Y	14.679
63	Estonia	N	14.400
64	Japan	N	14.300
65	North Carolina	Y	14.284
66	Sri Lanka	N	14.200
67	Michigan	Y	14.104
68	Florida	Y	13.965
69	Minnesota	Y	13.846
70	Finland	N	13.800
71	Eritrea	N	13.800
72	United States	N	13.700
73	Hungary	N	13.600
74	Georgia	Y	13.594
75	El Salvador	N	13.500
76	Liberia	N	13.400
77	Moldova	N	13.400
78	Poland	N	13.400
79	Texas	Y	13.353
80	Virginia	Y	13.350
81	Iceland	N	13.300
82	Slovenia	N	13.300
83	Mongolia	N	13.300
84	Bolivia	N	12.900
85	Thailand	N	12.900
86	Trinidad and Tobago	N	12.900
87	South Africa	N	12.800
88	Haiti	N	12.200
89	France	N	12.100
90	Nicaragua	N	11.900
91	Senegal	N	11.800
92	Rhode Island	Y	11.784
93	Sweden	N	11.700
94	Australia	N	11.700
95	Delaware	Y	11.629
96	Bhutan	N	11.600
97	New Zealand	N	11.600
98	Central African Republic	N	11.600
99	Botswana	N	11.500
100	Croatia	N	11.500
101	Namibia	N	11.500
102	Austria	N	11.400
103	Ethiopia	N	11.400
104	Switzerland	N	11.300
105	Zambia	N	11.300
106	Micronesia	N	11.300
107	Illinois	Y	11.212
108	Comoros	N	11.100
109	Rwanda	N	11.000
110	Serbia	N	10.900
111	Ireland	N	10.900
112	North Koreab	N	10.600
113	Connecticut	Y	10.512
114	Guinea	N	10.500
115	Czech Republic	N	10.500
116	Dominican Republic	N	10.500
117	California	Y	10.456
118	Canada	N	10.400
119	Luxembourg	N	10.400
120	Norway	N	10.100
121	Cuba	N	10.100
122	Slovakia	N	10.100
123	Gambia	N	10.000
124	Maryland	Y	9.847
125	Yemenb	N	9.800
126	Democratic Republic of the Congo	N	9.700
127	Chile	N	9.700
128	Nepal	N	9.600
129	Netherlands	N	9.600
130	Tanzaniab	N	9.600
131	Gabon	N	9.600
132	Sudan	N	9.500
133	Massachusetts	Y	9.456
134	Paraguay	N	9.300
135	Laos	N	9.300
136	Congo	N	9.300
137	Denmark	N	9.200
138	Germany	N	9.100
139	Kyrgyzstan	N	9.100
140	Argentina	N	9.100
141	Niger	N	9.000
142	Mali	N	8.900
143	Angola	N	8.900
144	Ghanab	N	8.700
145	Portugal	N	8.600
146	Djibouti	N	8.500
147	Mozambique	N	8.400
148	New Jersey	Y	8.339
149	Somalia	N	8.300
150	Seychelles	N	8.300
151	New York	Y	8.120
152	Myanmar	N	8.100
153	China	N	8.000
154	Romania	N	8.000
155	Singaporeb	N	7.900
156	Montenegro	N	7.900
157	Bulgaria	N	7.900
158	Malawi	N	7.800
159	United Kingdom	N	7.600
160	Mauritania	N	7.500
161	Costa Rica	N	7.500
162	Guinea-Bissau	N	7.400
163	Uzbekistan	N	7.400
164	Mauritius	N	7.300
165	Saint Lucia	N	7.300
166	Ecuador	N	7.200
167	Turkmenistan	N	7.200
168	Turkey	N	7.200
169	Papua New Guineab	N	7.000
170	Vietnam	N	7.000
171	Colombia	N	7.000
172	Madagascar	N	6.900
173	Georgia	N	6.700
174	District of Columbia	Y	6.621
175	Malta	N	6.500
176	Timor-Leste	N	6.400
177	Bosnia and Herzegovina	N	6.400
178	Afghanistan	N	6.400
179	Malaysiab	N	6.200
180	Republic of Macedonia	N	6.200
181	Bangladeshb	N	6.100
182	South Sudanb	N	6.100
183	Spain	N	6.100
184	Brazil	N	6.100
185	Cambodia	N	5.900
186	Solomon Islands	N	5.900
187	Belize	N	5.900
188	Qatar	N	5.800
189	Bahrain	N	5.700
190	Armenia	N	5.700
191	Albania	N	5.600
192	Kenyab	N	5.600
193	Italy	N	5.500
194	Fiji	N	5.500
195	Libya	N	5.500
196	Vanuatu	N	5.400
197	Samoa	N	5.400
198	Mexico	N	5.200
199	Peru	N	5.100
200	Bruneib	N	4.500
201	Cyprusb	N	4.500
202	Egypt	N	4.400
203	Panama	N	4.400
204	Iraq	N	4.100
205	Iran	N	4.000
206	Tonga	N	4.000
207	Greece	N	3.800
208	Venezuela	N	3.800
209	Jordanb	N	3.700
210	Philippines	N	3.700
211	Indonesia	N	3.700
212	Omanb	N	3.500
213	Honduras	N	3.400
214	Saudi Arabiab	N	3.400
215	Algeria	N	3.300
216	Tajikistan	N	3.300
217	Tunisia	N	3.200
218	Lebanonb	N	3.200
219	Morocco	N	3.100
220	Pakistanb	N	3.100
221	Sao Tome and Principe	N	3.100
222	Guatemala	N	2.900
223	Maldives	N	2.700
224	United Arab Emiratesb	N	2.700
225	Azerbaijan	N	2.600
226	Syriab	N	2.400
227	Saint Vincent and the Grenadines	N	2.400
228	Kuwait	N	2.200
229	Jamaica	N	2.000
230	Grenada	N	1.700
231	Bahamasb	N	1.600
232	Antigua and Barbuda	N	0.500
233	Barbados	N	0.400

It is shocking, but 5 US states made in top 10. Montana has the 2nd highers suicide rate. 3 states are ahead of Russia, and Russia has very high suicide rate.

Let’s create a violin plot of the distribution.

p <- ggplot(countryState, aes(factor(USflag), AgeAdjRate))

p + geom_violin()+ggtitle("Plot 20. Age Adj Suicide Rate per 100,000: World Countries vs US States")+theme(plot.title = element_text(size=10))

The plot provides devasting results. US is doing very poorly. Please note that in last 2 decades while US suicide rate was steadily rising, the rest of the world was moving in opposite direction. Rates of suicide across of Asia and Europe have been dropping.

6.1.1.2 Suicide Rates vs Serious Thoughts of Suicide by State

Let’s merge CDC and SAMHA data to see if there is correlation between suicide rates and serious thoughts of suicide. It should help us to validate our assumption, stated earlier that:

Suicide thought->Suicide Planning->Suicide Attempt->Suicide(in case of successful attempt)

suic_r_t<-merge(state2017, suicide_tt1, by = "state")

cor(suic_r_t$Y18P, suic_r_t$AARate)

## [1] 0.5509664

#suic_r_t

Correlation is moderate. R is only 0.55. Not as high as I expected.

scatterplot(suic_r_t$Y18P*100~suic_r_t$AARate, ylab="Thoughts of suicide, % of population", xlab="Age Adj Suicide Rate per 100,000", 
            main="Plot 21. Thoughts of Suicide(%) vs Age Adj Suicide Rate per 100,000")

ggplot(suic_r_t, aes(Y18P*100,AARate, color = AARate)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 22. Thoughts of Suicide(%) vs Suicide Rate per 100,000", x="Thoughts of Suicide, %", y="Suicide Rate, per 100,000")

Interesting question is why so many people seriously think about suicide while relativly few commit suicide, while in Montana not that many people seriously think about suicide while so many actually commit it? Is it the way data is collected or is it something else. Later we will see that gun ownership could play a role.

6.1.1.3 Suicide Rates vs SAMHA Survey.

Let’s check correlation between suicide rate and other SAMHA’s variables:

Vs Drug Use, Any - no correlation, r=0.00. Very Interesting. It seems variables are perfectly independent.

Vs Heroin Use - practically no correlation, r=-0.04

Vs Alcohol Use and Binge Drinking - week negative correlation(!), r=-0.22 and r=-0.2. Basically what we are seeing that higher alcohol consumption weakly predicts lower suicide rate. Not what I have expected.

Vs Painkillers Abuse - finally we got some decent correlation. r=0.42.

Vs Serious Mental Illness - BINGO! r=0.57. Hard to believe but serious mental illness is a slightly better predictor than serious thoughts of suicide.

state2017C<-select(state2017,state, AARate)

suic_cor<-merge(state2017C, SAMHA_master, by = "state")

suic_cor1<-suic_cor

suic_cor1$state<-NULL

res <- cor(suic_cor1)

round(res, 2)%>%
  kable() %>%
  kable_styling()

	AARate	ThoughtSuic	SerMH	Painkillers	DrugNoM	DrugAny	Binge	DrugDis	AlcDis	AlcUse	Heroin	SubAbuse
AARate	1.00	0.55	0.57	0.42	-0.16	0.00	-0.20	0.05	0.10	-0.22	-0.04	0.04
ThoughtSuic	0.55	1.00	0.80	0.46	0.24	0.31	-0.07	0.31	0.26	-0.02	0.16	0.29
SerMH	0.57	0.80	1.00	0.78	0.18	0.18	-0.21	0.26	0.13	-0.25	0.24	0.18
Painkillers	0.42	0.46	0.78	1.00	-0.01	-0.05	-0.29	0.13	-0.13	-0.35	0.24	-0.03
DrugNoM	-0.16	0.24	0.18	-0.01	1.00	0.83	0.36	0.80	0.57	0.38	0.14	0.68
DrugAny	0.00	0.31	0.18	-0.05	0.83	1.00	0.32	0.81	0.57	0.48	0.22	0.70
Binge	-0.20	-0.07	-0.21	-0.29	0.36	0.32	1.00	0.23	0.77	0.88	0.00	0.70
DrugDis	0.05	0.31	0.26	0.13	0.80	0.81	0.23	1.00	0.52	0.27	0.35	0.69
AlcDis	0.10	0.26	0.13	-0.13	0.57	0.57	0.77	0.52	1.00	0.70	0.01	0.95
AlcUse	-0.22	-0.02	-0.25	-0.35	0.38	0.48	0.88	0.27	0.70	1.00	0.13	0.67
Heroin	-0.04	0.16	0.24	0.24	0.14	0.22	0.00	0.35	0.01	0.13	1.00	0.13
SubAbuse	0.04	0.29	0.18	-0.03	0.68	0.70	0.70	0.69	0.95	0.67	0.13	1.00

6.1.1.4 Suicide Rates vs Serious Mental Illness Rates by State.

Let’s create scatterplots of the 2 variables. We see the same outliers again: Montana, Alaska, and New Mexico, all have high suicide rates, but their rates of mental illness are not all that high.

scatterplot(suic_cor$SerMH*100~suic_cor$AARate, xlab="Serious Mental Illness(%)", ylab="Age Adj Suicide Rate per 100,000", 
            main="Plot 22A. Serious Mental Illness(%) vs Age Adj Suicide Rate, per 100,000")

ggplot(suic_cor, aes(SerMH*100,AARate, color = AARate)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 22B. Serious Mental Illness(%) vs Age Adj Suicide Rate, per 100,000", x="Serious Mental Illness(%)", y="Age Adj Suicide Rate per 100,000")

6.1.2 2017 State Suicide Rate by Gender, Age 15+ Only

We have a chance to compare suicide distribution based on gender.

state2017GA$AARate1<-(state2017GA$Age.Adjusted.Rate.Lower.95..Confidence.Interval+state2017GA$Age.Adjusted.Rate.Upper.95..Confidence.Interval)/2

state2017GA1<-filter(state2017GA,Gender!='')

p <- ggplot(state2017GA1, aes(factor(Gender), AARate1))

p + geom_violin()+ggtitle("Plot 23. Violin Plot. Age Adj Suicide Rate per 100,000, 2017 US States: Female vs Males")+theme(plot.title = element_text(size=10))

Interesting results. Female distribution has much more narrow range, while male one spreads over wider range.

Let’s see how male and female suicide rates correlate.

state2017GA2<-state2017GA1%>%select(state,Gender,AARate1)%>%spread(Gender,AARate1)

cor(state2017GA2$Male, state2017GA2$Female)

## [1] 0.8955711

Correlation is almost perfect! R is 0.9. I find the result very important since it appears that the ratio of male to female suicide varies little from state to state. I have seen a lot of inconsistent results and this is the first consistent number. Delaware appears to be the only outlier. Very low female rate and not very low male. Please remember that CDC classified Delaware female rate as unreliable.

scatterplot(state2017GA2$Male~state2017GA2$Female, xlab="Male suicide rate, Age Adj per 100,000", ylab="Female suicide rate, Age Adj , per 100,00", main="Plot 24. Male vs Female Suicide Rate, per 100,000")

ggplot(state2017GA2, aes(Male,Female, color = Female)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 25. Male vs Female Suicide Rate, per 100,000", x="Male Suicide Rate, per 100,000", y="Female Suicide Rate, per 100,000")

Now let’s run Male/Female data vs SAMHA tables to see if we get some meaningful results. state2017F

6.1.2.1 Male Suicide Rates vs SAMHA Survey

Please remember that SAMHA data is not broken by gender. So, we are forced to compare male population to total.

Let’s check correlation between male suicide rate and other SAMHA’s variables:

Interestingly, that r for male suicide rate vs thoughts of suicide has dropped, but not by a lot. r=0.53. Modest correlation.

Vs Drug Use, Any - no correlation, r=-0.003

Vs Heroin Use - practically no correlation, r=-0.02

Vs Alcohol Use, Binge Drinking - week negative correlation(!), r=-0.24 and r=-0.22. Basically what we are seeing that higher alcohol consumption weakly predicts lower suicide rate. Not what I have expected.

Vs Painkillers Abuse - finally we got some decent correlation. r=0.46.

Vs Serious Mental Illness - bingo! r=0.59. Hard to believe but serious mental illness is a better predictor than serious thoughts of suicide.

state2017MC<-select(state2017M,state, AARate)

suic_r_t_m<-merge(state2017MC, SAMHA_master, by = "state")

suic_r_t_m$state<-NULL

res <- cor(suic_r_t_m)

round(res, 2)%>%
  kable() %>%
  kable_styling()

	AARate	ThoughtSuic	SerMH	Painkillers	DrugNoM	DrugAny	Binge	DrugDis	AlcDis	AlcUse	Heroin	SubAbuse
AARate	1.00	0.53	0.59	0.46	-0.19	-0.03	-0.22	0.02	0.07	-0.24	-0.02	0.02
ThoughtSuic	0.53	1.00	0.80	0.46	0.24	0.31	-0.07	0.31	0.26	-0.02	0.16	0.29
SerMH	0.59	0.80	1.00	0.78	0.18	0.18	-0.21	0.26	0.13	-0.25	0.24	0.18
Painkillers	0.46	0.46	0.78	1.00	-0.01	-0.05	-0.29	0.13	-0.13	-0.35	0.24	-0.03
DrugNoM	-0.19	0.24	0.18	-0.01	1.00	0.83	0.36	0.80	0.57	0.38	0.14	0.68
DrugAny	-0.03	0.31	0.18	-0.05	0.83	1.00	0.32	0.81	0.57	0.48	0.22	0.70
Binge	-0.22	-0.07	-0.21	-0.29	0.36	0.32	1.00	0.23	0.77	0.88	0.00	0.70
DrugDis	0.02	0.31	0.26	0.13	0.80	0.81	0.23	1.00	0.52	0.27	0.35	0.69
AlcDis	0.07	0.26	0.13	-0.13	0.57	0.57	0.77	0.52	1.00	0.70	0.01	0.95
AlcUse	-0.24	-0.02	-0.25	-0.35	0.38	0.48	0.88	0.27	0.70	1.00	0.13	0.67
Heroin	-0.02	0.16	0.24	0.24	0.14	0.22	0.00	0.35	0.01	0.13	1.00	0.13
SubAbuse	0.02	0.29	0.18	-0.03	0.68	0.70	0.70	0.69	0.95	0.67	0.13	1.00

6.1.2.2 Female Suicide Rates vs SAMHA Survey

Please remember that SAMHA data is not broken by gender. So, we are forced to compare female population to total.

Let’s check correlation between female suicide rate and other SAMHA’s variables:

Interestingly, that r for female suicide rate vs thoughts of suicide is the same as for both gender. r=0.57. Modest correlation.

Vs Drug Use, Any - practically no correlation, r=0.11

Vs Heroin Use - practically no correlation, r=-0.04

Vs Alcohol Use and Binge Drinking - week negative correlation, r=-0.19 and r=-0.18.

Vs Painkillers Abuse - finally we got some modest correlation. r=0.32.

Vs Serious Mental Illness - modest correlation, r=0.52. For females, thoughts of suicide is a slightly better predictor than serious mental illness.

state2017FC<-select(state2017F,state, AARate1)

suic_corF<-merge(state2017FC, SAMHA_master, by = "state")

suic_corF$state<-NULL

res <- cor(suic_corF)

round(res, 2)%>%
  kable() %>%
  kable_styling()

	AARate1	ThoughtSuic	SerMH	Painkillers	DrugNoM	DrugAny	Binge	DrugDis	AlcDis	AlcUse	Heroin	SubAbuse
AARate1	1.00	0.57	0.52	0.32	-0.04	0.11	-0.18	0.13	0.14	-0.19	-0.04	0.08
ThoughtSuic	0.57	1.00	0.80	0.46	0.24	0.31	-0.07	0.31	0.26	-0.02	0.16	0.29
SerMH	0.52	0.80	1.00	0.78	0.18	0.18	-0.21	0.26	0.13	-0.25	0.24	0.18
Painkillers	0.32	0.46	0.78	1.00	-0.01	-0.05	-0.29	0.13	-0.13	-0.35	0.24	-0.03
DrugNoM	-0.04	0.24	0.18	-0.01	1.00	0.83	0.36	0.80	0.57	0.38	0.14	0.68
DrugAny	0.11	0.31	0.18	-0.05	0.83	1.00	0.32	0.81	0.57	0.48	0.22	0.70
Binge	-0.18	-0.07	-0.21	-0.29	0.36	0.32	1.00	0.23	0.77	0.88	0.00	0.70
DrugDis	0.13	0.31	0.26	0.13	0.80	0.81	0.23	1.00	0.52	0.27	0.35	0.69
AlcDis	0.14	0.26	0.13	-0.13	0.57	0.57	0.77	0.52	1.00	0.70	0.01	0.95
AlcUse	-0.19	-0.02	-0.25	-0.35	0.38	0.48	0.88	0.27	0.70	1.00	0.13	0.67
Heroin	-0.04	0.16	0.24	0.24	0.14	0.22	0.00	0.35	0.01	0.13	1.00	0.13
SubAbuse	0.08	0.29	0.18	-0.03	0.68	0.70	0.70	0.69	0.95	0.67	0.13	1.00

6.1.3 Study: Colorado vs Minnesota. Where does the difference come from?

Our main research is to understand why such a gap is there. Our collected results should give us final ideas on why we have such a big discrepancy in suicide rates. I have mentioned DC and Montana in the introduction. These 2 states lay on opposite ends of suicide rates; however, comparing these two states is as hard as it gets. Everything is different about them. One is in the Metro area, and another is rural state. Climate is so different, as well as ethnic distribution. Another problem, both states have low population rates. We are facing a typical “apple vs orange” problem. My goal is to find close populations to compare and show if population size plays a significant role in suicide rates. After looking at different factors I decided to compare Colorado and Minnesota. They have comparable population sizes, similar ethic distribution making them both well to do states; however, their suicide rates are significantly different. Minnesota is on the lower end with 13.8 and Colorado is on the higher end with 20.3. Colorado is 47% higher. It does not make any sense to me. Interestingly, later in the project I try to cluster states by SAMHA survey and Minnesota does come out as a state with not a lot of drug, alcohol, and mental health issues, while Colorado falls in mix bag category with some good and some bad indicators. Looking at the details, we can see that Colorado has much higher drug usage than Minnesota. Heroin use is the same for both states (low). Colorado also has higher rate of painkiller abuse, alcohol abuse, suicide thoughts, and mental illness. So, maybe we do have reasons. Let’s look at the data.

col_min<-merge(state15_17C, state15_17M,by=c("Gender","X2013.Urbanization","Race","Hispanic.Origin"))

col_min$ratio<-col_min$AARate.x/col_min$AARate.y

arrange(col_min,desc(ratio))

We can see that 3 populations got over twice as high rate in Colorado vs Minnesota. All their rates are considered reliable by CDC.

We will focus on these 3 groups:

Non-Hispanic Afro-American males, Large Central Metro
Non-Hispanic White females, Small Metro
Non-Hispanic White males, Small Metro

col_min1<-merge(state08_17C, state08_17M,by=c("Gender","X2013.Urbanization","Race","ICD.10.113.Cause.List"))

col_min1$ratio<-col_min1$AARate.x/col_min1$AARate.y

col_min_C1<-filter(col_min1,Gender=='Male',Race=='Black or African American',X2013.Urbanization=='Large Central Metro')

head(arrange(col_min_C1,desc(ratio)))

For our first group we can see that Transport accidents and Alcoholic liver disease are very high for Colorado. Let’s see if these diseases would correlate for other states.

col_min_C2<-filter(col_min1,Gender=='Male',Race=='White',X2013.Urbanization=='Small Metro')

head(arrange(col_min_C2,desc(ratio)))

For our second group we can see that Assault, HIV, and Accidental poisoning are very high for Colorado. Let’s see if these diseases would correlate for other states.

col_min_C3<-filter(col_min1,Gender=='Female',Race=='White',X2013.Urbanization=='Small Metro')

arrange(col_min_C3,desc(ratio))

For our third group we can see that Assault, Viral hepatitis, and Accidental poisoning are very high for Colorado. Let’s see if these diseases would correlate for other states.

6.2 SAMHA State Prevalaince of Drug and Alcohol Use and Mental Issues Data.

6.2.1 Younger vs older population with thoughts of suicide

Correlation between younger and older population with thoughts of suicide by state approaching so called strong correlation (r=0.66).

cor(suicide_tt1$Y18_25, suicide_tt1$Y26P)

## [1] 0.6594445

scatterplot(suicide_tt1$Y18_25*100~suicide_tt1$Y26P,xlab="Age 18-25, %", ylab="Age 26+, %", main="Plot 26. Serious Thoughts of Suicide, rate(%): 18-25 years old vs 26+ years old cohort")

ggplot(suicide_tt1, aes(Y18_25*100,Y26P*100, color = Y18_25)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ ggtitle("Plot 27. Serious thoughts of suicide, rate(%), 18-25 years old vs 26+ years old cohort")+labs(x = "18-25 year old cohort, %", y = "26+ year old cohort, %")

6.2.2 Thoughts of suicide vs Any Mental Illness

Correlation between population with thoughts of suicide and one with any mental illness by state is strong (r=0.74).

suic_amh<-merge(suicide_tt1, amh_dt, by = "state")

cor(suic_amh$Y18P.x, suic_amh$Y18P.y)

## [1] 0.7351503

suic_amh$temp<-suic_amh$Y18P.x*100

suic_amh$temp1<-suic_amh$Y18P.y*100

scatterplot(suic_amh$temp~suic_amh$temp1,xlab="Thoughts of suicide, %", ylab="Any Mental Illness, %", main="Plot 27A. Thoughts of Suicide vs Any Mental Illness, %")

ggplot(suic_amh, aes(temp,temp1, color = Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 27B. Thoughts of Suicide vs Any Mental Illness, %")+labs(x = "18+ year old cohort - Thoughts of suicide, %", y = "18+ year old cohort - Any Mental Illness, %")

6.2.3 Thoughts of suicide vs Serious mental illness

Correlation between population with thoughts of suicide and one with serious mental illness by state is strong (r=0.8).

suic_mh<-merge(suicide_tt1, mh_dt, by = "state")

cor(suic_mh$Y18P.x, suic_mh$Y18P.y)

## [1] 0.7982923

scatterplot(suic_mh$Y18P.x~suic_mh$Y18P.y,xlab="Thoughts of suicide, %", ylab="Serious Mental Illness, %", 
   main="Plot 28. Thoughts of Suicide vs Serious Mental Illness, %")

ggplot(suic_mh, aes(Y18P.x,Y18P.y, color = Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 29. Thoughts of Suicide vs Serious Mental Illness", x="Thoughts of Suicide", y="Serious Mental Illness")

6.2.4 Thoughts of suicide vs Substance Abuse

Correlation between population with thoughts of suicide and one with substance abuse disorder is weak/moderate (r=0.3). Interestingly, Utah has low SA problem, but high level of thoughts of suicide, while DC has drug problem but low level of suicide thoughts. Let’s take out these 2 outliers and see if correlation will change. Indeed, correlation improved to r=0.47. Let’s see if correlation is stronger for younger cohort. Actually, it is not (r=0.46). Not what I have expected. I would like to mention that throughout my analysis Utah and DC will be somewhat outliers. Specifically, Utah, a state with low violence, drug and alcohol use, while the rate of suicide is one of the highest in country.

suic_sa<-merge(suicide_tt1, sa_dt, by = "state")

cor(suic_sa$Y18P.x, suic_sa$Y18P.y)

## [1] 0.2976138

scatterplot(suic_sa$Y18P.x~suic_sa$Y18P.y,xlab="Thoughts of suicide", ylab="Substance Abuse Illness", 
   main="Plot 30. Thoughts of Suicide vs SA Illness")

ggplot(suic_sa, aes(Y18P.x,Y18P.y, color = Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 31. Thoughts of Suicide vs SA Illness", y="Substance Abuse Illness", x="Thougts of suicide")

suic_sa_woo<-filter(suic_sa, !state %in% c('Utah','District of Columbia'))

cor(suic_sa_woo$Y18P.x, suic_sa_woo$Y18P.y)

## [1] 0.4704185

cor(suic_sa_woo$Y18_25.x, suic_sa_woo$Y18_25.y)

## [1] 0.4625329

6.2.5 Thoughts of suicide vs Alcohol Abuse

Weak correlation, between alcohol and thoughts of suicide (r=0.27). Strange. Without DC and Utah, correlation goes up to 0.43. Correlation is stronger for younger cohort - 0.53. For younger people, Alaska is an outline. Alaska has moderate alcohol issues and a lot of thoughts of suicide.

suic_alc<-merge(suicide_tt1, alc_dt, by = "state")

cor(suic_alc$Y18P.x, suic_alc$Y18P.y)

## [1] 0.2730799

scatterplot(suic_alc$Y18P.x~suic_alc$Y18P.y,xlab="Thoughts of suicide", ylab="Alcohol Disorder", 
   main="Plot 32. Thoughts of Suicide vs Alcohol Disorder")

ggplot(suic_alc, aes(Y18P.x,Y18P.y,color=Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 33. Thoughts of Suicide vs Alcohol Disorder", y="Alcohol Illness", x="Thougts of suicide", title="Scatterplot")

suic_alc_woo<-filter(suic_alc, !state %in% c('Utah','District of Columbia'))

cor(suic_alc_woo$Y18P.x, suic_alc_woo$Y18P.y)

## [1] 0.4435853

cor(suic_alc_woo$Y18_25.x, suic_alc_woo$Y18_25.y)

## [1] 0.5315146

ggplot(suic_alc_woo, aes(Y18_25.x,Y18_25.y, color=Y18_25.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 34. Thoughts of Suicide vs Alcohol Disorder - 18-25 years old only", y="Alcohol Illness", x="Thougts of suicide", title="Scatterplot")

6.2.6 Thoughts of suicide vs Painkillers Abuse Disorder

Interestingly, this time neither DC nor Utah are outliers. DC has low painkiller problem and low level of suicide thoughts, while Utah got problems with both. R is moderate 0.46.

suic_pain<-merge(suicide_tt1, pain_dt, by = "state")

cor(suic_pain$Y18P.x, suic_pain$Y18P.y)

## [1] 0.45627

scatterplot(suic_pain$Y18P.x~suic_pain$Y18P.y,xlab="Thoughts of suicide", ylab="Painkillers Abuse Disorder", 
   main="Plot 35. Thoughts of Suicide vs Painkillers Abuse Disorder, 18+")

ggplot(suic_pain, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 36. Thoughts of Suicide vs Painkillers Abuse Disorder - 18+ years old", y="Painkillers Abuse Disorder", x="Thougts of suicide", title="Scatterplot")

ggplot(suic_pain, aes(Y18_25.x,Y18_25.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 37. Thoughts of Suicide vs Painkillers Abuse Disorder - 18-25 years old only", y="Painkillers Abuse Disorder", x="Thougts of suicide", title="Scatterplot")

cor(suic_pain$Y18_25.x, suic_pain$Y18_25.y)

## [1] 0.1496557

ggplot(suic_pain, aes(Y26P.x,Y26P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 38. Thoughts of Suicide vs Painkillers Abuse Disorder - 26+ year old only", y="Painkillers Abuse Disorder", x="Thougts of suicide", title="Scatterplot")

cor(suic_pain$Y26P.x, suic_pain$Y26P.y)

## [1] 0.4922438

6.2.7 Thoughts of suicide vs Drug Disorder

Similar picture as many other variables, Utah, DC, and Idaho are outliers. DC has low level of suicide thoughts and high level of drug use, while Utah and Idaho are in opposite situation (Utah does have painkiller abuse problem.)

suic_drug<-merge(suicide_tt1, drug_dt, by = "state")

cor(suic_drug$Y18P.x, suic_drug$Y18P.y)

## [1] 0.314418

scatterplot(suic_drug$Y18P.x~suic_drug$Y18P.y,xlab="Thoughts of suicide", ylab="Drug Disorder", 
   main="Plot 39. Thoughts of Suicide vs Drug Disorder")

ggplot(suic_drug, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ 
  labs(subtitle="Plot 40. Thoughts of Suicide vs Drug Disorder - 18+ year old", y="Drug Abuse Disorder", x="Thougts of suicide", title="Scatterplot")

6.2.8 Thoughts of suicide vs Alcohol Use, Binge Drinking

The first variable that has no correlation at all. Interesting. So, it could be safe to go binge drinking after the project is complete!

suic_binge<-merge(suicide_tt1, alc_usebt, by = "state")

cor(suic_binge$Y18P.x, suic_binge$Y18P.y)

## [1] -0.06411454

scatterplot(suic_binge$Y18P.x~suic_binge$Y18P.y,xlab="Thoughts of suicide", ylab="Binge drinking", 
   main="Plot 41. Thoughts of Suicide vs Binge drinking")

ggplot(suic_binge, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
  labs(subtitle="Plot 42. Thoughts of Suicide vs Binge Drinking - 18+ year old", y="Binge Drinking", x="Thougts of suicide", title="Scatterplot")

6.2.9 Thoughts of suicide vs Alcohol Consumption

No correlation again. Interesting.

suic_alcd<-merge(suicide_tt1, alc_uset, by = "state")

cor(suic_alcd$Y18P.x, suic_alcd$Y18P.y)

## [1] -0.02256482

scatterplot(suic_alcd$Y18P.x~suic_alcd$Y18P.y,xlab="Thoughts of suicide", ylab="Alcohol drinking", 
   main="Plot 43. Thoughts of Suicide vs Alcohol drinking")

ggplot(suic_alcd, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
  labs(subtitle="Plot 44. Thoughts of Suicide vs Alcohol Drinking - 18+ year old", y="Alcohol Drinking", x="Thougts of suicide", title="Scatterplot")

6.2.10 Thoughts of suicide vs Heroin Use

Weak correlation.

suic_drugh<-merge(suicide_tt1, drug_useth, by = "state")

cor(suic_drugh$Y18P.x, suic_drugh$Y18P.y)

## [1] 0.1511048

cor(suic_drugh$Y18_25.x, suic_drugh$Y18_25.y)

## [1] 0.235639

scatterplot(suic_drugh$Y18P.x~suic_drugh$Y18P.y,xlab="Thoughts of suicide", ylab="Heroin use", 
   main="Plot 45. Thoughts of Suicide vs Heroin use")

ggplot(suic_drugh, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
  labs(subtitle="Plot 46. Thoughts of Suicide vs Heroin Use - 18+ year old", y="Heroin Use", x="Thougts of suicide", title="Scatterplot")

6.2.11 Thoughts of suicide vs Drug Use, Any

Weak correlation. R=0.24.

suic_drugu1<-merge(suicide_tt1, drug_uset1, by = "state")

cor(suic_drugu1$Y18P.x, suic_drugu1$Y18P.y)

## [1] 0.2422904

cor(suic_drugu1$Y18_25.x, suic_drugu1$Y18_25.y)

## [1] 0.1737713

scatterplot(suic_drugu1$Y18P.x~suic_drugu1$Y18P.y,xlab="Thoughts of suicide", ylab="Drug use, any", 
   main="Plot 47. Thoughts of Suicide vs Drug use, any")

ggplot(suic_drugu1, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
  labs(subtitle="Plot 48. Thoughts of Suicide vs Drug Use, any - 18+ year old", y="Drug Use, any", x="Thougts of suicide", title="Scatterplot")

6.2.12 Thoughts of suicide vs Drug Use, not Marijuana

The same outliers: DC, Utah, and Idaho. With outliers correlation is 0.31. Without outliers, it goes up to 0.53.

suic_drugu<-merge(suicide_tt1, drug_uset, by = "state")

cor(suic_drugu$Y18P.x, suic_drugu$Y18P.y)

## [1] 0.3126032

cor(suic_drugu$Y18_25.x, suic_drugu$Y18_25.y)

## [1] 0.2029604

suic_drugu_woo<-filter(suic_drugu, !state %in% c('Utah','Idaho', 'District of Columbia'))

cor(suic_drugu_woo$Y18P.x, suic_drugu_woo$Y18P.y)

## [1] 0.5261027

scatterplot(suic_drugu$Y18P.x~suic_drugu$Y18P.y,xlab="Thoughts of suicide", ylab="Drug use, not marijuana", 
   main="Plot 49. Thoughts of Suicide vs Drug use, not marijuana")

ggplot(suic_drugu, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
  labs(subtitle="Plot 50. Thoughts of Suicide vs Drug Use, not marijuana - 18+ year old", y="Drug Use, not marijuana", x="Thougts of suicide", title="Scatterplot")

6.2.13 Linear regression - “Thoughts of suicide”.

Now that we gained insight into mental health and substance abuse drivers behind thoughts of suicide, we can try to build basic Linear Regression. Please note that our main goal is not to model thoughts of suicide; however, it might help us to understand better what drives people to thoughts of suicide and what states are effected the worse.

The model calculates rate of suicide thoughts based on serious mental illness, painkillers abuse rates, and use of any drug.

The adjusted R-squired is 0.72. It is not that great.

SAMHA_masterLR<-SAMHA_master

SAMHA_masterLR$Painkillers2<-SAMHA_masterLR$Painkillers**2

SAMHA_masterLR$Painkillers3<-SAMHA_masterLR$Painkillers**3

SAMHA_masterLR$DrugAny2<-SAMHA_masterLR$DrugAny**2

SAMHA_masterLR$DrugAny3<-SAMHA_masterLR$DrugAny**3

model <- lm( ThoughtSuic~ SerMH+Painkillers3+DrugAny+DrugAny2+DrugAny3, data = SAMHA_masterLR)

model

## 
## Call:
## lm(formula = ThoughtSuic ~ SerMH + Painkillers3 + DrugAny + DrugAny2 + 
##     DrugAny3, data = SAMHA_masterLR)
## 
## Coefficients:
##  (Intercept)         SerMH  Painkillers3       DrugAny      DrugAny2  
##    4.900e-02     9.754e-01    -1.639e+04    -1.032e+00     7.421e+00  
##     DrugAny3  
##   -1.667e+01

summary(model)

## 
## Call:
## lm(formula = ThoughtSuic ~ SerMH + Painkillers3 + DrugAny + DrugAny2 + 
##     DrugAny3, data = SAMHA_masterLR)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0065928 -0.0016519 -0.0000169  0.0021532  0.0059593 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.900e-02  2.203e-02   2.224   0.0312 *  
## SerMH         9.754e-01  1.231e-01   7.921  4.5e-10 ***
## Painkillers3 -1.639e+04  5.682e+03  -2.884   0.0060 ** 
## DrugAny      -1.032e+00  4.888e-01  -2.110   0.0404 *  
## DrugAny2      7.421e+00  3.567e+00   2.081   0.0432 *  
## DrugAny3     -1.667e+01  8.306e+00  -2.007   0.0507 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002902 on 45 degrees of freedom
## Multiple R-squared:  0.7441, Adjusted R-squared:  0.7157 
## F-statistic: 26.18 on 5 and 45 DF,  p-value: 2.734e-12

6.2.13.1 Residual Plot.

myres = resid(model)

plot(SAMHA_masterLR$ThoughtSuic, myres, ylab="Residuals", xlab="Thoughts of Suicide", main="Thoughts of Suicide") 

abline(0, 0)

6.2.14 SAMHA Survey - Clustering.

I will use 4 clusters, which is perfect for me, so I can kind of imitate 4 regions of USA. Let’s if clusters correspond to 4 regions.

SAMHA_masterC<-SAMHA_master

row.names(SAMHA_masterC)<-SAMHA_masterC[,1]

SAMHA_masterC <- scale(SAMHA_masterC[-1])

wss <- (nrow(SAMHA_masterC)-1)*sum(apply(SAMHA_masterC,2,var))

for (i in 2:15) wss[i] <- sum(kmeans(SAMHA_masterC, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

fviz_nbclust(SAMHA_masterC, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

# Elbow method
fviz_nbclust(SAMHA_masterC, kmeans, method = "wss") +
    geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")

fit <- kmeans(SAMHA_masterC, 4) 

aggregate(SAMHA_masterC,by=list(fit$cluster),FUN=mean)

##   Group.1 ThoughtSuic       SerMH  Painkillers    DrugNoM    DrugAny
## 1       1   0.6007879  1.08414059  1.158150114 -0.2595276 -0.5611673
## 2       2  -0.2228678 -0.08459446  0.229540261 -0.3202326 -0.3565953
## 3       3  -0.8861016 -1.19118857 -1.211278204 -0.4452622 -0.3620547
## 4       4   0.9523062  0.61667459  0.003339082  1.4147873  1.6737100
##        Binge    DrugDis     AlcDis     AlcUse     Heroin   SubAbuse
## 1 -1.1437122 -0.2697160 -0.9377813 -1.4026162  0.1236364 -0.8403990
## 2  0.3807154 -0.1843569  0.2039108  0.2981129  0.1228629  0.1670663
## 3 -0.2742555 -0.6022033 -0.5160975 -0.0887141 -0.4732574 -0.6271312
## 4  0.8149568  1.3844227  1.2416685  0.9813414  0.2704450  1.3549502

mydata <- data.frame(SAMHA_masterC, fit$cluster)

distance <- get_dist(SAMHA_masterC)

fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

k2 <- kmeans(SAMHA_masterC, centers = 4, nstart = 25)

str(k2)

## List of 9
##  $ cluster     : Named int [1:51] 4 1 3 4 3 1 1 1 1 3 ...
##   ..- attr(*, "names")= chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ centers     : num [1:4, 1:11] 0.664 -0.178 -0.886 0.489 0.383 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "1" "2" "3" "4"
##   .. ..$ : chr [1:11] "ThoughtSuic" "SerMH" "Painkillers" "DrugNoM" ...
##  $ totss       : num 550
##  $ withinss    : num [1:4] 96.6 59.1 46.2 53.4
##  $ tot.withinss: num 255
##  $ betweenss   : num 295
##  $ size        : int [1:4] 13 14 13 11
##  $ iter        : int 4
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

k2

## K-means clustering with 4 clusters of sizes 13, 14, 13, 11
## 
## Cluster means:
##   ThoughtSuic       SerMH Painkillers    DrugNoM    DrugAny      Binge
## 1   0.6636423  0.38307248  0.01892408  1.1949376  1.3499637  0.6169255
## 2  -0.1779872 -0.06162524  0.26850653 -0.4827015 -0.4555759  0.5522002
## 3  -0.8861016 -1.19118857 -1.21127820 -0.4452622 -0.3620547 -0.2742555
## 4   0.4894356  1.03347842  1.06741020 -0.2716326 -0.5877050 -1.1077739
##      DrugDis     AlcDis     AlcUse     Heroin   SubAbuse
## 1  1.3773553  0.9877644  0.7975831  0.6819409  1.1710608
## 2 -0.5105397  0.2221695  0.4091577 -0.3076858  0.1150226
## 3 -0.6022033 -0.5160975 -0.0887141 -0.4732574 -0.6271312
## 4 -0.2663110 -0.8401857 -1.3585005  0.1449742 -0.7892183
## 
## Clustering vector:
##              Alabama               Alaska              Arizona 
##                    4                    1                    3 
##             Arkansas           California             Colorado 
##                    4                    3                    1 
##          Connecticut             Delaware District of Columbia 
##                    1                    1                    1 
##              Florida              Georgia               Hawaii 
##                    3                    3                    3 
##                Idaho             Illinois              Indiana 
##                    4                    3                    4 
##                 Iowa               Kansas             Kentucky 
##                    2                    2                    4 
##            Louisiana                Maine             Maryland 
##                    2                    2                    3 
##        Massachusetts             Michigan            Minnesota 
##                    1                    2                    3 
##          Mississippi             Missouri              Montana 
##                    4                    2                    1 
##             Nebraska               Nevada        New Hampshire 
##                    2                    1                    1 
##           New Jersey           New Mexico             New York 
##                    3                    3                    3 
##       North Carolina         North Dakota                 Ohio 
##                    4                    2                    2 
##             Oklahoma               Oregon         Pennsylvania 
##                    4                    1                    2 
##         Rhode Island       South Carolina         South Dakota 
##                    1                    2                    2 
##            Tennessee                Texas                 Utah 
##                    4                    3                    4 
##              Vermont             Virginia           Washington 
##                    1                    3                    1 
##        West Virginia            Wisconsin              Wyoming 
##                    4                    2                    2 
## 
## Within cluster sum of squares by cluster:
## [1] 96.55474 59.11845 46.17610 53.43947
##  (between_SS / total_SS =  53.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

fviz_cluster(k2, data = SAMHA_masterC)

So, we have 4 clusters.

I will name these clusters:

“Mix Bag”. Good in some areas and bad in another. DC, Alaska, and Montana are a good example. If you remember DC has the lowest suicide rate in the country, but it has drug/alcohol problems. Alaska, on the other hand, has one of the highest suicide rates, but not much of alcohol or drug abuse. Montana has the highest suicide rate. Colorado is among “Mix Bag”.
“Golden Middle”. These states are in middle. They are not doing bad, but not great either. Many are in Central USA - Iowa, Kansas, Michigan, and North Dakota
“The Best”. States that are doing better than average. Many are from East Coast - NJ, NY, NC, FL, GA, VA, and MD. NJ and NY have one of the lowest suicide rates in country. Minnesota is one “The Best”.
“Bad”. These states got major problems in many areas. Many of them are from Central USA - Utah, Idaho, Oklahoma; and South - Alabama, Mississippi, West Virginia, and so on. Both Utah and Idaho have very high suicide rates.

As we can see, SAMHA clustering is not very useful indicator of suicide rate. High and low suicide rate states are mixed and not well separated.

6.3 DC vs Montana: Suicide Rates

Montana has Native American popualtion with a huge suicide rate.
DC is Large Central Metro Area, while Montana is a mix of Non Metro and Small Metro, which will account for some discrepency. NonMetro/Small Metro vs Large Metro accounts for 25% difference, but not 4 times we see.

If we compare 2 states by type of suicide based on the method. We can see that majority of suicide comited in Montana is done by using firearms, while non-firearm rates for are pretty close. If we compare White non-fireamr rate is 6.9 vs 8.0 - practically indentical. So, it is very likely that avilability of firearms playing a role in driving suicide rates up(Matthew Miller and Hemenway 2002).

So to summarize, we can say that 3 top drivers for much higher suicide rate in Montana:

Hugely high suicide rate among Native American population in Montana
Rural, isolated envorement in Montana
High gun ownership and corresponding high firearm suicide rate in Montana

state15_17<-select(state15_17,state,Race,Hispanic.Origin,Gender,X2013.Urbanization,Population, AARate)%>%filter(state=='Montana'|state=='District of Columbia')

state15_17%>%
  kable() %>%
  kable_styling()

state	Race	Hispanic.Origin	Gender	X2013.Urbanization	Population	AARate
District of Columbia	Black or African American	Not Hispanic or Latino	Female	Large Central Metro	441333	2.8460
District of Columbia	Black or African American	Not Hispanic or Latino	Male	Large Central Metro	349404	13.1180
District of Columbia	White	Not Hispanic or Latino	Male	Large Central Metro	339134	15.3460
Montana	White	Not Hispanic or Latino	Female	Small Metro	411771	12.6575
Montana	American Indian or Alaska Native	Not Hispanic or Latino	Male	Small Metro	14865	79.4025
Montana	White	Not Hispanic or Latino	Male	Small Metro	403880	53.5045
Montana	White	Not Hispanic or Latino	Female	Micropolitan (Nonmetro)	364414	16.8390
Montana	White	Not Hispanic or Latino	Male	Micropolitan (Nonmetro)	369459	48.6820
Montana	American Indian or Alaska Native	Not Hispanic or Latino	Female	NonCore (Nonmetro)	52546	27.9455
Montana	White	Not Hispanic or Latino	Female	NonCore (Nonmetro)	363771	12.6020
Montana	American Indian or Alaska Native	Not Hispanic or Latino	Male	NonCore (Nonmetro)	50796	64.3225
Montana	White	Not Hispanic or Latino	Male	NonCore (Nonmetro)	374454	53.3775

state17_DC_Mo<-select(state17_DC_Mo,State,Race,ICD.10.113.Cause.List,Age.Adjusted.Rate)

state17_DC_Mo %>%
  kable() %>%
  kable_styling()

State	Race	ICD.10.113.Cause.List	Age.Adjusted.Rate
District of Columbia	Black or African American	Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0)	Unreliable
District of Columbia	White	Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0)	6.9
Montana	American Indian or Alaska Native	Intentional self-harm (suicide) by discharge of firearms (X72-X74)	Unreliable
Montana	American Indian or Alaska Native	Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0)	27.5
Montana	White	Intentional self-harm (suicide) by discharge of firearms (X72-X74)	19.6
Montana	White	Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0)	8.0

7 Conclusion

I feel that I have tried, but I have not gotten very far.Let me summarize what I was able to identify:

More quality data needs to be collected about suicide. Currently suicide related data is spread and often incomplete. We need more centralized research. I have scanned papers related to statistics of suicide and volume is not all that big. More needs to be done.
Suicide is a complex subject. It is multifaceted. It has roots in social, medical, and biological areas. To view it from only one point would not be enough.
Looking at how USA states fare against World countries is depressing. If USA states were independent countries, they would take 5 out of 10 top positions. In addition, suicide rates fall across the world, while here in US they grow every year.
Contributing factors responsible for a huge variance in suicide levels across the US states are: ethnicity, residential composition (Large Metro vs Rural areas), availability of guns, mental illness rate, alcohol abuse rate, and drug abuse. Alcohol induced deaths correlate from moderately with suicide.
Male and female suicide rates by state strongly correlate, so it is possible that drivers behind suicides for both genders are the same. However, male suicide rates have much more variance to them vs females.
Prevalence of serious thoughts of suicide moderately correlate with suicide. Serious mental illness has moderate correlation with suicide rate as well. Next step would be to try to figure out why rates of mental illness/thoughts of suicide fluctuate that much from state to state. Painkiller Abuse Disorder Prevalence correlates with suicides. Considering that problems with painkiller abuse continue to spread it might negatively effect suicide rate in future.
Attempt to create Linear Model of Serious Thoughts of Suicide was not very successful. R squire was only 0.72. The variables used in a model were Serious Mental Illness, Painkiller Abuse Disorder, and Drug Use, Any Drug. There must be additional variables which need to be discovered.
Clustering states based on SAMHSA survey produced some interesting results. It allowed to cluster states into 4 different groups, however these groups would not correspond to state suicide rates well.
Comparing high suicide state (Colorado) and low suicide state (Minnesota) indicates alcohol, drugs, and violence as possible culprits.
DC vs Montana. 3 drivers - very high suicide rate among Montana Native American population, rural/small metro environment in Montana, and high gun ownership and correspondingly high gun suicide rate in Montana.

8 Recommendations

Create centralized public database for suicide data.
Create methods of improved data collection related to suicide.
Try to understand how other countries were able to lower their suicide rates. For example, Japan had a lot of success.
Do more research into connection between mental illness, alcohol, drugs, violence, gun ownership, and suicide.
To see if success of low suicide states could be applied to high suicide states.
To do more research into how people go from thinking to actually commuting suicide.

References

CDC. 2018. About Underlying Cause of Death, 1999-2017. https://wonder.cdc.gov/ucd-icd10.html.

Gerstman, Bud. Unknown. 14: Correlation. http://www.sjsu.edu/faculty/gerstman/StatPrimer/correlation.pdf.

M.Groysman. 2018. Data607_FinalProject_DataFIles. https://github.com/mgroysman/Data607_FinalProject_DataFIles.

Martin, David. 2018. U.S. Navy Admiral Scott Stearney Found Dead in Apparent Suicide. https://www.cbsnews.com/news/scott-stearney-us-navy-admiral-found-dead-apparent-suicide-2018-12-01/.

Matthew Miller, Deborah Azrael, and David Hemenway. 2002. Household Firearm Ownership and Suicide Rates in the United States. https://www.jstor.org/stable/3703934?seq=1/analyze.

Riess, Jana. 2016. Study Shows Link Between Teen Suicide and Mormon Populations. https://religionnews.com/2016/03/12/study-shows-link-teen-suicide-mormon-populations/.

SAMHSA. 2018. 2016-2017 Nsduh State Prevalence Estimates. https://www.samhsa.gov/data/report/2016-2017-nsduh-state-prevalence-estimates.

Wikipedia. 2018. List of Countries by Suicide Rate. https://en.wikipedia.org/wiki/List_of_countries_by_suicide_rate.

Final Project: Suicide Rate of US States: Why Are They So Different?

Mikhail Groysman

December 3, 2018