A few days ago, CDC released mortality data for 2017. Life expectancy has dropped for three years in a row. One of the main causes is an increase in suicide rates. Suicide keeps increasing every year for the past 2 decades and it is hard to pinpoint the exact reasons. One of the main paradoxes is a big difference in suicide rates across the country. The suicide rate for 2017 varies from 6.6 in DC to 28.9 in Montana, which is almost 4.5 times greater. Please note that the rate is age adjusted, so age distribution should not play the role. Why would a person in Montana be 4.5 times more likely to commit suicide than one in DC? If we could answer the following question, there is a greater possibility to reduce suicide rates and prevent such tragedies across our country. To illustrate that suicide is not remote and could only happen to someone unstable and weak, I would like to reference a tragedy that took place while I was working on the project: Top US Navy Admiral on active duty committed suicide(Martin 2018) only 58 years old.
At the end of this project, I expect to come with a list of variables responsible for varying suicide rates across of the states. As mentioned above, I believe that understanding factors behind the variance will help us address the ever-worsening problem of suicide across the country. To come up with answers, I will try to apply both supervised and unsupervised methods, but a primary final goal is to come with a supervised model. Final measure to evaluate our model will be Adjusted R squire.
To cluster states would also be interesting since we can see whether geography plays a role.
We know that gender, age, mental health, substance abuse, viability of guns, and ethnicity play a major role in suicide rates. Other possible factors include but are not limited to altitude, population density, rural status, and specific occupations.
I researched multiple sources and none of them have any raw, detailed data because suicide is a sensitive topic for families; therefore, the government protects the families’ privacy and does not provide individual data for each case; even grouped data is often masked. This is one of the limitations I had to deal with. It was disappointing not to have all the details, but I understand that in the real world, data is not always easily and readily available.
After reviewing multiple sources of potential data, I have chosen two that seem to give me the most insight on the topic:
I have also pulled an additional Wikipedia source for comparison study:
library(readr)
library(stringr)
library(dplyr)
library(tidyr)
library(car)
library(ggplot2)
library(usmap)
library(readxl)
library(ggthemes)
library(XML)
library(RCurl)
library(ggpubr)
library(knitr)
library(kableExtra)
library(factoextra)
library(NbClust)
This is probably as good as it gets when it comes to suicide statistics. It does not have individual records and results with values under 11 are masked (fewer than 11 deaths). However, I am not aware of better alternatives to pull suicide data. From what I have seen, CDC is the main source of data for suicide reporting.
Please note that CDC Wonder does have API; however, it does not allow to pull state level data from API.
I will pull multiple data requests from the system. I have tried to create my own master data file in one pull; however, it did not work for the following reasons:
If data from 2017 is used, it would have a lot of masked results. Expanding the time frame to other years, an issue of data not being up-to data, or current, rises.
Not enough fields can be pulled out by Wonder Query, which were needed for the project. The query only allows you to pull 6 categorical variables, while the prospected goal was all 8 of them.
The first query was Suicide Data by state for year 2017. Data was not limited in any way except for the year (2017) and cause of death (suicide).
state2017<-read.table("C:/Final Project/Underlying Cause of Death, 2017 Suicide State All Race All Ages.txt", header =TRUE, sep = "\t")
state2017$state<-state2017$State
state2017$AARate<-state2017$Age.Adjusted.Rate
arrange(state2017, AARate)
state2017$group1<-"US States"
par(ps = 10, cex = 1, cex.main = 1)
boxplot(state2017$AARate)
title(main="Plot 1. Boxplot. 2017 Suicide Age Adjusted Rate per 100,000 people by State")
ggplot(state2017, aes(x=group1, y=AARate))+geom_violin()+ggtitle("Plot 2. Violin Plot. 2017 Suicide Age Adjusted Rate per 100,000 by State")+theme(plot.title = element_text(size=12))
plot_usmap(data = state2017, values = "AARate", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+
labs(title = "Plot 3. 2017 Suicide Age Adjusted Rate per 100,000 by State")
Montana has the highest rate, DC has the lowest. Montana seems to be an outlier.
The second query used was 2017 Suicide Data comparing state and gender. Data was limited to one year (2017), cause of death (suicide), and age (15+). Please note that we encountered unreliable data according to CDC. The data is not big enough to be considered reliable for females in DC and Delaware. For simplicity we will use the actual rate for these states. The number we use here is not for our final analysis, so we can get away with some “cheating”. If we really feel that including these 2 states will significantly skew our data (I do not feel that way), then we have 2 solutions to the problem: one is to drop these 2 states from our analysis, or pull a few more years of data, instead of using just 2017.
state2017GA<-read.table("C:/Final Project/Underlying Cause of Death, 2017Suicide State Gender 15+.txt", header =TRUE, sep = "\t")
state2017GA$state<-state2017GA$State
state2017GA$AARate<-as.numeric(as.character(state2017GA$Age.Adjusted.Rate))
arrange(state2017GA, AARate)
state2017GA$group1<-"US States"
state2017M<-filter(state2017GA, Gender=='Male')
arrange(state2017M, AARate)
par(ps = 10, cex = 1, cex.main = 1)
boxplot(state2017M$AARate)
title(main="Plot 4. Boxplot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Male, Age 15+")
ggplot(state2017M, aes(x=group1, y=AARate))+geom_violin()+ggtitle("Plot 5. Violin plot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Male, Age 15+")+theme(plot.title = element_text(size=11))
plot_usmap(data = state2017M, values = "AARate", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+
labs(title = "Plot 6. 2017 Suicide Age Adjusted Rate per 100,000 by State - Male, Age 15+")
state2017F<-filter(state2017GA, Gender=='Female')
state2017F$AARate1<-(state2017F$Age.Adjusted.Rate.Lower.95..Confidence.Interval+state2017F$Age.Adjusted.Rate.Upper.95..Confidence.Interval)/2
arrange(state2017F, AARate1)
par(ps = 10, cex = 1, cex.main = 1)
boxplot(state2017F$AARate1)
title(main="Plot 7. Boxplot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Female, Age 15+")
ggplot(state2017F, aes(x=group1, y=AARate1))+geom_violin()+ggtitle("Plot 8. Violin plot. 2017 Suicide Age Adjusted Rate per 100,000 by State - Female, Age 15+")+theme(plot.title = element_text(size=10))
plot_usmap(data = state2017F, values = "AARate1", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+
labs(title = "Plot 9. 2017 Suicide Age Adjusted Rate per 100,000 by State - Female, Age 15+")
For male, Montana has the highest rate, DC has the lowest again as well. Montana, Wyoming, and Alaska seem to be outliers.
For females, Delaware has the lowest rate, but data is unreliable. If we only use reliable data, then NY could be proud: tt has the lowest rate based on reliable data. Montana, on the other hand, has the highest rate.
The third query used was 2015-2017 Suicide Data by state, gender, ethnicity, and metro residence. Data was limited to 3 years (2015-2017), cause of death (suicide), and age (15+). I have pulled 3 years of data to smooth it and to avoid unreliability label. I have still got some. We got a lot of detailed information from the data since it includes multiple aspects like gender, residency status, and ethnicity; therefore, it brings us one step closer to answering our question.
state15_17<-read.table("C:/Final Project/Underlying Cause of Death, 2015-2017 Suicide State Ethnicity Gender Metro 15+.txt", header =TRUE, sep = "\t")
state15_17$state<-state15_17$State
state15_17$pop<-as.numeric(as.character(state15_17$Population))
state15_17$relFlag<-if (state15_17$Age.Adjusted.Rate=='Unreliable') {'U'} else {'R'}
state15_17$AARate<-(as.numeric(as.character(state15_17$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state15_17$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2
arrange(state15_17, AARate)
state15_17$group1<-"US States"
state15_17C<-filter(state15_17, state=='Colorado')%>%select(state,Gender,X2013.Urbanization,AARate,Race,Hispanic.Origin,pop,relFlag)
state15_17M<-filter(state15_17, state=='Minnesota')%>%select(state,Gender,X2013.Urbanization,AARate,Race,Hispanic.Origin,pop,relFlag)
state15_17WM<-filter(state15_17, Race=='White'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17WM$prod<-state15_17WM$AARate*state15_17WM$pop
state15_17WM1<-select(state15_17WM,prod,pop,state)
state15_17WM1<-aggregate(.~state, data=state15_17WM1, FUN=sum)
state15_17WM1$AARate<-state15_17WM1$prod/state15_17WM1$pop
state15_17WF<-filter(state15_17, Race=='White'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17WF$prod<-state15_17WF$AARate*state15_17WF$pop
state15_17WF1<-select(state15_17WF,prod,pop,state)
state15_17WF1<-aggregate(.~state, data=state15_17WF1, FUN=sum)
state15_17WF1$AARate<-state15_17WF1$prod/state15_17WF1$pop
state15_17BM<-filter(state15_17, Race=='Black or African American'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17BM$prod<-state15_17BM$AARate*state15_17BM$pop
state15_17BM1<-select(state15_17BM,prod,pop,state)
state15_17BM1<-aggregate(.~state, data=state15_17BM1, FUN=sum)
state15_17BM1$AARate<-state15_17BM1$prod/state15_17BM1$pop
state15_17BF<-filter(state15_17, Race=='Black or African American'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17BF$prod<-state15_17BF$AARate*state15_17BF$pop
state15_17BF1<-select(state15_17BF,prod,pop,state)
state15_17BF1<-aggregate(.~state, data=state15_17BF1, FUN=sum)
state15_17BF1$AARate<-state15_17BF1$prod/state15_17BF1$pop
state15_17HM<-filter(state15_17, Gender=='Male'&Hispanic.Origin=='Hispanic or Latino')
state15_17HM$prod<-state15_17HM$AARate*state15_17HM$pop
state15_17HM1<-select(state15_17HM,prod,pop,state)
state15_17HM1<-aggregate(.~state, data=state15_17HM1, FUN=sum)
state15_17HM1$AARate<-state15_17HM1$prod/state15_17HM1$pop
state15_17HF<-filter(state15_17, Race=='Black or African American'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17HF$prod<-state15_17HF$AARate*state15_17HF$pop
state15_17HF1<-select(state15_17HF,prod,pop,state)
state15_17HF1<-aggregate(.~state, data=state15_17HF1, FUN=sum)
state15_17HF1$AARate<-state15_17HF1$prod/state15_17HF1$pop
state15_17%>%select(state,Race,Hispanic.Origin,Gender,X2013.Urbanization,Population, AARate)%>%filter(state=='Montana'|state=='District of Columbia')
The third query I have pulled was 2015-2017 Mortality Data by state, gender, ethnicity, metro residence, and drug induced indicator. Data was limited to 3 years (2015-2017) and age (15+). I have pulled 3 years of data to smooth it and to avoid unreliability label.
state15_17DI<-read.table("C:/Final Project/Underlying Cause of Death, 2015-2017 Drug Induced.txt", header =TRUE, sep = "\t")
state15_17DI$state<-state15_17DI$State
state15_17DI
state15_17DI$pop<-as.numeric(as.character(state15_17DI$Population))
state15_17DI$relFlag<-if (state15_17DI$Age.Adjusted.Rate=='Unreliable') {'U'} else {'R'}
state15_17DI$AARate<-(as.numeric(as.character(state15_17DI$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state15_17DI$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2
arrange(state15_17DI, state)
state15_17DI$group1<-"US States"
state15_17WMDI<-filter(state15_17DI, Race=='White'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17WMDI$prod<-state15_17WMDI$AARate*state15_17WMDI$pop
state15_17WMDI1<-select(state15_17WMDI,prod,pop,state,Drug.Alcohol.Induced)
state15_17WMDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17WMDI1, FUN=sum)
state15_17WMDI1$AARate1<-state15_17WMDI1$prod/state15_17WMDI1$pop
state15_17WMDI2<-select(state15_17WMDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)
colnames(state15_17WMDI2)<-c('state','Alcohol','Other','Drug')
state15_17WMM<-merge(state15_17WM1,state15_17WMDI2,by='state')
state15_17WMC<-select(state15_17WMM,AARate, Alcohol, Other, Drug)
state15_17WMM
cor(state15_17WMC)
state15_17WFDI<-filter(state15_17DI, Race=='White'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17WFDI$prod<-state15_17WFDI$AARate*state15_17WFDI$pop
state15_17WFDI1<-select(state15_17WFDI,prod,pop,state,Drug.Alcohol.Induced)
state15_17WFDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17WFDI1, FUN=sum)
state15_17WFDI1$AARate1<-state15_17WFDI1$prod/state15_17WFDI1$pop
state15_17WFDI2<-select(state15_17WFDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)
colnames(state15_17WFDI2)<-c('state','Alcohol','Other','Drug')
state15_17WFM<-merge(state15_17WF1,state15_17WFDI2,by='state')
state15_17WFC<-select(state15_17WFM,AARate, Alcohol, Other, Drug)
state15_17WFM
cor(state15_17WFC)
state15_17BMDI<-filter(state15_17DI, Race=='Black or African American'&Gender=='Male'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17BMDI
state15_17BMDI$prod<-state15_17BMDI$AARate*state15_17BMDI$pop
state15_17BMDI1<-select(state15_17BMDI,prod,pop,state,Drug.Alcohol.Induced)
state15_17BMDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17BMDI1, FUN=sum)
state15_17BMDI1$AARate1<-state15_17BMDI1$prod/state15_17BMDI1$pop
state15_17BMDI2<-select(state15_17BMDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)
colnames(state15_17BMDI2)<-c('state','Alcohol','Other','Drug')
state15_17BMDI2<-filter(state15_17BMDI2,!is.na(Alcohol)&!is.na(Drug))
state15_17BMM<-merge(state15_17BM1,state15_17BMDI2,by='state')
state15_17BMC<-select(state15_17BMM,AARate, Alcohol, Other, Drug)
state15_17BMM
cor(state15_17BMC)
state15_17BFDI<-filter(state15_17DI, Race=='Black or African American'&Gender=='Female'&Hispanic.Origin=='Not Hispanic or Latino')
state15_17BFDI$prod<-state15_17BFDI$AARate*state15_17BFDI$pop
state15_17BFDI1<-select(state15_17BFDI,prod,pop,state,Drug.Alcohol.Induced)
state15_17BFDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17BFDI1, FUN=sum)
state15_17BFDI1$AARate1<-state15_17BFDI1$prod/state15_17BFDI1$pop
state15_17BFDI2<-select(state15_17BFDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)
colnames(state15_17BFDI2)<-c('state','Alcohol','Other','Drug')
state15_17BMDI2<-filter(state15_17BMDI2,!is.na(Alcohol)&!is.na(Drug))
state15_17BFM<-merge(state15_17BF1,state15_17BFDI2,by='state')
state15_17BFC<-select(state15_17BFM,AARate, Alcohol, Other, Drug)
state15_17BFM
cor(state15_17BFC)
state15_17HMDI<-filter(state15_17DI, Gender=='Male'&Hispanic.Origin=='Hispanic or Latino')
state15_17HMDI
state15_17HMDI$prod<-state15_17HMDI$AARate*state15_17HMDI$pop
state15_17HMDI1<-select(state15_17HMDI,prod,pop,state,Drug.Alcohol.Induced)
state15_17HMDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17HMDI1, FUN=sum)
state15_17HMDI1$AARate1<-state15_17HMDI1$prod/state15_17HMDI1$pop
state15_17HMDI2<-select(state15_17HMDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)
colnames(state15_17HMDI2)<-c('state','Alcohol','Other','Drug')
state15_17HMDI2<-filter(state15_17HMDI2,!is.na(Alcohol)&!is.na(Drug))
state15_17HMM<-merge(state15_17HM1,state15_17HMDI2,by='state')
state15_17HMC<-select(state15_17HMM,AARate, Alcohol, Other, Drug)
state15_17HMM
cor(state15_17HMC)
state15_17HFDI<-filter(state15_17DI, Gender=='Female'&Hispanic.Origin=='Hispanic or Latino')
state15_17HFDI$prod<-state15_17HFDI$AARate*state15_17HFDI$pop
state15_17HFDI1<-select(state15_17HFDI,prod,pop,state,Drug.Alcohol.Induced)
state15_17HFDI1<-aggregate(.~state+Drug.Alcohol.Induced, data=state15_17HFDI1, FUN=sum)
state15_17HFDI1$AARate1<-state15_17HFDI1$prod/state15_17HFDI1$pop
state15_17HFDI2<-select(state15_17HFDI1,state,Drug.Alcohol.Induced,AARate1)%>%spread(Drug.Alcohol.Induced,AARate1)
colnames(state15_17HFDI2)<-c('state','Alcohol','Other','Drug')
state15_17HMDI2<-filter(state15_17HMDI2,!is.na(Alcohol)&!is.na(Drug))
state15_17HFM<-merge(state15_17HF1,state15_17HFDI2,by='state')
state15_17HFC<-select(state15_17HFM,AARate, Alcohol, Other, Drug)
state15_17HFM
cor(state15_17HFC)
state15_17DI%>%select(state,Race,Hispanic.Origin,Gender, Drug.Alcohol.Induced, AARate)%>%filter(state=='Montana'|state=='District of Columbia')
For White Non-Hispanic Male, Alcohol induced causes correlate with r=0.57.
For White Non-Hispanic Female Alcohol induced causes correlate with r=0.65.
For Black Non-Hispanic Male Alcohol induced causes correlate with r=0.43.
For Black Non-Hispanic Female Alcohol induced causes correlate with r=0.51.
For Hispanic Male Alcohol induced causes correlate with r=0.59.
For Hispanic Female Alcohol induced causes correlate with r=NA.
The fourth query was 2008-2017 Suicide Data for Colorado and Minnesota by gender, ethnicity, metro residence, and ICD10 groups. Data was limited to 10 years (2008-2017) and age (15+). Please note that all disease was pulled since suicide was not the main aspect of the given data. I have pulled 10 years to get the most reliable data.
state08_17<-read.table("C:/Final Project/Underlying Cause of Death, 2008-2017 Col Min by ICD.txt", header =TRUE, sep = "\t")
state08_17$state<-state08_17$State
state08_17$pop<-as.numeric(as.factor(state08_17$Population))
state08_17$relFlag<-if (state08_17$Age.Adjusted.Rate=='Unreliable') {'U'} else {'R'}
state08_17$AARate<-(as.numeric(as.character(state08_17$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state08_17$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2
arrange(state08_17, AARate)
state08_17$group1<-"US States"
state08_17<-select(state08_17,state,Gender,X2013.Urbanization,AARate,Race,pop,relFlag,ICD.10.113.Cause.List)
state08_17M<-filter(state08_17,state=='Minnesota')
state08_17C<-filter(state08_17,state=='Colorado')
The fifth query was 2008-2017 Mortality Data for all sates by gender, ethnicity, and ICD10 groups. Data was limited to 10 years (2008-2017), ethnicity, and Metro areas. Please note that all disease was pulled since suicide was not only of mention. To get most reliable results data from 10 years ago was used.
state08_17a<-read.table("C:/Final Project/Underlying Cause of Death, 2008-2017 Metro Non Hispanic 15+.txt", header =TRUE, sep = "\t")
#state08_17a<-filter(state08_17a,Crude.Rate!='Unreliable')
state08_17a$state<-state08_17a$State
state08_17a$pop<-as.numeric(as.factor(state08_17a$Population))
state08_17a$AARate<-(as.numeric(as.character(state08_17a$Age.Adjusted.Rate.Lower.95..Confidence.Interval))+as.numeric(as.character(state08_17a$Age.Adjusted.Rate.Upper.95..Confidence.Interval)))/2
state08_17a$group1<-"US States"
state08_17a<-select(state08_17a,state,Gender,AARate,Race,pop,ICD.10.113.Cause.List)
state08_17b<-spread(state08_17a,ICD.10.113.Cause.List,AARate)
state08_17b
state08_17bMB<-filter(state08_17b,Gender=='Male',Race=='Black or African American')
state08_17bMB
state08_17bMBC<-state08_17bMB[-c(1:5)]
state08_17bMBC<-state08_17bMBC[, -which(colMeans(is.na(state08_17bMBC)) > 0.25)]
state08_17bMBC
colnames(state08_17bMBC)[13] <- "Suicide"
state08_17bMBC<-filter(state08_17bMBC,!is.na(Suicide))
cor(as.matrix(state08_17bMBC[,13]), as.matrix(state08_17bMBC[,-13]), use = "pairwise.complete.obs")
state08_17bMW<-filter(state08_17b,Gender=='Male',Race=='White')
state08_17bMW
state08_17bMWC<-state08_17bMW[-c(1:5)]
state08_17bMWC<-state08_17bMWC[, -which(colMeans(is.na(state08_17bMWC)) > 0.1)]
state08_17bMWC
colnames(state08_17bMWC)[19] <- "Suicide"
state08_17bMBC<-filter(state08_17bMWC,!is.na(Suicide))
cor(as.matrix(state08_17bMWC[,19]), as.matrix(state08_17bMWC[,-11]), use = "pairwise.complete.obs")
The 6th query was 2017 Suicide Data for DC and Montana by ethnicity and ICD10 groups. Data was limited to 2017 and DC and Montana only.
state17_DC_Mo<-read.table("C:/Final Project/Underlying Cause of Death, 2017 Monata vs DC.txt", header =TRUE, sep = "\t")
This data does not provide suicide rates; however, it does provide statistics on alcohol, drug, and mental health by state. It should help us understand which states have issues with substance abuse and mental health. The data is provided in Excel format. I have loaded the file on my GitHub account(M.Groysman 2018). The file has multiple tabs. We do not need all of them, so we will only load the ones we need:
Suicide thought->Suicide Planning->Suicide Attempt->Suicide(in case of successful attempt)
drug_use <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 1")
drug_use <- drug_use[6:62, ]
colnames(drug_use) <- c("Index", "state", "Y12P", "Y12P_LE", "Y12P_HE", "Y12_17",
"Y12_17_LE", "Y12_17_HE", "Y18_25", "Y18_25_LE", "Y18_25_HE", "Y26P", "Y26P_LE",
"Y26P_HE", "Y18P", "Y18P_LE", "Y18P_HE")
drug_uset <- transform(drug_use, Y12P = as.numeric(Y12P), Y18P = as.numeric(Y18P),
Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))
drug_uset %>% arrange(desc(Y12P))
drug_use1 <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 6")
drug_use1<-drug_use1[6:62,]
colnames(drug_use1)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
drug_uset1<-transform(drug_use1,Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))
drug_uset1%>%arrange(desc(Y12P))
drug_useh <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 9")
drug_useh<-drug_useh[6:62,]
colnames(drug_useh)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
drug_useth<-transform(drug_useh, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))
drug_useth%>%arrange(desc(Y12P))
alc_use <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 13")
alc_use<-alc_use[6:62,]
colnames(alc_use)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
alc_uset<-transform(alc_use, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))
alc_uset%>%arrange(desc(Y12P))
alc_useb <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 14")
alc_useb<-alc_useb[6:62,]
colnames(alc_useb)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
alc_usebt<-transform(alc_useb, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))
alc_usebt%>%arrange(desc(Y12P))
drug_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 20")
drug_d<-drug_d[6:62,]
colnames(drug_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
drug_dt<-transform(drug_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))
drug_dt%>%arrange(desc(Y12P))
pain_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 21")
pain_d<-pain_d[6:62,]
colnames(pain_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
pain_dt<-transform(pain_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25), Y26P = as.numeric(Y26P))
pain_dt%>%arrange(desc(Y12P))
alc_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 22")
alc_d<-alc_d[6:62,]
colnames(alc_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
alc_dt<-transform(alc_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25))
alc_dt%>%arrange(desc(Y12P))
sa_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 23")
sa_d<-sa_d[6:62,]
colnames(sa_d)<-c("Index","state","Y12P","Y12P_LE","Y12P_HE","Y12_17","Y12_17_LE","Y12_17_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE","Y18P","Y18P_LE","Y18P_HE")
sa_dt<-transform(sa_d, Y12P = as.numeric(Y12P),Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25))
sa_dt%>%arrange(desc(Y12P))
mh_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 27")
mh_d<-mh_d[6:62,]
colnames(mh_d)<-c("Index","state","Y18P","Y18P_LE","Y18P_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE")
mh_dt<-transform(mh_d, Y18P = as.numeric(Y18P))
mh_dt%>%arrange(desc(Y18P))
amh_d <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 28")
amh_d<-amh_d[6:62,]
colnames(amh_d)<-c("Index","state","Y18P","Y18P_LE","Y18P_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE")
amh_dt<-transform(amh_d, Y18P = as.numeric(Y18P))
amh_dt%>%arrange(desc(Y18P))
suicide_t <- read_excel("C:/Final Project/NSDUHsaeExcelTabs2017.xlsx", sheet = "Table 30")
suicide_t<-suicide_t[6:62,]
colnames(suicide_t)<-c("Index","state","Y18P","Y18P_LE","Y18P_HE","Y18_25","Y18_25_LE","Y18_25_HE","Y26P","Y26P_LE","Y26P_HE")
suicide_t%>%filter(!is.na(Index))%>%filter(!state %in% c('Total U.S.','Midwest','West','South','Northeast'))
suicide_tt<-transform(suicide_t, Y18P = as.numeric(Y18P),Y18_25 = as.numeric(Y18_25),Y26P = as.numeric(Y26P))
suicide_tt1<-filter(suicide_tt,!is.na(Y18P)&substr(suicide_tt$state,1,3)!='Tot')%>%arrange(desc(Y18P))
arrange(suicide_tt1,desc(Y18P))
suicide_tt1$group1<-"US States"
par(ps = 10, cex = 1, cex.main = 1)
boxplot(suicide_tt1$Y18P*100)
title(main="Plot 10. Boxplot. Serious thoughts of suicide, % rate among 18+ cohort")
ggplot(suicide_tt1, aes(x=group1, y=Y18P*100))+geom_violin()+ggtitle("Plot 11. Violin plot. Serious thoughts of suicide, % rate among 18+ cohort")+theme(plot.title = element_text(size=13))
plot_usmap(data = suicide_tt1, values = "Y18P", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+
labs(title = "Plot 12. Serious thoughts of suicide, % rate among 18+ cohort")
Utah has the highest % of population with thoughts of suicide, NJ had the lowest. Utah seems to be an outline. Interestingly, Utah has high but not highest rate of suicide among US states. Alaska and Montana have higher rates.
Among young people, Alaska is slightly ahead of Utah in suicide thoughts. Apparently, Utah has teen suicide crisis(Riess 2016).
suicide_tt1%>%arrange(desc(Y18_25))
par(ps = 10, cex = 1, cex.main = 1)
boxplot(suicide_tt1$Y18_25*100)
title(main="Plot 13. Boxplot. Serious thoughts of suicide, % rate among 18-25 year old cohort")
ggplot(suicide_tt1, aes(x=group1, y=Y18_25*100)) +geom_violin()+ggtitle("Plot 14. Serious thoughts of suicide, % rate among 18-25 year old cohort")+theme(plot.title = element_text(size=10))
plot_usmap(data = suicide_tt1, values = "Y18_25", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+
labs(title = "Plot 15. Serious thoughts of suicide, % rate among 18-25 year old cohort")
Both Alaska and Utah seem to be outliers. Among older people (26+), Idaho, state adjutant to Utah, has the highest number with suicide thoughts, while NJ again has the lowest number.
suicide_tt1%>%arrange(desc(Y26P))
par(ps = 10, cex = 1, cex.main = 1)
boxplot(suicide_tt1$Y26P*100)
title(main="Plot 16. Serious thoughts of suicide, rate among 25+ year old cohort")
ggplot(suicide_tt1, aes(x=group1, y=Y26P*100))+geom_violin()+ggtitle("Plot 17. Serious thoughts of suicide, % rate among 25+ year old cohort")+theme(plot.title = element_text(size=10))
plot_usmap(data = suicide_tt1, values = "Y26P", theme = theme_map(), labels = TRUE, label_color = "white", lines = "red")+
labs(title = "Plot 18. Serious thoughts of suicide, % rate among 25+ year old cohort")
Let’s put a SAMHA master table to include all separate pieces together
drug_usetA<-select(drug_uset,"state","Y18P")
colnames(drug_usetA)<-c("state","DrugAny")
drug_uset1A<-select(drug_uset1,"state","Y18P")
colnames(drug_uset1A)<-c("state","DrugNoM")
drug_useth1<-select(drug_useth,"state","Y18P")
colnames(drug_useth1)<-c("state","Heroin")
alc_uset1<-select(alc_uset,"state","Y18P")
colnames(alc_uset1)<-c("state","AlcUse")
alc_usebt1<-select(alc_usebt,"state","Y18P")
colnames(alc_usebt1)<-c("state","Binge")
drug_dt1<-select(drug_dt,"state","Y18P")
colnames(drug_dt1)<-c("state","DrugDis")
pain_dt1<-select(pain_dt,"state","Y18P")
colnames(pain_dt1)<-c("state","Painkillers")
alc_dt1<-select(alc_dt,"state","Y18P")
colnames(alc_dt1)<-c("state","AlcDis")
sa_dt1<-select(sa_dt,"state","Y18P")
colnames(sa_dt1)<-c("state","SubAbuse")
suicide_tt2<-select(suicide_tt1,"state","Y18P")
colnames(suicide_tt2)<-c("state","ThoughtSuic")
mh_dt1<-select(mh_dt,"state","Y18P")
colnames(mh_dt1)<-c("state","SerMH")
SAMHA_master<-Reduce(function(x,y) merge(x = x, y = y, by = "state"), list(suicide_tt2, mh_dt1, pain_dt1, drug_uset1A, drug_usetA, alc_usebt1,drug_dt1, alc_dt1, alc_uset1, drug_useth1, sa_dt1))
SAMHA_master<-SAMHA_master%>%filter(!state %in% c('Total U.S.','Midwest','West','South','Northeast'))
wikiURL<-"https://en.wikipedia.org/wiki/List_of_countries_by_suicide_rate"
wikiData <- getURL(wikiURL)
countries <- readHTMLTable(wikiData)
countries1<-countries[[3]]
countries2<-countries1[-1,]
colnames(countries2)<-c("Rank","Country", "AgeAdjRateT","a","b","c","d","e")
countries2$temp<-as.character(countries2$AgeAdjRateT)
countries2$AgeAdjRate<-as.numeric(countries2$temp)
countries2<-countries2%>%select("Rank","Country","AgeAdjRate")
countries2$Country<-gsub("\\(more info\\)", "", as.character(countries2$Country))
head(countries2)%>%
kable() %>%
kable_styling()
| Rank | Country | AgeAdjRate | |
|---|---|---|---|
| 2 | 1 | Guyana | 30.2 |
| 3 | 2 | Lesotho | 28.9 |
| 4 | 3 | Russia | 26.5 |
| 5 | 4 | Lithuania | 25.7 |
| 6 | 5 | Suriname | 23.2 |
| 7 | 6 | Cote d’Ivoire | 23.0 |
Correlation classification(Gerstman Unknown):
0 < |r| < .3 weak correlation .3 < |r| < .7 moderate correlation |r| > 0.7 strong correlation
Let’s check for normality. The distribution of our data does somewhat resemble to normal. Montana creates a long tail.
ggdensity(state2017$AARate,
main = "Plot 19. Density plot of 2017 Age Adj Suicide Rate per 100,000 by State",
xlab = "States")
Next let’s merge US State data into Wiki Country data to see how US States would rank if they were independent countries.
stateWiki<-select(state2017,State,AARate)
colnames(stateWiki)<-c("Country","AgeAdjRate")
stateWiki$USflag<-"Y"
countries3<-select(countries2,Country, AgeAdjRate)
countries3$USflag<-"N"
countryState <- rbind(countries3, stateWiki)
countryState<-arrange(countryState,desc(AgeAdjRate))
countryState$rank <- NA
countryState$rank<- 1:nrow(countryState)
countryState<-select(countryState,rank, Country, USflag, AgeAdjRate)
countryState %>%
kable() %>%
kable_styling()
| rank | Country | USflag | AgeAdjRate |
|---|---|---|---|
| 1 | Guyana | N | 30.200 |
| 2 | Montana | Y | 28.943 |
| 3 | Lesotho | N | 28.900 |
| 4 | Alaska | Y | 27.014 |
| 5 | Wyoming | Y | 26.940 |
| 6 | Russia | N | 26.500 |
| 7 | Lithuania | N | 25.700 |
| 8 | New Mexico | Y | 23.300 |
| 9 | Suriname | N | 23.200 |
| 10 | Idaho | Y | 23.168 |
| 11 | Cote d’Ivoire | N | 23.000 |
| 12 | Kazakhstan | N | 22.800 |
| 13 | Utah | Y | 22.691 |
| 14 | South Dakota | Y | 22.517 |
| 15 | Equatorial Guinea | N | 22.000 |
| 16 | Belarus | N | 21.400 |
| 17 | West Virginia | Y | 21.064 |
| 18 | Arkansas | Y | 20.800 |
| 19 | Colorado | Y | 20.339 |
| 20 | Nevada | Y | 20.285 |
| 21 | South Korea | N | 20.200 |
| 22 | North Dakota | Y | 20.083 |
| 23 | Ugandab | N | 20.000 |
| 24 | Cameroon | N | 19.500 |
| 25 | Kansas | Y | 19.131 |
| 26 | Oklahoma | Y | 19.129 |
| 27 | Zimbabwe | N | 19.100 |
| 28 | Oregon | Y | 18.998 |
| 29 | New Hampshire | Y | 18.898 |
| 30 | Maine | Y | 18.874 |
| 31 | Ukraine | N | 18.500 |
| 32 | Missouri | Y | 18.487 |
| 33 | Vermont | Y | 18.310 |
| 34 | Arizona | Y | 18.158 |
| 35 | Nigeria | N | 17.300 |
| 36 | Latvia | N | 17.200 |
| 37 | Kentucky | Y | 16.920 |
| 38 | Washington | Y | 16.909 |
| 39 | Tennessee | Y | 16.783 |
| 40 | Eswatini | N | 16.700 |
| 41 | Alabama | Y | 16.628 |
| 42 | Togo | N | 16.600 |
| 43 | India | N | 16.500 |
| 44 | Uruguay | N | 16.500 |
| 45 | Indiana | Y | 16.330 |
| 46 | South Carolina | Y | 16.297 |
| 47 | Sierra Leone | N | 16.100 |
| 48 | Benin | N | 15.700 |
| 49 | Belgium | N | 15.700 |
| 50 | Chad | N | 15.500 |
| 51 | Wisconsin | Y | 15.385 |
| 52 | Louisiana | Y | 15.225 |
| 53 | Kiribati | N | 15.200 |
| 54 | Hawaii | Y | 15.178 |
| 55 | Cabo Verde | N | 15.100 |
| 56 | Iowa | Y | 15.040 |
| 57 | Pennsylvania | Y | 15.023 |
| 58 | Mississippi | Y | 15.018 |
| 59 | Burundi | N | 15.000 |
| 60 | Ohio | Y | 14.807 |
| 61 | Burkina Faso | N | 14.800 |
| 62 | Nebraska | Y | 14.679 |
| 63 | Estonia | N | 14.400 |
| 64 | Japan | N | 14.300 |
| 65 | North Carolina | Y | 14.284 |
| 66 | Sri Lanka | N | 14.200 |
| 67 | Michigan | Y | 14.104 |
| 68 | Florida | Y | 13.965 |
| 69 | Minnesota | Y | 13.846 |
| 70 | Finland | N | 13.800 |
| 71 | Eritrea | N | 13.800 |
| 72 | United States | N | 13.700 |
| 73 | Hungary | N | 13.600 |
| 74 | Georgia | Y | 13.594 |
| 75 | El Salvador | N | 13.500 |
| 76 | Liberia | N | 13.400 |
| 77 | Moldova | N | 13.400 |
| 78 | Poland | N | 13.400 |
| 79 | Texas | Y | 13.353 |
| 80 | Virginia | Y | 13.350 |
| 81 | Iceland | N | 13.300 |
| 82 | Slovenia | N | 13.300 |
| 83 | Mongolia | N | 13.300 |
| 84 | Bolivia | N | 12.900 |
| 85 | Thailand | N | 12.900 |
| 86 | Trinidad and Tobago | N | 12.900 |
| 87 | South Africa | N | 12.800 |
| 88 | Haiti | N | 12.200 |
| 89 | France | N | 12.100 |
| 90 | Nicaragua | N | 11.900 |
| 91 | Senegal | N | 11.800 |
| 92 | Rhode Island | Y | 11.784 |
| 93 | Sweden | N | 11.700 |
| 94 | Australia | N | 11.700 |
| 95 | Delaware | Y | 11.629 |
| 96 | Bhutan | N | 11.600 |
| 97 | New Zealand | N | 11.600 |
| 98 | Central African Republic | N | 11.600 |
| 99 | Botswana | N | 11.500 |
| 100 | Croatia | N | 11.500 |
| 101 | Namibia | N | 11.500 |
| 102 | Austria | N | 11.400 |
| 103 | Ethiopia | N | 11.400 |
| 104 | Switzerland | N | 11.300 |
| 105 | Zambia | N | 11.300 |
| 106 | Micronesia | N | 11.300 |
| 107 | Illinois | Y | 11.212 |
| 108 | Comoros | N | 11.100 |
| 109 | Rwanda | N | 11.000 |
| 110 | Serbia | N | 10.900 |
| 111 | Ireland | N | 10.900 |
| 112 | North Koreab | N | 10.600 |
| 113 | Connecticut | Y | 10.512 |
| 114 | Guinea | N | 10.500 |
| 115 | Czech Republic | N | 10.500 |
| 116 | Dominican Republic | N | 10.500 |
| 117 | California | Y | 10.456 |
| 118 | Canada | N | 10.400 |
| 119 | Luxembourg | N | 10.400 |
| 120 | Norway | N | 10.100 |
| 121 | Cuba | N | 10.100 |
| 122 | Slovakia | N | 10.100 |
| 123 | Gambia | N | 10.000 |
| 124 | Maryland | Y | 9.847 |
| 125 | Yemenb | N | 9.800 |
| 126 | Democratic Republic of the Congo | N | 9.700 |
| 127 | Chile | N | 9.700 |
| 128 | Nepal | N | 9.600 |
| 129 | Netherlands | N | 9.600 |
| 130 | Tanzaniab | N | 9.600 |
| 131 | Gabon | N | 9.600 |
| 132 | Sudan | N | 9.500 |
| 133 | Massachusetts | Y | 9.456 |
| 134 | Paraguay | N | 9.300 |
| 135 | Laos | N | 9.300 |
| 136 | Congo | N | 9.300 |
| 137 | Denmark | N | 9.200 |
| 138 | Germany | N | 9.100 |
| 139 | Kyrgyzstan | N | 9.100 |
| 140 | Argentina | N | 9.100 |
| 141 | Niger | N | 9.000 |
| 142 | Mali | N | 8.900 |
| 143 | Angola | N | 8.900 |
| 144 | Ghanab | N | 8.700 |
| 145 | Portugal | N | 8.600 |
| 146 | Djibouti | N | 8.500 |
| 147 | Mozambique | N | 8.400 |
| 148 | New Jersey | Y | 8.339 |
| 149 | Somalia | N | 8.300 |
| 150 | Seychelles | N | 8.300 |
| 151 | New York | Y | 8.120 |
| 152 | Myanmar | N | 8.100 |
| 153 | China | N | 8.000 |
| 154 | Romania | N | 8.000 |
| 155 | Singaporeb | N | 7.900 |
| 156 | Montenegro | N | 7.900 |
| 157 | Bulgaria | N | 7.900 |
| 158 | Malawi | N | 7.800 |
| 159 | United Kingdom | N | 7.600 |
| 160 | Mauritania | N | 7.500 |
| 161 | Costa Rica | N | 7.500 |
| 162 | Guinea-Bissau | N | 7.400 |
| 163 | Uzbekistan | N | 7.400 |
| 164 | Mauritius | N | 7.300 |
| 165 | Saint Lucia | N | 7.300 |
| 166 | Ecuador | N | 7.200 |
| 167 | Turkmenistan | N | 7.200 |
| 168 | Turkey | N | 7.200 |
| 169 | Papua New Guineab | N | 7.000 |
| 170 | Vietnam | N | 7.000 |
| 171 | Colombia | N | 7.000 |
| 172 | Madagascar | N | 6.900 |
| 173 | Georgia | N | 6.700 |
| 174 | District of Columbia | Y | 6.621 |
| 175 | Malta | N | 6.500 |
| 176 | Timor-Leste | N | 6.400 |
| 177 | Bosnia and Herzegovina | N | 6.400 |
| 178 | Afghanistan | N | 6.400 |
| 179 | Malaysiab | N | 6.200 |
| 180 | Republic of Macedonia | N | 6.200 |
| 181 | Bangladeshb | N | 6.100 |
| 182 | South Sudanb | N | 6.100 |
| 183 | Spain | N | 6.100 |
| 184 | Brazil | N | 6.100 |
| 185 | Cambodia | N | 5.900 |
| 186 | Solomon Islands | N | 5.900 |
| 187 | Belize | N | 5.900 |
| 188 | Qatar | N | 5.800 |
| 189 | Bahrain | N | 5.700 |
| 190 | Armenia | N | 5.700 |
| 191 | Albania | N | 5.600 |
| 192 | Kenyab | N | 5.600 |
| 193 | Italy | N | 5.500 |
| 194 | Fiji | N | 5.500 |
| 195 | Libya | N | 5.500 |
| 196 | Vanuatu | N | 5.400 |
| 197 | Samoa | N | 5.400 |
| 198 | Mexico | N | 5.200 |
| 199 | Peru | N | 5.100 |
| 200 | Bruneib | N | 4.500 |
| 201 | Cyprusb | N | 4.500 |
| 202 | Egypt | N | 4.400 |
| 203 | Panama | N | 4.400 |
| 204 | Iraq | N | 4.100 |
| 205 | Iran | N | 4.000 |
| 206 | Tonga | N | 4.000 |
| 207 | Greece | N | 3.800 |
| 208 | Venezuela | N | 3.800 |
| 209 | Jordanb | N | 3.700 |
| 210 | Philippines | N | 3.700 |
| 211 | Indonesia | N | 3.700 |
| 212 | Omanb | N | 3.500 |
| 213 | Honduras | N | 3.400 |
| 214 | Saudi Arabiab | N | 3.400 |
| 215 | Algeria | N | 3.300 |
| 216 | Tajikistan | N | 3.300 |
| 217 | Tunisia | N | 3.200 |
| 218 | Lebanonb | N | 3.200 |
| 219 | Morocco | N | 3.100 |
| 220 | Pakistanb | N | 3.100 |
| 221 | Sao Tome and Principe | N | 3.100 |
| 222 | Guatemala | N | 2.900 |
| 223 | Maldives | N | 2.700 |
| 224 | United Arab Emiratesb | N | 2.700 |
| 225 | Azerbaijan | N | 2.600 |
| 226 | Syriab | N | 2.400 |
| 227 | Saint Vincent and the Grenadines | N | 2.400 |
| 228 | Kuwait | N | 2.200 |
| 229 | Jamaica | N | 2.000 |
| 230 | Grenada | N | 1.700 |
| 231 | Bahamasb | N | 1.600 |
| 232 | Antigua and Barbuda | N | 0.500 |
| 233 | Barbados | N | 0.400 |
It is shocking, but 5 US states made in top 10. Montana has the 2nd highers suicide rate. 3 states are ahead of Russia, and Russia has very high suicide rate.
Let’s create a violin plot of the distribution.
p <- ggplot(countryState, aes(factor(USflag), AgeAdjRate))
p + geom_violin()+ggtitle("Plot 20. Age Adj Suicide Rate per 100,000: World Countries vs US States")+theme(plot.title = element_text(size=10))
The plot provides devasting results. US is doing very poorly. Please note that in last 2 decades while US suicide rate was steadily rising, the rest of the world was moving in opposite direction. Rates of suicide across of Asia and Europe have been dropping.
Let’s merge CDC and SAMHA data to see if there is correlation between suicide rates and serious thoughts of suicide. It should help us to validate our assumption, stated earlier that:
Suicide thought->Suicide Planning->Suicide Attempt->Suicide(in case of successful attempt)
suic_r_t<-merge(state2017, suicide_tt1, by = "state")
cor(suic_r_t$Y18P, suic_r_t$AARate)
## [1] 0.5509664
#suic_r_t
Correlation is moderate. R is only 0.55. Not as high as I expected.
scatterplot(suic_r_t$Y18P*100~suic_r_t$AARate, ylab="Thoughts of suicide, % of population", xlab="Age Adj Suicide Rate per 100,000",
main="Plot 21. Thoughts of Suicide(%) vs Age Adj Suicide Rate per 100,000")
ggplot(suic_r_t, aes(Y18P*100,AARate, color = AARate)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 22. Thoughts of Suicide(%) vs Suicide Rate per 100,000", x="Thoughts of Suicide, %", y="Suicide Rate, per 100,000")
Interesting question is why so many people seriously think about suicide while relativly few commit suicide, while in Montana not that many people seriously think about suicide while so many actually commit it? Is it the way data is collected or is it something else. Later we will see that gun ownership could play a role.
Let’s check correlation between suicide rate and other SAMHA’s variables:
Vs Drug Use, Any - no correlation, r=0.00. Very Interesting. It seems variables are perfectly independent.
Vs Heroin Use - practically no correlation, r=-0.04
Vs Alcohol Use and Binge Drinking - week negative correlation(!), r=-0.22 and r=-0.2. Basically what we are seeing that higher alcohol consumption weakly predicts lower suicide rate. Not what I have expected.
Vs Painkillers Abuse - finally we got some decent correlation. r=0.42.
Vs Serious Mental Illness - BINGO! r=0.57. Hard to believe but serious mental illness is a slightly better predictor than serious thoughts of suicide.
state2017C<-select(state2017,state, AARate)
suic_cor<-merge(state2017C, SAMHA_master, by = "state")
suic_cor1<-suic_cor
suic_cor1$state<-NULL
res <- cor(suic_cor1)
round(res, 2)%>%
kable() %>%
kable_styling()
| AARate | ThoughtSuic | SerMH | Painkillers | DrugNoM | DrugAny | Binge | DrugDis | AlcDis | AlcUse | Heroin | SubAbuse | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AARate | 1.00 | 0.55 | 0.57 | 0.42 | -0.16 | 0.00 | -0.20 | 0.05 | 0.10 | -0.22 | -0.04 | 0.04 |
| ThoughtSuic | 0.55 | 1.00 | 0.80 | 0.46 | 0.24 | 0.31 | -0.07 | 0.31 | 0.26 | -0.02 | 0.16 | 0.29 |
| SerMH | 0.57 | 0.80 | 1.00 | 0.78 | 0.18 | 0.18 | -0.21 | 0.26 | 0.13 | -0.25 | 0.24 | 0.18 |
| Painkillers | 0.42 | 0.46 | 0.78 | 1.00 | -0.01 | -0.05 | -0.29 | 0.13 | -0.13 | -0.35 | 0.24 | -0.03 |
| DrugNoM | -0.16 | 0.24 | 0.18 | -0.01 | 1.00 | 0.83 | 0.36 | 0.80 | 0.57 | 0.38 | 0.14 | 0.68 |
| DrugAny | 0.00 | 0.31 | 0.18 | -0.05 | 0.83 | 1.00 | 0.32 | 0.81 | 0.57 | 0.48 | 0.22 | 0.70 |
| Binge | -0.20 | -0.07 | -0.21 | -0.29 | 0.36 | 0.32 | 1.00 | 0.23 | 0.77 | 0.88 | 0.00 | 0.70 |
| DrugDis | 0.05 | 0.31 | 0.26 | 0.13 | 0.80 | 0.81 | 0.23 | 1.00 | 0.52 | 0.27 | 0.35 | 0.69 |
| AlcDis | 0.10 | 0.26 | 0.13 | -0.13 | 0.57 | 0.57 | 0.77 | 0.52 | 1.00 | 0.70 | 0.01 | 0.95 |
| AlcUse | -0.22 | -0.02 | -0.25 | -0.35 | 0.38 | 0.48 | 0.88 | 0.27 | 0.70 | 1.00 | 0.13 | 0.67 |
| Heroin | -0.04 | 0.16 | 0.24 | 0.24 | 0.14 | 0.22 | 0.00 | 0.35 | 0.01 | 0.13 | 1.00 | 0.13 |
| SubAbuse | 0.04 | 0.29 | 0.18 | -0.03 | 0.68 | 0.70 | 0.70 | 0.69 | 0.95 | 0.67 | 0.13 | 1.00 |
Let’s create scatterplots of the 2 variables. We see the same outliers again: Montana, Alaska, and New Mexico, all have high suicide rates, but their rates of mental illness are not all that high.
scatterplot(suic_cor$SerMH*100~suic_cor$AARate, xlab="Serious Mental Illness(%)", ylab="Age Adj Suicide Rate per 100,000",
main="Plot 22A. Serious Mental Illness(%) vs Age Adj Suicide Rate, per 100,000")
ggplot(suic_cor, aes(SerMH*100,AARate, color = AARate)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 22B. Serious Mental Illness(%) vs Age Adj Suicide Rate, per 100,000", x="Serious Mental Illness(%)", y="Age Adj Suicide Rate per 100,000")
We have a chance to compare suicide distribution based on gender.
state2017GA$AARate1<-(state2017GA$Age.Adjusted.Rate.Lower.95..Confidence.Interval+state2017GA$Age.Adjusted.Rate.Upper.95..Confidence.Interval)/2
state2017GA1<-filter(state2017GA,Gender!='')
p <- ggplot(state2017GA1, aes(factor(Gender), AARate1))
p + geom_violin()+ggtitle("Plot 23. Violin Plot. Age Adj Suicide Rate per 100,000, 2017 US States: Female vs Males")+theme(plot.title = element_text(size=10))
Interesting results. Female distribution has much more narrow range, while male one spreads over wider range.
Let’s see how male and female suicide rates correlate.
state2017GA2<-state2017GA1%>%select(state,Gender,AARate1)%>%spread(Gender,AARate1)
cor(state2017GA2$Male, state2017GA2$Female)
## [1] 0.8955711
Correlation is almost perfect! R is 0.9. I find the result very important since it appears that the ratio of male to female suicide varies little from state to state. I have seen a lot of inconsistent results and this is the first consistent number. Delaware appears to be the only outlier. Very low female rate and not very low male. Please remember that CDC classified Delaware female rate as unreliable.
scatterplot(state2017GA2$Male~state2017GA2$Female, xlab="Male suicide rate, Age Adj per 100,000", ylab="Female suicide rate, Age Adj , per 100,00", main="Plot 24. Male vs Female Suicide Rate, per 100,000")
ggplot(state2017GA2, aes(Male,Female, color = Female)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 25. Male vs Female Suicide Rate, per 100,000", x="Male Suicide Rate, per 100,000", y="Female Suicide Rate, per 100,000")
Now let’s run Male/Female data vs SAMHA tables to see if we get some meaningful results. state2017F
Please remember that SAMHA data is not broken by gender. So, we are forced to compare male population to total.
Let’s check correlation between male suicide rate and other SAMHA’s variables:
Interestingly, that r for male suicide rate vs thoughts of suicide has dropped, but not by a lot. r=0.53. Modest correlation.
Vs Drug Use, Any - no correlation, r=-0.003
Vs Heroin Use - practically no correlation, r=-0.02
Vs Alcohol Use, Binge Drinking - week negative correlation(!), r=-0.24 and r=-0.22. Basically what we are seeing that higher alcohol consumption weakly predicts lower suicide rate. Not what I have expected.
Vs Painkillers Abuse - finally we got some decent correlation. r=0.46.
Vs Serious Mental Illness - bingo! r=0.59. Hard to believe but serious mental illness is a better predictor than serious thoughts of suicide.
state2017MC<-select(state2017M,state, AARate)
suic_r_t_m<-merge(state2017MC, SAMHA_master, by = "state")
suic_r_t_m$state<-NULL
res <- cor(suic_r_t_m)
round(res, 2)%>%
kable() %>%
kable_styling()
| AARate | ThoughtSuic | SerMH | Painkillers | DrugNoM | DrugAny | Binge | DrugDis | AlcDis | AlcUse | Heroin | SubAbuse | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AARate | 1.00 | 0.53 | 0.59 | 0.46 | -0.19 | -0.03 | -0.22 | 0.02 | 0.07 | -0.24 | -0.02 | 0.02 |
| ThoughtSuic | 0.53 | 1.00 | 0.80 | 0.46 | 0.24 | 0.31 | -0.07 | 0.31 | 0.26 | -0.02 | 0.16 | 0.29 |
| SerMH | 0.59 | 0.80 | 1.00 | 0.78 | 0.18 | 0.18 | -0.21 | 0.26 | 0.13 | -0.25 | 0.24 | 0.18 |
| Painkillers | 0.46 | 0.46 | 0.78 | 1.00 | -0.01 | -0.05 | -0.29 | 0.13 | -0.13 | -0.35 | 0.24 | -0.03 |
| DrugNoM | -0.19 | 0.24 | 0.18 | -0.01 | 1.00 | 0.83 | 0.36 | 0.80 | 0.57 | 0.38 | 0.14 | 0.68 |
| DrugAny | -0.03 | 0.31 | 0.18 | -0.05 | 0.83 | 1.00 | 0.32 | 0.81 | 0.57 | 0.48 | 0.22 | 0.70 |
| Binge | -0.22 | -0.07 | -0.21 | -0.29 | 0.36 | 0.32 | 1.00 | 0.23 | 0.77 | 0.88 | 0.00 | 0.70 |
| DrugDis | 0.02 | 0.31 | 0.26 | 0.13 | 0.80 | 0.81 | 0.23 | 1.00 | 0.52 | 0.27 | 0.35 | 0.69 |
| AlcDis | 0.07 | 0.26 | 0.13 | -0.13 | 0.57 | 0.57 | 0.77 | 0.52 | 1.00 | 0.70 | 0.01 | 0.95 |
| AlcUse | -0.24 | -0.02 | -0.25 | -0.35 | 0.38 | 0.48 | 0.88 | 0.27 | 0.70 | 1.00 | 0.13 | 0.67 |
| Heroin | -0.02 | 0.16 | 0.24 | 0.24 | 0.14 | 0.22 | 0.00 | 0.35 | 0.01 | 0.13 | 1.00 | 0.13 |
| SubAbuse | 0.02 | 0.29 | 0.18 | -0.03 | 0.68 | 0.70 | 0.70 | 0.69 | 0.95 | 0.67 | 0.13 | 1.00 |
Please remember that SAMHA data is not broken by gender. So, we are forced to compare female population to total.
Let’s check correlation between female suicide rate and other SAMHA’s variables:
Interestingly, that r for female suicide rate vs thoughts of suicide is the same as for both gender. r=0.57. Modest correlation.
Vs Drug Use, Any - practically no correlation, r=0.11
Vs Heroin Use - practically no correlation, r=-0.04
Vs Alcohol Use and Binge Drinking - week negative correlation, r=-0.19 and r=-0.18.
Vs Painkillers Abuse - finally we got some modest correlation. r=0.32.
Vs Serious Mental Illness - modest correlation, r=0.52. For females, thoughts of suicide is a slightly better predictor than serious mental illness.
state2017FC<-select(state2017F,state, AARate1)
suic_corF<-merge(state2017FC, SAMHA_master, by = "state")
suic_corF$state<-NULL
res <- cor(suic_corF)
round(res, 2)%>%
kable() %>%
kable_styling()
| AARate1 | ThoughtSuic | SerMH | Painkillers | DrugNoM | DrugAny | Binge | DrugDis | AlcDis | AlcUse | Heroin | SubAbuse | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AARate1 | 1.00 | 0.57 | 0.52 | 0.32 | -0.04 | 0.11 | -0.18 | 0.13 | 0.14 | -0.19 | -0.04 | 0.08 |
| ThoughtSuic | 0.57 | 1.00 | 0.80 | 0.46 | 0.24 | 0.31 | -0.07 | 0.31 | 0.26 | -0.02 | 0.16 | 0.29 |
| SerMH | 0.52 | 0.80 | 1.00 | 0.78 | 0.18 | 0.18 | -0.21 | 0.26 | 0.13 | -0.25 | 0.24 | 0.18 |
| Painkillers | 0.32 | 0.46 | 0.78 | 1.00 | -0.01 | -0.05 | -0.29 | 0.13 | -0.13 | -0.35 | 0.24 | -0.03 |
| DrugNoM | -0.04 | 0.24 | 0.18 | -0.01 | 1.00 | 0.83 | 0.36 | 0.80 | 0.57 | 0.38 | 0.14 | 0.68 |
| DrugAny | 0.11 | 0.31 | 0.18 | -0.05 | 0.83 | 1.00 | 0.32 | 0.81 | 0.57 | 0.48 | 0.22 | 0.70 |
| Binge | -0.18 | -0.07 | -0.21 | -0.29 | 0.36 | 0.32 | 1.00 | 0.23 | 0.77 | 0.88 | 0.00 | 0.70 |
| DrugDis | 0.13 | 0.31 | 0.26 | 0.13 | 0.80 | 0.81 | 0.23 | 1.00 | 0.52 | 0.27 | 0.35 | 0.69 |
| AlcDis | 0.14 | 0.26 | 0.13 | -0.13 | 0.57 | 0.57 | 0.77 | 0.52 | 1.00 | 0.70 | 0.01 | 0.95 |
| AlcUse | -0.19 | -0.02 | -0.25 | -0.35 | 0.38 | 0.48 | 0.88 | 0.27 | 0.70 | 1.00 | 0.13 | 0.67 |
| Heroin | -0.04 | 0.16 | 0.24 | 0.24 | 0.14 | 0.22 | 0.00 | 0.35 | 0.01 | 0.13 | 1.00 | 0.13 |
| SubAbuse | 0.08 | 0.29 | 0.18 | -0.03 | 0.68 | 0.70 | 0.70 | 0.69 | 0.95 | 0.67 | 0.13 | 1.00 |
Our main research is to understand why such a gap is there. Our collected results should give us final ideas on why we have such a big discrepancy in suicide rates. I have mentioned DC and Montana in the introduction. These 2 states lay on opposite ends of suicide rates; however, comparing these two states is as hard as it gets. Everything is different about them. One is in the Metro area, and another is rural state. Climate is so different, as well as ethnic distribution. Another problem, both states have low population rates. We are facing a typical “apple vs orange” problem. My goal is to find close populations to compare and show if population size plays a significant role in suicide rates. After looking at different factors I decided to compare Colorado and Minnesota. They have comparable population sizes, similar ethic distribution making them both well to do states; however, their suicide rates are significantly different. Minnesota is on the lower end with 13.8 and Colorado is on the higher end with 20.3. Colorado is 47% higher. It does not make any sense to me. Interestingly, later in the project I try to cluster states by SAMHA survey and Minnesota does come out as a state with not a lot of drug, alcohol, and mental health issues, while Colorado falls in mix bag category with some good and some bad indicators. Looking at the details, we can see that Colorado has much higher drug usage than Minnesota. Heroin use is the same for both states (low). Colorado also has higher rate of painkiller abuse, alcohol abuse, suicide thoughts, and mental illness. So, maybe we do have reasons. Let’s look at the data.
col_min<-merge(state15_17C, state15_17M,by=c("Gender","X2013.Urbanization","Race","Hispanic.Origin"))
col_min$ratio<-col_min$AARate.x/col_min$AARate.y
arrange(col_min,desc(ratio))
We can see that 3 populations got over twice as high rate in Colorado vs Minnesota. All their rates are considered reliable by CDC.
We will focus on these 3 groups:
col_min1<-merge(state08_17C, state08_17M,by=c("Gender","X2013.Urbanization","Race","ICD.10.113.Cause.List"))
col_min1$ratio<-col_min1$AARate.x/col_min1$AARate.y
col_min_C1<-filter(col_min1,Gender=='Male',Race=='Black or African American',X2013.Urbanization=='Large Central Metro')
head(arrange(col_min_C1,desc(ratio)))
For our first group we can see that Transport accidents and Alcoholic liver disease are very high for Colorado. Let’s see if these diseases would correlate for other states.
col_min_C2<-filter(col_min1,Gender=='Male',Race=='White',X2013.Urbanization=='Small Metro')
head(arrange(col_min_C2,desc(ratio)))
For our second group we can see that Assault, HIV, and Accidental poisoning are very high for Colorado. Let’s see if these diseases would correlate for other states.
col_min_C3<-filter(col_min1,Gender=='Female',Race=='White',X2013.Urbanization=='Small Metro')
arrange(col_min_C3,desc(ratio))
For our third group we can see that Assault, Viral hepatitis, and Accidental poisoning are very high for Colorado. Let’s see if these diseases would correlate for other states.
Correlation between younger and older population with thoughts of suicide by state approaching so called strong correlation (r=0.66).
cor(suicide_tt1$Y18_25, suicide_tt1$Y26P)
## [1] 0.6594445
scatterplot(suicide_tt1$Y18_25*100~suicide_tt1$Y26P,xlab="Age 18-25, %", ylab="Age 26+, %", main="Plot 26. Serious Thoughts of Suicide, rate(%): 18-25 years old vs 26+ years old cohort")
ggplot(suicide_tt1, aes(Y18_25*100,Y26P*100, color = Y18_25)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+ ggtitle("Plot 27. Serious thoughts of suicide, rate(%), 18-25 years old vs 26+ years old cohort")+labs(x = "18-25 year old cohort, %", y = "26+ year old cohort, %")
Correlation between population with thoughts of suicide and one with any mental illness by state is strong (r=0.74).
suic_amh<-merge(suicide_tt1, amh_dt, by = "state")
cor(suic_amh$Y18P.x, suic_amh$Y18P.y)
## [1] 0.7351503
suic_amh$temp<-suic_amh$Y18P.x*100
suic_amh$temp1<-suic_amh$Y18P.y*100
scatterplot(suic_amh$temp~suic_amh$temp1,xlab="Thoughts of suicide, %", ylab="Any Mental Illness, %", main="Plot 27A. Thoughts of Suicide vs Any Mental Illness, %")
ggplot(suic_amh, aes(temp,temp1, color = Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 27B. Thoughts of Suicide vs Any Mental Illness, %")+labs(x = "18+ year old cohort - Thoughts of suicide, %", y = "18+ year old cohort - Any Mental Illness, %")
Correlation between population with thoughts of suicide and one with serious mental illness by state is strong (r=0.8).
suic_mh<-merge(suicide_tt1, mh_dt, by = "state")
cor(suic_mh$Y18P.x, suic_mh$Y18P.y)
## [1] 0.7982923
scatterplot(suic_mh$Y18P.x~suic_mh$Y18P.y,xlab="Thoughts of suicide, %", ylab="Serious Mental Illness, %",
main="Plot 28. Thoughts of Suicide vs Serious Mental Illness, %")
ggplot(suic_mh, aes(Y18P.x,Y18P.y, color = Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 29. Thoughts of Suicide vs Serious Mental Illness", x="Thoughts of Suicide", y="Serious Mental Illness")
Correlation between population with thoughts of suicide and one with substance abuse disorder is weak/moderate (r=0.3). Interestingly, Utah has low SA problem, but high level of thoughts of suicide, while DC has drug problem but low level of suicide thoughts. Let’s take out these 2 outliers and see if correlation will change. Indeed, correlation improved to r=0.47. Let’s see if correlation is stronger for younger cohort. Actually, it is not (r=0.46). Not what I have expected. I would like to mention that throughout my analysis Utah and DC will be somewhat outliers. Specifically, Utah, a state with low violence, drug and alcohol use, while the rate of suicide is one of the highest in country.
suic_sa<-merge(suicide_tt1, sa_dt, by = "state")
cor(suic_sa$Y18P.x, suic_sa$Y18P.y)
## [1] 0.2976138
scatterplot(suic_sa$Y18P.x~suic_sa$Y18P.y,xlab="Thoughts of suicide", ylab="Substance Abuse Illness",
main="Plot 30. Thoughts of Suicide vs SA Illness")
ggplot(suic_sa, aes(Y18P.x,Y18P.y, color = Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 31. Thoughts of Suicide vs SA Illness", y="Substance Abuse Illness", x="Thougts of suicide")
suic_sa_woo<-filter(suic_sa, !state %in% c('Utah','District of Columbia'))
cor(suic_sa_woo$Y18P.x, suic_sa_woo$Y18P.y)
## [1] 0.4704185
cor(suic_sa_woo$Y18_25.x, suic_sa_woo$Y18_25.y)
## [1] 0.4625329
Weak correlation, between alcohol and thoughts of suicide (r=0.27). Strange. Without DC and Utah, correlation goes up to 0.43. Correlation is stronger for younger cohort - 0.53. For younger people, Alaska is an outline. Alaska has moderate alcohol issues and a lot of thoughts of suicide.
suic_alc<-merge(suicide_tt1, alc_dt, by = "state")
cor(suic_alc$Y18P.x, suic_alc$Y18P.y)
## [1] 0.2730799
scatterplot(suic_alc$Y18P.x~suic_alc$Y18P.y,xlab="Thoughts of suicide", ylab="Alcohol Disorder",
main="Plot 32. Thoughts of Suicide vs Alcohol Disorder")
ggplot(suic_alc, aes(Y18P.x,Y18P.y,color=Y18P.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 33. Thoughts of Suicide vs Alcohol Disorder", y="Alcohol Illness", x="Thougts of suicide", title="Scatterplot")
suic_alc_woo<-filter(suic_alc, !state %in% c('Utah','District of Columbia'))
cor(suic_alc_woo$Y18P.x, suic_alc_woo$Y18P.y)
## [1] 0.4435853
cor(suic_alc_woo$Y18_25.x, suic_alc_woo$Y18_25.y)
## [1] 0.5315146
ggplot(suic_alc_woo, aes(Y18_25.x,Y18_25.y, color=Y18_25.x)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 34. Thoughts of Suicide vs Alcohol Disorder - 18-25 years old only", y="Alcohol Illness", x="Thougts of suicide", title="Scatterplot")
Interestingly, this time neither DC nor Utah are outliers. DC has low painkiller problem and low level of suicide thoughts, while Utah got problems with both. R is moderate 0.46.
suic_pain<-merge(suicide_tt1, pain_dt, by = "state")
cor(suic_pain$Y18P.x, suic_pain$Y18P.y)
## [1] 0.45627
scatterplot(suic_pain$Y18P.x~suic_pain$Y18P.y,xlab="Thoughts of suicide", ylab="Painkillers Abuse Disorder",
main="Plot 35. Thoughts of Suicide vs Painkillers Abuse Disorder, 18+")
ggplot(suic_pain, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 36. Thoughts of Suicide vs Painkillers Abuse Disorder - 18+ years old", y="Painkillers Abuse Disorder", x="Thougts of suicide", title="Scatterplot")
ggplot(suic_pain, aes(Y18_25.x,Y18_25.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 37. Thoughts of Suicide vs Painkillers Abuse Disorder - 18-25 years old only", y="Painkillers Abuse Disorder", x="Thougts of suicide", title="Scatterplot")
cor(suic_pain$Y18_25.x, suic_pain$Y18_25.y)
## [1] 0.1496557
ggplot(suic_pain, aes(Y26P.x,Y26P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 38. Thoughts of Suicide vs Painkillers Abuse Disorder - 26+ year old only", y="Painkillers Abuse Disorder", x="Thougts of suicide", title="Scatterplot")
cor(suic_pain$Y26P.x, suic_pain$Y26P.y)
## [1] 0.4922438
Similar picture as many other variables, Utah, DC, and Idaho are outliers. DC has low level of suicide thoughts and high level of drug use, while Utah and Idaho are in opposite situation (Utah does have painkiller abuse problem.)
suic_drug<-merge(suicide_tt1, drug_dt, by = "state")
cor(suic_drug$Y18P.x, suic_drug$Y18P.y)
## [1] 0.314418
scatterplot(suic_drug$Y18P.x~suic_drug$Y18P.y,xlab="Thoughts of suicide", ylab="Drug Disorder",
main="Plot 39. Thoughts of Suicide vs Drug Disorder")
ggplot(suic_drug, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 40. Thoughts of Suicide vs Drug Disorder - 18+ year old", y="Drug Abuse Disorder", x="Thougts of suicide", title="Scatterplot")
The first variable that has no correlation at all. Interesting. So, it could be safe to go binge drinking after the project is complete!
suic_binge<-merge(suicide_tt1, alc_usebt, by = "state")
cor(suic_binge$Y18P.x, suic_binge$Y18P.y)
## [1] -0.06411454
scatterplot(suic_binge$Y18P.x~suic_binge$Y18P.y,xlab="Thoughts of suicide", ylab="Binge drinking",
main="Plot 41. Thoughts of Suicide vs Binge drinking")
ggplot(suic_binge, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 42. Thoughts of Suicide vs Binge Drinking - 18+ year old", y="Binge Drinking", x="Thougts of suicide", title="Scatterplot")
No correlation again. Interesting.
suic_alcd<-merge(suicide_tt1, alc_uset, by = "state")
cor(suic_alcd$Y18P.x, suic_alcd$Y18P.y)
## [1] -0.02256482
scatterplot(suic_alcd$Y18P.x~suic_alcd$Y18P.y,xlab="Thoughts of suicide", ylab="Alcohol drinking",
main="Plot 43. Thoughts of Suicide vs Alcohol drinking")
ggplot(suic_alcd, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 44. Thoughts of Suicide vs Alcohol Drinking - 18+ year old", y="Alcohol Drinking", x="Thougts of suicide", title="Scatterplot")
Weak correlation.
suic_drugh<-merge(suicide_tt1, drug_useth, by = "state")
cor(suic_drugh$Y18P.x, suic_drugh$Y18P.y)
## [1] 0.1511048
cor(suic_drugh$Y18_25.x, suic_drugh$Y18_25.y)
## [1] 0.235639
scatterplot(suic_drugh$Y18P.x~suic_drugh$Y18P.y,xlab="Thoughts of suicide", ylab="Heroin use",
main="Plot 45. Thoughts of Suicide vs Heroin use")
ggplot(suic_drugh, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 46. Thoughts of Suicide vs Heroin Use - 18+ year old", y="Heroin Use", x="Thougts of suicide", title="Scatterplot")
Weak correlation. R=0.24.
suic_drugu1<-merge(suicide_tt1, drug_uset1, by = "state")
cor(suic_drugu1$Y18P.x, suic_drugu1$Y18P.y)
## [1] 0.2422904
cor(suic_drugu1$Y18_25.x, suic_drugu1$Y18_25.y)
## [1] 0.1737713
scatterplot(suic_drugu1$Y18P.x~suic_drugu1$Y18P.y,xlab="Thoughts of suicide", ylab="Drug use, any",
main="Plot 47. Thoughts of Suicide vs Drug use, any")
ggplot(suic_drugu1, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 48. Thoughts of Suicide vs Drug Use, any - 18+ year old", y="Drug Use, any", x="Thougts of suicide", title="Scatterplot")
The same outliers: DC, Utah, and Idaho. With outliers correlation is 0.31. Without outliers, it goes up to 0.53.
suic_drugu<-merge(suicide_tt1, drug_uset, by = "state")
cor(suic_drugu$Y18P.x, suic_drugu$Y18P.y)
## [1] 0.3126032
cor(suic_drugu$Y18_25.x, suic_drugu$Y18_25.y)
## [1] 0.2029604
suic_drugu_woo<-filter(suic_drugu, !state %in% c('Utah','Idaho', 'District of Columbia'))
cor(suic_drugu_woo$Y18P.x, suic_drugu_woo$Y18P.y)
## [1] 0.5261027
scatterplot(suic_drugu$Y18P.x~suic_drugu$Y18P.y,xlab="Thoughts of suicide", ylab="Drug use, not marijuana",
main="Plot 49. Thoughts of Suicide vs Drug use, not marijuana")
ggplot(suic_drugu, aes(Y18P.x,Y18P.y)) + geom_point() + geom_text(aes(label=state), check_overlap = TRUE)+
labs(subtitle="Plot 50. Thoughts of Suicide vs Drug Use, not marijuana - 18+ year old", y="Drug Use, not marijuana", x="Thougts of suicide", title="Scatterplot")
Now that we gained insight into mental health and substance abuse drivers behind thoughts of suicide, we can try to build basic Linear Regression. Please note that our main goal is not to model thoughts of suicide; however, it might help us to understand better what drives people to thoughts of suicide and what states are effected the worse.
The model calculates rate of suicide thoughts based on serious mental illness, painkillers abuse rates, and use of any drug.
The adjusted R-squired is 0.72. It is not that great.
SAMHA_masterLR<-SAMHA_master
SAMHA_masterLR$Painkillers2<-SAMHA_masterLR$Painkillers**2
SAMHA_masterLR$Painkillers3<-SAMHA_masterLR$Painkillers**3
SAMHA_masterLR$DrugAny2<-SAMHA_masterLR$DrugAny**2
SAMHA_masterLR$DrugAny3<-SAMHA_masterLR$DrugAny**3
model <- lm( ThoughtSuic~ SerMH+Painkillers3+DrugAny+DrugAny2+DrugAny3, data = SAMHA_masterLR)
model
##
## Call:
## lm(formula = ThoughtSuic ~ SerMH + Painkillers3 + DrugAny + DrugAny2 +
## DrugAny3, data = SAMHA_masterLR)
##
## Coefficients:
## (Intercept) SerMH Painkillers3 DrugAny DrugAny2
## 4.900e-02 9.754e-01 -1.639e+04 -1.032e+00 7.421e+00
## DrugAny3
## -1.667e+01
summary(model)
##
## Call:
## lm(formula = ThoughtSuic ~ SerMH + Painkillers3 + DrugAny + DrugAny2 +
## DrugAny3, data = SAMHA_masterLR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0065928 -0.0016519 -0.0000169 0.0021532 0.0059593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.900e-02 2.203e-02 2.224 0.0312 *
## SerMH 9.754e-01 1.231e-01 7.921 4.5e-10 ***
## Painkillers3 -1.639e+04 5.682e+03 -2.884 0.0060 **
## DrugAny -1.032e+00 4.888e-01 -2.110 0.0404 *
## DrugAny2 7.421e+00 3.567e+00 2.081 0.0432 *
## DrugAny3 -1.667e+01 8.306e+00 -2.007 0.0507 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002902 on 45 degrees of freedom
## Multiple R-squared: 0.7441, Adjusted R-squared: 0.7157
## F-statistic: 26.18 on 5 and 45 DF, p-value: 2.734e-12
myres = resid(model)
plot(SAMHA_masterLR$ThoughtSuic, myres, ylab="Residuals", xlab="Thoughts of Suicide", main="Thoughts of Suicide")
abline(0, 0)
I will use 4 clusters, which is perfect for me, so I can kind of imitate 4 regions of USA. Let’s if clusters correspond to 4 regions.
SAMHA_masterC<-SAMHA_master
row.names(SAMHA_masterC)<-SAMHA_masterC[,1]
SAMHA_masterC <- scale(SAMHA_masterC[-1])
wss <- (nrow(SAMHA_masterC)-1)*sum(apply(SAMHA_masterC,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(SAMHA_masterC,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
fviz_nbclust(SAMHA_masterC, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
# Elbow method
fviz_nbclust(SAMHA_masterC, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")
fit <- kmeans(SAMHA_masterC, 4)
aggregate(SAMHA_masterC,by=list(fit$cluster),FUN=mean)
## Group.1 ThoughtSuic SerMH Painkillers DrugNoM DrugAny
## 1 1 0.6007879 1.08414059 1.158150114 -0.2595276 -0.5611673
## 2 2 -0.2228678 -0.08459446 0.229540261 -0.3202326 -0.3565953
## 3 3 -0.8861016 -1.19118857 -1.211278204 -0.4452622 -0.3620547
## 4 4 0.9523062 0.61667459 0.003339082 1.4147873 1.6737100
## Binge DrugDis AlcDis AlcUse Heroin SubAbuse
## 1 -1.1437122 -0.2697160 -0.9377813 -1.4026162 0.1236364 -0.8403990
## 2 0.3807154 -0.1843569 0.2039108 0.2981129 0.1228629 0.1670663
## 3 -0.2742555 -0.6022033 -0.5160975 -0.0887141 -0.4732574 -0.6271312
## 4 0.8149568 1.3844227 1.2416685 0.9813414 0.2704450 1.3549502
mydata <- data.frame(SAMHA_masterC, fit$cluster)
distance <- get_dist(SAMHA_masterC)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
k2 <- kmeans(SAMHA_masterC, centers = 4, nstart = 25)
str(k2)
## List of 9
## $ cluster : Named int [1:51] 4 1 3 4 3 1 1 1 1 3 ...
## ..- attr(*, "names")= chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ centers : num [1:4, 1:11] 0.664 -0.178 -0.886 0.489 0.383 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "1" "2" "3" "4"
## .. ..$ : chr [1:11] "ThoughtSuic" "SerMH" "Painkillers" "DrugNoM" ...
## $ totss : num 550
## $ withinss : num [1:4] 96.6 59.1 46.2 53.4
## $ tot.withinss: num 255
## $ betweenss : num 295
## $ size : int [1:4] 13 14 13 11
## $ iter : int 4
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
k2
## K-means clustering with 4 clusters of sizes 13, 14, 13, 11
##
## Cluster means:
## ThoughtSuic SerMH Painkillers DrugNoM DrugAny Binge
## 1 0.6636423 0.38307248 0.01892408 1.1949376 1.3499637 0.6169255
## 2 -0.1779872 -0.06162524 0.26850653 -0.4827015 -0.4555759 0.5522002
## 3 -0.8861016 -1.19118857 -1.21127820 -0.4452622 -0.3620547 -0.2742555
## 4 0.4894356 1.03347842 1.06741020 -0.2716326 -0.5877050 -1.1077739
## DrugDis AlcDis AlcUse Heroin SubAbuse
## 1 1.3773553 0.9877644 0.7975831 0.6819409 1.1710608
## 2 -0.5105397 0.2221695 0.4091577 -0.3076858 0.1150226
## 3 -0.6022033 -0.5160975 -0.0887141 -0.4732574 -0.6271312
## 4 -0.2663110 -0.8401857 -1.3585005 0.1449742 -0.7892183
##
## Clustering vector:
## Alabama Alaska Arizona
## 4 1 3
## Arkansas California Colorado
## 4 3 1
## Connecticut Delaware District of Columbia
## 1 1 1
## Florida Georgia Hawaii
## 3 3 3
## Idaho Illinois Indiana
## 4 3 4
## Iowa Kansas Kentucky
## 2 2 4
## Louisiana Maine Maryland
## 2 2 3
## Massachusetts Michigan Minnesota
## 1 2 3
## Mississippi Missouri Montana
## 4 2 1
## Nebraska Nevada New Hampshire
## 2 1 1
## New Jersey New Mexico New York
## 3 3 3
## North Carolina North Dakota Ohio
## 4 2 2
## Oklahoma Oregon Pennsylvania
## 4 1 2
## Rhode Island South Carolina South Dakota
## 1 2 2
## Tennessee Texas Utah
## 4 3 4
## Vermont Virginia Washington
## 1 3 1
## West Virginia Wisconsin Wyoming
## 4 2 2
##
## Within cluster sum of squares by cluster:
## [1] 96.55474 59.11845 46.17610 53.43947
## (between_SS / total_SS = 53.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
fviz_cluster(k2, data = SAMHA_masterC)
So, we have 4 clusters.
I will name these clusters:
As we can see, SAMHA clustering is not very useful indicator of suicide rate. High and low suicide rate states are mixed and not well separated.
Montana has Native American popualtion with a huge suicide rate.
DC is Large Central Metro Area, while Montana is a mix of Non Metro and Small Metro, which will account for some discrepency. NonMetro/Small Metro vs Large Metro accounts for 25% difference, but not 4 times we see.
If we compare 2 states by type of suicide based on the method. We can see that majority of suicide comited in Montana is done by using firearms, while non-firearm rates for are pretty close. If we compare White non-fireamr rate is 6.9 vs 8.0 - practically indentical. So, it is very likely that avilability of firearms playing a role in driving suicide rates up(Matthew Miller and Hemenway 2002).
So to summarize, we can say that 3 top drivers for much higher suicide rate in Montana:
state15_17<-select(state15_17,state,Race,Hispanic.Origin,Gender,X2013.Urbanization,Population, AARate)%>%filter(state=='Montana'|state=='District of Columbia')
state15_17%>%
kable() %>%
kable_styling()
| state | Race | Hispanic.Origin | Gender | X2013.Urbanization | Population | AARate |
|---|---|---|---|---|---|---|
| District of Columbia | Black or African American | Not Hispanic or Latino | Female | Large Central Metro | 441333 | 2.8460 |
| District of Columbia | Black or African American | Not Hispanic or Latino | Male | Large Central Metro | 349404 | 13.1180 |
| District of Columbia | White | Not Hispanic or Latino | Male | Large Central Metro | 339134 | 15.3460 |
| Montana | White | Not Hispanic or Latino | Female | Small Metro | 411771 | 12.6575 |
| Montana | American Indian or Alaska Native | Not Hispanic or Latino | Male | Small Metro | 14865 | 79.4025 |
| Montana | White | Not Hispanic or Latino | Male | Small Metro | 403880 | 53.5045 |
| Montana | White | Not Hispanic or Latino | Female | Micropolitan (Nonmetro) | 364414 | 16.8390 |
| Montana | White | Not Hispanic or Latino | Male | Micropolitan (Nonmetro) | 369459 | 48.6820 |
| Montana | American Indian or Alaska Native | Not Hispanic or Latino | Female | NonCore (Nonmetro) | 52546 | 27.9455 |
| Montana | White | Not Hispanic or Latino | Female | NonCore (Nonmetro) | 363771 | 12.6020 |
| Montana | American Indian or Alaska Native | Not Hispanic or Latino | Male | NonCore (Nonmetro) | 50796 | 64.3225 |
| Montana | White | Not Hispanic or Latino | Male | NonCore (Nonmetro) | 374454 | 53.3775 |
state17_DC_Mo<-select(state17_DC_Mo,State,Race,ICD.10.113.Cause.List,Age.Adjusted.Rate)
state17_DC_Mo %>%
kable() %>%
kable_styling()
| State | Race | ICD.10.113.Cause.List | Age.Adjusted.Rate |
|---|---|---|---|
| District of Columbia | Black or African American | Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0) | Unreliable |
| District of Columbia | White | Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0) | 6.9 |
| Montana | American Indian or Alaska Native | Intentional self-harm (suicide) by discharge of firearms (X72-X74) | Unreliable |
| Montana | American Indian or Alaska Native | Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0) | 27.5 |
| Montana | White | Intentional self-harm (suicide) by discharge of firearms (X72-X74) | 19.6 |
| Montana | White | Intentional self-harm (suicide) by other and unspecified means and their sequelae (*U03,X60-X71,X75-X84,Y87.0) | 8.0 |
I feel that I have tried, but I have not gotten very far.Let me summarize what I was able to identify:
More quality data needs to be collected about suicide. Currently suicide related data is spread and often incomplete. We need more centralized research. I have scanned papers related to statistics of suicide and volume is not all that big. More needs to be done.
Suicide is a complex subject. It is multifaceted. It has roots in social, medical, and biological areas. To view it from only one point would not be enough.
Looking at how USA states fare against World countries is depressing. If USA states were independent countries, they would take 5 out of 10 top positions. In addition, suicide rates fall across the world, while here in US they grow every year.
Contributing factors responsible for a huge variance in suicide levels across the US states are: ethnicity, residential composition (Large Metro vs Rural areas), availability of guns, mental illness rate, alcohol abuse rate, and drug abuse. Alcohol induced deaths correlate from moderately with suicide.
Male and female suicide rates by state strongly correlate, so it is possible that drivers behind suicides for both genders are the same. However, male suicide rates have much more variance to them vs females.
Prevalence of serious thoughts of suicide moderately correlate with suicide. Serious mental illness has moderate correlation with suicide rate as well. Next step would be to try to figure out why rates of mental illness/thoughts of suicide fluctuate that much from state to state. Painkiller Abuse Disorder Prevalence correlates with suicides. Considering that problems with painkiller abuse continue to spread it might negatively effect suicide rate in future.
Attempt to create Linear Model of Serious Thoughts of Suicide was not very successful. R squire was only 0.72. The variables used in a model were Serious Mental Illness, Painkiller Abuse Disorder, and Drug Use, Any Drug. There must be additional variables which need to be discovered.
Clustering states based on SAMHSA survey produced some interesting results. It allowed to cluster states into 4 different groups, however these groups would not correspond to state suicide rates well.
Comparing high suicide state (Colorado) and low suicide state (Minnesota) indicates alcohol, drugs, and violence as possible culprits.
DC vs Montana. 3 drivers - very high suicide rate among Montana Native American population, rural/small metro environment in Montana, and high gun ownership and correspondingly high gun suicide rate in Montana.
Create centralized public database for suicide data.
Create methods of improved data collection related to suicide.
Try to understand how other countries were able to lower their suicide rates. For example, Japan had a lot of success.
Do more research into connection between mental illness, alcohol, drugs, violence, gun ownership, and suicide.
To see if success of low suicide states could be applied to high suicide states.
To do more research into how people go from thinking to actually commuting suicide.
CDC. 2018. About Underlying Cause of Death, 1999-2017. https://wonder.cdc.gov/ucd-icd10.html.
Gerstman, Bud. Unknown. 14: Correlation. http://www.sjsu.edu/faculty/gerstman/StatPrimer/correlation.pdf.
M.Groysman. 2018. Data607_FinalProject_DataFIles. https://github.com/mgroysman/Data607_FinalProject_DataFIles.
Martin, David. 2018. U.S. Navy Admiral Scott Stearney Found Dead in Apparent Suicide. https://www.cbsnews.com/news/scott-stearney-us-navy-admiral-found-dead-apparent-suicide-2018-12-01/.
Matthew Miller, Deborah Azrael, and David Hemenway. 2002. Household Firearm Ownership and Suicide Rates in the United States. https://www.jstor.org/stable/3703934?seq=1/analyze.
Riess, Jana. 2016. Study Shows Link Between Teen Suicide and Mormon Populations. https://religionnews.com/2016/03/12/study-shows-link-teen-suicide-mormon-populations/.
SAMHSA. 2018. 2016-2017 Nsduh State Prevalence Estimates. https://www.samhsa.gov/data/report/2016-2017-nsduh-state-prevalence-estimates.
Wikipedia. 2018. List of Countries by Suicide Rate. https://en.wikipedia.org/wiki/List_of_countries_by_suicide_rate.