For my final project, I will be exploring the “Suicide Attempts in Shandong, China” dataset available on Kaggle and collected in a study by Sun et al. (Sun et al.). According to the research article, “linked data from two public health surveillance systems were analysed,” though there is no explanation of how the data for those data sets were collected. I will be exploring several aspects of this data, including the relationship between mortality rates and residence over time; the relationship between sex, methods, and outcomes; and the relationship between age a suicide methods, all over the course of the three-year study (2009 through 2011). The data set contains the age (numeric), sex (categorical), education level (categorical), and occupation (categorical) of the individuals, as well as year (discrete), month (discrete), and method (categorical) of their suicide attempt, whether or not they were hospitalized (categorical) or died (categorical), and whether or not they lived in an urban environment (categorical). I decided to explore this data set because suicide is a major global leading cause of death, ranking as the fourth highest cause of death for people between the ages of 15 and 29 years old (World Health Organization).
In order to clean this data set, I will first exclude extra index columns, add a continuous variable for year (using year and month), change the column headers to lowercase and remove spaces in them, replace the “s” in “hospitalised” with a “z,” change the “died” column to “outcome” and recode the yesses and nos accordingly, add a “died” and “survived” column and codes yesses as 1s and nos as 0s, and reorder the columns of the data set to my liking. After this preliminary cleaning, I will do further cleaning before each visualization to create the data set needed for that visualization.
Coding
Loading libraries and the data set
# Loading in the necessary librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)# Loading in the data setssetwd("/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory")suice <-read_csv("suicide_china.csv")
New names:
Rows: 2571 Columns: 12
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(7): Hospitalised, Died, Urban, Sex, Education, Occupation, method dbl (5):
...1, Person_ID, Year, Month, Age
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Cleaning the data set
First, we’ll clean the data set by getting rid of extra columns, adding new columns, and changing existing columns.
# The first two columns are just index numbers, so I'll remove themsuice2 <-select(suice, 3:12)# Creating a numeric column for yearsuice2 <-mutate(suice2, year_plus = Year + (Month -1)/12)# Renaming the columns to remove spaces and capital lettersnames(suice2) <-tolower(names(suice2))names(suice2) <-gsub(" ", "_", names(suice2))# Renaming hospitalized to have a z instead of an ssuice3 <-rename(suice2, c("hospitalized"="hospitalised"))# Renaming died to outcome and changing the factorssuice3 <-rename(suice3, c("outcome"="died"))suice3 <-mutate(suice3, outcome =case_when( outcome =="yes"~"died", outcome =="no"~"survived"))# Creating a column for died using numeric valuessuice3 <-mutate(suice3, died =case_when( outcome =="died"~1, outcome =="survived"~0))# Creating a column for survived using numeric valuessuice3 <-mutate(suice3, survived =case_when( outcome =="died"~0, outcome =="survived"~1))# Reordering the columnssuice3 <- suice3[, c("hospitalized", "outcome", "died", "survived", "urban", "year_plus", "year", "month", "sex", "age", "education", "occupation", "method")]
Now that we have a basic cleaned data set, let’s start on our first visualization.
First Visualization
We will be exploring the relationship between success rates of suicides both in urban and in rural environments and time. I will group the data set by year, outcome, and residence, remove missing data points (as they will only interfere with the visualization and won’t be useful in this data set in any way), add four missing rows (all months in which either nobody died or survived urban suicide attempts), and arrange the data so the four new rows go where they belong.
# Creating the data set for my first visualizationsuice_urb3 <- suice3 |>group_by(year_plus, urban, outcome) |> dplyr::summarize(total =n())
`summarise()` has grouped output by 'year_plus', 'urban'. You can override
using the `.groups` argument.
# Removing unknowns for urban and outcome (this data set will only be used for the first visualization and having objects with an unknown for urban or outcome will interfere with the visualization)suice_urb4 <-filter(suice_urb3, outcome !="unknown"& urban !="unknown")# Adding missing variablessuice_urb4[nrow(suice_urb4) +1,] <-list(2009.917, "yes", "died", 0)suice_urb4[nrow(suice_urb4) +1,] <-list(2010.917, "yes", "survived", 0)suice_urb4[nrow(suice_urb4) +1,] <-list(2011.167, "yes", "died", 0)suice_urb4[nrow(suice_urb4) +1,] <-list(2011.917, "yes", "survived", 0)# Reordering the data setsuice_urb4 <-arrange(suice_urb4, year_plus, urban, outcome)
I will the then change the “urban” column header to “residence,” capitalize the variables in the “residence” and “outcome” columns, create an “Outcome_by_Residence” column, and calculate the proportion of suicide attempts resulting in that outcome out of all suicide attempts in that month in that residence.
# Renaming urban to residencesuice_urb5 <-rename(suice_urb4, c("residence"="urban"))# changing the yes' and no to urban and rural, respectivelysuice_urb5$residence <-c("Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban")# Capitalizing died and survivedsuice_urb5$outcome <-c("Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived")# Adding the last columnsuice_urb5 <-mutate(suice_urb5, Outcome_by_Residence =case_when( residence =="Rural"& outcome =="Died"~"Rural Died", residence =="Rural"& outcome =="Survived"~"Rural Survived", residence =="Urban"& outcome =="Died"~"Urban Died", residence =="Urban"& outcome =="Survived"~"Urban Survived"))suice_urb5 <-mutate(suice_urb5, Proportion_by_Residence =case_when( year_plus ==2009.000& residence =="Rural"~ total /46,round(year_plus, 2) ==2009.08& residence =="Rural"~ total /46,round(year_plus, 2) ==2009.17& residence =="Rural"~ total /55, year_plus ==2009.250& residence =="Rural"~ total /48,round(year_plus, 2) ==2009.33& residence =="Rural"~ total /65,round(year_plus, 2) ==2009.42& residence =="Rural"~ total /79, year_plus ==2009.500& residence =="Rural"~ total /58,round(year_plus, 2) ==2009.58& residence =="Rural"~ total /52,round(year_plus, 2) ==2009.67& residence =="Rural"~ total /71, year_plus ==2009.750& residence =="Rural"~ total /49,round(year_plus, 2) ==2009.83& residence =="Rural"~ total /36,round(year_plus, 2) ==2009.92& residence =="Rural"~ total /46, year_plus ==2010.000& residence =="Rural"~ total /68,round(year_plus, 2) ==2010.08& residence =="Rural"~ total /73,round(year_plus, 2) ==2010.17& residence =="Rural"~ total /64, year_plus ==2010.250& residence =="Rural"~ total /71,round(year_plus, 2) ==2010.33& residence =="Rural"~ total /71,round(year_plus, 2) ==2010.42& residence =="Rural"~ total /89, year_plus ==2010.500& residence =="Rural"~ total /75,round(year_plus, 2) ==2010.58& residence =="Rural"~ total /79,round(year_plus, 2) ==2010.67& residence =="Rural"~ total /75, year_plus ==2010.750& residence =="Rural"~ total /60,round(year_plus, 2) ==2010.83& residence =="Rural"~ total /47,round(year_plus, 2) ==2010.92& residence =="Rural"~ total /33, year_plus ==2011.000& residence =="Rural"~ total /55,round(year_plus, 2) ==2011.08& residence =="Rural"~ total /62,round(year_plus, 2) ==2011.17& residence =="Rural"~ total /49, year_plus ==2011.250& residence =="Rural"~ total /55,round(year_plus, 2) ==2011.33& residence =="Rural"~ total /86,round(year_plus, 2) ==2011.42& residence =="Rural"~ total /79, year_plus ==2011.500& residence =="Rural"~ total /73,round(year_plus, 2) ==2011.58& residence =="Rural"~ total /67,round(year_plus, 2) ==2011.67& residence =="Rural"~ total /64, year_plus ==2011.750& residence =="Rural"~ total /75,round(year_plus, 2) ==2011.83& residence =="Rural"~ total /45,round(year_plus, 2) ==2011.92& residence =="Rural"~ total /47, year_plus ==2009.000& residence =="Urban"~ total /6,round(year_plus, 2) ==2009.08& residence =="Urban"~ total /8,round(year_plus, 2) ==2009.17& residence =="Urban"~ total /3, year_plus ==2009.250& residence =="Urban"~ total /8,round(year_plus, 2) ==2009.33& residence =="Urban"~ total /12,round(year_plus, 2) ==2009.42& residence =="Urban"~ total /6, year_plus ==2009.500& residence =="Urban"~ total /9,round(year_plus, 2) ==2009.58& residence =="Urban"~ total /8,round(year_plus, 2) ==2009.67& residence =="Urban"~ total /9, year_plus ==2009.750& residence =="Urban"~ total /7,round(year_plus, 2) ==2009.83& residence =="Urban"~ total /4,round(year_plus, 2) ==2009.92& residence =="Urban"~ total /2, year_plus ==2010.000& residence =="Urban"~ total /13,round(year_plus, 2) ==2010.08& residence =="Urban"~ total /9,round(year_plus, 2) ==2010.17& residence =="Urban"~ total /11, year_plus ==2010.250& residence =="Urban"~ total /7,round(year_plus, 2) ==2010.33& residence =="Urban"~ total /12,round(year_plus, 2) ==2010.42& residence =="Urban"~ total /9, year_plus ==2010.500& residence =="Urban"~ total /13,round(year_plus, 2) ==2010.58& residence =="Urban"~ total /9,round(year_plus, 2) ==2010.67& residence =="Urban"~ total /8, year_plus ==2010.750& residence =="Urban"~ total /10,round(year_plus, 2) ==2010.83& residence =="Urban"~ total /7,round(year_plus, 2) ==2010.92& residence =="Urban"~ total /5, year_plus ==2011.000& residence =="Urban"~ total /8,round(year_plus, 2) ==2011.08& residence =="Urban"~ total /5,round(year_plus, 2) ==2011.17& residence =="Urban"~ total /2, year_plus ==2011.250& residence =="Urban"~ total /10,round(year_plus, 2) ==2011.33& residence =="Urban"~ total /7,round(year_plus, 2) ==2011.42& residence =="Urban"~ total /9, year_plus ==2011.500& residence =="Urban"~ total /6,round(year_plus, 2) ==2011.58& residence =="Urban"~ total /9,round(year_plus, 2) ==2011.67& residence =="Urban"~ total /7, year_plus ==2011.750& residence =="Urban"~ total /6,round(year_plus, 2) ==2011.83& residence =="Urban"~ total /11,round(year_plus, 2) ==2011.92& residence =="Urban"~ total /2,))
Now that we have the proper data set, let’s do some analysis. First, will make a model to predict the success rate of urban suicides based on the year as a decimal.
# Filtering exclusively for successful urban suicides for each monthsuice_urb_stat5 <-filter(suice_urb5, Outcome_by_Residence =="Urban Died")# Creating and analyzing the linear modelsummary(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat5))
Call:
lm(formula = Proportion_by_Residence ~ year_plus, data = suice_urb_stat5)
Residuals:
Min 1Q Median 3Q Max
-0.44364 -0.14690 -0.03273 0.07007 0.60468
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -272.51521 91.20230 -2.988 0.00518 **
year_plus 0.13578 0.04536 2.993 0.00511 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2356 on 34 degrees of freedom
Multiple R-squared: 0.2085, Adjusted R-squared: 0.1853
F-statistic: 8.959 on 1 and 34 DF, p-value: 0.005115
With a p-value of 0.005115, we can conclude at a 5% significance level that there is a correlation between the year and the succes rate of urban suicides. Our model to predict the succes rate of urban suicides based on the year is the following:
Predicted success rate of urban suicides = -272.51521 + 0.13578(year as a decimal)
This means that every year, the predicted succes rate of urban suicides in Shandong, China goes up by 13.578 percentage points. Our adjusted r-squared value of 0.1853, however, lets us know that our model can only explain 18.53% of the variation in the data, and is therefore not a good predictive model.
Let’s look at some diagnostic plots.
ggplot2::autoplot(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat5))
According to these plots, data points 3, 12, 24, and 31 seem to be outliers that could be harming our model, though overall, it doesn’t seem like our model failed because it was linear. We likely need more explanatory variables to create a better model.
Let’s take a look at successful rural suicide attempts.
# Filtering exclusively for successful rural suicides for each monthsuice_urb_stat6 <-filter(suice_urb5, Outcome_by_Residence =="Rural Died")# Creating and analyzing the linear modelsummary(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat6))
Call:
lm(formula = Proportion_by_Residence ~ year_plus, data = suice_urb_stat6)
Residuals:
Min 1Q Median 3Q Max
-0.183076 -0.073779 0.002881 0.065371 0.288588
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -35.78528 38.93394 -0.919 0.365
year_plus 0.01806 0.01937 0.933 0.358
Residual standard error: 0.1006 on 34 degrees of freedom
Multiple R-squared: 0.02494, Adjusted R-squared: -0.00374
F-statistic: 0.8696 on 1 and 34 DF, p-value: 0.3576
With a p=value of 0.3576, we can conclude at a 5% significance level that there is no correlation between the year and the rate of successful rural suicide attempts in Shandong, China. If, however, there was a correlation, the model would be as follows:
Predicted success rate of rural suicides = -35.78528 + 0.01806(year as a decimal)
Our adjusted r-squared value of -0.00374 further confirms that our model is bad, as it means that -0.374% (?!?!) of the variation in the data is explained by our model.
Let’s look at the diagnostic plots.
autoplot(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat6))
According to the diagnostic plots, data points 3, 9, 24, and 32 may be outliers that are affecting our model, though once again, the linearity of the model doesn’t seem to be the limiting factor of its accuracy.
Let’s visualize the data.
# Exporting the data set to use in Tableauwrite.csv(suice_urb5,"/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory/suice_urb5.csv", row.names =FALSE)
As we can see in the visualization, it does seem that urban suicide success rates are increasing over time, though not-at-all in a linear manner. The success rate of rural suicides doesn’t seem to be increasing or decreasing on average over time. I have noticed two things, though: firstly, that urban suicide success rates tend to be lower than rural suicide success rate, likely because people in the city are closer to hospitals; and secondly, that suicide success rates spike around the new year, possibly because people mey be more depressed at the beginning of a new year and therefore use more lethal methods.
Second visualization
Let’s look at the relationship between sex, methods, and outcomes in suicide attempts in Shandong, China from 2009 through 2011.
# Creating the data set for my second plotsuice_meth3 <- suice3 |>group_by(method, sex, outcome) |> dplyr::summarize(total =n())
`summarise()` has grouped output by 'method', 'sex'. You can override using the
`.groups` argument.
# Removing an extra rowsuice_meth3 <- suice_meth3[-19,]
# Duplicating the data set to wrangle itsuice_meth4 <- suice_meth3# Creating a variable by which to color the datasuice_meth4 <-mutate(suice_meth4, Outcome_by_Sex =case_when( sex =="male"& outcome =="died"~"Male Died", sex =="female"& outcome =="died"~"Female Died", sex =="male"& outcome =="survived"~"Male Survived", sex =="female"& outcome =="survived"~"Female Survived"))# Adding two missing rows (no male or female survived a suicide attempt by drowning, so it didn't appear in the suice_meth3 data set)suice_meth4[nrow(suice_meth4) +1,] <-list("Drowning", "female", "survived", 0, "Female Survived")suice_meth4[nrow(suice_meth4) +1,] <-list("Drowning", "male", "survived", 0, "Male Survived")# Reordering the data setsuice_meth4 <-arrange(suice_meth4, method, sex, outcome)
# Exporting the data set to use in Tableauwrite.csv(suice_meth5,"/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory/suice_meth5.csv", row.names =FALSE)
As we can see here, pesticide is by far the most common method of attempted suicide, distantly followed in second by hanging. We can also see that pesticide seem to be preferred by females, while males prefer hanging. This may be because hanging is viewed as a more violent death. Poison is also much less effective than other methods. Drowning is the deadliest method, with a 100% mortality rate, followed by hanging, which has over a 97% mortality rate. Pesticide is very ineffective, with over 50% of attempts being unsuccessful.
Third visualization
# Duplicating the data set to further wrangle itsuice_age3 <- suice3# Capitalizing the methodssuice_age3 <-mutate(suice_age3, method =case_when( method =="Cutting"~"Cutting", method =="Drowning"~"Drowning", method =="Hanging"~"Hanging", method =="Jumping"~"Jumping", method =="Other poison"~"Other Poison", method =="Pesticide"~"Pesticide", method =="Poison unspec"~"Unspecified Poison", method =="unspecified"~"Unspecified Method", method =="Others"~"Unspecified Method"))suice_age3 <-select(suice_age3, age, method)
# Exporting the data set to use in Tableauwrite.csv(suice_age3,"/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory/suice_age3.csv", row.names =FALSE)
For the most part, there doesn’t seem to be much strong correlation between method type and age. However, it does seem that cutting is preferred by younger people and hanging os preferred by older people. This surprised me, as I thought that poison would be preferred by older people. It is by far the easiest method. Poison and drowning were preferred by a wide age range, while jumping was slightly more restricted to those in the middle.
Conclusion
We learned several interesting things through the exploration of this data set, and hopefully this information can be used to save lives. Suicide is a preventable cause of death, and society should do everything in its power to prevent it.
References
Sun, Jiandong et al. (2014). Data from: Incidence and fatality of serious suicide attempts in a predominantly rural population in Shandong, China: a public health surveillance study [Dataset]. Dryad. https://doi.org/10.5061/dryad.r0v35
World Health Organization. “Suicide.” Suicide, 28 Aug. 2023, www.who.int/news-room/fact-sheets/detail/suicide.