DATA 110 Final Project 2

Author

Emilio Difilippantonio

Final Project

Shandong, China

Source: Britannica

For my final project, I will be exploring the “Suicide Attempts in Shandong, China” dataset available on Kaggle and collected in a study by Sun et al. (Sun et al.). According to the research article, “linked data from two public health surveillance systems were analysed,” though there is no explanation of how the data for those data sets were collected. I will be exploring several aspects of this data, including the relationship between mortality rates and residence over time; the relationship between sex, methods, and outcomes; and the relationship between age a suicide methods, all over the course of the three-year study (2009 through 2011). The data set contains the age (numeric), sex (categorical), education level (categorical), and occupation (categorical) of the individuals, as well as year (discrete), month (discrete), and method (categorical) of their suicide attempt, whether or not they were hospitalized (categorical) or died (categorical), and whether or not they lived in an urban environment (categorical). I decided to explore this data set because suicide is a major global leading cause of death, ranking as the fourth highest cause of death for people between the ages of 15 and 29 years old (World Health Organization).

In order to clean this data set, I will first exclude extra index columns, add a continuous variable for year (using year and month), change the column headers to lowercase and remove spaces in them, replace the “s” in “hospitalised” with a “z,” change the “died” column to “outcome” and recode the yesses and nos accordingly, add a “died” and “survived” column and codes yesses as 1s and nos as 0s, and reorder the columns of the data set to my liking. After this preliminary cleaning, I will do further cleaning before each visualization to create the data set needed for that visualization.

Coding

Loading libraries and the data set

# Loading in the necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)

# Loading in the data sets
setwd("/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory")
suice <- read_csv("suicide_china.csv")
New names:
Rows: 2571 Columns: 12
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(7): Hospitalised, Died, Urban, Sex, Education, Occupation, method dbl (5):
...1, Person_ID, Year, Month, Age
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Cleaning the data set

First, we’ll clean the data set by getting rid of extra columns, adding new columns, and changing existing columns.

# The first two columns are just index numbers, so I'll remove them
suice2 <- select(suice, 3:12)

# Creating a numeric column for year
suice2 <- mutate(suice2,  year_plus =
                   Year + (Month - 1)/12)

# Renaming the columns to remove spaces and capital letters
names(suice2) <- tolower(names(suice2))
names(suice2) <- gsub(" ", "_", names(suice2))

# Renaming hospitalized to have a z instead of an s
suice3 <- rename(suice2, c("hospitalized" = "hospitalised"))

# Renaming died to outcome and changing the factors
suice3 <- rename(suice3, c("outcome" = "died"))
suice3 <- mutate(suice3, outcome = case_when(
  outcome == "yes" ~ "died",
  outcome == "no" ~ "survived"
))

# Creating a column for died using numeric values
suice3 <- mutate(suice3, died = case_when(
  outcome == "died" ~ 1,
  outcome == "survived" ~ 0
))

# Creating a column for survived using numeric values
suice3 <- mutate(suice3, survived = case_when(
  outcome == "died" ~ 0,
  outcome == "survived" ~ 1
))

# Reordering the columns
suice3 <- suice3[, c("hospitalized", "outcome", "died", "survived", "urban", "year_plus", "year", "month", "sex", "age", "education", "occupation", "method")]

Now that we have a basic cleaned data set, let’s start on our first visualization.

First Visualization

We will be exploring the relationship between success rates of suicides both in urban and in rural environments and time. I will group the data set by year, outcome, and residence, remove missing data points (as they will only interfere with the visualization and won’t be useful in this data set in any way), add four missing rows (all months in which either nobody died or survived urban suicide attempts), and arrange the data so the four new rows go where they belong.

# Creating the data set for my first visualization
suice_urb3 <- suice3 |> group_by(year_plus, urban, outcome) |> dplyr::summarize(total = n())
`summarise()` has grouped output by 'year_plus', 'urban'. You can override
using the `.groups` argument.
# Removing unknowns for urban and outcome (this data set will only be used for the first visualization and having objects with an unknown for urban or outcome will interfere with the visualization)
suice_urb4 <- filter(suice_urb3, outcome != "unknown" & urban != "unknown")

# Adding missing variables
suice_urb4[nrow(suice_urb4) + 1,] <- list(2009.917, "yes", "died", 0)
suice_urb4[nrow(suice_urb4) + 1,] <- list(2010.917, "yes", "survived", 0)
suice_urb4[nrow(suice_urb4) + 1,] <- list(2011.167, "yes", "died", 0)
suice_urb4[nrow(suice_urb4) + 1,] <- list(2011.917, "yes", "survived", 0)

# Reordering the data set
suice_urb4 <- arrange(suice_urb4, year_plus, urban, outcome)

I will the then change the “urban” column header to “residence,” capitalize the variables in the “residence” and “outcome” columns, create an “Outcome_by_Residence” column, and calculate the proportion of suicide attempts resulting in that outcome out of all suicide attempts in that month in that residence.

# Renaming urban to residence
suice_urb5 <- rename(suice_urb4, c("residence" = "urban"))

# changing the yes' and no to urban and rural, respectively
suice_urb5$residence <- c("Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban", "Rural", "Rural", "Urban", "Urban")

# Capitalizing died and survived
suice_urb5$outcome <- c("Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived", "Died", "Survived")

# Adding the last column
suice_urb5 <- mutate(suice_urb5, Outcome_by_Residence = case_when(
  residence == "Rural" & outcome == "Died" ~ "Rural Died",
  residence == "Rural" & outcome == "Survived" ~ "Rural Survived",
  residence == "Urban" & outcome == "Died" ~ "Urban Died",
  residence == "Urban" & outcome == "Survived" ~ "Urban Survived"
))

suice_urb5 <- mutate(suice_urb5, Proportion_by_Residence = case_when(
  year_plus == 2009.000 & residence == "Rural" ~ total / 46,
  round(year_plus, 2) == 2009.08 & residence == "Rural" ~ total / 46,
  round(year_plus, 2) == 2009.17 & residence == "Rural" ~ total / 55,
  year_plus == 2009.250 & residence == "Rural" ~ total / 48,
  round(year_plus, 2) == 2009.33 & residence == "Rural" ~ total / 65,
  round(year_plus, 2) == 2009.42 & residence == "Rural" ~ total / 79,
  year_plus == 2009.500 & residence == "Rural" ~ total / 58,
  round(year_plus, 2) == 2009.58 & residence == "Rural" ~ total / 52,
  round(year_plus, 2) == 2009.67 & residence == "Rural" ~ total / 71,
  year_plus == 2009.750 & residence == "Rural" ~ total / 49,
  round(year_plus, 2) == 2009.83 & residence == "Rural" ~ total / 36,
  round(year_plus, 2) == 2009.92 & residence == "Rural" ~ total / 46,
  year_plus == 2010.000 & residence == "Rural" ~ total / 68,
  round(year_plus, 2) == 2010.08 & residence == "Rural" ~ total / 73,
  round(year_plus, 2) == 2010.17 & residence == "Rural" ~ total / 64,
  year_plus == 2010.250 & residence == "Rural" ~ total / 71,
  round(year_plus, 2) == 2010.33 & residence == "Rural" ~ total / 71,
  round(year_plus, 2) == 2010.42 & residence == "Rural" ~ total / 89,
  year_plus == 2010.500 & residence == "Rural" ~ total / 75,
  round(year_plus, 2) == 2010.58 & residence == "Rural" ~ total / 79,
  round(year_plus, 2) == 2010.67 & residence == "Rural" ~ total / 75,
  year_plus == 2010.750 & residence == "Rural" ~ total / 60,
  round(year_plus, 2) == 2010.83 & residence == "Rural" ~ total / 47,
  round(year_plus, 2) == 2010.92 & residence == "Rural" ~ total / 33,
  year_plus == 2011.000 & residence == "Rural" ~ total / 55,
  round(year_plus, 2) == 2011.08 & residence == "Rural" ~ total / 62,
  round(year_plus, 2) == 2011.17 & residence == "Rural" ~ total / 49,
  year_plus == 2011.250 & residence == "Rural" ~ total / 55,
  round(year_plus, 2) == 2011.33 & residence == "Rural" ~ total / 86,
  round(year_plus, 2) == 2011.42 & residence == "Rural" ~ total / 79,
  year_plus == 2011.500 & residence == "Rural" ~ total / 73,
  round(year_plus, 2) == 2011.58 & residence == "Rural" ~ total / 67,
  round(year_plus, 2) == 2011.67 & residence == "Rural" ~ total / 64,
  year_plus == 2011.750 & residence == "Rural" ~ total / 75,
  round(year_plus, 2) == 2011.83 & residence == "Rural" ~ total / 45,
  round(year_plus, 2) == 2011.92 & residence == "Rural" ~ total / 47,
  year_plus == 2009.000 & residence == "Urban" ~ total / 6,
  round(year_plus, 2) == 2009.08 & residence == "Urban" ~ total / 8,
  round(year_plus, 2) == 2009.17 & residence == "Urban" ~ total / 3,
  year_plus == 2009.250 & residence == "Urban" ~ total / 8,
  round(year_plus, 2) == 2009.33 & residence == "Urban" ~ total / 12,
  round(year_plus, 2) == 2009.42 & residence == "Urban" ~ total / 6,
  year_plus == 2009.500 & residence == "Urban" ~ total / 9,
  round(year_plus, 2) == 2009.58 & residence == "Urban" ~ total / 8,
  round(year_plus, 2) == 2009.67 & residence == "Urban" ~ total / 9,
  year_plus == 2009.750 & residence == "Urban" ~ total / 7,
  round(year_plus, 2) == 2009.83 & residence == "Urban" ~ total / 4,
  round(year_plus, 2) == 2009.92 & residence == "Urban" ~ total / 2,
  year_plus == 2010.000 & residence == "Urban" ~ total / 13,
  round(year_plus, 2) == 2010.08 & residence == "Urban" ~ total / 9,
  round(year_plus, 2) == 2010.17 & residence == "Urban" ~ total / 11,
  year_plus == 2010.250 & residence == "Urban" ~ total / 7,
  round(year_plus, 2) == 2010.33 & residence == "Urban" ~ total / 12,
  round(year_plus, 2) == 2010.42 & residence == "Urban" ~ total / 9,
  year_plus == 2010.500 & residence == "Urban" ~ total / 13,
  round(year_plus, 2) == 2010.58 & residence == "Urban" ~ total / 9,
  round(year_plus, 2) == 2010.67 & residence == "Urban" ~ total / 8,
  year_plus == 2010.750 & residence == "Urban" ~ total / 10,
  round(year_plus, 2) == 2010.83 & residence == "Urban" ~ total / 7,
  round(year_plus, 2) == 2010.92 & residence == "Urban" ~ total / 5,
  year_plus == 2011.000 & residence == "Urban" ~ total / 8,
  round(year_plus, 2) == 2011.08 & residence == "Urban" ~ total / 5,
  round(year_plus, 2) == 2011.17 & residence == "Urban" ~ total / 2,
  year_plus == 2011.250 & residence == "Urban" ~ total / 10,
  round(year_plus, 2) == 2011.33 & residence == "Urban" ~ total / 7,
  round(year_plus, 2) == 2011.42 & residence == "Urban" ~ total / 9,
  year_plus == 2011.500 & residence == "Urban" ~ total / 6,
  round(year_plus, 2) == 2011.58 & residence == "Urban" ~ total / 9,
  round(year_plus, 2) == 2011.67 & residence == "Urban" ~ total / 7,
  year_plus == 2011.750 & residence == "Urban" ~ total / 6,
  round(year_plus, 2) == 2011.83 & residence == "Urban" ~ total / 11,
  round(year_plus, 2) == 2011.92 & residence == "Urban" ~ total / 2,
))

Now that we have the proper data set, let’s do some analysis. First, will make a model to predict the success rate of urban suicides based on the year as a decimal.

# Filtering exclusively for successful urban suicides for each month
suice_urb_stat5 <- filter(suice_urb5, Outcome_by_Residence == "Urban Died")

# Creating and analyzing the linear model
summary(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat5))

Call:
lm(formula = Proportion_by_Residence ~ year_plus, data = suice_urb_stat5)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.44364 -0.14690 -0.03273  0.07007  0.60468 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -272.51521   91.20230  -2.988  0.00518 **
year_plus      0.13578    0.04536   2.993  0.00511 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2356 on 34 degrees of freedom
Multiple R-squared:  0.2085,    Adjusted R-squared:  0.1853 
F-statistic: 8.959 on 1 and 34 DF,  p-value: 0.005115

With a p-value of 0.005115, we can conclude at a 5% significance level that there is a correlation between the year and the succes rate of urban suicides. Our model to predict the succes rate of urban suicides based on the year is the following:

Predicted success rate of urban suicides = -272.51521 + 0.13578(year as a decimal)

This means that every year, the predicted succes rate of urban suicides in Shandong, China goes up by 13.578 percentage points. Our adjusted r-squared value of 0.1853, however, lets us know that our model can only explain 18.53% of the variation in the data, and is therefore not a good predictive model.

Let’s look at some diagnostic plots.

ggplot2::autoplot(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat5))

According to these plots, data points 3, 12, 24, and 31 seem to be outliers that could be harming our model, though overall, it doesn’t seem like our model failed because it was linear. We likely need more explanatory variables to create a better model.

Let’s take a look at successful rural suicide attempts.

# Filtering exclusively for successful rural suicides for each month
suice_urb_stat6 <- filter(suice_urb5, Outcome_by_Residence == "Rural Died")

# Creating and analyzing the linear model
summary(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat6))

Call:
lm(formula = Proportion_by_Residence ~ year_plus, data = suice_urb_stat6)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.183076 -0.073779  0.002881  0.065371  0.288588 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -35.78528   38.93394  -0.919    0.365
year_plus     0.01806    0.01937   0.933    0.358

Residual standard error: 0.1006 on 34 degrees of freedom
Multiple R-squared:  0.02494,   Adjusted R-squared:  -0.00374 
F-statistic: 0.8696 on 1 and 34 DF,  p-value: 0.3576

With a p=value of 0.3576, we can conclude at a 5% significance level that there is no correlation between the year and the rate of successful rural suicide attempts in Shandong, China. If, however, there was a correlation, the model would be as follows:

Predicted success rate of rural suicides = -35.78528 + 0.01806(year as a decimal)

Our adjusted r-squared value of -0.00374 further confirms that our model is bad, as it means that -0.374% (?!?!) of the variation in the data is explained by our model.

Let’s look at the diagnostic plots.

autoplot(lm(Proportion_by_Residence ~ year_plus, data = suice_urb_stat6))

According to the diagnostic plots, data points 3, 9, 24, and 32 may be outliers that are affecting our model, though once again, the linearity of the model doesn’t seem to be the limiting factor of its accuracy.

Let’s visualize the data.

# Exporting the data set to use in Tableau
write.csv(suice_urb5,"/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory/suice_urb5.csv", row.names = FALSE)

Second visualization

Let’s look at the relationship between sex, methods, and outcomes in suicide attempts in Shandong, China from 2009 through 2011.

# Creating the data set for my second plot
suice_meth3 <- suice3 |> group_by(method, sex, outcome) |> dplyr::summarize(total = n())
`summarise()` has grouped output by 'method', 'sex'. You can override using the
`.groups` argument.
# Removing an extra row
suice_meth3 <- suice_meth3[-19,]
# Duplicating the data set to wrangle it
suice_meth4 <- suice_meth3

# Creating a variable by which to color the data
suice_meth4 <- mutate(suice_meth4, Outcome_by_Sex = case_when(
  sex == "male" & outcome == "died" ~ "Male Died",
  sex == "female" & outcome == "died" ~ "Female Died",
  sex == "male" & outcome == "survived" ~ "Male Survived",
  sex == "female" & outcome == "survived" ~ "Female Survived"
))

# Adding two missing rows (no male or female survived a suicide attempt by drowning, so it didn't appear in the suice_meth3 data set)
suice_meth4[nrow(suice_meth4) + 1,] <- list("Drowning", "female", "survived", 0, "Female Survived")
suice_meth4[nrow(suice_meth4) + 1,] <- list("Drowning", "male", "survived", 0, "Male Survived")

# Reordering the data set
suice_meth4 <- arrange(suice_meth4, method, sex, outcome)
# Duplicating the data set to further wrangle it
suice_meth5 <- suice_meth4

# Capitalizing the methods
suice_meth5$method <- c("Cutting", "Cutting", "Cutting", "Cutting", "Drowning", "Drowning", "Drowning", "Drowning", "Hanging", "Hanging", "Hanging", "Hanging", "Jumping", "Jumping", "Jumping", "Jumping", "Other Poison", "Other Poison", "Other Poison", "Other Poison", "Pesticide", "Pesticide", "Pesticide", "Pesticide", "Unspecified Poison", "Unspecified Poison", "Unspecified Poison", "Unspecified Poison", "Unspecified Method", "Unspecified Method", "Unspecified Method", "Unspecified Method")

# Capitalizing the sexes
suice_meth5$sex <- c("Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female", "Male", "Male")
# Exporting the data set to use in Tableau
write.csv(suice_meth5,"/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory/suice_meth5.csv", row.names = FALSE)

Third visualization

# Duplicating the data set to further wrangle it
suice_age3 <- suice3

# Capitalizing the methods
suice_age3 <- mutate(suice_age3, method = case_when(
  method == "Cutting" ~ "Cutting",
  method == "Drowning" ~ "Drowning",
  method == "Hanging" ~ "Hanging",
  method == "Jumping" ~ "Jumping",
  method == "Other poison" ~ "Other Poison",
  method == "Pesticide" ~ "Pesticide",
  method == "Poison unspec" ~ "Unspecified Poison",
  method == "unspecified" ~ "Unspecified Method",
  method == "Others" ~ "Unspecified Method"
))

suice_age3 <- select(suice_age3, age, method)
# Exporting the data set to use in Tableau
write.csv(suice_age3,"/Users/emiliodifilippantonio/Desktop/DATA 110/DATA 110 Working Directory/suice_age3.csv", row.names = FALSE)

Conclusion

We learned several interesting things through the exploration of this data set, and hopefully this information can be used to save lives. Suicide is a preventable cause of death, and society should do everything in its power to prevent it.

References

Sun, Jiandong et al. (2014). Data from: Incidence and fatality of serious suicide attempts in a predominantly rural population in Shandong, China: a public health surveillance study [Dataset]. Dryad. https://doi.org/10.5061/dryad.r0v35

World Health Organization. “Suicide.” Suicide, 28 Aug. 2023, www.who.int/news-room/fact-sheets/detail/suicide.