Description of Data

For this homework, I am using the dataset from Kaggle which is called Suicide Rates Overview 1985 to 2016.

I am interested in looking at if gender or age will influence suicide rates. Here I am using the data of United States from 2015 which is the lastest available data.

Importing Dataset

Here we are importing the spotify data set.

pacman::p_load(Zelig,pander,texreg,lmtest,visreg,tidyverse,shiny,readr,knitr)
master <- read_csv("Desktop/master.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   suicides_no = col_double(),
##   population = col_double(),
##   `suicides/100k pop` = col_double(),
##   `country-year` = col_character(),
##   `HDI for year` = col_double(),
##   `gdp_for_year ($)` = col_number(),
##   `gdp_per_capita ($)` = col_double(),
##   generation = col_character()
## )
suicide<-sjlabelled::remove_all_labels(master) 
#This removed haven labels, allowing me to recode my variables once removed
head(suicide)
##   country year    sex         age suicides_no population suicides.100k.pop
## 1 Albania 1987   male 15-24 years          21     312900              6.71
## 2 Albania 1987   male 35-54 years          16     308000              5.19
## 3 Albania 1987 female 15-24 years          14     289700              4.83
## 4 Albania 1987   male   75+ years           1      21800              4.59
## 5 Albania 1987   male 25-34 years           9     274300              3.28
## 6 Albania 1987 female   75+ years           1      35600              2.81
##   country.year HDI.for.year gdp_for_year.... gdp_per_capita....
## 1  Albania1987           NA       2156624900                796
## 2  Albania1987           NA       2156624900                796
## 3  Albania1987           NA       2156624900                796
## 4  Albania1987           NA       2156624900                796
## 5  Albania1987           NA       2156624900                796
## 6  Albania1987           NA       2156624900                796
##        generation
## 1    Generation X
## 2          Silent
## 3    Generation X
## 4 G.I. Generation
## 5         Boomers
## 6 G.I. Generation
suicide1=select(suicide,'country', 'year', 'sex', 'suicides_no', 'generation','HDI.for.year')
head(suicide1)
##   country year    sex suicides_no      generation HDI.for.year
## 1 Albania 1987   male          21    Generation X           NA
## 2 Albania 1987   male          16          Silent           NA
## 3 Albania 1987 female          14    Generation X           NA
## 4 Albania 1987   male           1 G.I. Generation           NA
## 5 Albania 1987   male           9         Boomers           NA
## 6 Albania 1987 female           1 G.I. Generation           NA

Select United States

S1=filter(suicide1, country=="United States")

head(S1)
##         country year    sex suicides_no      generation HDI.for.year
## 1 United States 1985   male        2177 G.I. Generation        0.841
## 2 United States 1985   male        5302 G.I. Generation        0.841
## 3 United States 1985   male        5134         Boomers        0.841
## 4 United States 1985   male        6053          Silent        0.841
## 5 United States 1985   male        4267    Generation X        0.841
## 6 United States 1985 female        2105          Silent        0.841
dim(S1)
## [1] 372   6
length(unique(S1$country))
## [1] 1

Select year 2015

S2=filter(S1, year=="2015")

head(S2)
##         country year    sex suicides_no   generation HDI.for.year
## 1 United States 2015   male        3171       Silent           NA
## 2 United States 2015   male        9068      Boomers           NA
## 3 United States 2015   male       11634 Generation X           NA
## 4 United States 2015   male        5503   Millenials           NA
## 5 United States 2015   male        4359   Millenials           NA
## 6 United States 2015 female        4053 Generation X           NA

Recode age and gender groups.

Variable Index:

age:

1=Generation Z

2=Millenials

3=Generation X

4=Boomers

5=G.I. Generation

6=Silent

Gender:

0 = Male

1 = Female

suicide2<-rename(S2)%>%
 mutate(age=
          recode(generation, 'Generation Z'=1,'Millenials'=2, 'Generation X'=3,'Boomers'=4,'Silent'=5,'G.I. Generation'=6),
         gender=recode(sex, 'male'=0, 'female'=1)
        )%>%
select(age, gender,suicides_no,HDI.for.year)
head(suicide2)
##   age gender suicides_no HDI.for.year
## 1   5      0        3171           NA
## 2   4      0        9068           NA
## 3   3      0       11634           NA
## 4   2      0        5503           NA
## 5   2      0        4359           NA
## 6   3      1        4053           NA

Calculating the mean suicides numbers for each age group occured in 2015.

It showed that in age group 3 which is Generation X has the largest suicide cases in average.

suicide2 %>%
group_by(age) %>% 
  summarize(mean_suicides_no= mean(suicides_no)) %>%
  kable()
age mean_suicides_no
1 206.5
2 3109.5
3 7843.5
4 5970.0
5 1855.5

Calculating the mean suicides numbers based on age group and gender.

From the table below, we can see that male commited more suicide cases than females. In age group 3 which is Generation X, male group has the largest suicide cases in average.

 suicide2 %>%
group_by(age, gender)%>% 
  summarize(mean_suicides_no= mean(suicides_no)) %>% 
  kable()
age gender mean_suicides_no
1 0 255
1 1 158
2 0 4931
2 1 1288
3 0 11634
3 1 4053
4 0 9068
4 1 2872
5 0 3171
5 1 540

In this table, the colunm is gender and the row shows the age group. The cell is average suicide case numbers occurred in each group.

 suicide2 %>%
group_by(age, gender) %>% 
  summarize(mean_suicides_no= mean(suicides_no)) %>% 
  spread(gender, mean_suicides_no) %>%
kable()
age 0 1
1 255 158
2 4931 1288
3 11634 4053
4 9068 2872
5 3171 540

Three regression models

model1 <- lm(suicides_no ~ age, data = suicide2)
model2 <- lm(suicides_no ~ age + gender, data = suicide2)
model3 <- lm(suicides_no ~ age*gender, data = suicide2)

Model1

summary(model1)
## 
## Call:
## lm(formula = suicides_no ~ age, data = suicide2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4489  -2096  -1628   1480   7848 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1921.9     2468.4   0.779    0.454
## age            621.4      787.2   0.789    0.448
## 
## Residual standard error: 3664 on 10 degrees of freedom
## Multiple R-squared:  0.05865,    Adjusted R-squared:  -0.03548 
## F-statistic: 0.6231 on 1 and 10 DF,  p-value: 0.4482

Model 1: Intercept: when age is younger, it has a log odds of 1921.9. It is statistically significant between suicide numbers and age.

Model 2: Adding gender to see if it is significant to suicide numbers.

summary(model2)
## 
## Call:
## lm(formula = suicides_no ~ age + gender, data = suicide2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4270.8 -1217.7   106.0   897.8  5865.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   3904.5     2279.9   1.713   0.1209  
## age            621.4      668.3   0.930   0.3767  
## gender       -3965.2     1795.9  -2.208   0.0546 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3111 on 9 degrees of freedom
## Multiple R-squared:  0.3894, Adjusted R-squared:  0.2537 
## F-statistic:  2.87 on 2 and 9 DF,  p-value: 0.1086

Compared to the age variabe, the gender have a decrease in log odds by 3965.2 which means females have lower possibility to commit suicide.

Model 3: Interaction between age and gender

summary(model3)
## 
## Call:
## lm(formula = suicides_no ~ age * gender, data = suicide2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4634.6 -1234.0  -199.5  1218.8  5804.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2865.8     3090.4   0.927    0.381
## age            988.0      985.5   1.002    0.345
## gender       -1887.7     4370.5  -0.432    0.677
## age:gender    -733.2     1393.7  -0.526    0.613
## 
## Residual standard error: 3244 on 8 degrees of freedom
## Multiple R-squared:  0.4098, Adjusted R-squared:  0.1885 
## F-statistic: 1.852 on 3 and 8 DF,  p-value: 0.2161

Finally, in Model 3 my interaction variables are age and gender. Males have a higher log odds of suicide numbers when compared to females (1887.7). Furthermore, an increase in age by one group increases the log odds of suicide numbers by 988.0. However, as the age increases and gender becoming close to female, the log odds of suicide number decreases by 733.2.

htmlreg(list(model1,model2,model3))
Statistical models
Model 1 Model 2 Model 3
(Intercept) 1921.89 3904.48 2865.75
(2468.41) (2279.88) (3090.39)
age 621.36 621.36 987.97
(787.17) (668.28) (985.51)
gender -3965.17 -1887.72
(1795.94) (4370.47)
age:gender -733.22
(1393.73)
R2 0.06 0.39 0.41
Adj. R2 -0.04 0.25 0.19
Num. obs. 12 12 12
RMSE 3664.06 3110.66 3243.72
p < 0.001, p < 0.01, p < 0.05

From the result, we can tell that 6% of total variation can be explained by the age variable. 39% of total variation can be explained by the gender variable. 41% of total variation can be explained by both age and gender variables.

Looking at these results, we can see that Model 3 has the biggest R^2. In addition to this, our interaction term in model 3 is statistically significant at all levels.

Female and Male Models

suicide2F <- suicide2 %>% filter(gender == 1)
suicide2M <- suicide2 %>% filter(gender == 0)
modelF <- lm(suicides_no ~ age, data = suicide2F) 
modelM <- lm(suicides_no ~ age, data = suicide2M)
texreg(list(modelF, modelM, model3), caption = "", custom.model.names = c("Female", "Male", "Both"), digits = 3)
## 
## \begin{table}
## \begin{center}
## \begin{tabular}{l c c c }
## \hline
##  & Female & Male & Both \\
## \hline
## (Intercept) & $978.031$    & $2865.754$   & $2865.754$   \\
##             & $(1530.208)$ & $(4093.829)$ & $(3090.386)$ \\
## age         & $254.754$    & $987.969$    & $987.969$    \\
##             & $(487.978)$  & $(1305.507)$ & $(985.513)$  \\
## gender      &              &              & $-1887.723$  \\
##             &              &              & $(4370.466)$ \\
## age:gender  &              &              & $-733.215$   \\
##             &              &              & $(1393.726)$ \\
## \hline
## R$^2$       & 0.064        & 0.125        & 0.410        \\
## Adj. R$^2$  & -0.170       & -0.093       & 0.188        \\
## Num. obs.   & 6            & 6            & 12           \\
## RMSE        & 1606.132     & 4296.951     & 3243.721     \\
## \hline
## \multicolumn{4}{l}{\scriptsize{$^{***}p<0.001$, $^{**}p<0.01$, $^*p<0.05$}}
## \end{tabular}
## \caption{}
## \label{table:coefficients}
## \end{center}
## \end{table}
htmlreg(list(modelF,modelM))
Statistical models
Model 1 Model 2
(Intercept) 978.03 2865.75
(1530.21) (4093.83)
age 254.75 987.97
(487.98) (1305.51)
R2 0.06 0.13
Adj. R2 -0.17 -0.09
Num. obs. 6 6
RMSE 1606.13 4296.95
p < 0.001, p < 0.01, p < 0.05

From the result, we can tell that 6% of total variation can be explained by the age variable among females. 13% of total variation can be explained by the age variable among males.

Looking at these results, we can see that male model has the biggest R^2. In addition to this, our interaction term in male model is statistically significant at all levels.

Age and Gender Interaction Plot

 library(interactions)
interact_plot(model3, pred = gender, modx = age)

Looking at this graphic, we can see that females commited a lot less suicides than maled. Older people commited more suicide than younger ones. Older males commited most of the suicide in United States duiring the year 2015.

By looking at the results above, in future study we could ask why older males commited more suicides. We may find some more issues that about mental health of older people.