Analytics Project

Results

We start by generating a new column, taking logarithm of the GDP per capita.

chooseCRANmirror(graphics = TRUE, ind=1)
knitr::opts_chunk$set(echo = TRUE)
setwd("~/Desktop/analytics data")
master <- read_excel("master.xlsx")
master$log_GDPpc=log(master$`gdp_per_capita ($)`)

\[\\\] Then we look at a scatter plot of suicide rate per 100k people on log GDP per capita

chooseCRANmirror(graphics = TRUE, ind=1)
knitr::opts_chunk$set(echo = TRUE)
install.packages("ggplot2")

## Installing package into '/Users/ashley/Library/R/3.6/library'
## (as 'lib' is unspecified)

## 
## The downloaded binary packages are in
##  /var/folders/tg/ndf9pfq972n0nl0yrwftf9rh0000gn/T//Rtmp0pPsaS/downloaded_packages

library(ggplot2)
ggplot(master, aes(x=log_GDPpc, y=`suicides/100k pop`))+geom_point(shape=20,size=2,color='light blue')+geom_smooth(method=lm,color=' blue')+theme_minimal()+ggtitle("Suicide rate vs GDP")

## `geom_smooth()` using formula 'y ~ x'

The plot suggests a negative relationship i.e. higher GDP growth is associated with higher suicide rates. We can confirm this with a regression:

r1=lm(`suicides/100k pop`~log_GDPpc,master)
summary(r1)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc, data = master)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.729 -4.484 -2.458  5.486 15.464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   28.540      7.862   3.630 0.000331 ***
## log_GDPpc     -2.054      0.771  -2.664 0.008135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.107 on 310 degrees of freedom
## Multiple R-squared:  0.02237,    Adjusted R-squared:  0.01922 
## F-statistic: 7.095 on 1 and 310 DF,  p-value: 0.008135

The regression result would suggest that every 1 percentage point increase of the \(log~GDP~per~capita\) leads to 0.02054 less suicides per 100k people in the UK. As the p-value is 0.008135, we can reject the null hypothesis \(\beta_1=0\) and say that the log_GDPpc coefficient is significantly different from 0 at 5%.

This basic result might be subject to a number of confounding factors that might either bias the results upwards or downwards, For instance, people's education level could be an important driver of both the suicide rates and the GDP per capita growth. A better educated population may work more efficiently and result in higher GDP growth. People with higher education level may have better psychological knowledge, know more about the ways to get work life balance, resulting in a lower suicide rate. This would cause a downward bias in the coefficient estimate, implying that the effect of GDP growth on suicide rate is overestimated as it captures the effect of having higher education level, too. Alternatively, it might be the case that higher unemployment rate reduces the growth of GDP per capita and increases the suicide rate as people are more likely to feel depressed without a proper job. This would result in a downward bias in the coefficient estimate, implying that the causal effect of GDP growth on suicide rate is overestimated as it captures the effect of having lower unemployment rate, too. Hence a good idea is to introduce the potential confounding factors, i.e. unemployment rate, Gini index, sex, average household size, and secondary education enrolment as control variables. We first merge all the relevant datasets into our original data and then run a multivariate regression. \[\\\]

# Merge datasets
setwd("~/Desktop/analytics data")
unemployment <- read_excel("unemployment.xlsx")
gini <- read_excel("gini.xlsx")
hh <- read_excel("hh.xlsx")
se <- read_excel("se.xlsx")
oil <- read_excel("oil.xlsx")
inflation=read.csv("inflation.csv")
# check whether the "year" column in all datasets are of the same type 
class(master$year)

## [1] "numeric"

class(unemployment$year)

## [1] "character"

class(gini$year)

## [1] "character"

class(hh$year)

## [1] "character"

class(se$year)

## [1] "character"

class(oil$year)

## [1] "character"

class(inflation$year)

## [1] "integer"

# convert all "year" columns to numeric type
unemployment <- unemployment %>% mutate(year = as.numeric(year))
gini <- gini %>% mutate(year = as.numeric(year))
hh <- hh %>% mutate(year = as.numeric(year))
se <- se %>% mutate(year = as.numeric(year))
oil <- oil %>% mutate(year = as.numeric(year))
inflation <- inflation %>% mutate(year = as.numeric(year))
# Now merge all datasets into "master"
master=master %>% inner_join(unemployment,by="year")
master=master %>% inner_join(gini,by="year")
master=master %>% inner_join(hh,by="year")
master=master %>% inner_join(se,by="year")
master=master %>% inner_join(oil,by="year")
master=master %>% inner_join(inflation,by="year")
master=master%>%mutate(sex=factor(sex,label=c("male","female")))
r2=lm(`suicides/100k pop`~log_GDPpc+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,master)
summary(r2)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = master)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3483  -0.9902   0.5394   1.6700   9.2838 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -9.58045  104.60263  -0.092    0.927    
## log_GDPpc                        -0.85477    2.60418  -0.328    0.743    
## sexfemale                         7.79822    0.62419  12.493   <2e-16 ***
## unemployment                     -0.10819    0.33453  -0.323    0.747    
## Gini_index                       -0.04497    0.37068  -0.121    0.904    
## Average_household_size           10.90709   30.59480   0.357    0.722    
## Secondary_School_enrolment_rate  -0.02075    0.18254  -0.114    0.910    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.187 on 173 degrees of freedom
##   (132 observations deleted due to missingness)
## Multiple R-squared:  0.4769, Adjusted R-squared:  0.4588 
## F-statistic: 26.29 on 6 and 173 DF,  p-value: < 2.2e-16

As we include a wider set of control variables, we see two things in particular: 1) previously significant negative relationship between \(log~GDP~per~capita\) and suicide rate has now become insignificant. 2) none of the other control variables except sex is significant. We want to test that if GDP growth is really not important in explaining suicide rate and if there is really no significant relationship between suicide rate and the other control variables. We suspect that there is multicollinearity issues between the explanatory varible and in order to test that we first inspect the correlation between our control variables, then do the F-test and compute VIF fot every x-variable. \[\\\]

# check types of the columns since correlation can only be implemented on numeric columns
str(master)

## tibble [312 × 21] (S3: tbl_df/tbl/data.frame)
##  $ country                        : chr [1:312] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ year                           : num [1:312] 1985 1985 1985 1985 1985 ...
##  $ sex                            : Factor w/ 2 levels "male","female": 2 2 2 2 1 1 2 1 1 1 ...
##  $ age                            : chr [1:312] "75+ years" "55-74 years" "35-54 years" "25-34 years" ...
##  $ suicides_no                    : num [1:312] 264 915 1208 620 678 ...
##  $ population                     : num [1:312] 1202838 5170113 6899879 3969689 6002096 ...
##  $ suicides/100k pop              : num [1:312] 21.9 17.7 17.5 15.6 11.3 ...
##  $ country-year                   : chr [1:312] "United Kingdom1985" "United Kingdom1985" "United Kingdom1985" "United Kingdom1985" ...
##  $ HDI for year                   : num [1:312] 0.753 0.753 0.753 0.753 0.753 0.753 0.753 0.753 0.753 0.753 ...
##  $ gdp_for_year ($)               : num [1:312] 4.89e+11 4.89e+11 4.89e+11 4.89e+11 4.89e+11 ...
##  $ gdp_per_capita ($)             : num [1:312] 9231 9231 9231 9231 9231 ...
##  $ generation                     : chr [1:312] "G.I. Generation" "G.I. Generation" "Silent" "Boomers" ...
##  $ log_GDPpc                      : num [1:312] 9.13 9.13 9.13 9.13 9.13 ...
##  $ unemployment                   : num [1:312] 11.4 11.4 11.4 11.4 11.4 11.4 11.4 11.4 11.4 11.4 ...
##  $ Gini_index                     : num [1:312] 29.6 29.6 29.6 29.6 29.6 29.6 29.6 29.6 29.6 29.6 ...
##  $ Average_household_size         : num [1:312] NA NA NA NA NA NA NA NA NA NA ...
##  $ Secondary_School_enrolment_rate: num [1:312] 83.6 83.6 83.6 83.6 83.6 ...
##  $ oil_price                      : num [1:312] 27.5 27.5 27.5 27.5 27.5 27.5 27.5 27.5 27.5 27.5 ...
##  $ inflation_rate                 : num [1:312] 6.07 6.07 6.07 6.07 6.07 ...
##  $ X                              : logi [1:312] NA NA NA NA NA NA ...
##  $ X.1                            : logi [1:312] NA NA NA NA NA NA ...

cor(master %>% select(log_GDPpc,unemployment,Gini_index,Average_household_size,Secondary_School_enrolment_rate))

##                                  log_GDPpc unemployment Gini_index
## log_GDPpc                        1.0000000   -0.8193791  0.8653588
## unemployment                    -0.8193791    1.0000000 -0.7648058
## Gini_index                       0.8653588   -0.7648058  1.0000000
## Average_household_size                  NA           NA         NA
## Secondary_School_enrolment_rate  0.7337892   -0.5782068  0.4842689
##                                 Average_household_size
## log_GDPpc                                           NA
## unemployment                                        NA
## Gini_index                                          NA
## Average_household_size                               1
## Secondary_School_enrolment_rate                     NA
##                                 Secondary_School_enrolment_rate
## log_GDPpc                                             0.7337892
## unemployment                                         -0.5782068
## Gini_index                                            0.4842689
## Average_household_size                                       NA
## Secondary_School_enrolment_rate                       1.0000000

linearHypothesis(r2,c("log_GDPpc=0","sexfemale=0","unemployment=0","Gini_index=0","Average_household_size=0","Secondary_School_enrolment_rate=0"))

## Linear hypothesis test
## 
## Hypothesis:
## log_GDPpc = 0
## sexfemale = 0
## unemployment = 0
## Gini_index = 0
## Average_household_size = 0
## Secondary_School_enrolment_rate = 0
## 
## Model 1: restricted model
## Model 2: `suicides/100k pop` ~ log_GDPpc + sex + unemployment + Gini_index + 
##     Average_household_size + Secondary_School_enrolment_rate
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    179 5798.6                                  
## 2    173 3033.1  6    2765.5 26.289 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Calculate for VIF
r2 %>% vif()

##                       log_GDPpc                             sex 
##                        3.735700                        1.000000 
##                    unemployment                      Gini_index 
##                        1.381500                        2.838073 
##          Average_household_size Secondary_School_enrolment_rate 
##                        3.826946                        1.791867

We see that most of the control variables are highly correlated so there might be a multicollinearity issue. Our joint hypothesis test calculates that p-value is very small, meaning that we can reject the null and the coefficients are significantly different from zero. However, the VIFs are less than 5 but greater than 1, which informs us that multicollinearity may be an issue but not so problematic in this case.

There are clearly further concerns to explore. For instance, including additional control variables does not always give a better estimate of causal relationship. There could be a causal channel that goes from GDP growth via unemployment to suicide rate: i.e. higher GDP growth increases the demand for labour, reduces unemployment, and therefore results in lower suicide rate. By including unemployment we would shut down this part of the causal effect of GDP growth.

To further address the endogeneity problem, we use inflation rate and brent crude oil price as two instrumental variables for \(log~GDP~per~capita\) and \(Gini~index\) and do the following regressions:

# Fixing endogeneity with instrumental variables
iv=ivreg(`suicides/100k pop`~ log_GDPpc+Gini_index+sex+unemployment+Average_household_size+Secondary_School_enrolment_rate | inflation_rate+oil_price+sex+unemployment+Average_household_size+Secondary_School_enrolment_rate,data=master)
summary(iv)

## 
## Call:
## ivreg(formula = `suicides/100k pop` ~ log_GDPpc + Gini_index + 
##     sex + unemployment + Average_household_size + Secondary_School_enrolment_rate | 
##     inflation_rate + oil_price + sex + unemployment + Average_household_size + 
##         Secondary_School_enrolment_rate, data = master)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3709  -1.0246   0.5993   1.6956   9.3571 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -31.02378  398.50198  -0.078    0.938    
## log_GDPpc                        -0.97116    5.35803  -0.181    0.856    
## Gini_index                        0.11534    3.35336   0.034    0.973    
## sexfemale                         7.79822    0.62454  12.486   <2e-16 ***
## unemployment                     -0.07172    0.85050  -0.084    0.933    
## Average_household_size           15.88603   92.53431   0.172    0.864    
## Secondary_School_enrolment_rate   0.02855    0.99476   0.029    0.977    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.19 on 173 degrees of freedom
## Multiple R-Squared: 0.4763,  Adjusted R-squared: 0.4582 
## Wald test: 26.24 on 6 and 173 DF,  p-value: < 2.2e-16

# Checking first stage for the first endogenous variable
first=lm(Gini_index~oil_price+inflation_rate,data=master)
summary(first)

## 
## Call:
## lm(formula = Gini_index ~ oil_price + inflation_rate, data = master)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4375 -0.9892  0.1335  0.9589  2.5507 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.307490   0.226029 147.359  < 2e-16 ***
## oil_price       0.047127   0.003778  12.474  < 2e-16 ***
## inflation_rate -0.257925   0.047086  -5.478 8.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.504 on 309 degrees of freedom
## Multiple R-squared:  0.4128, Adjusted R-squared:  0.409 
## F-statistic: 108.6 on 2 and 309 DF,  p-value: < 2.2e-16

linearHypothesis(first,c("oil_price=0"))

## Linear hypothesis test
## 
## Hypothesis:
## oil_price = 0
## 
## Model 1: restricted model
## Model 2: Gini_index ~ oil_price + inflation_rate
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    310 1051.58                                  
## 2    309  699.41  1    352.18 155.59 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Checking first stage for the second endogenous variable
first2=lm(log_GDPpc~oil_price+inflation_rate, data=master)
summary(first2)

## 
## Call:
## lm(formula = log_GDPpc ~ oil_price + inflation_rate, data = master)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67023 -0.11184  0.06906  0.13430  0.40419 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10.1280262  0.0373532  271.14   <2e-16 ***
## oil_price       0.0122607  0.0006244   19.64   <2e-16 ***
## inflation_rate -0.1094714  0.0077813  -14.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2486 on 309 degrees of freedom
## Multiple R-squared:  0.6956, Adjusted R-squared:  0.6936 
## F-statistic:   353 on 2 and 309 DF,  p-value: < 2.2e-16

linearHypothesis(first2,c("inflation_rate=0"))

## Linear hypothesis test
## 
## Hypothesis:
## inflation_rate = 0
## 
## Model 1: restricted model
## Model 2: log_GDPpc ~ oil_price + inflation_rate
## 
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    310 31.336                                  
## 2    309 19.101  1    12.235 197.92 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Specification tests
summary(iv,diagnostics = TRUE)

## 
## Call:
## ivreg(formula = `suicides/100k pop` ~ log_GDPpc + Gini_index + 
##     sex + unemployment + Average_household_size + Secondary_School_enrolment_rate | 
##     inflation_rate + oil_price + sex + unemployment + Average_household_size + 
##         Secondary_School_enrolment_rate, data = master)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3709  -1.0246   0.5993   1.6956   9.3571 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -31.02378  398.50198  -0.078    0.938    
## log_GDPpc                        -0.97116    5.35803  -0.181    0.856    
## Gini_index                        0.11534    3.35336   0.034    0.973    
## sexfemale                         7.79822    0.62454  12.486   <2e-16 ***
## unemployment                     -0.07172    0.85050  -0.084    0.933    
## Average_household_size           15.88603   92.53431   0.172    0.864    
## Secondary_School_enrolment_rate   0.02855    0.99476   0.029    0.977    
## 
## Diagnostic tests:
##                               df1 df2 statistic p-value    
## Weak instruments (log_GDPpc)    2 173   146.213  <2e-16 ***
## Weak instruments (Gini_index)   2 173     2.906  0.0574 .  
## Wu-Hausman                      2 171     0.002  0.9977    
## Sargan                          0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.19 on 173 degrees of freedom
## Multiple R-Squared: 0.4763,  Adjusted R-squared: 0.4582 
## Wald test: 26.24 on 6 and 173 DF,  p-value: < 2.2e-16

Our IV regression comes out with an estimated coefficient=-0.97 for \(log~GDP~per~capita\). Note that the effect of GDP growth becomes actually stronger when using the instrument. This suggests that it addresses an endogeneity issue arising from a positive correlation between un-observed heterogeneity and the endogenous variable; e.g. it could be the case that suicide rate becomes higher when brent inflation rate becomes higher since higher inflation signals better performance of the economy and therefore higher GDP growth.

Weak instruments test rejects the null, meaning that at least one instrument is strong, validating the use of our instruments as they also pass the F-test at first stage.

However our (Wu-)Hausman test for endogeneity comes out that p-value is not significant. We cannot reject the null that the variable of concern is uncorrelated with the error term, indicating that log_GDPpc and Gini_index may not be marginally endogenous. So there may not be a significant isuue of endogeneity in our original multivariate regression.

Further looking at our dataset, we recognized that age is divided into groups. We think that it might be worth comparing the causal effects of GDP growth on suicide rate in different age groups. Therefore we separate age into six groups and run multivariate regressions, respectively. \[\\\]

#【It would be better if we use loop?】

# separate age into different age groups and compare the differences in casaul effects
library(dplyr)
master1=master%>%mutate(age=factor(age,label=c("75+ years","55-74 years","35-54 years","25-34 years","15-24 years","5-14 years")))
# "75+ years"
master1$age=as.character(master1$age)
year1=master1%>%filter(age=="75+ years")
r3=lm(`suicides/100k pop`~log_GDPpc+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,year1)
summary(r3)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = year1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0012 -0.4932 -0.1408  0.4898  1.2717 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     27.55832   45.20080   0.610   0.5480    
## log_GDPpc                       -2.40929    1.12532  -2.141   0.0431 *  
## sexfemale                        6.52733    0.26972  24.200   <2e-16 ***
## unemployment                    -0.07111    0.14456  -0.492   0.6274    
## Gini_index                      -0.10862    0.16018  -0.678   0.5045    
## Average_household_size           4.69973   13.22060   0.355   0.7255    
## Secondary_School_enrolment_rate -0.06768    0.07888  -0.858   0.3997    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7387 on 23 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.9638, Adjusted R-squared:  0.9544 
## F-statistic: 102.1 on 6 and 23 DF,  p-value: 2.084e-15

# "55-74 years"
year2=master1%>%filter(age=="55-74 years")
r4=lm(`suicides/100k pop`~log_GDPpc+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,year2)
summary(r4)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = year2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03843 -0.74893 -0.08662  0.82567  3.13903 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      22.4022    77.9391   0.287   0.7764    
## log_GDPpc                        -2.9806     1.9404  -1.536   0.1382    
## sexfemale                        12.3900     0.4651  26.641   <2e-16 ***
## unemployment                     -0.4562     0.2493  -1.830   0.0802 .  
## Gini_index                       -0.1691     0.2762  -0.612   0.5463    
## Average_household_size           15.5822    22.7961   0.684   0.5011    
## Secondary_School_enrolment_rate  -0.1556     0.1360  -1.144   0.2644    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.274 on 23 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.9694, Adjusted R-squared:  0.9614 
## F-statistic: 121.5 on 6 and 23 DF,  p-value: 3.063e-16

# "35-54 years"
year3=master1%>%filter(age=="35-54 years")
r5=lm(`suicides/100k pop`~log_GDPpc+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,year3)
summary(r5)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = year3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02225 -0.45863 -0.01513  0.30943  1.32638 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      25.698345  35.017811   0.734    0.470    
## log_GDPpc                         0.448191   0.871801   0.514    0.612    
## sexfemale                        11.764667   0.208959  56.301   <2e-16 ***
## unemployment                      0.108176   0.111991   0.966    0.344    
## Gini_index                       -0.003095   0.124093  -0.025    0.980    
## Average_household_size          -12.999469  10.242219  -1.269    0.217    
## Secondary_School_enrolment_rate   0.043819   0.061108   0.717    0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5723 on 23 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.9928, Adjusted R-squared:  0.9909 
## F-statistic: 529.9 on 6 and 23 DF,  p-value: < 2.2e-16

# "25-34 years"
year4=master1%>%filter(age=="25-34 years")
r6=lm(`suicides/100k pop`~log_GDPpc+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,year4)
summary(r6)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = year4)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.064595 -0.045739 -0.006506  0.031459  0.089651 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     -0.224570   3.028473  -0.074    0.942
## log_GDPpc                        0.039840   0.075397   0.528    0.602
## sexfemale                        0.024000   0.018072   1.328    0.197
## unemployment                    -0.006884   0.009685  -0.711    0.484
## Gini_index                       0.007187   0.010732   0.670    0.510
## Average_household_size           0.134207   0.885786   0.152    0.881
## Secondary_School_enrolment_rate -0.006282   0.005285  -1.189    0.247
## 
## Residual standard error: 0.04949 on 23 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.3244, Adjusted R-squared:  0.1482 
## F-statistic: 1.841 on 6 and 23 DF,  p-value: 0.1352

# "15-24 years"
year5=master1%>%filter(age=="15-24 years")
r7=lm(`suicides/100k pop`~log_GDPpc+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,year5)
summary(r7)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = year5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77209 -0.22960 -0.00992  0.23904  0.82983 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -10.604168  25.744942  -0.412    0.684    
## log_GDPpc                         0.016060   0.640944   0.025    0.980    
## sexfemale                         6.978000   0.153626  45.422   <2e-16 ***
## unemployment                      0.055232   0.082335   0.671    0.509    
## Gini_index                        0.059306   0.091233   0.650    0.522    
## Average_household_size            4.779799   7.530035   0.635    0.532    
## Secondary_School_enrolment_rate   0.004333   0.044926   0.096    0.924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4207 on 23 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.989,  Adjusted R-squared:  0.9861 
## F-statistic: 344.2 on 6 and 23 DF,  p-value: < 2.2e-16

# "5-14 years"
year6=master1%>%filter(age=="5-14 years")
r8=lm(`suicides/100k pop`~log_GDPpc+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,year6)
summary(r8)

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = year6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82475 -0.49633 -0.00738  0.45892  2.26588 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -122.31284   59.65597  -2.050  0.05190 .  
## log_GDPpc                         -0.24281    1.48519  -0.163  0.87156    
## sexfemale                          9.10533    0.35598  25.578  < 2e-16 ***
## unemployment                      -0.27834    0.19079  -1.459  0.15811    
## Gini_index                        -0.05549    0.21140  -0.262  0.79529    
## Average_household_size            53.24608   17.44853   3.052  0.00566 ** 
## Secondary_School_enrolment_rate    0.05690    0.10410   0.547  0.58995    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9749 on 23 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9593 
## F-statistic:   115 on 6 and 23 DF,  p-value: 5.598e-16

ggplot(master1,aes(x=log_GDPpc, y=`suicides/100k pop`, color=age))+ 
       geom_point() +
       theme_minimal()+
       xlab("log_GDPpc") + 
       ylab("suicides/100k pop") + 
       geom_smooth(method = "lm", se = FALSE)+
       ggtitle("Suicide Rate and GDP by age groups") +
       guides(color=guide_legend(title="Age groups"))

## `geom_smooth()` using formula 'y ~ x'

Comparing regression output for different age groups, we see that elderly people experience stronger, more significant and negative relationship between GDP per capita growth and suicide rate. This negative relationship becomes much weaker for lower age groups. This may be explained by the fact that older people are more likely being dependent on state pension. A higher GDP growth economy may imply better performance of the government and therefore more generous provision of pension and national healthcare. Consequently, older people tend to get more content with their life quality at higher growth economy and hence less likely to suicide. Youngsters are more likely to have regular income so their suicide rate may be less affected by changes in GDP per capita. However, only sex variable keep being significant in all age group.

\[\\\]

# Take control of time with a timeline
master=master%>%mutate(t=1:n())
lm(`suicides/100k pop`~log_GDPpc+t+sex+unemployment+Gini_index+Average_household_size+Secondary_School_enrolment_rate,master)%>%summary()

## 
## Call:
## lm(formula = `suicides/100k pop` ~ log_GDPpc + t + sex + unemployment + 
##     Gini_index + Average_household_size + Secondary_School_enrolment_rate, 
##     data = master)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6382  -1.2555   0.1935   2.1127   9.1299 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      424.0372   147.7428   2.870 0.004619 ** 
## log_GDPpc                         12.6412     4.1990   3.011 0.003001 ** 
## t                                 -0.1384     0.0346  -3.999 9.44e-05 ***
## sexfemale                          7.2202     0.6160  11.722  < 2e-16 ***
## unemployment                       2.4520     0.7162   3.424 0.000772 ***
## Gini_index                         0.1356     0.3584   0.378 0.705694    
## Average_household_size          -232.6520    67.6109  -3.441 0.000727 ***
## Secondary_School_enrolment_rate    0.1044     0.1779   0.587 0.557922    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.017 on 172 degrees of freedom
##   (132 observations deleted due to missingness)
## Multiple R-squared:  0.5214, Adjusted R-squared:  0.5019 
## F-statistic: 26.77 on 7 and 172 DF,  p-value: < 2.2e-16

# Testing autoregression
install.packages("urca")

## Installing package into '/Users/ashley/Library/R/3.6/library'
## (as 'lib' is unspecified)

## 
## The downloaded binary packages are in
##  /var/folders/tg/ndf9pfq972n0nl0yrwftf9rh0000gn/T//Rtmp0pPsaS/downloaded_packages

library(urca)
ur.df(master$log_GDPpc,type = "none",lags = 1)%>%summary()

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.203932 -0.004974 -0.004753 -0.004565  0.207565 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## z.lag.1     0.0004682  0.0001810   2.586   0.0102 *
## z.diff.lag -0.0217253  0.0569938  -0.381   0.7033  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03215 on 308 degrees of freedom
## Multiple R-squared:  0.02126,    Adjusted R-squared:  0.0149 
## F-statistic: 3.345 on 2 and 308 DF,  p-value: 0.03656
## 
## 
## Value of test-statistic is: 2.5864 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62

ur.df(master$`suicides/100k pop`,type = "none",lags = 1)%>%summary()

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7112 -0.9534  0.0740  0.8879 23.9185 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## z.lag.1    -0.17665    0.03269  -5.403 1.31e-07 ***
## z.diff.lag  0.02649    0.05650   0.469    0.639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.395 on 308 degrees of freedom
## Multiple R-squared:  0.08961,    Adjusted R-squared:  0.0837 
## F-statistic: 15.16 on 2 and 308 DF,  p-value: 5.263e-07
## 
## 
## Value of test-statistic is: -5.4032 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62

# Getting rid of unit roots
ur.df(diff(master$log_GDPpc,1),type = "none",lags = 1)%>%summary()

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1989  0.0000  0.0000  0.0000  0.2119 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## z.lag.1    -1.000e+00  8.071e-02  -12.39   <2e-16 ***
## z.diff.lag -4.887e-33  5.707e-02    0.00        1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03255 on 307 degrees of freedom
## Multiple R-squared:    0.5,  Adjusted R-squared:  0.4967 
## F-statistic: 153.5 on 2 and 307 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -12.3895 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62

ur.df(diff(master$`suicides/100k pop`,1),type = "none",lags = 1)%>%summary()

## 
## ############################################### 
## # Augmented Dickey-Fuller Test Unit Root Test # 
## ############################################### 
## 
## Test regression none 
## 
## 
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.233 -2.285 -1.113 -0.372 23.660 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## z.lag.1    -1.17241    0.08249 -14.213   <2e-16 ***
## z.diff.lag  0.10846    0.05670   1.913   0.0567 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.621 on 307 degrees of freedom
## Multiple R-squared:  0.5344, Adjusted R-squared:  0.5314 
## F-statistic: 176.2 on 2 and 307 DF,  p-value: < 2.2e-16
## 
## 
## Value of test-statistic is: -14.2127 
## 
## Critical values for test statistics: 
##       1pct  5pct 10pct
## tau1 -2.58 -1.95 -1.62

# Revisit log_GDP vs suicide rate
library(dplyr)
library(tidyverse)

## ─ Attaching packages ──────────────────────── tidyverse 1.3.0 ─

## ✓ tibble  3.0.3     ✓ stringr 1.4.0
## ✓ purrr   0.3.4     ✓ forcats 0.5.0

## ─ Conflicts ───────────────────────── tidyverse_conflicts() ─
## x gridExtra::combine() masks gdata::combine(), dplyr::combine()
## x dplyr::filter()      masks stats::filter()
## x gdata::first()       masks dplyr::first()
## x purrr::flatten()     masks rtweet::flatten()
## x purrr::keep()        masks gdata::keep()
## x dplyr::lag()         masks stats::lag()
## x gdata::last()        masks dplyr::last()
## x car::recode()        masks dplyr::recode()
## x purrr::some()        masks car::some()

master=master%>%arrange(year)%>%mutate(Dlog_GDPpc=log_GDPpc-dplyr::lag(log_GDPpc),Dsuicides_100kpop=`suicides/100k pop`-dplyr::lag(`suicides/100k pop`),DDlog_GDPpc=Dlog_GDPpc-dplyr::lag(Dlog_GDPpc))
lm(Dsuicides_100kpop~Dlog_GDPpc+t,master) %>% summary()

## 
## Call:
## lm(formula = Dsuicides_100kpop ~ Dlog_GDPpc + t, data = master)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.122 -1.758 -0.488  0.245 36.440 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.141734   0.556220  -2.053   0.0409 *  
## Dlog_GDPpc  93.694301   8.564123  10.940   <2e-16 ***
## t            0.003925   0.003050   1.287   0.1991    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.802 on 308 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2799, Adjusted R-squared:  0.2752 
## F-statistic: 59.85 on 2 and 308 DF,  p-value: < 2.2e-16

# Generate a new dataset to illustrate how suicide rate and GDPpc interact and change in time.
suicide_rate <- aggregate(`suicides/100k pop`~year, data = master, sum, na.rm=TRUE)
master_gdppc<-master[,c("year","log_GDPpc")]
suicide_rate<-left_join(suicide_rate,master_gdppc,by="year")
suicide_rate<-unique(suicide_rate)
plot(suicide_rate, col = "darkblue")

With concerns that time itself is a confounding factor, we take control of time by including timeline as a control variable. The results show that for every 1 percentage increase in GDP per capita, suicide rate increased by 0.13 per 100k people but this value is only weakly significant. We suspect that there is correlation between two adjacent terms in the time sequence and implement a Dickey-Fuller Test. Setting null hypothesis for gamma=0, we get t-stat for log_GDPpc=2.5864, much greater than the critical value at 5%=-1.95, therefore we cannot reject the null and conclude that the a unit root is present, i.e. data is a random walk. To address this issue, we take difference, check differenced series is not unit rooted and run the regression again. Our revised regression(without unit root) shows that for every 1 percentage increase in GDP per capita, suicide rate increased by 0.67 per 100k people. By controlling the timeline, the relationship between GDP per capita and suicide rate becomes positive. It is probably because over time people are more capable of finding work life balance and more content with life quality so they are less likely to suicide. Therefore, if we keep the time variable constant, the actual relationship between GDP per capita and suicide rate is positive. Higher the production level, heavier the work pressure, hence higher the suicide rate.

Analytics Project - UK part

Michelle

2020年11月10日

Background

Introduction

Methods and data

Results

Conclusion