#Load data and prepare dataset

##  [1] "ï..country"         "year"               "sex"               
##  [4] "age"                "suicides_no"        "population"        
##  [7] "suicides.100k.pop"  "country.year"       "HDI.for.year"      
## [10] "gdp_for_year...."   "gdp_per_capita...." "generation"
## 'data.frame':    936 obs. of  12 variables:
##  $ Country    : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ year       : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ Gender     : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 2 2 2 2 ...
##  $ Age        : Factor w/ 6 levels "15-24 years",..: 1 2 3 4 5 6 1 2 3 4 ...
##  $ Suicideno  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Population : int  8537 7578 15273 8296 6085 1686 8208 6928 13325 8335 ...
##  $ Suicides   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CountryYear: Factor w/ 2321 levels "Albania1987",..: 48 48 48 48 48 48 48 48 48 48 ...
##  $ HDI        : num  0.783 0.783 0.783 0.783 0.783 0.783 0.783 0.783 0.783 0.783 ...
##  $ GDP        : Factor w/ 2321 levels "1,002,219,052,968",..: 74 74 74 74 74 74 74 74 74 74 ...
##  $ GDPpercap  : int  14093 14093 14093 14093 14093 14093 14093 14093 14093 14093 ...
##  $ Generation : Factor w/ 6 levels "Boomers","G.I. Generation",..: 5 5 3 4 1 6 5 5 3 4 ...
##                 Country         year         Gender             Age     
##  Antigua and Barbuda: 12   Min.   :2014   female:468   15-24 years:156  
##  Argentina          : 12   1st Qu.:2014   male  :468   25-34 years:156  
##  Armenia            : 12   Median :2014                35-54 years:156  
##  Australia          : 12   Mean   :2014                5-14 years :156  
##  Austria            : 12   3rd Qu.:2014                55-74 years:156  
##  Bahrain            : 12   Max.   :2014                75+ years  :156  
##  (Other)            :864                                                
##    Suicideno         Population          Suicides      
##  Min.   :    0.0   Min.   :     960   Min.   :  0.000  
##  1st Qu.:    4.0   1st Qu.:  172142   1st Qu.:  1.268  
##  Median :   29.0   Median :  525932   Median :  5.565  
##  Mean   :  238.2   Mean   : 2042796   Mean   : 11.011  
##  3rd Qu.:  126.0   3rd Qu.: 1677010   3rd Qu.: 14.178  
##  Max.   :11455.0   Max.   :41858354   Max.   :124.450  
##                                                        
##                   CountryYear       HDI              GDP      
##  Antigua and Barbuda2014: 12   Min.   :0.6270   Min.   :  74  
##  Argentina2014          : 12   1st Qu.:0.7500   1st Qu.: 580  
##  Armenia2014            : 12   Median :0.8180   Median :1198  
##  Australia2014          : 12   Mean   :0.8085   Mean   :1144  
##  Austria2014            : 12   3rd Qu.:0.8830   3rd Qu.:1720  
##  Bahrain2014            : 12   Max.   :0.9440   Max.   :2279  
##  (Other)                :864   NA's   :36                     
##    GDPpercap                Generation    Genclass        
##  Min.   :  1465   Boomers        :156   Length:936        
##  1st Qu.:  8849   G.I. Generation:  0   Class :character  
##  Median : 15950   Generation X   :156   Mode  :character  
##  Mean   : 27420   Generation Z   :156                     
##  3rd Qu.: 41869   Millenials     :312                     
##  Max.   :126352   Silent         :156                     
##                                                           
##    HDIlevel                                Region   
##  Length:936         East Asia & Pacific       : 48  
##  Class :character   Europe & Central Asia     :504  
##  Mode  :character   Latin America & Caribbean :228  
##                     Middle East & North Africa: 84  
##                     North America             : 12  
##                     Sub-Saharan Africa        : 60  
##                                                     
##               IncomeGroup 
##  High income        :528  
##  Low income         : 24  
##  Lower middle income: 84  
##  Not classified     : 12  
##  Upper middle income:288  
##                           
## 

Part 1 - Introduction

I have two research questions for this assigment.

The first one is if there is a significant difference in suicide rates between countries with High HDI scores and Low HDI Scores. The interest was generated from reading the following article from the Journal of Epidemiology and Global Health. https://www.sciencedirect.com/science/article/pii/S2210600616300430

My expectation was that countries with higher poverty levels and a not so adequate social security system in place will have higher suicide rates. As we expect the life expectancy and other human development scores to increase in a country, it may be expected that lower suicide rates will prevail.However, the cited research has tested the relationship and found that suicide rates increase with HDI scores. The article however mentions that the data is limited to only high HDI countries, but my dataset which is a compilation of several World Bank indicators covers both High and Low HDI countries. My research is however limited to only the data from 2014 (the sample) and there may be significant variations between years. Also the dataset covers only 78 countries.

The second research question is whether Millenials tend to have a higher suicide rate than the rest of the classifications.The interest was generated due to the following article: https://www.businessinsider.com/perfectionism-causing-more-early-deaths-and-suicides-among-millennials-2018-9

The article suggests that millenials have a higher tendency to commit suicide due to higher pressure on perfectionism.I would like to research if millenials actually do have a higher suicide rate in the dataset compared to the other classifications of generation.

There are other factors in the dataset like GDP which may effect the level of suicides in countries. I would also like to test if some of the other factors present in the dataset (like Gender) can improve my model of predicting suicide rates in a particular year.

Part 2 - Data

The data was collected from Kaggle(https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016). The dataset contains data from 1985 to 2016 and contains socio economic data and suicide rates by year and country. The data is complied from a World Bank dataset. The dataset has also been uploaded without any edits at “https://raw.githubusercontent.com/zahirf/Data606/master/master.csv”.

There are 27,820 cases in the original dataset covering data from 1985 to 2016. I have however limited my research to only 2014 data.The study is observational. My interest variables are Suicide per 100 k which is numerical, HDI index score which is numerical, and Generation which is a categorical variable with 6 factor levels, namely Generation X, Generation Z, Silent, GI Generation, Boomers,and Millenials. My dependent variable is the suicide per 100k and the independent variables are HDI scores and Generation.

I have used another dataset from World Bank to get the region and the income status of each country. I belive this might shed some light on testing if the trends hold up in each region and income classification. https://databank.worldbank.org/data/download/site-content/CLASS.xls

Part 3 - Exploratory data analysis

Data Preparation

I required some data prepartion to filter out the 2014 data and I renamed the columns to make it more easily understandable. After above manipulations, the new dataset data2014 has 936 cases of 12 variables. I have changed some of the variables to numeric so the analysis can be done. I added 2 columns HDI Level and Genclass to separate the data into High HDI and Not high HDI, and Millenials and Non Millenials to run the regressions.

Variable: HDI

To test the normality of the HDI (the quantitative independant variable), i have plotted a histogram. The data looks nearly normal with a slight positive skew. The sample size is more than 30. The observations should be independent of each other as HDI scores vary depending on the country’s policies to tackle the isssues..

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.6270  0.7500  0.8180  0.8085  0.8830  0.9440      36

Let us now look at the scatterplot of HDI with suicides. There seems to be a positive linear association between the variables.The correlation however is very low at 14.5% so 14.5% of tha change in suicide numbers is explained by this variable. The chart is dominated by High Income and Upper Middle income countries.

## 
##  Pearson's product-moment correlation
## 
## data:  data2014$HDI and data2014$Suicides
## t = 4.4057, df = 898, p-value = 1.181e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08087883 0.20882119
## sample estimates:
##       cor 
## 0.1454581

Variable : Generation

We can clearly see that Millenials do not have a higher suicide rate than the rest of the classifications.Silent, the people born between mid 1920s to mid 1940s, have the highest mean suicide rate. I have run the boxplot again by classifying the dataset into millenials and non millenials. Now I find the means to be almost at the same level, but the data for Non Millenials has a significant number of positive outliers.

## 
##         Boomers G.I. Generation    Generation X    Generation Z 
##             156               0             156             156 
##      Millenials          Silent 
##             312             156

Other variables that might be of interest:

I also want to explore the relation of the independent variable with age and gender.Males tend to have a higher suicide rate than the females. Also the 75+ age group has the highest rate of suicide. This is contrary to popular belief that females and younger people are more prone to suicides.

I ran a correlation test after assigning a dummy value of 1 to male and 0 to female and found the correlation of suicide with males to be 43%.

## 
##  Pearson's product-moment correlation
## 
## data:  data2014$dummy and data2014$Suicides
## t = 14.655, df = 934, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3787922 0.4830703
## sample estimates:
##       cor 
## 0.4323759

Let us also look at the Gender variable by Region to see if males are more prone to suicide in every region. We see that males have significantly higher suicide rates than females in almost every region.

The last variable I will test is the income classification of the country (according to World Bank). We see that Suicides are more common in High and Upper income countries.

The exploratory data analysis suggests that factors other than HDI and Generation might be more useful in predicting suicides. However, since the original research question was based on HDI and Millenials, I would like to test the model using those 2 variables first and then try to improve the model.

Part 4 - Inference

Significance of the HDI variable

Null and Alternate Hypothesis

H0: There is no difference in the mean rates of suicide between High HDI countries and Not High HDI countries

Ha: There is a difference between the mean rates of suicide between High HDI countries and Not High HDI countries.

Running the regression

## 
## Call:
## lm(formula = data2014$Suicides ~ data2014$HDI, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.825  -8.664  -4.884   3.514  82.225 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -10.269      4.772  -2.152   0.0317 *  
## data2014$HDI   25.873      5.873   4.406 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.35 on 898 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.02116,    Adjusted R-squared:  0.02007 
## F-statistic: 19.41 on 1 and 898 DF,  p-value: 1.181e-05
## Warning: Removed 36 rows containing non-finite values (stat_smooth).
## Warning: Removed 36 rows containing missing values (geom_point).

The p value is very low so we may reject the null hypothesis. This shows a significant relationship.The Rsquare is however very low at 2%.

Equation for regression:

Suicides = -10.269 + 25.873 * HDI

Testing the conditions for regression:

We can see from the plots that the linearity and constant variability tests plot a line that is almost straight, so the conditions are met.The normality test shows a significant positive skew but since the sample size is more than 30, we may ignore this violation for now.

Hypothesis test and Confidence intervals:

Theoretical method:

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_High = 468, mean_High = 12.5249, sd_High = 15.2226
## n_Not High = 432, mean_Not High = 8.6191, sd_Not High = 13.3807
## Observed difference between means (High-Not High) = 3.9058
## 
## H0: mu_High - mu_Not High = 0 
## HA: mu_High - mu_Not High != 0 
## Standard error = 0.954 
## Test statistic: Z =  4.095 
## p-value =  0

There is an observed difference in means of 3.91 and the p value is 0. The slope coeffcient is therefore significant.

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_High = 468, mean_High = 12.5249, sd_High = 15.2226
## n_Not High = 432, mean_Not High = 8.6191, sd_Not High = 13.3807

## Observed difference between means (High-Not High) = 3.9058
## 
## Standard error = 0.9537 
## 95 % Confidence interval = ( 2.0365 , 5.7751 )

The 95% Confidence Interval is (2.0365, 5.7751) and does not span 0. There is evidence of a difference between the mean suicide rates between the two populations.

Simulation method:

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_High = 468, mean_High = 12.5249, sd_High = 15.2226
## n_Not High = 432, mean_Not High = 8.6191, sd_Not High = 13.3807
## Observed difference between means (High-Not High) = 3.9058
## 
## H0: mu_High - mu_Not High = 0 
## HA: mu_High - mu_Not High != 0

## p-value =  0

The simulation method also comes up with a p value of 0. The variable is significant.

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_High = 468, mean_High = 12.5249, sd_High = 15.2226
## n_Not High = 432, mean_Not High = 8.6191, sd_Not High = 13.3807
## Observed difference between means (High-Not High) = 3.9058

## 95 % Bootstrap interval = ( 2.0581 , 5.7193 )

The simulated confidence interval is (2.0827, 5.8155). This is slightly different from the results using the theoretical method. The interval does not span 0 so the HDI variable is significant.

Significance of the Generation variable

Null and Alternate Hypothesis

The ideal way to test this variable would be to compare the proportion of millenial suicides as a % of millenial population to the proportions of suicides as a % of other generations. However, since we do not have the population data classifed into generations, we will stick to the mean suicide rate of millenials compared to non millenials.

H0: There are no differences in the mean suicide rates of Millenials compared to the mean suicide rates of Non Millenials

Ha: There is a difference between the mean suicide rate of Millenials compared to the mean suicide rates of non millenials.

Running the regression

## 
## Call:
## lm(formula = data2014$Suicides ~ data2014$Genclass)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.917 -10.012  -4.735   2.947 112.533 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       9.1995     0.8558  10.749  < 2e-16 ***
## data2014$GenclassNot Millenials   2.7179     1.0482   2.593  0.00966 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.12 on 934 degrees of freedom
## Multiple R-squared:  0.007147,   Adjusted R-squared:  0.006084 
## F-statistic: 6.724 on 1 and 934 DF,  p-value: 0.009663

We can clearly see that there is significant difference between the suicide rates of millenials and non millenials, however it is not as we expected. Non millenials have a higher suicide rate than the millenials.The p value is very low so we reject the null hypotheis. The variable Generation is significant. However the R square is very low at .7%.The residuals however seem to have a significant positive skew.

Equation for linear regression:

Suicides = 9.1995 + 2.7179 * NotMillenials

Testing the regression conditions:

There is a much higher no of residuals on the upper side of the Non millenial population. The data looks near normal with a positive skew. I do not believe all conditions are met for this variable.

Hypothesis test and Confidence Intervals

Theoretical: This method is not recommended as there seems to be some violation in the data in meeting some of the regression conditions as outlined above.

Theoretical vs simulated

Hypothesis test

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Millenials = 312, mean_Millenials = 9.1995, sd_Millenials = 9.55
## n_Not Millenials = 624, mean_Not Millenials = 11.9174, sd_Not Millenials = 17.2357
## Observed difference between means (Millenials-Not Millenials) = -2.7179
## 
## H0: mu_Millenials - mu_Not Millenials = 0 
## HA: mu_Millenials - mu_Not Millenials != 0 
## Standard error = 0.877 
## Test statistic: Z =  -3.101 
## p-value =  0.002

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Millenials = 312, mean_Millenials = 9.1995, sd_Millenials = 9.55
## n_Not Millenials = 624, mean_Not Millenials = 11.9174, sd_Not Millenials = 17.2357
## Observed difference between means (Millenials-Not Millenials) = -2.7179
## 
## H0: mu_Millenials - mu_Not Millenials = 0 
## HA: mu_Millenials - mu_Not Millenials != 0

## p-value =  0.0078

In both tests, p value is close to 0 so we reject the null hypothesis. Therefore Generation is a significant variable that effects suicide rates.

Confidence Interval

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Millenials = 312, mean_Millenials = 9.1995, sd_Millenials = 9.55
## n_Not Millenials = 624, mean_Not Millenials = 11.9174, sd_Not Millenials = 17.2357

## Observed difference between means (Millenials-Not Millenials) = -2.7179
## 
## Standard error = 0.8766 
## 95 % Confidence interval = ( -4.436 , -0.9999 )
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Millenials = 312, mean_Millenials = 9.1995, sd_Millenials = 9.55
## n_Not Millenials = 624, mean_Not Millenials = 11.9174, sd_Not Millenials = 17.2357
## Observed difference between means (Millenials-Not Millenials) = -2.7179

## 95 % Bootstrap interval = ( -4.4454 , -1.0418 )

The simulated confidence interval does not span 0, so we reject the null hypothesis. There is a difference between the suicide rates between Millenials and Non Millenials.

Multiple Regression with both variables

## 
## Call:
## lm(formula = Suicides ~ HDI + Genclass, data = data2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.647  -8.881  -4.821   3.865  81.404 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -11.912      4.807  -2.478   0.0134 *  
## HDI                      25.873      5.856   4.418 1.12e-05 ***
## GenclassNot Millenials    2.465      1.012   2.437   0.0150 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.31 on 897 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.0276, Adjusted R-squared:  0.02543 
## F-statistic: 12.73 on 2 and 897 DF,  p-value: 3.543e-06

## Analysis of Variance Table
## 
## Response: Suicides
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## HDI         1   3995  3994.9 19.5174 1.119e-05 ***
## Genclass    1   1215  1215.5  5.9384   0.01501 *  
## Residuals 897 183602   204.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple regression equation:

Suicides = -11.912 +25.873*HDI + 2.465GenclassNotMillenials

Slope HDI: All else remaining constant, a one unit change in HDI of a country will lead to 26 more suicides per 100k of the population in that country

Slope Genclass: All else remaining constant, a person being a non millenial has increased risk of suicide by 2.465 times

Intercept: A country with 0 HDI score and 0 non millenials will have a suicide of -11.9 per 100k of the population. This does not make sense practically as we cannot have negative suicides.

Although the p value is low for the regression, the adjusted Rsquare is only 2.5%. Also, the residuals are not evenly distributed with a minimum of -14 and a maximum of 81.The sum square of residuals shows that 97.5% of the variabilty of the response variable is unexplained by these two explanatory variables.

We will test the model again by running a simulated regression using these coefficiants. This will ease the heteroscedasticity issue we are facing with this dataset.

Multiple Regression Simulation:

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7903 -2.0119 -0.0096  1.9882  9.6458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.6768     0.9096  -12.84   <2e-16 ***
## x1           25.7943     1.1111   23.22   <2e-16 ***
## x2            2.4405     0.1895   12.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.996 on 997 degrees of freedom
## Multiple R-squared:  0.4099, Adjusted R-squared:  0.4087 
## F-statistic: 346.2 on 2 and 997 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## x1          1 4728.7  4728.7  526.69 < 2.2e-16 ***
## x2          1 1488.3  1488.3  165.77 < 2.2e-16 ***
## Residuals 997 8951.2     9.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value is still very low, but just testing the model on a random dataset with 1000 random numbers for HDI and 1000 random numbers for Generation with either a value of 0 or 1 with a probablity of 50%, has significantly increased the Adjusted Rsquare to 40.9%. We can clearly see the residuals are more evenly distributed.

I would say this model can be used to somehow predict the suicide per 100k, all else constant. However, I do not believe this is the best model to predict suicides as exploratory analysis showed that other factors like Gender have significant associations with the response variable. I will therefore test those variables in the next part.

How about the other variables?

Using all 4 variables, we have an adjusted Rsquare of 23% However, just 2 variables, Gender and HDI, give us an adjusted Rsquare of 21.4% with a p vlaue close to 0. The unexplained variability is 79%. I believe this to be the best model to predict suicides from the given variables in the dataset. A 21% Rsquare is deemed to be good enough to predict a variable that has to do with something as unpredictable as suicides.

## 
## Call:
## lm(formula = Suicides ~ Gender + HDI, data = data2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.887  -5.686  -1.628   3.173  75.864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.6305     4.2999  -3.868 0.000118 ***
## Gendermale   12.7234     0.8575  14.838  < 2e-16 ***
## HDI          25.8731     5.2651   4.914 1.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.86 on 897 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.2141, Adjusted R-squared:  0.2123 
## F-statistic: 122.2 on 2 and 897 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: Suicides
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Gender      1  36424   36424 220.173 < 2.2e-16 ***
## HDI         1   3995    3995  24.148  1.06e-06 ***
## Residuals 897 148393     165                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Running a simuation on the best regression model

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5676 -1.8855  0.0808  1.9978 10.0039 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.1742     0.9355  -16.22   <2e-16 ***
## x1           24.2419     1.1436   21.20   <2e-16 ***
## x2           12.6274     0.1913   66.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.022 on 997 degrees of freedom
## Multiple R-squared:  0.8252, Adjusted R-squared:  0.8249 
## F-statistic:  2354 on 2 and 997 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## x1          1   3187    3187  348.99 < 2.2e-16 ***
## x2          1  39803   39803 4358.26 < 2.2e-16 ***
## Residuals 997   9105       9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We do find that the above model with a randomly generated data set still has a p value close to 0 but the adjusted Rsquare has gone up to 82.5%. This model is undoubtedly a better predictor of suicide rates than my original one.

Part 5 - Conclusion

The initial model of predicting suicides per 100k population using HDI Scores and Generation Class did not come out to be a very good one, even though testing the variables separately proved to be significant. It seems that Gender is the only variable in the given dataset that explains most of the variability in suicide rates.

Human beings are very unpredictable and I believe the reasons behind suicide have less to do with socioeconomic demographics and more to do with personal circumstances. However, it will be interesting to test the relationship between suicides and the components of HDI to find if there is a relationship there. Research may also be done with personal attributes like marital status, rank at work, no of children, education level, etc to find out the demographic that is more prone to suicides. Once the vulnerable population is identified, then programs may be rolled out to offer more mental health packages to the targeted population in an attempt to cut down on suicides.

References

For the research idea

HDI and Suicide rates https://www.sciencedirect.com/science/article/pii/S2210600616300430

Millenials more prone to suicide https://www.businessinsider.com/perfectionism-causing-more-early-deaths-and-suicides-among-millennials-2018-9

For the data

https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016)

https://databank.worldbank.org/data/download/site-content/CLASS.xls