VIEW & PROBE DATA

'data.frame':   27820 obs. of  12 variables:
 $ country           : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year              : int  1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
 $ sex               : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
 $ age               : Factor w/ 6 levels "15-24 years",..: 1 3 1 6 2 6 3 2 5 4 ...
 $ suicides_no       : int  21 16 14 1 9 1 6 4 1 0 ...
 $ population        : int  312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
 $ suicides.100k.pop : num  6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
 $ country.year      : Factor w/ 2321 levels "Albania1987",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HDI.for.year      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ gdp_for_year....  : Factor w/ 2321 levels "1,002,219,052,968",..: 727 727 727 727 727 727 727 727 727 727 ...
 $ gdp_per_capita....: int  796 796 796 796 796 796 796 796 796 796 ...
 $ generation        : Factor w/ 6 levels "Boomers","G.I. Generation",..: 3 6 3 2 1 2 6 1 2 3 ...
           country               year                sex 
              0.00               0.00               0.00 
               age        suicides_no         population 
              0.00               0.00               0.00 
 suicides.100k.pop       country.year       HDI.for.year 
              0.00               0.00              69.94 
  gdp_for_year.... gdp_per_capita....         generation 
              0.00               0.00               0.00 
[1] 27820    12

REMOVE IRRELEVANT FACTORED VARIABLES

'data.frame':   27820 obs. of  11 variables:
 $ country           : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year              : int  1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
 $ sex               : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
 $ age               : Factor w/ 6 levels "15-24 years",..: 1 3 1 6 2 6 3 2 5 4 ...
 $ suicides_no       : int  21 16 14 1 9 1 6 4 1 0 ...
 $ population        : int  312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
 $ suicides.100k.pop : num  6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
 $ HDI.for.year      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ gdp_per_capita....: int  796 796 796 796 796 796 796 796 796 796 ...
 $ generation        : Factor w/ 6 levels "Boomers","G.I. Generation",..: 3 6 3 2 1 2 6 1 2 3 ...
 $ continent         : Factor w/ 5 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
  country year    sex         age suicides_no population suicides.100k.pop
1 Albania 1987   male 15-24 years          21     312900              6.71
2 Albania 1987   male 35-54 years          16     308000              5.19
3 Albania 1987 female 15-24 years          14     289700              4.83
4 Albania 1987   male   75+ years           1      21800              4.59
5 Albania 1987   male 25-34 years           9     274300              3.28
6 Albania 1987 female   75+ years           1      35600              2.81
  HDI.for.year gdp_per_capita....      generation continent
1           NA                796    Generation X    Europe
2           NA                796          Silent    Europe
3           NA                796    Generation X    Europe
4           NA                796 G.I. Generation    Europe
5           NA                796         Boomers    Europe
6           NA                796 G.I. Generation    Europe

REMOVE country VARIABLE FROM MASTER DATASET

 [1] "year"               "sex"                "age"               
 [4] "suicides_no"        "population"         "suicides.100k.pop" 
 [7] "HDI.for.year"       "gdp_per_capita...." "generation"        
[10] "continent"         

LOOK AT NA’S CHECK IF SUBSETTING WAS ACCURATE

              year                sex                age 
                 0                  0                  0 
       suicides_no         population  suicides.100k.pop 
                 0                  0                  0 
      HDI.for.year gdp_per_capita....         generation 
                 0                  0                  0 
         continent 
                 0 
              year                sex                age 
                 0                  0                  0 
       suicides_no         population  suicides.100k.pop 
                 0                  0                  0 
      HDI.for.year gdp_per_capita....         generation 
               100                  0                  0 
         continent 
                 0 

#TRAIN MODEL ON ‘COMPLETE’ SUB DATASET & PREDICT Human Development Index


Call:
lm(formula = HDI.for.year ~ ., data = sub)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.187340 -0.021486  0.004924  0.031165  0.144485 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -4.231e+00  2.922e-01 -14.477  < 2e-16 ***
year                       2.450e-03  1.472e-04  16.641  < 2e-16 ***
sexmale                   -3.063e-03  1.202e-03  -2.548  0.01085 *  
age25-34 years            -1.965e-03  2.018e-03  -0.974  0.33002    
age35-54 years            -9.841e-03  3.274e-03  -3.006  0.00266 ** 
age5-14 years              5.801e-03  2.744e-03   2.114  0.03453 *  
age55-74 years            -1.175e-02  5.270e-03  -2.229  0.02583 *  
age75+ years              -1.162e-02  6.447e-03  -1.803  0.07148 .  
suicides_no                8.342e-06  1.190e-06   7.010 2.57e-12 ***
population                 1.558e-09  1.991e-10   7.823 5.81e-15 ***
suicides.100k.pop          9.751e-05  3.952e-05   2.468  0.01362 *  
gdp_per_capita....         2.380e-06  2.727e-08  87.295  < 2e-16 ***
generationG.I. Generation  7.902e-03  4.832e-03   1.636  0.10198    
generationGeneration X    -3.750e-03  2.911e-03  -1.288  0.19771    
generationGeneration Z    -1.291e-02  6.507e-03  -1.984  0.04731 *  
 [ reached getOption("max.print") -- omitted 6 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04864 on 8343 degrees of freedom
Multiple R-squared:  0.7293,    Adjusted R-squared:  0.7286 
F-statistic:  1124 on 20 and 8343 DF,  p-value: < 2.2e-16
[1] 0.04857793

USE MODEL TO PREDICT HDI on ‘MISSING’ SUB DATASET

 [1] "year"               "sex"                "age"               
 [4] "suicides_no"        "population"         "suicides.100k.pop" 
 [7] "HDI.for.year"       "gdp_per_capita...." "generation"        
[10] "continent"          "pred"              
 [1] "year"               "sex"                "age"               
 [4] "suicides_no"        "population"         "suicides.100k.pop" 
 [7] "HDI.for.year"       "gdp_per_capita...." "generation"        
[10] "continent"          "HDI.pred"          

REARRANGE THE ORDER OF COLUMNS

 [1] "year"               "sex"                "age"               
 [4] "suicides_no"        "population"         "suicides.100k.pop" 
 [7] "HDI.pred"           "gdp_per_capita...." "generation"        
[10] "continent"          "HDI.for.year"      
 [1] "year"               "sex"                "age"               
 [4] "suicides_no"        "population"         "suicides.100k.pop" 
 [7] "HDI.for.year"       "gdp_per_capita...." "generation"        
[10] "continent"          "pred"              

REMOVE THE OLD HDI COLUMN CONTAINING NA VALUES

RENAME NEW PREDICTED HDI COL TO OLD COL NAME

REMOVE PREDICTED COL IN COMPLETE SUB DATASET

CHECK names() OF BOTH DATASETS AND MERGE THEM

 [1] "year"               "sex"                "age"               
 [4] "suicides_no"        "population"         "suicides.100k.pop" 
 [7] "HDI.for.year"       "gdp_per_capita...." "generation"        
[10] "continent"         
 [1] "year"               "sex"                "age"               
 [4] "suicides_no"        "population"         "suicides.100k.pop" 
 [7] "HDI.for.year"       "gdp_per_capita...." "generation"        
[10] "continent"         

MERGE THE DATA

[1] 27820    10
   year    sex         age suicides_no population suicides.100k.pop
73 1995   male 25-34 years          13     232900              5.58
74 1995   male 55-74 years           9     178000              5.06
75 1995 female   75+ years           2      40800              4.90
76 1995 female 15-24 years          13     283500              4.59
77 1995   male 15-24 years          11     241200              4.56
78 1995   male   75+ years           1      25100              3.98
   HDI.for.year gdp_per_capita....      generation continent
73        0.619                835    Generation X    Europe
74        0.619                835          Silent    Europe
75        0.619                835 G.I. Generation    Europe
76        0.619                835    Generation X    Europe
77        0.619                835    Generation X    Europe
78        0.619                835 G.I. Generation    Europe
'data.frame':   27820 obs. of  10 variables:
 $ year              : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
 $ sex               : Factor w/ 2 levels "female","male": 2 2 1 1 2 2 2 1 1 2 ...
 $ age               : Factor w/ 6 levels "15-24 years",..: 2 5 6 1 1 6 3 2 3 4 ...
 $ suicides_no       : int  13 9 2 13 11 1 14 7 8 6 ...
 $ population        : int  232900 178000 40800 283500 241200 25100 375900 264000 356400 376500 ...
 $ suicides.100k.pop : num  5.58 5.06 4.9 4.59 4.56 3.98 3.72 2.65 2.24 1.59 ...
 $ HDI.for.year      : num  0.619 0.619 0.619 0.619 0.619 0.619 0.619 0.619 0.619 0.619 ...
 $ gdp_per_capita....: int  835 835 835 835 835 835 835 835 835 835 ...
 $ generation        : Factor w/ 6 levels "Boomers","G.I. Generation",..: 3 6 2 3 3 2 1 3 1 5 ...
 $ continent         : Factor w/ 5 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...

WRITE DATA TO EXPORT .csv

PREDICT SUICIDES


Call:
lm(formula = suicides_no ~ ., data = final)

Residuals:
    Min      1Q  Median      3Q     Max 
-3035.9  -164.0    10.6   157.0 17446.1 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                2.127e+04  1.894e+03  11.231  < 2e-16 ***
year                      -1.237e+01  9.698e-01 -12.756  < 2e-16 ***
sexmale                    9.221e+01  8.521e+00  10.821  < 2e-16 ***
age25-34 years             3.880e+01  1.612e+01   2.407  0.01609 *  
age35-54 years             1.303e+02  2.519e+01   5.175 2.30e-07 ***
age5-14 years             -8.879e+01  1.684e+01  -5.274 1.35e-07 ***
age55-74 years             1.180e+02  3.769e+01   3.131  0.00174 ** 
age75+ years              -1.581e+00  4.389e+01  -0.036  0.97127    
population                 1.271e-04  1.081e-06 117.530  < 2e-16 ***
suicides.100k.pop          1.198e+01  2.518e-01  47.566  < 2e-16 ***
HDI.for.year               4.792e+03  1.409e+02  34.025  < 2e-16 ***
gdp_per_capita....        -1.142e-02  4.085e-04 -27.954  < 2e-16 ***
generationG.I. Generation -6.277e+01  3.306e+01  -1.899  0.05760 .  
generationGeneration X    -1.430e+01  1.925e+01  -0.743  0.45769    
generationGeneration Z     7.279e+01  4.233e+01   1.720  0.08553 .  
 [ reached getOption("max.print") -- omitted 6 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 638.7 on 27799 degrees of freedom
Multiple R-squared:  0.4991,    Adjusted R-squared:  0.4987 
F-statistic:  1385 on 20 and 27799 DF,  p-value: < 2.2e-16
[1] 638.4137