Predict Cancer Mortality Rates for US Counties

0.1 Summary

The goal of this project/challenge is to predict the results of Cancer Mortality Rates.

The website hosting the data is located at https://data.world/nrippner/ols-regression-challenge. These data were aggregated from a number of sources including the American Community Survey (https://www.census.gov), https://www.clinicaltrials.gov, and https://www.cancer.gov.

Using only OLS to build the model, the final model chosen uses 6 predictors, out of 36 total, with an adjusted R-squared of 0.9624.

0.2 Gathering and Cleaning the Data

0.2.1 The data

The data was loaded from data.world and a summary analysis was done to get gain some quick insight.

library(dplyr);library(data.table);library(h2o);library(caret)
set.seed(323)
ols_project <- fread("cancer_reg.csv")
Hmisc::describe(ols_project)

## ols_project 
## 
##  34  Variables      3047  Observations
## ---------------------------------------------------------------------------
## avgAnnCount 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      929        1    606.3    856.7       23       37 
##      .25      .50      .75      .90      .95 
##       76      171      518     1963     1972 
## 
## lowest :     6     7     8     9    10, highest: 13294 14477 15470 24965 38150
## ---------------------------------------------------------------------------
## avgDeathsPerYear 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      608        1      186    262.2      9.0     14.0 
##      .25      .50      .75      .90      .95 
##     28.0     61.0    149.0    378.4    749.4 
## 
## lowest :     3     4     5     6     7, highest:  4895  5108  5780  9445 14010
## ---------------------------------------------------------------------------
## TARGET_deathRate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     1053        1    178.7    30.67    134.1    145.4 
##      .25      .50      .75      .90      .95 
##    161.2    178.1    195.2    213.3    224.4 
## 
## lowest :  59.7  66.3  80.8  87.6  93.8, highest: 277.6 280.8 292.5 293.9 362.8
## ---------------------------------------------------------------------------
## incidenceRate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     1506        1    448.3    57.33    355.1    380.8 
##      .25      .50      .75      .90      .95 
##    420.3    453.5    480.9    507.3    525.0 
## 
## lowest :  201.3  211.1  214.8  221.5  234.0, highest:  639.7  651.3  718.9 1014.2 1206.9
## ---------------------------------------------------------------------------
## medIncome 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     2920        1    47063    12717    31838    34212 
##      .25      .50      .75      .90      .95 
##    38883    45207    52492    61323    69964 
## 
## lowest :  22640  23047  24035  24265  24707, highest: 107250 108477 110507 122641 125635
## ---------------------------------------------------------------------------
## popEst2015 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     2999        1   102637   153938     3666     5796 
##      .25      .50      .75      .90      .95 
##    11684    26643    68671   207791   436220 
## 
## lowest :      827      829     1130     1310     1330
## highest:  3299521  4167947  4538028  5238216 10170292
## ---------------------------------------------------------------------------
## povertyPercent 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      333        1    16.88    7.026     8.30     9.80 
##      .25      .50      .75      .90      .95 
##    12.15    15.90    20.40    25.30    28.70 
## 
## lowest :  3.2  3.7  3.9  4.0  4.2, highest: 44.0 45.1 46.9 47.0 47.4
## ---------------------------------------------------------------------------
## studyPerCap 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     1117    0.745    155.4    272.2     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     0.00    83.65   412.81   747.58 
## 
## lowest :    0.000000    1.481842    2.887570    5.030485    5.033625
## highest: 4938.271605 6810.442679 8585.924288 9439.200444 9762.308998
## ---------------------------------------------------------------------------
## binnedInc 
##        n  missing distinct 
##     3047        0       10 
## 
## (34218.1, 37413.8] (304, 0.100), (37413.8, 40362.7] (304, 0.100),
## (40362.7, 42724.4] (304, 0.100), (42724.4, 45201] (305, 0.100), (45201,
## 48021.6] (306, 0.100), (48021.6, 51046.4] (305, 0.100), (51046.4, 54545.6]
## (305, 0.100), (54545.6, 61494.5] (306, 0.100), (61494.5, 125635] (302,
## 0.099), [22640, 34218.1] (306, 0.100)
## ---------------------------------------------------------------------------
## MedianAge 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      325        1    45.27    14.48    31.93    34.50 
##      .25      .50      .75      .90      .95 
##    37.70    41.00    44.00    47.70    50.17 
## 
## lowest :  22.3  23.2  23.3  23.5  23.9, highest: 536.4 546.0 579.6 619.2 624.0
## ---------------------------------------------------------------------------
## MedianAgeMale 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      298        1    39.57    5.792    31.00    33.20 
##      .25      .50      .75      .90      .95 
##    36.35    39.60    42.50    46.10    48.60 
## 
## lowest : 22.4 22.8 23.0 23.7 23.8, highest: 56.5 58.5 58.6 60.2 64.7
## ---------------------------------------------------------------------------
## MedianAgeFemale 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      296        1    42.15    5.881     32.9     35.4 
##      .25      .50      .75      .90      .95 
##     39.1     42.4     45.3     48.6     50.6 
## 
## lowest : 22.3 22.8 23.6 23.9 24.5, highest: 58.0 58.2 58.7 59.6 65.7
## ---------------------------------------------------------------------------
## Geography 
##        n  missing distinct 
##     3047        0     3047 
## 
## lowest : Abbeville County, South Carolina  Acadia Parish, Louisiana          Accomack County, Virginia         Ada County, Idaho                 Adair County, Iowa               
## highest: Yukon-Koyukuk Census Area, Alaska Yuma County, Arizona              Yuma County, Colorado             Zapata County, Texas              Zavala County, Texas             
## ---------------------------------------------------------------------------
## AvgHouseholdSize 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      199        1     2.48   0.3509     2.14     2.24 
##      .25      .50      .75      .90      .95 
##     2.37     2.50     2.63     2.82     2.98 
## 
## lowest : 0.0221 0.0222 0.0225 0.0230 0.0236, highest: 3.7800 3.8400 3.8600 3.9300 3.9700
## ---------------------------------------------------------------------------
## PercentMarried 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      362        1    51.77    7.612    39.30    43.06 
##      .25      .50      .75      .90      .95 
##    47.75    52.40    56.40    59.94    61.90 
## 
## lowest : 23.1 25.1 25.3 26.2 26.3, highest: 68.0 69.1 69.2 72.3 72.5
## ---------------------------------------------------------------------------
## PctNoHS18_24 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      405        1    18.22    8.802     6.90     9.00 
##      .25      .50      .75      .90      .95 
##    12.80    17.10    22.70    28.60    32.87 
## 
## lowest :  0.0  0.5  0.8  1.2  1.4, highest: 59.0 59.1 59.7 62.7 64.1
## ---------------------------------------------------------------------------
## PctHS18_24 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      469        1       35     10.1    20.40    23.66 
##      .25      .50      .75      .90      .95 
##    29.20    34.70    40.70    46.10    50.60 
## 
## lowest :  0.0  7.1  8.0  8.6 10.0, highest: 65.5 65.7 66.2 72.1 72.5
## ---------------------------------------------------------------------------
## PctSomeCol18_24 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      762     2285      343        1    40.98    12.24    24.52    27.81 
##      .25      .50      .75      .90      .95 
##    34.00    40.40    46.40    56.18    60.87 
## 
## lowest :  7.1  9.6 10.1 11.2 11.4, highest: 73.5 75.2 76.2 78.3 79.0
## ---------------------------------------------------------------------------
## PctBachDeg18_24 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      219        1    6.158    4.681      0.5      1.4 
##      .25      .50      .75      .90      .95 
##      3.1      5.4      8.2     11.7     14.3 
## 
## lowest :  0.0  0.1  0.2  0.3  0.4, highest: 33.3 37.5 40.3 43.4 51.8
## ---------------------------------------------------------------------------
## PctHS25_Over 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      361        1     34.8    7.906    22.23    25.60 
##      .25      .50      .75      .90      .95 
##    30.40    35.30    39.65    43.40    45.40 
## 
## lowest :  7.5  8.3 10.8 11.5 11.8, highest: 51.7 52.1 52.7 53.6 54.8
## ---------------------------------------------------------------------------
## PctBachDeg25_Over 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      281        1    13.28    5.855     6.40     7.40 
##      .25      .50      .75      .90      .95 
##     9.40    12.30    16.10    20.40    23.47 
## 
## lowest :  2.5  2.7  3.2  3.4  3.9, highest: 35.8 37.8 39.7 40.4 42.2
## ---------------------------------------------------------------------------
## PctEmployed16_Over 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2895      152      409        1    54.15    9.362    39.97    43.50 
##      .25      .50      .75      .90      .95 
##    48.60    54.50    60.30    64.40    66.50 
## 
## lowest : 17.6 19.5 22.1 23.9 24.0, highest: 74.3 74.4 75.9 76.5 80.1
## ---------------------------------------------------------------------------
## PctUnemployed16_Over 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      195        1    7.852    3.765      2.8      3.8 
##      .25      .50      .75      .90      .95 
##      5.5      7.6      9.7     12.2     13.8 
## 
## lowest :  0.4  0.5  0.7  0.8  0.9, highest: 25.3 25.4 26.8 27.0 29.4
## ---------------------------------------------------------------------------
## PctPrivateCoverage 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      498        1    64.35    12.02    46.40    50.30 
##      .25      .50      .75      .90      .95 
##    57.20    65.10    72.10    77.60    80.57 
## 
## lowest : 22.3 23.4 25.0 27.2 27.8, highest: 88.0 88.8 88.9 89.6 92.3
## ---------------------------------------------------------------------------
## PctPrivateCoverageAlone 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2438      609      459        1    48.45    11.48    32.48    35.30 
##      .25      .50      .75      .90      .95 
##    41.00    48.70    55.60    61.60    65.02 
## 
## lowest : 15.7 16.8 19.6 20.7 22.2, highest: 76.5 76.6 77.1 78.2 78.9
## ---------------------------------------------------------------------------
## PctEmpPrivCoverage 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      450        1     41.2    10.73    26.00    28.90 
##      .25      .50      .75      .90      .95 
##    34.50    41.10    47.70    53.74    56.80 
## 
## lowest : 13.5 14.3 15.0 16.3 16.8, highest: 68.8 68.9 69.2 70.2 70.7
## ---------------------------------------------------------------------------
## PctPublicCoverage 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      395        1    36.25    8.864    23.00    26.10 
##      .25      .50      .75      .90      .95 
##    30.90    36.30    41.55    46.20    49.00 
## 
## lowest : 11.2 11.8 13.5 13.8 14.0, highest: 58.5 59.3 62.2 62.7 65.1
## ---------------------------------------------------------------------------
## PctPublicCoverageAlone 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0      319        1    19.24    6.849    10.00    11.80 
##      .25      .50      .75      .90      .95 
##    14.85    18.80    23.10    27.00    29.90 
## 
## lowest :  2.6  2.7  4.2  4.6  5.0, highest: 40.6 41.4 41.9 43.3 46.6
## ---------------------------------------------------------------------------
## PctWhite 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     3044        1    83.65    16.43    47.96    60.42 
##      .25      .50      .75      .90      .95 
##    77.30    90.06    95.45    97.20    97.77 
## 
## lowest :  10.19916  11.00876  12.01620  12.27367  13.62272
## highest:  99.61528  99.64629  99.69304  99.84590 100.00000
## ---------------------------------------------------------------------------
## PctBlack 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     2972        1    9.108    12.83   0.1007   0.2485 
##      .25      .50      .75      .90      .95 
##   0.6207   2.2476  10.5097  30.3715  42.1703 
## 
## lowest :  0.000000000  0.009738995  0.012794268  0.013453518  0.013691128
## highest: 80.659997700 81.281846340 82.559130620 84.866023580 85.947798580
## ---------------------------------------------------------------------------
## PctAsian 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     2852        1    1.254     1.63   0.0000   0.0578 
##      .25      .50      .75      .90      .95 
##   0.2542   0.5498   1.2210   2.8123   4.4719 
## 
## lowest :  0.000000000  0.006028454  0.007154611  0.007464915  0.007917030
## highest: 33.760904510 33.829509620 35.640183090 37.156931740 42.619424540
## ---------------------------------------------------------------------------
## PctOtherRace 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     2903        1    1.984    2.604  0.01408  0.08920 
##      .25      .50      .75      .90      .95 
##  0.29517  0.82619  2.17796  4.84393  7.85840 
## 
## lowest :  0.000000000  0.005558026  0.007952286  0.009243853  0.010645093
## highest: 36.828519600 37.610993660 37.859022640 38.743746530 41.930251420
## ---------------------------------------------------------------------------
## PctMarriedHouseholds 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     3043        1    51.24    7.176    39.73    43.11 
##      .25      .50      .75      .90      .95 
##    47.76    51.67    55.40    58.67    60.99 
## 
## lowest : 22.99249 23.88563 23.91565 24.02463 24.42908
## highest: 70.78125 71.12701 71.40010 71.70306 78.07540
## ---------------------------------------------------------------------------
## BirthRate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3047        0     3019        1     5.64    2.071    2.875    3.563 
##      .25      .50      .75      .90      .95 
##    4.521    5.381    6.494    7.984    9.185 
## 
## lowest :  0.0000000  0.2840909  0.3636364  0.4132231  0.4767580
## highest: 17.2605791 17.4010456 17.8770950 18.5567010 21.3261649
## ---------------------------------------------------------------------------

# data cleaning
ols_project$binnedInc<-as.factor(ols_project$binnedInc)
ols_project<- tidyr::separate(ols_project,"Geography",into = c("County/City","State"),sep = ",")

The data appears to have missing values, and is a mixture of numeric and categorical values. The variable binnedInc was converted to a factor variable. Additionally, it appears that the Geography column can be deconstructed into two variables, “County/city” and “State”. This was done to add more information and increase the variability of the data.

0.2.2 Impute missing values and check for outliers

The H2o framework was integrated into the pipeline because of its ease in dimensionality reduction with auto-encoders and generalized low rank models (grlm). The grlm was constructed to impute the missing values in the original dataset.

h2o.init(nthreads = -1,min_mem_size = "4g")

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\usnf0\AppData\Local\Temp\RtmpS2OGcZ/h2o_usnf0_started_from_r.out
##     C:\Users\usnf0\AppData\Local\Temp\RtmpS2OGcZ/h2o_usnf0_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 214 milliseconds 
##     H2O cluster version:        3.10.5.3 
##     H2O cluster version age:    1 month and 22 days  
##     H2O cluster name:           H2O_started_from_R_usnf0_hxx275 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.83 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)

ols_project<- as.h2o(ols_project)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

ols.glrm<-h2o.glrm(training_frame = ols_project,k=10,init = "SVD",svd_method = "GramSVD",max_iterations =3000,min_step_size = 1e-6 )

## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [County/City, State].

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=================================================================| 100%

ols.pred<-predict(ols.glrm,ols_project)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

feat.names<- setdiff(names(ols.pred),"reconstr_TARGET_deathRate")
ols.dl<-h2o.deeplearning(x =feat.names, training_frame = ols.pred, autoencoder = T,reproducible = T,seed = 323,hidden = c(10,2,10),epochs = 100)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

ols.anom<-h2o.anomaly(ols.dl,ols.pred)
ols.anom<-ols.anom %>% as.data.frame() %>% mutate(row_number=1:3047)%>% filter(Reconstruction.MSE>.09)

## Warning: package 'bindrcpp' was built under R version 3.4.1

ols.pred<-h2o.cbind(ols.pred,ols_project[,c("County/City","State")])
ols_project<- ols.pred%>% as.data.table()
ols_project$State<-as.factor(ols_project$State)
ols_project$County.City<-as.factor(ols_project$County.City)

ols.anom# possible outliers

##   Reconstruction.MSE row_number
## 1         0.11856255        210
## 2         0.12976341        260
## 3         0.09876091        982
## 4         0.19768667       1000
## 5         0.09781145       1065
## 6         0.11074758       1490
## 7         0.11984581       2374
## 8         0.09628350       2546

ols_project[ols.anom$row_number]

##    reconstr_avgAnnCount reconstr_avgDeathsPerYear
## 1:            861.97635                 283.11141
## 2:             54.97603                  16.45885
## 3:           3958.98642                1229.23766
## 4:          38149.99456               14010.13114
## 5:            983.97205                 259.12079
## 6:            213.98816                  61.46226
## 7:          24965.01245                9444.45521
## 8:             42.97959                  17.63177
##    reconstr_TARGET_deathRate reconstr_incidenceRate reconstr_medIncome
## 1:                  136.9045               363.2937          122640.97
## 2:                  132.3758               446.1932          125634.96
## 3:                  128.4056               380.3929          110506.96
## 4:                  147.8705               405.5552           55685.91
## 5:                  129.3135               422.5906          107249.98
## 6:                  335.7610              1200.2902           40207.01
## 7:                  181.5888               470.2980           55057.97
## 8:                  189.1039               437.0334           26083.01
##    reconstr_popEst2015 reconstr_povertyPercent reconstr_studyPerCap
## 1:          375628.935              -1.8344455         4.498867e+02
## 2:           13891.999              -4.2823767         7.302797e-04
## 3:         1142233.800               2.3059846         1.531857e+02
## 4:        10170290.215              35.0795891         2.559074e+02
## 5:          322386.945              -0.5238955         8.436920e+02
## 6:           15233.999              50.2546436         6.245411e-02
## 7:         5238215.077              14.4018756         3.710958e+02
## 8:            9149.997              25.4945990        -6.721804e-02
##    reconstr_binnedInc reconstr_MedianAge reconstr_MedianAgeMale
## 1:  (61494.5, 125635]           35.32910               33.77080
## 2:  (61494.5, 125635]           37.49270               35.98991
## 3:  (61494.5, 125635]           37.65994               31.19040
## 4: (54545.6, 61494.5]           36.03170               27.11996
## 5:  (61494.5, 125635]           37.66537               37.00402
## 6:   [22640, 34218.1]           38.83978               52.39339
## 7:  (61494.5, 125635]           35.83674               41.46691
## 8:   [22640, 34218.1]           33.72340               21.71874
##    reconstr_MedianAgeFemale reconstr_AvgHouseholdSize
## 1:                 35.07343                  3.555909
## 2:                 37.51028                  3.374537
## 3:                 32.43226                  3.460838
## 4:                 29.10121                  4.239767
## 5:                 38.36979                  3.230929
## 6:                 58.27128                  3.618531
## 7:                 46.16567                  2.287342
## 8:                 24.76374                  2.027053
##    reconstr_PercentMarried reconstr_PctNoHS18_24 reconstr_PctHS18_24
## 1:                62.89290             14.702136            30.96427
## 2:                62.86190             11.523299            28.45050
## 3:                58.80342             14.425014            27.83671
## 4:                41.30476             12.481278            20.98855
## 5:                64.56818             12.500823            29.61922
## 6:                51.29031             22.556830            43.80337
## 7:                29.11824              3.959409            19.81418
## 8:                20.85507             15.855093            25.55816
##    reconstr_PctSomeCol18_24 reconstr_PctBachDeg18_24 reconstr_PctHS25_Over
## 1:                 76.92795                16.990502             14.888100
## 2:                 82.08931                18.913195             15.471586
## 3:                 72.86827                16.149781             13.275179
## 4:                 48.46589                20.228439              5.193957
## 5:                 74.29304                15.832267             19.563536
## 6:                 61.48743                 8.522816             55.793071
## 7:                 27.25771                22.468997             15.384794
## 8:                 22.99170                 2.863033             25.037987
##    reconstr_PctBachDeg25_Over reconstr_PctEmployed16_Over
## 1:                  35.896343                    86.88097
## 2:                  37.752830                    88.18877
## 3:                  33.735608                    81.89610
## 4:                  32.719979                    73.80639
## 5:                  32.751776                    84.89267
## 6:                  16.728413                    63.44473
## 7:                  27.584912                    51.07448
## 8:                   7.005274                    28.71234
##    reconstr_PctUnemployed16_Over reconstr_PctPrivateCoverage
## 1:                      3.406779                   101.95141
## 2:                      3.165542                   109.45285
## 3:                      4.098229                    92.65050
## 4:                     15.769650                    43.22236
## 5:                      2.530254                   101.34843
## 6:                     22.745586                    86.54729
## 7:                     15.241694                    56.73909
## 8:                     12.662282                    34.00705
##    reconstr_PctPrivateCoverageAlone reconstr_PctEmpPrivCoverage
## 1:                        73.626191                    79.22048
## 2:                        74.631148                    83.86160
## 3:                        62.494232                    71.76735
## 4:                        48.460404                    41.65464
## 5:                        75.505901                    75.10127
## 6:                        38.920177                    55.17852
## 7:                         2.250338                    45.49659
## 8:                        31.190335                    24.67683
##    reconstr_PctPublicCoverage reconstr_PctPublicCoverageAlone
## 1:                   4.747879                        0.213755
## 2:                   4.732290                       -1.603801
## 3:                   7.546320                        3.566150
## 4:                  40.685264                       39.193979
## 5:                  10.501823                        1.843507
## 6:                  71.199872                       45.192925
## 7:                  46.221133                       29.384747
## 8:                  31.680834                       22.821130
##    reconstr_PctWhite reconstr_PctBlack reconstr_PctAsian
## 1:         66.595136          18.00069          9.373414
## 2:         71.331732          19.21688          8.273722
## 3:         61.466006          17.02473         10.680120
## 4:         56.177699           1.27434         30.080789
## 5:         86.149859           5.30846          7.269670
## 6:         71.649568          80.47621          1.435697
## 7:         60.147868          13.57254         14.449921
## 8:          9.643894          53.94795          1.979483
##    reconstr_PctOtherRace reconstr_PctMarriedHouseholds reconstr_BirthRate
## 1:            7.16604789                      69.58531           6.882339
## 2:            5.18234989                      68.68973           6.437283
## 3:            8.88764356                      64.71846           6.771524
## 4:           29.27091522                      43.65194           5.694446
## 5:            5.40744399                      68.54347           6.738437
## 6:           -0.01587447                      51.51721           7.328961
## 7:           10.07990854                      24.27156           1.882297
## 8:            1.64398517                      23.20373           3.800228
##           County.City        State
## 1:     Loudoun County     Virginia
## 2:  Falls Church city     Virginia
## 3:     Fairfax County     Virginia
## 4: Los Angeles County   California
## 5:     Douglas County     Colorado
## 6:       Union County      Florida
## 7:        Cook County     Illinois
## 8:   Claiborne County  Mississippi

ols_project[reconstr_MedianAge>100,reconstr_MedianAge]#outliers

##  [1] 457.7734 468.9756 545.7721 623.8188 508.4562 619.1590 497.6171
##  [8] 412.1777 480.8220 424.1665 534.8357 406.2605 579.9061 502.2296
## [15] 496.1947 525.0669 519.0263 535.7403 522.3904 469.9527 430.5584
## [22] 413.8742 500.6821 429.3218 500.3413 497.1198 348.5221 511.1347
## [29] 497.4341 508.4157

h2o.shutdown(F); rm(ols.glrm,ols.pred,ols.dl); gc();detach("package:h2o", unload=TRUE)

## [1] TRUE

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2133614 114.0    3205452 171.2  3205452 171.2
## Vcells 3425435  26.2    5721718  43.7  5719947  43.7

## [1] "A shutdown has been triggered. "

## Warning in value[[3L]](cond): 
## ----------------------------------------------------------------------
## 
## Could not shut down the H2O Java Process!
## Please shutdown H2O manually by navigating to `http://localhost:54321/Shutdown`
## 
## Windows requires the shutdown of h2o before re-installing -or- updating the h2o package.
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------

The data was pulled back from the h2o cluster, and factors were created from “County.city” and “State”. Examining the results, the anomaly method listed 8 possible outliers. These 8 require domain knowledge to judge if they are true outliers. However, from the summary table done earlier, it is obvious that the MedianAge variable has outliers because currently, people do not live to reach 400, 500, or 600 years old.

0.2.3 Checking for Normality

plot(Hmisc::describe(ols_project)) #

## $Categorical

## 
## $Continuous

data.table(Column=names(ols_project[,-c(9,34:35)]),P=apply(ols_project[,-c(9,34:35)],2,function(x)shapiro.test(x)$p.value))

##                               Column            P
##  1:             reconstr_avgAnnCount 4.170786e-73
##  2:        reconstr_avgDeathsPerYear 2.647982e-75
##  3:        reconstr_TARGET_deathRate 2.108897e-11
##  4:           reconstr_incidenceRate 1.822361e-36
##  5:               reconstr_medIncome 5.074380e-38
##  6:              reconstr_popEst2015 1.162513e-76
##  7:          reconstr_povertyPercent 6.133276e-10
##  8:             reconstr_studyPerCap 2.363104e-75
##  9:               reconstr_MedianAge 1.205533e-79
## 10:           reconstr_MedianAgeMale 3.347651e-26
## 11:         reconstr_MedianAgeFemale 4.627598e-21
## 12:        reconstr_AvgHouseholdSize 7.365672e-25
## 13:          reconstr_PercentMarried 1.452269e-41
## 14:            reconstr_PctNoHS18_24 4.473207e-07
## 15:              reconstr_PctHS18_24 1.123750e-05
## 16:         reconstr_PctSomeCol18_24 7.070605e-19
## 17:         reconstr_PctBachDeg18_24 3.397906e-31
## 18:            reconstr_PctHS25_Over 3.775964e-18
## 19:       reconstr_PctBachDeg25_Over 3.538379e-32
## 20:      reconstr_PctEmployed16_Over 1.871671e-11
## 21:    reconstr_PctUnemployed16_Over 2.817221e-17
## 22:      reconstr_PctPrivateCoverage 5.999665e-11
## 23: reconstr_PctPrivateCoverageAlone 9.509105e-42
## 24:      reconstr_PctEmpPrivCoverage 5.724100e-24
## 25:       reconstr_PctPublicCoverage 4.656196e-16
## 26:  reconstr_PctPublicCoverageAlone 1.209840e-07
## 27:                reconstr_PctWhite 4.257334e-47
## 28:                reconstr_PctBlack 1.504331e-36
## 29:                reconstr_PctAsian 6.815750e-56
## 30:            reconstr_PctOtherRace 2.464945e-65
## 31:    reconstr_PctMarriedHouseholds 1.143897e-35
## 32:               reconstr_BirthRate 3.137107e-21
##                               Column            P

Two checks were done to briefly look at the distributions of the variables to identify any variables that might benefit from some sort of transformation. From the plot, it is apparent that a few of the distributions have long tails and may require transformation depending on your approach. Additionally, the Shapiro test results rejected the null hypothesis for normality for all variables. This will not be a problem, and will be addressed shortly.

0.3 Feature Selection

# Crtl<- trainControl(method = "repeatedcv",number = 10,repeats = 3,allowParallel = T)
# features<-train(reconstr_TARGET_deathRate~.,data=ols_project,method = "xgbTree",trControl=Crtl)
# getTrainPerf(features)
imp_feat<-c("reconstr_PctPublicCoverageAlone",
"reconstr_povertyPercent",
"reconstr_AvgHouseholdSize",
"reconstr_PctUnemployed16_Over",
"reconstr_PctBlack",
"reconstr_PctHS18_24")

GGally::ggpairs(ols_project[,imp_feat,with=F])

imp_feat[7]<-"reconstr_TARGET_deathRate"

summary(ols_project[,imp_feat,with=F])

##  reconstr_PctPublicCoverageAlone reconstr_povertyPercent
##  Min.   :-1.604                  Min.   :-4.491         
##  1st Qu.:15.895                  1st Qu.:13.039         
##  Median :18.979                  Median :16.427         
##  Mean   :19.070                  Mean   :16.636         
##  3rd Qu.:22.351                  3rd Qu.:20.129         
##  Max.   :45.193                  Max.   :50.255         
##  reconstr_AvgHouseholdSize reconstr_PctUnemployed16_Over
##  Min.   :1.518             Min.   :-0.4861              
##  1st Qu.:2.337             1st Qu.: 6.2024              
##  Median :2.460             Median : 7.6521              
##  Mean   :2.464             Mean   : 7.7747              
##  3rd Qu.:2.579             3rd Qu.: 9.1386              
##  Max.   :4.240             Max.   :22.7456              
##  reconstr_PctBlack  reconstr_PctHS18_24 reconstr_TARGET_deathRate
##  Min.   :-23.2224   Min.   :13.63       Min.   : 70.38           
##  1st Qu.:  0.8389   1st Qu.:32.19       1st Qu.:161.75           
##  Median :  6.4338   Median :34.86       Median :178.06           
##  Mean   :  8.7382   Mean   :34.97       Mean   :178.91           
##  3rd Qu.: 13.6192   3rd Qu.:37.68       3rd Qu.:194.70           
##  Max.   : 80.4762   Max.   :51.26       Max.   :335.76

Because the variables had outliers and required transformations, feature selection was done by using a robust algorithmic selector. Some possible choices are random forest, and boosted trees. XGBoost was used to identify the important variables. The variables selected appear to be approximately normally distributed, therefore no transformations were needed. However, some of the variables exhibit high correlation.

0.4 Model Selection

set.seed(323)

intrain<-createDataPartition(ols_project$reconstr_TARGET_deathRate,p=.70,list=F)
training<-ols_project[intrain]
testing<-ols_project[-intrain]

models<-list()
for(i in seq_along(imp_feat[-7])){
  
  
  models[[i]]<- lm(reconstr_TARGET_deathRate~.,data=training[,imp_feat[c(1:i,7)],with=F])
  
  
}
lapply(models,summary)

## [[1]]
## 
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[, 
##     imp_feat[c(1:i, 7)], with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -125.048   -5.902    0.416    5.981   56.223 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     90.04480    0.92034   97.84   <2e-16 ***
## reconstr_PctPublicCoverageAlone  4.66585    0.04684   99.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 2133 degrees of freedom
## Multiple R-squared:  0.8231, Adjusted R-squared:  0.823 
## F-statistic:  9924 on 1 and 2133 DF,  p-value: < 2.2e-16
## 
## 
## [[2]]
## 
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[, 
##     imp_feat[c(1:i, 7)], with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.445   -5.857    0.177    5.833   54.842 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      96.7587     1.1294  85.671   <2e-16 ***
## reconstr_PctPublicCoverageAlone   2.7993     0.1950  14.352   <2e-16 ***
## reconstr_povertyPercent           1.7344     0.1762   9.846   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.45 on 2132 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8306 
## F-statistic:  5234 on 2 and 2132 DF,  p-value: < 2.2e-16
## 
## 
## [[3]]
## 
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[, 
##     imp_feat[c(1:i, 7)], with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -174.636   -2.611    0.652    3.400   29.122 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      17.9976     1.9200   9.374   <2e-16 ***
## reconstr_PctPublicCoverageAlone   2.4357     0.1396  17.446   <2e-16 ***
## reconstr_povertyPercent           1.8658     0.1259  14.817   <2e-16 ***
## reconstr_AvgHouseholdSize        33.8680     0.7491  45.209   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.471 on 2131 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.9135 
## F-statistic:  7514 on 3 and 2131 DF,  p-value: < 2.2e-16
## 
## 
## [[4]]
## 
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[, 
##     imp_feat[c(1:i, 7)], with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -173.374   -2.606    0.800    3.472   25.121 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      19.6867     1.9779   9.953  < 2e-16 ***
## reconstr_PctPublicCoverageAlone   2.4864     0.1400  17.754  < 2e-16 ***
## reconstr_povertyPercent           1.5545     0.1551  10.022  < 2e-16 ***
## reconstr_AvgHouseholdSize        32.7312     0.8179  40.021  < 2e-16 ***
## reconstr_PctUnemployed16_Over     0.6850     0.2003   3.420 0.000638 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.452 on 2130 degrees of freedom
## Multiple R-squared:  0.9141, Adjusted R-squared:  0.9139 
## F-statistic:  5667 on 4 and 2130 DF,  p-value: < 2.2e-16
## 
## 
## [[5]]
## 
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[, 
##     imp_feat[c(1:i, 7)], with = F])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -135.333   -2.848    1.265    3.278   24.931 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      19.8208     1.6474   12.03   <2e-16 ***
## reconstr_PctPublicCoverageAlone  12.0653     0.3333   36.20   <2e-16 ***
## reconstr_povertyPercent          -3.2424     0.2028  -15.99   <2e-16 ***
## reconstr_AvgHouseholdSize        36.2343     0.6907   52.46   <2e-16 ***
## reconstr_PctUnemployed16_Over   -15.9807     0.5682  -28.13   <2e-16 ***
## reconstr_PctBlack                 2.0589     0.0671   30.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.207 on 2129 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9403 
## F-statistic:  6723 on 5 and 2129 DF,  p-value: < 2.2e-16
## 
## 
## [[6]]
## 
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[, 
##     imp_feat[c(1:i, 7)], with = F])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.721  -2.514   0.683   2.716  32.148 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      19.59237    1.31002   14.96   <2e-16 ***
## reconstr_PctPublicCoverageAlone   7.86404    0.29066   27.06   <2e-16 ***
## reconstr_povertyPercent          -2.03437    0.16489  -12.34   <2e-16 ***
## reconstr_AvgHouseholdSize        17.01313    0.77453   21.97   <2e-16 ***
## reconstr_PctUnemployed16_Over   -11.34322    0.47065  -24.10   <2e-16 ***
## reconstr_PctBlack                 1.91697    0.05351   35.82   <2e-16 ***
## reconstr_PctHS18_24               2.08064    0.05912   35.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.936 on 2128 degrees of freedom
## Multiple R-squared:  0.9624, Adjusted R-squared:  0.9622 
## F-statistic:  9066 on 6 and 2128 DF,  p-value: < 2.2e-16

do.call(anova,models)

## Analysis of Variance Table
## 
## Model 1: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone
## Model 2: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone + 
##     reconstr_povertyPercent
## Model 3: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone + 
##     reconstr_povertyPercent + reconstr_AvgHouseholdSize
## Model 4: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone + 
##     reconstr_povertyPercent + reconstr_AvgHouseholdSize + reconstr_PctUnemployed16_Over
## Model 5: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone + 
##     reconstr_povertyPercent + reconstr_AvgHouseholdSize + reconstr_PctUnemployed16_Over + 
##     reconstr_PctBlack
## Model 6: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone + 
##     reconstr_povertyPercent + reconstr_AvgHouseholdSize + reconstr_PctUnemployed16_Over + 
##     reconstr_PctBlack + reconstr_PctHS18_24
##   Res.Df    RSS Df Sum of Sq        F    Pr(>F)    
## 1   2133 243619                                    
## 2   2132 233024  1     10595  434.885 < 2.2e-16 ***
## 3   2131 118944  1    114079 4682.564 < 2.2e-16 ***
## 4   2130 118295  1       650   26.663 2.647e-07 ***
## 5   2129  82024  1     36271 1488.812 < 2.2e-16 ***
## 6   2128  51844  1     30180 1238.780 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow=c(2,2))
plot(models[[6]]) #observe high leverage,

ols_project[709]

##    reconstr_avgAnnCount reconstr_avgDeathsPerYear
## 1:             427.9675                  183.0179
##    reconstr_TARGET_deathRate reconstr_incidenceRate reconstr_medIncome
## 1:                  192.9291               470.1869           36739.99
##    reconstr_popEst2015 reconstr_povertyPercent reconstr_studyPerCap
## 1:            68579.99                19.89551          -0.01039127
##    reconstr_binnedInc reconstr_MedianAge reconstr_MedianAgeMale
## 1:   (42724.4, 45201]           43.68159               43.10673
##    reconstr_MedianAgeFemale reconstr_AvgHouseholdSize
## 1:                 45.86514                  2.409053
##    reconstr_PercentMarried reconstr_PctNoHS18_24 reconstr_PctHS18_24
## 1:                54.22054              19.39945             38.0621
##    reconstr_PctSomeCol18_24 reconstr_PctBachDeg18_24 reconstr_PctHS25_Over
## 1:                 35.28683                 4.529534              40.86022
##    reconstr_PctBachDeg25_Over reconstr_PctEmployed16_Over
## 1:                   9.892463                    52.68388
##    reconstr_PctUnemployed16_Over reconstr_PctPrivateCoverage
## 1:                      8.354626                    62.23812
##    reconstr_PctPrivateCoverageAlone reconstr_PctEmpPrivCoverage
## 1:                         44.42119                    37.55365
##    reconstr_PctPublicCoverage reconstr_PctPublicCoverageAlone
## 1:                   42.84199                         22.6819
##    reconstr_PctWhite reconstr_PctBlack reconstr_PctAsian
## 1:          96.46664          3.325524        -0.2832896
##    reconstr_PctOtherRace reconstr_PctMarriedHouseholds reconstr_BirthRate
## 1:              1.060268                      52.33497           5.685039
##      County.City      State
## 1: Greene County  Tennessee

par(mfrow=c(1,1))

In the previous section, the variable PctPublicCoverageAlone is the most important feature that explains a significant amount of variability. Therefore, a check was done to test if the addition of the other variables would improve the model. Both the summary and the anova confirm that all six are significant and improve the model. Another nested model was made (not shown here) where the highly correlated variables were removed. That model did slightly worse.

In the diagnostic plots, the residual plot trend line is not flat, and is on the cusp of showing a pattern. The Q-Q plot, except for the tails, does not show signs for concern. However, the data point at row 709 exhibits a significant amount of leverage and could possibly be an outlier and will need a domain expert to investigate.

0.5 Model Prediction

results<-predict(models[[6]],testing)
data.frame(results,actual=testing$reconstr_TARGET_deathRate) %>% ggplot(aes(x=results,y=actual)) + geom_point()+stat_smooth(method="lm",show.legend = T)

RMSE(pred = results,obs = testing$reconstr_TARGET_deathRate)

## [1] 4.454168

The test set was scored on the model and the result indicate acceptable results when comparing the predicted values with the expected result having a RMSE of 4.339.

0.6 Concluding Remarks

The analysis from this report presented an accurate model with the aim of giving the SMEs in this field a better method for identifying the active class other than from their collective agreement.