The goal of this project/challenge is to predict the results of Cancer Mortality Rates.
The website hosting the data is located at https://data.world/nrippner/ols-regression-challenge. These data were aggregated from a number of sources including the American Community Survey (https://www.census.gov), https://www.clinicaltrials.gov, and https://www.cancer.gov.
Using only OLS to build the model, the final model chosen uses 6 predictors, out of 36 total, with an adjusted R-squared of 0.9624.
The data was loaded from data.world and a summary analysis was done to get gain some quick insight.
library(dplyr);library(data.table);library(h2o);library(caret)
set.seed(323)
ols_project <- fread("cancer_reg.csv")
Hmisc::describe(ols_project)
## ols_project
##
## 34 Variables 3047 Observations
## ---------------------------------------------------------------------------
## avgAnnCount
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 929 1 606.3 856.7 23 37
## .25 .50 .75 .90 .95
## 76 171 518 1963 1972
##
## lowest : 6 7 8 9 10, highest: 13294 14477 15470 24965 38150
## ---------------------------------------------------------------------------
## avgDeathsPerYear
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 608 1 186 262.2 9.0 14.0
## .25 .50 .75 .90 .95
## 28.0 61.0 149.0 378.4 749.4
##
## lowest : 3 4 5 6 7, highest: 4895 5108 5780 9445 14010
## ---------------------------------------------------------------------------
## TARGET_deathRate
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 1053 1 178.7 30.67 134.1 145.4
## .25 .50 .75 .90 .95
## 161.2 178.1 195.2 213.3 224.4
##
## lowest : 59.7 66.3 80.8 87.6 93.8, highest: 277.6 280.8 292.5 293.9 362.8
## ---------------------------------------------------------------------------
## incidenceRate
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 1506 1 448.3 57.33 355.1 380.8
## .25 .50 .75 .90 .95
## 420.3 453.5 480.9 507.3 525.0
##
## lowest : 201.3 211.1 214.8 221.5 234.0, highest: 639.7 651.3 718.9 1014.2 1206.9
## ---------------------------------------------------------------------------
## medIncome
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 2920 1 47063 12717 31838 34212
## .25 .50 .75 .90 .95
## 38883 45207 52492 61323 69964
##
## lowest : 22640 23047 24035 24265 24707, highest: 107250 108477 110507 122641 125635
## ---------------------------------------------------------------------------
## popEst2015
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 2999 1 102637 153938 3666 5796
## .25 .50 .75 .90 .95
## 11684 26643 68671 207791 436220
##
## lowest : 827 829 1130 1310 1330
## highest: 3299521 4167947 4538028 5238216 10170292
## ---------------------------------------------------------------------------
## povertyPercent
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 333 1 16.88 7.026 8.30 9.80
## .25 .50 .75 .90 .95
## 12.15 15.90 20.40 25.30 28.70
##
## lowest : 3.2 3.7 3.9 4.0 4.2, highest: 44.0 45.1 46.9 47.0 47.4
## ---------------------------------------------------------------------------
## studyPerCap
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 1117 0.745 155.4 272.2 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.00 83.65 412.81 747.58
##
## lowest : 0.000000 1.481842 2.887570 5.030485 5.033625
## highest: 4938.271605 6810.442679 8585.924288 9439.200444 9762.308998
## ---------------------------------------------------------------------------
## binnedInc
## n missing distinct
## 3047 0 10
##
## (34218.1, 37413.8] (304, 0.100), (37413.8, 40362.7] (304, 0.100),
## (40362.7, 42724.4] (304, 0.100), (42724.4, 45201] (305, 0.100), (45201,
## 48021.6] (306, 0.100), (48021.6, 51046.4] (305, 0.100), (51046.4, 54545.6]
## (305, 0.100), (54545.6, 61494.5] (306, 0.100), (61494.5, 125635] (302,
## 0.099), [22640, 34218.1] (306, 0.100)
## ---------------------------------------------------------------------------
## MedianAge
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 325 1 45.27 14.48 31.93 34.50
## .25 .50 .75 .90 .95
## 37.70 41.00 44.00 47.70 50.17
##
## lowest : 22.3 23.2 23.3 23.5 23.9, highest: 536.4 546.0 579.6 619.2 624.0
## ---------------------------------------------------------------------------
## MedianAgeMale
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 298 1 39.57 5.792 31.00 33.20
## .25 .50 .75 .90 .95
## 36.35 39.60 42.50 46.10 48.60
##
## lowest : 22.4 22.8 23.0 23.7 23.8, highest: 56.5 58.5 58.6 60.2 64.7
## ---------------------------------------------------------------------------
## MedianAgeFemale
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 296 1 42.15 5.881 32.9 35.4
## .25 .50 .75 .90 .95
## 39.1 42.4 45.3 48.6 50.6
##
## lowest : 22.3 22.8 23.6 23.9 24.5, highest: 58.0 58.2 58.7 59.6 65.7
## ---------------------------------------------------------------------------
## Geography
## n missing distinct
## 3047 0 3047
##
## lowest : Abbeville County, South Carolina Acadia Parish, Louisiana Accomack County, Virginia Ada County, Idaho Adair County, Iowa
## highest: Yukon-Koyukuk Census Area, Alaska Yuma County, Arizona Yuma County, Colorado Zapata County, Texas Zavala County, Texas
## ---------------------------------------------------------------------------
## AvgHouseholdSize
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 199 1 2.48 0.3509 2.14 2.24
## .25 .50 .75 .90 .95
## 2.37 2.50 2.63 2.82 2.98
##
## lowest : 0.0221 0.0222 0.0225 0.0230 0.0236, highest: 3.7800 3.8400 3.8600 3.9300 3.9700
## ---------------------------------------------------------------------------
## PercentMarried
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 362 1 51.77 7.612 39.30 43.06
## .25 .50 .75 .90 .95
## 47.75 52.40 56.40 59.94 61.90
##
## lowest : 23.1 25.1 25.3 26.2 26.3, highest: 68.0 69.1 69.2 72.3 72.5
## ---------------------------------------------------------------------------
## PctNoHS18_24
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 405 1 18.22 8.802 6.90 9.00
## .25 .50 .75 .90 .95
## 12.80 17.10 22.70 28.60 32.87
##
## lowest : 0.0 0.5 0.8 1.2 1.4, highest: 59.0 59.1 59.7 62.7 64.1
## ---------------------------------------------------------------------------
## PctHS18_24
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 469 1 35 10.1 20.40 23.66
## .25 .50 .75 .90 .95
## 29.20 34.70 40.70 46.10 50.60
##
## lowest : 0.0 7.1 8.0 8.6 10.0, highest: 65.5 65.7 66.2 72.1 72.5
## ---------------------------------------------------------------------------
## PctSomeCol18_24
## n missing distinct Info Mean Gmd .05 .10
## 762 2285 343 1 40.98 12.24 24.52 27.81
## .25 .50 .75 .90 .95
## 34.00 40.40 46.40 56.18 60.87
##
## lowest : 7.1 9.6 10.1 11.2 11.4, highest: 73.5 75.2 76.2 78.3 79.0
## ---------------------------------------------------------------------------
## PctBachDeg18_24
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 219 1 6.158 4.681 0.5 1.4
## .25 .50 .75 .90 .95
## 3.1 5.4 8.2 11.7 14.3
##
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 33.3 37.5 40.3 43.4 51.8
## ---------------------------------------------------------------------------
## PctHS25_Over
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 361 1 34.8 7.906 22.23 25.60
## .25 .50 .75 .90 .95
## 30.40 35.30 39.65 43.40 45.40
##
## lowest : 7.5 8.3 10.8 11.5 11.8, highest: 51.7 52.1 52.7 53.6 54.8
## ---------------------------------------------------------------------------
## PctBachDeg25_Over
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 281 1 13.28 5.855 6.40 7.40
## .25 .50 .75 .90 .95
## 9.40 12.30 16.10 20.40 23.47
##
## lowest : 2.5 2.7 3.2 3.4 3.9, highest: 35.8 37.8 39.7 40.4 42.2
## ---------------------------------------------------------------------------
## PctEmployed16_Over
## n missing distinct Info Mean Gmd .05 .10
## 2895 152 409 1 54.15 9.362 39.97 43.50
## .25 .50 .75 .90 .95
## 48.60 54.50 60.30 64.40 66.50
##
## lowest : 17.6 19.5 22.1 23.9 24.0, highest: 74.3 74.4 75.9 76.5 80.1
## ---------------------------------------------------------------------------
## PctUnemployed16_Over
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 195 1 7.852 3.765 2.8 3.8
## .25 .50 .75 .90 .95
## 5.5 7.6 9.7 12.2 13.8
##
## lowest : 0.4 0.5 0.7 0.8 0.9, highest: 25.3 25.4 26.8 27.0 29.4
## ---------------------------------------------------------------------------
## PctPrivateCoverage
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 498 1 64.35 12.02 46.40 50.30
## .25 .50 .75 .90 .95
## 57.20 65.10 72.10 77.60 80.57
##
## lowest : 22.3 23.4 25.0 27.2 27.8, highest: 88.0 88.8 88.9 89.6 92.3
## ---------------------------------------------------------------------------
## PctPrivateCoverageAlone
## n missing distinct Info Mean Gmd .05 .10
## 2438 609 459 1 48.45 11.48 32.48 35.30
## .25 .50 .75 .90 .95
## 41.00 48.70 55.60 61.60 65.02
##
## lowest : 15.7 16.8 19.6 20.7 22.2, highest: 76.5 76.6 77.1 78.2 78.9
## ---------------------------------------------------------------------------
## PctEmpPrivCoverage
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 450 1 41.2 10.73 26.00 28.90
## .25 .50 .75 .90 .95
## 34.50 41.10 47.70 53.74 56.80
##
## lowest : 13.5 14.3 15.0 16.3 16.8, highest: 68.8 68.9 69.2 70.2 70.7
## ---------------------------------------------------------------------------
## PctPublicCoverage
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 395 1 36.25 8.864 23.00 26.10
## .25 .50 .75 .90 .95
## 30.90 36.30 41.55 46.20 49.00
##
## lowest : 11.2 11.8 13.5 13.8 14.0, highest: 58.5 59.3 62.2 62.7 65.1
## ---------------------------------------------------------------------------
## PctPublicCoverageAlone
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 319 1 19.24 6.849 10.00 11.80
## .25 .50 .75 .90 .95
## 14.85 18.80 23.10 27.00 29.90
##
## lowest : 2.6 2.7 4.2 4.6 5.0, highest: 40.6 41.4 41.9 43.3 46.6
## ---------------------------------------------------------------------------
## PctWhite
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 3044 1 83.65 16.43 47.96 60.42
## .25 .50 .75 .90 .95
## 77.30 90.06 95.45 97.20 97.77
##
## lowest : 10.19916 11.00876 12.01620 12.27367 13.62272
## highest: 99.61528 99.64629 99.69304 99.84590 100.00000
## ---------------------------------------------------------------------------
## PctBlack
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 2972 1 9.108 12.83 0.1007 0.2485
## .25 .50 .75 .90 .95
## 0.6207 2.2476 10.5097 30.3715 42.1703
##
## lowest : 0.000000000 0.009738995 0.012794268 0.013453518 0.013691128
## highest: 80.659997700 81.281846340 82.559130620 84.866023580 85.947798580
## ---------------------------------------------------------------------------
## PctAsian
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 2852 1 1.254 1.63 0.0000 0.0578
## .25 .50 .75 .90 .95
## 0.2542 0.5498 1.2210 2.8123 4.4719
##
## lowest : 0.000000000 0.006028454 0.007154611 0.007464915 0.007917030
## highest: 33.760904510 33.829509620 35.640183090 37.156931740 42.619424540
## ---------------------------------------------------------------------------
## PctOtherRace
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 2903 1 1.984 2.604 0.01408 0.08920
## .25 .50 .75 .90 .95
## 0.29517 0.82619 2.17796 4.84393 7.85840
##
## lowest : 0.000000000 0.005558026 0.007952286 0.009243853 0.010645093
## highest: 36.828519600 37.610993660 37.859022640 38.743746530 41.930251420
## ---------------------------------------------------------------------------
## PctMarriedHouseholds
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 3043 1 51.24 7.176 39.73 43.11
## .25 .50 .75 .90 .95
## 47.76 51.67 55.40 58.67 60.99
##
## lowest : 22.99249 23.88563 23.91565 24.02463 24.42908
## highest: 70.78125 71.12701 71.40010 71.70306 78.07540
## ---------------------------------------------------------------------------
## BirthRate
## n missing distinct Info Mean Gmd .05 .10
## 3047 0 3019 1 5.64 2.071 2.875 3.563
## .25 .50 .75 .90 .95
## 4.521 5.381 6.494 7.984 9.185
##
## lowest : 0.0000000 0.2840909 0.3636364 0.4132231 0.4767580
## highest: 17.2605791 17.4010456 17.8770950 18.5567010 21.3261649
## ---------------------------------------------------------------------------
# data cleaning
ols_project$binnedInc<-as.factor(ols_project$binnedInc)
ols_project<- tidyr::separate(ols_project,"Geography",into = c("County/City","State"),sep = ",")
The data appears to have missing values, and is a mixture of numeric and categorical values. The variable binnedInc was converted to a factor variable. Additionally, it appears that the Geography column can be deconstructed into two variables, “County/city” and “State”. This was done to add more information and increase the variability of the data.
The H2o framework was integrated into the pipeline because of its ease in dimensionality reduction with auto-encoders and generalized low rank models (grlm). The grlm was constructed to impute the missing values in the original dataset.
h2o.init(nthreads = -1,min_mem_size = "4g")
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\usnf0\AppData\Local\Temp\RtmpS2OGcZ/h2o_usnf0_started_from_r.out
## C:\Users\usnf0\AppData\Local\Temp\RtmpS2OGcZ/h2o_usnf0_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 214 milliseconds
## H2O cluster version: 3.10.5.3
## H2O cluster version age: 1 month and 22 days
## H2O cluster name: H2O_started_from_R_usnf0_hxx275
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.83 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.4.0 (2017-04-21)
ols_project<- as.h2o(ols_project)
##
|
| | 0%
|
|=================================================================| 100%
ols.glrm<-h2o.glrm(training_frame = ols_project,k=10,init = "SVD",svd_method = "GramSVD",max_iterations =3000,min_step_size = 1e-6 )
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [County/City, State].
##
|
| | 0%
|
|= | 1%
|
|=== | 5%
|
|===== | 8%
|
|======== | 12%
|
|========== | 16%
|
|============= | 20%
|
|=============== | 23%
|
|================ | 25%
|
|================= | 26%
|
|================== | 27%
|
|=================== | 30%
|
|====================== | 33%
|
|=================================================================| 100%
ols.pred<-predict(ols.glrm,ols_project)
##
|
| | 0%
|
|=================================================================| 100%
feat.names<- setdiff(names(ols.pred),"reconstr_TARGET_deathRate")
ols.dl<-h2o.deeplearning(x =feat.names, training_frame = ols.pred, autoencoder = T,reproducible = T,seed = 323,hidden = c(10,2,10),epochs = 100)
##
|
| | 0%
|
|=================================================================| 100%
ols.anom<-h2o.anomaly(ols.dl,ols.pred)
ols.anom<-ols.anom %>% as.data.frame() %>% mutate(row_number=1:3047)%>% filter(Reconstruction.MSE>.09)
## Warning: package 'bindrcpp' was built under R version 3.4.1
ols.pred<-h2o.cbind(ols.pred,ols_project[,c("County/City","State")])
ols_project<- ols.pred%>% as.data.table()
ols_project$State<-as.factor(ols_project$State)
ols_project$County.City<-as.factor(ols_project$County.City)
ols.anom# possible outliers
## Reconstruction.MSE row_number
## 1 0.11856255 210
## 2 0.12976341 260
## 3 0.09876091 982
## 4 0.19768667 1000
## 5 0.09781145 1065
## 6 0.11074758 1490
## 7 0.11984581 2374
## 8 0.09628350 2546
ols_project[ols.anom$row_number]
## reconstr_avgAnnCount reconstr_avgDeathsPerYear
## 1: 861.97635 283.11141
## 2: 54.97603 16.45885
## 3: 3958.98642 1229.23766
## 4: 38149.99456 14010.13114
## 5: 983.97205 259.12079
## 6: 213.98816 61.46226
## 7: 24965.01245 9444.45521
## 8: 42.97959 17.63177
## reconstr_TARGET_deathRate reconstr_incidenceRate reconstr_medIncome
## 1: 136.9045 363.2937 122640.97
## 2: 132.3758 446.1932 125634.96
## 3: 128.4056 380.3929 110506.96
## 4: 147.8705 405.5552 55685.91
## 5: 129.3135 422.5906 107249.98
## 6: 335.7610 1200.2902 40207.01
## 7: 181.5888 470.2980 55057.97
## 8: 189.1039 437.0334 26083.01
## reconstr_popEst2015 reconstr_povertyPercent reconstr_studyPerCap
## 1: 375628.935 -1.8344455 4.498867e+02
## 2: 13891.999 -4.2823767 7.302797e-04
## 3: 1142233.800 2.3059846 1.531857e+02
## 4: 10170290.215 35.0795891 2.559074e+02
## 5: 322386.945 -0.5238955 8.436920e+02
## 6: 15233.999 50.2546436 6.245411e-02
## 7: 5238215.077 14.4018756 3.710958e+02
## 8: 9149.997 25.4945990 -6.721804e-02
## reconstr_binnedInc reconstr_MedianAge reconstr_MedianAgeMale
## 1: (61494.5, 125635] 35.32910 33.77080
## 2: (61494.5, 125635] 37.49270 35.98991
## 3: (61494.5, 125635] 37.65994 31.19040
## 4: (54545.6, 61494.5] 36.03170 27.11996
## 5: (61494.5, 125635] 37.66537 37.00402
## 6: [22640, 34218.1] 38.83978 52.39339
## 7: (61494.5, 125635] 35.83674 41.46691
## 8: [22640, 34218.1] 33.72340 21.71874
## reconstr_MedianAgeFemale reconstr_AvgHouseholdSize
## 1: 35.07343 3.555909
## 2: 37.51028 3.374537
## 3: 32.43226 3.460838
## 4: 29.10121 4.239767
## 5: 38.36979 3.230929
## 6: 58.27128 3.618531
## 7: 46.16567 2.287342
## 8: 24.76374 2.027053
## reconstr_PercentMarried reconstr_PctNoHS18_24 reconstr_PctHS18_24
## 1: 62.89290 14.702136 30.96427
## 2: 62.86190 11.523299 28.45050
## 3: 58.80342 14.425014 27.83671
## 4: 41.30476 12.481278 20.98855
## 5: 64.56818 12.500823 29.61922
## 6: 51.29031 22.556830 43.80337
## 7: 29.11824 3.959409 19.81418
## 8: 20.85507 15.855093 25.55816
## reconstr_PctSomeCol18_24 reconstr_PctBachDeg18_24 reconstr_PctHS25_Over
## 1: 76.92795 16.990502 14.888100
## 2: 82.08931 18.913195 15.471586
## 3: 72.86827 16.149781 13.275179
## 4: 48.46589 20.228439 5.193957
## 5: 74.29304 15.832267 19.563536
## 6: 61.48743 8.522816 55.793071
## 7: 27.25771 22.468997 15.384794
## 8: 22.99170 2.863033 25.037987
## reconstr_PctBachDeg25_Over reconstr_PctEmployed16_Over
## 1: 35.896343 86.88097
## 2: 37.752830 88.18877
## 3: 33.735608 81.89610
## 4: 32.719979 73.80639
## 5: 32.751776 84.89267
## 6: 16.728413 63.44473
## 7: 27.584912 51.07448
## 8: 7.005274 28.71234
## reconstr_PctUnemployed16_Over reconstr_PctPrivateCoverage
## 1: 3.406779 101.95141
## 2: 3.165542 109.45285
## 3: 4.098229 92.65050
## 4: 15.769650 43.22236
## 5: 2.530254 101.34843
## 6: 22.745586 86.54729
## 7: 15.241694 56.73909
## 8: 12.662282 34.00705
## reconstr_PctPrivateCoverageAlone reconstr_PctEmpPrivCoverage
## 1: 73.626191 79.22048
## 2: 74.631148 83.86160
## 3: 62.494232 71.76735
## 4: 48.460404 41.65464
## 5: 75.505901 75.10127
## 6: 38.920177 55.17852
## 7: 2.250338 45.49659
## 8: 31.190335 24.67683
## reconstr_PctPublicCoverage reconstr_PctPublicCoverageAlone
## 1: 4.747879 0.213755
## 2: 4.732290 -1.603801
## 3: 7.546320 3.566150
## 4: 40.685264 39.193979
## 5: 10.501823 1.843507
## 6: 71.199872 45.192925
## 7: 46.221133 29.384747
## 8: 31.680834 22.821130
## reconstr_PctWhite reconstr_PctBlack reconstr_PctAsian
## 1: 66.595136 18.00069 9.373414
## 2: 71.331732 19.21688 8.273722
## 3: 61.466006 17.02473 10.680120
## 4: 56.177699 1.27434 30.080789
## 5: 86.149859 5.30846 7.269670
## 6: 71.649568 80.47621 1.435697
## 7: 60.147868 13.57254 14.449921
## 8: 9.643894 53.94795 1.979483
## reconstr_PctOtherRace reconstr_PctMarriedHouseholds reconstr_BirthRate
## 1: 7.16604789 69.58531 6.882339
## 2: 5.18234989 68.68973 6.437283
## 3: 8.88764356 64.71846 6.771524
## 4: 29.27091522 43.65194 5.694446
## 5: 5.40744399 68.54347 6.738437
## 6: -0.01587447 51.51721 7.328961
## 7: 10.07990854 24.27156 1.882297
## 8: 1.64398517 23.20373 3.800228
## County.City State
## 1: Loudoun County Virginia
## 2: Falls Church city Virginia
## 3: Fairfax County Virginia
## 4: Los Angeles County California
## 5: Douglas County Colorado
## 6: Union County Florida
## 7: Cook County Illinois
## 8: Claiborne County Mississippi
ols_project[reconstr_MedianAge>100,reconstr_MedianAge]#outliers
## [1] 457.7734 468.9756 545.7721 623.8188 508.4562 619.1590 497.6171
## [8] 412.1777 480.8220 424.1665 534.8357 406.2605 579.9061 502.2296
## [15] 496.1947 525.0669 519.0263 535.7403 522.3904 469.9527 430.5584
## [22] 413.8742 500.6821 429.3218 500.3413 497.1198 348.5221 511.1347
## [29] 497.4341 508.4157
h2o.shutdown(F); rm(ols.glrm,ols.pred,ols.dl); gc();detach("package:h2o", unload=TRUE)
## [1] TRUE
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2133614 114.0 3205452 171.2 3205452 171.2
## Vcells 3425435 26.2 5721718 43.7 5719947 43.7
## [1] "A shutdown has been triggered. "
## Warning in value[[3L]](cond):
## ----------------------------------------------------------------------
##
## Could not shut down the H2O Java Process!
## Please shutdown H2O manually by navigating to `http://localhost:54321/Shutdown`
##
## Windows requires the shutdown of h2o before re-installing -or- updating the h2o package.
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
The data was pulled back from the h2o cluster, and factors were created from “County.city” and “State”. Examining the results, the anomaly method listed 8 possible outliers. These 8 require domain knowledge to judge if they are true outliers. However, from the summary table done earlier, it is obvious that the MedianAge variable has outliers because currently, people do not live to reach 400, 500, or 600 years old.
plot(Hmisc::describe(ols_project)) #
## $Categorical
##
## $Continuous
data.table(Column=names(ols_project[,-c(9,34:35)]),P=apply(ols_project[,-c(9,34:35)],2,function(x)shapiro.test(x)$p.value))
## Column P
## 1: reconstr_avgAnnCount 4.170786e-73
## 2: reconstr_avgDeathsPerYear 2.647982e-75
## 3: reconstr_TARGET_deathRate 2.108897e-11
## 4: reconstr_incidenceRate 1.822361e-36
## 5: reconstr_medIncome 5.074380e-38
## 6: reconstr_popEst2015 1.162513e-76
## 7: reconstr_povertyPercent 6.133276e-10
## 8: reconstr_studyPerCap 2.363104e-75
## 9: reconstr_MedianAge 1.205533e-79
## 10: reconstr_MedianAgeMale 3.347651e-26
## 11: reconstr_MedianAgeFemale 4.627598e-21
## 12: reconstr_AvgHouseholdSize 7.365672e-25
## 13: reconstr_PercentMarried 1.452269e-41
## 14: reconstr_PctNoHS18_24 4.473207e-07
## 15: reconstr_PctHS18_24 1.123750e-05
## 16: reconstr_PctSomeCol18_24 7.070605e-19
## 17: reconstr_PctBachDeg18_24 3.397906e-31
## 18: reconstr_PctHS25_Over 3.775964e-18
## 19: reconstr_PctBachDeg25_Over 3.538379e-32
## 20: reconstr_PctEmployed16_Over 1.871671e-11
## 21: reconstr_PctUnemployed16_Over 2.817221e-17
## 22: reconstr_PctPrivateCoverage 5.999665e-11
## 23: reconstr_PctPrivateCoverageAlone 9.509105e-42
## 24: reconstr_PctEmpPrivCoverage 5.724100e-24
## 25: reconstr_PctPublicCoverage 4.656196e-16
## 26: reconstr_PctPublicCoverageAlone 1.209840e-07
## 27: reconstr_PctWhite 4.257334e-47
## 28: reconstr_PctBlack 1.504331e-36
## 29: reconstr_PctAsian 6.815750e-56
## 30: reconstr_PctOtherRace 2.464945e-65
## 31: reconstr_PctMarriedHouseholds 1.143897e-35
## 32: reconstr_BirthRate 3.137107e-21
## Column P
Two checks were done to briefly look at the distributions of the variables to identify any variables that might benefit from some sort of transformation. From the plot, it is apparent that a few of the distributions have long tails and may require transformation depending on your approach. Additionally, the Shapiro test results rejected the null hypothesis for normality for all variables. This will not be a problem, and will be addressed shortly.
# Crtl<- trainControl(method = "repeatedcv",number = 10,repeats = 3,allowParallel = T)
# features<-train(reconstr_TARGET_deathRate~.,data=ols_project,method = "xgbTree",trControl=Crtl)
# getTrainPerf(features)
imp_feat<-c("reconstr_PctPublicCoverageAlone",
"reconstr_povertyPercent",
"reconstr_AvgHouseholdSize",
"reconstr_PctUnemployed16_Over",
"reconstr_PctBlack",
"reconstr_PctHS18_24")
GGally::ggpairs(ols_project[,imp_feat,with=F])
imp_feat[7]<-"reconstr_TARGET_deathRate"
summary(ols_project[,imp_feat,with=F])
## reconstr_PctPublicCoverageAlone reconstr_povertyPercent
## Min. :-1.604 Min. :-4.491
## 1st Qu.:15.895 1st Qu.:13.039
## Median :18.979 Median :16.427
## Mean :19.070 Mean :16.636
## 3rd Qu.:22.351 3rd Qu.:20.129
## Max. :45.193 Max. :50.255
## reconstr_AvgHouseholdSize reconstr_PctUnemployed16_Over
## Min. :1.518 Min. :-0.4861
## 1st Qu.:2.337 1st Qu.: 6.2024
## Median :2.460 Median : 7.6521
## Mean :2.464 Mean : 7.7747
## 3rd Qu.:2.579 3rd Qu.: 9.1386
## Max. :4.240 Max. :22.7456
## reconstr_PctBlack reconstr_PctHS18_24 reconstr_TARGET_deathRate
## Min. :-23.2224 Min. :13.63 Min. : 70.38
## 1st Qu.: 0.8389 1st Qu.:32.19 1st Qu.:161.75
## Median : 6.4338 Median :34.86 Median :178.06
## Mean : 8.7382 Mean :34.97 Mean :178.91
## 3rd Qu.: 13.6192 3rd Qu.:37.68 3rd Qu.:194.70
## Max. : 80.4762 Max. :51.26 Max. :335.76
Because the variables had outliers and required transformations, feature selection was done by using a robust algorithmic selector. Some possible choices are random forest, and boosted trees. XGBoost was used to identify the important variables. The variables selected appear to be approximately normally distributed, therefore no transformations were needed. However, some of the variables exhibit high correlation.
set.seed(323)
intrain<-createDataPartition(ols_project$reconstr_TARGET_deathRate,p=.70,list=F)
training<-ols_project[intrain]
testing<-ols_project[-intrain]
models<-list()
for(i in seq_along(imp_feat[-7])){
models[[i]]<- lm(reconstr_TARGET_deathRate~.,data=training[,imp_feat[c(1:i,7)],with=F])
}
lapply(models,summary)
## [[1]]
##
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[,
## imp_feat[c(1:i, 7)], with = F])
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.048 -5.902 0.416 5.981 56.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.04480 0.92034 97.84 <2e-16 ***
## reconstr_PctPublicCoverageAlone 4.66585 0.04684 99.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 2133 degrees of freedom
## Multiple R-squared: 0.8231, Adjusted R-squared: 0.823
## F-statistic: 9924 on 1 and 2133 DF, p-value: < 2.2e-16
##
##
## [[2]]
##
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[,
## imp_feat[c(1:i, 7)], with = F])
##
## Residuals:
## Min 1Q Median 3Q Max
## -119.445 -5.857 0.177 5.833 54.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.7587 1.1294 85.671 <2e-16 ***
## reconstr_PctPublicCoverageAlone 2.7993 0.1950 14.352 <2e-16 ***
## reconstr_povertyPercent 1.7344 0.1762 9.846 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.45 on 2132 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8306
## F-statistic: 5234 on 2 and 2132 DF, p-value: < 2.2e-16
##
##
## [[3]]
##
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[,
## imp_feat[c(1:i, 7)], with = F])
##
## Residuals:
## Min 1Q Median 3Q Max
## -174.636 -2.611 0.652 3.400 29.122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.9976 1.9200 9.374 <2e-16 ***
## reconstr_PctPublicCoverageAlone 2.4357 0.1396 17.446 <2e-16 ***
## reconstr_povertyPercent 1.8658 0.1259 14.817 <2e-16 ***
## reconstr_AvgHouseholdSize 33.8680 0.7491 45.209 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.471 on 2131 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.9135
## F-statistic: 7514 on 3 and 2131 DF, p-value: < 2.2e-16
##
##
## [[4]]
##
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[,
## imp_feat[c(1:i, 7)], with = F])
##
## Residuals:
## Min 1Q Median 3Q Max
## -173.374 -2.606 0.800 3.472 25.121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.6867 1.9779 9.953 < 2e-16 ***
## reconstr_PctPublicCoverageAlone 2.4864 0.1400 17.754 < 2e-16 ***
## reconstr_povertyPercent 1.5545 0.1551 10.022 < 2e-16 ***
## reconstr_AvgHouseholdSize 32.7312 0.8179 40.021 < 2e-16 ***
## reconstr_PctUnemployed16_Over 0.6850 0.2003 3.420 0.000638 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.452 on 2130 degrees of freedom
## Multiple R-squared: 0.9141, Adjusted R-squared: 0.9139
## F-statistic: 5667 on 4 and 2130 DF, p-value: < 2.2e-16
##
##
## [[5]]
##
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[,
## imp_feat[c(1:i, 7)], with = F])
##
## Residuals:
## Min 1Q Median 3Q Max
## -135.333 -2.848 1.265 3.278 24.931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.8208 1.6474 12.03 <2e-16 ***
## reconstr_PctPublicCoverageAlone 12.0653 0.3333 36.20 <2e-16 ***
## reconstr_povertyPercent -3.2424 0.2028 -15.99 <2e-16 ***
## reconstr_AvgHouseholdSize 36.2343 0.6907 52.46 <2e-16 ***
## reconstr_PctUnemployed16_Over -15.9807 0.5682 -28.13 <2e-16 ***
## reconstr_PctBlack 2.0589 0.0671 30.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.207 on 2129 degrees of freedom
## Multiple R-squared: 0.9404, Adjusted R-squared: 0.9403
## F-statistic: 6723 on 5 and 2129 DF, p-value: < 2.2e-16
##
##
## [[6]]
##
## Call:
## lm(formula = reconstr_TARGET_deathRate ~ ., data = training[,
## imp_feat[c(1:i, 7)], with = F])
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.721 -2.514 0.683 2.716 32.148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.59237 1.31002 14.96 <2e-16 ***
## reconstr_PctPublicCoverageAlone 7.86404 0.29066 27.06 <2e-16 ***
## reconstr_povertyPercent -2.03437 0.16489 -12.34 <2e-16 ***
## reconstr_AvgHouseholdSize 17.01313 0.77453 21.97 <2e-16 ***
## reconstr_PctUnemployed16_Over -11.34322 0.47065 -24.10 <2e-16 ***
## reconstr_PctBlack 1.91697 0.05351 35.82 <2e-16 ***
## reconstr_PctHS18_24 2.08064 0.05912 35.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.936 on 2128 degrees of freedom
## Multiple R-squared: 0.9624, Adjusted R-squared: 0.9622
## F-statistic: 9066 on 6 and 2128 DF, p-value: < 2.2e-16
do.call(anova,models)
## Analysis of Variance Table
##
## Model 1: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone
## Model 2: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone +
## reconstr_povertyPercent
## Model 3: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone +
## reconstr_povertyPercent + reconstr_AvgHouseholdSize
## Model 4: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone +
## reconstr_povertyPercent + reconstr_AvgHouseholdSize + reconstr_PctUnemployed16_Over
## Model 5: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone +
## reconstr_povertyPercent + reconstr_AvgHouseholdSize + reconstr_PctUnemployed16_Over +
## reconstr_PctBlack
## Model 6: reconstr_TARGET_deathRate ~ reconstr_PctPublicCoverageAlone +
## reconstr_povertyPercent + reconstr_AvgHouseholdSize + reconstr_PctUnemployed16_Over +
## reconstr_PctBlack + reconstr_PctHS18_24
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2133 243619
## 2 2132 233024 1 10595 434.885 < 2.2e-16 ***
## 3 2131 118944 1 114079 4682.564 < 2.2e-16 ***
## 4 2130 118295 1 650 26.663 2.647e-07 ***
## 5 2129 82024 1 36271 1488.812 < 2.2e-16 ***
## 6 2128 51844 1 30180 1238.780 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(models[[6]]) #observe high leverage,
ols_project[709]
## reconstr_avgAnnCount reconstr_avgDeathsPerYear
## 1: 427.9675 183.0179
## reconstr_TARGET_deathRate reconstr_incidenceRate reconstr_medIncome
## 1: 192.9291 470.1869 36739.99
## reconstr_popEst2015 reconstr_povertyPercent reconstr_studyPerCap
## 1: 68579.99 19.89551 -0.01039127
## reconstr_binnedInc reconstr_MedianAge reconstr_MedianAgeMale
## 1: (42724.4, 45201] 43.68159 43.10673
## reconstr_MedianAgeFemale reconstr_AvgHouseholdSize
## 1: 45.86514 2.409053
## reconstr_PercentMarried reconstr_PctNoHS18_24 reconstr_PctHS18_24
## 1: 54.22054 19.39945 38.0621
## reconstr_PctSomeCol18_24 reconstr_PctBachDeg18_24 reconstr_PctHS25_Over
## 1: 35.28683 4.529534 40.86022
## reconstr_PctBachDeg25_Over reconstr_PctEmployed16_Over
## 1: 9.892463 52.68388
## reconstr_PctUnemployed16_Over reconstr_PctPrivateCoverage
## 1: 8.354626 62.23812
## reconstr_PctPrivateCoverageAlone reconstr_PctEmpPrivCoverage
## 1: 44.42119 37.55365
## reconstr_PctPublicCoverage reconstr_PctPublicCoverageAlone
## 1: 42.84199 22.6819
## reconstr_PctWhite reconstr_PctBlack reconstr_PctAsian
## 1: 96.46664 3.325524 -0.2832896
## reconstr_PctOtherRace reconstr_PctMarriedHouseholds reconstr_BirthRate
## 1: 1.060268 52.33497 5.685039
## County.City State
## 1: Greene County Tennessee
par(mfrow=c(1,1))
In the previous section, the variable PctPublicCoverageAlone is the most important feature that explains a significant amount of variability. Therefore, a check was done to test if the addition of the other variables would improve the model. Both the summary and the anova confirm that all six are significant and improve the model. Another nested model was made (not shown here) where the highly correlated variables were removed. That model did slightly worse.
In the diagnostic plots, the residual plot trend line is not flat, and is on the cusp of showing a pattern. The Q-Q plot, except for the tails, does not show signs for concern. However, the data point at row 709 exhibits a significant amount of leverage and could possibly be an outlier and will need a domain expert to investigate.
results<-predict(models[[6]],testing)
data.frame(results,actual=testing$reconstr_TARGET_deathRate) %>% ggplot(aes(x=results,y=actual)) + geom_point()+stat_smooth(method="lm",show.legend = T)
RMSE(pred = results,obs = testing$reconstr_TARGET_deathRate)
## [1] 4.454168
The test set was scored on the model and the result indicate acceptable results when comparing the predicted values with the expected result having a RMSE of 4.339.
The analysis from this report presented an accurate model with the aim of giving the SMEs in this field a better method for identifying the active class other than from their collective agreement.