this project is about predicating the death rate per 100000 person from more than 30 predictors
to find out more about this data and description of variables plz visit this page
and for any advice or recommendation plz feel free to contact me : vet.m.mohamed@gmail.com
lets start with loading our libraries
library(tidyverse)
library(knitr)
library(caret)
library(car)
library(psych)
library(mice)
library(progress)
library(DMwR)
library(readr)
library(MASS)
library(pedometrics)
then importing the data
data<-read.csv("./data sets/cancer.csv")
head(data)%>%kable("markdown")
| avganncount | avgdeathsperyear | target_deathrate | incidencerate | medincome | popest2015 | povertypercent | studypercap | binnedinc | medianage | medianagemale | medianagefemale | geography | percentmarried | pctnohs18_24 | pcths18_24 | pctsomecol18_24 | pctbachdeg18_24 | pcths25_over | pctbachdeg25_over | pctemployed16_over | pctunemployed16_over | pctprivatecoverage | pctprivatecoveragealone | pctempprivcoverage | pctpubliccoverage | pctpubliccoveragealone | pctwhite | pctblack | pctasian | pctotherrace | pctmarriedhouseholds | birthrate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1397 | 469 | 164.9 | 489.8 | 61898 | 260131 | 11.2 | 499.74820 | (61494.5, 125635] | 39.3 | 36.9 | 41.7 | Kitsap County, Washington | 52.5 | 11.5 | 39.5 | 42.1 | 6.9 | 23.2 | 19.6 | 51.9 | 8.0 | 75.1 | NA | 41.6 | 32.9 | 14.0 | 81.78053 | 2.5947283 | 4.8218571 | 1.8434785 | 52.85608 | 6.118831 |
| 173 | 70 | 161.3 | 411.6 | 48127 | 43269 | 18.6 | 23.11123 | (48021.6, 51046.4] | 33.0 | 32.2 | 33.7 | Kittitas County, Washington | 44.5 | 6.1 | 22.4 | 64.0 | 7.5 | 26.0 | 22.7 | 55.9 | 7.8 | 70.2 | 53.8 | 43.6 | 31.1 | 15.3 | 89.22851 | 0.9691025 | 2.2462326 | 3.7413515 | 45.37250 | 4.333096 |
| 102 | 50 | 174.7 | 349.7 | 49348 | 21026 | 14.6 | 47.56016 | (48021.6, 51046.4] | 45.0 | 44.0 | 45.8 | Klickitat County, Washington | 54.2 | 24.0 | 36.6 | NA | 9.5 | 29.0 | 16.0 | 45.9 | 7.0 | 63.7 | 43.5 | 34.9 | 42.1 | 21.1 | 90.92219 | 0.7396734 | 0.4658982 | 2.7473583 | 54.44487 | 3.729488 |
| 427 | 202 | 194.8 | 430.4 | 44243 | 75882 | 17.1 | 342.63725 | (42724.4, 45201] | 42.8 | 42.2 | 43.4 | Lewis County, Washington | 52.7 | 20.2 | 41.2 | 36.1 | 2.5 | 31.6 | 9.3 | 48.3 | 12.1 | 58.4 | 40.3 | 35.0 | 45.3 | 25.0 | 91.74469 | 0.7826260 | 1.1613587 | 1.3626432 | 51.02151 | 4.603841 |
| 57 | 26 | 144.4 | 350.1 | 49955 | 10321 | 12.5 | 0.00000 | (48021.6, 51046.4] | 48.3 | 47.8 | 48.9 | Lincoln County, Washington | 57.8 | 14.9 | 43.0 | 40.0 | 2.0 | 33.4 | 15.0 | 48.2 | 4.8 | 61.6 | 43.9 | 35.1 | 44.0 | 22.7 | 94.10402 | 0.2701920 | 0.6658304 | 0.4921355 | 54.02746 | 6.796657 |
| 428 | 152 | 176.0 | 505.4 | 52313 | 61023 | 15.6 | 180.25990 | (51046.4, 54545.6] | 45.4 | 43.5 | 48.0 | Mason County, Washington | 50.4 | 29.9 | 35.1 | NA | 4.5 | 30.4 | 11.9 | 44.1 | 12.9 | 60.0 | 38.8 | 32.6 | 43.2 | 20.2 | 84.88263 | 1.6532052 | 1.5380566 | 3.3146354 | 51.22036 | 4.964476 |
it is important to see the structure of the data
str(data)
## 'data.frame': 3047 obs. of 33 variables:
## $ avganncount : num 1397 173 102 427 57 ...
## $ avgdeathsperyear : int 469 70 50 202 26 152 97 71 36 1380 ...
## $ target_deathrate : num 165 161 175 195 144 ...
## $ incidencerate : num 490 412 350 430 350 ...
## $ medincome : int 61898 48127 49348 44243 49955 52313 37782 40189 42579 60397 ...
## $ popest2015 : int 260131 43269 21026 75882 10321 61023 41516 20848 13088 843954 ...
## $ povertypercent : num 11.2 18.6 14.6 17.1 12.5 15.6 23.2 17.8 22.3 13.1 ...
## $ studypercap : num 499.7 23.1 47.6 342.6 0 ...
## $ binnedinc : Factor w/ 10 levels "(34218.1, 37413.8]",..: 9 6 6 4 6 7 2 2 3 8 ...
## $ medianage : num 39.3 33 45 42.8 48.3 45.4 42.6 51.7 49.3 35.8 ...
## $ medianagemale : num 36.9 32.2 44 42.2 47.8 43.5 42.2 50.8 48.4 34.7 ...
## $ medianagefemale : num 41.7 33.7 45.8 43.4 48.9 48 43.5 52.5 49.8 37 ...
## $ geography : Factor w/ 3047 levels "Abbeville County, South Carolina",..: 1459 1460 1464 1589 1618 1766 2051 2112 2143 2185 ...
## $ percentmarried : num 52.5 44.5 54.2 52.7 57.8 50.4 54.1 52.7 55.9 50 ...
## $ pctnohs18_24 : num 11.5 6.1 24 20.2 14.9 29.9 26.1 27.3 34.7 15.6 ...
## $ pcths18_24 : num 39.5 22.4 36.6 41.2 43 35.1 41.4 33.9 39.4 36.3 ...
## $ pctsomecol18_24 : num 42.1 64 NA 36.1 40 NA NA 36.5 NA NA ...
## $ pctbachdeg18_24 : num 6.9 7.5 9.5 2.5 2 4.5 5.8 2.2 1.4 7.1 ...
## $ pcths25_over : num 23.2 26 29 31.6 33.4 30.4 29.8 31.6 32.2 28.8 ...
## $ pctbachdeg25_over : num 19.6 22.7 16 9.3 15 11.9 11.9 11.3 12 16.2 ...
## $ pctemployed16_over : num 51.9 55.9 45.9 48.3 48.2 44.1 51.8 40.9 39.5 56.6 ...
## $ pctunemployed16_over : num 8 7.8 7 12.1 4.8 12.9 8.9 8.9 10.3 9.2 ...
## $ pctprivatecoverage : num 75.1 70.2 63.7 58.4 61.6 60 49.5 55.8 55.5 69.9 ...
## $ pctprivatecoveragealone: num NA 53.8 43.5 40.3 43.9 38.8 35 33.1 37.8 NA ...
## $ pctempprivcoverage : num 41.6 43.6 34.9 35 35.1 32.6 28.3 25.9 29.9 44.4 ...
## $ pctpubliccoverage : num 32.9 31.1 42.1 45.3 44 43.2 46.4 50.9 48.1 31.4 ...
## $ pctpubliccoveragealone : num 14 15.3 21.1 25 22.7 20.2 28.7 24.1 26.6 16.5 ...
## $ pctwhite : num 81.8 89.2 90.9 91.7 94.1 ...
## $ pctblack : num 2.595 0.969 0.74 0.783 0.27 ...
## $ pctasian : num 4.822 2.246 0.466 1.161 0.666 ...
## $ pctotherrace : num 1.843 3.741 2.747 1.363 0.492 ...
## $ pctmarriedhouseholds : num 52.9 45.4 54.4 51 54 ...
## $ birthrate : num 6.12 4.33 3.73 4.6 6.8 ...
some of data are factors and others are numeric but there is no need for modification
Lets inspect the accuracy
summary(data)
## avganncount avgdeathsperyear target_deathrate incidencerate
## Min. : 6.0 Min. : 3 Min. : 59.7 Min. : 201.3
## 1st Qu.: 76.0 1st Qu.: 28 1st Qu.:161.2 1st Qu.: 420.3
## Median : 171.0 Median : 61 Median :178.1 Median : 453.5
## Mean : 606.3 Mean : 186 Mean :178.7 Mean : 448.3
## 3rd Qu.: 518.0 3rd Qu.: 149 3rd Qu.:195.2 3rd Qu.: 480.9
## Max. :38150.0 Max. :14010 Max. :362.8 Max. :1206.9
##
## medincome popest2015 povertypercent studypercap
## Min. : 22640 Min. : 827 Min. : 3.20 Min. : 0.00
## 1st Qu.: 38883 1st Qu.: 11684 1st Qu.:12.15 1st Qu.: 0.00
## Median : 45207 Median : 26643 Median :15.90 Median : 0.00
## Mean : 47063 Mean : 102637 Mean :16.88 Mean : 155.40
## 3rd Qu.: 52492 3rd Qu.: 68671 3rd Qu.:20.40 3rd Qu.: 83.65
## Max. :125635 Max. :10170292 Max. :47.40 Max. :9762.31
##
## binnedinc medianage medianagemale
## (45201, 48021.6] : 306 Min. : 22.30 Min. :22.40
## (54545.6, 61494.5]: 306 1st Qu.: 37.70 1st Qu.:36.35
## [22640, 34218.1] : 306 Median : 41.00 Median :39.60
## (42724.4, 45201] : 305 Mean : 45.27 Mean :39.57
## (48021.6, 51046.4]: 305 3rd Qu.: 44.00 3rd Qu.:42.50
## (51046.4, 54545.6]: 305 Max. :624.00 Max. :64.70
## (Other) :1214
## medianagefemale geography percentmarried
## Min. :22.30 Abbeville County, South Carolina: 1 Min. :23.10
## 1st Qu.:39.10 Acadia Parish, Louisiana : 1 1st Qu.:47.75
## Median :42.40 Accomack County, Virginia : 1 Median :52.40
## Mean :42.15 Ada County, Idaho : 1 Mean :51.77
## 3rd Qu.:45.30 Adair County, Iowa : 1 3rd Qu.:56.40
## Max. :65.70 Adair County, Kentucky : 1 Max. :72.50
## (Other) :3041
## pctnohs18_24 pcths18_24 pctsomecol18_24 pctbachdeg18_24
## Min. : 0.00 Min. : 0.0 Min. : 7.10 Min. : 0.000
## 1st Qu.:12.80 1st Qu.:29.2 1st Qu.:34.00 1st Qu.: 3.100
## Median :17.10 Median :34.7 Median :40.40 Median : 5.400
## Mean :18.22 Mean :35.0 Mean :40.98 Mean : 6.158
## 3rd Qu.:22.70 3rd Qu.:40.7 3rd Qu.:46.40 3rd Qu.: 8.200
## Max. :64.10 Max. :72.5 Max. :79.00 Max. :51.800
## NA's :2285
## pcths25_over pctbachdeg25_over pctemployed16_over pctunemployed16_over
## Min. : 7.50 Min. : 2.50 Min. :17.60 Min. : 0.400
## 1st Qu.:30.40 1st Qu.: 9.40 1st Qu.:48.60 1st Qu.: 5.500
## Median :35.30 Median :12.30 Median :54.50 Median : 7.600
## Mean :34.80 Mean :13.28 Mean :54.15 Mean : 7.852
## 3rd Qu.:39.65 3rd Qu.:16.10 3rd Qu.:60.30 3rd Qu.: 9.700
## Max. :54.80 Max. :42.20 Max. :80.10 Max. :29.400
## NA's :152
## pctprivatecoverage pctprivatecoveragealone pctempprivcoverage
## Min. :22.30 Min. :15.70 Min. :13.5
## 1st Qu.:57.20 1st Qu.:41.00 1st Qu.:34.5
## Median :65.10 Median :48.70 Median :41.1
## Mean :64.35 Mean :48.45 Mean :41.2
## 3rd Qu.:72.10 3rd Qu.:55.60 3rd Qu.:47.7
## Max. :92.30 Max. :78.90 Max. :70.7
## NA's :609
## pctpubliccoverage pctpubliccoveragealone pctwhite
## Min. :11.20 Min. : 2.60 Min. : 10.20
## 1st Qu.:30.90 1st Qu.:14.85 1st Qu.: 77.30
## Median :36.30 Median :18.80 Median : 90.06
## Mean :36.25 Mean :19.24 Mean : 83.65
## 3rd Qu.:41.55 3rd Qu.:23.10 3rd Qu.: 95.45
## Max. :65.10 Max. :46.60 Max. :100.00
##
## pctblack pctasian pctotherrace
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.6207 1st Qu.: 0.2542 1st Qu.: 0.2952
## Median : 2.2476 Median : 0.5498 Median : 0.8262
## Mean : 9.1080 Mean : 1.2540 Mean : 1.9835
## 3rd Qu.:10.5097 3rd Qu.: 1.2210 3rd Qu.: 2.1780
## Max. :85.9478 Max. :42.6194 Max. :41.9303
##
## pctmarriedhouseholds birthrate
## Min. :22.99 Min. : 0.000
## 1st Qu.:47.76 1st Qu.: 4.521
## Median :51.67 Median : 5.381
## Mean :51.24 Mean : 5.640
## 3rd Qu.:55.40 3rd Qu.: 6.494
## Max. :78.08 Max. :21.326
##
the summary reveal some issues with outliers and missing data
another thing , i think that we have to exclude the county name and keep only the name of the state
table(data$geography)%>%data.frame()%>%head()%>%kable("markdown") #here we see that the number of county is the same of the number of data
| Var1 | Freq |
|---|---|
| Abbeville County, South Carolina | 1 |
| Acadia Parish, Louisiana | 1 |
| Accomack County, Virginia | 1 |
| Ada County, Idaho | 1 |
| Adair County, Iowa | 1 |
| Adair County, Kentucky | 1 |
#this wil be problamitic in our analysis
#So i will keep the state name and remove the county name
data$geography<-str_remove_all(string = data$geography,
pattern = "[:alpha:]{1,}(\\s)|[:alpha:]{1,}(\\,)|(\\s)")
head(data,10)%>%kable("markdown")
| avganncount | avgdeathsperyear | target_deathrate | incidencerate | medincome | popest2015 | povertypercent | studypercap | binnedinc | medianage | medianagemale | medianagefemale | geography | percentmarried | pctnohs18_24 | pcths18_24 | pctsomecol18_24 | pctbachdeg18_24 | pcths25_over | pctbachdeg25_over | pctemployed16_over | pctunemployed16_over | pctprivatecoverage | pctprivatecoveragealone | pctempprivcoverage | pctpubliccoverage | pctpubliccoveragealone | pctwhite | pctblack | pctasian | pctotherrace | pctmarriedhouseholds | birthrate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1397 | 469 | 164.9 | 489.8 | 61898 | 260131 | 11.2 | 499.74820 | (61494.5, 125635] | 39.3 | 36.9 | 41.7 | Washington | 52.5 | 11.5 | 39.5 | 42.1 | 6.9 | 23.2 | 19.6 | 51.9 | 8.0 | 75.1 | NA | 41.6 | 32.9 | 14.0 | 81.78053 | 2.5947283 | 4.8218571 | 1.8434785 | 52.85608 | 6.118831 |
| 173 | 70 | 161.3 | 411.6 | 48127 | 43269 | 18.6 | 23.11123 | (48021.6, 51046.4] | 33.0 | 32.2 | 33.7 | Washington | 44.5 | 6.1 | 22.4 | 64.0 | 7.5 | 26.0 | 22.7 | 55.9 | 7.8 | 70.2 | 53.8 | 43.6 | 31.1 | 15.3 | 89.22851 | 0.9691025 | 2.2462326 | 3.7413515 | 45.37250 | 4.333096 |
| 102 | 50 | 174.7 | 349.7 | 49348 | 21026 | 14.6 | 47.56016 | (48021.6, 51046.4] | 45.0 | 44.0 | 45.8 | Washington | 54.2 | 24.0 | 36.6 | NA | 9.5 | 29.0 | 16.0 | 45.9 | 7.0 | 63.7 | 43.5 | 34.9 | 42.1 | 21.1 | 90.92219 | 0.7396734 | 0.4658982 | 2.7473583 | 54.44487 | 3.729488 |
| 427 | 202 | 194.8 | 430.4 | 44243 | 75882 | 17.1 | 342.63725 | (42724.4, 45201] | 42.8 | 42.2 | 43.4 | Washington | 52.7 | 20.2 | 41.2 | 36.1 | 2.5 | 31.6 | 9.3 | 48.3 | 12.1 | 58.4 | 40.3 | 35.0 | 45.3 | 25.0 | 91.74469 | 0.7826260 | 1.1613587 | 1.3626432 | 51.02151 | 4.603841 |
| 57 | 26 | 144.4 | 350.1 | 49955 | 10321 | 12.5 | 0.00000 | (48021.6, 51046.4] | 48.3 | 47.8 | 48.9 | Washington | 57.8 | 14.9 | 43.0 | 40.0 | 2.0 | 33.4 | 15.0 | 48.2 | 4.8 | 61.6 | 43.9 | 35.1 | 44.0 | 22.7 | 94.10402 | 0.2701920 | 0.6658304 | 0.4921355 | 54.02746 | 6.796657 |
| 428 | 152 | 176.0 | 505.4 | 52313 | 61023 | 15.6 | 180.25990 | (51046.4, 54545.6] | 45.4 | 43.5 | 48.0 | Washington | 50.4 | 29.9 | 35.1 | NA | 4.5 | 30.4 | 11.9 | 44.1 | 12.9 | 60.0 | 38.8 | 32.6 | 43.2 | 20.2 | 84.88263 | 1.6532052 | 1.5380566 | 3.3146354 | 51.22036 | 4.964476 |
| 250 | 97 | 175.9 | 461.8 | 37782 | 41516 | 23.2 | 0.00000 | (37413.8, 40362.7] | 42.6 | 42.2 | 43.5 | Washington | 54.1 | 26.1 | 41.4 | NA | 5.8 | 29.8 | 11.9 | 51.8 | 8.9 | 49.5 | 35.0 | 28.3 | 46.4 | 28.7 | 75.10645 | 0.6169554 | 0.8661570 | 8.3567212 | 51.01390 | 4.204317 |
| 146 | 71 | 183.6 | 404.0 | 40189 | 20848 | 17.8 | 0.00000 | (37413.8, 40362.7] | 51.7 | 50.8 | 52.5 | Washington | 52.7 | 27.3 | 33.9 | 36.5 | 2.2 | 31.6 | 11.3 | 40.9 | 8.9 | 55.8 | 33.1 | 25.9 | 50.9 | 24.1 | 89.40664 | 0.3051586 | 1.8890773 | 2.2862679 | 48.96703 | 5.889179 |
| 88 | 36 | 190.5 | 459.4 | 42579 | 13088 | 22.3 | 0.00000 | (40362.7, 42724.4] | 49.3 | 48.4 | 49.8 | Washington | 55.9 | 34.7 | 39.4 | NA | 1.4 | 32.2 | 12.0 | 39.5 | 10.3 | 55.5 | 37.8 | 29.9 | 48.1 | 26.6 | 91.78748 | 0.1850709 | 0.2082048 | 0.6169031 | 53.44700 | 5.587583 |
| 4025 | 1380 | 177.8 | 510.9 | 60397 | 843954 | 13.1 | 427.74843 | (54545.6, 61494.5] | 35.8 | 34.7 | 37.0 | Washington | 50.0 | 15.6 | 36.3 | NA | 7.1 | 28.8 | 16.2 | 56.6 | 9.2 | 69.9 | NA | 44.4 | 31.4 | 16.5 | 74.72967 | 6.7108542 | 6.0414720 | 2.6991844 | 50.06357 | 5.533430 |
good, now i think that we have to test for more than 5% missing value
miss<-apply(data,2,function(x){
round((sum(is.na(x))/length(x))*100,2)
})
miss
## avganncount avgdeathsperyear target_deathrate
## 0.00 0.00 0.00
## incidencerate medincome popest2015
## 0.00 0.00 0.00
## povertypercent studypercap binnedinc
## 0.00 0.00 0.00
## medianage medianagemale medianagefemale
## 0.00 0.00 0.00
## geography percentmarried pctnohs18_24
## 0.00 0.00 0.00
## pcths18_24 pctsomecol18_24 pctbachdeg18_24
## 0.00 74.99 0.00
## pcths25_over pctbachdeg25_over pctemployed16_over
## 0.00 0.00 4.99
## pctunemployed16_over pctprivatecoverage pctprivatecoveragealone
## 0.00 0.00 19.99
## pctempprivcoverage pctpubliccoverage pctpubliccoveragealone
## 0.00 0.00 0.00
## pctwhite pctblack pctasian
## 0.00 0.00 0.00
## pctotherrace pctmarriedhouseholds birthrate
## 0.00 0.00 0.00
here we have 3 variable with missing values in my opinion , the most problematic one is the one which got 75% missing So i will Exclude it
data<-data%>%dplyr::select(-which(miss>70))
OK , lets impute the rest using KNN method
data<-cbind(knnImputation(data = data[,-c(9,13)]),binnedinc=data$binnedinc,
geography=data$geography)%>%data.frame()
the next step , is my data got ouliers !! I think using Mahalanobis will answer this question
first excluding the categorical variable
num<-data%>%dplyr::select(-geography,-binnedinc)
getting the mahalanobis value
mah<-mahalanobis(x = num,center = colMeans(num),cov = cov(num))
Now calculating the cutoff points
cutoff<-qchisq(p = .99,df = ncol(num))
lets open the surprise box :P
summary(mah>cutoff)
## Mode FALSE TRUE
## logical 2691 356
now we have 356 case which considered multivariate outlier
lets save it for later use
outidx<-as.numeric(mah>cutoff)
testing additivity for multicolinearity is something crucial so lets test for correlation more than .9
corr<-cor(num)%>%matrix(nrow = ncol(num),ncol = ncol(num))
addit<-apply(corr,2,function(x){
ifelse(x>=abs(.9)&x<1,paste(round(x,2),"additive",sep = " ")," ")
})
colnames(addit)<-rownames(addit)<-names(num)
corvar<-names(data)[apply(addit,2,function(x){
str_detect(x,"additive")
})%>%apply(MARGIN = 2,any)%>%which()]
corvar
## [1] "avganncount" "avgdeathsperyear"
## [3] "popest2015" "medianagemale"
## [5] "medianagefemale" "pctprivatecoverage"
## [7] "pctprivatecoveragealone" "pctempprivcoverage"
addit[corvar,corvar]%>%kable("markdown")
| avganncount | avgdeathsperyear | popest2015 | medianagemale | medianagefemale | pctprivatecoverage | pctprivatecoveragealone | pctempprivcoverage | |
|---|---|---|---|---|---|---|---|---|
| avganncount | 0.94 additive | 0.93 additive | ||||||
| avgdeathsperyear | 0.94 additive | 0.98 additive | ||||||
| popest2015 | 0.93 additive | 0.98 additive | ||||||
| medianagemale | 0.93 additive | |||||||
| medianagefemale | 0.93 additive | |||||||
| pctprivatecoverage | 0.93 additive | |||||||
| pctprivatecoveragealone | 0.93 additive | 0.92 additive | ||||||
| pctempprivcoverage | 0.92 additive |
we have here number of reported cancer (avganncount) and average reported mortality ( avgdeathsperyear) and number of population is highly correlated and we see that the number of population and average of reported mortality having no sense in predicting target death rate , so we will exclude them
data<-data%>%dplyr::select(-popest2015,-avgdeathsperyear)
For the median age of males and females i will combine it together (x1+x2)/2
data<-data%>%mutate(medianagemf=(medianagemale+medianagefemale)/2)%>%dplyr::select(-medianagefemale,-medianagemale)
For the median average of coverage and coverage alone and employee coverage i will make a linear function of them = .3*(x1+x2+x3)
data<-data%>%mutate(medcov=.3*(pctprivatecoverage+pctprivatecoveragealone+pctempprivcoverage))%>%dplyr::select(-pctprivatecoverage,-pctprivatecoveragealone,-pctempprivcoverage)
run the correlation again
num<-data%>%dplyr::select(-geography,-binnedinc)
corr2<-cor(num)
addit2<-apply(corr2,2,function(x){
ifelse(x>=abs(.9)&x<1,paste(round(x,2),"additive",sep = " ")," ")
})
colnames(addit2)<-rownames(addit2)<-names(num)
corvar2<-names(data)[apply(addit2,2,function(x){
str_detect(x,"additive")
})%>%apply(MARGIN = 2,any)%>%which()]
corvar2
## character(0)
Great, lets dive further in our analysis and run correlation with all variables
fitall<-data%>%with(lm(target_deathrate~.,data=data))
summary(fitall)
##
## Call:
## lm(formula = target_deathrate ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.540 -9.519 -0.299 9.934 126.239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.595e+02 1.759e+01 9.068 < 2e-16 ***
## avganncount -1.826e-04 3.040e-04 -0.601 0.548117
## incidencerate 1.843e-01 7.449e-03 24.745 < 2e-16 ***
## medincome 7.827e-05 1.113e-04 0.703 0.482111
## povertypercent -1.400e-01 1.785e-01 -0.784 0.432859
## studypercap 4.214e-04 6.444e-04 0.654 0.513265
## medianage 1.418e-03 7.383e-03 0.192 0.847700
## percentmarried 7.419e-01 1.690e-01 4.388 1.18e-05 ***
## pctnohs18_24 -2.640e-02 5.461e-02 -0.483 0.628851
## pcths18_24 1.700e-01 4.850e-02 3.505 0.000463 ***
## pctbachdeg18_24 -2.272e-01 1.051e-01 -2.162 0.030666 *
## pcths25_over 1.949e-01 1.077e-01 1.811 0.070309 .
## pctbachdeg25_over -1.052e+00 1.578e-01 -6.666 3.13e-11 ***
## pctemployed16_over -3.599e-01 1.144e-01 -3.147 0.001667 **
## pctunemployed16_over 3.332e-01 1.739e-01 1.916 0.055417 .
## pctpubliccoverage -1.566e-01 2.237e-01 -0.700 0.483933
## pctpubliccoveragealone 7.008e-01 2.275e-01 3.080 0.002089 **
## pctwhite -1.939e-01 6.606e-02 -2.935 0.003362 **
## pctblack -1.771e-01 7.016e-02 -2.524 0.011655 *
## pctasian -6.877e-02 2.176e-01 -0.316 0.751941
## pctotherrace -6.446e-01 1.298e-01 -4.965 7.27e-07 ***
## pctmarriedhouseholds -1.050e+00 1.537e-01 -6.834 1.00e-11 ***
## birthrate -5.975e-01 1.853e-01 -3.224 0.001278 **
## binnedinc(37413.8, 40362.7] -1.933e+00 1.543e+00 -1.253 0.210454
## binnedinc(40362.7, 42724.4] -3.627e+00 1.661e+00 -2.184 0.029040 *
## binnedinc(42724.4, 45201] -2.628e+00 1.819e+00 -1.445 0.148612
## binnedinc(45201, 48021.6] -4.350e+00 1.993e+00 -2.183 0.029128 *
## binnedinc(48021.6, 51046.4] -5.388e+00 2.222e+00 -2.425 0.015384 *
## binnedinc(51046.4, 54545.6] -5.557e+00 2.426e+00 -2.291 0.022044 *
## binnedinc(54545.6, 61494.5] -4.792e+00 2.781e+00 -1.723 0.084907 .
## binnedinc(61494.5, 125635] -4.245e+00 3.923e+00 -1.082 0.279228
## binnedinc[22640, 34218.1] 3.727e+00 1.664e+00 2.239 0.025211 *
## geographyAlaska 1.380e+01 6.086e+00 2.267 0.023478 *
## geographyAnne'Maryland 1.825e+01 1.835e+01 0.994 0.320143
## geographyArizona -2.111e+01 5.538e+00 -3.813 0.000140 ***
## geographyArkansas 1.068e+01 3.271e+00 3.265 0.001106 **
## geographyCalifornia -1.328e+01 3.920e+00 -3.388 0.000714 ***
## geographyCarolina -4.146e+00 2.798e+00 -1.482 0.138435
## geographyColorado -1.747e+01 3.655e+00 -4.780 1.84e-06 ***
## geographyColumbia -3.861e-01 1.849e+01 -0.021 0.983342
## geographyConnecticut -1.709e+01 7.083e+00 -2.412 0.015906 *
## geographyDakota -6.656e+00 3.339e+00 -1.993 0.046333 *
## geographyDelaware -6.440e+00 1.081e+01 -0.596 0.551486
## geographyDo<U+0623>±Mexico -1.526e+01 1.838e+01 -0.830 0.406499
## geographyFlorida 1.581e+00 3.451e+00 0.458 0.646892
## geographyGeorge'Maryland 7.864e+00 1.839e+01 0.428 0.669025
## geographyGeorgia -5.993e+00 2.821e+00 -2.125 0.033694 *
## geographyHampshire -3.150e+00 6.399e+00 -0.492 0.622511
## geographyHawaii -3.261e+01 1.161e+01 -2.808 0.005016 **
## geographyIdaho -1.728e+01 3.853e+00 -4.485 7.56e-06 ***
## geographyIllinois -2.637e+00 3.193e+00 -0.826 0.408927
## geographyIndiana 6.930e+00 3.242e+00 2.138 0.032623 *
## geographyIowa -1.003e+01 3.320e+00 -3.023 0.002528 **
## geographyIsland -4.879e+00 8.644e+00 -0.564 0.572495
## geographyJersey -5.336e+00 4.961e+00 -1.076 0.282143
## geographyKansas -3.642e+00 3.237e+00 -1.125 0.260620
## geographyKentucky 1.054e+01 3.078e+00 3.425 0.000623 ***
## geographyLouisiana -2.597e-01 3.466e+00 -0.075 0.940270
## geographyMaine 1.684e+00 5.347e+00 0.315 0.752850
## geographyMaryland -4.183e-01 4.820e+00 -0.087 0.930845
## geographyMassachusetts -8.482e+00 5.824e+00 -1.456 0.145390
## geographyMatanuska-Alaska -1.336e+00 1.836e+01 -0.073 0.942007
## geographyMexico -1.620e+01 4.511e+00 -3.592 0.000334 ***
## geographyMiami-Florida -2.826e+01 1.864e+01 -1.516 0.129719
## geographyMichigan -1.375e+00 3.288e+00 -0.418 0.675838
## geographyMinnesota -1.152e+01 3.439e+00 -3.350 0.000818 ***
## geographyMississippi 5.178e+00 3.121e+00 1.659 0.097133 .
## geographyMissouri 8.107e+00 3.084e+00 2.629 0.008611 **
## geographyMontana -1.220e+01 3.839e+00 -3.178 0.001497 **
## geographyNebraska -6.338e+00 3.400e+00 -1.864 0.062452 .
## geographyNevada -4.349e+00 5.241e+00 -0.830 0.406767
## geographyO'Iowa 2.940e+00 1.835e+01 0.160 0.872707
## geographyOhio 3.695e+00 3.283e+00 1.125 0.260487
## geographyOklahoma 1.114e+01 3.432e+00 3.245 0.001188 **
## geographyOregon -8.800e+00 4.025e+00 -2.186 0.028884 *
## geographyPennsylvania -9.400e+00 3.558e+00 -2.642 0.008294 **
## geographySt.Alabama 3.216e+00 1.828e+01 0.176 0.860313
## geographySt.Arkansas 1.018e+01 1.828e+01 0.557 0.577577
## geographySt.Florida -9.275e+00 1.307e+01 -0.710 0.477830
## geographySt.Illinois 3.925e+00 1.827e+01 0.215 0.829941
## geographySt.Indiana 2.219e+00 1.828e+01 0.121 0.903360
## geographySt.Louisiana 7.004e+00 6.545e+00 1.070 0.284639
## geographySt.Mary'Maryland 1.694e+01 1.835e+01 0.923 0.356217
## geographySt.Michigan 2.632e-01 1.306e+01 0.020 0.983925
## geographySt.Minnesota 1.218e+01 1.833e+01 0.665 0.506392
## geographySt.Missouri 5.396e+00 8.461e+00 0.638 0.523725
## geographySt.Wisconsin 2.814e+01 1.835e+01 1.533 0.125360
## geographySt.York -1.157e+01 1.830e+01 -0.632 0.527146
## geographySte.Missouri 2.103e+01 1.834e+01 1.146 0.251708
## geographyTennessee 6.524e+00 3.115e+00 2.094 0.036305 *
## geographyTexas 2.702e+00 2.918e+00 0.926 0.354528
## geographyUtah -2.194e+01 4.441e+00 -4.940 8.25e-07 ***
## geographyValdez-Alaska 2.426e+01 1.844e+01 1.315 0.188446
## geographyVermont -1.301e+00 5.830e+00 -0.223 0.823396
## geographyVirginia 4.526e+00 2.796e+00 1.619 0.105647
## geographyWashington -7.238e+00 4.025e+00 -1.798 0.072235 .
## geographyWisconsin -4.402e+00 3.470e+00 -1.268 0.204730
## geographyWyoming -4.566e+00 4.687e+00 -0.974 0.330105
## geographyYork -1.387e+01 3.639e+00 -3.811 0.000141 ***
## geographyYukon-Alaska -1.833e+01 1.879e+01 -0.975 0.329504
## medianagemf -4.697e-01 1.573e-01 -2.986 0.002847 **
## medcov 2.000e-01 1.395e-01 1.433 0.151861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.06 on 2945 degrees of freedom
## Multiple R-squared: 0.5905, Adjusted R-squared: 0.5764
## F-statistic: 42.04 on 101 and 2945 DF, p-value: < 2.2e-16
Now we can run regression diagnostics
firstly we can see the leverage points
lev<-hatvalues(model = fitall)
next calculating the cutoff point = (2*K+2)/N
k<-ncol(data)-1
hatcut<-((2*k)+2)/nrow(data)
lets see how many exceed the cutoff
(lev>hatcut)%>%summary()
## Mode FALSE TRUE
## logical 1096 1951
we have 1951 points which exceed the cutpoint
lets store it in variable
levout<-as.numeric(lev>hatcut)
now we will test cook’s D for leverage and dispersion
cook<-cooks.distance(fitall)
setting the cutoff point for cook’s distance
cookcut<-(4/(nrow(data)-(ncol(data)-1)-1))
(cook>cookcut)%>%summary()
## Mode FALSE TRUE NA's
## logical 2889 143 15
here we have 143 cases which considered as influence with some misbehaving values returned to NA
cookout<-as.numeric(cook>cookcut)
now we need to see the cases which is considered as outlier and influence
totalout<-outidx+levout+cookout
table(totalout)
## totalout
## 0 1 2 3
## 1086 1538 330 78
I think that removing points which got two issues is enough
idx<-totalout>=2
idx<-sapply(idx,function(x){
ifelse(is.na(x),TRUE,x)
})
table(idx)
## idx
## FALSE TRUE
## 2624 423
data<-data[-idx,]
here i think that we have to take a look on VIF for variables cause Inflation
viftest<-vif(fitall)
viftest<-data.frame(viftest)
summary(viftest$GVIF>10)
## Mode FALSE TRUE
## logical 17 9
summary(viftest$GVIF..1..2.Df..>3.5)
## Mode FALSE TRUE
## logical 21 5
i think that we face a great problem here regarding multicolinearity but i will delay any action for it to the end of our analysis
Now , lets test normality , linearity , homoscdasticity !!
stdresid<-rstudent(fitall)
stdfit<-fitted(fitall)%>%scale
plot(fitall,2)
plot(stdfit,stdresid)
abline(v = 0,h = 0)
plotting of residual carry good news regarding normality and homoscdasticity congrats !!
here i think that i am ready to run my model and conducting the real analysis but we have to not forget that we have a serious problem in co-linearity so i will do number of models and compare between them these models are: 1- Model Contain all variables 2- step-wise model regarding the multicolinear variables 3- step-wise model regarding AIC 4- Regularized model (elastic model) and then comparing between RMSE and R^2
firstly lets convert the categorical variables to dummy variables
dummy<-dummyVars(target_deathrate~.,data=data)
data<-cbind(target_deathrate=data$target_deathrate,predict(dummy,data))%>%data.frame()
then partitioning our data
set.seed(234)
trainidx<-createDataPartition(data$target_deathrate,p = .7,list = F)
traindata<-data[trainidx,]
testdata<-data[-trainidx,]
Now lets run all models consequently and doing 10 CV fold on the training data
fitcontrol<-trainControl(method = "cv",number = 10)
all variables model
allfit<-train(target_deathrate ~.,data = traindata,method="lm",trControl=fitcontrol)
summary(allfit)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106.085 -9.712 -0.048 10.078 129.226
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.430e+02 2.729e+01 5.239 1.78e-07 ***
## avganncount -3.536e-04 4.402e-04 -0.803 0.421936
## incidencerate 1.812e-01 8.925e-03 20.301 < 2e-16 ***
## medincome 1.608e-04 1.487e-04 1.081 0.279711
## povertypercent -1.230e-01 2.179e-01 -0.564 0.572498
## studypercap 1.005e-03 8.785e-04 1.144 0.252624
## medianage 9.067e-03 9.151e-03 0.991 0.321892
## percentmarried 7.734e-01 2.066e-01 3.743 0.000187 ***
## pctnohs18_24 1.095e-02 6.674e-02 0.164 0.869723
## pcths18_24 2.071e-01 5.939e-02 3.487 0.000498 ***
## pctbachdeg18_24 -1.677e-01 1.289e-01 -1.302 0.193158
## pcths25_over 2.010e-01 1.315e-01 1.528 0.126583
## pctbachdeg25_over -1.177e+00 1.943e-01 -6.057 1.65e-09 ***
## pctemployed16_over -2.887e-01 1.395e-01 -2.069 0.038632 *
## pctunemployed16_over 1.803e-01 2.155e-01 0.837 0.402943
## pctpubliccoverage -1.399e-01 2.770e-01 -0.505 0.613709
## pctpubliccoveragealone 6.977e-01 2.785e-01 2.505 0.012313 *
## pctwhite -1.903e-01 8.064e-02 -2.360 0.018372 *
## pctblack -1.612e-01 8.608e-02 -1.873 0.061200 .
## pctasian 8.381e-02 2.642e-01 0.317 0.751084
## pctotherrace -7.525e-01 1.689e-01 -4.455 8.86e-06 ***
## pctmarriedhouseholds -1.187e+00 1.878e-01 -6.319 3.22e-10 ***
## birthrate -7.171e-01 2.293e-01 -3.128 0.001786 **
## binnedinc..34218.1..37413.8. -2.816e+00 2.074e+00 -1.358 0.174677
## binnedinc..37413.8..40362.7. -5.794e+00 2.341e+00 -2.475 0.013388 *
## binnedinc..40362.7..42724.4. -7.516e+00 2.590e+00 -2.902 0.003750 **
## binnedinc..42724.4..45201. -7.803e+00 2.846e+00 -2.742 0.006155 **
## binnedinc..45201..48021.6. -9.677e+00 3.147e+00 -3.076 0.002129 **
## binnedinc..48021.6..51046.4. -9.778e+00 3.485e+00 -2.806 0.005064 **
## binnedinc..51046.4..54545.6. -9.860e+00 3.751e+00 -2.628 0.008649 **
## binnedinc..54545.6..61494.5. -9.719e+00 4.247e+00 -2.288 0.022214 *
## binnedinc..61494.5..125635. -9.825e+00 5.738e+00 -1.712 0.087010 .
## binnedinc..22640..34218.1. NA NA NA NA
## geography.Alabama 1.954e+01 1.955e+01 1.000 0.317635
## geography.Alaska 3.258e+01 1.962e+01 1.660 0.097013 .
## geography.Anne.Maryland NA NA NA NA
## geography.Arizona -2.570e+00 2.007e+01 -0.128 0.898129
## geography.Arkansas 3.008e+01 1.954e+01 1.540 0.123777
## geography.California 7.453e+00 1.962e+01 0.380 0.704067
## geography.Carolina 1.494e+01 1.939e+01 0.771 0.440975
## geography.Colorado 2.293e+00 1.952e+01 0.117 0.906482
## geography.Columbia 1.580e+01 2.706e+01 0.584 0.559432
## geography.Connecticut 2.784e+00 2.110e+01 0.132 0.895057
## geography.Dakota 1.372e+01 1.929e+01 0.711 0.476961
## geography.Delaware 1.262e+01 2.216e+01 0.569 0.569308
## geographyDo.U.0623..Mexico 6.097e+00 2.696e+01 0.226 0.821116
## geography.Florida 2.290e+01 1.952e+01 1.173 0.240825
## geography.George.Maryland 2.574e+01 2.689e+01 0.957 0.338519
## geography.Georgia 1.212e+01 1.938e+01 0.625 0.531793
## geography.Hampshire 1.617e+01 2.039e+01 0.793 0.427918
## geography.Hawaii -1.059e+01 2.415e+01 -0.438 0.661104
## geography.Idaho 6.236e+00 1.956e+01 0.319 0.749954
## geography.Illinois 1.758e+01 1.949e+01 0.902 0.367343
## geography.Indiana 2.660e+01 1.943e+01 1.369 0.171163
## geography.Iowa 8.353e+00 1.949e+01 0.428 0.668351
## geography.Island 1.453e+01 2.108e+01 0.689 0.490931
## geography.Jersey 1.562e+01 1.987e+01 0.786 0.432035
## geography.Kansas 1.824e+01 1.942e+01 0.939 0.347811
## geography.Kentucky 3.289e+01 1.948e+01 1.688 0.091505 .
## geography.Louisiana 1.700e+01 1.955e+01 0.869 0.384807
## geography.Maine 1.947e+01 2.023e+01 0.962 0.335973
## geography.Maryland 1.752e+01 2.008e+01 0.873 0.383012
## geography.Massachusetts 9.667e+00 2.025e+01 0.477 0.633131
## geography.Matanuska.Alaska 1.821e+01 2.663e+01 0.684 0.494190
## geography.Mexico 7.029e+00 1.973e+01 0.356 0.721645
## geography.Miami.Florida -5.787e+00 2.715e+01 -0.213 0.831257
## geography.Michigan 1.937e+01 1.947e+01 0.995 0.319983
## geography.Minnesota 7.862e+00 1.954e+01 0.402 0.687416
## geography.Mississippi 2.201e+01 1.954e+01 1.126 0.260149
## geography.Missouri 2.853e+01 1.939e+01 1.471 0.141491
## geography.Montana 5.439e+00 1.939e+01 0.281 0.779113
## geography.Nebraska 1.258e+01 1.944e+01 0.647 0.517744
## geography.Nevada 1.921e+01 1.980e+01 0.970 0.332015
## geography.O.Iowa 2.252e+01 2.686e+01 0.839 0.401765
## geography.Ohio 2.472e+01 1.947e+01 1.269 0.204419
## geography.Oklahoma 2.859e+01 1.921e+01 1.488 0.136871
## geography.Oregon 1.343e+01 1.963e+01 0.684 0.493844
## geography.Pennsylvania 1.139e+01 1.951e+01 0.584 0.559221
## geography.St.Alabama 2.227e+01 2.677e+01 0.832 0.405586
## geography.St.Arkansas NA NA NA NA
## geography.St.Florida 7.929e+00 2.677e+01 0.296 0.767163
## geography.St.Illinois 2.310e+01 2.682e+01 0.861 0.389094
## geography.St.Indiana NA NA NA NA
## geography.St.Louisiana 2.751e+01 2.078e+01 1.324 0.185791
## geography.St.Mary.Maryland NA NA NA NA
## geography.St.Michigan 2.011e+01 2.335e+01 0.861 0.389305
## geography.St.Minnesota NA NA NA NA
## geography.St.Missouri 2.634e+01 2.137e+01 1.233 0.217853
## geography.St.Wisconsin 4.664e+01 2.684e+01 1.738 0.082431 .
## geography.St.York 9.830e+00 2.679e+01 0.367 0.713727
## geography.Ste.Missouri 3.995e+01 2.683e+01 1.489 0.136648
## geography.Tennessee 2.534e+01 1.942e+01 1.305 0.191987
## geography.Texas 2.180e+01 1.934e+01 1.127 0.259792
## geography.Utah -1.934e-01 1.977e+01 -0.010 0.992195
## geography.Valdez.Alaska 4.423e+01 2.647e+01 1.671 0.094916 .
## geography.Vermont 1.628e+01 2.032e+01 0.801 0.423092
## geography.Virginia 2.273e+01 1.936e+01 1.174 0.240515
## geography.Washington 1.220e+01 1.965e+01 0.621 0.534733
## geography.Wisconsin 1.580e+01 1.948e+01 0.811 0.417536
## geography.Wyoming 1.855e+01 1.977e+01 0.938 0.348258
## geography.York 5.149e+00 1.959e+01 0.263 0.792662
## geography.Yukon.Alaska NA NA NA NA
## medianagemf -4.849e-01 1.890e-01 -2.565 0.010377 *
## medcov 2.076e-01 1.704e-01 1.218 0.223330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.45 on 2037 degrees of freedom
## Multiple R-squared: 0.5858, Adjusted R-squared: 0.5663
## F-statistic: 30.01 on 96 and 2037 DF, p-value: < 2.2e-16
OK , the factor predictor of states is misbehave and rarely be significant
our data matrix is at least equal to the number of parameters we want to fit. One way to invoke it is having some col-linear covariates which exist in our data
Now lets test step-wise regression
stepfit<-train(target_deathrate ~.,data = traindata,method="lmStepAIC",trControl=fitcontrol)
Seeing the summary and the coefficients
summary(stepfit)
##
## Call:
## lm(formula = .outcome ~ incidencerate + percentmarried + pcths18_24 +
## pcths25_over + pctbachdeg25_over + pctemployed16_over + pctpubliccoveragealone +
## pctwhite + pctblack + pctotherrace + pctmarriedhouseholds +
## birthrate + binnedinc..37413.8..40362.7. + binnedinc..40362.7..42724.4. +
## binnedinc..42724.4..45201. + binnedinc..45201..48021.6. +
## binnedinc..48021.6..51046.4. + binnedinc..51046.4..54545.6. +
## binnedinc..54545.6..61494.5. + geography.Alabama + geography.Alaska +
## geography.Arkansas + geography.Carolina + geography.Dakota +
## geography.Florida + geography.Georgia + geography.Hampshire +
## geography.Illinois + geography.Indiana + geography.Jersey +
## geography.Kansas + geography.Kentucky + geography.Louisiana +
## geography.Maine + geography.Maryland + geography.Michigan +
## geography.Mississippi + geography.Missouri + geography.Nebraska +
## geography.Nevada + geography.Ohio + geography.Oklahoma +
## geography.Oregon + geography.Pennsylvania + geography.St.Louisiana +
## geography.St.Missouri + geography.St.Wisconsin + geography.Ste.Missouri +
## geography.Tennessee + geography.Texas + geography.Valdez.Alaska +
## geography.Vermont + geography.Virginia + geography.Washington +
## geography.Wisconsin + geography.Wyoming + medianagemf + medcov,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.36 -9.90 -0.32 10.21 127.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 143.23047 13.66199 10.484 < 2e-16 ***
## incidencerate 0.18560 0.00843 22.017 < 2e-16 ***
## percentmarried 0.74211 0.19298 3.846 0.000124 ***
## pcths18_24 0.21462 0.05389 3.983 7.04e-05 ***
## pcths25_over 0.19695 0.12299 1.601 0.109448
## pctbachdeg25_over -1.26662 0.17688 -7.161 1.11e-12 ***
## pctemployed16_over -0.25453 0.10904 -2.334 0.019672 *
## pctpubliccoveragealone 0.61463 0.17141 3.586 0.000344 ***
## pctwhite -0.18170 0.06631 -2.740 0.006198 **
## pctblack -0.13647 0.07520 -1.815 0.069705 .
## pctotherrace -0.74841 0.15452 -4.844 1.37e-06 ***
## pctmarriedhouseholds -1.14913 0.16695 -6.883 7.72e-12 ***
## birthrate -0.70078 0.22564 -3.106 0.001924 **
## binnedinc..37413.8..40362.7. -3.25235 1.52572 -2.132 0.033151 *
## binnedinc..40362.7..42724.4. -4.08568 1.53370 -2.664 0.007783 **
## binnedinc..42724.4..45201. -3.98153 1.51825 -2.622 0.008794 **
## binnedinc..45201..48021.6. -5.16374 1.58126 -3.266 0.001110 **
## binnedinc..48021.6..51046.4. -4.42297 1.62502 -2.722 0.006547 **
## binnedinc..51046.4..54545.6. -4.18749 1.57863 -2.653 0.008048 **
## binnedinc..54545.6..61494.5. -3.05262 1.59153 -1.918 0.055242 .
## geography.Alabama 13.24125 3.36364 3.937 8.54e-05 ***
## geography.Alaska 27.94431 5.89467 4.741 2.28e-06 ***
## geography.Arkansas 24.06341 3.01349 7.985 2.30e-15 ***
## geography.Carolina 9.21724 2.37017 3.889 0.000104 ***
## geography.Dakota 7.02700 2.49441 2.817 0.004892 **
## geography.Florida 16.81214 3.25335 5.168 2.60e-07 ***
## geography.Georgia 6.38499 2.42380 2.634 0.008494 **
## geography.Hampshire 9.44290 6.64378 1.421 0.155375
## geography.Illinois 11.28871 2.44081 4.625 3.98e-06 ***
## geography.Indiana 20.66953 2.59944 7.952 3.00e-15 ***
## geography.Jersey 9.57581 4.71283 2.032 0.042295 *
## geography.Kansas 11.19632 2.46390 4.544 5.83e-06 ***
## geography.Kentucky 26.63049 2.50302 10.639 < 2e-16 ***
## geography.Louisiana 10.61774 3.43501 3.091 0.002021 **
## geography.Maine 12.88223 5.97827 2.155 0.031289 *
## geography.Maryland 11.30764 5.26228 2.149 0.031765 *
## geography.Michigan 13.68354 2.71221 5.045 4.93e-07 ***
## geography.Mississippi 15.74400 3.12473 5.039 5.10e-07 ***
## geography.Missouri 22.61036 2.50518 9.025 < 2e-16 ***
## geography.Nebraska 6.14029 2.73332 2.246 0.024779 *
## geography.Nevada 13.79806 5.13014 2.690 0.007211 **
## geography.Ohio 18.71122 2.64973 7.062 2.24e-12 ***
## geography.Oklahoma 22.56084 2.86365 7.878 5.31e-15 ***
## geography.Oregon 7.81109 3.87969 2.013 0.044209 *
## geography.Pennsylvania 4.62095 2.96702 1.557 0.119519
## geography.St.Louisiana 21.22583 7.81188 2.717 0.006640 **
## geography.St.Missouri 20.49557 9.30461 2.203 0.027723 *
## geography.St.Wisconsin 41.38447 18.51145 2.236 0.025483 *
## geography.Ste.Missouri 35.26877 18.52788 1.904 0.057107 .
## geography.Tennessee 19.06761 2.49509 7.642 3.24e-14 ***
## geography.Texas 16.03179 2.01293 7.964 2.71e-15 ***
## geography.Valdez.Alaska 40.64590 18.56773 2.189 0.028703 *
## geography.Vermont 9.17605 5.82503 1.575 0.115344
## geography.Virginia 16.60272 2.09694 7.918 3.91e-15 ***
## geography.Washington 6.41982 3.78538 1.696 0.090045 .
## geography.Wisconsin 9.82176 2.97262 3.304 0.000969 ***
## geography.Wyoming 12.83993 4.66086 2.755 0.005923 **
## medianagemf -0.49429 0.13133 -3.764 0.000172 ***
## medcov 0.22289 0.13775 1.618 0.105797
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.39 on 2075 degrees of freedom
## Multiple R-squared: 0.5808, Adjusted R-squared: 0.569
## F-statistic: 49.56 on 58 and 2075 DF, p-value: < 2.2e-16
Now the final model is penalized model
Penalized Model has three type :
1- lasso model 2- ridge model 3- Elsatic Model
here i will use elastic model since we have alot of colinear variable which to need to be tru zero and others which need only to be zero
lets run elastic model
elasticfit<-train(target_deathrate~.,data = traindata,trControl=fitcontrol,tuneLength=10,method="glmnet")
elasticfit$bestTune # best alpha and lambda values
## alpha lambda
## 7 0.1 0.9404475
coef(elasticfit$finalModel,elasticfit$bestTune$lambda)
## 104 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.345691e+02
## avganncount -1.895949e-04
## incidencerate 1.754294e-01
## medincome .
## povertypercent .
## studypercap 7.542099e-04
## medianage 5.913347e-03
## percentmarried 1.885861e-01
## pctnohs18_24 .
## pcths18_24 2.164261e-01
## pctbachdeg18_24 -1.353659e-01
## pcths25_over 2.457405e-01
## pctbachdeg25_over -1.000302e+00
## pctemployed16_over -7.355091e-02
## pctunemployed16_over 2.498321e-01
## pctpubliccoverage .
## pctpubliccoveragealone 4.244808e-01
## pctwhite -4.779651e-02
## pctblack -5.773841e-03
## pctasian 9.954008e-02
## pctotherrace -6.145021e-01
## pctmarriedhouseholds -6.034101e-01
## birthrate -5.503060e-01
## binnedinc..34218.1..37413.8. 3.433645e+00
## binnedinc..37413.8..40362.7. 8.489800e-01
## binnedinc..40362.7..42724.4. .
## binnedinc..42724.4..45201. .
## binnedinc..45201..48021.6. -9.928822e-01
## binnedinc..48021.6..51046.4. -1.149099e+00
## binnedinc..51046.4..54545.6. -1.139128e+00
## binnedinc..54545.6..61494.5. -1.229785e-01
## binnedinc..61494.5..125635. 8.941259e-01
## binnedinc..22640..34218.1. 5.146713e+00
## geography.Alabama 1.154804e+00
## geography.Alaska 1.733174e+01
## geography.Anne.Maryland .
## geography.Arizona -1.688474e+01
## geography.Arkansas 1.151172e+01
## geography.California -8.631527e+00
## geography.Carolina -3.052875e+00
## geography.Colorado -1.508951e+01
## geography.Columbia .
## geography.Connecticut -1.193743e+01
## geography.Dakota -2.477776e+00
## geography.Delaware -2.736964e+00
## geographyDo.U.0623..Mexico -9.535875e+00
## geography.Florida 2.311799e+00
## geography.George.Maryland 3.215843e+00
## geography.Georgia -5.919504e+00
## geography.Hampshire -7.721718e-01
## geography.Hawaii -1.751243e+01
## geography.Idaho -1.135165e+01
## geography.Illinois .
## geography.Indiana 8.376133e+00
## geography.Iowa -8.373090e+00
## geography.Island -1.147855e+00
## geography.Jersey -5.950566e-01
## geography.Kansas 1.049727e-01
## geography.Kentucky 1.570057e+01
## geography.Louisiana -7.274319e-01
## geography.Maine .
## geography.Maryland .
## geography.Massachusetts -6.086410e+00
## geography.Matanuska.Alaska .
## geography.Mexico -9.957967e+00
## geography.Miami.Florida -2.500847e+01
## geography.Michigan 1.091320e+00
## geography.Minnesota -8.718165e+00
## geography.Mississippi 3.173603e+00
## geography.Missouri 9.609113e+00
## geography.Montana -1.135981e+01
## geography.Nebraska -4.245383e+00
## geography.Nevada 1.507387e+00
## geography.O.Iowa 1.532439e+00
## geography.Ohio 6.598234e+00
## geography.Oklahoma 1.183841e+01
## geography.Oregon -2.747470e+00
## geography.Pennsylvania -6.169547e+00
## geography.St.Alabama 6.006366e-01
## geography.St.Arkansas .
## geography.St.Florida -7.477797e+00
## geography.St.Illinois 1.382744e+00
## geography.St.Indiana .
## geography.St.Louisiana 7.405574e+00
## geography.St.Mary.Maryland .
## geography.St.Michigan 4.364745e-02
## geography.St.Minnesota .
## geography.St.Missouri 6.730196e+00
## geography.St.Wisconsin 2.434074e+01
## geography.St.York -3.102247e+00
## geography.Ste.Missouri 1.793544e+01
## geography.Tennessee 6.971448e+00
## geography.Texas 2.159052e+00
## geography.Utah -1.746181e+01
## geography.Valdez.Alaska 2.170245e+01
## geography.Vermont -1.148873e+00
## geography.Virginia 4.644171e+00
## geography.Washington -3.895516e+00
## geography.Wisconsin -6.244107e-01
## geography.Wyoming .
## geography.York -1.102549e+01
## geography.Yukon.Alaska -4.383698e+00
## medianagemf -3.506919e-01
## medcov .
finally we need to predict the test data to Extract RMSE and R2
allfitpred<-predict(allfit,testdata)
allfitmeasures<-data.frame(RMSE=RMSE(pred = allfitpred,obs = testdata$target_deathrate),R2=R2(pred = allfitpred,obs = testdata$target_deathrate))
stepfitpred<-predict(stepfit,testdata)
stepfitmeasures<-data.frame(RMSE=RMSE(pred = stepfitpred,obs = testdata$target_deathrate),R2=R2(pred = stepfitpred,obs = testdata$target_deathrate))
elasticfitpred<-predict(elasticfit,testdata)
elasticfitmeasures<-data.frame(RMSE=RMSE(pred = elasticfitpred,obs = testdata$target_deathrate),R2=R2(pred = elasticfitpred,obs = testdata$target_deathrate))
putting all measures together
allmeasures<-rbind(allfitmeasures,stepfitmeasures,elasticfitmeasures)
rownames(allmeasures)<-c("Model with all variables","stepwise model","elastic penalized model")
allmeasures%>%kable("markdown")
| RMSE | R2 | |
|---|---|---|
| Model with all variables | 17.58857 | 0.5793748 |
| stepwise model | 17.61035 | 0.5782697 |
| elastic penalized model | 17.43842 | 0.5866667 |
at last we see here that the penalized model is the most accurate model in this model we reduced the coefficients of some model to be near zero and others to be zero (combination between lasso and ridge regression) the best alpha value is .2 and lambda value is .407
this was an effort to achieve the best prediction of death rate using more than 30 variables
Regards