this project is about predicating the death rate per 100000 person from more than 30 predictors

to find out more about this data and description of variables plz visit this page

and for any advice or recommendation plz feel free to contact me : vet.m.mohamed@gmail.com

lets start with loading our libraries

library(tidyverse)
library(knitr)
library(caret)
library(car)
library(psych)
library(mice)
library(progress)
library(DMwR)
library(readr)
library(MASS)
library(pedometrics)

then importing the data

data<-read.csv("./data sets/cancer.csv")

head(data)%>%kable("markdown")
avganncount avgdeathsperyear target_deathrate incidencerate medincome popest2015 povertypercent studypercap binnedinc medianage medianagemale medianagefemale geography percentmarried pctnohs18_24 pcths18_24 pctsomecol18_24 pctbachdeg18_24 pcths25_over pctbachdeg25_over pctemployed16_over pctunemployed16_over pctprivatecoverage pctprivatecoveragealone pctempprivcoverage pctpubliccoverage pctpubliccoveragealone pctwhite pctblack pctasian pctotherrace pctmarriedhouseholds birthrate
1397 469 164.9 489.8 61898 260131 11.2 499.74820 (61494.5, 125635] 39.3 36.9 41.7 Kitsap County, Washington 52.5 11.5 39.5 42.1 6.9 23.2 19.6 51.9 8.0 75.1 NA 41.6 32.9 14.0 81.78053 2.5947283 4.8218571 1.8434785 52.85608 6.118831
173 70 161.3 411.6 48127 43269 18.6 23.11123 (48021.6, 51046.4] 33.0 32.2 33.7 Kittitas County, Washington 44.5 6.1 22.4 64.0 7.5 26.0 22.7 55.9 7.8 70.2 53.8 43.6 31.1 15.3 89.22851 0.9691025 2.2462326 3.7413515 45.37250 4.333096
102 50 174.7 349.7 49348 21026 14.6 47.56016 (48021.6, 51046.4] 45.0 44.0 45.8 Klickitat County, Washington 54.2 24.0 36.6 NA 9.5 29.0 16.0 45.9 7.0 63.7 43.5 34.9 42.1 21.1 90.92219 0.7396734 0.4658982 2.7473583 54.44487 3.729488
427 202 194.8 430.4 44243 75882 17.1 342.63725 (42724.4, 45201] 42.8 42.2 43.4 Lewis County, Washington 52.7 20.2 41.2 36.1 2.5 31.6 9.3 48.3 12.1 58.4 40.3 35.0 45.3 25.0 91.74469 0.7826260 1.1613587 1.3626432 51.02151 4.603841
57 26 144.4 350.1 49955 10321 12.5 0.00000 (48021.6, 51046.4] 48.3 47.8 48.9 Lincoln County, Washington 57.8 14.9 43.0 40.0 2.0 33.4 15.0 48.2 4.8 61.6 43.9 35.1 44.0 22.7 94.10402 0.2701920 0.6658304 0.4921355 54.02746 6.796657
428 152 176.0 505.4 52313 61023 15.6 180.25990 (51046.4, 54545.6] 45.4 43.5 48.0 Mason County, Washington 50.4 29.9 35.1 NA 4.5 30.4 11.9 44.1 12.9 60.0 38.8 32.6 43.2 20.2 84.88263 1.6532052 1.5380566 3.3146354 51.22036 4.964476

it is important to see the structure of the data

str(data)
## 'data.frame':    3047 obs. of  33 variables:
##  $ avganncount            : num  1397 173 102 427 57 ...
##  $ avgdeathsperyear       : int  469 70 50 202 26 152 97 71 36 1380 ...
##  $ target_deathrate       : num  165 161 175 195 144 ...
##  $ incidencerate          : num  490 412 350 430 350 ...
##  $ medincome              : int  61898 48127 49348 44243 49955 52313 37782 40189 42579 60397 ...
##  $ popest2015             : int  260131 43269 21026 75882 10321 61023 41516 20848 13088 843954 ...
##  $ povertypercent         : num  11.2 18.6 14.6 17.1 12.5 15.6 23.2 17.8 22.3 13.1 ...
##  $ studypercap            : num  499.7 23.1 47.6 342.6 0 ...
##  $ binnedinc              : Factor w/ 10 levels "(34218.1, 37413.8]",..: 9 6 6 4 6 7 2 2 3 8 ...
##  $ medianage              : num  39.3 33 45 42.8 48.3 45.4 42.6 51.7 49.3 35.8 ...
##  $ medianagemale          : num  36.9 32.2 44 42.2 47.8 43.5 42.2 50.8 48.4 34.7 ...
##  $ medianagefemale        : num  41.7 33.7 45.8 43.4 48.9 48 43.5 52.5 49.8 37 ...
##  $ geography              : Factor w/ 3047 levels "Abbeville County, South Carolina",..: 1459 1460 1464 1589 1618 1766 2051 2112 2143 2185 ...
##  $ percentmarried         : num  52.5 44.5 54.2 52.7 57.8 50.4 54.1 52.7 55.9 50 ...
##  $ pctnohs18_24           : num  11.5 6.1 24 20.2 14.9 29.9 26.1 27.3 34.7 15.6 ...
##  $ pcths18_24             : num  39.5 22.4 36.6 41.2 43 35.1 41.4 33.9 39.4 36.3 ...
##  $ pctsomecol18_24        : num  42.1 64 NA 36.1 40 NA NA 36.5 NA NA ...
##  $ pctbachdeg18_24        : num  6.9 7.5 9.5 2.5 2 4.5 5.8 2.2 1.4 7.1 ...
##  $ pcths25_over           : num  23.2 26 29 31.6 33.4 30.4 29.8 31.6 32.2 28.8 ...
##  $ pctbachdeg25_over      : num  19.6 22.7 16 9.3 15 11.9 11.9 11.3 12 16.2 ...
##  $ pctemployed16_over     : num  51.9 55.9 45.9 48.3 48.2 44.1 51.8 40.9 39.5 56.6 ...
##  $ pctunemployed16_over   : num  8 7.8 7 12.1 4.8 12.9 8.9 8.9 10.3 9.2 ...
##  $ pctprivatecoverage     : num  75.1 70.2 63.7 58.4 61.6 60 49.5 55.8 55.5 69.9 ...
##  $ pctprivatecoveragealone: num  NA 53.8 43.5 40.3 43.9 38.8 35 33.1 37.8 NA ...
##  $ pctempprivcoverage     : num  41.6 43.6 34.9 35 35.1 32.6 28.3 25.9 29.9 44.4 ...
##  $ pctpubliccoverage      : num  32.9 31.1 42.1 45.3 44 43.2 46.4 50.9 48.1 31.4 ...
##  $ pctpubliccoveragealone : num  14 15.3 21.1 25 22.7 20.2 28.7 24.1 26.6 16.5 ...
##  $ pctwhite               : num  81.8 89.2 90.9 91.7 94.1 ...
##  $ pctblack               : num  2.595 0.969 0.74 0.783 0.27 ...
##  $ pctasian               : num  4.822 2.246 0.466 1.161 0.666 ...
##  $ pctotherrace           : num  1.843 3.741 2.747 1.363 0.492 ...
##  $ pctmarriedhouseholds   : num  52.9 45.4 54.4 51 54 ...
##  $ birthrate              : num  6.12 4.33 3.73 4.6 6.8 ...

some of data are factors and others are numeric but there is no need for modification

Lets inspect the accuracy

summary(data)
##   avganncount      avgdeathsperyear target_deathrate incidencerate   
##  Min.   :    6.0   Min.   :    3    Min.   : 59.7    Min.   : 201.3  
##  1st Qu.:   76.0   1st Qu.:   28    1st Qu.:161.2    1st Qu.: 420.3  
##  Median :  171.0   Median :   61    Median :178.1    Median : 453.5  
##  Mean   :  606.3   Mean   :  186    Mean   :178.7    Mean   : 448.3  
##  3rd Qu.:  518.0   3rd Qu.:  149    3rd Qu.:195.2    3rd Qu.: 480.9  
##  Max.   :38150.0   Max.   :14010    Max.   :362.8    Max.   :1206.9  
##                                                                      
##    medincome        popest2015       povertypercent   studypercap     
##  Min.   : 22640   Min.   :     827   Min.   : 3.20   Min.   :   0.00  
##  1st Qu.: 38883   1st Qu.:   11684   1st Qu.:12.15   1st Qu.:   0.00  
##  Median : 45207   Median :   26643   Median :15.90   Median :   0.00  
##  Mean   : 47063   Mean   :  102637   Mean   :16.88   Mean   : 155.40  
##  3rd Qu.: 52492   3rd Qu.:   68671   3rd Qu.:20.40   3rd Qu.:  83.65  
##  Max.   :125635   Max.   :10170292   Max.   :47.40   Max.   :9762.31  
##                                                                       
##               binnedinc      medianage      medianagemale  
##  (45201, 48021.6]  : 306   Min.   : 22.30   Min.   :22.40  
##  (54545.6, 61494.5]: 306   1st Qu.: 37.70   1st Qu.:36.35  
##  [22640, 34218.1]  : 306   Median : 41.00   Median :39.60  
##  (42724.4, 45201]  : 305   Mean   : 45.27   Mean   :39.57  
##  (48021.6, 51046.4]: 305   3rd Qu.: 44.00   3rd Qu.:42.50  
##  (51046.4, 54545.6]: 305   Max.   :624.00   Max.   :64.70  
##  (Other)           :1214                                   
##  medianagefemale                            geography    percentmarried 
##  Min.   :22.30   Abbeville County, South Carolina:   1   Min.   :23.10  
##  1st Qu.:39.10   Acadia Parish, Louisiana        :   1   1st Qu.:47.75  
##  Median :42.40   Accomack County, Virginia       :   1   Median :52.40  
##  Mean   :42.15   Ada County, Idaho               :   1   Mean   :51.77  
##  3rd Qu.:45.30   Adair County, Iowa              :   1   3rd Qu.:56.40  
##  Max.   :65.70   Adair County, Kentucky          :   1   Max.   :72.50  
##                  (Other)                         :3041                  
##   pctnohs18_24     pcths18_24   pctsomecol18_24 pctbachdeg18_24 
##  Min.   : 0.00   Min.   : 0.0   Min.   : 7.10   Min.   : 0.000  
##  1st Qu.:12.80   1st Qu.:29.2   1st Qu.:34.00   1st Qu.: 3.100  
##  Median :17.10   Median :34.7   Median :40.40   Median : 5.400  
##  Mean   :18.22   Mean   :35.0   Mean   :40.98   Mean   : 6.158  
##  3rd Qu.:22.70   3rd Qu.:40.7   3rd Qu.:46.40   3rd Qu.: 8.200  
##  Max.   :64.10   Max.   :72.5   Max.   :79.00   Max.   :51.800  
##                                 NA's   :2285                    
##   pcths25_over   pctbachdeg25_over pctemployed16_over pctunemployed16_over
##  Min.   : 7.50   Min.   : 2.50     Min.   :17.60      Min.   : 0.400      
##  1st Qu.:30.40   1st Qu.: 9.40     1st Qu.:48.60      1st Qu.: 5.500      
##  Median :35.30   Median :12.30     Median :54.50      Median : 7.600      
##  Mean   :34.80   Mean   :13.28     Mean   :54.15      Mean   : 7.852      
##  3rd Qu.:39.65   3rd Qu.:16.10     3rd Qu.:60.30      3rd Qu.: 9.700      
##  Max.   :54.80   Max.   :42.20     Max.   :80.10      Max.   :29.400      
##                                    NA's   :152                            
##  pctprivatecoverage pctprivatecoveragealone pctempprivcoverage
##  Min.   :22.30      Min.   :15.70           Min.   :13.5      
##  1st Qu.:57.20      1st Qu.:41.00           1st Qu.:34.5      
##  Median :65.10      Median :48.70           Median :41.1      
##  Mean   :64.35      Mean   :48.45           Mean   :41.2      
##  3rd Qu.:72.10      3rd Qu.:55.60           3rd Qu.:47.7      
##  Max.   :92.30      Max.   :78.90           Max.   :70.7      
##                     NA's   :609                               
##  pctpubliccoverage pctpubliccoveragealone    pctwhite     
##  Min.   :11.20     Min.   : 2.60          Min.   : 10.20  
##  1st Qu.:30.90     1st Qu.:14.85          1st Qu.: 77.30  
##  Median :36.30     Median :18.80          Median : 90.06  
##  Mean   :36.25     Mean   :19.24          Mean   : 83.65  
##  3rd Qu.:41.55     3rd Qu.:23.10          3rd Qu.: 95.45  
##  Max.   :65.10     Max.   :46.60          Max.   :100.00  
##                                                           
##     pctblack          pctasian        pctotherrace    
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.6207   1st Qu.: 0.2542   1st Qu.: 0.2952  
##  Median : 2.2476   Median : 0.5498   Median : 0.8262  
##  Mean   : 9.1080   Mean   : 1.2540   Mean   : 1.9835  
##  3rd Qu.:10.5097   3rd Qu.: 1.2210   3rd Qu.: 2.1780  
##  Max.   :85.9478   Max.   :42.6194   Max.   :41.9303  
##                                                       
##  pctmarriedhouseholds   birthrate     
##  Min.   :22.99        Min.   : 0.000  
##  1st Qu.:47.76        1st Qu.: 4.521  
##  Median :51.67        Median : 5.381  
##  Mean   :51.24        Mean   : 5.640  
##  3rd Qu.:55.40        3rd Qu.: 6.494  
##  Max.   :78.08        Max.   :21.326  
## 

the summary reveal some issues with outliers and missing data

another thing , i think that we have to exclude the county name and keep only the name of the state

table(data$geography)%>%data.frame()%>%head()%>%kable("markdown") #here we see that the number of county is the same of the number of data 
Var1 Freq
Abbeville County, South Carolina 1
Acadia Parish, Louisiana 1
Accomack County, Virginia 1
Ada County, Idaho 1
Adair County, Iowa 1
Adair County, Kentucky 1
#this wil be problamitic in our analysis 
#So i will keep the state name and remove the county name 

data$geography<-str_remove_all(string = data$geography,
                               pattern = "[:alpha:]{1,}(\\s)|[:alpha:]{1,}(\\,)|(\\s)")


head(data,10)%>%kable("markdown")
avganncount avgdeathsperyear target_deathrate incidencerate medincome popest2015 povertypercent studypercap binnedinc medianage medianagemale medianagefemale geography percentmarried pctnohs18_24 pcths18_24 pctsomecol18_24 pctbachdeg18_24 pcths25_over pctbachdeg25_over pctemployed16_over pctunemployed16_over pctprivatecoverage pctprivatecoveragealone pctempprivcoverage pctpubliccoverage pctpubliccoveragealone pctwhite pctblack pctasian pctotherrace pctmarriedhouseholds birthrate
1397 469 164.9 489.8 61898 260131 11.2 499.74820 (61494.5, 125635] 39.3 36.9 41.7 Washington 52.5 11.5 39.5 42.1 6.9 23.2 19.6 51.9 8.0 75.1 NA 41.6 32.9 14.0 81.78053 2.5947283 4.8218571 1.8434785 52.85608 6.118831
173 70 161.3 411.6 48127 43269 18.6 23.11123 (48021.6, 51046.4] 33.0 32.2 33.7 Washington 44.5 6.1 22.4 64.0 7.5 26.0 22.7 55.9 7.8 70.2 53.8 43.6 31.1 15.3 89.22851 0.9691025 2.2462326 3.7413515 45.37250 4.333096
102 50 174.7 349.7 49348 21026 14.6 47.56016 (48021.6, 51046.4] 45.0 44.0 45.8 Washington 54.2 24.0 36.6 NA 9.5 29.0 16.0 45.9 7.0 63.7 43.5 34.9 42.1 21.1 90.92219 0.7396734 0.4658982 2.7473583 54.44487 3.729488
427 202 194.8 430.4 44243 75882 17.1 342.63725 (42724.4, 45201] 42.8 42.2 43.4 Washington 52.7 20.2 41.2 36.1 2.5 31.6 9.3 48.3 12.1 58.4 40.3 35.0 45.3 25.0 91.74469 0.7826260 1.1613587 1.3626432 51.02151 4.603841
57 26 144.4 350.1 49955 10321 12.5 0.00000 (48021.6, 51046.4] 48.3 47.8 48.9 Washington 57.8 14.9 43.0 40.0 2.0 33.4 15.0 48.2 4.8 61.6 43.9 35.1 44.0 22.7 94.10402 0.2701920 0.6658304 0.4921355 54.02746 6.796657
428 152 176.0 505.4 52313 61023 15.6 180.25990 (51046.4, 54545.6] 45.4 43.5 48.0 Washington 50.4 29.9 35.1 NA 4.5 30.4 11.9 44.1 12.9 60.0 38.8 32.6 43.2 20.2 84.88263 1.6532052 1.5380566 3.3146354 51.22036 4.964476
250 97 175.9 461.8 37782 41516 23.2 0.00000 (37413.8, 40362.7] 42.6 42.2 43.5 Washington 54.1 26.1 41.4 NA 5.8 29.8 11.9 51.8 8.9 49.5 35.0 28.3 46.4 28.7 75.10645 0.6169554 0.8661570 8.3567212 51.01390 4.204317
146 71 183.6 404.0 40189 20848 17.8 0.00000 (37413.8, 40362.7] 51.7 50.8 52.5 Washington 52.7 27.3 33.9 36.5 2.2 31.6 11.3 40.9 8.9 55.8 33.1 25.9 50.9 24.1 89.40664 0.3051586 1.8890773 2.2862679 48.96703 5.889179
88 36 190.5 459.4 42579 13088 22.3 0.00000 (40362.7, 42724.4] 49.3 48.4 49.8 Washington 55.9 34.7 39.4 NA 1.4 32.2 12.0 39.5 10.3 55.5 37.8 29.9 48.1 26.6 91.78748 0.1850709 0.2082048 0.6169031 53.44700 5.587583
4025 1380 177.8 510.9 60397 843954 13.1 427.74843 (54545.6, 61494.5] 35.8 34.7 37.0 Washington 50.0 15.6 36.3 NA 7.1 28.8 16.2 56.6 9.2 69.9 NA 44.4 31.4 16.5 74.72967 6.7108542 6.0414720 2.6991844 50.06357 5.533430

good, now i think that we have to test for more than 5% missing value

miss<-apply(data,2,function(x){
        round((sum(is.na(x))/length(x))*100,2)
})

miss
##             avganncount        avgdeathsperyear        target_deathrate 
##                    0.00                    0.00                    0.00 
##           incidencerate               medincome              popest2015 
##                    0.00                    0.00                    0.00 
##          povertypercent             studypercap               binnedinc 
##                    0.00                    0.00                    0.00 
##               medianage           medianagemale         medianagefemale 
##                    0.00                    0.00                    0.00 
##               geography          percentmarried            pctnohs18_24 
##                    0.00                    0.00                    0.00 
##              pcths18_24         pctsomecol18_24         pctbachdeg18_24 
##                    0.00                   74.99                    0.00 
##            pcths25_over       pctbachdeg25_over      pctemployed16_over 
##                    0.00                    0.00                    4.99 
##    pctunemployed16_over      pctprivatecoverage pctprivatecoveragealone 
##                    0.00                    0.00                   19.99 
##      pctempprivcoverage       pctpubliccoverage  pctpubliccoveragealone 
##                    0.00                    0.00                    0.00 
##                pctwhite                pctblack                pctasian 
##                    0.00                    0.00                    0.00 
##            pctotherrace    pctmarriedhouseholds               birthrate 
##                    0.00                    0.00                    0.00

here we have 3 variable with missing values in my opinion , the most problematic one is the one which got 75% missing So i will Exclude it

data<-data%>%dplyr::select(-which(miss>70))

OK , lets impute the rest using KNN method

data<-cbind(knnImputation(data = data[,-c(9,13)]),binnedinc=data$binnedinc,
            geography=data$geography)%>%data.frame()

the next step , is my data got ouliers !! I think using Mahalanobis will answer this question

first excluding the categorical variable

num<-data%>%dplyr::select(-geography,-binnedinc)

getting the mahalanobis value

mah<-mahalanobis(x = num,center = colMeans(num),cov = cov(num))

Now calculating the cutoff points

cutoff<-qchisq(p = .99,df = ncol(num))

lets open the surprise box :P

summary(mah>cutoff)
##    Mode   FALSE    TRUE 
## logical    2691     356

now we have 356 case which considered multivariate outlier

lets save it for later use

outidx<-as.numeric(mah>cutoff)

testing additivity for multicolinearity is something crucial so lets test for correlation more than .9

corr<-cor(num)%>%matrix(nrow = ncol(num),ncol = ncol(num))

addit<-apply(corr,2,function(x){
        ifelse(x>=abs(.9)&x<1,paste(round(x,2),"additive",sep = " ")," ")
})

colnames(addit)<-rownames(addit)<-names(num)
corvar<-names(data)[apply(addit,2,function(x){
        str_detect(x,"additive")
})%>%apply(MARGIN = 2,any)%>%which()]

corvar
## [1] "avganncount"             "avgdeathsperyear"       
## [3] "popest2015"              "medianagemale"          
## [5] "medianagefemale"         "pctprivatecoverage"     
## [7] "pctprivatecoveragealone" "pctempprivcoverage"
addit[corvar,corvar]%>%kable("markdown")
avganncount avgdeathsperyear popest2015 medianagemale medianagefemale pctprivatecoverage pctprivatecoveragealone pctempprivcoverage
avganncount 0.94 additive 0.93 additive
avgdeathsperyear 0.94 additive 0.98 additive
popest2015 0.93 additive 0.98 additive
medianagemale 0.93 additive
medianagefemale 0.93 additive
pctprivatecoverage 0.93 additive
pctprivatecoveragealone 0.93 additive 0.92 additive
pctempprivcoverage 0.92 additive

we have here number of reported cancer (avganncount) and average reported mortality ( avgdeathsperyear) and number of population is highly correlated and we see that the number of population and average of reported mortality having no sense in predicting target death rate , so we will exclude them

data<-data%>%dplyr::select(-popest2015,-avgdeathsperyear)

For the median age of males and females i will combine it together (x1+x2)/2

data<-data%>%mutate(medianagemf=(medianagemale+medianagefemale)/2)%>%dplyr::select(-medianagefemale,-medianagemale)

For the median average of coverage and coverage alone and employee coverage i will make a linear function of them = .3*(x1+x2+x3)

data<-data%>%mutate(medcov=.3*(pctprivatecoverage+pctprivatecoveragealone+pctempprivcoverage))%>%dplyr::select(-pctprivatecoverage,-pctprivatecoveragealone,-pctempprivcoverage)

run the correlation again

num<-data%>%dplyr::select(-geography,-binnedinc)

corr2<-cor(num)

addit2<-apply(corr2,2,function(x){
        ifelse(x>=abs(.9)&x<1,paste(round(x,2),"additive",sep = " ")," ")
})

colnames(addit2)<-rownames(addit2)<-names(num)
corvar2<-names(data)[apply(addit2,2,function(x){
        str_detect(x,"additive")
})%>%apply(MARGIN = 2,any)%>%which()]

corvar2
## character(0)

Great, lets dive further in our analysis and run correlation with all variables

fitall<-data%>%with(lm(target_deathrate~.,data=data))

summary(fitall)
## 
## Call:
## lm(formula = target_deathrate ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.540   -9.519   -0.299    9.934  126.239 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.595e+02  1.759e+01   9.068  < 2e-16 ***
## avganncount                 -1.826e-04  3.040e-04  -0.601 0.548117    
## incidencerate                1.843e-01  7.449e-03  24.745  < 2e-16 ***
## medincome                    7.827e-05  1.113e-04   0.703 0.482111    
## povertypercent              -1.400e-01  1.785e-01  -0.784 0.432859    
## studypercap                  4.214e-04  6.444e-04   0.654 0.513265    
## medianage                    1.418e-03  7.383e-03   0.192 0.847700    
## percentmarried               7.419e-01  1.690e-01   4.388 1.18e-05 ***
## pctnohs18_24                -2.640e-02  5.461e-02  -0.483 0.628851    
## pcths18_24                   1.700e-01  4.850e-02   3.505 0.000463 ***
## pctbachdeg18_24             -2.272e-01  1.051e-01  -2.162 0.030666 *  
## pcths25_over                 1.949e-01  1.077e-01   1.811 0.070309 .  
## pctbachdeg25_over           -1.052e+00  1.578e-01  -6.666 3.13e-11 ***
## pctemployed16_over          -3.599e-01  1.144e-01  -3.147 0.001667 ** 
## pctunemployed16_over         3.332e-01  1.739e-01   1.916 0.055417 .  
## pctpubliccoverage           -1.566e-01  2.237e-01  -0.700 0.483933    
## pctpubliccoveragealone       7.008e-01  2.275e-01   3.080 0.002089 ** 
## pctwhite                    -1.939e-01  6.606e-02  -2.935 0.003362 ** 
## pctblack                    -1.771e-01  7.016e-02  -2.524 0.011655 *  
## pctasian                    -6.877e-02  2.176e-01  -0.316 0.751941    
## pctotherrace                -6.446e-01  1.298e-01  -4.965 7.27e-07 ***
## pctmarriedhouseholds        -1.050e+00  1.537e-01  -6.834 1.00e-11 ***
## birthrate                   -5.975e-01  1.853e-01  -3.224 0.001278 ** 
## binnedinc(37413.8, 40362.7] -1.933e+00  1.543e+00  -1.253 0.210454    
## binnedinc(40362.7, 42724.4] -3.627e+00  1.661e+00  -2.184 0.029040 *  
## binnedinc(42724.4, 45201]   -2.628e+00  1.819e+00  -1.445 0.148612    
## binnedinc(45201, 48021.6]   -4.350e+00  1.993e+00  -2.183 0.029128 *  
## binnedinc(48021.6, 51046.4] -5.388e+00  2.222e+00  -2.425 0.015384 *  
## binnedinc(51046.4, 54545.6] -5.557e+00  2.426e+00  -2.291 0.022044 *  
## binnedinc(54545.6, 61494.5] -4.792e+00  2.781e+00  -1.723 0.084907 .  
## binnedinc(61494.5, 125635]  -4.245e+00  3.923e+00  -1.082 0.279228    
## binnedinc[22640, 34218.1]    3.727e+00  1.664e+00   2.239 0.025211 *  
## geographyAlaska              1.380e+01  6.086e+00   2.267 0.023478 *  
## geographyAnne'Maryland       1.825e+01  1.835e+01   0.994 0.320143    
## geographyArizona            -2.111e+01  5.538e+00  -3.813 0.000140 ***
## geographyArkansas            1.068e+01  3.271e+00   3.265 0.001106 ** 
## geographyCalifornia         -1.328e+01  3.920e+00  -3.388 0.000714 ***
## geographyCarolina           -4.146e+00  2.798e+00  -1.482 0.138435    
## geographyColorado           -1.747e+01  3.655e+00  -4.780 1.84e-06 ***
## geographyColumbia           -3.861e-01  1.849e+01  -0.021 0.983342    
## geographyConnecticut        -1.709e+01  7.083e+00  -2.412 0.015906 *  
## geographyDakota             -6.656e+00  3.339e+00  -1.993 0.046333 *  
## geographyDelaware           -6.440e+00  1.081e+01  -0.596 0.551486    
## geographyDo<U+0623>±Mexico  -1.526e+01  1.838e+01  -0.830 0.406499    
## geographyFlorida             1.581e+00  3.451e+00   0.458 0.646892    
## geographyGeorge'Maryland     7.864e+00  1.839e+01   0.428 0.669025    
## geographyGeorgia            -5.993e+00  2.821e+00  -2.125 0.033694 *  
## geographyHampshire          -3.150e+00  6.399e+00  -0.492 0.622511    
## geographyHawaii             -3.261e+01  1.161e+01  -2.808 0.005016 ** 
## geographyIdaho              -1.728e+01  3.853e+00  -4.485 7.56e-06 ***
## geographyIllinois           -2.637e+00  3.193e+00  -0.826 0.408927    
## geographyIndiana             6.930e+00  3.242e+00   2.138 0.032623 *  
## geographyIowa               -1.003e+01  3.320e+00  -3.023 0.002528 ** 
## geographyIsland             -4.879e+00  8.644e+00  -0.564 0.572495    
## geographyJersey             -5.336e+00  4.961e+00  -1.076 0.282143    
## geographyKansas             -3.642e+00  3.237e+00  -1.125 0.260620    
## geographyKentucky            1.054e+01  3.078e+00   3.425 0.000623 ***
## geographyLouisiana          -2.597e-01  3.466e+00  -0.075 0.940270    
## geographyMaine               1.684e+00  5.347e+00   0.315 0.752850    
## geographyMaryland           -4.183e-01  4.820e+00  -0.087 0.930845    
## geographyMassachusetts      -8.482e+00  5.824e+00  -1.456 0.145390    
## geographyMatanuska-Alaska   -1.336e+00  1.836e+01  -0.073 0.942007    
## geographyMexico             -1.620e+01  4.511e+00  -3.592 0.000334 ***
## geographyMiami-Florida      -2.826e+01  1.864e+01  -1.516 0.129719    
## geographyMichigan           -1.375e+00  3.288e+00  -0.418 0.675838    
## geographyMinnesota          -1.152e+01  3.439e+00  -3.350 0.000818 ***
## geographyMississippi         5.178e+00  3.121e+00   1.659 0.097133 .  
## geographyMissouri            8.107e+00  3.084e+00   2.629 0.008611 ** 
## geographyMontana            -1.220e+01  3.839e+00  -3.178 0.001497 ** 
## geographyNebraska           -6.338e+00  3.400e+00  -1.864 0.062452 .  
## geographyNevada             -4.349e+00  5.241e+00  -0.830 0.406767    
## geographyO'Iowa              2.940e+00  1.835e+01   0.160 0.872707    
## geographyOhio                3.695e+00  3.283e+00   1.125 0.260487    
## geographyOklahoma            1.114e+01  3.432e+00   3.245 0.001188 ** 
## geographyOregon             -8.800e+00  4.025e+00  -2.186 0.028884 *  
## geographyPennsylvania       -9.400e+00  3.558e+00  -2.642 0.008294 ** 
## geographySt.Alabama          3.216e+00  1.828e+01   0.176 0.860313    
## geographySt.Arkansas         1.018e+01  1.828e+01   0.557 0.577577    
## geographySt.Florida         -9.275e+00  1.307e+01  -0.710 0.477830    
## geographySt.Illinois         3.925e+00  1.827e+01   0.215 0.829941    
## geographySt.Indiana          2.219e+00  1.828e+01   0.121 0.903360    
## geographySt.Louisiana        7.004e+00  6.545e+00   1.070 0.284639    
## geographySt.Mary'Maryland    1.694e+01  1.835e+01   0.923 0.356217    
## geographySt.Michigan         2.632e-01  1.306e+01   0.020 0.983925    
## geographySt.Minnesota        1.218e+01  1.833e+01   0.665 0.506392    
## geographySt.Missouri         5.396e+00  8.461e+00   0.638 0.523725    
## geographySt.Wisconsin        2.814e+01  1.835e+01   1.533 0.125360    
## geographySt.York            -1.157e+01  1.830e+01  -0.632 0.527146    
## geographySte.Missouri        2.103e+01  1.834e+01   1.146 0.251708    
## geographyTennessee           6.524e+00  3.115e+00   2.094 0.036305 *  
## geographyTexas               2.702e+00  2.918e+00   0.926 0.354528    
## geographyUtah               -2.194e+01  4.441e+00  -4.940 8.25e-07 ***
## geographyValdez-Alaska       2.426e+01  1.844e+01   1.315 0.188446    
## geographyVermont            -1.301e+00  5.830e+00  -0.223 0.823396    
## geographyVirginia            4.526e+00  2.796e+00   1.619 0.105647    
## geographyWashington         -7.238e+00  4.025e+00  -1.798 0.072235 .  
## geographyWisconsin          -4.402e+00  3.470e+00  -1.268 0.204730    
## geographyWyoming            -4.566e+00  4.687e+00  -0.974 0.330105    
## geographyYork               -1.387e+01  3.639e+00  -3.811 0.000141 ***
## geographyYukon-Alaska       -1.833e+01  1.879e+01  -0.975 0.329504    
## medianagemf                 -4.697e-01  1.573e-01  -2.986 0.002847 ** 
## medcov                       2.000e-01  1.395e-01   1.433 0.151861    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.06 on 2945 degrees of freedom
## Multiple R-squared:  0.5905, Adjusted R-squared:  0.5764 
## F-statistic: 42.04 on 101 and 2945 DF,  p-value: < 2.2e-16

Now we can run regression diagnostics

firstly we can see the leverage points

lev<-hatvalues(model = fitall)

next calculating the cutoff point = (2*K+2)/N

k<-ncol(data)-1
hatcut<-((2*k)+2)/nrow(data)

lets see how many exceed the cutoff

(lev>hatcut)%>%summary()
##    Mode   FALSE    TRUE 
## logical    1096    1951

we have 1951 points which exceed the cutpoint

lets store it in variable

levout<-as.numeric(lev>hatcut)

now we will test cook’s D for leverage and dispersion

cook<-cooks.distance(fitall)

setting the cutoff point for cook’s distance

cookcut<-(4/(nrow(data)-(ncol(data)-1)-1))
(cook>cookcut)%>%summary()
##    Mode   FALSE    TRUE    NA's 
## logical    2889     143      15

here we have 143 cases which considered as influence with some misbehaving values returned to NA

cookout<-as.numeric(cook>cookcut)

now we need to see the cases which is considered as outlier and influence

totalout<-outidx+levout+cookout

table(totalout)
## totalout
##    0    1    2    3 
## 1086 1538  330   78

I think that removing points which got two issues is enough

idx<-totalout>=2

idx<-sapply(idx,function(x){
        ifelse(is.na(x),TRUE,x)
})

table(idx)
## idx
## FALSE  TRUE 
##  2624   423
data<-data[-idx,]

here i think that we have to take a look on VIF for variables cause Inflation

viftest<-vif(fitall)

viftest<-data.frame(viftest)

summary(viftest$GVIF>10)
##    Mode   FALSE    TRUE 
## logical      17       9
summary(viftest$GVIF..1..2.Df..>3.5)
##    Mode   FALSE    TRUE 
## logical      21       5

i think that we face a great problem here regarding multicolinearity but i will delay any action for it to the end of our analysis

Now , lets test normality , linearity , homoscdasticity !!

stdresid<-rstudent(fitall)
stdfit<-fitted(fitall)%>%scale

plot(fitall,2)

plot(stdfit,stdresid)
abline(v = 0,h = 0)

plotting of residual carry good news regarding normality and homoscdasticity congrats !!

here i think that i am ready to run my model and conducting the real analysis but we have to not forget that we have a serious problem in co-linearity so i will do number of models and compare between them these models are: 1- Model Contain all variables 2- step-wise model regarding the multicolinear variables 3- step-wise model regarding AIC 4- Regularized model (elastic model) and then comparing between RMSE and R^2

firstly lets convert the categorical variables to dummy variables

dummy<-dummyVars(target_deathrate~.,data=data)

data<-cbind(target_deathrate=data$target_deathrate,predict(dummy,data))%>%data.frame()

then partitioning our data

set.seed(234)
trainidx<-createDataPartition(data$target_deathrate,p = .7,list = F)
traindata<-data[trainidx,]
testdata<-data[-trainidx,]

Now lets run all models consequently and doing 10 CV fold on the training data

fitcontrol<-trainControl(method = "cv",number = 10)

all variables model

allfit<-train(target_deathrate ~.,data = traindata,method="lm",trControl=fitcontrol)

summary(allfit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.085   -9.712   -0.048   10.078  129.226 
## 
## Coefficients: (7 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   1.430e+02  2.729e+01   5.239 1.78e-07 ***
## avganncount                  -3.536e-04  4.402e-04  -0.803 0.421936    
## incidencerate                 1.812e-01  8.925e-03  20.301  < 2e-16 ***
## medincome                     1.608e-04  1.487e-04   1.081 0.279711    
## povertypercent               -1.230e-01  2.179e-01  -0.564 0.572498    
## studypercap                   1.005e-03  8.785e-04   1.144 0.252624    
## medianage                     9.067e-03  9.151e-03   0.991 0.321892    
## percentmarried                7.734e-01  2.066e-01   3.743 0.000187 ***
## pctnohs18_24                  1.095e-02  6.674e-02   0.164 0.869723    
## pcths18_24                    2.071e-01  5.939e-02   3.487 0.000498 ***
## pctbachdeg18_24              -1.677e-01  1.289e-01  -1.302 0.193158    
## pcths25_over                  2.010e-01  1.315e-01   1.528 0.126583    
## pctbachdeg25_over            -1.177e+00  1.943e-01  -6.057 1.65e-09 ***
## pctemployed16_over           -2.887e-01  1.395e-01  -2.069 0.038632 *  
## pctunemployed16_over          1.803e-01  2.155e-01   0.837 0.402943    
## pctpubliccoverage            -1.399e-01  2.770e-01  -0.505 0.613709    
## pctpubliccoveragealone        6.977e-01  2.785e-01   2.505 0.012313 *  
## pctwhite                     -1.903e-01  8.064e-02  -2.360 0.018372 *  
## pctblack                     -1.612e-01  8.608e-02  -1.873 0.061200 .  
## pctasian                      8.381e-02  2.642e-01   0.317 0.751084    
## pctotherrace                 -7.525e-01  1.689e-01  -4.455 8.86e-06 ***
## pctmarriedhouseholds         -1.187e+00  1.878e-01  -6.319 3.22e-10 ***
## birthrate                    -7.171e-01  2.293e-01  -3.128 0.001786 ** 
## binnedinc..34218.1..37413.8. -2.816e+00  2.074e+00  -1.358 0.174677    
## binnedinc..37413.8..40362.7. -5.794e+00  2.341e+00  -2.475 0.013388 *  
## binnedinc..40362.7..42724.4. -7.516e+00  2.590e+00  -2.902 0.003750 ** 
## binnedinc..42724.4..45201.   -7.803e+00  2.846e+00  -2.742 0.006155 ** 
## binnedinc..45201..48021.6.   -9.677e+00  3.147e+00  -3.076 0.002129 ** 
## binnedinc..48021.6..51046.4. -9.778e+00  3.485e+00  -2.806 0.005064 ** 
## binnedinc..51046.4..54545.6. -9.860e+00  3.751e+00  -2.628 0.008649 ** 
## binnedinc..54545.6..61494.5. -9.719e+00  4.247e+00  -2.288 0.022214 *  
## binnedinc..61494.5..125635.  -9.825e+00  5.738e+00  -1.712 0.087010 .  
## binnedinc..22640..34218.1.           NA         NA      NA       NA    
## geography.Alabama             1.954e+01  1.955e+01   1.000 0.317635    
## geography.Alaska              3.258e+01  1.962e+01   1.660 0.097013 .  
## geography.Anne.Maryland              NA         NA      NA       NA    
## geography.Arizona            -2.570e+00  2.007e+01  -0.128 0.898129    
## geography.Arkansas            3.008e+01  1.954e+01   1.540 0.123777    
## geography.California          7.453e+00  1.962e+01   0.380 0.704067    
## geography.Carolina            1.494e+01  1.939e+01   0.771 0.440975    
## geography.Colorado            2.293e+00  1.952e+01   0.117 0.906482    
## geography.Columbia            1.580e+01  2.706e+01   0.584 0.559432    
## geography.Connecticut         2.784e+00  2.110e+01   0.132 0.895057    
## geography.Dakota              1.372e+01  1.929e+01   0.711 0.476961    
## geography.Delaware            1.262e+01  2.216e+01   0.569 0.569308    
## geographyDo.U.0623..Mexico    6.097e+00  2.696e+01   0.226 0.821116    
## geography.Florida             2.290e+01  1.952e+01   1.173 0.240825    
## geography.George.Maryland     2.574e+01  2.689e+01   0.957 0.338519    
## geography.Georgia             1.212e+01  1.938e+01   0.625 0.531793    
## geography.Hampshire           1.617e+01  2.039e+01   0.793 0.427918    
## geography.Hawaii             -1.059e+01  2.415e+01  -0.438 0.661104    
## geography.Idaho               6.236e+00  1.956e+01   0.319 0.749954    
## geography.Illinois            1.758e+01  1.949e+01   0.902 0.367343    
## geography.Indiana             2.660e+01  1.943e+01   1.369 0.171163    
## geography.Iowa                8.353e+00  1.949e+01   0.428 0.668351    
## geography.Island              1.453e+01  2.108e+01   0.689 0.490931    
## geography.Jersey              1.562e+01  1.987e+01   0.786 0.432035    
## geography.Kansas              1.824e+01  1.942e+01   0.939 0.347811    
## geography.Kentucky            3.289e+01  1.948e+01   1.688 0.091505 .  
## geography.Louisiana           1.700e+01  1.955e+01   0.869 0.384807    
## geography.Maine               1.947e+01  2.023e+01   0.962 0.335973    
## geography.Maryland            1.752e+01  2.008e+01   0.873 0.383012    
## geography.Massachusetts       9.667e+00  2.025e+01   0.477 0.633131    
## geography.Matanuska.Alaska    1.821e+01  2.663e+01   0.684 0.494190    
## geography.Mexico              7.029e+00  1.973e+01   0.356 0.721645    
## geography.Miami.Florida      -5.787e+00  2.715e+01  -0.213 0.831257    
## geography.Michigan            1.937e+01  1.947e+01   0.995 0.319983    
## geography.Minnesota           7.862e+00  1.954e+01   0.402 0.687416    
## geography.Mississippi         2.201e+01  1.954e+01   1.126 0.260149    
## geography.Missouri            2.853e+01  1.939e+01   1.471 0.141491    
## geography.Montana             5.439e+00  1.939e+01   0.281 0.779113    
## geography.Nebraska            1.258e+01  1.944e+01   0.647 0.517744    
## geography.Nevada              1.921e+01  1.980e+01   0.970 0.332015    
## geography.O.Iowa              2.252e+01  2.686e+01   0.839 0.401765    
## geography.Ohio                2.472e+01  1.947e+01   1.269 0.204419    
## geography.Oklahoma            2.859e+01  1.921e+01   1.488 0.136871    
## geography.Oregon              1.343e+01  1.963e+01   0.684 0.493844    
## geography.Pennsylvania        1.139e+01  1.951e+01   0.584 0.559221    
## geography.St.Alabama          2.227e+01  2.677e+01   0.832 0.405586    
## geography.St.Arkansas                NA         NA      NA       NA    
## geography.St.Florida          7.929e+00  2.677e+01   0.296 0.767163    
## geography.St.Illinois         2.310e+01  2.682e+01   0.861 0.389094    
## geography.St.Indiana                 NA         NA      NA       NA    
## geography.St.Louisiana        2.751e+01  2.078e+01   1.324 0.185791    
## geography.St.Mary.Maryland           NA         NA      NA       NA    
## geography.St.Michigan         2.011e+01  2.335e+01   0.861 0.389305    
## geography.St.Minnesota               NA         NA      NA       NA    
## geography.St.Missouri         2.634e+01  2.137e+01   1.233 0.217853    
## geography.St.Wisconsin        4.664e+01  2.684e+01   1.738 0.082431 .  
## geography.St.York             9.830e+00  2.679e+01   0.367 0.713727    
## geography.Ste.Missouri        3.995e+01  2.683e+01   1.489 0.136648    
## geography.Tennessee           2.534e+01  1.942e+01   1.305 0.191987    
## geography.Texas               2.180e+01  1.934e+01   1.127 0.259792    
## geography.Utah               -1.934e-01  1.977e+01  -0.010 0.992195    
## geography.Valdez.Alaska       4.423e+01  2.647e+01   1.671 0.094916 .  
## geography.Vermont             1.628e+01  2.032e+01   0.801 0.423092    
## geography.Virginia            2.273e+01  1.936e+01   1.174 0.240515    
## geography.Washington          1.220e+01  1.965e+01   0.621 0.534733    
## geography.Wisconsin           1.580e+01  1.948e+01   0.811 0.417536    
## geography.Wyoming             1.855e+01  1.977e+01   0.938 0.348258    
## geography.York                5.149e+00  1.959e+01   0.263 0.792662    
## geography.Yukon.Alaska               NA         NA      NA       NA    
## medianagemf                  -4.849e-01  1.890e-01  -2.565 0.010377 *  
## medcov                        2.076e-01  1.704e-01   1.218 0.223330    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.45 on 2037 degrees of freedom
## Multiple R-squared:  0.5858, Adjusted R-squared:  0.5663 
## F-statistic: 30.01 on 96 and 2037 DF,  p-value: < 2.2e-16

OK , the factor predictor of states is misbehave and rarely be significant

our data matrix is at least equal to the number of parameters we want to fit. One way to invoke it is having some col-linear covariates which exist in our data

Now lets test step-wise regression

stepfit<-train(target_deathrate ~.,data = traindata,method="lmStepAIC",trControl=fitcontrol)

Seeing the summary and the coefficients

summary(stepfit)
## 
## Call:
## lm(formula = .outcome ~ incidencerate + percentmarried + pcths18_24 + 
##     pcths25_over + pctbachdeg25_over + pctemployed16_over + pctpubliccoveragealone + 
##     pctwhite + pctblack + pctotherrace + pctmarriedhouseholds + 
##     birthrate + binnedinc..37413.8..40362.7. + binnedinc..40362.7..42724.4. + 
##     binnedinc..42724.4..45201. + binnedinc..45201..48021.6. + 
##     binnedinc..48021.6..51046.4. + binnedinc..51046.4..54545.6. + 
##     binnedinc..54545.6..61494.5. + geography.Alabama + geography.Alaska + 
##     geography.Arkansas + geography.Carolina + geography.Dakota + 
##     geography.Florida + geography.Georgia + geography.Hampshire + 
##     geography.Illinois + geography.Indiana + geography.Jersey + 
##     geography.Kansas + geography.Kentucky + geography.Louisiana + 
##     geography.Maine + geography.Maryland + geography.Michigan + 
##     geography.Mississippi + geography.Missouri + geography.Nebraska + 
##     geography.Nevada + geography.Ohio + geography.Oklahoma + 
##     geography.Oregon + geography.Pennsylvania + geography.St.Louisiana + 
##     geography.St.Missouri + geography.St.Wisconsin + geography.Ste.Missouri + 
##     geography.Tennessee + geography.Texas + geography.Valdez.Alaska + 
##     geography.Vermont + geography.Virginia + geography.Washington + 
##     geography.Wisconsin + geography.Wyoming + medianagemf + medcov, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107.36   -9.90   -0.32   10.21  127.41 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  143.23047   13.66199  10.484  < 2e-16 ***
## incidencerate                  0.18560    0.00843  22.017  < 2e-16 ***
## percentmarried                 0.74211    0.19298   3.846 0.000124 ***
## pcths18_24                     0.21462    0.05389   3.983 7.04e-05 ***
## pcths25_over                   0.19695    0.12299   1.601 0.109448    
## pctbachdeg25_over             -1.26662    0.17688  -7.161 1.11e-12 ***
## pctemployed16_over            -0.25453    0.10904  -2.334 0.019672 *  
## pctpubliccoveragealone         0.61463    0.17141   3.586 0.000344 ***
## pctwhite                      -0.18170    0.06631  -2.740 0.006198 ** 
## pctblack                      -0.13647    0.07520  -1.815 0.069705 .  
## pctotherrace                  -0.74841    0.15452  -4.844 1.37e-06 ***
## pctmarriedhouseholds          -1.14913    0.16695  -6.883 7.72e-12 ***
## birthrate                     -0.70078    0.22564  -3.106 0.001924 ** 
## binnedinc..37413.8..40362.7.  -3.25235    1.52572  -2.132 0.033151 *  
## binnedinc..40362.7..42724.4.  -4.08568    1.53370  -2.664 0.007783 ** 
## binnedinc..42724.4..45201.    -3.98153    1.51825  -2.622 0.008794 ** 
## binnedinc..45201..48021.6.    -5.16374    1.58126  -3.266 0.001110 ** 
## binnedinc..48021.6..51046.4.  -4.42297    1.62502  -2.722 0.006547 ** 
## binnedinc..51046.4..54545.6.  -4.18749    1.57863  -2.653 0.008048 ** 
## binnedinc..54545.6..61494.5.  -3.05262    1.59153  -1.918 0.055242 .  
## geography.Alabama             13.24125    3.36364   3.937 8.54e-05 ***
## geography.Alaska              27.94431    5.89467   4.741 2.28e-06 ***
## geography.Arkansas            24.06341    3.01349   7.985 2.30e-15 ***
## geography.Carolina             9.21724    2.37017   3.889 0.000104 ***
## geography.Dakota               7.02700    2.49441   2.817 0.004892 ** 
## geography.Florida             16.81214    3.25335   5.168 2.60e-07 ***
## geography.Georgia              6.38499    2.42380   2.634 0.008494 ** 
## geography.Hampshire            9.44290    6.64378   1.421 0.155375    
## geography.Illinois            11.28871    2.44081   4.625 3.98e-06 ***
## geography.Indiana             20.66953    2.59944   7.952 3.00e-15 ***
## geography.Jersey               9.57581    4.71283   2.032 0.042295 *  
## geography.Kansas              11.19632    2.46390   4.544 5.83e-06 ***
## geography.Kentucky            26.63049    2.50302  10.639  < 2e-16 ***
## geography.Louisiana           10.61774    3.43501   3.091 0.002021 ** 
## geography.Maine               12.88223    5.97827   2.155 0.031289 *  
## geography.Maryland            11.30764    5.26228   2.149 0.031765 *  
## geography.Michigan            13.68354    2.71221   5.045 4.93e-07 ***
## geography.Mississippi         15.74400    3.12473   5.039 5.10e-07 ***
## geography.Missouri            22.61036    2.50518   9.025  < 2e-16 ***
## geography.Nebraska             6.14029    2.73332   2.246 0.024779 *  
## geography.Nevada              13.79806    5.13014   2.690 0.007211 ** 
## geography.Ohio                18.71122    2.64973   7.062 2.24e-12 ***
## geography.Oklahoma            22.56084    2.86365   7.878 5.31e-15 ***
## geography.Oregon               7.81109    3.87969   2.013 0.044209 *  
## geography.Pennsylvania         4.62095    2.96702   1.557 0.119519    
## geography.St.Louisiana        21.22583    7.81188   2.717 0.006640 ** 
## geography.St.Missouri         20.49557    9.30461   2.203 0.027723 *  
## geography.St.Wisconsin        41.38447   18.51145   2.236 0.025483 *  
## geography.Ste.Missouri        35.26877   18.52788   1.904 0.057107 .  
## geography.Tennessee           19.06761    2.49509   7.642 3.24e-14 ***
## geography.Texas               16.03179    2.01293   7.964 2.71e-15 ***
## geography.Valdez.Alaska       40.64590   18.56773   2.189 0.028703 *  
## geography.Vermont              9.17605    5.82503   1.575 0.115344    
## geography.Virginia            16.60272    2.09694   7.918 3.91e-15 ***
## geography.Washington           6.41982    3.78538   1.696 0.090045 .  
## geography.Wisconsin            9.82176    2.97262   3.304 0.000969 ***
## geography.Wyoming             12.83993    4.66086   2.755 0.005923 ** 
## medianagemf                   -0.49429    0.13133  -3.764 0.000172 ***
## medcov                         0.22289    0.13775   1.618 0.105797    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.39 on 2075 degrees of freedom
## Multiple R-squared:  0.5808, Adjusted R-squared:  0.569 
## F-statistic: 49.56 on 58 and 2075 DF,  p-value: < 2.2e-16

Now the final model is penalized model

Penalized Model has three type :

1- lasso model 2- ridge model 3- Elsatic Model

here i will use elastic model since we have alot of colinear variable which to need to be tru zero and others which need only to be zero

lets run elastic model

elasticfit<-train(target_deathrate~.,data = traindata,trControl=fitcontrol,tuneLength=10,method="glmnet")

elasticfit$bestTune # best alpha and lambda values 
##   alpha    lambda
## 7   0.1 0.9404475
coef(elasticfit$finalModel,elasticfit$bestTune$lambda)
## 104 x 1 sparse Matrix of class "dgCMatrix"
##                                          1
## (Intercept)                   1.345691e+02
## avganncount                  -1.895949e-04
## incidencerate                 1.754294e-01
## medincome                     .           
## povertypercent                .           
## studypercap                   7.542099e-04
## medianage                     5.913347e-03
## percentmarried                1.885861e-01
## pctnohs18_24                  .           
## pcths18_24                    2.164261e-01
## pctbachdeg18_24              -1.353659e-01
## pcths25_over                  2.457405e-01
## pctbachdeg25_over            -1.000302e+00
## pctemployed16_over           -7.355091e-02
## pctunemployed16_over          2.498321e-01
## pctpubliccoverage             .           
## pctpubliccoveragealone        4.244808e-01
## pctwhite                     -4.779651e-02
## pctblack                     -5.773841e-03
## pctasian                      9.954008e-02
## pctotherrace                 -6.145021e-01
## pctmarriedhouseholds         -6.034101e-01
## birthrate                    -5.503060e-01
## binnedinc..34218.1..37413.8.  3.433645e+00
## binnedinc..37413.8..40362.7.  8.489800e-01
## binnedinc..40362.7..42724.4.  .           
## binnedinc..42724.4..45201.    .           
## binnedinc..45201..48021.6.   -9.928822e-01
## binnedinc..48021.6..51046.4. -1.149099e+00
## binnedinc..51046.4..54545.6. -1.139128e+00
## binnedinc..54545.6..61494.5. -1.229785e-01
## binnedinc..61494.5..125635.   8.941259e-01
## binnedinc..22640..34218.1.    5.146713e+00
## geography.Alabama             1.154804e+00
## geography.Alaska              1.733174e+01
## geography.Anne.Maryland       .           
## geography.Arizona            -1.688474e+01
## geography.Arkansas            1.151172e+01
## geography.California         -8.631527e+00
## geography.Carolina           -3.052875e+00
## geography.Colorado           -1.508951e+01
## geography.Columbia            .           
## geography.Connecticut        -1.193743e+01
## geography.Dakota             -2.477776e+00
## geography.Delaware           -2.736964e+00
## geographyDo.U.0623..Mexico   -9.535875e+00
## geography.Florida             2.311799e+00
## geography.George.Maryland     3.215843e+00
## geography.Georgia            -5.919504e+00
## geography.Hampshire          -7.721718e-01
## geography.Hawaii             -1.751243e+01
## geography.Idaho              -1.135165e+01
## geography.Illinois            .           
## geography.Indiana             8.376133e+00
## geography.Iowa               -8.373090e+00
## geography.Island             -1.147855e+00
## geography.Jersey             -5.950566e-01
## geography.Kansas              1.049727e-01
## geography.Kentucky            1.570057e+01
## geography.Louisiana          -7.274319e-01
## geography.Maine               .           
## geography.Maryland            .           
## geography.Massachusetts      -6.086410e+00
## geography.Matanuska.Alaska    .           
## geography.Mexico             -9.957967e+00
## geography.Miami.Florida      -2.500847e+01
## geography.Michigan            1.091320e+00
## geography.Minnesota          -8.718165e+00
## geography.Mississippi         3.173603e+00
## geography.Missouri            9.609113e+00
## geography.Montana            -1.135981e+01
## geography.Nebraska           -4.245383e+00
## geography.Nevada              1.507387e+00
## geography.O.Iowa              1.532439e+00
## geography.Ohio                6.598234e+00
## geography.Oklahoma            1.183841e+01
## geography.Oregon             -2.747470e+00
## geography.Pennsylvania       -6.169547e+00
## geography.St.Alabama          6.006366e-01
## geography.St.Arkansas         .           
## geography.St.Florida         -7.477797e+00
## geography.St.Illinois         1.382744e+00
## geography.St.Indiana          .           
## geography.St.Louisiana        7.405574e+00
## geography.St.Mary.Maryland    .           
## geography.St.Michigan         4.364745e-02
## geography.St.Minnesota        .           
## geography.St.Missouri         6.730196e+00
## geography.St.Wisconsin        2.434074e+01
## geography.St.York            -3.102247e+00
## geography.Ste.Missouri        1.793544e+01
## geography.Tennessee           6.971448e+00
## geography.Texas               2.159052e+00
## geography.Utah               -1.746181e+01
## geography.Valdez.Alaska       2.170245e+01
## geography.Vermont            -1.148873e+00
## geography.Virginia            4.644171e+00
## geography.Washington         -3.895516e+00
## geography.Wisconsin          -6.244107e-01
## geography.Wyoming             .           
## geography.York               -1.102549e+01
## geography.Yukon.Alaska       -4.383698e+00
## medianagemf                  -3.506919e-01
## medcov                        .

finally we need to predict the test data to Extract RMSE and R2

allfitpred<-predict(allfit,testdata)
allfitmeasures<-data.frame(RMSE=RMSE(pred = allfitpred,obs = testdata$target_deathrate),R2=R2(pred = allfitpred,obs = testdata$target_deathrate))

stepfitpred<-predict(stepfit,testdata)
stepfitmeasures<-data.frame(RMSE=RMSE(pred = stepfitpred,obs = testdata$target_deathrate),R2=R2(pred = stepfitpred,obs = testdata$target_deathrate))

elasticfitpred<-predict(elasticfit,testdata)
elasticfitmeasures<-data.frame(RMSE=RMSE(pred = elasticfitpred,obs = testdata$target_deathrate),R2=R2(pred = elasticfitpred,obs = testdata$target_deathrate))

putting all measures together

allmeasures<-rbind(allfitmeasures,stepfitmeasures,elasticfitmeasures)

rownames(allmeasures)<-c("Model with all variables","stepwise model","elastic penalized model")

allmeasures%>%kable("markdown")
RMSE R2
Model with all variables 17.58857 0.5793748
stepwise model 17.61035 0.5782697
elastic penalized model 17.43842 0.5866667

at last we see here that the penalized model is the most accurate model in this model we reduced the coefficients of some model to be near zero and others to be zero (combination between lasso and ridge regression) the best alpha value is .2 and lambda value is .407

this was an effort to achieve the best prediction of death rate using more than 30 variables

Regards