Introduction

It is quite straight forward to identify the response variable in a given dataset but it is often harder to find out the key predictor or predictors that are highly associated with the response among a set of variables. This vignette is to give you some ideas on how to identify the main predictor or predictors when building linear regression model.

1. Importing raw dataset

I am going to use a raw data file named “BodyFat.xls” available at http://www2.stetson.edu/~jrasp/data.htm for demonstration. This file contains data on percent bodyfat, Density and various measurements of body size taken on 252 male participants. Our objective is to find a good fit regression on Density. Note: 3 variables will be dropped, including IDNO, BODYFAT and ADIPOSITY because they are either not useful or are descriptive variables computing from other variables in raw dataset.

library(readxl)     # This function is from install.packages("tidyverse").
setwd("/Users/tubui/Desktop/Data source for STDS") # After download and save the raw data, copy and paste the link to our rawdata here.
rawfile <- read_excel("BodyFat.xls")
rawfile$IDNO <- rawfile$BODYFAT <- rawfile$ADIPOSITY <- NULL # Drop 3 unuseful variables.
summary(rawfile)
##     DENSITY           AGE            WEIGHT          HEIGHT     
##  Min.   :0.995   Min.   :22.00   Min.   :118.5   Min.   :29.50  
##  1st Qu.:1.041   1st Qu.:35.75   1st Qu.:159.0   1st Qu.:68.25  
##  Median :1.055   Median :43.00   Median :176.5   Median :70.00  
##  Mean   :1.056   Mean   :44.88   Mean   :178.9   Mean   :70.15  
##  3rd Qu.:1.070   3rd Qu.:54.00   3rd Qu.:197.0   3rd Qu.:72.25  
##  Max.   :1.109   Max.   :81.00   Max.   :363.1   Max.   :77.75  
##       NECK           CHEST           ABDOMEN            HIP       
##  Min.   :31.10   Min.   : 79.30   Min.   : 69.40   Min.   : 85.0  
##  1st Qu.:36.40   1st Qu.: 94.35   1st Qu.: 84.58   1st Qu.: 95.5  
##  Median :38.00   Median : 99.65   Median : 90.95   Median : 99.3  
##  Mean   :37.99   Mean   :100.82   Mean   : 92.56   Mean   : 99.9  
##  3rd Qu.:39.42   3rd Qu.:105.38   3rd Qu.: 99.33   3rd Qu.:103.5  
##  Max.   :51.20   Max.   :136.20   Max.   :148.10   Max.   :147.7  
##      THIGH            KNEE           ANKLE          BICEPS     
##  Min.   :47.20   Min.   :33.00   Min.   :19.1   Min.   :24.80  
##  1st Qu.:56.00   1st Qu.:36.98   1st Qu.:22.0   1st Qu.:30.20  
##  Median :59.00   Median :38.50   Median :22.8   Median :32.05  
##  Mean   :59.41   Mean   :38.59   Mean   :23.1   Mean   :32.27  
##  3rd Qu.:62.35   3rd Qu.:39.92   3rd Qu.:24.0   3rd Qu.:34.33  
##  Max.   :87.30   Max.   :49.10   Max.   :33.9   Max.   :45.00  
##     FOREARM          WRIST      
##  Min.   :21.00   Min.   :15.80  
##  1st Qu.:27.30   1st Qu.:17.60  
##  Median :28.70   Median :18.30  
##  Mean   :28.66   Mean   :18.23  
##  3rd Qu.:30.00   3rd Qu.:18.80  
##  Max.   :34.90   Max.   :21.40

As shown, our raw data contains all numeric values and is already clean (i.e No NA’s found), hence, no cleaning up is required. We can go straight to the next step which is using Boruta function to find out what predictors are more important than the others.

2. Identify important predictors by using Boruta algorithm

There is a number of algorithms can be used to assess the importance of explanatory variables, such as Random Forest Method, Relative Importance and Step-wise Regression. I choose Boruta as this one tells us explicitly which variables are important, tentative and unimportant in the result. Given the words limit, I will not be explaining how Boruta works in here but you can refer to the 2nd link in Reference reading for more details.

library(Boruta) # This function is from install.packages("Boruta")
## Loading required package: ranger
boruta.fat <- Boruta(DENSITY~., data = rawfile, doTrace = 2)
##  1. run of importance source...
##  2. run of importance source...
##  3. run of importance source...
##  4. run of importance source...
##  5. run of importance source...
##  6. run of importance source...
##  7. run of importance source...
##  8. run of importance source...
##  9. run of importance source...
##  10. run of importance source...
##  11. run of importance source...
## After 11 iterations, +1.2 secs:
##  confirmed 9 attributes: ABDOMEN, AGE, BICEPS, CHEST, HEIGHT and 4 more;
##  still have 4 attributes left.
##  12. run of importance source...
##  13. run of importance source...
##  14. run of importance source...
##  15. run of importance source...
## After 15 iterations, +1.5 secs:
##  confirmed 1 attribute: NECK;
##  still have 3 attributes left.
##  16. run of importance source...
##  17. run of importance source...
##  18. run of importance source...
##  19. run of importance source...
##  20. run of importance source...
##  21. run of importance source...
## After 21 iterations, +1.9 secs:
##  confirmed 1 attribute: WRIST;
##  still have 2 attributes left.
##  22. run of importance source...
##  23. run of importance source...
##  24. run of importance source...
##  25. run of importance source...
##  26. run of importance source...
##  27. run of importance source...
##  28. run of importance source...
##  29. run of importance source...
##  30. run of importance source...
##  31. run of importance source...
##  32. run of importance source...
##  33. run of importance source...
##  34. run of importance source...
##  35. run of importance source...
##  36. run of importance source...
##  37. run of importance source...
##  38. run of importance source...
##  39. run of importance source...
##  40. run of importance source...
##  41. run of importance source...
##  42. run of importance source...
##  43. run of importance source...
##  44. run of importance source...
##  45. run of importance source...
##  46. run of importance source...
##  47. run of importance source...
##  48. run of importance source...
##  49. run of importance source...
##  50. run of importance source...
##  51. run of importance source...
##  52. run of importance source...
##  53. run of importance source...
##  54. run of importance source...
##  55. run of importance source...
##  56. run of importance source...
##  57. run of importance source...
##  58. run of importance source...
##  59. run of importance source...
##  60. run of importance source...
##  61. run of importance source...
##  62. run of importance source...
##  63. run of importance source...
##  64. run of importance source...
##  65. run of importance source...
##  66. run of importance source...
##  67. run of importance source...
##  68. run of importance source...
##  69. run of importance source...
##  70. run of importance source...
##  71. run of importance source...
##  72. run of importance source...
##  73. run of importance source...
##  74. run of importance source...
##  75. run of importance source...
##  76. run of importance source...
##  77. run of importance source...
##  78. run of importance source...
##  79. run of importance source...
##  80. run of importance source...
##  81. run of importance source...
##  82. run of importance source...
##  83. run of importance source...
##  84. run of importance source...
##  85. run of importance source...
##  86. run of importance source...
##  87. run of importance source...
##  88. run of importance source...
##  89. run of importance source...
##  90. run of importance source...
##  91. run of importance source...
##  92. run of importance source...
##  93. run of importance source...
##  94. run of importance source...
##  95. run of importance source...
##  96. run of importance source...
##  97. run of importance source...
##  98. run of importance source...
##  99. run of importance source...
print(boruta.fat)
## Boruta performed 99 iterations in 7.684312 secs.
##  11 attributes confirmed important: ABDOMEN, AGE, BICEPS, CHEST,
## HEIGHT and 6 more;
##  No attributes deemed unimportant.
##  2 tentative attributes left: ANKLE, FOREARM;
boruta_signif <- names(boruta.fat$finalDecision[boruta.fat$finalDecision %in% c("Confirmed", "Tentative")]) 
print(boruta_signif)
##  [1] "AGE"     "WEIGHT"  "HEIGHT"  "NECK"    "CHEST"   "ABDOMEN" "HIP"    
##  [8] "THIGH"   "KNEE"    "ANKLE"   "BICEPS"  "FOREARM" "WRIST"

Well, 11 out of 13 variables are confirmed important. Lets plot the graph to see how important these predictors are.

plot(boruta.fat, cex.axis=.7, las=2, xlab="", main="Variable Importance")

The colours are interpreted as followed: Red - Confirmed unimportant (not shown here because Boruta did not find any variables unimportant in our dataset), Yellow - Tentative and Green - Confirmed important. As shown, ABDOMEN size is highly important to our response variable but this does not mean this predictor will be kept in our model because Boruta does not take into account Colinearity.

3. Checking for Multicolinearity

Now, let’s check VIF for our linear regression with these 11 predictors.

model_1 <- lm(DENSITY ~.-FOREARM -ANKLE, data=rawfile)
library(car) # This function is from install.packages("car").
vif(model_1)
##       AGE    WEIGHT    HEIGHT      NECK     CHEST   ABDOMEN       HIP 
##  2.167651 32.380603  1.667529  4.219014  9.321749 11.635693 14.625285 
##     THIGH      KNEE    BICEPS     WRIST 
##  7.776325  4.448404  3.297249  3.082781

The result shows that we have issue with collinearity because there are so many predictors having VIF > 10 and the suggested solution is to drop the one with highest VIF then perform lm again until all remaining predictors are below 5.

model_2 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT, data=rawfile)
vif(model_2)
##       AGE    HEIGHT      NECK     CHEST   ABDOMEN       HIP     THIGH 
##  2.047027  1.328557  3.883547  7.718243 11.219029 10.317159  7.752294 
##      KNEE    BICEPS     WRIST 
##  4.057212  3.186432  2.959279
model_3 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT -ABDOMEN, data=rawfile)
vif(model_3)
##      AGE   HEIGHT     NECK    CHEST      HIP    THIGH     KNEE   BICEPS 
## 1.620845 1.327967 3.852821 4.796836 8.324935 7.574063 4.051698 3.150443 
##    WRIST 
## 2.886206
model_4 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT -ABDOMEN -HIP, data=rawfile)
vif(model_4)
##      AGE   HEIGHT     NECK    CHEST    THIGH     KNEE   BICEPS    WRIST 
## 1.618411 1.327576 3.852804 3.869455 5.264019 3.813440 3.136138 2.876557
model_5 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT -ABDOMEN -HIP -THIGH, data=rawfile)
vif(model_5)
##      AGE   HEIGHT     NECK    CHEST     KNEE   BICEPS    WRIST 
## 1.253569 1.239098 3.732681 3.662574 2.658872 2.848060 2.865505
summary(model_5)
## 
## Call:
## lm(formula = DENSITY ~ . - FOREARM - ANKLE - WEIGHT - ABDOMEN - 
##     HIP - THIGH, data = rawfile)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.030848 -0.008921  0.000109  0.008888  0.051665 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.137e+00  1.982e-02  57.367  < 2e-16 ***
## AGE         -3.128e-04  7.177e-05  -4.359 1.93e-05 ***
## HEIGHT       6.693e-04  2.455e-04   2.726  0.00687 ** 
## NECK         2.885e-04  6.420e-04   0.449  0.65351    
## CHEST       -1.478e-03  1.834e-04  -8.061 3.38e-14 ***
## KNEE        -1.460e-03  5.461e-04  -2.673  0.00803 ** 
## BICEPS      -8.037e-04  4.512e-04  -1.781  0.07614 .  
## WRIST        5.819e-03  1.465e-03   3.973 9.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0128 on 244 degrees of freedom
## Multiple R-squared:  0.5604, Adjusted R-squared:  0.5478 
## F-statistic: 44.44 on 7 and 244 DF,  p-value: < 2.2e-16
plot(model_5)

The model is now fine with colinearity and is able to explain 55% of the variance. By checking with the 4 graphs, our assumptions of linear relationship and normal distribution are satisfied. However, on the last graph, it seems to be a problem with an influential point that requires futher investigation.

4. Reference Readings

http://r-statistics.co/Variable-Selection-and-Importance-With-R.html

https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/

http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis