It is quite straight forward to identify the response variable in a given dataset but it is often harder to find out the key predictor or predictors that are highly associated with the response among a set of variables. This vignette is to give you some ideas on how to identify the main predictor or predictors when building linear regression model.
I am going to use a raw data file named “BodyFat.xls” available at http://www2.stetson.edu/~jrasp/data.htm for demonstration. This file contains data on percent bodyfat, Density and various measurements of body size taken on 252 male participants. Our objective is to find a good fit regression on Density. Note: 3 variables will be dropped, including IDNO, BODYFAT and ADIPOSITY because they are either not useful or are descriptive variables computing from other variables in raw dataset.
library(readxl) # This function is from install.packages("tidyverse").
setwd("/Users/tubui/Desktop/Data source for STDS") # After download and save the raw data, copy and paste the link to our rawdata here.
rawfile <- read_excel("BodyFat.xls")
rawfile$IDNO <- rawfile$BODYFAT <- rawfile$ADIPOSITY <- NULL # Drop 3 unuseful variables.
summary(rawfile)
## DENSITY AGE WEIGHT HEIGHT
## Min. :0.995 Min. :22.00 Min. :118.5 Min. :29.50
## 1st Qu.:1.041 1st Qu.:35.75 1st Qu.:159.0 1st Qu.:68.25
## Median :1.055 Median :43.00 Median :176.5 Median :70.00
## Mean :1.056 Mean :44.88 Mean :178.9 Mean :70.15
## 3rd Qu.:1.070 3rd Qu.:54.00 3rd Qu.:197.0 3rd Qu.:72.25
## Max. :1.109 Max. :81.00 Max. :363.1 Max. :77.75
## NECK CHEST ABDOMEN HIP
## Min. :31.10 Min. : 79.30 Min. : 69.40 Min. : 85.0
## 1st Qu.:36.40 1st Qu.: 94.35 1st Qu.: 84.58 1st Qu.: 95.5
## Median :38.00 Median : 99.65 Median : 90.95 Median : 99.3
## Mean :37.99 Mean :100.82 Mean : 92.56 Mean : 99.9
## 3rd Qu.:39.42 3rd Qu.:105.38 3rd Qu.: 99.33 3rd Qu.:103.5
## Max. :51.20 Max. :136.20 Max. :148.10 Max. :147.7
## THIGH KNEE ANKLE BICEPS
## Min. :47.20 Min. :33.00 Min. :19.1 Min. :24.80
## 1st Qu.:56.00 1st Qu.:36.98 1st Qu.:22.0 1st Qu.:30.20
## Median :59.00 Median :38.50 Median :22.8 Median :32.05
## Mean :59.41 Mean :38.59 Mean :23.1 Mean :32.27
## 3rd Qu.:62.35 3rd Qu.:39.92 3rd Qu.:24.0 3rd Qu.:34.33
## Max. :87.30 Max. :49.10 Max. :33.9 Max. :45.00
## FOREARM WRIST
## Min. :21.00 Min. :15.80
## 1st Qu.:27.30 1st Qu.:17.60
## Median :28.70 Median :18.30
## Mean :28.66 Mean :18.23
## 3rd Qu.:30.00 3rd Qu.:18.80
## Max. :34.90 Max. :21.40
As shown, our raw data contains all numeric values and is already clean (i.e No NA’s found), hence, no cleaning up is required. We can go straight to the next step which is using Boruta function to find out what predictors are more important than the others.
There is a number of algorithms can be used to assess the importance of explanatory variables, such as Random Forest Method, Relative Importance and Step-wise Regression. I choose Boruta as this one tells us explicitly which variables are important, tentative and unimportant in the result. Given the words limit, I will not be explaining how Boruta works in here but you can refer to the 2nd link in Reference reading for more details.
library(Boruta) # This function is from install.packages("Boruta")
## Loading required package: ranger
boruta.fat <- Boruta(DENSITY~., data = rawfile, doTrace = 2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## 11. run of importance source...
## After 11 iterations, +1.2 secs:
## confirmed 9 attributes: ABDOMEN, AGE, BICEPS, CHEST, HEIGHT and 4 more;
## still have 4 attributes left.
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## 15. run of importance source...
## After 15 iterations, +1.5 secs:
## confirmed 1 attribute: NECK;
## still have 3 attributes left.
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## 20. run of importance source...
## 21. run of importance source...
## After 21 iterations, +1.9 secs:
## confirmed 1 attribute: WRIST;
## still have 2 attributes left.
## 22. run of importance source...
## 23. run of importance source...
## 24. run of importance source...
## 25. run of importance source...
## 26. run of importance source...
## 27. run of importance source...
## 28. run of importance source...
## 29. run of importance source...
## 30. run of importance source...
## 31. run of importance source...
## 32. run of importance source...
## 33. run of importance source...
## 34. run of importance source...
## 35. run of importance source...
## 36. run of importance source...
## 37. run of importance source...
## 38. run of importance source...
## 39. run of importance source...
## 40. run of importance source...
## 41. run of importance source...
## 42. run of importance source...
## 43. run of importance source...
## 44. run of importance source...
## 45. run of importance source...
## 46. run of importance source...
## 47. run of importance source...
## 48. run of importance source...
## 49. run of importance source...
## 50. run of importance source...
## 51. run of importance source...
## 52. run of importance source...
## 53. run of importance source...
## 54. run of importance source...
## 55. run of importance source...
## 56. run of importance source...
## 57. run of importance source...
## 58. run of importance source...
## 59. run of importance source...
## 60. run of importance source...
## 61. run of importance source...
## 62. run of importance source...
## 63. run of importance source...
## 64. run of importance source...
## 65. run of importance source...
## 66. run of importance source...
## 67. run of importance source...
## 68. run of importance source...
## 69. run of importance source...
## 70. run of importance source...
## 71. run of importance source...
## 72. run of importance source...
## 73. run of importance source...
## 74. run of importance source...
## 75. run of importance source...
## 76. run of importance source...
## 77. run of importance source...
## 78. run of importance source...
## 79. run of importance source...
## 80. run of importance source...
## 81. run of importance source...
## 82. run of importance source...
## 83. run of importance source...
## 84. run of importance source...
## 85. run of importance source...
## 86. run of importance source...
## 87. run of importance source...
## 88. run of importance source...
## 89. run of importance source...
## 90. run of importance source...
## 91. run of importance source...
## 92. run of importance source...
## 93. run of importance source...
## 94. run of importance source...
## 95. run of importance source...
## 96. run of importance source...
## 97. run of importance source...
## 98. run of importance source...
## 99. run of importance source...
print(boruta.fat)
## Boruta performed 99 iterations in 7.684312 secs.
## 11 attributes confirmed important: ABDOMEN, AGE, BICEPS, CHEST,
## HEIGHT and 6 more;
## No attributes deemed unimportant.
## 2 tentative attributes left: ANKLE, FOREARM;
boruta_signif <- names(boruta.fat$finalDecision[boruta.fat$finalDecision %in% c("Confirmed", "Tentative")])
print(boruta_signif)
## [1] "AGE" "WEIGHT" "HEIGHT" "NECK" "CHEST" "ABDOMEN" "HIP"
## [8] "THIGH" "KNEE" "ANKLE" "BICEPS" "FOREARM" "WRIST"
Well, 11 out of 13 variables are confirmed important. Lets plot the graph to see how important these predictors are.
plot(boruta.fat, cex.axis=.7, las=2, xlab="", main="Variable Importance")
The colours are interpreted as followed: Red - Confirmed unimportant (not shown here because Boruta did not find any variables unimportant in our dataset), Yellow - Tentative and Green - Confirmed important. As shown, ABDOMEN size is highly important to our response variable but this does not mean this predictor will be kept in our model because Boruta does not take into account Colinearity.
Now, let’s check VIF for our linear regression with these 11 predictors.
model_1 <- lm(DENSITY ~.-FOREARM -ANKLE, data=rawfile)
library(car) # This function is from install.packages("car").
vif(model_1)
## AGE WEIGHT HEIGHT NECK CHEST ABDOMEN HIP
## 2.167651 32.380603 1.667529 4.219014 9.321749 11.635693 14.625285
## THIGH KNEE BICEPS WRIST
## 7.776325 4.448404 3.297249 3.082781
The result shows that we have issue with collinearity because there are so many predictors having VIF > 10 and the suggested solution is to drop the one with highest VIF then perform lm again until all remaining predictors are below 5.
model_2 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT, data=rawfile)
vif(model_2)
## AGE HEIGHT NECK CHEST ABDOMEN HIP THIGH
## 2.047027 1.328557 3.883547 7.718243 11.219029 10.317159 7.752294
## KNEE BICEPS WRIST
## 4.057212 3.186432 2.959279
model_3 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT -ABDOMEN, data=rawfile)
vif(model_3)
## AGE HEIGHT NECK CHEST HIP THIGH KNEE BICEPS
## 1.620845 1.327967 3.852821 4.796836 8.324935 7.574063 4.051698 3.150443
## WRIST
## 2.886206
model_4 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT -ABDOMEN -HIP, data=rawfile)
vif(model_4)
## AGE HEIGHT NECK CHEST THIGH KNEE BICEPS WRIST
## 1.618411 1.327576 3.852804 3.869455 5.264019 3.813440 3.136138 2.876557
model_5 <- lm(DENSITY ~.-FOREARM -ANKLE -WEIGHT -ABDOMEN -HIP -THIGH, data=rawfile)
vif(model_5)
## AGE HEIGHT NECK CHEST KNEE BICEPS WRIST
## 1.253569 1.239098 3.732681 3.662574 2.658872 2.848060 2.865505
summary(model_5)
##
## Call:
## lm(formula = DENSITY ~ . - FOREARM - ANKLE - WEIGHT - ABDOMEN -
## HIP - THIGH, data = rawfile)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.030848 -0.008921 0.000109 0.008888 0.051665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.137e+00 1.982e-02 57.367 < 2e-16 ***
## AGE -3.128e-04 7.177e-05 -4.359 1.93e-05 ***
## HEIGHT 6.693e-04 2.455e-04 2.726 0.00687 **
## NECK 2.885e-04 6.420e-04 0.449 0.65351
## CHEST -1.478e-03 1.834e-04 -8.061 3.38e-14 ***
## KNEE -1.460e-03 5.461e-04 -2.673 0.00803 **
## BICEPS -8.037e-04 4.512e-04 -1.781 0.07614 .
## WRIST 5.819e-03 1.465e-03 3.973 9.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0128 on 244 degrees of freedom
## Multiple R-squared: 0.5604, Adjusted R-squared: 0.5478
## F-statistic: 44.44 on 7 and 244 DF, p-value: < 2.2e-16
plot(model_5)
The model is now fine with colinearity and is able to explain 55% of the variance. By checking with the 4 graphs, our assumptions of linear relationship and normal distribution are satisfied. However, on the last graph, it seems to be a problem with an influential point that requires futher investigation.