This is a comprehensive dataset that lists estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men. These data were generously supplied by Dr. A. Garth Fisher who gave permission to freely distribute the data and use for non-commercial purposes. These data are used to produce the predictive equations for lean body weight given in the abstract Generalized body composition prediction equation for men using simple measurement techniques, K.W. Penrose, A.G. Nelson, A.G. Fisher, FACSM, Human Performance Research Center, Brigham Young University, Provo, Utah 84602 as listed in Medicine and Science in Sports and Exercise, vol. 17, no. 2, April 1985, p. 189.(http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_BMI_Regression#References)
Some experts tout BMI(body mass index) as the most accurate and simple way to determine the effect of weight on your health. In fact, most recent medical research uses BMI as an indicator of someone’s health status and disease risk. Some debate about which values on the BMI scale the thresholds for ‘underweight’, ‘overweight’ and ‘obese’ should be set. However, the followings are used for the criteria.(http://healthiack.com/body-fat-percentage-calculator)
BMI < 18.5 : underweight,
18.5 < BMI < 25 : optimal weight,
25 < BMI < 30 : overweight,
BMI > 30 : obese.
Meanwhile, in September 2000, the American Journal of Clinical Nutrition published a study showing that body-fat percentage may be a better measure of your risk of weight-related diseases than BMI.(http://www.webmd.com/diet/body-fat-measurement). Percentage of body fat for an individual can be estimated by the Siri’s formula(1956) once body density has been determined. The American Council on Exercise provides the following ranges for men’s body-fat percentage:
essential fat : 2-5%
athletes : 6-13%
fitness : 14-17%
Normal : 18-24%
obese : more than 24%
Which one is a better and more accurate measure of body fat? This question remains to be studied further. Instead, the main goal of this analysis is to find out the relationships between ‘Body Fat Percentage’ and other predictors as well as ‘Body Mass Index’ and others.
The body density dataset includes the following 15 variables listed from left to right:
Density : Density determined from underwater weighing
Fat : Percent body fat from Siri’s (1956) equation
Age : Age (years)
Weight : Weight (kg)
Height : Height (cm)
Neck : Neck circumference (cm)
Chest: Chest circumference (cm)
Abdomen : Abdomen circumference (cm)
Hip : Hip circumference (cm)
Thigh : Thigh circumference (cm)
Knee : Knee circumference (cm)
Ankle : Ankle circumference (cm)
Biceps : Biceps (extended) circumference (cm)
Forearm : Forearm circumference (cm)
Wrist : Wrist circumference (cm)
New variable MassIndex will be created as the ratio of Weight(kg) to Height(m) squared for the measure of Body Mass Index in addition to the above variables. Model Selection process and regressions for both response variables, Fat and MassIndex, will be compared to explain which measure explains this data better.
library(car)
library(stargazer)
library(Zelig)
bmi <- read.table("K:/QC/Soc/SOC712/Homework/BMI.txt",header=TRUE)
summary(bmi)
## Density Fat Age Height
## Min. :0.995 Min. : 0.00 Min. :22.00 Min. : 74.93
## 1st Qu.:1.041 1st Qu.:12.47 1st Qu.:35.75 1st Qu.:173.35
## Median :1.055 Median :19.20 Median :43.00 Median :177.80
## Mean :1.056 Mean :19.15 Mean :44.88 Mean :178.18
## 3rd Qu.:1.070 3rd Qu.:25.30 3rd Qu.:54.00 3rd Qu.:183.51
## Max. :1.109 Max. :47.50 Max. :81.00 Max. :197.49
## Weight Neck Chest Abdomen
## Min. : 53.75 Min. :31.10 Min. : 79.30 Min. : 69.40
## 1st Qu.: 72.12 1st Qu.:36.40 1st Qu.: 94.35 1st Qu.: 84.58
## Median : 80.06 Median :38.00 Median : 99.65 Median : 90.95
## Mean : 81.16 Mean :37.99 Mean :100.82 Mean : 92.56
## 3rd Qu.: 89.36 3rd Qu.:39.42 3rd Qu.:105.38 3rd Qu.: 99.33
## Max. :164.72 Max. :51.20 Max. :136.20 Max. :148.10
## Hip Thigh Knee Ankle
## Min. : 85.0 Min. :47.20 Min. :33.00 Min. :19.1
## 1st Qu.: 95.5 1st Qu.:56.00 1st Qu.:36.98 1st Qu.:22.0
## Median : 99.3 Median :59.00 Median :38.50 Median :22.8
## Mean : 99.9 Mean :59.41 Mean :38.59 Mean :23.1
## 3rd Qu.:103.5 3rd Qu.:62.35 3rd Qu.:39.92 3rd Qu.:24.0
## Max. :147.7 Max. :87.30 Max. :49.10 Max. :33.9
## Biceps Forearm Wrist
## Min. :24.80 Min. :21.00 Min. :15.80
## 1st Qu.:30.20 1st Qu.:27.30 1st Qu.:17.60
## Median :32.05 Median :28.70 Median :18.30
## Mean :32.27 Mean :28.66 Mean :18.23
## 3rd Qu.:34.33 3rd Qu.:30.00 3rd Qu.:18.80
## Max. :45.00 Max. :34.90 Max. :21.40
The above table shows the Body Density Dataset having 15 variables and 252 observations.
Next step is to create new variable MassIndex using mutate function in dplyr package. As a result, 16 variables are in the data named bmi2.
library(dplyr)
bmi2 <- mutate(bmi, MassIndex = Weight/(Height/100)^2)
names(bmi2)
## [1] "Density" "Fat" "Age" "Height" "Weight"
## [6] "Neck" "Chest" "Abdomen" "Hip" "Thigh"
## [11] "Knee" "Ankle" "Biceps" "Forearm" "Wrist"
## [16] "MassIndex"
To compare the distributions of 2 response variables
the following criteria are used to categorize the types of weights for Body Mass Index.
Underweight Less than 18.5
Recommended 18.6 to 24.9
Overweight 25.0 to 29.9
Obese 30 or greater
attach(bmi2)
bmi2$WeightGroup[MassIndex < 18.5] <- 1
bmi2$WeightGroup[MassIndex >= 18.5 & MassIndex <25] <- 2
bmi2$WeightGroup[MassIndex >= 25 & MassIndex < 30] <- 3
bmi2$WeightGroup[MassIndex >= 30 ] <-4
bmi2$WeightGroup[is.na(MassIndex)] <- NA
detach(bmi2)
WeightType <- c("Underweight", "Normalweight", "Overweight", "Obese")
bmi2$WeightGroup <-factor(bmi2$WeightGroup, labels = WeightType)
Now the variable WeightGroup has 4 categories for the types of weight based on the Body Mass Index.
Then the following criteria are used to categorize the types of weights for Body Fat Percent.
essential fat : 2-5%
athletes : 6-13%
fitness : 14-17%
Normal : 18-24%
obese : more than 24%
attach(bmi2)
bmi2$WeightGroup2[Fat < 6] <- 1
bmi2$WeightGroup2[Fat >= 6 & Fat < 14] <- 2
bmi2$WeightGroup2[Fat >= 14 & Fat < 18] <- 3
bmi2$WeightGroup2[Fat >= 18 & Fat < 25 ] <- 4
bmi2$WeightGroup2[Fat >= 25] <- 5
bmi2$WeightGroup2[is.na(Fat)] <- NA
detach(bmi2)
WeightType2 <- c("EssentialFat", "Athletes", "Fitness", "Average", "Obese")
bmi2$WeightGroup2 <-factor(bmi2$WeightGroup2, labels = WeightType2)
Now the variable WeightGroup2 has 5 categories for the types of weight based on the Boyd Fat Percent.
The following results shows the number of male participants categorized as underweight, normal weight, overweight, and obese according to their Body Mass Index Value. And they are also categorized as underweight, normal weight, overweight, and obese with their Body Fat Percent.
summary(bmi2$WeightGroup)
## Underweight Normalweight Overweight Obese
## 1 124 102 25
summary(bmi2$WeightGroup2)
## EssentialFat Athletes Fitness Average Obese
## 12 63 37 74 66
As we can see with the above comparison, Body Mass Index and Body Fat Percent have quite different distributions especially for both tails. Body Fat Percent has much thicker tails on both sides(for EssentialFat and Obese) than Body Mass Index.
To select significant predictors for the response variable ‘Fat’, linear regression is performed.
The below table shows that only ‘Density’ has a significant linear relationship with ‘Fat’.
summary(bmi.mod1 <- lm(Fat ~ Density + Age + Height + Weight + Neck + Chest + Abdomen + Hip + Thigh + Knee + Ankle + Biceps + Forearm + Wrist, data = bmi2))
##
## Call:
## lm(formula = Fat ~ Density + Age + Height + Weight + Neck + Chest +
## Abdomen + Hip + Thigh + Knee + Ankle + Biceps + Forearm +
## Wrist, data = bmi2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4357 -0.3724 -0.1275 0.2156 15.1474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.500e+02 1.071e+01 42.005 <2e-16 ***
## Density -4.112e+02 8.258e+00 -49.796 <2e-16 ***
## Age 1.259e-02 9.626e-03 1.308 0.192
## Height -3.142e-03 1.120e-02 -0.281 0.779
## Weight 2.217e-02 3.520e-02 0.630 0.529
## Neck -2.846e-02 6.938e-02 -0.410 0.682
## Chest 2.678e-02 2.936e-02 0.912 0.363
## Abdomen 1.857e-02 3.175e-02 0.585 0.559
## Hip 1.917e-02 4.343e-02 0.441 0.659
## Thigh -1.676e-02 4.303e-02 -0.389 0.697
## Knee -4.639e-03 7.162e-02 -0.065 0.948
## Ankle -8.568e-02 6.576e-02 -1.303 0.194
## Biceps -5.505e-02 5.087e-02 -1.082 0.280
## Forearm 3.386e-02 5.953e-02 0.569 0.570
## Wrist 7.345e-03 1.617e-01 0.045 0.964
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.274 on 237 degrees of freedom
## Multiple R-squared: 0.9781, Adjusted R-squared: 0.9768
## F-statistic: 756.3 on 14 and 237 DF, p-value: < 2.2e-16
Again, linear regression is performed to select significant predictors for the response variable ‘MassIndex’. As a result, the predictors; Height, Weight,Neck, Chest, Abdomen, Thigh, and Knee are chosen for their significance.
summary(bmi.mod2 <- lm(MassIndex ~ Density + Age + Height + Weight + Neck + Chest + Abdomen + Hip + Thigh + Knee + Ankle + Biceps + Forearm + Wrist, data = bmi2))
##
## Call:
## lm(formula = MassIndex ~ Density + Age + Height + Weight + Neck +
## Chest + Abdomen + Hip + Thigh + Knee + Ankle + Biceps + Forearm +
## Wrist, data = bmi2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.3523 -2.0685 -0.3381 1.6979 28.2505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 229.605482 30.330244 7.570 8.26e-13 ***
## Density -6.343269 23.380600 -0.271 0.786393
## Age -0.007158 0.027253 -0.263 0.793040
## Height -1.029726 0.031703 -32.480 < 2e-16 ***
## Weight 1.154493 0.099651 11.585 < 2e-16 ***
## Neck -0.515661 0.196416 -2.625 0.009219 **
## Chest -0.344475 0.083132 -4.144 4.76e-05 ***
## Abdomen -0.330356 0.089894 -3.675 0.000294 ***
## Hip -0.146461 0.122946 -1.191 0.234743
## Thigh -0.325452 0.121819 -2.672 0.008072 **
## Knee 0.654171 0.202773 3.226 0.001432 **
## Ankle -0.246928 0.186169 -1.326 0.185996
## Biceps -0.272801 0.144028 -1.894 0.059431 .
## Forearm 0.100265 0.168550 0.595 0.552497
## Wrist -0.088536 0.457726 -0.193 0.846792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.608 on 237 degrees of freedom
## Multiple R-squared: 0.8655, Adjusted R-squared: 0.8575
## F-statistic: 108.9 on 14 and 237 DF, p-value: < 2.2e-16
In generalized linear regression models, the optimzed model can be obtained by selecting active predictors with Akaike information criterion(AIC) or Bayesan information criterion(BIC). Step function with backward method is used to select variables for the optimized subset models by the Akaike information criterion (AIC) for the given set of data.
(bmi.backward1 <- step(bmi.mod1, scope = list(lower ~ Density), trace=0))
##
## Call:
## lm(formula = Fat ~ Density + Age + Chest, data = bmi2)
##
## Coefficients:
## (Intercept) Density Age Chest
## 452.23095 -415.90774 0.01289 0.05319
As a result, Density, age, and chest are selected in the optimized model for the response variable Fat.
Then the fitted values for bmi.mod1 with all the regressors and the fitted values for subset models selected by backward step function are compared in the below graph.
plot(fitted(bmi.mod1) ~ fitted(bmi.backward1))
abline(0,1)
cor(fitted(bmi.mod1), fitted(bmi.backward1))
## [1] 0.9997243
The change in the fitted values is relatively small, and the two sets of fitted values have correlation .99. So we conclude that the subset model and the full model provide essentially the same information about the value of the response given predicors.(Fox and Weisberg 2011)
Similary, Step function with backward method is used to select variables for the optimized subset models for the response variable MassIndex for the given set of data.
(bmi.backward2 <- step(bmi.mod2, scope = list(lower ~ Density), trace=0))
##
## Call:
## lm(formula = MassIndex ~ Height + Weight + Neck + Chest + Abdomen +
## Thigh + Knee + Biceps, data = bmi2)
##
## Coefficients:
## (Intercept) Height Weight Neck Chest
## 207.7371 -1.0187 1.0607 -0.4758 -0.3171
## Abdomen Thigh Knee Biceps
## -0.3352 -0.3480 0.5859 -0.2238
Then the fit of the full and subset models is compared by looking at the corresponding fitted values.
plot(fitted(bmi.mod2) ~ fitted(bmi.backward2))
abline(0,1)
cor(fitted(bmi.mod2), fitted(bmi.backward2))
## [1] 0.9986755
Again the subset model and the full model provide essentially the same information abouth the value of the response given predictors.
Based on the above results, 2 different subsets of data are selected. First subset ‘subbmi1’ has Fat(as response variable) and density after removing all other variables. Second subset ‘subbmi2’ has variables: MassIndex(as response variable) and Height, Weight,Neck, Chest, Abdomen, Thigh, and Knee as predictors.
subbmi1 <- select(bmi2, Fat, Density)
names(subbmi1)
## [1] "Fat" "Density"
subbmi2 <- select(bmi2, MassIndex, Height, Weight,Neck, Chest, Abdomen, Thigh, Knee)
names(subbmi2)
## [1] "MassIndex" "Height" "Weight" "Neck" "Chest" "Abdomen"
## [7] "Thigh" "Knee"
Then regressions results for 2 different subsets and response variables are compared using stargazer package.(Hlavac 2014) The table shows that FAT~Density regression model has higher R-sqaure value than MassIndex regression model. Multiple R-squared value is 0.976 which implies that approximately 97.6 % of the vairiability of the dependant variable is explained by the fitted regression line. MassIndex regression has 86.2% R-squre. The weighted combination of the 7 predictor variables explained approximately 86.2% of the variance of the dependent variable in the MassIndex regression.
Density=-434.360 implies that Fat will be expected to decrease by 434.360 percent for an additonal density unit. Another resonse variable MassIndex has negative linear relationship with Height, Neck, Chest, Abdomen, and Thigh but postive linear relationship with Weight and Knee.
library(stargazer)
bmi.m1 <- lm(Fat ~ Density, data =subbmi1)
bmi.m2 <- lm(MassIndex ~ Height+Weight+Neck+Chest+ Abdomen+Thigh+Knee, data =subbmi2)
stargazer(bmi.m1, bmi.m2, title="Comparison of 2 Regression outputs",type="html", single.row=TRUE)
Dependent variable: | ||
Fat | MassIndex | |
(1) | (2) | |
Density | -434.360^{***} (4.334) | |
Height | -1.017^{***} (0.030) | |
Weight | 1.040^{***} (0.075) | |
Neck | -0.537^{***} (0.172) | |
Chest | -0.337^{***} (0.080) | |
Abdomen | -0.317^{***} (0.060) | |
Thigh | -0.391^{***} (0.098) | |
Knee | 0.597^{***} (0.189) | |
Constant | 477.650^{***} (4.576) | 206.708^{***} (10.790) |
Observations | 252 | 252 |
R^{2} | 0.976 | 0.862 |
Adjusted R^{2} | 0.976 | 0.858 |
Residual Std. Error | 1.307 (df = 250) | 3.606 (df = 244) |
F Statistic | 10,044.030^{***} (df = 1; 250) | 217.073^{***} (df = 7; 244) |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
Regression outliers are y values that are unusual conditional on the values of the predictors. The Quantile-quantile plot and the outlierTest function for the regression model (Fat ~ Density) shows that observations 96 and 48 are outliers.
qqPlot(bmi.m1, id.n=4)
## 48 182 76 96
## 1 250 251 252
outlierTest(bmi.m1)
## rstudent unadjusted p-value Bonferonni p
## 96 24.505612 2.6448e-68 6.6650e-66
## 48 -7.457113 1.4662e-12 3.6949e-10
The Quantile-quantile plot and the outlierTest function for the regression model (MassIndex ~ Height+Weight+Neck+Chest+ Abdomen+Thigh+Knee) shows that observations 42 and 39 are outliers.
qqPlot(bmi.m2, id.n=4)
## 39 106 74 42
## 1 2 3 252
outlierTest(bmi.m2)
## rstudent unadjusted p-value Bonferonni p
## 42 212.253315 7.1043e-278 1.7903e-275
## 39 -5.977953 8.0100e-09 2.0185e-06
Observations with high leverage are relatively far from the center of the regressor space, and have potentially greater influence on the least-squares regression coefficients. The function influenceIndexPlot in car package shows hat values and Cook’s distances, the most common measures of leverage.(Fox and Weisberg 2011) As a result, observations 96, 48, 182, 36 and 216 have high leverage for the response variable Fat. Among them 96, 182, and 216 are flagged in both graphs.
influenceIndexPlot(bmi.m1, vars=c("Cook", "hat"), id.n=4)
So the following results shows that Observations 42, 39 have high leverage for the response variable MassIndex.
influenceIndexPlot(bmi.m2, vars=c("Cook", "hat"), id.n=4)
According to the above diagnostics, the outliers and high leverage observations are removed. 4 observations are removed from the regression model for the response variable Fat and 2 observations are removed from the regression model fro the response variable MassIndex.
bmi.m3 <- update(bmi.m1, subset = -c(96, 48, 182, 216))
bmi.m4 <- update(bmi.m2, subset = -c(42, 39))
Linear regression analysis are performed again for 2 subsets without influential values. Then the results are compared in the table below.
library(stargazer)
stargazer(bmi.m3, bmi.m4, title="Comparison of 2 Regression outputs without unusual data",type="html", single.row=TRUE)
Dependent variable: | ||
Fat | MassIndex | |
(1) | (2) | |
Density | -443.388^{***} (1.161) | |
Height | -0.277^{***} (0.004) | |
Weight | 0.302^{***} (0.007) | |
Neck | -0.008 (0.013) | |
Chest | 0.009 (0.006) | |
Abdomen | 0.013^{***} (0.005) | |
Thigh | 0.006 (0.007) | |
Knee | -0.027^{*} (0.014) | |
Constant | 487.120^{***} (1.225) | 49.141^{***} (1.158) |
Observations | 248 | 250 |
R^{2} | 0.998 | 0.994 |
Adjusted R^{2} | 0.998 | 0.994 |
Residual Std. Error | 0.333 (df = 246) | 0.258 (df = 242) |
F Statistic | 145,970.800^{***} (df = 1; 246) | 5,955.936^{***} (df = 7; 242) |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
Without outliers and high leverage observations 2 regression models show significantly improved outputs for the dataset. Especially regression outcome for the response variable(MassIndex) has much higher R square values than the previous regression model. This change can be explained by the effect of outliers and high leverage values in the dataset.
Both Body Mass Index (BMI) and Body Fat Percent are measures of the body fat. For the Body Density Data 2 different response variables are used to identify the relationsip between each response variable and their own significant predictors to determine which body fat measure represents this dataset better. The result of this analysis shows that Body Fat percent(Fat) has a significant negative linear relationship only with the variable Denstiy since the fomula for Body Fat Percent is and Body Density
, where
D = Body Density (gm/cm3)
A = proportion of lean body tissue
B = proportion of fat tissue (A+B=1)
a = density of lean body tissue (gm/cm3)
b = density of fat tissue (gm/cm3)
Meanwhile, Body Mass Index(MassIndex) has strong linear relationships with multiple predictors since Body Mass Index can be calculated by and Height and Weight are also correlated with other measurements such as Abdomen, Thigh, etc.
Accurate measurement of body fat is inconvenient or costly and it is desirable to have easy methods of estimating body fat that are cost-effective and convenient. However, most of the body measures except for the Body Fat Percent and the Body Density are relatively easier to obtain for data collection. Body Mass Index is also calculated simply from weight and height of an individual.
Conclusively, Body Mass Index is a more conveniet and less costly method for Body fat measures since it is strongly correlated with most body measures except Body Fat Percent and Density. One plausible problem with Body Mass Index(‘MassIndex’) regression is that there is a possiblity of multicollinearity among predictor variables since all body measures are highly correlated.
However, What is the better or more accurate measure of human body fat remains to be answered with further medical reseaches.
Fox, John, and Harvey Sanford Weisberg. 2011. An R Companion to Applied Regression. 2nd ed. Thousand Oaks, CA: Sage Publications.
Hlavac, Marek(2014). 2014. Stargazer:LaTex Code and ASCII Text for Well-Formatted Regression and Summary Staistics Tables. http://CRAN.R-project.org/package=stargazer.