Predicting Siri’s Percent Body Fat Without Underwater Density Measurements

Abstract

The Siri Equation calculates percent body fat based on a measure of body density. The most widely used method of obtaining body density is hydrostatic (underwater) weighing. This procedure compares the subject’s dry weight to the subject’s weight when submerged in a tank of water. Hydrostatic weighing is expensive and inaccessible to most people, therefore, an alternative method of estimating Siri’s calculation for percent body fat is sought. Multiple regression is used on a dataset of 252 men to fit a model for Siri’s percent body fat based on hydrostatic weighing with values for age, weight, height and ten body measurements taken with a measuring tape.The final model contains four predictors: weight (pounds), abdomen circumference (cm), forearm circumference (cm) and wrist circumference (cm).

I.Introduction

A. Study Design

A study on percent of body fat is considered. Health, diet, and insurance industries all have an interest in the weight of the American public. Methods of measuring fitness abound from BMI calculations to skinfold tests to old fashioned weight and height tables. Another measure, body density, is the ratio of the body’s mass to its volume which is best calculated by weighing the subject while submerged in a tank of water. The Siri Equation, \(Percent Body Fat = (495/Body Density) - 450\), uses body density to calculate percent of body fat. According to the American Council on Exercise, normal adult men typically have between 18% and 22% body fat. Obtaining an accurate measure of percent body fat, while beneficial, is both expensive and difficult to measure. To gain a quick and reliable estimate for the Siri Equation, data is used that was collected by Dr. A. Garth Fisher and published by Roger W. Johnson of Carleton College in the Journal of Statistics Education (1996). Body density for calculating the Siri Equation was measured using underwater hydrostatic weighing procedures and regressed against simple measurements acquired from a scale and tape measure. The variables are: age, in years; weight, in pounds; height, in inches; neck in cm; chest, in cm; abdomen circumference, in cm at the umbilicus and level with the iliac crest; hip circumference, in cm; thigh circumference, in cm, knee circumference, in cm; ankle circumferences, in cm; extended biceps circumference, in cm; forearm circumference, in cm; wrist circumference, in cm.

B. Aims

The purpose of the study is to investigate the relationship between Siri percent body fat and several potential predictor variables: age, weight, height, and 10 simple body measurements, and then to formulate a model that accurately predicts Siri percent body fat from a subset of these potential predictors.

C. Statitical Model.

A multiple linear regression model is considered. All predictor variables, \(X_i\) are possible covariates, none of which are predetermined. Let

\[{Y_i} = \beta_0 +\beta_1X_{i1} + \beta_2X_{i2} +\beta_3X_{i3} +\beta_4X_{i4}+\beta_5X_{i5} +\beta_6X_{i6} + \beta_7X_{i7}+\beta_8X_{i8}++\beta_9X_{i9}+\beta_10X_{i10}+\beta_11X_{i11} +\beta_12X_{i12}+\beta_13X_{i13}+ \epsilon_i\]

\({Y_i}\) = a continuous measure of percent body fat using Siri’s equation, \(457/Density-414.2\) of the \(i^{th}\) randomly selected man.

\(X_{i1}\) = known age in years of the \(i^{th}\) man

\(X_{i2}\) = known weight in pounds of the \(i^{th}\) man

\(X_{i3}\) = known height in inches of the \(i^{th}\) man.

\(X_{i4}\) = known neck circumference in centimeters of the \(i^{th}\) man.

\(X_{i5}\) = known chest circumference in centimeters of the \(i^{th}\) man.

\(X_{i6}\) = known abdomen circumference in centimeters at the umbilicus and level with the iliac chest of the \(i^{th}\) man.

\(X_{i7}\) = known hip circumference in centimeters of the \(i^{th}\) man.

\(X_{i8}\) = known thigh circumference in centimeters of the \(i^{th}\) man.

\(X_{i9}\) = known knee circumference in centimeters of the \(i^{th}\) man.

\(X_{i10}\) = known ankle circumference in centimeters of the \(i^{th}\) man.

\(X_{i11}\) = known extended biceps circumference in centimeters of the \(i^{th}\) man.

\(X_{i12}\) = known forearm circumference in centimeters of the \(i^{th}\) man.

\(X_{i13}\) = known wrist circumference distal to the styloid processes in centimeters of the \(i^{th}\) man.

where \(\epsilon_i \sim iidN(0,\sigma^2)\) and \(\beta_0\), \(\beta_1\), \(\beta_2\), \(\beta_3\),\(\beta_4\),\(\beta_5\),\(\beta_6\), \(\beta_7\),\(\beta_8\), \(\beta_9\), \(\beta_{10}\), \(\beta_{11}\), \(\beta_{12}\), \(\beta_{13}\) and \(\sigma^2\) are unknown parameters of interest.

II Methods

A. Preliminary Analysis

1. Data Analysis:

There was one data error found among the proposed predictor variables for the model. One 200 pound man (case 42) is recorded as less than 3 feet tall. Using the adiposity index formula (\(weight/height^2\)) and the value for weight for case 42, the height should presumably be 69.5 inches instead of 29.5 inches. This value was corrected in the dataset.

2. Diagnostics of Predictor Variables:

A building dataset was used consisting of cases 1-150 and individual diagnostics of the thirteen potential predictor variables were assessed using jittered dotplots and boxplots. Only two of the predictor variables, abdomen and ankle show mild right skewedness, both with two large outliers. In the model selection process, natural logarithm transformations will be applied to ankle and abdomen to assess whether the transformed variables make a more appropriate model fit.(The jittered dotplots for the two transformations show some mild improvement in skewedness.) The remaining predictor variables are fairly symmetric with a few outlying points for each. One noteworthy observation is a large outlier exists in every predictor variable except forearm. This observation will be investigated in the diagnostics section.

3. Bivariate Associations:

Bivariate associations are assessed for all 13 predictor variables using scatterplot and correlation matrices. Weight, X2, shows strong linearly association (r above 0.69) with all potential covariates except age and height. Most of the 10 body measurement variables show moderate to strong linear correlation with each other. Multicollinearity will be an issue to be addressed as predictor models are evaluated. Also, one potentially influencing outlier, case 39, affects the linearity of the scatterplots with X2, weight, and is noticeable in many of the other plots. Case 39 will have to be assessed later.

##               siri         age     weight     height       neck      chest
## siri     1.0000000  0.30975731  0.5375537 -0.1098717 0.40877819 0.65396540
## age      0.3097573  1.00000000 -0.0515724 -0.2170123 0.08282452 0.09532291
## weight   0.5375537 -0.05157240  1.0000000  0.4608855 0.81007567 0.88626960
## height  -0.1098717 -0.21701228  0.4608855  1.0000000 0.29906804 0.20528516
## neck     0.4087782  0.08282452  0.8100757  0.2990680 1.00000000 0.76131928
## chest    0.6539654  0.09532291  0.8862696  0.2052852 0.76131928 1.00000000
## abdom    0.7810216  0.20074497  0.8758668  0.1653808 0.71891577 0.89900837
## hip      0.5583165 -0.09060300  0.9342726  0.3098580 0.70539684 0.82683861
## thigh    0.5269136 -0.22609128  0.8569298  0.2328093 0.68196635 0.73928665
## knee     0.4809214 -0.03705629  0.8443064  0.4131382 0.64428918 0.71545861
## ankle    0.2118469 -0.09285554  0.5601565  0.3409611 0.41464323 0.43380646
## biceps   0.4276395 -0.06123391  0.7578906  0.2511332 0.67328148 0.69719694
## forearm  0.3566446 -0.10637163  0.6893100  0.3357938 0.66277962 0.65849702
## wrist    0.2551109  0.20186895  0.7033990  0.3945886 0.71844281 0.63461966
##             abdom        hip      thigh        knee       ankle
## siri    0.7810216  0.5583165  0.5269136  0.48092137  0.21184691
## age     0.2007450 -0.0906030 -0.2260913 -0.03705629 -0.09285554
## weight  0.8758668  0.9342726  0.8569298  0.84430641  0.56015652
## height  0.1653808  0.3098580  0.2328093  0.41313817  0.34096113
## neck    0.7189158  0.7053968  0.6819664  0.64428918  0.41464323
## chest   0.8990084  0.8268386  0.7392867  0.71545861  0.43380646
## abdom   1.0000000  0.8667376  0.7704181  0.74772040  0.39932848
## hip     0.8667376  1.0000000  0.8913073  0.82036740  0.50154403
## thigh   0.7704181  0.8913073  1.0000000  0.80695122  0.45759547
## knee    0.7477204  0.8203674  0.8069512  1.00000000  0.54771087
## ankle   0.3993285  0.5015440  0.4575955  0.54771087  1.00000000
## biceps  0.6437049  0.7005927  0.7363782  0.64493355  0.40524310
## forearm 0.5383186  0.5812026  0.6319985  0.59471319  0.37954151
## wrist   0.5709633  0.5756396  0.5024366  0.60580890  0.50251421
##              biceps    forearm     wrist
## siri     0.42763954  0.3566446 0.2551109
## age     -0.06123391 -0.1063716 0.2018690
## weight   0.75789056  0.6893100 0.7033990
## height   0.25113318  0.3357938 0.3945886
## neck     0.67328148  0.6627796 0.7184428
## chest    0.69719694  0.6584970 0.6346197
## abdom    0.64370493  0.5383186 0.5709633
## hip      0.70059271  0.5812026 0.5756396
## thigh    0.73637821  0.6319985 0.5024366
## knee     0.64493355  0.5947132 0.6058089
## ankle    0.40524310  0.3795415 0.5025142
## biceps   1.00000000  0.7279839 0.5787090
## forearm  0.72798387  1.0000000 0.6768877
## wrist    0.57870896  0.6768877 1.0000000

4. Screening of Covariates and Verification of Assumptions:

Due to the high number of potential predictor variables, automatic variable selection techniques including Mallow’s \(C_p\), BIC, and adjusted \(R^2\) are used to narrow down potential models. Mallow’s \(C_p\) indicates a model with 6 predictor variables will have the lowest \(C_p\) value of 7.189791 (p=7). The six predictor variables for Mallow’s choice model include: X1, X2, X6, X10, X12, and X13 (age, weight, abdomen, ankle, forearm and wrist). The BIC model of choice (largest value -171.9092) contains four predictor variables: X2, X6, X12, X13 (weight, abdomen, forearm, wrist). The adjusted \(R^2\) is actually largest for the model with eight covariates, however, the difference in the \(R^2\) values for BIC choice model with four predictors and the \(R^2\) choice model with eight is only 0.015481. Since this value is so small, a model with 4 predictors would be more parsimonious. Mallow’s and BIC’s choice models will be built and evaluated with and without transformations of abdomen and ankle using cases 1-150 as the model building set.

## Subset selection object
## Call: regsubsets.formula(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + 
##     X9 + X10 + X11 + X12 + X13, data1, subset = 1:150)
## 13 Variables  (and intercept)
##     Forced in Forced out
## X1      FALSE      FALSE
## X2      FALSE      FALSE
## X3      FALSE      FALSE
## X4      FALSE      FALSE
## X5      FALSE      FALSE
## X6      FALSE      FALSE
## X7      FALSE      FALSE
## X8      FALSE      FALSE
## X9      FALSE      FALSE
## X10     FALSE      FALSE
## X11     FALSE      FALSE
## X12     FALSE      FALSE
## X13     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          X1  X2  X3  X4  X5  X6  X7  X8  X9  X10 X11 X12 X13
## 1  ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " " "
## 2  ( 1 ) " " "*" " " " " " " "*" " " " " " " " " " " " " " "
## 3  ( 1 ) " " "*" " " " " " " "*" " " " " " " " " " " "*" " "
## 4  ( 1 ) " " "*" " " " " " " "*" " " " " " " " " " " "*" "*"
## 5  ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*" "*"
## 6  ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" " " "*" "*"
## 7  ( 1 ) "*" "*" " " "*" " " "*" " " " " " " "*" " " "*" "*"
## 8  ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " "*" " " "*" "*"

## [1] 73.805921 23.847459 19.875330 11.601843  8.720351  7.189791  7.232474
## [8]  7.163157

## [1] 0.6073595 0.6981336 0.7068802 0.7235899 0.7306325 0.7352760 0.7370558
## [8] 0.7390709

## [1] -131.2180 -166.6608 -167.0845 -171.9092 -171.8079 -170.4509 -167.5048
## [8] -164.7082

B. Statistical Analysis:

1. Best Model:

Mallow’s \(C_p\) choice model with predictors: age, weight, abdomen, ankle, forearm and wrist, is built and compared to four other models. These include: Mallow’s choice with ln(abdom) and ln(ankle), BIC’s choice, BIC’s choice with ln(abdom), and a parsimonious model with three predictor variables. The mean squared error, coefficients of determination, parameter estimates and partial residual plots were compared for the four models. Additionally the variance inflation factors were examined for the existance of strong multicollinearity within in each model. (See Appendiix for analysis of alternative models.)

Although Mallow’s \(C_p\) with variables X1, X2, log(X6), log(X10), X12, and X13 is originally selected because it has the lowest MSE at 16.73, smallest standard error of the residuals equalling 4.091 percent body fat, and the highest coefficient of determination (0.7509), validation checks fail. (See Validation of Dataset.) BIC’s choice (which aligns much better with the validation set) is the selected model. Covariates include weight, abdomen, forearm, and wrist with no logarithm transformations. MSE is 17.8 with standard error of the residuals at 4.221% body fat and an \(R^2\) of 0.731. All variance inflation factors were less than 10, with the highest value at 6.700226. Partial residual plots with each of the covariates showed linear patterns with no curvature and constancy of variance, indicating that all four covariates are appropriate for the model. The strongest linear patterns are found in the partial residual plots with weight and abdomen. In both plots most of the data is concentrated for lower values of the covariate with one or two large outliers.

Selected Model:

\[\hat{Y} = -38.81221 -0.18408X_{2} + 1.04291X_{6} + 0.98792X_{12} - 1.83390X_{13}\] with \(\hat\sigma^2\) = MSE = 17.8.

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X12 + X13, data = data1, subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0715 -2.8861 -0.2272  3.3524  8.7253 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38.81221    9.44681  -4.108 6.63e-05 ***
## X2           -0.18408    0.03141  -5.860 2.99e-08 ***
## X6            1.04291    0.07018  14.861  < 2e-16 ***
## X12           0.98792    0.29463   3.353  0.00102 ** 
## X13          -1.83990    0.58695  -3.135  0.00208 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.221 on 145 degrees of freedom
## Multiple R-squared:  0.731,  Adjusted R-squared:  0.7236 
## F-statistic: 98.51 on 4 and 145 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X2          1 2775.8  2775.8 155.7672 < 2.2e-16 ***
## X6          1 3969.5  3969.5 222.7488 < 2.2e-16 ***
## X12         1  101.8   101.8   5.7121  0.018134 *  
## X13         1  175.1   175.1   9.8261  0.002083 ** 
## Residuals 145 2583.9    17.8                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##              siri        age     abdom    forearm     wrist
## siri    1.0000000  0.3097573 0.7810216  0.3566446 0.2551109
## age     0.3097573  1.0000000 0.2007450 -0.1063716 0.2018690
## abdom   0.7810216  0.2007450 1.0000000  0.5383186 0.5709633
## forearm 0.3566446 -0.1063716 0.5383186  1.0000000 0.6768877
## wrist   0.2551109  0.2018690 0.5709633  0.6768877 1.0000000

##       X2       X6      X12      X13 
## 6.700226 4.470871 2.264077 2.309599

B. Residual Diagnostics

1. Checking for Influencial Points

Since outlying points can often impact the model, Cook’s distance is extracted from the model and plotted using a half-normal plot to identify potential outlying cases in the building dataset. Case 39 is again identified as a potentially influencial observation. The model is fit with and without case 39 and p-values are checked to insure that all covariates are still significant to the model. Additionally MSE and \(R^2\) are compared for both models. Although case 39 is a large outlier, the model is robust to its inclusion since none of the beta coefficients change drastically when case 39 is removed and all tests for inclusion (t values) are still significant. The standard error of the residuals is also roughly equal for both datasets (4.221% body fat with case 39 and 4.207% body fat without case 39).

Plots of both the residuals and the studentized residuals versus \(\hat{Y}\) indicate the model is a good fit to the data. There is no systematic pattern to the residuals and variance is constant for all observations except for the large outliers on the right side of the graphs. None of the studentized residuals appear to be more than two standard deviations from zero, indicating there are no influencial points in the output of the model. (The large observation in weight, case 39, has already been assessed.)

##    weight abdom forearm wrist
## 39 363.15 148.1      29  21.4

## 
## Call:
## lm(formula = Y[-c(39)] ~ X2[-c(39)] + X6[-c(39)] + X12[-c(39)] + 
##     X13[-c(39)], data = data, subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2199 -2.9567 -0.1317  3.2233  8.6940 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -36.22568    9.74813  -3.716 0.000288 ***
## X2[-c(39)]   -0.16157    0.03827  -4.222 4.26e-05 ***
## X6[-c(39)]    1.02851    0.07126  14.433  < 2e-16 ***
## X12[-c(39)]   0.76935    0.36051   2.134 0.034521 *  
## X13[-c(39)]  -1.78320    0.58730  -3.036 0.002840 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.207 on 145 degrees of freedom
## Multiple R-squared:  0.7279, Adjusted R-squared:  0.7204 
## F-statistic: 96.98 on 4 and 145 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X12 + X13, data = data1, subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0715 -2.8861 -0.2272  3.3524  8.7253 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38.81221    9.44681  -4.108 6.63e-05 ***
## X2           -0.18408    0.03141  -5.860 2.99e-08 ***
## X6            1.04291    0.07018  14.861  < 2e-16 ***
## X12           0.98792    0.29463   3.353  0.00102 ** 
## X13          -1.83990    0.58695  -3.135  0.00208 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.221 on 145 degrees of freedom
## Multiple R-squared:  0.731,  Adjusted R-squared:  0.7236 
## F-statistic: 98.51 on 4 and 145 DF,  p-value: < 2.2e-16

## (Intercept)          X2          X6         X12         X13 
## -38.8122109  -0.1840820   1.0429139   0.9879229  -1.8399043

## (Intercept)  X2[-c(39)]  X6[-c(39)] X12[-c(39)] X13[-c(39)] 
## -36.2256802  -0.1615724   1.0285115   0.7693474  -1.7831964

## [1] 4.221411

## [1] 4.206523

2. Checking for Normality

The residual pattern in the normal probability plot indicates a fairly strong linear pattern with heavy tails, slight departures from normality for the highest and lowest cases of the predictor variables. This may influence the accuracy of predicting Siri percent body fat for extreme values of the predictor variables, however, overall the conditions for normality of error terms are satisfied.

3. Checking of Independence

The error terms appear to be independent since the sequence plot shows no apparent pattern in the residuals. Also, the formal correlation test for the linear model predicting the ith case from the previous case shows no association between consecutive residuals with a p-value of 0.564. (The plausible slope is zero for the model.) The assumption of independence of residuals is not violated.

## 
## Call:
## lm(formula = residuals(m.4)[-1] ~ -1 + residuals(m.4)[-150])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1289 -2.8358 -0.1919  3.4959  8.8739 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## residuals(m.4)[-150]  0.04833    0.08181   0.591    0.556
## 
## Residual standard error: 4.154 on 148 degrees of freedom
## Multiple R-squared:  0.002353,   Adjusted R-squared:  -0.004388 
## F-statistic: 0.349 on 1 and 148 DF,  p-value: 0.5556

C. Validation of the Model

Cases 151-252 are used as a validation dataset for the model. The values from the validation set for the predictor variables: weight, abdomen, forearm and wrist are substituted into the selected model created from the building set and new model diagnostics are calculated and compared with those from the original model. The validation dataset yielded similar coefficents for all predictor variables, except some a slight difference in \(\beta_3\):

\(\beta_K\) \|	Building Set \|	Validation Set
\(\beta_0\) \|	-38.81221 \|	-33.41258
\(\beta_1\) \|	-0.18408 \|	-0.08243
\(\beta_2\) \|	1.04291 \|	0.94887
\(\beta_3\) \|	0.98792 \|	0.17697
\(\beta_4\) \|	-1.83990 \|	-1.43883

Note that \(\beta_3\) in the validation dataset is the only predictor that is not significant to the model. This is an issue with the reliability of predictions based on this model, however, it should be noted that all other models tested were less reliable with several predictors from the validation set testing as insignificant in the model. (See Appendix.)

The MSE’s for the two datasets are similar (17.8 Building vs 18.7 Validation) and so were the coefficients of determination (0.731 Building vs o.7727 Validation). Since the predictive power of the model is not seriously changed using the validation data, this model is retained.

The mean squared prediction error of the residuals was 48.32879. Since the MSPR is much larger than the MSE, there is evidence of increased variability in the residuals due to the pressence of multicollinearity in the predictor variables. Predictions may be more variable than is indicated by the standard error of residuals (\(\sqrt{MSE}= 4.319\)), with variablity closer to \(\sqrt{MSPR}=6.952\). This may cause some percent body fat estimates to be over or under estimated by as much as 7%.

## The following objects are masked from data1:
## 
##     abdom, age, ankle, forearm, siri, weight, wrist

## The following objects are masked from fat:
## 
##     abdom, age, ankle, forearm, siri, weight, wrist

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X12 + X13, data = data1, subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0715 -2.8861 -0.2272  3.3524  8.7253 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38.81221    9.44681  -4.108 6.63e-05 ***
## X2           -0.18408    0.03141  -5.860 2.99e-08 ***
## X6            1.04291    0.07018  14.861  < 2e-16 ***
## X12           0.98792    0.29463   3.353  0.00102 ** 
## X13          -1.83990    0.58695  -3.135  0.00208 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.221 on 145 degrees of freedom
## Multiple R-squared:  0.731,  Adjusted R-squared:  0.7236 
## F-statistic: 98.51 on 4 and 145 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X2          1 2775.8  2775.8 155.7672 < 2.2e-16 ***
## X6          1 3969.5  3969.5 222.7488 < 2.2e-16 ***
## X12         1  101.8   101.8   5.7121  0.018134 *  
## X13         1  175.1   175.1   9.8261  0.002083 ** 
## Residuals 145 2583.9    17.8                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##       X2       X6      X12      X13 
## 6.700226 4.470871 2.264077 2.309599

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X12 + X13, data = data, subset = 151:252)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1755 -2.5981 -0.4022  3.2291 10.5689 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -33.41258   11.09788  -3.011  0.00332 ** 
## X2           -0.08243    0.03916  -2.105  0.03785 *  
## X6            0.94887    0.09013  10.528  < 2e-16 ***
## X12           0.17697    0.22699   0.780  0.43750    
## X13          -1.43883    0.67707  -2.125  0.03612 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.319 on 97 degrees of freedom
## Multiple R-squared:  0.7727, Adjusted R-squared:  0.7633 
## F-statistic: 82.44 on 4 and 97 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##           Df Sum Sq Mean Sq  F value  Pr(>F)    
## X2         1 3955.7  3955.7 212.0302 < 2e-16 ***
## X6         1 2106.1  2106.1 112.8902 < 2e-16 ***
## X12        1    5.8     5.8   0.3086 0.57985    
## X13        1   84.3    84.3   4.5161 0.03612 *  
## Residuals 97 1809.7    18.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X2          1 2775.8  2775.8 155.7672 < 2.2e-16 ***
## X6          1 3969.5  3969.5 222.7488 < 2.2e-16 ***
## X12         1  101.8   101.8   5.7121  0.018134 *  
## X13         1  175.1   175.1   9.8261  0.002083 ** 
## Residuals 145 2583.9    17.8                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] 48.32978

D. Prediction Intervals

The plot below shows the prediction intervals for the validation dataset plotted against the actual Siri percent body fat values with the line modeling predicted Siri value based on the actual Siri value. It would appear from the graph that actual values are contained in the prediction interval for the majority of the dataset, however, intervals are fairly wide. For example, case 5 predicts Siri percent body fat to be 26.35122% with a margin of error of 8.44%. Although the actual percent body fat of case 5 is 28.7%, the wide prediction interval shows the existance of a large predictive variability in the model. The accuracy of this model is questionable.

## Warning: 'newdata' had 102 rows but variables found have 252 rows

##        Y      fit      lwr      upr 
## 28.70000 26.36122 17.92332 34.79912

## [1] 28.7

## Warning in abline(m.4m): only using the first two of 6 regression
## coefficients

III. Summary of Findings

There is a moderately strong, positive linear relationship between Siri percent body fat and the predictor variables: weight (lbs), abdomen circumference (cm), forearm circumference (cm), ad wrist circumference (cm). The fitted model is given by

\[\hat{Y} = -38.81221 -0.18408X_{2} + 1.04291X_{6} + 0.98792X_{12} - 1.83390X_{13}\]

with \(\hat\sigma^2\) = MSE = 17.8 and \(R^2=0.731\). One outlying observation was identified but is included in the model after verifying the results are robust to its inclusion. Residual analysis reveals the model conditions for normality and independence are satisfied. Cross-validation using cases 1-150, indicates the coefficient for the predictor forearm circumference is insignificant to the validation model, but the model is retained because its predictive ability is not seriously altered (\(R^=0.7727\) and MSE = 18.7). The mean squared error may tend to underestimate the variability in the predictions made by the model due to the effects of multicollinearity among predictor variables (MSPR= 48.3 exceeds MSE = 17.8). The model accounts for 73.1% of the variation in Siri’s body fat and has predictive power for men with a typical estimation error between 4.221% to 6.95% body fat (based on \(\sqrt{MSE}\)=4.221% and \(\sqrt{MSPR}\)=6.95%).

VI. General Discussion

High variability makes this model a unsatisfactory model as a predictor. Due to multicollinearity among all the covariates, the MSPR indicates a typical prediction error of almost 7%. There is a big difference in man who has 20% body fat vs 27% body fat. Even a 4% error, based off the MSE, is a big difference and renders this model ineffective as a diagnostic tool for weight control and other health purposes. Other models exist that incorporate ratios of body measurements that might be more effective. A modeling method that is more sensitive to multicollinearity is needed, and a stronger model fit (lower \(R^2\)) is needed for it to be helpful and effective as an actual diagnostic tool.

V. References

Johnson, R.W., “Fitting Percentage of Body Fat to Simple Body Measurements”“, Journal of Statistics Education, 1996, v.4, n.1.

Wood, R. J. (2008). Complete Guide to Fitness Testing. [online] Available at: http://www.topendsports.com/testing/. [Accessed 9 December 2017].

Kutner, M.H., Nachtsheim, C.J., Neter, J., Li, W. (2005), Applied Linear Statistical Models, fifth edition, McGraw-Hill Irwin, New York, NY.

VI. Appendix:Statistical Analysis and Model Selection

A.\(C_p\) Model

In Section I.A., the preliminary procedure included: analyze dataset for inconsistencies, diagnosis of predictor variables, assessment of multicollinearity, and use of automative model selection diagnostics to identify potential models for analysis. Mallow’s \(C_p\) indicates a model with 6 predictor variables will have the lowest \(C-p\) value of 7.189791 (p=7). The six predictor variables for Mallow’s choice model include: X1, X2, X6, X10, X12, and X13 (age, weight, abdomen, ankle, forearm and wrist). Although this model has the best Mallow’s \(C_p\) value, the coefficient for ankle is not significant to the model at a 0.05 level of significance. This model was evaluated first, but had a variance inflation factor that exceeded 10 (10.838874) and was rejected based on high multicollinearity of covariates.

\[\hat{Y} = -37.39656 + 0.08436X_{1} - 0.15810X_{2} + 0.95909X_{6} + 0.43516X_{10} + 1.19789X_{12} - 2.83629X_{13}\]

## 
## Call:
## lm(formula = Y ~ X1 + X2 + (X6) + (X10) + X12 + X13, data = data1, 
##     subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3757 -2.9911  0.0941  3.4037  8.1467 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -37.39656   10.43232  -3.585 0.000462 ***
## X1            0.08436    0.03641   2.317 0.021913 *  
## X2           -0.15810    0.03910  -4.044 8.57e-05 ***
## X6            0.95909    0.08538  11.234  < 2e-16 ***
## X10           0.43516    0.23175   1.878 0.062454 .  
## X12           1.19789    0.29746   4.027 9.13e-05 ***
## X13          -2.83629    0.67958  -4.174 5.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.131 on 143 degrees of freedom
## Multiple R-squared:  0.7459, Adjusted R-squared:  0.7353 
## F-statistic: 69.98 on 6 and 143 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## X1          1  921.70  921.70  54.0053 1.407e-11 ***
## X2          1 2951.10 2951.10 172.9138 < 2.2e-16 ***
## X6          1 2873.74 2873.74 168.3808 < 2.2e-16 ***
## X10         1   15.07   15.07   0.8833   0.34890    
## X12         1  106.63  106.63   6.2480   0.01356 *  
## X13         1  297.29  297.29  17.4191 5.182e-05 ***
## Residuals 143 2440.57   17.07                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##              siri         age     weight     abdom       ankle    forearm
## siri    1.0000000  0.30975731  0.5375537 0.7810216  0.21184691  0.3566446
## age     0.3097573  1.00000000 -0.0515724 0.2007450 -0.09285554 -0.1063716
## weight  0.5375537 -0.05157240  1.0000000 0.8758668  0.56015652  0.6893100
## abdom   0.7810216  0.20074497  0.8758668 1.0000000  0.39932848  0.5383186
## ankle   0.2118469 -0.09285554  0.5601565 0.3993285  1.00000000  0.3795415
## forearm 0.3566446 -0.10637163  0.6893100 0.5383186  0.37954151  1.0000000
## wrist   0.2551109  0.20186895  0.7033990 0.5709633  0.50251421  0.6768877
##             wrist
## siri    0.2551109
## age     0.2018690
## weight  0.7033990
## abdom   0.5709633
## ankle   0.5025142
## forearm 0.6768877
## wrist   1.0000000

##        X1        X2        X6       X10       X12       X13 
##  1.830901 10.838874  6.909212  1.607782  2.409589  3.232682

2. \(C_p\) Model with Logarithm Transformations

The partial residual graphs from the \(C_p\) model 1 show the majority of data in the graphs for X2, X6 and X10 is grouped on the left for lower values to the predictor variables. The large outlier is definitely influencing the graphs, but a log transformation of these variables may improve the model. Natural logs were taken of those three predictor variables and summary data was checked for improvements to the model. The best model (evalution statistics: highest \(r^2\), lowest MSE, lowest VIF, and most linear partial residual plots) resulted when only natural logarithm of X10 (ankle) and natural logaritm of X6 (abdomen) was taken. In fact, this model had better evaluation statistics than models 1, 3, 4 or 5. All conditions were satisfied: no influencial points, normality of residuals, and independence of error terms. This model was rejected when the validation set was fit to the model and three of the coefficients were no longer significant to the model.

\[\hat{Y} = -353.07922 + 0.08552X_{1} - 0.11442X_{2} + 84.96128lnX_{6} + 10.32964lnX_{10} + 0.79615X_{12} - 2.74954X_{13}\]

## 
## Call:
## lm(formula = Y ~ X1 + X2 + log(X6) + log(X10) + X12 + X13, data = data1, 
##     subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4103 -2.8777  0.0344  3.0018  8.1109 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -353.07922   37.10881  -9.515  < 2e-16 ***
## X1             0.08552    0.03601   2.375  0.01889 *  
## X2            -0.11442    0.03516  -3.254  0.00142 ** 
## log(X6)       84.96128    7.40813  11.469  < 2e-16 ***
## log(X10)      10.32964    5.97241   1.730  0.08587 .  
## X12            0.79615    0.29532   2.696  0.00786 ** 
## X13           -2.74954    0.67854  -4.052  8.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.091 on 143 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7405 
## F-statistic: 71.85 on 6 and 143 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## X1          1  921.70  921.70  55.0829 9.461e-12 ***
## X2          1 2951.10 2951.10 176.3642 < 2.2e-16 ***
## log(X6)     1 3037.78 3037.78 181.5442 < 2.2e-16 ***
## log(X10)    1   12.21   12.21   0.7295    0.3945    
## X12         1   15.75   15.75   0.9410    0.3337    
## X13         1  274.75  274.75  16.4197 8.295e-05 ***
## Residuals 143 2392.82   16.73                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##              siri         age     weight     abdom       ankle    forearm
## siri    1.0000000  0.30975731  0.5375537 0.7810216  0.21184691  0.3566446
## age     0.3097573  1.00000000 -0.0515724 0.2007450 -0.09285554 -0.1063716
## weight  0.5375537 -0.05157240  1.0000000 0.8758668  0.56015652  0.6893100
## abdom   0.7810216  0.20074497  0.8758668 1.0000000  0.39932848  0.5383186
## ankle   0.2118469 -0.09285554  0.5601565 0.3993285  1.00000000  0.3795415
## forearm 0.3566446 -0.10637163  0.6893100 0.5383186  0.37954151  1.0000000
## wrist   0.2551109  0.20186895  0.7033990 0.5709633  0.50251421  0.6768877
##             wrist
## siri    0.2551109
## age     0.2018690
## weight  0.7033990
## abdom   0.5709633
## ankle   0.5025142
## forearm 0.6768877
## wrist   1.0000000

##       X1       X2  log(X6) log(X10)      X12      X13 
## 1.827471 8.941582 5.818205 1.715083 2.422356 3.287172

## 
## Call:
## lm(formula = Y ~ X1 + X2 + log(X6) + log(X10) + X12 + X13, data = datav, 
##     subset = 151:252)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1626  -2.7006  -0.7253   2.9225   9.5571 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.082e+02  5.471e+01  -5.633 1.80e-07 ***
## X1           5.591e-03  4.572e-02   0.122   0.9029    
## X2          -5.450e-02  4.851e-02  -1.123   0.2641    
## log(X6)      8.650e+01  1.050e+01   8.237 9.53e-13 ***
## log(X10)    -9.960e+00  1.181e+01  -0.844   0.4010    
## X12          1.641e-01  2.316e-01   0.709   0.4803    
## X13         -1.553e+00  8.063e-01  -1.926   0.0571 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.376 on 95 degrees of freedom
## Multiple R-squared:  0.7715, Adjusted R-squared:  0.757 
## F-statistic: 53.44 on 6 and 95 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## X1         1  561.5   561.5  29.3176 4.610e-07 ***
## X2         1 3883.1  3883.1 202.7341 < 2.2e-16 ***
## log(X6)    1 1564.0  1564.0  81.6532 1.897e-14 ***
## log(X10)   1   56.9    56.9   2.9684   0.08816 .  
## X12        1    5.5     5.5   0.2849   0.59473    
## X13        1   71.1    71.1   3.7102   0.05707 .  
## Residuals 95 1819.6    19.2                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] 45.70758

## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## X1          1  921.70  921.70  55.0829 9.461e-12 ***
## X2          1 2951.10 2951.10 176.3642 < 2.2e-16 ***
## log(X6)     1 3037.78 3037.78 181.5442 < 2.2e-16 ***
## log(X10)    1   12.21   12.21   0.7295    0.3945    
## X12         1   15.75   15.75   0.9410    0.3337    
## X13         1  274.75  274.75  16.4197 8.295e-05 ***
## Residuals 143 2392.82   16.73                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. BIC Model Choice

BIC choice is the selected model. Covariates include weight, abdomen, forearm, and wrist with no logarithm transformations. MSE is 17.8 with standard error of the residuals at 4.221% body fat and an \(R^2\) of 0.731. These evaluation statistics are slightly lower than model 2, but the model has fewer covariates making it more parsimonious. All variance inflation factors were less than 10, with the highest value at 6.700226. Partial residual plots with each of the covariates showed linear patterns with no curvature and constancy of variance, indicating that all four covariates are appropriate for the model. All conditions were satisfied: no influencial points, normality of residuals, and independence of error terms. The validation model yielded coefficients that closely aligned with those in the original model, except for \(\beta_3\) which is insignificant to the model using the validation dataset. Even with this discrepancy, model 3 provides the best fit of the models identified and checked. Note, the strongest linear patterns are found in the partial residual plots with weight and abdomen. In both plots most of the data is concentrated for lower values of the covariate with one or two large outliers, so model 4 is evaluated to see if transforming abdomen will improve the BIC model.

\[\hat{Y} = -38.81221 -0.18408X_{2} + 1.04291X_{6} + 0.98792X_{12} - 1.83390X_{13}\]

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X12 + X13, data = data1, subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0715 -2.8861 -0.2272  3.3524  8.7253 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -38.81221    9.44681  -4.108 6.63e-05 ***
## X2           -0.18408    0.03141  -5.860 2.99e-08 ***
## X6            1.04291    0.07018  14.861  < 2e-16 ***
## X12           0.98792    0.29463   3.353  0.00102 ** 
## X13          -1.83990    0.58695  -3.135  0.00208 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.221 on 145 degrees of freedom
## Multiple R-squared:  0.731,  Adjusted R-squared:  0.7236 
## F-statistic: 98.51 on 4 and 145 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X2          1 2775.8  2775.8 155.7672 < 2.2e-16 ***
## X6          1 3969.5  3969.5 222.7488 < 2.2e-16 ***
## X12         1  101.8   101.8   5.7121  0.018134 *  
## X13         1  175.1   175.1   9.8261  0.002083 ** 
## Residuals 145 2583.9    17.8                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##              siri        age     abdom    forearm     wrist
## siri    1.0000000  0.3097573 0.7810216  0.3566446 0.2551109
## age     0.3097573  1.0000000 0.2007450 -0.1063716 0.2018690
## abdom   0.7810216  0.2007450 1.0000000  0.5383186 0.5709633
## forearm 0.3566446 -0.1063716 0.5383186  1.0000000 0.6768877
## wrist   0.2551109  0.2018690 0.5709633  0.6768877 1.0000000

##       X2       X6      X12      X13 
## 6.700226 4.470871 2.264077 2.309599

4. BIC Model with Logarithm Transformation

BIC model with ln(abdomen) was checked to see if the logarithm transformation improved the fit of model 3. The coefficient for forearm was not significant to the building model, so model 4 was rejected.

\[\hat{Y} = -359.48456 - 0.13859X_{2} + 92.73149lnX_{6} + 0.56747X_{12} -1.74040 X_{13}\]

## 
## Call:
## lm(formula = Y ~ X2 + log(X6) + X12 + X13, data = data1, subset = 1:150)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5474 -3.0105 -0.3153  3.1488  8.2617 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -359.48456   26.44256 -13.595  < 2e-16 ***
## X2            -0.13859    0.02856  -4.853  3.1e-06 ***
## log(X6)       92.73149    6.12328  15.144  < 2e-16 ***
## X12            0.56747    0.28840   1.968  0.05102 .  
## X13           -1.74040    0.58072  -2.997  0.00321 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.173 on 145 degrees of freedom
## Multiple R-squared:  0.7371, Adjusted R-squared:  0.7299 
## F-statistic: 101.6 on 4 and 145 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X2          1 2775.8  2775.8 159.3807 < 2.2e-16 ***
## log(X6)     1 4133.3  4133.3 237.3261 < 2.2e-16 ***
## X12         1   15.2    15.2   0.8705  0.352361    
## X13         1  156.4   156.4   8.9817  0.003209 ** 
## Residuals 145 2525.4    17.4                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##       X2  log(X6)      X12      X13 
## 5.665554 3.819082 2.219577 2.313258

##              siri        age     abdom    forearm     wrist
## siri    1.0000000  0.3097573 0.7810216  0.3566446 0.2551109
## age     0.3097573  1.0000000 0.2007450 -0.1063716 0.2018690
## abdom   0.7810216  0.2007450 1.0000000  0.5383186 0.5709633
## forearm 0.3566446 -0.1063716 0.5383186  1.0000000 0.6768877
## wrist   0.2551109  0.2018690 0.5709633  0.6768877 1.0000000

####5. Model with Weight, Abdomen, and Wrist

Since the BIC model shows \(\beta_3\) to be insignificant to the validation model, based on the automatic model selection test,a more parsimonious model was checked with only three predictor variables: age, abdomen circumference and wrist circumference. This model meets all diagnostic conditions, has low variance inflation factors, and is significant for all covariates upon cross-validation. However, it is less accurate than the BIC model over all with lower MSE (19.3 vs 17.8) and higher \(R^2\) (0.7256 vs 0.731). The MSPR is slightly lower for the 3 variable model (46.92458 vs 48.32978), but not by much. Since both the validation and building sets have higher MSE and lower \(R^2\) than the BIC model, this model is rejected.

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X13, data = data1, subset = 1 - 150)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0167  -3.2322  -0.2315   3.2524   9.7818 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.38034    6.84803  -3.998 8.43e-05 ***
## X2           -0.11355    0.02368  -4.796 2.80e-06 ***
## X6            0.97194    0.05628  17.269  < 2e-16 ***
## X13          -1.26690    0.43707  -2.899  0.00409 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.395 on 247 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.7223 
## F-statistic: 217.7 on 3 and 247 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X2          1 6460.5  6460.5 334.4843 < 2.2e-16 ***
## X6          1 5992.8  5992.8 310.2666 < 2.2e-16 ***
## X13         1  162.3   162.3   8.4019  0.004085 ** 
## Residuals 247 4770.8    19.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##       X2       X6      X13 
## 6.256308 4.744260 2.146121

##            siri       age     abdom     wrist
## siri  1.0000000 0.2914584 0.8134323 0.3465749
## age   0.2914584 1.0000000 0.2304094 0.2135306
## abdom 0.8134323 0.2304094 1.0000000 0.6198324
## wrist 0.3465749 0.2135306 0.6198324 1.0000000

##    siri age abdom wrist
## 39 35.2  46 148.1  21.4

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X13, subset = -39)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7726 -3.1691 -0.5099  3.3201  9.8752 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.27455    6.68387  -4.081 6.07e-05 ***
## X2           -0.09234    0.02410  -3.832 0.000161 ***
## X6            0.96151    0.05517  17.427  < 2e-16 ***
## X13          -1.42537    0.43085  -3.308 0.001079 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.305 on 247 degrees of freedom
## Multiple R-squared:  0.7357, Adjusted R-squared:  0.7325 
## F-statistic: 229.2 on 3 and 247 DF,  p-value: < 2.2e-16

## (Intercept)          X2          X6         X13 
## -27.3803440  -0.1135505   0.9719414  -1.2668970

##  (Intercept)           X2           X6          X13 
## -27.27454648  -0.09233657   0.96151335  -1.42536801

## [1] 4.394874

## [1] 4.305255

## 
## Call:
## lm(formula = Y ~ X2 + X6 + X13, data = data1, subset = 151 - 
##     252)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0208  -3.2365  -0.2176   3.2610   9.8018 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -27.99692    6.82807  -4.100 5.60e-05 ***
## X2           -0.11525    0.02372  -4.858 2.11e-06 ***
## X6            0.97669    0.05630  17.348  < 2e-16 ***
## X13          -1.24182    0.43684  -2.843  0.00485 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.399 on 247 degrees of freedom
## Multiple R-squared:  0.728,  Adjusted R-squared:  0.7247 
## F-statistic: 220.4 on 3 and 247 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X2          1 6590.7  6590.7 340.535 < 2.2e-16 ***
## X6          1 6047.3  6047.3 312.460 < 2.2e-16 ***
## X13         1  156.4   156.4   8.081  0.004847 ** 
## Residuals 247 4780.4    19.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##       X2       X6      X13 
## 6.293537 4.778814 2.155720

## [1] 46.92458

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq  F value    Pr(>F)    
## X2          1 6460.5  6460.5 334.4843 < 2.2e-16 ***
## X6          1 5992.8  5992.8 310.2666 < 2.2e-16 ***
## X13         1  162.3   162.3   8.4019  0.004085 ** 
## Residuals 247 4770.8    19.3                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(\beta_K\) \|	Building Set \|	Validation Set
\(\beta_0\) \|	-38.81221 \|	-33.41258
\(\beta_1\) \|	-0.18408 \|	-0.08243
\(\beta_2\) \|	1.04291 \|	0.94887
\(\beta_3\) \|	0.98792 \|	0.17697
\(\beta_4\) \|	-1.83990 \|	-1.43883

Predicting Siri’s Percent Body Fat Without Underwater Density Measurements

Tami Elsey

December 5, 2017

Abstract

I.Introduction

A. Study Design

B. Aims

C. Statitical Model.

II Methods

A. Preliminary Analysis

1. Data Analysis:

2. Diagnostics of Predictor Variables:

3. Bivariate Associations:

4. Screening of Covariates and Verification of Assumptions:

B. Statistical Analysis:

1. Best Model:

B. Residual Diagnostics

1. Checking for Influencial Points

2. Checking for Normality

3. Checking of Independence

C. Validation of the Model

D. Prediction Intervals

III. Summary of Findings

VI. General Discussion

V. References

VI. Appendix:Statistical Analysis and Model Selection

A.\(C_p\) Model

2. \(C_p\) Model with Logarithm Transformations

3. BIC Model Choice

4. BIC Model with Logarithm Transformation