Introduction:
The data in my project is 250 observations of men recording their age, body fat percentage, and various body part measurements for the BYU Human Performance Research Center. There are over a dozen variables, most of which are specific measurements of individual body parts. There is also density, body fat percentage, age, and weight which are not specific to certain body parts. The problem I am setting out to solve is to find the best predictor of body fat percentage using the other variables as predictors.
Below is the dataset.
library(readr)
library(mosaic)
BodyFat<- read_csv("BodyFat - Sheet1.csv")
head(BodyFat)
Split Data: For the eventual LASSO Regression that this data set is going to use, I am going to split this data into 50-200. I am going to run analysis on the 200 observations which is the training data and then I will apply the model I get from the training data onto the rest of the data and see how well it works.
## 75% of the sample size
smp_size <- floor(0.80 * nrow(BodyFat))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(BodyFat)), size = smp_size)
train <- BodyFat[train_ind, ]
test <- BodyFat[-train_ind, ]
Below is the list of variables, followed by summary statistics for each individual variable along with a short description of the variable.
The Variables: All variables are quantitative variables.
Density Pct.BF (Body Fat Percentage) Age Weight Height Neck Chest Abdomen Waist Hip Thigh Knee Ankle Bicep Forearm Wrist
Density: Density of the body
histogram(~Density, data = train)
favstats(~Density, data = train)
Pct.BF: Percentage of the body that is fat
histogram(~Pct.BF, data = train)
favstats(~Pct.BF, data = train)
Age: Age in years
histogram(~Age, data = train)
favstats(~Age, data = train)
Weight: Weight in pounds
histogram(~Weight, data = train)
favstats(~Weight, data = train)
Height: Height in inches
histogram(~Height, data = train)
favstats(~Height, data = train)
Neck: Circumference of neck in cm
histogram(~Neck, data = train)
favstats(~Neck, data = train)
Chest: Circumference of chest in cm
histogram(~Chest, data = train)
favstats(~Chest, data = train)
Abdomen: Circumference of abdomen in cm
histogram(~Abdomen, data = train)
favstats(~Abdomen, data = train)
Waist: Circumference of waist in inches
histogram(~Waist, data = train)
favstats(~Waist, data = train)
Hip: Circumference of hip in cm
histogram(~Hip, data = train)
favstats(~Hip, data = train)
Thigh: Circumference of thigh in cm
histogram(~Thigh, data = train)
favstats(~Thigh, data = train)
Knee: Circumference of knee in cm
histogram(~Knee, data = train)
favstats(~Knee, data = train)
Ankle: Circumference of ankle in cm
histogram(~Ankle, data = train)
favstats(~Ankle, data = train)
Bicep: Circumference of bicep in cm
histogram(~Bicep, data = train)
favstats(~Bicep, data = train)
Forearm: Circumference of forearm in cm
histogram(~Forearm, data = train)
favstats(~Forearm, data = train)
Wrist: Circumference of wrist in cm
histogram(~Wrist, data = train)
favstats(~Wrist, data = train)
The simple way to analyze the data would be to do a multiple linear regression model, where Pct.BF is a response variable and every other variable is an predictor variable.
library(car)
model1b <- lm(Pct.BF ~ . - Waist, data = train)
summary(model1b)
##
## Call:
## lm(formula = Pct.BF ~ . - Waist, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2198 -0.4715 -0.1500 0.3065 14.6196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.453e+02 1.401e+01 31.780 <2e-16 ***
## Density -4.075e+02 1.025e+01 -39.763 <2e-16 ***
## Age 2.007e-02 1.167e-02 1.720 0.0872 .
## Weight 1.818e-02 2.458e-02 0.740 0.4603
## Height -8.042e-03 6.880e-02 -0.117 0.9071
## Neck -2.961e-02 8.548e-02 -0.346 0.7295
## Chest 3.266e-02 3.908e-02 0.836 0.4045
## Abdomen 7.414e-03 3.968e-02 0.187 0.8520
## Hip 2.850e-02 5.233e-02 0.545 0.5867
## Thigh 3.736e-03 5.458e-02 0.068 0.9455
## Knee -2.396e-02 8.923e-02 -0.269 0.7886
## Ankle -9.822e-02 9.084e-02 -1.081 0.2810
## Bicep -8.164e-02 6.342e-02 -1.287 0.1996
## Forearm 1.642e-02 7.730e-02 0.212 0.8320
## Wrist -1.122e-02 1.958e-01 -0.057 0.9544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.386 on 185 degrees of freedom
## Multiple R-squared: 0.9748, Adjusted R-squared: 0.9729
## F-statistic: 511.2 on 14 and 185 DF, p-value: < 2.2e-16
vif(model1b)
## Density Age Weight Height Neck Chest Abdomen Hip
## 3.952374 2.364334 46.325525 3.542433 4.061391 10.788148 17.764789 12.106861
## Thigh Knee Ankle Bicep Forearm Wrist
## 7.327146 4.423883 2.085117 3.526008 2.567187 3.395943
Waist has been removed as a variable as it was not cooperating with the regression.
Our VIF results show us that a large amount of the variables are highly correlated.
Methodology: The statistical modeling technique LASSO Regression stands for Least Absolute Shrinkage and Selection Operator. This regression is used when the dataset has a large amount of multicollinear variables. This technique is very similar to a normal multiple linear regression model. It takes multiple variables and comes up with a regression model that has coefficients assigned to each variable. It also follows the same assumptions as a multiple linear regression; normality, homoskedasticity, and linearity. It does not follow independence, as the whole point of the LASSO regression is that the variables are not completely independent of each other.
In a LASSO Regression, multicollinearity is able to be reduced, as it introduces a small amount of bias while greatly reducing variance. The LASSO Regression reduces the coefficients of variables it deems unnecessary to 0, which leaves the regression with only the variables that have the largest effect on the response variable.
In the case of using all these body parts as predictors, it makes sense that some of them are multicollinear as someone who has a higher body fat percentage is likely to have larger measurements on many body parts or body parts close together.
In my regular multiple linear regression, the coefficients for the data are derived using the least squares method, where we try to minimize the sum of squared residuals (RSS).
Results and conclusions:
y <- train$Pct.BF
x <- data.matrix(train[, c('Density', 'Age', 'Weight', 'Height', 'Neck', 'Chest', 'Abdomen', 'Hip', 'Thigh', 'Knee', 'Ankle', 'Bicep', 'Forearm', 'Wrist')])
library(glmnet)
crossmod <- cv.glmnet(x, y, alpha = 1)
best_lambda <- crossmod$lambda.min
best_lambda
## [1] 0.104411
plot(crossmod)
This is the graph and number for my Lambda coefficient which is used to calculate the LASSO model. My value is small, at 0.104411 because my model is not very complex and is a small dataset compared to others.
best_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)
coef(best_model)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 4.364121e+02
## Density -4.018356e+02
## Age 7.181078e-03
## Weight .
## Height .
## Neck .
## Chest 2.626213e-02
## Abdomen 4.245969e-02
## Hip .
## Thigh .
## Knee .
## Ankle .
## Bicep .
## Forearm .
## Wrist .
best_model
##
## Call: glmnet(x = x, y = y, alpha = 1, lambda = best_lambda)
##
## Df %Dev Lambda
## 1 4 97.4 0.1044
Our model is:
BF.Pct = 436.4121 - 401.8356(Density) + 0.00718107(Age) + 0.0262613(Chest) + 0.04245969(Abdomen)
The %Dev value of 97.4 is the R^2 value for the training data. This means that from the training data, the LASSO regression comes up with a very high R^2 value, meaning it is a good fit for the training data.
testx <- data.matrix(test[, c('Density', 'Age', 'Weight', 'Height', 'Neck', 'Chest', 'Abdomen', 'Hip', 'Thigh', 'Knee', 'Ankle', 'Bicep', 'Forearm', 'Wrist')])
testy <- test$Pct.BF
y_predicted <- predict(crossmod, s = best_lambda, newx = testx)
sst <- sum((testy - mean(testy))^2)
sse <- sum((y_predicted - testy)^2)
rsq <- 1 - sse/sst
rsq
## [1] 0.9915553
The number here, 0.9915553, is the R^2 value for the LASSO regression on the testing data. Our model was built on training data and then applied to the test data, which is data it has never seen before. It is remarkable because our R^2 value on the testing data is higher than that of the training data. So our LASSO model which was training on the training data actually gave a better representation of data it had never seen before compared to the data it was trained on.
Discussion and critique: I learned that Age, Density, Chest, and Abdomen are the biggest indicators for having a high body fat percentage. I think my analysis is very strong for the data I was given. This is only a model that is trained and tested on men, so I cannot say if it applies to women or not. Additionally, the data set is very small, so I am not sure how well this would hold up compared to a large sample or to a sample from another country. I think the LASSO regression was very interesting and useful for what I was doing and I think it was the perfect statistical model to apply.