Regression analysis

About the data

Estimated percentage of body fat determined by underwater weighing and body circumference measurements for 252 men.

case - identifier
brozek - bodyfat percentage using Brozek’s equation
siri - bodyfat percentage using Siri’s equation
density - gm/cm3
age - age (years)
weight - weight (kg)
height - height (inches)
neck - neck circumference (cm)
chest - chest circumference (cm)
abdomen - abdomen circumference (cm)
hip - hip circumference (cm)
thigh - thigh circumference (cm)
knee - knee circumference (cm)
ankle - ankle circumference (cm)
biceps - extended biceps circumference (cm)
forearm - forearm circumference (cm)
wrist - wrist circumference (cm)

Plotting circumference of body parts vs percentage of bodyfat

The data ‘bodyfat’ possesses two variables for measuring body fat - brozek and siri. Both of them measure body fat but use different equations - brozek uses Brozek’s equation and siri Siri’s equation. However, there is no point in using both of them. Hence, in this task I chose to use Brozek’s variable.

abdomen_graph <- ggplot(data = bodyfat, aes(abdomen, brozek)) + geom_point(color = "darkblue") +
  labs(y = "% of bodyfat", x = "Circumference in cm") 
chest_graph <- ggplot(data = bodyfat, aes(chest, brozek)) + geom_point(color = "darkblue") +
  labs(y = "% of bodyfat", x = "Circumference in cm")
hip_graph <- ggplot(data = bodyfat, aes(hip, brozek)) + geom_point(color = "darkblue") +
  labs(y = "% of bodyfat", x = "Circumference in cm")
ankle_graph <- ggplot(data = bodyfat, aes(ankle, brozek)) + geom_point(color = "darkblue") + 
  labs(y = "% of bodyfat", x = "Circumference in cm")
biceps_graph <- ggplot(data = bodyfat, aes(biceps, brozek)) + geom_point(color = "darkblue") +
  labs(y = "% of bodyfat", x = "Circumference in cm")
thigh_graph <- ggplot(data = bodyfat, aes(thigh, brozek)) + geom_point(color = "darkblue") +
  labs(y = "% of bodyfat", x = "Circumference in cm")

ggarrange(abdomen_graph, chest_graph, hip_graph, ankle_graph, biceps_graph, thigh_graph,
          labels = c("Abdomen", "Chest", "Hip", "Ankle", "Biceps", "Thigh"))

Simple scatter plots of relationship between a circumference of some body parts and percentage of body fat are created. We can see a relationship in all cases - there is no one plot where the arrangement of points is purely random. In most cases we can clearly see an almost linear positive relationship.

Linear Regression Model

Firstly, let’s see a relationship between all circumferences and percentage of body fat. To do that, a corrplot with a correlation coefficient is shown. Here we can see that variable brozek (% of body fat) have a positive correlation with all other variables. The strongest one (0.81) is with abdomen. It looks like the circumference of a belly can tell most about person’s body fat percentage. The smallest correlation coefficient (0.27) is with an ankle. Hence, at the beginning I will try to find a regression model only using abdomen. Since it is the strongest predictor, the results later should be very similar.

 # We are interested only in these variables
bodyfat_1 <- select(bodyfat, brozek, neck:wrist)

corrplot(cor(bodyfat_1), method = "number", type = "upper", diag =FALSE)

Now, using abdoment as a predictor we use lm() function to create a linear regression model.

lm(brozek ~ abdomen, data = bodyfat)

## 
## Call:
## lm(formula = brozek ~ abdomen, data = bodyfat)
## 
## Coefficients:
## (Intercept)      abdomen  
##    -35.1966       0.5849

abdomen_graph + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

A liner regression line has an equation:

\[ \hat{y} = -35.1966 + 0.5849 * x \]

where y is % of body fat and x circumference of an abdomen.

Now we can create a model to get more data about this linear regression. Using summary(model) we can see again the regression line. Also thanks to R-squared statistics we can see that this model allows us to predict 66.21 % of data - only using abdomen as a predictor. Another way to show regression line - using equation given before and putting it on the plot.

model_abdomen <- lm(brozek ~ abdomen, data = bodyfat)

summary(model_abdomen)

## 
## Call:
## lm(formula = brozek ~ abdomen, data = bodyfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6257  -3.4672   0.0111   3.1415  11.9754 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.19661    2.46229  -14.29   <2e-16 ***
## abdomen       0.58489    0.02643   22.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.514 on 250 degrees of freedom
## Multiple R-squared:  0.6621, Adjusted R-squared:  0.6608 
## F-statistic: 489.9 on 1 and 250 DF,  p-value: < 2.2e-16

equation1 = function(x){coef(model_abdomen)[2]*x+coef(model_abdomen)[1]}

ggplot(bodyfat, aes(y = brozek, x = abdomen)) + geom_point() +
        stat_function(fun = equation1, geom = "line", color = "red")

Model Diagnostic

After getting a regression line we have to check if the model is valid. There are four major assumptions the model must meet: linearity, normality of distribution of dependent variables, homoscedasticity - constant variancea and independence - meaning that dependent variable (body fat) does not depend on its previous values. Lastly, we will check the existence of influential observations - outliers or different observations that strongly impact the model.

par(mfrow = c(2, 2))
plot(model_abdomen)

Using the method plot(model) we see four plots that are showing how well the assumptions are satisfied.

1. Residuals vs Fitted

Plot of residuals versus fitted values. Allows us to check two assumptions - linearity and homoscedasticity. Here, we can see that the model is not perfect - there exist one strongly influencing observation (marked as 39) that ruins the line. However, apart from that one point both assumptions seems to be met - the red line is almost horizontal and the points are distributed randomly - there is no sign of heteroscedasticity.

2. Normal Q-Q

Plot that shows normality assumption. The dotted line represents the ‘ideal’ normal observations. Our data, marked as dots are lying around that line, also showing a nice normal distribution. Only three points are marked as having large residuals - and again these are the same as in the first plot.

3. Scale - Location

Another plot to check homoscedasticity. We can see that the value 39 is again strongly influencing the result. However, apart from that point the red line is almost linear.

4. Residuals vs Leverage

Using Cook’s distance we can see that our dataset has points that are influential observations. The value 39 falls below 1 Cook’s distance. Another two marked values, 41 and 216 have big residuals, but fall below Cook’s distance and are not marked as influential.

Fixing the model

Since the model has ifluential observation, we need to rebuild that. Here, we just delete the influential observations using the Cook’s distance. We choose to delete all observations that are outliers or influential observations.

# Removing Outliers
cooksd <- cooks.distance(model_abdomen)
sample_size <- nrow(bodyfat)

influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])
bodyfat2 <- bodyfat[-influential, ]
model_abdomen2 <- lm(brozek ~ abdomen, data = bodyfat2)

summary(model_abdomen2)

## 
## Call:
## lm(formula = brozek ~ abdomen, data = bodyfat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1489  -3.2921   0.1026   2.8451  11.8098 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -37.64991    2.63201  -14.30   <2e-16 ***
## abdomen       0.61278    0.02842   21.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.276 on 243 degrees of freedom
## Multiple R-squared:  0.6567, Adjusted R-squared:  0.6553 
## F-statistic: 464.8 on 1 and 243 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model_abdomen2)

Now, we can see that all the plots meet the given assumptions. The deletion of 8 observations allows us to create a better regression model that will work for predicting the criteria. The linear regression line has slightly chaged - now it has an equation:

\[ \hat{y} = -37.64991 + 0.61278 * x \]

Also, now our model can predict 65.67% of results using abdoment’s circumference.

Using multivariable regression model

Since we changed our base dataset - to bodyfat2 with deleted 8 influential observations, we will use this set to determine the percentage of body fat.

model1 <- lm(brozek ~ neck + abdomen + thigh + hip + chest + knee + ankle + biceps + forearm + wrist, data  = bodyfat2)

summary(model1)

## 
## Call:
## lm(formula = brozek ~ neck + abdomen + thigh + hip + chest + 
##     knee + ankle + biceps + forearm + wrist, data = bodyfat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0812 -2.8892 -0.1928  2.7251 10.5139 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.09371    6.46746   0.324  0.74643    
## neck        -0.42504    0.21145  -2.010  0.04557 *  
## abdomen      0.94133    0.07070  13.315  < 2e-16 ***
## thigh        0.12638    0.12487   1.012  0.31253    
## hip         -0.32250    0.12514  -2.577  0.01058 *  
## chest       -0.13886    0.08589  -1.617  0.10727    
## knee        -0.09006    0.23324  -0.386  0.69976    
## ankle        0.05649    0.20011   0.282  0.77795    
## biceps       0.15244    0.16016   0.952  0.34219    
## forearm      0.17874    0.19050   0.938  0.34908    
## wrist       -1.27244    0.47754  -2.665  0.00825 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.972 on 234 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7026 
## F-statistic: 58.64 on 10 and 234 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model1)

Using all measurements: neck, abdomen, thigh, hip, chest, knee, ankle, biceps, forearm and wrist allows us to predict 71.48% of results. The liner function has the form:

\[ \hat{y} = -0.43 * n + 0.94 * a + 0.13 * t - 0.32 * h - 0.14 * c - 0.09 * k + 0.06 * an + 0.15 * b + 0.18 * f - 1.27 * w + 2.09 \] where n - neck, a - abdomen, t - thigh, h - hip, c - chest, k - knee, an - ankle, b - biceps, f - forearm, w - wrist.

F-Statistics and p-value - p-value is < 2.2e-16 => significat, meaning that at least one predicator is related to the outcome variable. We can assume it to be abdomen, as the analysis using only this variable predicted 66% of brozek value. This assumsion is supported by the t value for abdomen - it is equal to 13.315 and it is the highest value. Estimate value - for holding all other variables constant, we can expect an 0.94 unit increase as we increase abdomen by 1.

At the end, we can see once again that using measurements is quite good way to estimate body fat - it gives us 71% accuracy. At the same time, using only abdomen size, we get around 66%.

Regression analysis - Bodyfat