Which body dimensions are the best indicators of gender?
This project explores the relationship between gender and several body dimensions (body girth measurements and skeletal diameter measurements). Logistic regression will be used because quantitative variables are being used to predict a binary outcome. The dataset used was built in 2003 and originally published in the Journal of Statistics Education. It contains 507 cases, 247 of which are male and 260 of which are female. It should be noted that only physically active individuals were included in this dataset, as obesity, pregnancy, and incapacitating conditions tend to unpredictably affect body dimensions. All measurements of length are in centimeters.
Though there are 21 unique measurements in this dataset as columns plus general variables (age, weight, height, gender), this project will only make use of the following:
| wri_di | Quantitative, wrist diameter, measured as sum of two wrists |
| kne_di | Quantitative, knee diameter, measured as sum of two knees |
| ank_di | Quantitative, ankle diameter, measured as sum of two ankles |
| wai_gi | Quantitative, waist girth, measured at the narrowest part of torso below the rib cage as average of contracted and relaxed position |
| hip_gi | Quantitative, hip girth, measured at level of bitrochanteric diameter |
| sex | Categorical, 1 if respondent is male, 0 if female |
“Body measurements of 507 physically active individuals.” OpenIntro. Originally published by Heinz, G et al. in “Exploring Relationships in Body Dimensions.”, Journal of Statistics Education 11(2). Retrieved from https://www.openintro.org/data/index.php?data=bdims on December 5,2025.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
bdims <- read_csv("D:/DATA 101/Datasets/bdims.csv")
colSums(is.na(bdims)) # no nulls!
## bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi che_gi
## 0 0 0 0 0 0 0 0 0 0 0
## wai_gi nav_gi hip_gi thi_gi bic_gi for_gi kne_gi cal_gi ank_gi wri_gi age
## 0 0 0 0 0 0 0 0 0 0 0
## wgt hgt sex
## 0 0 0
head(bdims, 10)
## # A tibble: 10 × 25
## bia_di bii_di bit_di che_de che_di elb_di wri_di kne_di ank_di sho_gi che_gi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 42.9 26 31.5 17.7 28 13.1 10.4 18.8 14.1 106. 89.5
## 2 43.7 28.5 33.5 16.9 30.8 14 11.8 20.6 15.1 110. 97
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115. 97.5
## 4 44.3 29.9 34 18.4 28.2 13.9 11.2 20.9 15 104. 97
## 5 42.5 29.9 34 21.5 29.4 15.2 11.6 20.7 14.9 108. 97.5
## 6 43.3 27 31.5 19.6 31.3 14 11.5 18.8 13.9 120. 99.9
## 7 43.5 30 34 21.9 31.7 16.1 12.5 20.8 15.6 124. 107.
## 8 44.4 29.8 33.2 21.8 28.8 15.1 11.9 21 14.6 120. 102.
## 9 43.5 26.5 32.1 15.5 27.5 14.1 11.2 18.9 13.2 111 91
## 10 42 28 34 22.5 28 15.6 12 21.1 15 120. 93.5
## # ℹ 14 more variables: wai_gi <dbl>, nav_gi <dbl>, hip_gi <dbl>, thi_gi <dbl>,
## # bic_gi <dbl>, for_gi <dbl>, kne_gi <dbl>, cal_gi <dbl>, ank_gi <dbl>,
## # wri_gi <dbl>, age <dbl>, wgt <dbl>, hgt <dbl>, sex <dbl>
data05 <- bdims |>
select(wri_di, kne_di, ank_di, wai_gi, hip_gi, sex) |>
rename(
"wrist_diameter" = "wri_di",
"knee_diameter" = "kne_di",
"ankle_diameter" = "ank_di",
"waist_girth" = "wai_gi",
"hip_girth" = "hip_gi",
"gender" = "sex"
)
lgrm <- glm(gender ~ ., data=data05, family="binomial")
summary(lgrm)
##
## Call:
## glm(formula = gender ~ ., family = "binomial", data = data05)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -29.66706 6.62986 -4.475 7.65e-06 ***
## wrist_diameter 2.77070 0.62390 4.441 8.96e-06 ***
## knee_diameter -0.26524 0.33065 -0.802 0.422
## ankle_diameter 2.22222 0.48184 4.612 3.99e-06 ***
## waist_girth 0.48475 0.06864 7.062 1.64e-12 ***
## hip_girth -0.64495 0.09009 -7.159 8.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 702.518 on 506 degrees of freedom
## Residual deviance: 95.767 on 501 degrees of freedom
## AIC: 107.77
##
## Number of Fisher Scoring iterations: 8
This model indicates that wrist diameter, ankle diameter, waist girth, and hip girth are all strong indicators for predicting gender (\(p<0.05\) and any common significance level). The signs on the z-values mean that high wrist, ankle, and waist measurements are strong indicators for being male while high hip girth is a strong indicator that the respondent is female (negative z-value, means it predicts the 0 case instead of the 1 case). Knee diameter isn’t useful for predicting anything, as noted by \(p=0.422\).
Significant variables:
For each 1cm increase in wrist diameter, the log-odds of being male increase by 2.77. The moderate standard error, high z-value, and significant p-value combine to make wrist diameter a strong indicator for identifying men.
For each 1cm increase in ankle diameter, the log-odds of being male increase by 2.22. The moderate standard error, high z-value, and significant p-value combine to make wrist diameter a strong indicator for identifying men.
For each 1cm increase in waist girth, the log-odds of being male increase by 0.48. The standard error is very low, which is good because the coefficient is already small and indicates consistent precision. The z-value is ridiculously high, which aligns with the ridiculously tiny \(p=8.12*10^{-13}\).
For each 1cm increase in hip girth, the model inverses itself and the log-odds for being female increase by 0.645. The standard error is tiny and the z-value is very high along with the p-value, which indicate high significance and precision, a great combination.
predicted.probs <- lgrm$fitted.values
predicted.classes <- ifelse(predicted.probs > 0.8, 1, 0)
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(data05$gender, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 254 23
## 1 6 224
The model correctly guessed 254 women to be women and 224 men to be men. It incorrectly guessed that 23 men were women, and that 6 women were men. I used a probability threshold of 0.8 not because this is a crucial test but because I wanted to increase precision.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc1 <- roc(response = data05$gender, predictor=lgrm$fitted.values,levels=c("0","1"),direction="<")
auc1 <- auc(roc1)
auc1
## Area under the curve: 0.9947
plot.roc(roc1, print.auc=TRUE,xlab="False Positive Rate",ylab="True Positive Rate")
TP <- 224
TN <- 254
FP <- 6
FN <- 23
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))
## Accuracy: 0.943
## Sensitivity: 0.907
## Specificity: 0.977
## Precision: 0.974
The AUC value of 0.995 indicates that this model is very good at predicting gender. The other performance metrics are excellent. The positive predictive value (precision) is very high at 97.4%, while the true positive rate (sensitivity) is also high at 90.7%, which means that the model is very confident in its positive predictions. 94.3% accuracy means this is a good model overall.
The logistic regression analysis model found that increased wrist diameter, ankle diameter and waist girth are strong indicators that a person is male while increased hip girth indicates a person is female. Knee diameter was noted to be a very bad indicator with no statistically significant value. While there are certainly more conclusive and efficient ways to ascertain gender, thresholds determined from this form of analysis could be used as part of training a machine learning model to predict gender based on an image of a person and/or certain body measurements.
The model was an absolute success, with very high performance metrics. It is limited by the 5 possible predictors that it was given. It is possible or even likely that there are other body measurements that are even stronger indicators of gender. Future research could pursue this avenue. Another direction of future research could focus on obese people, and identifying what body measurements are the best indicators of gender regardless of or considering body type.
“Body measurements of 507 physically active individuals.” OpenIntro. Originally published by Heinz, G et al. in “Exploring Relationships in Body Dimensions.”, Journal of Statistics Education 11(2). Retrieved from https://www.openintro.org/data/index.php?data=bdims on December 5,2025.