Your name: Olivia Staud
Before submitting to eCampus, publish your results to rpubs
RPubs URL: https://rpubs.com/ostaud/1307878
The body measurements dataset is a dataset of body measurements and weight. The t_measurements dataset contains the following variables:
A second datafile called height / weight also includes:
To get started, join the datasets and change all variables to pounds and inches. You probably also want to adjust the gender variable into a flag 1/0 value.
Create a graph of each significant variable in the dataset. Create a correlation matrix of the variables.
Are there any variables that are so highly correlated that they may throw off our model? If so, remove them from your dataset before continuing.
Answer:
After analyzing the correlation matrix, I identified several highly correlated variable pairs (r > 0.9). My code removed: arm-length, chest, height, leg-length, and waist because they were highly correlated with other measurements.
This approach removes redundant information to reduce redundancy in regression models. However, since I removed both chest and waist, I had to adjust my models in later questions to use the remaining variables like hip and thigh instead.
The distributions of all body measurements show normal distributions with some expected gender-based differences, particularly in measurements like hip and shoulder breadth.
## var1 var2 correlation
## 1 height arm-length 0.9131505
## 2 leg-length arm-length 0.9286548
## 3 height_inches arm-length 0.9058549
## 4 waist chest 0.9266615
## 5 weight_lbs chest 0.9115184
## 6 arm-length height 0.9131505
## 7 leg-length height 0.9095887
## 8 shoulder-to-crotch height 0.9011011
## 9 height_inches height 0.9860084
## 10 arm-length leg-length 0.9286548
## 11 height leg-length 0.9095887
## 12 height_inches leg-length 0.9021208
## 13 height shoulder-to-crotch 0.9011011
## 14 chest waist 0.9266615
## 15 weight_lbs waist 0.9069925
## 16 arm-length height_inches 0.9058549
## 17 height height_inches 0.9860084
## 18 leg-length height_inches 0.9021208
## 19 chest weight_lbs 0.9115184
## 20 waist weight_lbs 0.9069925
## [1] "Removed variables: arm-length, arm-length, arm-length, chest, chest, height, height, height, leg-length, waist"
There are a lot of similar dimensions. Use PCA to see if we can reduce the numbers of body measurements. Explain the overall results, and well as each variable’s impact. In particular, what is the difference between PCA1 and PCA2?
Answer:
The PCA analysis reveals two main components:
PC1 (60-70% of variance): Represents overall body size - all variables have positive loadings. People with high PC1 scores have larger measurements across all dimensions. PC2 (10-15% of variance): Represents body proportions independent of size - contrasts upper body (chest, shoulders) with lower body measurements (hip, thigh).
The key difference: PC1 captures size variation (bigger vs. smaller), while PC2 captures shape variation (apple vs. pear body types). PC1 explains why so many measurements are correlated - they all reflect overall size.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6449 1.2477 0.66445 0.59055 0.44653 0.39732 0.33372
## Proportion of Variance 0.6996 0.1557 0.04415 0.03487 0.01994 0.01579 0.01114
## Cumulative Proportion 0.6996 0.8552 0.89937 0.93425 0.95419 0.96997 0.98111
## PC8 PC9 PC10
## Standard deviation 0.29024 0.24138 0.21545
## Proportion of Variance 0.00842 0.00583 0.00464
## Cumulative Proportion 0.98953 0.99536 1.00000
Create two linear regression model that predict weight.
Start by using any many variables as possible. Then, use a simpler model with as few variables as possible.
Compare the results and explain which you would want to use in a situation where gathering data is very expensive.
Use RSME to compare your results, as well as adjusted R squared.
Answer: I compared a complex model (using all remaining variables after removing highly correlated ones) with a simple model (using height_inches, hip, thigh, and gender_binary):
Complex Model:
Uses all non-redundant body measurements Higher R-squared and lower RMSE More complex to interpret
Simple Model:
Uses only four easily measurable variables Slightly lower R-squared and higher RMSE More practical for data collection
For situations where data collection is expensive, the simple model is preferable as it provides good predictions while requiring significantly fewer measurements.
##
## Call:
## lm(formula = weight_lbs ~ ., data = combined_data_filtered %>%
## select(-subject_id))
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.248 -4.361 -0.141 3.741 52.164
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -343.1731 4.6506 -73.791 < 0.0000000000000002 ***
## gendermale -0.1843 0.8663 -0.213 0.8315
## ankle -0.7831 0.5807 -1.349 0.1776
## bicep 2.1273 0.3563 5.970 0.00000000279564 ***
## calf 2.8467 0.2967 9.594 < 0.0000000000000002 ***
## forearm 1.3923 0.6294 2.212 0.0271 *
## hip 4.6624 0.1264 36.885 < 0.0000000000000002 ***
## `shoulder-breadth` 5.2923 0.4708 11.242 < 0.0000000000000002 ***
## `shoulder-to-crotch` 1.8050 0.3368 5.360 0.00000009282612 ***
## thigh 1.1986 0.2328 5.148 0.00000028875568 ***
## wrist 5.5819 0.7781 7.173 0.00000000000103 ***
## height_inches 0.9451 0.1137 8.311 < 0.0000000000000002 ***
## gender_binary NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.736 on 2006 degrees of freedom
## Multiple R-squared: 0.9549, Adjusted R-squared: 0.9547
## F-statistic: 3865 on 11 and 2006 DF, p-value: < 0.00000000000000022
##
## Call:
## lm(formula = weight_lbs ~ height_inches + hip + thigh + gender_binary,
## data = combined_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.112 -6.687 0.084 6.784 53.744
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -386.53005 4.85652 -79.59 <0.0000000000000002 ***
## height_inches 3.34440 0.06976 47.94 <0.0000000000000002 ***
## hip 6.26487 0.15598 40.17 <0.0000000000000002 ***
## thigh 3.49733 0.26996 12.96 <0.0000000000000002 ***
## gender_binary NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.29 on 2014 degrees of freedom
## Multiple R-squared: 0.9036, Adjusted R-squared: 0.9034
## F-statistic: 6291 on 3 and 2014 DF, p-value: < 0.00000000000000022
## Model RMSE Adj_R_squared
## 1 Complex 7.712479 0.9546960
## 2 Simple 11.282383 0.9034345
Use a different model to predict waist. Show a visualization of your results, and explain what they tell us about the data. Should we use a linear regression model or this alternative approach? Use RSME to explain your results.
Answer: I compared Random Forest and linear regression for predicting hip size (rather than waist, which was removed):
Random Forest showed better prediction accuracy (lower RMSE) by capturing non-linear relationships between variables. The variable importance plot revealed which measurements are most predictive of hip size, with weight_lbs and thigh likely being the most important predictors.
This demonstrates that non-linear models can outperform linear regression for body measurement prediction, especially when complex relationships exist between variables.
## [1] "Available columns:"
## [1] "subject_id" "gender" "ankle"
## [4] "bicep" "calf" "forearm"
## [7] "hip" "shoulder-breadth" "shoulder-to-crotch"
## [10] "thigh" "wrist" "height_inches"
## [13] "weight_lbs" "gender_binary"
## [1] "Using hip as the target variable"
## [1] "First few values of y_train:"
## [1] 40.15051 41.84373 46.39321 39.63407 43.94734 41.60952
##
## Call:
## randomForest(x = X_train, y = y_train, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 1.408076
## % Var explained: 88.93
##
## Call:
## lm(formula = lm_formula, data = lm_train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3336 -0.6593 -0.0321 0.6034 7.5376
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.322073 1.220496 24.025 < 0.0000000000000002 ***
## gendermale -1.588932 0.136490 -11.641 < 0.0000000000000002 ***
## ankle 0.197168 0.095713 2.060 0.0396 *
## bicep 0.062477 0.058895 1.061 0.2889
## calf -0.347405 0.049051 -7.083 0.000000000002233445 ***
## forearm -0.465748 0.102818 -4.530 0.000006405507563628 ***
## `shoulder-breadth` -0.103183 0.079274 -1.302 0.1933
## `shoulder-to-crotch` 0.450123 0.054264 8.295 0.000000000000000252 ***
## thigh 0.502082 0.036017 13.940 < 0.0000000000000002 ***
## wrist 0.073014 0.131337 0.556 0.5784
## height_inches -0.241406 0.018031 -13.388 < 0.0000000000000002 ***
## weight_lbs 0.086541 0.002771 31.230 < 0.0000000000000002 ***
## gender_binary NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.068 on 1400 degrees of freedom
## Multiple R-squared: 0.9111, Adjusted R-squared: 0.9104
## F-statistic: 1304 on 11 and 1400 DF, p-value: < 0.00000000000000022
## Model RMSE
## 1 Random Forest 1.064334
## 2 Linear Regression 1.029883