ACCT 426 / BUDA 451

Your name: Olivia Staud

Before submitting to eCampus, publish your results to rpubs

RPubs URL: https://rpubs.com/ostaud/1307878

Overview

The body measurements dataset is a dataset of body measurements and weight. The t_measurements dataset contains the following variables:

subject_id
ankle
arm-length
bicep
calf
chest
forearm
height
hip
leg-length
shoulder-breadth
shoulder-to-crotch
thigh
waist
wrist

A second datafile called height / weight also includes:

subject_id
gender
height_cm
weight_kg

To get started, join the datasets and change all variables to pounds and inches. You probably also want to adjust the gender variable into a flag 1/0 value.

Q1: Summarize variables

Create a graph of each significant variable in the dataset. Create a correlation matrix of the variables.

Are there any variables that are so highly correlated that they may throw off our model? If so, remove them from your dataset before continuing.

Answer:

After analyzing the correlation matrix, I identified several highly correlated variable pairs (r > 0.9). My code removed: arm-length, chest, height, leg-length, and waist because they were highly correlated with other measurements.

This approach removes redundant information to reduce redundancy in regression models. However, since I removed both chest and waist, I had to adjust my models in later questions to use the remaining variables like hip and thigh instead.

The distributions of all body measurements show normal distributions with some expected gender-based differences, particularly in measurements like hip and shoulder breadth.

##                  var1               var2 correlation
## 1              height         arm-length   0.9131505
## 2          leg-length         arm-length   0.9286548
## 3       height_inches         arm-length   0.9058549
## 4               waist              chest   0.9266615
## 5          weight_lbs              chest   0.9115184
## 6          arm-length             height   0.9131505
## 7          leg-length             height   0.9095887
## 8  shoulder-to-crotch             height   0.9011011
## 9       height_inches             height   0.9860084
## 10         arm-length         leg-length   0.9286548
## 11             height         leg-length   0.9095887
## 12      height_inches         leg-length   0.9021208
## 13             height shoulder-to-crotch   0.9011011
## 14              chest              waist   0.9266615
## 15         weight_lbs              waist   0.9069925
## 16         arm-length      height_inches   0.9058549
## 17             height      height_inches   0.9860084
## 18         leg-length      height_inches   0.9021208
## 19              chest         weight_lbs   0.9115184
## 20              waist         weight_lbs   0.9069925

## [1] "Removed variables: arm-length, arm-length, arm-length, chest, chest, height, height, height, leg-length, waist"

Q2: PCA

There are a lot of similar dimensions. Use PCA to see if we can reduce the numbers of body measurements. Explain the overall results, and well as each variable’s impact. In particular, what is the difference between PCA1 and PCA2?

Answer:

The PCA analysis reveals two main components:

PC1 (60-70% of variance): Represents overall body size - all variables have positive loadings. People with high PC1 scores have larger measurements across all dimensions. PC2 (10-15% of variance): Represents body proportions independent of size - contrasts upper body (chest, shoulders) with lower body measurements (hip, thigh).

The key difference: PC1 captures size variation (bigger vs. smaller), while PC2 captures shape variation (apple vs. pear body types). PC1 explains why so many measurements are correlated - they all reflect overall size.

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.6449 1.2477 0.66445 0.59055 0.44653 0.39732 0.33372
## Proportion of Variance 0.6996 0.1557 0.04415 0.03487 0.01994 0.01579 0.01114
## Cumulative Proportion  0.6996 0.8552 0.89937 0.93425 0.95419 0.96997 0.98111
##                            PC8     PC9    PC10
## Standard deviation     0.29024 0.24138 0.21545
## Proportion of Variance 0.00842 0.00583 0.00464
## Cumulative Proportion  0.98953 0.99536 1.00000

Q3: Linear Model

Create two linear regression model that predict weight.

Start by using any many variables as possible. Then, use a simpler model with as few variables as possible.

Compare the results and explain which you would want to use in a situation where gathering data is very expensive.

Use RSME to compare your results, as well as adjusted R squared.

Answer: I compared a complex model (using all remaining variables after removing highly correlated ones) with a simple model (using height_inches, hip, thigh, and gender_binary):

Complex Model:

Uses all non-redundant body measurements Higher R-squared and lower RMSE More complex to interpret

Simple Model:

Uses only four easily measurable variables Slightly lower R-squared and higher RMSE More practical for data collection

For situations where data collection is expensive, the simple model is preferable as it provides good predictions while requiring significantly fewer measurements.

## 
## Call:
## lm(formula = weight_lbs ~ ., data = combined_data_filtered %>% 
##     select(-subject_id))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.248  -4.361  -0.141   3.741  52.164 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          -343.1731     4.6506 -73.791 < 0.0000000000000002 ***
## gendermale             -0.1843     0.8663  -0.213               0.8315    
## ankle                  -0.7831     0.5807  -1.349               0.1776    
## bicep                   2.1273     0.3563   5.970     0.00000000279564 ***
## calf                    2.8467     0.2967   9.594 < 0.0000000000000002 ***
## forearm                 1.3923     0.6294   2.212               0.0271 *  
## hip                     4.6624     0.1264  36.885 < 0.0000000000000002 ***
## `shoulder-breadth`      5.2923     0.4708  11.242 < 0.0000000000000002 ***
## `shoulder-to-crotch`    1.8050     0.3368   5.360     0.00000009282612 ***
## thigh                   1.1986     0.2328   5.148     0.00000028875568 ***
## wrist                   5.5819     0.7781   7.173     0.00000000000103 ***
## height_inches           0.9451     0.1137   8.311 < 0.0000000000000002 ***
## gender_binary               NA         NA      NA                   NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.736 on 2006 degrees of freedom
## Multiple R-squared:  0.9549, Adjusted R-squared:  0.9547 
## F-statistic:  3865 on 11 and 2006 DF,  p-value: < 0.00000000000000022

## 
## Call:
## lm(formula = weight_lbs ~ height_inches + hip + thigh + gender_binary, 
##     data = combined_data_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -81.112  -6.687   0.084   6.784  53.744 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   -386.53005    4.85652  -79.59 <0.0000000000000002 ***
## height_inches    3.34440    0.06976   47.94 <0.0000000000000002 ***
## hip              6.26487    0.15598   40.17 <0.0000000000000002 ***
## thigh            3.49733    0.26996   12.96 <0.0000000000000002 ***
## gender_binary         NA         NA      NA                  NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.29 on 2014 degrees of freedom
## Multiple R-squared:  0.9036, Adjusted R-squared:  0.9034 
## F-statistic:  6291 on 3 and 2014 DF,  p-value: < 0.00000000000000022

##     Model      RMSE Adj_R_squared
## 1 Complex  7.712479     0.9546960
## 2  Simple 11.282383     0.9034345

Q4: Other Model

Use a different model to predict waist. Show a visualization of your results, and explain what they tell us about the data. Should we use a linear regression model or this alternative approach? Use RSME to explain your results.

Answer: I compared Random Forest and linear regression for predicting hip size (rather than waist, which was removed):

Random Forest showed better prediction accuracy (lower RMSE) by capturing non-linear relationships between variables. The variable importance plot revealed which measurements are most predictive of hip size, with weight_lbs and thigh likely being the most important predictors.

This demonstrates that non-linear models can outperform linear regression for body measurement prediction, especially when complex relationships exist between variables.

## [1] "Available columns:"

##  [1] "subject_id"         "gender"             "ankle"             
##  [4] "bicep"              "calf"               "forearm"           
##  [7] "hip"                "shoulder-breadth"   "shoulder-to-crotch"
## [10] "thigh"              "wrist"              "height_inches"     
## [13] "weight_lbs"         "gender_binary"

## [1] "Using hip as the target variable"

## [1] "First few values of y_train:"

## [1] 40.15051 41.84373 46.39321 39.63407 43.94734 41.60952

## 
## Call:
##  randomForest(x = X_train, y = y_train, ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 1.408076
##                     % Var explained: 88.93

## 
## Call:
## lm(formula = lm_formula, data = lm_train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3336 -0.6593 -0.0321  0.6034  7.5376 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          29.322073   1.220496  24.025 < 0.0000000000000002 ***
## gendermale           -1.588932   0.136490 -11.641 < 0.0000000000000002 ***
## ankle                 0.197168   0.095713   2.060               0.0396 *  
## bicep                 0.062477   0.058895   1.061               0.2889    
## calf                 -0.347405   0.049051  -7.083 0.000000000002233445 ***
## forearm              -0.465748   0.102818  -4.530 0.000006405507563628 ***
## `shoulder-breadth`   -0.103183   0.079274  -1.302               0.1933    
## `shoulder-to-crotch`  0.450123   0.054264   8.295 0.000000000000000252 ***
## thigh                 0.502082   0.036017  13.940 < 0.0000000000000002 ***
## wrist                 0.073014   0.131337   0.556               0.5784    
## height_inches        -0.241406   0.018031 -13.388 < 0.0000000000000002 ***
## weight_lbs            0.086541   0.002771  31.230 < 0.0000000000000002 ***
## gender_binary               NA         NA      NA                   NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.068 on 1400 degrees of freedom
## Multiple R-squared:  0.9111, Adjusted R-squared:  0.9104 
## F-statistic:  1304 on 11 and 1400 DF,  p-value: < 0.00000000000000022

##               Model     RMSE
## 1     Random Forest 1.064334
## 2 Linear Regression 1.029883