1 Data Description

The data in this note was found from [UFL] https://users.stat.ufl.edu/~winner/datasets.html. The modified data set was uploaded to the course web page at https://raw.githubusercontent.com/jaidenneff/sta321/main/stature_hand_foot%20(1).csv.

  • IdGen
  • gender(X1): male or female
  • height(X2): height in cm
  • footLen(X3): foot length in cm
  • handLen(Y): hand length in cm

2 Practical Question

The primary question is to identify the association between the hand length (y) and the variables that are closely related.

3 Exploratory Data Analysis

To start, we load the data to R. We then create a labeled box plot of gender and hand length to see the effect that gender has on hand length as you can see below it is a significant difference between male and female. We also include a scatter plot of the height to show the strong linear relationship for height in this data set. I also chose to crate another scatter plot of the foot length and hand length relationship. This relationship is linear even though its not as strong as the height correlation.

## The following objects are masked from data (pos = 3):
## 
##     footLen, gender, handLen, height, idGen

4 Full model

Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7838792 16.6341692 0.1072419 0.9147404
idGen 0.0201191 0.0234713 0.8571777 0.3927141
genderFemale -1.0803455 1.7963585 -0.6014086 0.5484757
height 0.1093853 0.0133900 8.1691775 0.0000000
footLen 0.0559712 0.0608566 0.9197221 0.3591947

This model shows the correlation of the data with no transformations. You can see that foot length and gender are not statistically significant as a predictor for hand length. I chose to keep foot length and gender in my model because I think its important and can give us a predictor in the final model. In my exploratory analysis in section 3 the scatter plot shown of foot length and hand length shows a clear linear relationship even though it isn’t extremely strong. The box plot of gender and hand length shows a clear difference between hand length in males and females so for this reason i chose to keep it as a predictor in my analysis.

5 Residual plots

Residual plots of the full model

Residual plots of the full model

After observing the residual plots its evident that the data for Hand Length is normal and evenly distributed. We can also assume constant variance from these graphs.

6 Transformations

Here are two transformations of the original full model

6.1 Square root transformation

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.1362183 0.5890258 12.1152899 0.0000000
height 0.0038642 0.0004741 8.1506511 0.0000000
footLen 0.0020755 0.0021552 0.9630288 0.3370727
genderFemale -0.0399784 0.0636355 -0.6282414 0.5307949

6.2 Log transformation

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.3042058 0.0837839 51.372728 0.0000000
height 0.0005444 0.0000674 8.073456 0.0000000
footLen 0.0003162 0.0003066 1.031457 0.3039757
genderFemale -0.0057154 0.0090516 -0.631420 0.5287206

Although these models show foot length and gender to be insignificant I still stick to my original thought that its an important predictor for hand length as observed by the scatter plot and box plot in section 3 “expolritory analysis”. The transformations aren’t very different from each other except form the intercept that is observed. But the intercepts and values for estimate are actually pretty different from the original regression of coefficients that we named full model.

7 Regression

The regression analysis shows us that the models are normally distributed. They all look pretty similar so this is not very helpful for determining which transformation to use in the final model.

8 Box-Cox Transformations

Since non-constant variance, we perform the Box-Cox procedure to search for a transformation of the response variable. We perform several tried Box-Cox transformations with different transformed

I could confidently choose 1 for a predictor for the transformations it is well within the 95% confidence interval and is at the very middle of the curve. Since my data is already pretty normally distributed the graphs are pretty centered over 0 and 1.

9 Goodness-of-fit Measures

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 6420.4589478 0.7657848 0.7595391 5 587.1921 602.4092 7160.4399877
sqrt.length.log.dist 8.1141691 0.7645480 0.7598701 4 -449.2211 -437.0474 8.8925687
log.length 0.1641709 0.7637833 0.7590903 4 -1053.7922 -1041.6185 0.1798398

9.1 Goodness of Fit expliantion

When observing the goodness of fit measures we can see by the R squared and the Adjusted R squared that are all similar with .76 for R squared and .75 for adjusted that all of these transformations are very strong and the data fits well in the regression model. When we observe Mallows Cp though we see 5 for the full model which shows its not as good of a predictor as the log and log squared which have 4 unbiased estimates. Finally when observing the SSE we see that the total deviation in response values for the full model has a very high number but the log transformation has the lowest number at 0.164 which is ultimately why I chose to use the log transformation in my final model. You can also see the values for AIC and SBC are decreasing in value from the original model to the log transformation which shows that the log is the better fit model. The error for the log transformation is also the lowest. All of these tests have the same conclution that the log transformation is the best suited for predicting hand length.

10 Final Model

The inferential statistics of the final working model are summarized in the following table.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.3042058 0.0837839 51.372728 0.0000000
height 0.0005444 0.0000674 8.073456 0.0000000
footLen 0.0003162 0.0003066 1.031457 0.3039757
genderFemale -0.0057154 0.0090516 -0.631420 0.5287206

Here is the model that would be used to determine hand length if you were given height, foot length, and gender with the log transformation. I gave a justification for why I chose the log transformation in the “Goodness of Fit explination” section so here is a final model for the prediction of hand length.

\[ \log(handLen) =4.3042058 + 0.0005444\times height +0.0003162\times footLen -0.0057154\times genderFemale \] Here I am showing the variable gender female and how its prediction is shown. When looking at the box plot of gender in section 3 “expolritory analysis” is clear that subjects who have a female gender have an average smaller hand sizes then males. This translates to this predictor by showing female as a negative predictor of hand size.

\[ p_{x_0+1} - p_{x_0} =-0.0057154p_{x_3} \to \frac{p_{x_3+1}-p_{x_3}}{p_{x_3}} = -0.0057154 = -.57\% \]