The data in this note was found from [UFL] https://users.stat.ufl.edu/~winner/datasets.html. The modified data set was uploaded to the course web page at https://raw.githubusercontent.com/jaidenneff/sta321/main/stature_hand_foot%20(1).csv.
The primary question is to identify the association between the hand length (y) and the variables that are closely related.
To start, we load the data to R. We then create a labeled box plot of gender and hand length to see the effect that gender has on hand length as you can see below it is a significant difference between male and female. We also include a scatter plot of the height to show the strong linear relationship for height in this data set. I also chose to crate another scatter plot of the foot length and hand length relationship. This relationship is linear even though its not as strong as the height correlation.
## The following objects are masked from data (pos = 3):
##
## footLen, gender, handLen, height, idGen
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.7838792 | 16.6341692 | 0.1072419 | 0.9147404 |
| idGen | 0.0201191 | 0.0234713 | 0.8571777 | 0.3927141 |
| genderFemale | -1.0803455 | 1.7963585 | -0.6014086 | 0.5484757 |
| height | 0.1093853 | 0.0133900 | 8.1691775 | 0.0000000 |
| footLen | 0.0559712 | 0.0608566 | 0.9197221 | 0.3591947 |
This model shows the correlation of the data with no transformations. You can see that foot length and gender are not statistically significant as a predictor for hand length. I chose to keep foot length and gender in my model because I think its important and can give us a predictor in the final model. In my exploratory analysis in section 3 the scatter plot shown of foot length and hand length shows a clear linear relationship even though it isn’t extremely strong. The box plot of gender and hand length shows a clear difference between hand length in males and females so for this reason i chose to keep it as a predictor in my analysis.
Residual plots of the full model
After observing the residual plots its evident that the data for Hand Length is normal and evenly distributed. We can also assume constant variance from these graphs.
Here are two transformations of the original full model
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 7.1362183 | 0.5890258 | 12.1152899 | 0.0000000 |
| height | 0.0038642 | 0.0004741 | 8.1506511 | 0.0000000 |
| footLen | 0.0020755 | 0.0021552 | 0.9630288 | 0.3370727 |
| genderFemale | -0.0399784 | 0.0636355 | -0.6282414 | 0.5307949 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.3042058 | 0.0837839 | 51.372728 | 0.0000000 |
| height | 0.0005444 | 0.0000674 | 8.073456 | 0.0000000 |
| footLen | 0.0003162 | 0.0003066 | 1.031457 | 0.3039757 |
| genderFemale | -0.0057154 | 0.0090516 | -0.631420 | 0.5287206 |
Although these models show foot length and gender to be insignificant I still stick to my original thought that its an important predictor for hand length as observed by the scatter plot and box plot in section 3 “expolritory analysis”. The transformations aren’t very different from each other except form the intercept that is observed. But the intercepts and values for estimate are actually pretty different from the original regression of coefficients that we named full model.
The regression analysis shows us that the models are normally
distributed. They all look pretty similar so this is not very helpful
for determining which transformation to use in the final model.
Since non-constant variance, we perform the Box-Cox procedure to search for a transformation of the response variable. We perform several tried Box-Cox transformations with different transformed
I could confidently choose 1 for a predictor for the transformations it
is well within the 95% confidence interval and is at the very middle of
the curve. Since my data is already pretty normally distributed the
graphs are pretty centered over 0 and 1.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 6420.4589478 | 0.7657848 | 0.7595391 | 5 | 587.1921 | 602.4092 | 7160.4399877 |
| sqrt.length.log.dist | 8.1141691 | 0.7645480 | 0.7598701 | 4 | -449.2211 | -437.0474 | 8.8925687 |
| log.length | 0.1641709 | 0.7637833 | 0.7590903 | 4 | -1053.7922 | -1041.6185 | 0.1798398 |
When observing the goodness of fit measures we can see by the R squared and the Adjusted R squared that are all similar with .76 for R squared and .75 for adjusted that all of these transformations are very strong and the data fits well in the regression model. When we observe Mallows Cp though we see 5 for the full model which shows its not as good of a predictor as the log and log squared which have 4 unbiased estimates. Finally when observing the SSE we see that the total deviation in response values for the full model has a very high number but the log transformation has the lowest number at 0.164 which is ultimately why I chose to use the log transformation in my final model. You can also see the values for AIC and SBC are decreasing in value from the original model to the log transformation which shows that the log is the better fit model. The error for the log transformation is also the lowest. All of these tests have the same conclution that the log transformation is the best suited for predicting hand length.
The inferential statistics of the final working model are summarized in the following table.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.3042058 | 0.0837839 | 51.372728 | 0.0000000 |
| height | 0.0005444 | 0.0000674 | 8.073456 | 0.0000000 |
| footLen | 0.0003162 | 0.0003066 | 1.031457 | 0.3039757 |
| genderFemale | -0.0057154 | 0.0090516 | -0.631420 | 0.5287206 |
Here is the model that would be used to determine hand length if you were given height, foot length, and gender with the log transformation. I gave a justification for why I chose the log transformation in the “Goodness of Fit explination” section so here is a final model for the prediction of hand length.
\[ \log(handLen) =4.3042058 + 0.0005444\times height +0.0003162\times footLen -0.0057154\times genderFemale \] Here I am showing the variable gender female and how its prediction is shown. When looking at the box plot of gender in section 3 “expolritory analysis” is clear that subjects who have a female gender have an average smaller hand sizes then males. This translates to this predictor by showing female as a negative predictor of hand size.
\[ p_{x_0+1} - p_{x_0} =-0.0057154p_{x_3} \to \frac{p_{x_3+1}-p_{x_3}}{p_{x_3}} = -0.0057154 = -.57\% \]
Here we bootstrap the log transformation to attempt to make the data more normal by creating simulated samples.
These histograms show the bootstrap estimates of regression coefficients represent the sampling distributions of the corresponding estimates. They are all pretty similarly and normally distributed as you can see by the distribution on the histogram.
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 4.3042 | 0.0838 | 51.3727 | 0.0000 | [ 4.06 , 4.5359 ] |
| height | 0.0005 | 0.0001 | 8.0735 | 0.0000 | [ 4e-04 , 7e-04 ] |
| footLen | 0.0003 | 0.0003 | 1.0315 | 0.3040 | [ -7e-04 , 0.0012 ] |
| genderFemale | -0.0057 | 0.0091 | -0.6314 | 0.5287 | [ -0.0285 , 0.0176 ] |
The regression coefficients shown have very similar values to the log transformation on its own because the data was already normally distributed. This does however give is a confidence interval that shows the spread of the data.
These bootstrap histograms for the final model are very similar to the above for the regression coefficients. They show us that the histograms are normally distributed and have an even spread.
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 4.3042 | 0.0838 | 51.3727 | 0.0000 | [ 4.1258 , 4.4649 ] |
| height | 0.0005 | 0.0001 | 8.0735 | 0.0000 | [ 4e-04 , 7e-04 ] |
| footLen | 0.0003 | 0.0003 | 1.0315 | 0.3040 | [ -3e-04 , 9e-04 ] |
| genderFemale | -0.0057 | 0.0091 | -0.6314 | 0.5287 | [ -0.0219 , 0.0122 ] |
This table shows the the residual bootstraps of the regression coefficient and the confidence interval . It seems that the estimates are pretty similar to the log transformation and the previous regression coefficient. This tells us that all of the data is normal. The confidence interval is also pretty simialr to the previous but you can see that the interval is smaller.
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 4.3042 | 0.0838 | 0.0000 | [ 4.06 , 4.5359 ] | [ 4.1258 , 4.4649 ] |
| height | 0.0005 | 0.0001 | 0.0000 | [ 4e-04 , 7e-04 ] | [ 4e-04 , 7e-04 ] |
| footLen | 0.0003 | 0.0003 | 0.3040 | [ -7e-04 , 0.0012 ] | [ -3e-04 , 9e-04 ] |
| genderFemale | -0.0057 | 0.0091 | 0.5287 | [ -0.0285 , 0.0176 ] | [ -0.0219 , 0.0122 ] |
| btc.wd | btr.wd |
|---|---|
| 0.4759042 | 0.3390777 |
| 0.0003811 | 0.0002637 |
| 0.0018775 | 0.0012451 |
| 0.0461742 | 0.0341724 |
The confidnece interval for the bootstrap of the coefficients and of the regression model are both similar and very small but the bootstrap fo the regression model is better for prediction because the variation is smaller. Although we have found that bootstrapping isn’t really necessary for this data set because of how similar the numbers have come out to from the final log transformation model it just helps to narrow down the numbers.
In this statistical report we looked into variables that are closely related to hand length and how they can be used in predictive analysis. The data that we used was normally distributed but the log transformation gave us the best and most useful model and the regression bootstrap gave us the most accurate confidence interval. The model we found is below and we chose to keep all of the factors in the model. In doing bootstrapping of the coefficients and residuals we discovered that bootstrapping was unnecessary because our data was normally distributed however it did give us a valid confidence interval to work off of for the estimates.
\[ \log(handLen) =4.3042 + 0.0005\times height +0.0003\times footLen -0.0057\times genderFemale \]
We ran into some issues with gender and foot length not being significant in our first assessment but after further investigation and data exploration we saw that there was absolutely a correlation even if it wasn’t strong enough to be significant.
I would recommend you use this model to predict hand length for males and females using their heights gender and foot lengths as valid predictors and the model shown above for the most accurate prediction.