1 Data Description

The data in this note was found from [UFL] https://users.stat.ufl.edu/~winner/datasets.html. The modified data set was uploaded to the course web page at https://raw.githubusercontent.com/jaidenneff/sta321/main/stature_hand_foot%20(1).csv.

  • IdGen
  • gender(X1): male or female
  • height(X2): height in cm
  • footLen(X3): foot length in cm
  • handLen(Y): hand length in cm

1.1 Practical Question

The primary question is to identify the association between the hand length (y) and the variables that are closely related.

2 Exploratory Data Analysis

To start, we load the data to R. We then create a labeled box plot of gender and hand length to see the effect that gender has on hand length as you can see below it is a significant difference between male and female. We also include a scatter plot of the height to show the strong linear relationship for height in this data set. I also chose to crate another scatter plot of the foot length and hand length relationship. This relationship is linear even though its not as strong as the height correlation.

## The following objects are masked from data (pos = 3):
## 
##     footLen, gender, handLen, height, idGen

3 Full model

Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7838792 16.6341692 0.1072419 0.9147404
idGen 0.0201191 0.0234713 0.8571777 0.3927141
genderFemale -1.0803455 1.7963585 -0.6014086 0.5484757
height 0.1093853 0.0133900 8.1691775 0.0000000
footLen 0.0559712 0.0608566 0.9197221 0.3591947

This model shows the correlation of the data with no transformations. You can see that foot length and gender are not statistically significant as a predictor for hand length. I chose to keep foot length and gender in my model because I think its important and can give us a predictor in the final model. In my exploratory analysis in section 3 the scatter plot shown of foot length and hand length shows a clear linear relationship even though it isn’t extremely strong. The box plot of gender and hand length shows a clear difference between hand length in males and females so for this reason i chose to keep it as a predictor in my analysis.

3.1 Residual plots

Residual plots of the full model

Residual plots of the full model

After observing the residual plots its evident that the data for Hand Length is normal and evenly distributed. We can also assume constant variance from these graphs.

4 Transformations

Here are two transformations of the original full model

4.1 Square root transformation

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.1362183 0.5890258 12.1152899 0.0000000
height 0.0038642 0.0004741 8.1506511 0.0000000
footLen 0.0020755 0.0021552 0.9630288 0.3370727
genderFemale -0.0399784 0.0636355 -0.6282414 0.5307949

4.2 Log transformation

log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.3042058 0.0837839 51.372728 0.0000000
height 0.0005444 0.0000674 8.073456 0.0000000
footLen 0.0003162 0.0003066 1.031457 0.3039757
genderFemale -0.0057154 0.0090516 -0.631420 0.5287206

Although these models show foot length and gender to be insignificant I still stick to my original thought that its an important predictor for hand length as observed by the scatter plot and box plot in section 3 “expolritory analysis”. The transformations aren’t very different from each other except form the intercept that is observed. But the intercepts and values for estimate are actually pretty different from the original regression of coefficients that we named full model.

5 Regression

The regression analysis shows us that the models are normally distributed. They all look pretty similar so this is not very helpful for determining which transformation to use in the final model.

5.1 Box-Cox Transformations

Since non-constant variance, we perform the Box-Cox procedure to search for a transformation of the response variable. We perform several tried Box-Cox transformations with different transformed

I could confidently choose 1 for a predictor for the transformations it is well within the 95% confidence interval and is at the very middle of the curve. Since my data is already pretty normally distributed the graphs are pretty centered over 0 and 1.

5.2 Goodness-of-fit Measures

Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 6420.4589478 0.7657848 0.7595391 5 587.1921 602.4092 7160.4399877
sqrt.length.log.dist 8.1141691 0.7645480 0.7598701 4 -449.2211 -437.0474 8.8925687
log.length 0.1641709 0.7637833 0.7590903 4 -1053.7922 -1041.6185 0.1798398

5.2.1 Goodness of Fit expliantion

When observing the goodness of fit measures we can see by the R squared and the Adjusted R squared that are all similar with .76 for R squared and .75 for adjusted that all of these transformations are very strong and the data fits well in the regression model. When we observe Mallows Cp though we see 5 for the full model which shows its not as good of a predictor as the log and log squared which have 4 unbiased estimates. Finally when observing the SSE we see that the total deviation in response values for the full model has a very high number but the log transformation has the lowest number at 0.164 which is ultimately why I chose to use the log transformation in my final model. You can also see the values for AIC and SBC are decreasing in value from the original model to the log transformation which shows that the log is the better fit model. The error for the log transformation is also the lowest. All of these tests have the same conclution that the log transformation is the best suited for predicting hand length.

6 Final Model

The inferential statistics of the final working model are summarized in the following table.

Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.3042058 0.0837839 51.372728 0.0000000
height 0.0005444 0.0000674 8.073456 0.0000000
footLen 0.0003162 0.0003066 1.031457 0.3039757
genderFemale -0.0057154 0.0090516 -0.631420 0.5287206

Here is the model that would be used to determine hand length if you were given height, foot length, and gender with the log transformation. I gave a justification for why I chose the log transformation in the “Goodness of Fit explination” section so here is a final model for the prediction of hand length.

\[ \log(handLen) =4.3042058 + 0.0005444\times height +0.0003162\times footLen -0.0057154\times genderFemale \] Here I am showing the variable gender female and how its prediction is shown. When looking at the box plot of gender in section 3 “expolritory analysis” is clear that subjects who have a female gender have an average smaller hand sizes then males. This translates to this predictor by showing female as a negative predictor of hand size.

\[ p_{x_0+1} - p_{x_0} =-0.0057154p_{x_3} \to \frac{p_{x_3+1}-p_{x_3}}{p_{x_3}} = -0.0057154 = -.57\% \]

7 Bootstrapping

7.1 Bootstrapping Coeffients

Here we bootstrap the log transformation to attempt to make the data more normal by creating simulated samples.

7.1.1 Bootstrap Coeffient Histogram

These histograms show the bootstrap estimates of regression coefficients represent the sampling distributions of the corresponding estimates. They are all pretty similarly and normally distributed as you can see by the distribution on the histogram.

7.2 Bootstrapping Residuals

Regression Coefficient Matrix
Estimate Std. Error t value Pr(>|t|) btc.ci.95
(Intercept) 4.3042 0.0838 51.3727 0.0000 [ 4.06 , 4.5359 ]
height 0.0005 0.0001 8.0735 0.0000 [ 4e-04 , 7e-04 ]
footLen 0.0003 0.0003 1.0315 0.3040 [ -7e-04 , 0.0012 ]
genderFemale -0.0057 0.0091 -0.6314 0.5287 [ -0.0285 , 0.0176 ]

The regression coefficients shown have very similar values to the log transformation on its own because the data was already normally distributed. This does however give is a confidence interval that shows the spread of the data.

7.2.1 Bootstrapping Residuals Histogram

These bootstrap histograms for the final model are very similar to the above for the regression coefficients. They show us that the histograms are normally distributed and have an even spread.

Regression Coefficient Matrix with 95% Residual Bootstrap CI
Estimate Std. Error t value Pr(>|t|) btr.ci.95
(Intercept) 4.3042 0.0838 51.3727 0.0000 [ 4.1258 , 4.4649 ]
height 0.0005 0.0001 8.0735 0.0000 [ 4e-04 , 7e-04 ]
footLen 0.0003 0.0003 1.0315 0.3040 [ -3e-04 , 9e-04 ]
genderFemale -0.0057 0.0091 -0.6314 0.5287 [ -0.0219 , 0.0122 ]

This table shows the the residual bootstraps of the regression coefficient and the confidence interval . It seems that the estimates are pretty similar to the log transformation and the previous regression coefficient. This tells us that all of the data is normal. The confidence interval is also pretty simialr to the previous but you can see that the interval is smaller.

7.3 Confidnece Interval Width

Final Combined Inferential Statistics: p-values and Bootstrap CIs
Estimate Std. Error Pr(>|t|) btc.ci.95 btr.ci.95
(Intercept) 4.3042 0.0838 0.0000 [ 4.06 , 4.5359 ] [ 4.1258 , 4.4649 ]
height 0.0005 0.0001 0.0000 [ 4e-04 , 7e-04 ] [ 4e-04 , 7e-04 ]
footLen 0.0003 0.0003 0.3040 [ -7e-04 , 0.0012 ] [ -3e-04 , 9e-04 ]
genderFemale -0.0057 0.0091 0.5287 [ -0.0285 , 0.0176 ] [ -0.0219 , 0.0122 ]
width of the two bootstrap confidence intervals
btc.wd btr.wd
0.4759042 0.3390777
0.0003811 0.0002637
0.0018775 0.0012451
0.0461742 0.0341724

The confidnece interval for the bootstrap of the coefficients and of the regression model are both similar and very small but the bootstrap fo the regression model is better for prediction because the variation is smaller. Although we have found that bootstrapping isn’t really necessary for this data set because of how similar the numbers have come out to from the final log transformation model it just helps to narrow down the numbers.

8 Summary

8.1 Findings

In this statistical report we looked into variables that are closely related to hand length and how they can be used in predictive analysis. The data that we used was normally distributed but the log transformation gave us the best and most useful model and the regression bootstrap gave us the most accurate confidence interval. The model we found is below and we chose to keep all of the factors in the model. In doing bootstrapping of the coefficients and residuals we discovered that bootstrapping was unnecessary because our data was normally distributed however it did give us a valid confidence interval to work off of for the estimates.

\[ \log(handLen) =4.3042 + 0.0005\times height +0.0003\times footLen -0.0057\times genderFemale \]

8.2 Drawbacks

We ran into some issues with gender and foot length not being significant in our first assessment but after further investigation and data exploration we saw that there was absolutely a correlation even if it wasn’t strong enough to be significant.

8.3 Reccomendations for application.

I would recommend you use this model to predict hand length for males and females using their heights gender and foot lengths as valid predictors and the model shown above for the most accurate prediction.