Can a person’s hand length predicts a person’s foot length?

We randomly collected 128 observations of people’s (adult) hand & foot length as our data to investigate whether we can predict a person’s foot size by their hand size.

Let’s start with creating a hypothesis test as follow:
H0: A person’s hand size do not define their foot size.
HA: A person’s hand size defines their foot size.

# Required library
library(readr)
library(magrittr)
library(dplyr)

Data Retrieval

# Read the data
Hand_length <- read.csv('Hand_length.csv')

# Check the columns of the original data
colnames(Hand_length)
## [1] "ï..Timestamp"  "Email.address" "Gender"        "hand_length"  
## [5] "foot_length"
# Select relevant columns for analysis
Hand_length <- Hand_length %>% select("Gender", "hand_length", "foot_length")
Hand_length %>% head
##   Gender hand_length foot_length
## 1 Female        18.0        22.8
## 2   Male        18.6        23.0
## 3 Female        18.0        23.7
## 4 Female        16.5        24.0
## 5 Female        15.0        25.0
## 6 Female        16.0        25.0
# Perform plot to see the correlation
plot(foot_length~hand_length, data = Hand_length, main = "Plot of Hand and Foot length")

A few people have used centimeter and micromilimeter which makes the data become inconsistent. For this analysis, we will use milimeters. Let’s fix these value ups.

Correct Data Entry Errors

### Data Cleaning 1: 
# if the hand/foot length is lesser than 100, multiply it by 10.
Hand_length$hand_length <- ifelse(Hand_length$hand_length < 100,
                                  Hand_length$hand_length * 10,
                                  Hand_length$hand_length)

Hand_length$foot_length <- ifelse(Hand_length$foot_length < 100,
                                  Hand_length$foot_length * 10,
                                  Hand_length$foot_length)

### Data Cleaning 2:
# if the hand/foot length is greater than 1000, divide it by 10.
Hand_length$hand_length <- ifelse(Hand_length$hand_length > 1000,
                                  Hand_length$hand_length / 10,
                                  Hand_length$hand_length)

Hand_length$foot_length <- ifelse(Hand_length$foot_length > 1000,
                                  Hand_length$foot_length / 10,
                                  Hand_length$foot_length)

# Recheck our plot
plot(foot_length~hand_length, data = Hand_length, main = "Plot of Hand and Foot length", xlab = "Hand Length (mm)", ylab="Foot Length (mm)")

Looks like we have outliers. Let’s check the summary statistic and calculate the z-score to find out these outliers.

Any z-score greater than 3 or less than -2 is considered to be an outlier.

# Summary Statistic of hand and foot length
Hand_length %>% summarise_at(c("hand_length", "foot_length"),
                             list(mean=mean, sd=sd))
##   hand_length_mean foot_length_mean hand_length_sd foot_length_sd
## 1         190.6992          255.625       73.79657       52.01635
# Calculate the hand_length z-score
hand_z_scores <- (Hand_length$hand_length-mean(Hand_length$hand_length))/sd(Hand_length$hand_length)

foot_z_scores <- (Hand_length$foot_length-mean(Hand_length$foot_length))/sd(Hand_length$foot_length)

# Subset data
hand_foot_subset <- subset(Hand_length, hand_z_scores > 3 | hand_z_scores < -2 |
                                  foot_z_scores > 3 | foot_z_scores < -2)
hand_foot_subset 
##     Gender hand_length foot_length
## 9     Male         210         108
## 76  Female         800         275
## 121 Female         170         750
## 124   Male         720         240

We found outliers as listed above. Index number 76 and 124 are outliers of the hand length and index number 9 and 121 are outliers of the foot length. Let’s remove these from our data as we did not get any confirmation about the error.

Remove Outliers

### Data Cleaning 3 ###

# Replace NA to hand length that is within the z-score condition
Hand_length$hand_length <- ifelse(hand_z_scores > 3 | hand_z_scores < -2.5,
                                  NA,
                                  Hand_length$hand_length)
# Replace NA to foot length that is within the z-score condition
Hand_length$foot_length <- ifelse(foot_z_scores > 3 | foot_z_scores < -2.5,
                                  NA,
                                  Hand_length$foot_length)


# Recheck our plot
plot(foot_length~hand_length, data = Hand_length, main = "Plot of Hand and Foot length", xlab = "Hand Length (mm)", ylab="Foot Length (mm)")

The data has been cleaned. The next step is to fit a linear model to the data and assess if the model is statiscally significant.

# Fit in Linear Model
model1 <- lm(foot_length~hand_length, data=Hand_length)
model1 %>% summary()
## 
## Call:
## lm(formula = foot_length ~ hand_length, data = Hand_length)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.101 -11.380  -1.741  10.054  67.178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.3539    21.8402   3.862 0.000181 ***
## hand_length   0.9279     0.1199   7.740 3.23e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.42 on 122 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.3293, Adjusted R-squared:  0.3238 
## F-statistic: 59.91 on 1 and 122 DF,  p-value: 3.232e-12

The linear model was statistically significant with the test result of F(1,122) = 59.91, p < 0.001.
Hand length explained about 32.9% of the variability in foot length.

Let’s test the main assumptions for linear regression and see if the assumptions are safe.

# Apply linear regression plot on our model
plot(model1)

Linearity:

The scatter plot suggested a linear relationship. Other non-linear relationships were ruled out. There were no non-linear trends in the Residual vs Fitted Plot.

Normality of Residuals:

Normal Q-Q plot didn’t show any obvious departures from normality. The residuals fall close to the line.

Homoscedasticity:

Homoscedasticity looks violated according on the scale-location plot. The variance in residuals appeared to decrease across predicted values.

Influential Cases:

There appeared to be no influential cases. No value goes above the bands.

Let’s interpret and test the statistical significance of the regression intercept and slope.

# Test the statistical significance of the regression intercept and slope
model1 %>% summary()
## 
## Call:
## lm(formula = foot_length ~ hand_length, data = Hand_length)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.101 -11.380  -1.741  10.054  67.178 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.3539    21.8402   3.862 0.000181 ***
## hand_length   0.9279     0.1199   7.740 3.23e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.42 on 122 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.3293, Adjusted R-squared:  0.3238 
## F-statistic: 59.91 on 1 and 122 DF,  p-value: 3.232e-12
model1 %>% confint()
##                  2.5 %     97.5 %
## (Intercept) 41.1190207 127.588745
## hand_length  0.6905936   1.165256

Interpretations

The estimated average MEAN foot length when hand length = 0 was 84.35mm.
The intercept of the regression was statistically significant:
a = 87.42, p < 0.001, 95% CI (44.59, 130.26)
For every one unit increase in hand length (mm), the MEAN foot length was estimated to increase on average by 0.93mm.
The slope of the regression for hand length was statistically significant:
b = 0.91, p < 0.001, 95%CI (0.67, 1.14)

Conclusion

Overall, the test shows that there are statistically significant positive linear relationship between hand length and foot length. Therefore, we reject the null hypothesis.

A person’s hand length was estimated to explain up to 32.6% of the variability in foot length.