Can a person’s hand length predicts a person’s foot length?
We randomly collected 128 observations of people’s (adult) hand & foot length as our data to investigate whether we can predict a person’s foot size by their hand size.
Let’s start with creating a hypothesis test as follow:
H0: A person’s hand size do not define their foot size.
HA: A person’s hand size defines their foot size.
# Required library
library(readr)
library(magrittr)
library(dplyr)
# Read the data
Hand_length <- read.csv('Hand_length.csv')
# Check the columns of the original data
colnames(Hand_length)
## [1] "ï..Timestamp" "Email.address" "Gender" "hand_length"
## [5] "foot_length"
# Select relevant columns for analysis
Hand_length <- Hand_length %>% select("Gender", "hand_length", "foot_length")
Hand_length %>% head
## Gender hand_length foot_length
## 1 Female 18.0 22.8
## 2 Male 18.6 23.0
## 3 Female 18.0 23.7
## 4 Female 16.5 24.0
## 5 Female 15.0 25.0
## 6 Female 16.0 25.0
# Perform plot to see the correlation
plot(foot_length~hand_length, data = Hand_length, main = "Plot of Hand and Foot length")
A few people have used centimeter and micromilimeter which makes the data become inconsistent. For this analysis, we will use milimeters. Let’s fix these value ups.
### Data Cleaning 1:
# if the hand/foot length is lesser than 100, multiply it by 10.
Hand_length$hand_length <- ifelse(Hand_length$hand_length < 100,
Hand_length$hand_length * 10,
Hand_length$hand_length)
Hand_length$foot_length <- ifelse(Hand_length$foot_length < 100,
Hand_length$foot_length * 10,
Hand_length$foot_length)
### Data Cleaning 2:
# if the hand/foot length is greater than 1000, divide it by 10.
Hand_length$hand_length <- ifelse(Hand_length$hand_length > 1000,
Hand_length$hand_length / 10,
Hand_length$hand_length)
Hand_length$foot_length <- ifelse(Hand_length$foot_length > 1000,
Hand_length$foot_length / 10,
Hand_length$foot_length)
# Recheck our plot
plot(foot_length~hand_length, data = Hand_length, main = "Plot of Hand and Foot length", xlab = "Hand Length (mm)", ylab="Foot Length (mm)")
Looks like we have outliers. Let’s check the summary statistic and calculate the z-score to find out these outliers.
Any z-score greater than 3 or less than -2 is considered to be an outlier.
# Summary Statistic of hand and foot length
Hand_length %>% summarise_at(c("hand_length", "foot_length"),
list(mean=mean, sd=sd))
## hand_length_mean foot_length_mean hand_length_sd foot_length_sd
## 1 190.6992 255.625 73.79657 52.01635
# Calculate the hand_length z-score
hand_z_scores <- (Hand_length$hand_length-mean(Hand_length$hand_length))/sd(Hand_length$hand_length)
foot_z_scores <- (Hand_length$foot_length-mean(Hand_length$foot_length))/sd(Hand_length$foot_length)
# Subset data
hand_foot_subset <- subset(Hand_length, hand_z_scores > 3 | hand_z_scores < -2 |
foot_z_scores > 3 | foot_z_scores < -2)
hand_foot_subset
## Gender hand_length foot_length
## 9 Male 210 108
## 76 Female 800 275
## 121 Female 170 750
## 124 Male 720 240
We found outliers as listed above. Index number 76 and 124 are outliers of the hand length and index number 9 and 121 are outliers of the foot length. Let’s remove these from our data as we did not get any confirmation about the error.
### Data Cleaning 3 ###
# Replace NA to hand length that is within the z-score condition
Hand_length$hand_length <- ifelse(hand_z_scores > 3 | hand_z_scores < -2.5,
NA,
Hand_length$hand_length)
# Replace NA to foot length that is within the z-score condition
Hand_length$foot_length <- ifelse(foot_z_scores > 3 | foot_z_scores < -2.5,
NA,
Hand_length$foot_length)
# Recheck our plot
plot(foot_length~hand_length, data = Hand_length, main = "Plot of Hand and Foot length", xlab = "Hand Length (mm)", ylab="Foot Length (mm)")
The data has been cleaned. The next step is to fit a linear model to the data and assess if the model is statiscally significant.
# Fit in Linear Model
model1 <- lm(foot_length~hand_length, data=Hand_length)
model1 %>% summary()
##
## Call:
## lm(formula = foot_length ~ hand_length, data = Hand_length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.101 -11.380 -1.741 10.054 67.178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.3539 21.8402 3.862 0.000181 ***
## hand_length 0.9279 0.1199 7.740 3.23e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.42 on 122 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.3293, Adjusted R-squared: 0.3238
## F-statistic: 59.91 on 1 and 122 DF, p-value: 3.232e-12
The linear model was statistically significant with the test result of F(1,122) = 59.91, p < 0.001.
Hand length explained about 32.9% of the variability in foot length.
Let’s test the main assumptions for linear regression and see if the assumptions are safe.
# Apply linear regression plot on our model
plot(model1)
The scatter plot suggested a linear relationship. Other non-linear relationships were ruled out. There were no non-linear trends in the Residual vs Fitted Plot.
Normal Q-Q plot didn’t show any obvious departures from normality. The residuals fall close to the line.
Homoscedasticity looks violated according on the scale-location plot. The variance in residuals appeared to decrease across predicted values.
There appeared to be no influential cases. No value goes above the bands.
Let’s interpret and test the statistical significance of the regression intercept and slope.
# Test the statistical significance of the regression intercept and slope
model1 %>% summary()
##
## Call:
## lm(formula = foot_length ~ hand_length, data = Hand_length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.101 -11.380 -1.741 10.054 67.178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.3539 21.8402 3.862 0.000181 ***
## hand_length 0.9279 0.1199 7.740 3.23e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.42 on 122 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.3293, Adjusted R-squared: 0.3238
## F-statistic: 59.91 on 1 and 122 DF, p-value: 3.232e-12
model1 %>% confint()
## 2.5 % 97.5 %
## (Intercept) 41.1190207 127.588745
## hand_length 0.6905936 1.165256
The estimated average MEAN foot length when hand length = 0 was 84.35mm.
The intercept of the regression was statistically significant:
a = 87.42, p < 0.001, 95% CI (44.59, 130.26)
For every one unit increase in hand length (mm), the MEAN foot length was estimated to increase on average by 0.93mm.
The slope of the regression for hand length was statistically significant:
b = 0.91, p < 0.001, 95%CI (0.67, 1.14)
Overall, the test shows that there are statistically significant positive linear relationship between hand length and foot length. Therefore, we reject the null hypothesis.
A person’s hand length was estimated to explain up to 32.6% of the variability in foot length.