I pulled a data set that measures a person’s height, their gender and a variety of measurements of different appendages. A sample of the 55 entries is below.
raw_data <- read_tsv('C:\\Users\\Brian\\Desktop\\GradClasses\\Fall18\\605\\week12\\Physical.txt')
head(raw_data)
## # A tibble: 6 x 11
## Obs Sex Height LeftArm RtArm LeftFoot RtFoot LeftHand RtHand HeadCirc
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Male 69 25.5 25.5 27 26.5 9.5 9 58.5
## 2 2 Male 79 28 25 29 27.5 9 9 54
## 3 3 Male 75 27 27.5 31 32 3.75 3.75 62.5
## 4 4 Male 69 25 25.5 25.5 25.5 10 8 58.5
## 5 5 Male 65 25 25 23.5 23 9.5 9.4 57
## 6 6 Male 79 30.5 30.5 28 28 8.5 8.5 58.5
## # ... with 1 more variable: nose <dbl>
tail(raw_data)
## # A tibble: 6 x 11
## Obs Sex Height LeftArm RtArm LeftFoot RtFoot LeftHand RtHand HeadCirc
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 50 Fema~ 65 22.5 23 22.9 23.2 7.5 7.4 57
## 2 51 Fema~ 61 22.2 22.2 22 22.2 6.9 6.9 55.2
## 3 52 Fema~ 63 23 23 24 23 7.5 8 53
## 4 53 Fema~ 71 25.5 28 25.5 28 7.5 7.5 54.5
## 5 54 Fema~ 68 24.5 24.5 24.5 25.5 7 6.5 57.8
## 6 55 Fema~ 62 26 26 23 23 6 6 56
## # ... with 1 more variable: nose <dbl>
For the purposes of the assignment, I’m limiting myself to only the categorical variable of Sex (coded Male/Female) and the quantitative variable of LeftArm in order to predict the response variable of Height.
raw_data <- raw_data %>%
select(Height, Sex, LeftArm)
Examining the distribution of LeftArm shows a very rough normal distribution. It is difficult to draw any conclusions from the data due to the limited number of observations.
ggplot(raw_data) +
geom_histogram(aes(LeftArm), bins=15)
The below plot explores the relationship between the three variables. There are some positive signs in this plot. Sex appears to be a significant factor in determining height and there appears to be some positive correlation between LeftArm length and Height. It is difficult to tell if this is a linear relationship or could benefit from a quadradic term.
ggplot(raw_data, aes(LeftArm, Height, group=Sex, color=Sex)) +
geom_point() +
geom_smooth(method='lm')
I performed the regression as indicated by the assignment and the result is that while the data has some predictive value overall, no single predictor is statistically significant. This was a disappointing result given the initial exploration.
lm <- lm(Height ~ Sex * LeftArm + I(LeftArm ** 2), raw_data)
summary(lm)
##
## Call:
## lm(formula = Height ~ Sex * LeftArm + I(LeftArm^2), data = raw_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4778 -1.5873 -0.4778 1.5762 6.5611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.05370 53.75821 1.526 0.133
## SexMale 22.65142 14.00245 1.618 0.112
## LeftArm -2.56361 4.32127 -0.593 0.556
## I(LeftArm^2) 0.07704 0.08689 0.887 0.380
## SexMale:LeftArm -0.74583 0.55029 -1.355 0.181
##
## Residual standard error: 2.551 on 50 degrees of freedom
## Multiple R-squared: 0.6916, Adjusted R-squared: 0.6669
## F-statistic: 28.03 on 4 and 50 DF, p-value: 3.103e-12
Exploring the diagnostics indicates that this is a valid model. The residuals show no apparent pattern and the qqplot mostly fits along the line. Furthermore, there are no outliers or high leverage points.
plot(lm)
After further examination, I created a second regression that only included the Sex term and the quadradic LeftArm term. In this case, it turns out that all three predictors are significant and an \(R^1\) that is approximately the same as the previous regression. The diagnostics indicate that this is also a valid regression.
lm.2 <- lm(Height ~ Sex + I(LeftArm**2), raw_data)
summary(lm.2)
##
## Call:
## lm(formula = Height ~ Sex + I(LeftArm^2), data = raw_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0364 -1.8186 -0.3415 1.8319 6.1447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.027177 2.206095 24.490 < 2e-16 ***
## SexMale 3.740484 0.837597 4.466 4.32e-05 ***
## I(LeftArm^2) 0.019244 0.003628 5.305 2.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.551 on 52 degrees of freedom
## Multiple R-squared: 0.6792, Adjusted R-squared: 0.6668
## F-statistic: 55.04 on 2 and 52 DF, p-value: 1.454e-13
plot(lm.2)
Given the information, I believe the second regression is the superior model. Considering both have significant p-values and \(R^2\) near \(70\%\) I tend towards the simplier model with the statistically significant terms.
\[\widehat{height}=3.74\times Male + 0.019\times LeftArm^2 + 54.02\]
Finally, for fun I decided to measure my wife and my left arm length to see how well the regression performed.
our.data <- tibble(Sex=c('Male', 'Female'), LeftArm=c(30, 26.5), Height=c(68, 64))
predict(lm.2, our.data)
## 1 2
## 75.08761 67.54154
It predicted ~75 inches for me and I’m 68. For my wife it predicted ~67.5 inches while she is 64. So the regression is not that accurate, although we do both have long arms.