I pulled a data set that measures a person’s height, their gender and a variety of measurements of different appendages. A sample of the 55 entries is below.

raw_data <- read_tsv('C:\\Users\\Brian\\Desktop\\GradClasses\\Fall18\\605\\week12\\Physical.txt')
head(raw_data)
## # A tibble: 6 x 11
##     Obs Sex   Height LeftArm RtArm LeftFoot RtFoot LeftHand RtHand HeadCirc
##   <int> <chr>  <dbl>   <dbl> <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
## 1     1 Male      69    25.5  25.5     27     26.5     9.5    9        58.5
## 2     2 Male      79    28    25       29     27.5     9      9        54  
## 3     3 Male      75    27    27.5     31     32       3.75   3.75     62.5
## 4     4 Male      69    25    25.5     25.5   25.5    10      8        58.5
## 5     5 Male      65    25    25       23.5   23       9.5    9.4      57  
## 6     6 Male      79    30.5  30.5     28     28       8.5    8.5      58.5
## # ... with 1 more variable: nose <dbl>
tail(raw_data)
## # A tibble: 6 x 11
##     Obs Sex   Height LeftArm RtArm LeftFoot RtFoot LeftHand RtHand HeadCirc
##   <int> <chr>  <dbl>   <dbl> <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
## 1    50 Fema~     65    22.5  23       22.9   23.2      7.5    7.4     57  
## 2    51 Fema~     61    22.2  22.2     22     22.2      6.9    6.9     55.2
## 3    52 Fema~     63    23    23       24     23        7.5    8       53  
## 4    53 Fema~     71    25.5  28       25.5   28        7.5    7.5     54.5
## 5    54 Fema~     68    24.5  24.5     24.5   25.5      7      6.5     57.8
## 6    55 Fema~     62    26    26       23     23        6      6       56  
## # ... with 1 more variable: nose <dbl>

For the purposes of the assignment, I’m limiting myself to only the categorical variable of Sex (coded Male/Female) and the quantitative variable of LeftArm in order to predict the response variable of Height.

raw_data <- raw_data %>%
  select(Height, Sex, LeftArm)

Examining the distribution of LeftArm shows a very rough normal distribution. It is difficult to draw any conclusions from the data due to the limited number of observations.

ggplot(raw_data) +
  geom_histogram(aes(LeftArm), bins=15)

The below plot explores the relationship between the three variables. There are some positive signs in this plot. Sex appears to be a significant factor in determining height and there appears to be some positive correlation between LeftArm length and Height. It is difficult to tell if this is a linear relationship or could benefit from a quadradic term.

ggplot(raw_data, aes(LeftArm, Height, group=Sex, color=Sex)) +
  geom_point() +
  geom_smooth(method='lm')

I performed the regression as indicated by the assignment and the result is that while the data has some predictive value overall, no single predictor is statistically significant. This was a disappointing result given the initial exploration.

lm <- lm(Height ~ Sex * LeftArm + I(LeftArm ** 2), raw_data)
summary(lm)
## 
## Call:
## lm(formula = Height ~ Sex * LeftArm + I(LeftArm^2), data = raw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4778 -1.5873 -0.4778  1.5762  6.5611 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     82.05370   53.75821   1.526    0.133
## SexMale         22.65142   14.00245   1.618    0.112
## LeftArm         -2.56361    4.32127  -0.593    0.556
## I(LeftArm^2)     0.07704    0.08689   0.887    0.380
## SexMale:LeftArm -0.74583    0.55029  -1.355    0.181
## 
## Residual standard error: 2.551 on 50 degrees of freedom
## Multiple R-squared:  0.6916, Adjusted R-squared:  0.6669 
## F-statistic: 28.03 on 4 and 50 DF,  p-value: 3.103e-12

Exploring the diagnostics indicates that this is a valid model. The residuals show no apparent pattern and the qqplot mostly fits along the line. Furthermore, there are no outliers or high leverage points.

plot(lm)

After further examination, I created a second regression that only included the Sex term and the quadradic LeftArm term. In this case, it turns out that all three predictors are significant and an \(R^1\) that is approximately the same as the previous regression. The diagnostics indicate that this is also a valid regression.

lm.2 <- lm(Height ~ Sex + I(LeftArm**2), raw_data)
summary(lm.2)
## 
## Call:
## lm(formula = Height ~ Sex + I(LeftArm^2), data = raw_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0364 -1.8186 -0.3415  1.8319  6.1447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.027177   2.206095  24.490  < 2e-16 ***
## SexMale       3.740484   0.837597   4.466 4.32e-05 ***
## I(LeftArm^2)  0.019244   0.003628   5.305 2.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.551 on 52 degrees of freedom
## Multiple R-squared:  0.6792, Adjusted R-squared:  0.6668 
## F-statistic: 55.04 on 2 and 52 DF,  p-value: 1.454e-13
plot(lm.2)

Given the information, I believe the second regression is the superior model. Considering both have significant p-values and \(R^2\) near \(70\%\) I tend towards the simplier model with the statistically significant terms.

\[\widehat{height}=3.74\times Male + 0.019\times LeftArm^2 + 54.02\]

Finally, for fun I decided to measure my wife and my left arm length to see how well the regression performed.

our.data <- tibble(Sex=c('Male', 'Female'), LeftArm=c(30, 26.5), Height=c(68, 64))
predict(lm.2, our.data)
##        1        2 
## 75.08761 67.54154

It predicted ~75 inches for me and I’m 68. For my wife it predicted ~67.5 inches while she is 64. So the regression is not that accurate, although we do both have long arms.