library (HistData)
lmmodel=lm(childHeight ~ midparentHeight,data=GaltonFamilies)
summary (lmmodel)
## 
## Call:
## lm(formula = childHeight ~ midparentHeight, data = GaltonFamilies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9570 -2.6989 -0.2155  2.7961 11.6848 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     22.63624    4.26511   5.307 1.39e-07 ***
## midparentHeight  0.63736    0.06161  10.345  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.392 on 932 degrees of freedom
## Multiple R-squared:  0.103,  Adjusted R-squared:  0.102 
## F-statistic:   107 on 1 and 932 DF,  p-value: < 2.2e-16

R Markdown

This model examines the relationship between child height (dependent variable) and midparent height (independent variable) using data from the GaltonFamilies dataset.

Residuals:

  • Min: -8.9570
  • 1Q (1st Quartile): -2.6989
  • Median: -0.2155
  • 3Q (3rd Quartile): 2.7961
  • Max: 11.6848

These statistics summarize the distribution of residuals (the differences between the observed and predicted values of child height). Most residuals are between -2.7 and 2.8, meaning the model’s predictions are generally within that range of the true values. There are some extreme residuals, as shown by the minimum (-8.96) and maximum (11.68).

Coefficients:

  • Intercept (Estimate = 22.63624): This is the expected child height when the midparent height is zero. Although zero midparent height isn’t realistic, the intercept is necessary to define the regression line. In context, this means other parts of the model take over when predicting actual heights.

  • midparentHeight (Estimate = 0.63736): For every 1 unit increase in midparent height (in inches), the child’s height is expected to increase by 0.637 inches, on average. This positive coefficient indicates a direct relationship between the two variables.

Statistical Significance:

  • Pr(>|t|):
    • The p-value for the intercept is very small (1.39e-07), indicating that it’s significantly different from zero.
    • The p-value for the slope (midparent height) is also extremely small (< 2e-16), which means there’s very strong evidence that midparent height is a significant predictor of child height.
    Both variables have a significance level indicated by *** (p-value < 0.001), meaning they are highly significant.

Model Fit:

  • Residual standard error = 3.392: This value represents the typical distance that the observed values fall from the regression line. In this case, the average deviation is about 3.39 inches.

  • Multiple R-squared = 0.103: This indicates that about 10.3% of the variation in child height can be explained by midparent height. This is a relatively low value, meaning midparent height alone doesn’t explain much of the variation in child height.

  • Adjusted R-squared = 0.102: Adjusted R-squared is similar but takes into account the number of predictors in the model. Since there is only one predictor, it’s nearly the same as the R-squared.

  • F-statistic = 107, p-value: < 2.2e-16: This indicates that the model as a whole is statistically significant, meaning that midparent height is a significant predictor of child height.

Summary:

  • The model finds a statistically significant positive relationship between midparent height and child height.
  • The effect size is moderate (for each additional inch in midparent height, child height increases by about 0.64 inches).
  • The R-squared is relatively low, suggesting there are other factors beyond midparent height that explain child height.

Model 2

lmmodel2=lm(childHeight ~ mother+father,data=GaltonFamilies)
summary(lmmodel2)
## 
## Call:
## lm(formula = childHeight ~ mother + father, data = GaltonFamilies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.117 -2.741 -0.218  2.766 11.694 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.64328    4.26213   5.313 1.35e-07 ***
## mother       0.29051    0.04852   5.987 3.05e-09 ***
## father       0.36828    0.04489   8.204 7.66e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.389 on 931 degrees of freedom
## Multiple R-squared:  0.1052, Adjusted R-squared:  0.1033 
## F-statistic: 54.74 on 2 and 931 DF,  p-value: < 2.2e-16

Interpretation

This model examines the relationship between child height (dependent variable) and both mother’s height and father’s height (independent variables).

Residuals:

  • Min: -9.117
  • 1Q (1st Quartile): -2.741
  • Median: -0.218
  • 3Q (3rd Quartile): 2.766
  • Max: 11.694

These statistics summarize the distribution of residuals (the differences between the observed and predicted child heights). The residuals show a similar pattern as in the first model, with most falling within a typical range of -2.7 to 2.8, but some extreme values reaching nearly -9.1 and +11.7.

Coefficients:

  • Intercept (Estimate = 22.64328): This is the expected child height when both mother’s and father’s heights are zero, which isn’t meaningful in this context but helps define the regression line. It represents the base value from which the heights of the parents influence the prediction.

  • Mother (Estimate = 0.29051): For every 1 unit (inch) increase in the mother’s height, the child’s height is expected to increase by about 0.29 inches, on average. This is a positive relationship, but the mother’s influence on the child’s height is smaller compared to the father’s.

  • Father (Estimate = 0.36828): For every 1 unit (inch) increase in the father’s height, the child’s height is expected to increase by about 0.37 inches, on average. The father’s height has a slightly stronger effect than the mother’s height.

Statistical Significance:

  • Pr(>|t|):
    • The p-values for the intercept (1.35e-07), mother’s height (3.05e-09), and father’s height (7.66e-16) are all extremely small, meaning that each coefficient is significantly different from zero.
    Both the mother’s and father’s heights are highly significant predictors of the child’s height, as indicated by the p-values and the significance levels (***).

Model Fit:

  • Residual standard error = 3.389: This means the average deviation of the observed child heights from the model’s predicted values is about 3.39 inches, similar to the previous model.

  • Multiple R-squared = 0.1052: This indicates that about 10.52% of the variation in child height can be explained by the heights of the mother and father. This is a slight improvement over the model that only used midparent height (which had an R-squared of 10.3%).

  • Adjusted R-squared = 0.1033: Adjusted R-squared is slightly lower than the R-squared and accounts for the number of predictors. It still suggests that the model has a low explanatory power.

  • F-statistic = 54.74, p-value: < 2.2e-16: The overall F-statistic is highly significant, meaning that the model, as a whole, is a good fit for the data. The combined effect of mother and father’s heights significantly predicts child height.

Summary:

  • Both the mother’s height and father’s height are statistically significant predictors of child height.
  • The father’s height has a stronger effect on child height than the mother’s, but both contribute positively.
  • The model fit is slightly better than the previous model with only midparent height, but still, only about 10.5% of the variance in child height is explained, suggesting other factors influence child height.

Model 3 - factor variable as predictor

lmmodel3=lm(childHeight ~ gender,data=GaltonFamilies)
summary (lmmodel3)
## 
## Call:
## lm(formula = childHeight ~ gender, data = GaltonFamilies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.234 -1.604 -0.104  1.766  9.766 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.1040     0.1173  546.32   <2e-16 ***
## gendermale    5.1301     0.1635   31.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.497 on 932 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.5132 
## F-statistic: 984.4 on 1 and 932 DF,  p-value: < 2.2e-16

This model examines the relationship between child height (dependent variable) and gender (independent variable), where the gender variable is likely coded as a factor with categories such as “male” and “female.”

Residuals:

  • Min: -9.234
  • 1Q (1st Quartile): -1.604
  • Median: -0.104
  • 3Q (3rd Quartile): 1.766
  • Max: 9.766

The residuals summarize the differences between the observed and predicted child heights. The spread of the residuals is tighter than in the previous models, with most residuals falling between -1.6 and 1.77. This suggests the model fits the data better than the previous ones.

Coefficients:

  • Intercept (Estimate = 64.1040): This is the predicted height for females (since “female” is typically the reference level for gender). On average, the expected height for a female child is about 64.1 inches.

  • gendermale (Estimate = 5.1301): The coefficient for “male” indicates that male children are, on average, about 5.13 inches taller than female children. This significant positive coefficient shows that gender has a strong effect on child height, with males being taller on average.

Statistical Significance:

  • Pr(>|t|):
    • The p-value for both the intercept and the gender coefficient is extremely small (<2e-16), indicating that both are highly significant.
    The model indicates a very strong, statistically significant difference in height between male and female children.

Model Fit:

  • Residual standard error = 2.497: This represents the typical deviation of observed child heights from the predicted values, and at 2.497 inches, this model is more accurate than the previous models with higher residual errors (~3.39).

  • Multiple R-squared = 0.5137: This suggests that 51.37% of the variation in child height can be explained by gender. This is a substantial improvement compared to the previous models (which explained around 10% of the variation).

  • Adjusted R-squared = 0.5132: Adjusted R-squared is very close to the R-squared, indicating that adding gender as the sole predictor explains a large proportion of the variability in child height.

  • F-statistic = 984.4, p-value: < 2.2e-16: The F-statistic is very high, indicating that the model as a whole is highly significant. Gender is a strong predictor of child height in this dataset.

Summary:

  • Gender is a highly significant predictor of child height, with male children being on average 5.13 inches taller than female children.
  • The model explains over 51% of the variation in child height, which is a substantial improvement over the models with parental height.
  • The residuals are smaller, indicating a better fit to the data.

Model 4a midparent Height and gender

lmmodel4a <-lm(childHeight ~ midparentHeight+ gender,data=GaltonFamilies)#separatly
summary (lmmodel4a) 
## 
## Call:
## lm(formula = childHeight ~ midparentHeight + gender, data = GaltonFamilies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5317 -1.4600  0.0979  1.4566  9.1110 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     16.51410    2.73392    6.04 2.22e-09 ***
## midparentHeight  0.68702    0.03944   17.42  < 2e-16 ***
## gendermale       5.21511    0.14216   36.69  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.17 on 931 degrees of freedom
## Multiple R-squared:  0.6332, Adjusted R-squared:  0.6324 
## F-statistic: 803.6 on 2 and 931 DF,  p-value: < 2.2e-16

Model 4a interpretation

This model examines the relationship between child height (dependent variable) and two predictors: midparent height (the average height of the parents) and gender.

Residuals:

  • Min: -9.5317
  • 1Q (1st Quartile): -1.4600
  • Median: 0.0979
  • 3Q (3rd Quartile): 1.4566
  • Max: 9.1110

The residuals indicate that most of the errors between predicted and actual child heights are fairly small, with 50% of the residuals falling between -1.46 and 1.46 inches. The spread of residuals is much tighter compared to the previous models, suggesting a better fit.

Coefficients:

  • Intercept (Estimate = 16.51410): This is the predicted child height when both midparent height is zero and the child is female (since gender is a binary variable and female is usually the reference level). While zero midparent height isn’t realistic, this intercept helps define the regression line.

  • midparentHeight (Estimate = 0.68702): For every 1 inch increase in the average height of the parents (midparent height), the child’s height is expected to increase by about 0.69 inches, on average. This coefficient indicates a positive relationship between midparent height and child height, similar to the earlier model that used only midparent height.

  • gendermale (Estimate = 5.21511): Being male is associated with a predicted increase in height of about 5.22 inches compared to female children. This is a very strong predictor, suggesting that gender significantly affects child height, with males being taller on average.

Statistical Significance:

  • Pr(>|t|):
    • The intercept, midparent height, and gender coefficients are all highly statistically significant, with p-values < 2e-16 for midparent height and gender, and 2.22e-09 for the intercept.
    • Both midparent height and gender are strong, significant predictors of child height.

Model Fit:

  • Residual standard error = 2.17: This is the typical deviation between the observed and predicted child heights, indicating a smaller error than in the previous models (which had residual errors around 2.5 to 3.4 inches). The lower error indicates that this model fits the data better.

  • Multiple R-squared = 0.6332: This means that 63.32% of the variation in child height can be explained by midparent height and gender together. This is a substantial improvement over models that used only one of these predictors.

  • Adjusted R-squared = 0.6324: The adjusted R-squared is very close to the R-squared, confirming that both predictors are meaningful, and the model isn’t overfitting.

  • F-statistic = 803.6, p-value: < 2.2e-16: The F-statistic is extremely high, indicating that the model as a whole is statistically significant. The p-value confirms that midparent height and gender are both very strong predictors of child height.

Summary:

  • Midparent height and gender are both significant predictors of child height.
  • Males tend to be about 5.22 inches taller than females, on average.
  • For each 1 inch increase in midparent height, a child’s height increases by about 0.69 inches.
  • This model explains over 63% of the variation in child height, which is a significant improvement over models that used only midparent height or gender alone.

Model 4: Interaction term

lmmodel4 <- lm(childHeight ~ midparentHeight+ gender + midparentHeight*gender,data=GaltonFamilies) #together
summary (lmmodel4)
## 
## Call:
## lm(formula = childHeight ~ midparentHeight + gender + midparentHeight * 
##     gender, data = GaltonFamilies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5431 -1.4568  0.0769  1.4795  9.0860 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                18.33348    3.86636   4.742 2.45e-06 ***
## midparentHeight             0.66075    0.05580  11.842  < 2e-16 ***
## gendermale                  1.57998    5.46264   0.289    0.772    
## midparentHeight:gendermale  0.05252    0.07890   0.666    0.506    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.171 on 930 degrees of freedom
## Multiple R-squared:  0.6334, Adjusted R-squared:  0.6322 
## F-statistic: 535.6 on 3 and 930 DF,  p-value: < 2.2e-16

This model investigates the relationship between child height (dependent variable) and the predictors midparent height (the average height of the parents) and gender, including the interaction between midparent height and gender. The interaction term allows us to examine whether the effect of midparent height on child height differs by gender.

Residuals:

  • Min: -9.5431
  • 1Q (1st Quartile): -1.4568
  • Median: 0.0769
  • 3Q (3rd Quartile): 1.4795
  • Max: 9.0860

The residuals indicate that the model fits the data reasonably well, with most errors being small (the median residual is close to zero). However, the minimum and maximum residuals show some variability, suggesting occasional larger prediction errors.

Coefficients:

  • Intercept (Estimate = 18.33348): This is the predicted height for a female child (reference group) when midparent height is zero. Again, while this value is not practically meaningful, it provides a baseline for the model.

  • midparentHeight (Estimate = 0.66075): For every 1-inch increase in midparent height, the child’s height increases by approximately 0.66 inches, holding gender constant. This coefficient is statistically significant (p < 2e-16).

  • gendermale (Estimate = 1.57998): The coefficient for “gendermale” suggests that, on average, male children are about 1.58 inches taller than female children, but this effect is not statistically significant (p = 0.772). This indicates that the difference in height attributed to gender alone is not strong in the context of this model.

  • midparentHeight:gendermale (Estimate = 0.05252): This interaction term indicates that the effect of midparent height on child height is slightly greater for males, with an additional increase of 0.052 inches for each additional inch in midparent height. However, this effect is also not statistically significant (p = 0.506), meaning that the interaction does not provide strong evidence that the relationship between midparent height and child height differs by gender.

Statistical Significance:

  • Pr(>|t|):
    • The intercept and midparent height coefficients are highly significant, while gender and the interaction term are not. This suggests that midparent height is the dominant predictor in this model, and the influence of gender and its interaction is less impactful.

Model Fit:

  • Residual standard error = 2.171: This indicates a similar level of accuracy in predicting child height compared to the previous model (which had a residual standard error of about 2.17 inches).

  • Multiple R-squared = 0.6334: The model explains 63.34% of the variability in child height, similar to the previous model that included only midparent height and gender without the interaction term.

  • Adjusted R-squared = 0.6322: This value is very close to the R-squared, suggesting that the model is appropriately specified with the additional interaction term not substantially improving the fit.

  • F-statistic = 535.6, p-value: < 2.2e-16: The F-statistic is high, indicating that the model as a whole is statistically significant, although the individual coefficients associated with gender and the interaction term do not provide significant additional predictive power.

Summary:

  • Midparent height remains a significant predictor of child height, with a clear positive relationship.
  • The main effect of gender does not provide strong evidence of a difference in height when controlling for midparent height, and the interaction term also does not significantly affect the relationship between midparent height and child height.
  • Overall, this model does not show that the effect of midparent height on child height significantly differs between genders, as indicated by the non-significant interaction term.