6 Comparing models g.sm versus g.big. (13 points)

## install.packages("HistData",  repos = "http://cran.us.r-project.org", dependencies=TRUE).  (After the first compile, we may comment out this line.
library("HistData")
data(GaltonFamilies)
Galton2 <- data.frame(GaltonFamilies)
names(Galton2)
## [1] "family"          "father"          "mother"          "midparentHeight"
## [5] "children"        "childNum"        "gender"          "childHeight"

Note: I will use “we” or “us” hereinafter to avoid first paragraph narrative which in my opinion does not come across as a very convincing means of communicating data analysis. In other words, the use of “we” or “us” does not indicate that another individual or entity assisted in the response to the six Mid-Term Exam responses.

6.1. Fit the following model for childHeight, g.big: childHeight=β0+β1gender+β2father+β3mother+β4children. What are the estimated coefficients, standard errors of estimate, t-values of estimate, and p-values of estimate?

g.big <- lm(childHeight ~  gender + father + mother + children, Galton2)
summary(g.big)$coefficients
##                Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 17.43102973 2.77406863  6.283561  5.079644e-10
## gendermale   5.19852104 0.14197155 36.616640 2.093140e-182
## father       0.38521395 0.02898001 13.292403  4.892948e-37
## mother       0.31618779 0.03097755 10.206997  2.950867e-23
## children    -0.04573175 0.02630931 -1.738235  8.250078e-02

Response to Question No. 6.1: As shown above, we use the R function lm to create interaction model “g.big” which is used to carry out regression at a single stratum (i.e., “childHeight”) analysis of variance and analysis of co-variance.

6.2. Fit the following model for childHeight, g.sm: childHight=β0+β1gender+β2father.

What are the estimated coefficients, standard errors of estimate, t-values of estimate, and p-values of estimate?

g.sm <- lm(childHeight ~  gender + father, Galton2)
summary(g.sm)$coefficients
##               Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 35.6790984 2.09313335 17.04578  6.533613e-57
## gendermale   5.1804523 0.14947528 34.65759 1.054827e-169
## father       0.4104067 0.03018159 13.59792  1.518043e-38

Response to Question No. 6.2: As shown above, we see the “standard error of estimate” (i.e., standard deviation divided by the square root of the sample size), the “t value” (i.e., determine whether the means of two groups are equal to each other), and “p-value” (or “critical value”) (i.e., *cumulative distribution functions* and *inverse cumulative distribution functions (quantile function)* of the known sampling distribution).

6.3. For the test of the model g.sm vs the model g.big, in terms of the beta coefficients, what are the null and alternative hypotheses for this statistical test?

Response to Question No. 6.3: As shown above, we have two models; one big model (g.big) and one small model (g.sm) with a null hypothesis \(\Omega\) = (H0) = \(\beta\) 3 = \(\beta\) =4 and the alternative hypothesis \(\omega\) = (H1) at least one of \(\beta\) 3 and \(\beta\) 4 = 0. By applying principle of Occam’s Razor we prefer to use \(\omega\).

6.4. Compute the Analysis of Variance Table for this test based on the data.

anova(g.sm, g.big)
## Analysis of Variance Table
## 
## Model 1: childHeight ~ gender + father
## Model 2: childHeight ~ gender + father + mother + children
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    931 4849.7                                  
## 2    929 4343.7  2    505.99 54.108 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response to Question No. 6.4: An analysis of variance table (avova) is a “concenient way of arranging arithmetic (Fisher) and is shown abov by execite the R function anova() with the two models; one big model (g.big) and one small model (g.sm).

6.5. Using α=0.001, based on the p-value, what is the decision rule and conclusion of the hypotheses test of g.sm versus g.big?

Response to Question No. 6.5: As described in our lecture and in Faraway (Chapter 3) we apply the Decision Rule as the p.value is greater that 0.001 to reject the small model (g.sm) and we can reject the H0 to conclude that the the mother or the children will predict the child height controling for gender and father.

6.6. Compute the fit plot for the g.big with the following specifications:

45 degree line in red

title “Fit Plot”

y-axis label “Child Height”

x-axis label “Fitted Values”

library(ggplot2)
plot1 <- ggplot(g.big) +
  aes(x=.fitted, y=childHeight) +
  geom_point()
plot1 <- plot1 + geom_abline(intercept = 0, slope = 1, color = "red")
plot1 <- plot1 + labs(title = "Fit Plot")
(plot1 <- plot1 + labs(x="Fitted Values", y = "Child Height"))

Response to Question No. 6.6: As shown above, we see a 45 degree line in red in the fit plot observing an interaction between Child Height and Fitted Values.

6.7. What is the Pearson correlation between the g.big fitted-values and actual-values?

cor(g.big$fitted, Galton2$childHeight) 
## [1] 0.797865

Response to Question No. 6.7: The Pearson Correlation is the test statistic that measures the statistical relation or association between two continuous variables that gives information about the magnitude of the association or correlation. as well as the direction of the relations. As indicated above the Pearson Correlation is 0.7979.

6.8. Is the correlation strong, moderate or weak?

Response to Question No. 6.8: There is a strong correlation (80%) for the big model.

6.9. What does this indicate for the model?

Response to Question No. 6.9: A strong correlation Pearson Correlation (80%) would show that the big model is useful (along with additional hypothesis testing).

6.10. What is the R2 for the g.big model?

summary(g.big)$r.sq 
## [1] 0.6365886

Response to Question No. 6.10: As shown above, the R2 for the g.big model is 0.6366.

6.11. What does R2 mean for the fitted model?

Response to Question No. 6.11: The R2 for the g.big model as indicated above in response to Question No. 6.11 is roughly 64% of the ChildHeight deviations. The remaining 36% is likely due to error or other predictors which we did not include in the g.big model.

6.12. Theoretically, what is the relation between the R2 and the Pearson correlation (actual y-values vs fitted-values)?

Response to Question No. 6.12: The R^2 value is simply the square of the correlation coefficient R. On the other hand, the correlation coefficient ( R ) of a model (e.g., variables x and y) takes values between −1 and 1 and describes how x and y are correlated. Hence, R2 is the square of the Pearson correlation between fitted-values and actual y-values.

6.13. Show that this relation holds for the computed values of the g.big?

r <- cor(g.big$fitted, Galton2$childHeight) 
r_sq <- r^2
R_sq <- summary(g.big)$r.sq
data.frame(r, r_sq, R_sq)
##          r      r_sq      R_sq
## 1 0.797865 0.6365886 0.6365886

Response to Question No. 6.13: As shown above, the relation holds for the computed values of the g.big model.