3 Interaction Model (8 points)

## install.packages("HistData",  repos = "http://cran.us.r-project.org", dependencies=TRUE).  (After the first compile, we may comment out this line.
library("HistData")
data(GaltonFamilies)
Galton2 <- data.frame(GaltonFamilies)
names(Galton2)
## [1] "family"          "father"          "mother"          "midparentHeight"
## [5] "children"        "childNum"        "gender"          "childHeight"

Note: I will use “we” or “us” hereinafter to avoid first paragraph narrative which in my opinion does not come across as a very convincing means of communicating data analysis. In other words, the use of “we” or “us” does not indicate that another individual or entity assisted in the response to the six Mid-Term Exam responses.

3.1. Fit an the interaction model g1: childHeight ~ gender + midparentHeight + gender:midparentHeight.

g1 <- lm(childHeight~gender+midparentHeight+gender:midparentHeight, data = Galton2)

Response to Question No. 3.1: As shown above, we use the R function lm to create interaction model “g1” which is used to carry out regression at a single stratum analysis of variance and analysis of co-variance.

3.2. Using title “Residual Plot”, obtain the scatterplot of the residuals vs fitted values. Don’t print it, save it in p1.

library(ggplot2)
p1 <- ggplot(g1) + 
  aes(x=.fitted, y=.resid) + 
  geom_point() +
  labs(title = "Residual Plot")

Response to Question No. 3.2: As shown above, no plot is displayed and instead we have save the data to memory as “p1.” As shown above, we first initializes a ggplot object and use the R function aes() as we expect all layers to use the same data and the same set of aesthetics, throughout this problem set.

3.3. Using title “Residual Plot”, obtain the scatterplot of the residuals vs midparentHeight. Don’t print it, save it in p2.

p2 <- ggplot(g1) + 
  aes(x=midparentHeight, y=.resid) + 
  geom_point() +
  labs(title = "Residual Plot")

Response to Question No. 3.3: Same procedure in RStudio with “midparentHeight” to create a residual plot and save to memory as “p2” (nothing to see here “move on”).

3.4. Using title “Boxplot”, obtain the boxplot of the residuals vs gender. Don’t print it, save it in p3.

p3 <- ggplot(g1) + 
  aes(x=gender, y=.resid) + 
  geom_boxplot() +
  labs(title = "Boxplot")

Response to Question No. 3.4: Same procedure in RStudio with box and whiskers plot saved to memory to create “p2” (again, nothing to see here “move on”).

3.5. Using title “QQ-plot”, obtain the QQ-plot, with a red qq-line, of the residuals. Don’t print it, save it in p4.

p4 <- ggplot(g1) + 
  aes(sample=.resid) + 
  geom_qq() +
  geom_qq_line(color="red") +
  labs(title = "QQ-plot")

Response to Question No. 3.5: Same procedure in RStudio with Q–Q (quantile-quantile) Plot saved to memory to create “p2” (again, nothing to see here “move on”). In statistics we use Q-Q plots to illustrate that residuals are normally distributed (e.g., residuals follow close to a straight line on this plot, it is a good indication they are normally distributed).

3.6. Using a 2x2 grid, plot all four plots using gridExtra.

gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)

Response to Question No. 3.6: As shown above, using the R function gridExtra the output of p1 through p4 are arrange in multiple grobs (graphical objects) on a single page in a 2 x 2 grid to allow for a convenient visualization of the data.

3.7. What patterns do the above plots reveal if any?

Response to Question No. 3.7: As shown above, in the top left plot, we see that the “Residual” versus “Fitted” show very few outliers shown with and two discernible patterns (i.e., to groups a massed). An inspection of the top right plot compares “Residual” to “midparentHeight” with no apparent pattern shown. Moving down to the lower left corner plot we see that a “Residual” versus “Gender.” The first thing we not is the width of the male box is wider which we understand indicates that there is one or more error variances among males and females. As far as outliers, the there are not many. Finally, an examination of the Q–Q (quantile-quantile) Plot confirms that the residuals are normally distributed (pretty close) such that the residuals follow close to a straight (red) line on this plot with few insignificant outliers.

3.8. Obtain the coefficients, standard error of estimate, t-value of estimate, and p-value of estimate for model g1.

summary(g1)$coef
##                               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)                18.33347881 3.86635867  4.7417946 2.450098e-06
## gendermale                  1.57998384 5.46263785  0.2892346 7.724663e-01
## midparentHeight             0.66075039 0.05579597 11.8422605 3.103394e-30
## gendermale:midparentHeight  0.05252411 0.07890326  0.6656773 5.057825e-01

Response to Question No. 3.8: As shown above, we see the “standard error of estimate” (i.e., standard deviation divided by the square root of the sample size), the “t value” (i.e., determine whether the means of two groups are equal to each other), and “p-value” (or “critical value”) (i.e., cumulative distribution functions and inverse cumulative distribution functions (quantile function) of the known sampling distribution).