## install.packages("HistData", repos = "http://cran.us.r-project.org", dependencies=TRUE). (After the first compile, we may comment out this line.
library("HistData")
data(GaltonFamilies)
Galton2 <- data.frame(GaltonFamilies)
names(Galton2)
## [1] "family" "father" "mother" "midparentHeight"
## [5] "children" "childNum" "gender" "childHeight"
Note: I will use “we” or “us” hereinafter to avoid first paragraph narrative which in my opinion does not come across as a very convincing means of communicating data analysis. In other words, the use of “we” or “us” does not indicate that another individual or entity assisted in the response to the six Mid-Term Exam responses.
g.big <- lm(childHeight ~ gender + father + mother + children, Galton2)
summary(g.big)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.43102973 2.77406863 6.283561 5.079644e-10
## gendermale 5.19852104 0.14197155 36.616640 2.093140e-182
## father 0.38521395 0.02898001 13.292403 4.892948e-37
## mother 0.31618779 0.03097755 10.206997 2.950867e-23
## children -0.04573175 0.02630931 -1.738235 8.250078e-02
Response to Question No. 6.1: As shown above, we use the R function lm to create interaction model “g.big” which is used to carry out regression at a single stratum (i.e., “childHeight”) analysis of variance and analysis of co-variance.
What are the estimated coefficients, standard errors of estimate, t-values of estimate, and p-values of estimate?
g.sm <- lm(childHeight ~ gender + father, Galton2)
summary(g.sm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6790984 2.09313335 17.04578 6.533613e-57
## gendermale 5.1804523 0.14947528 34.65759 1.054827e-169
## father 0.4104067 0.03018159 13.59792 1.518043e-38
Response to Question No. 6.2: As shown above, we see the “standard error of estimate” (i.e., standard deviation divided by the square root of the sample size), the “t value” (i.e., determine whether the means of two groups are equal to each other), and “p-value” (or “critical value”) (i.e., *cumulative distribution functions* and *inverse cumulative distribution functions (quantile function)* of the known sampling distribution).
Response to Question No. 6.3: As shown above, we have two models; one big model (g.big) and one small model (g.sm) with a null hypothesis \(\Omega\) = (H0) = \(\beta\) 3 = \(\beta\) =4 and the alternative hypothesis \(\omega\) = (H1) at least one of \(\beta\) 3 and \(\beta\) 4 = 0. By applying principle of Occam’s Razor we prefer to use \(\omega\).
anova(g.sm, g.big)
## Analysis of Variance Table
##
## Model 1: childHeight ~ gender + father
## Model 2: childHeight ~ gender + father + mother + children
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 931 4849.7
## 2 929 4343.7 2 505.99 54.108 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response to Question No. 6.4: An analysis of variance table (avova) is a “concenient way of arranging arithmetic (Fisher) and is shown abov by execite the R function anova() with the two models; one big model (g.big) and one small model (g.sm).
Response to Question No. 6.5: As described in our lecture and in Faraway (Chapter 3) we apply the Decision Rule as the p.value is greater that 0.001 to reject the small model (g.sm) and we can reject the H0 to conclude that the the mother or the children will predict the child height controling for gender and father.
45 degree line in red
title “Fit Plot”
y-axis label “Child Height”
x-axis label “Fitted Values”
library(ggplot2)
plot1 <- ggplot(g.big) +
aes(x=.fitted, y=childHeight) +
geom_point()
plot1 <- plot1 + geom_abline(intercept = 0, slope = 1, color = "red")
plot1 <- plot1 + labs(title = "Fit Plot")
(plot1 <- plot1 + labs(x="Fitted Values", y = "Child Height"))
Response to Question No. 6.6: As shown above, we see a 45 degree line in red in the fit plot observing an interaction between Child Height and Fitted Values.
cor(g.big$fitted, Galton2$childHeight)
## [1] 0.797865
Response to Question No. 6.7: The Pearson Correlation is the test statistic that measures the statistical relation or association between two continuous variables that gives information about the magnitude of the association or correlation. as well as the direction of the relations. As indicated above the Pearson Correlation is 0.7979.
Response to Question No. 6.8: There is a strong correlation (80%) for the big model.
Response to Question No. 6.9: A strong correlation Pearson Correlation (80%) would show that the big model is useful (along with additional hypothesis testing).
summary(g.big)$r.sq
## [1] 0.6365886
Response to Question No. 6.10: As shown above, the R2 for the g.big model is 0.6366.
Response to Question No. 6.11: The R2 for the g.big model as indicated above in response to Question No. 6.11 is roughly 64% of the ChildHeight deviations. The remaining 36% is likely due to error or other predictors which we did not include in the g.big model.
Response to Question No. 6.12: The R^2 value is simply the square of the correlation coefficient R. On the other hand, the correlation coefficient ( R ) of a model (e.g., variables x and y) takes values between −1 and 1 and describes how x and y are correlated. Hence, R2 is the square of the Pearson correlation between fitted-values and actual y-values.
r <- cor(g.big$fitted, Galton2$childHeight)
r_sq <- r^2
R_sq <- summary(g.big)$r.sq
data.frame(r, r_sq, R_sq)
## r r_sq R_sq
## 1 0.797865 0.6365886 0.6365886
Response to Question No. 6.13: As shown above, the relation holds for the computed values of the g.big model.