Biostatistics 213: Homework 4
## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE,
echo = T, fig.width = 5, fig.height = 5)
options(width = 116, scipen = 5, digits = 5)
setwd("~/statistics/bio213/")
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")
## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)
model.bwt.by.lwt <- lm(formula = bwt ~ lwt, data = lbw)
model.bwt.by.age <- lm(formula = bwt ~ age, data = lbw)
anova1.model.bwt.by.lwt <- anova(model.bwt.by.lwt)
anova1.model.bwt.by.age <- anova(model.bwt.by.age)
library(car)
anova3.model.bwt.by.lwt <- Anova(model.bwt.by.lwt, type = 3)
anova3.model.bwt.by.age <- Anova(model.bwt.by.age, type = 3)
1. Run the regression analysis relating both maternal weight and age simultaneously to infant birth weight (same as last week). This week, be sure to request the partial correlations and standardized beta coefficients.
Summary results
## Raw coefficients with correlation
model.bwt.by.lwt.age <- lm(formula = bwt ~ lwt + age, data = lbw)
## Standardized coefficients by running regression on a scaled dataset
std.model.bwt.by.lwt.age <- lm(formula = bwt ~ lwt + age, data = data.frame(scale(lbw)))
## Create squared partial correlation type 2 with ppcor package
library(ppcor)
partial.cor.table <- pcor(lbw[,c("bwt","lwt","age")])$estimate
sq.partial.cor.table <- partial.cor.table ^ 2
## Show results
list("Regression summary" = summary(model.bwt.by.lwt.age),
"Standardized estimate" = summary(std.model.bwt.by.lwt.age)$coef[, 1, drop = F],
"Squared partial Corr Type II" = (sq.partial.cor.table)[2:3, 1, drop = F]
)
$`Regression summary`
Call:
lm(formula = bwt ~ lwt + age, data = lbw)
Residuals:
Min 1Q Median 3Q Max
-2233 -500 32 520 1899
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2213.40 299.21 7.40 4.6e-12 ***
lwt 4.19 1.74 2.41 0.017 *
age 8.04 10.06 0.80 0.425
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 719 on 186 degrees of freedom
Multiple R-squared: 0.0381, Adjusted R-squared: 0.0277
F-statistic: 3.68 on 2 and 186 DF, p-value: 0.0271
$`Standardized estimate`
Estimate
(Intercept) 2.2290e-16
lwt 1.7590e-01
age 5.8469e-02
$`Squared partial Corr Type II`
bwt
lwt 0.0301829
age 0.0034269
After including both variables, only lwt was significant. By comparing the standardized estimates, it is evident that the lwt variable has 3 times greater effect on the outcome. Also the squared partial correlation (type 2), which represents the what the variable adds to the model, is greater for lwt. Thus, the mother's weight is significantly associated with increase in the baby's birth weight after adjusting for maternal age.
Coefficients and p-value
models.list <- list("lwt model" = model.bwt.by.lwt,
"age model" = model.bwt.by.age,
"full model" = model.bwt.by.lwt.age
)
lapply(models.list, function(x) summary(x)$coef)
$`lwt model`
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2367.7666 228.4141 10.3661 3.5008e-20
lwt 4.4448 1.7129 2.5949 1.0213e-02
$`age model`
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2656.535 238.810 11.1240 2.1851e-22
age 12.403 10.021 1.2377 2.1736e-01
$`full model`
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2213.4031 299.213 7.39742 4.6187e-12
lwt 4.1937 1.743 2.40598 1.7108e-02
age 8.0450 10.059 0.79975 4.2488e-01
The coefficient for each of these variables decreased from the univariable counterpart, thus less significant.
The \( R^2 \)
lapply(models.list, function(x) summary(x)$r.squared)
$`lwt model`
[1] 0.034756
$`age model`
[1] 0.008126
$`full model`
[1] 0.038064
The \( R^2 \) always increase by including more variables.
The adjusted \( R^2 \)
lapply(models.list, function(x) summary(x)$adj.r.squared)
$`lwt model`
[1] 0.029594
$`age model`
[1] 0.0028219
$`full model`
[1] 0.02772
The adjusted \( R^2 \) takes into account the number of variables included. This is the best for the lwt-only model, as the age variable adds little to the prediction power of the full model.
The F-statistic
lapply(models.list, function(x) summary(x)$fstatistic["value"])
$`lwt model`
value
6.7334
$`age model`
value
1.532
$`full model`
value
3.68
The F-statistics are different for all models, and is largest for the lwt-only model.
2. Run the analysis of variance relating both maternal weight and age simultaneously to infant birth weight. Interpret the findings and relate the results to last week’s univariate outputs and to the regression model above.
Types 1 and 3 square sums for univariable and full models
## Anova Type 1
anova1.model.bwt.by.lwt.age <- anova(model.bwt.by.lwt.age)
## Anova Type 3
anova3.model.bwt.by.lwt.age <- Anova(model.bwt.by.lwt.age, type = 3)
## Show results
res.list <- list("type 1 lwt model" = anova1.model.bwt.by.lwt,
"type 1 age model" = anova1.model.bwt.by.age,
"type 1 full model" = anova1.model.bwt.by.lwt.age,
"type 3 lwt model" = anova3.model.bwt.by.lwt,
"type 3 age model" = anova3.model.bwt.by.age,
"type 3 full model" = anova3.model.bwt.by.lwt.age
)
res.list <- lapply(res.list,
function(df) {
class(df) <- "data.frame"
df
})
res.list
$`type 1 lwt model`
Df Sum Sq Mean Sq F value Pr(>F)
lwt 1 3473052 3473052 6.7334 0.010213
Residuals 187 96454213 515798 NA NA
$`type 1 age model`
Df Sum Sq Mean Sq F value Pr(>F)
age 1 812010 812010 1.532 0.21736
Residuals 187 99115254 530028 NA NA
$`type 1 full model`
Df Sum Sq Mean Sq F value Pr(>F)
lwt 1 3473052 3473052 6.7204 0.010289
age 1 330542 330542 0.6396 0.424876
Residuals 186 96123671 516794 NA NA
$`type 3 lwt model`
Sum Sq Df F value Pr(>F)
(Intercept) 55425738 1 107.4563 3.5008e-20
lwt 3473052 1 6.7334 1.0213e-02
Residuals 96454213 187 NA NA
$`type 3 age model`
Sum Sq Df F value Pr(>F)
(Intercept) 65587961 1 123.744 2.1851e-22
age 812010 1 1.532 2.1736e-01
Residuals 99115254 187 NA NA
$`type 3 full model`
Sum Sq Df F value Pr(>F)
(Intercept) 28279892 1 54.7218 4.6187e-12
lwt 2991583 1 5.7887 1.7108e-02
age 330542 1 0.6396 4.2488e-01
Residuals 96123671 186 NA NA
Type 1 square sums (sequential SS) in the multivariable model, only the first variable (lwt) gets the same square sum as the univariable model.
Type 3 square sums in the multivariable models represent the unambiguous variability that is explained by the given variable after adjustment for all variable included. Thus, the square sum is different from the univariable counterpart for each variable.
crude.lwt <- c(
beta = model.bwt.by.lwt$coef["lwt"],
std.error = summary(model.bwt.by.lwt)$coef["lwt","Std. Error"],
p.val = summary(model.bwt.by.lwt)$coef["lwt","Pr(>|t|)"],
SS = res.list[[4]]["lwt","Sum Sq"],
R.squared = summary(model.bwt.by.lwt)$r.sq,
adj.R.squared = summary(model.bwt.by.lwt)$adj.r.sq
)
adj.lwt <- c(
beta = model.bwt.by.lwt.age$coef["lwt"],
std.error = summary(model.bwt.by.lwt.age)$coef["lwt","Std. Error"],
p.val = summary(model.bwt.by.lwt.age)$coef["lwt","Pr(>|t|)"],
SS = res.list[[6]]["lwt","Sum Sq"],
R.squared = sq.partial.cor.table["lwt","bwt"],
adj.R.squared = summary(model.bwt.by.lwt.age)$adj.r.sq
)
crude.age <- c(
beta = model.bwt.by.age$coef["age"],
std.error = summary(model.bwt.by.age)$coef["age","Std. Error"],
p.val = summary(model.bwt.by.age)$coef["age","Pr(>|t|)"],
SS = res.list[[5]]["age","Sum Sq"],
R.squared = summary(model.bwt.by.age)$r.sq,
adj.R.squared = summary(model.bwt.by.age)$adj.r.sq
)
adj.age <- c(
beta = model.bwt.by.lwt.age$coef["age"],
std.error = summary(model.bwt.by.lwt.age)$coef["age","Std. Error"],
p.val = summary(model.bwt.by.lwt.age)$coef["age","Pr(>|t|)"],
SS = res.list[[6]]["age","Sum Sq"],
R.squared = sq.partial.cor.table["age","bwt"],
adj.R.squared = summary(model.bwt.by.lwt.age)$adj.r.sq
)
data.frame(crude.lwt = crude.lwt,
adj.lwt = adj.lwt,
crude.age = crude.age,
adj.age = adj.age)
crude.lwt adj.lwt crude.age adj.age
beta.lwt 4.444757 4.193734 12.4032153 8.0449763
std.error 1.712901 1.743047 10.0208139 10.0593573
p.val 0.010213 0.017108 0.2173623 0.4248755
SS 3473051.503210 2991583.205950 812010.1574259 330541.8601662
R.squared 0.034756 0.030183 0.0081260 0.0034269
adj.R.squared 0.029594 0.027720 0.0028219 0.0277202
Comparison of the betas, standard errors of betas, p-values, square sum (marginal or type III), \( R^2 \) (partial \( R^2 \) for adjusted model), and adjusted \( R^2 \).
The betas decreased in absolute magnitude after adjustment and standard errors slightly increased for both. The p-values as a result, became less significant.
The square sums, meaning the variability unambiguously explained by the variable of choice is smaller after adjustment, especially for age.
The partial \( R^2 \) in the adjusted model is 3.0% for lwt and 0.34% for age, meaning addition of lwt to the age-only model explains 3% of the remaining unexplained variability whereas addition of age to the lwt-only model expalins 0.34% of the remaining unexplained variability.
Among the adjusted \( R^2 \) values, the one for the lwt-only model is the largest, meaning by model comparison it is the most efficient after taking into account the number of variables included.
## Plot 3D scatter plot
library(rgl)
with(lbw, {
plot3d(x = age,
y = lwt,
z = bwt,
type = "s", # s_phere
col = "red",
size = 2
)
})
predict.bwt.by.age.lwt <-
function(age,lwt) predict(model.bwt.by.lwt.age, newdata = data.frame(age, lwt), type = "response")
bwt.predicted.mat <-
with(lbw,
outer(seq(min(age), max(age), 10),
seq(min(lwt), max(lwt), 10),
predict.bwt.by.age.lwt)
)
## Add plane
surface3d(x = with(lbw, seq(min(age), max(age), 10)),
y = with(lbw, seq(min(lwt), max(lwt), 10)),
z = bwt.predicted.mat,
col = 3, alpha = 0.2)