Biostatistics 213: Homework 4

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE, 
    echo = T, fig.width = 5, fig.height = 5)
options(width = 116, scipen = 5, digits = 5)

setwd("~/statistics/bio213/")
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")

## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)
model.bwt.by.lwt <- lm(formula = bwt ~ lwt, data = lbw)
model.bwt.by.age <- lm(formula = bwt ~ age, data = lbw)

anova1.model.bwt.by.lwt <- anova(model.bwt.by.lwt)
anova1.model.bwt.by.age <- anova(model.bwt.by.age)

library(car)
anova3.model.bwt.by.lwt <- Anova(model.bwt.by.lwt, type = 3)
anova3.model.bwt.by.age <- Anova(model.bwt.by.age, type = 3)

1. Run the regression analysis relating both maternal weight and age simultaneously to infant birth weight (same as last week). This week, be sure to request the partial correlations and standardized beta coefficients.
Summary results

## Raw coefficients with correlation
model.bwt.by.lwt.age <- lm(formula = bwt ~ lwt + age, data = lbw)

## Standardized coefficients by running regression on a scaled dataset
std.model.bwt.by.lwt.age <- lm(formula = bwt ~ lwt + age, data = data.frame(scale(lbw)))

## Create squared partial correlation type 2 with ppcor package
library(ppcor)
partial.cor.table <- pcor(lbw[,c("bwt","lwt","age")])$estimate
sq.partial.cor.table <- partial.cor.table ^ 2

## Show results
list("Regression summary"           = summary(model.bwt.by.lwt.age),
     "Standardized estimate"        = summary(std.model.bwt.by.lwt.age)$coef[, 1, drop = F],
     "Squared partial Corr Type II" = (sq.partial.cor.table)[2:3, 1, drop = F]
     )
$`Regression summary`

Call:
lm(formula = bwt ~ lwt + age, data = lbw)

Residuals:
   Min     1Q Median     3Q    Max 
 -2233   -500     32    520   1899 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2213.40     299.21    7.40  4.6e-12 ***
lwt             4.19       1.74    2.41    0.017 *  
age             8.04      10.06    0.80    0.425    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 719 on 186 degrees of freedom
Multiple R-squared: 0.0381, Adjusted R-squared: 0.0277 
F-statistic: 3.68 on 2 and 186 DF,  p-value: 0.0271 


$`Standardized estimate`
              Estimate
(Intercept) 2.2290e-16
lwt         1.7590e-01
age         5.8469e-02

$`Squared partial Corr Type II`
          bwt
lwt 0.0301829
age 0.0034269

After including both variables, only lwt was significant. By comparing the standardized estimates, it is evident that the lwt variable has 3 times greater effect on the outcome. Also the squared partial correlation (type 2), which represents the what the variable adds to the model, is greater for lwt. Thus, the mother's weight is significantly associated with increase in the baby's birth weight after adjusting for maternal age.

Coefficients and p-value

models.list <- list("lwt model"  = model.bwt.by.lwt,
                    "age model"  = model.bwt.by.age,
                    "full model" = model.bwt.by.lwt.age
                    )

lapply(models.list, function(x) summary(x)$coef)
$`lwt model`
             Estimate Std. Error t value   Pr(>|t|)
(Intercept) 2367.7666   228.4141 10.3661 3.5008e-20
lwt            4.4448     1.7129  2.5949 1.0213e-02

$`age model`
            Estimate Std. Error t value   Pr(>|t|)
(Intercept) 2656.535    238.810 11.1240 2.1851e-22
age           12.403     10.021  1.2377 2.1736e-01

$`full model`
             Estimate Std. Error t value   Pr(>|t|)
(Intercept) 2213.4031    299.213 7.39742 4.6187e-12
lwt            4.1937      1.743 2.40598 1.7108e-02
age            8.0450     10.059 0.79975 4.2488e-01

The coefficient for each of these variables decreased from the univariable counterpart, thus less significant.

The \( R^2 \)

lapply(models.list, function(x) summary(x)$r.squared)
$`lwt model`
[1] 0.034756

$`age model`
[1] 0.008126

$`full model`
[1] 0.038064

The \( R^2 \) always increase by including more variables.

The adjusted \( R^2 \)

lapply(models.list, function(x) summary(x)$adj.r.squared)
$`lwt model`
[1] 0.029594

$`age model`
[1] 0.0028219

$`full model`
[1] 0.02772

The adjusted \( R^2 \) takes into account the number of variables included. This is the best for the lwt-only model, as the age variable adds little to the prediction power of the full model.

The F-statistic

lapply(models.list, function(x) summary(x)$fstatistic["value"])
$`lwt model`
 value 
6.7334 

$`age model`
value 
1.532 

$`full model`
value 
 3.68 

The F-statistics are different for all models, and is largest for the lwt-only model.

2. Run the analysis of variance relating both maternal weight and age simultaneously to infant birth weight. Interpret the findings and relate the results to last week’s univariate outputs and to the regression model above.
Types 1 and 3 square sums for univariable and full models

## Anova Type 1
anova1.model.bwt.by.lwt.age <- anova(model.bwt.by.lwt.age)
## Anova Type 3
anova3.model.bwt.by.lwt.age <- Anova(model.bwt.by.lwt.age, type = 3)

## Show results
res.list <- list("type 1 lwt model"  = anova1.model.bwt.by.lwt,
                 "type 1 age model"  = anova1.model.bwt.by.age,
                 "type 1 full model" = anova1.model.bwt.by.lwt.age,
                 "type 3 lwt model"  = anova3.model.bwt.by.lwt,
                 "type 3 age model"  = anova3.model.bwt.by.age,
                 "type 3 full model" = anova3.model.bwt.by.lwt.age
                 )
res.list <- lapply(res.list,
                   function(df) {
                       class(df) <- "data.frame"
                       df
                   })
res.list
$`type 1 lwt model`
           Df   Sum Sq Mean Sq F value   Pr(>F)
lwt         1  3473052 3473052  6.7334 0.010213
Residuals 187 96454213  515798      NA       NA

$`type 1 age model`
           Df   Sum Sq Mean Sq F value  Pr(>F)
age         1   812010  812010   1.532 0.21736
Residuals 187 99115254  530028      NA      NA

$`type 1 full model`
           Df   Sum Sq Mean Sq F value   Pr(>F)
lwt         1  3473052 3473052  6.7204 0.010289
age         1   330542  330542  0.6396 0.424876
Residuals 186 96123671  516794      NA       NA

$`type 3 lwt model`
              Sum Sq  Df  F value     Pr(>F)
(Intercept) 55425738   1 107.4563 3.5008e-20
lwt          3473052   1   6.7334 1.0213e-02
Residuals   96454213 187       NA         NA

$`type 3 age model`
              Sum Sq  Df F value     Pr(>F)
(Intercept) 65587961   1 123.744 2.1851e-22
age           812010   1   1.532 2.1736e-01
Residuals   99115254 187      NA         NA

$`type 3 full model`
              Sum Sq  Df F value     Pr(>F)
(Intercept) 28279892   1 54.7218 4.6187e-12
lwt          2991583   1  5.7887 1.7108e-02
age           330542   1  0.6396 4.2488e-01
Residuals   96123671 186      NA         NA

Type 1 square sums (sequential SS) in the multivariable model, only the first variable (lwt) gets the same square sum as the univariable model.

Type 3 square sums in the multivariable models represent the unambiguous variability that is explained by the given variable after adjustment for all variable included. Thus, the square sum is different from the univariable counterpart for each variable.

Summary table

crude.lwt <- c(
               beta          = model.bwt.by.lwt$coef["lwt"],
               std.error     = summary(model.bwt.by.lwt)$coef["lwt","Std. Error"],
               p.val         = summary(model.bwt.by.lwt)$coef["lwt","Pr(>|t|)"],
               SS            = res.list[[4]]["lwt","Sum Sq"],
               R.squared     = summary(model.bwt.by.lwt)$r.sq,
               adj.R.squared = summary(model.bwt.by.lwt)$adj.r.sq
               )

adj.lwt <- c(
               beta          = model.bwt.by.lwt.age$coef["lwt"],
               std.error     = summary(model.bwt.by.lwt.age)$coef["lwt","Std. Error"],
               p.val         = summary(model.bwt.by.lwt.age)$coef["lwt","Pr(>|t|)"],
               SS            = res.list[[6]]["lwt","Sum Sq"],
               R.squared     = sq.partial.cor.table["lwt","bwt"],
               adj.R.squared = summary(model.bwt.by.lwt.age)$adj.r.sq
               )

crude.age <- c(
               beta          = model.bwt.by.age$coef["age"],
               std.error     = summary(model.bwt.by.age)$coef["age","Std. Error"],
               p.val         = summary(model.bwt.by.age)$coef["age","Pr(>|t|)"],
               SS            = res.list[[5]]["age","Sum Sq"],
               R.squared     = summary(model.bwt.by.age)$r.sq,
               adj.R.squared = summary(model.bwt.by.age)$adj.r.sq
               )

adj.age <- c(
               beta          = model.bwt.by.lwt.age$coef["age"],
               std.error     = summary(model.bwt.by.lwt.age)$coef["age","Std. Error"],
               p.val         = summary(model.bwt.by.lwt.age)$coef["age","Pr(>|t|)"],
               SS            = res.list[[6]]["age","Sum Sq"],
               R.squared     = sq.partial.cor.table["age","bwt"],
               adj.R.squared = summary(model.bwt.by.lwt.age)$adj.r.sq
               )

data.frame(crude.lwt = crude.lwt,
           adj.lwt   = adj.lwt,
           crude.age = crude.age,
           adj.age   = adj.age)
                   crude.lwt        adj.lwt      crude.age        adj.age
beta.lwt            4.444757       4.193734     12.4032153      8.0449763
std.error           1.712901       1.743047     10.0208139     10.0593573
p.val               0.010213       0.017108      0.2173623      0.4248755
SS            3473051.503210 2991583.205950 812010.1574259 330541.8601662
R.squared           0.034756       0.030183      0.0081260      0.0034269
adj.R.squared       0.029594       0.027720      0.0028219      0.0277202

Comparison of the betas, standard errors of betas, p-values, square sum (marginal or type III), \( R^2 \) (partial \( R^2 \) for adjusted model), and adjusted \( R^2 \).

The betas decreased in absolute magnitude after adjustment and standard errors slightly increased for both. The p-values as a result, became less significant.

The square sums, meaning the variability unambiguously explained by the variable of choice is smaller after adjustment, especially for age.

The partial \( R^2 \) in the adjusted model is 3.0% for lwt and 0.34% for age, meaning addition of lwt to the age-only model explains 3% of the remaining unexplained variability whereas addition of age to the lwt-only model expalins 0.34% of the remaining unexplained variability.

Among the adjusted \( R^2 \) values, the one for the lwt-only model is the largest, meaning by model comparison it is the most efficient after taking into account the number of variables included.

## Plot 3D scatter plot
library(rgl)
with(lbw, {
    plot3d(x = age,
           y = lwt,
           z = bwt,
           type = "s",  # s_phere
           col = "red",
           size = 2
           )
})

predict.bwt.by.age.lwt <-
    function(age,lwt) predict(model.bwt.by.lwt.age, newdata = data.frame(age, lwt), type = "response")

bwt.predicted.mat <-
    with(lbw,
         outer(seq(min(age), max(age), 10),
               seq(min(lwt), max(lwt), 10),
               predict.bwt.by.age.lwt)
         )

## Add plane
surface3d(x = with(lbw, seq(min(age), max(age), 10)),
          y = with(lbw, seq(min(lwt), max(lwt), 10)),
          z = bwt.predicted.mat,
          col = 3, alpha = 0.2)