Biostatistics 213: Homework 2

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE, 
    echo = T, fig.width = 5, fig.height = 5)
options(width = 116, scipen = 10)

setwd("~/statistics/bio213/")
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")

## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)

SeeNormality <-
function(data.frame, na.rm = TRUE) {
    require(e1071)
    result <- lapply(data.frame,
                     function(y) {
                         data.frame(n           = length(y),
                                    Not.NA      = sum(!is.na(y)),
                                    Mean        = mean(y, na.rm = na.rm),
                                    SD          = sd(y, na.rm = na.rm),
                                    Skewness    = skewness(y, type = 2, na.rm = na.rm),
                                    Kurtosis    = kurtosis(y, type = 2, na.rm = na.rm),
                                    Min         = min(y, na.rm = na.rm),
                                    Q25         = quantile(y, probs = 0.25, na.rm = na.rm),
                                    Median      = median(y, na.rm = na.rm),
                                    Q75         = quantile(y, probs = 0.75, na.rm = na.rm),
                                    Max         = max(y, na.rm = na.rm)
                                    )
                     })
    do.call(rbind, result)
}

library(ggplot2)

1. For mother's age, identify all the connections between the correlation coefficient and the slope from a linear regression (with birthweight as the dependent variable). Do the same for mother’s weight.'

Between the correlation coefficient and the slope, there is a relationship expressed as:
\[ r = \frac {S_X} {S_Y} \hat{\beta}_1 \]

where \( {S_X} \) = standard deviation of predictor, and \( {S_Y} \) = standard deviation of outcome

The p-values for Pearson correlation coefficient and slope are both testing for presence of association, thus they are identical.

by age
Scatter plot

## Graph
ggplot(lbw, aes(y = bwt, x = age)) +
    geom_point(alpha = 1/2) + theme_bw()

plot of chunk unnamed-chunk-3

Pearson correlation

## Pearson correlation
res.cor.test <- with(lbw, cor.test(bwt, age))
cor.coef.from.slope <- res.cor.test$estimate
names(cor.coef.from.slope) <- NULL
res.cor.test

    Pearson's product-moment correlation

data:  bwt and age 
t = 1.238, df = 187, p-value = 0.2174
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.05327  0.22992 
sample estimates:
    cor 
0.09014 

Thus, correlation coefficient r = 0.0901, and p-value is 0.2174.

Linear model

## Linear model
fit.bwt.age <- lm(bwt ~ age, lbw)
summary.fit.bwt.age <- summary(fit.bwt.age)
print(summary.fit.bwt.age, digits = 5)

Call:
lm(formula = bwt ~ age, data = lbw)

Residuals:
      Min        1Q    Median        3Q       Max 
-2294.825  -517.809    10.385   530.594  1775.321 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2656.535    238.810 11.1240   <2e-16 ***
age           12.403     10.021  1.2377   0.2174    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 728 on 187 degrees of freedom
Multiple R-squared: 0.008126,   Adjusted R-squared: 0.0028219 
F-statistic:  1.532 on 1 and 187 DF,  p-value: 0.21736 

Therefore, the p-values for the Pearson correlation coefficient (0.2174) and slope of the age variable (0.2174) are the same.

Also, SD(age) / SD(bwt) * slope of age = 5.298678 / 729.0597 * 12.40322 = 0.0901 is the same as Pearson correlation coefficient r = 0.0901.

The Multiple R-squared = 0.0081 in the regression result is equal to the Pearson correlation coefficient squard = 0.0081.

by lwt
Scatter plot

## Graph
ggplot(lbw, aes(y = bwt, x = lwt)) +
    geom_point(alpha = 1/2) + theme_bw()

plot of chunk unnamed-chunk-6

Pearson correlation

## Pearson correlation
res.cor.test <- with(lbw, cor.test(bwt, lwt))
cor.coef.from.slope <- res.cor.test$estimate
names(cor.coef.from.slope) <- NULL
res.cor.test

    Pearson's product-moment correlation

data:  bwt and lwt 
t = 2.595, df = 187, p-value = 0.01021
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.04489 0.32063 
sample estimates:
   cor 
0.1864 

Thus, correlation coefficient r = 0.1864, and p-value is 0.0102.

Linear model

## Linear model
fit.bwt.lwt <- lm(bwt ~ lwt, lbw)
summary.fit.bwt.lwt <- summary(fit.bwt.lwt)
print(summary.fit.bwt.lwt, digits = 4)

Call:
lm(formula = bwt ~ lwt, data = lbw)

Residuals:
     Min       1Q   Median       3Q      Max 
-2192.14  -503.69    -4.03   508.42  2075.53 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2367.767    228.414  10.366   <2e-16 ***
lwt            4.445      1.713   2.595   0.0102 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 718.2 on 187 degrees of freedom
Multiple R-squared: 0.03476,    Adjusted R-squared: 0.02959 
F-statistic: 6.733 on 1 and 187 DF,  p-value: 0.01021 

Therefore, the p-values for the Pearson correlation coefficient (0.0102) and slope of the lwt variable (0.0102) are the same.

Also, SD(lwt) / SD(bwt) * slope of lwt = 30.57938 / 729.0597 * 4.444757 = 0.1864 is the same as Pearson correlation coefficient r = 0.1864.

The Multiple R-squared = 0.0348 in the regression result is equal to the Pearson correlation coefficient squard = 0.0348.


2. Use regression to examine the relationship between mother's weight and infant birthweight. Get the confidence bounds for the slope and intercept, and show graphically the confidence bounds for the regression mean and for predicted observations.'

The lower and upper confidence limits for the 95% confidence interval are:

confint(fit.bwt.age)
               2.5 %  97.5 %
(Intercept) 2185.426 3127.64
age           -7.365   32.17

Scatterplot with the regression line (middle), the confidence bands (narrower), and the predicted bands (wider).

## Fitted line and uncertainties (Introductory Statistics with R, 2nd ed page 117-120)
newdata <- data.frame(age = seq(from = 14, to = 45, by = 0.1))

## Narrow bands: confidence bands (uncertainty about the line (predicted mean) itself)
## Curved as the line is better determined closer to the middle
conf.limits <- predict(object = fit.bwt.age, newdata = newdata, interval = "confidence")
conf.limits <- cbind(newdata, conf.limits)

## Wide bands: prediction bands (uncertainty about future observations)
## Most future data points should fall between the wide limits
pred.limits <- predict(object = fit.bwt.age, newdata = newdata, interval = "prediction")
pred.limits <- cbind(newdata, pred.limits)

## Actual plot by ggplot2
ggplot(lbw, aes(y = bwt, x = age)) + theme_bw() +
    geom_point(alpha = 1/2) +
    geom_smooth(method = "lm") +
    ## geom_line(data = conf.limits, aes(x = age, y = lwr), linetype = 2) +
    ## geom_line(data = conf.limits, aes(x = age, y = upr), linetype = 2) +
    geom_line(data = pred.limits, aes(x = age, y = lwr), linetype = 3) +
    geom_line(data = pred.limits, aes(x = age, y = upr), linetype = 3)

plot of chunk unnamed-chunk-10