Biostatistics 213: Homework 2
## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE,
echo = T, fig.width = 5, fig.height = 5)
options(width = 116, scipen = 10)
setwd("~/statistics/bio213/")
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")
## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)
SeeNormality <-
function(data.frame, na.rm = TRUE) {
require(e1071)
result <- lapply(data.frame,
function(y) {
data.frame(n = length(y),
Not.NA = sum(!is.na(y)),
Mean = mean(y, na.rm = na.rm),
SD = sd(y, na.rm = na.rm),
Skewness = skewness(y, type = 2, na.rm = na.rm),
Kurtosis = kurtosis(y, type = 2, na.rm = na.rm),
Min = min(y, na.rm = na.rm),
Q25 = quantile(y, probs = 0.25, na.rm = na.rm),
Median = median(y, na.rm = na.rm),
Q75 = quantile(y, probs = 0.75, na.rm = na.rm),
Max = max(y, na.rm = na.rm)
)
})
do.call(rbind, result)
}
library(ggplot2)
1. For mother's age, identify all the connections between the correlation coefficient and the slope from a linear regression (with birthweight as the dependent variable). Do the same for mother’s weight.'
Between the correlation coefficient and the slope, there is a relationship expressed as:
\[ r = \frac {S_X} {S_Y} \hat{\beta}_1 \]
where \( {S_X} \) = standard deviation of predictor, and \( {S_Y} \) = standard deviation of outcome
The p-values for Pearson correlation coefficient and slope are both testing for presence of association, thus they are identical.
by age
Scatter plot
## Graph
ggplot(lbw, aes(y = bwt, x = age)) +
geom_point(alpha = 1/2) + theme_bw()
Pearson correlation
## Pearson correlation
res.cor.test <- with(lbw, cor.test(bwt, age))
cor.coef.from.slope <- res.cor.test$estimate
names(cor.coef.from.slope) <- NULL
res.cor.test
Pearson's product-moment correlation
data: bwt and age
t = 1.238, df = 187, p-value = 0.2174
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.05327 0.22992
sample estimates:
cor
0.09014
Thus, correlation coefficient r = 0.0901, and p-value is 0.2174.
Linear model
## Linear model
fit.bwt.age <- lm(bwt ~ age, lbw)
summary.fit.bwt.age <- summary(fit.bwt.age)
print(summary.fit.bwt.age, digits = 5)
Call:
lm(formula = bwt ~ age, data = lbw)
Residuals:
Min 1Q Median 3Q Max
-2294.825 -517.809 10.385 530.594 1775.321
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2656.535 238.810 11.1240 <2e-16 ***
age 12.403 10.021 1.2377 0.2174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 728 on 187 degrees of freedom
Multiple R-squared: 0.008126, Adjusted R-squared: 0.0028219
F-statistic: 1.532 on 1 and 187 DF, p-value: 0.21736
Therefore, the p-values for the Pearson correlation coefficient (0.2174) and slope of the age variable (0.2174) are the same.
Also, SD(age) / SD(bwt) * slope of age = 5.298678 / 729.0597 * 12.40322 = 0.0901 is the same as Pearson correlation coefficient r = 0.0901.
The Multiple R-squared = 0.0081 in the regression result is equal to the Pearson correlation coefficient squard = 0.0081.
by lwt
Scatter plot
## Graph
ggplot(lbw, aes(y = bwt, x = lwt)) +
geom_point(alpha = 1/2) + theme_bw()
Pearson correlation
## Pearson correlation
res.cor.test <- with(lbw, cor.test(bwt, lwt))
cor.coef.from.slope <- res.cor.test$estimate
names(cor.coef.from.slope) <- NULL
res.cor.test
Pearson's product-moment correlation
data: bwt and lwt
t = 2.595, df = 187, p-value = 0.01021
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.04489 0.32063
sample estimates:
cor
0.1864
Thus, correlation coefficient r = 0.1864, and p-value is 0.0102.
Linear model
## Linear model
fit.bwt.lwt <- lm(bwt ~ lwt, lbw)
summary.fit.bwt.lwt <- summary(fit.bwt.lwt)
print(summary.fit.bwt.lwt, digits = 4)
Call:
lm(formula = bwt ~ lwt, data = lbw)
Residuals:
Min 1Q Median 3Q Max
-2192.14 -503.69 -4.03 508.42 2075.53
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2367.767 228.414 10.366 <2e-16 ***
lwt 4.445 1.713 2.595 0.0102 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 718.2 on 187 degrees of freedom
Multiple R-squared: 0.03476, Adjusted R-squared: 0.02959
F-statistic: 6.733 on 1 and 187 DF, p-value: 0.01021
Therefore, the p-values for the Pearson correlation coefficient (0.0102) and slope of the lwt variable (0.0102) are the same.
Also, SD(lwt) / SD(bwt) * slope of lwt = 30.57938 / 729.0597 * 4.444757 = 0.1864 is the same as Pearson correlation coefficient r = 0.1864.
The Multiple R-squared = 0.0348 in the regression result is equal to the Pearson correlation coefficient squard = 0.0348.
2. Use regression to examine the relationship between mother's weight and infant birthweight. Get the confidence bounds for the slope and intercept, and show graphically the confidence bounds for the regression mean and for predicted observations.'
The lower and upper confidence limits for the 95% confidence interval are:
confint(fit.bwt.age)
2.5 % 97.5 %
(Intercept) 2185.426 3127.64
age -7.365 32.17
Scatterplot with the regression line (middle), the confidence bands (narrower), and the predicted bands (wider).
## Fitted line and uncertainties (Introductory Statistics with R, 2nd ed page 117-120)
newdata <- data.frame(age = seq(from = 14, to = 45, by = 0.1))
## Narrow bands: confidence bands (uncertainty about the line (predicted mean) itself)
## Curved as the line is better determined closer to the middle
conf.limits <- predict(object = fit.bwt.age, newdata = newdata, interval = "confidence")
conf.limits <- cbind(newdata, conf.limits)
## Wide bands: prediction bands (uncertainty about future observations)
## Most future data points should fall between the wide limits
pred.limits <- predict(object = fit.bwt.age, newdata = newdata, interval = "prediction")
pred.limits <- cbind(newdata, pred.limits)
## Actual plot by ggplot2
ggplot(lbw, aes(y = bwt, x = age)) + theme_bw() +
geom_point(alpha = 1/2) +
geom_smooth(method = "lm") +
## geom_line(data = conf.limits, aes(x = age, y = lwr), linetype = 2) +
## geom_line(data = conf.limits, aes(x = age, y = upr), linetype = 2) +
geom_line(data = pred.limits, aes(x = age, y = lwr), linetype = 3) +
geom_line(data = pred.limits, aes(x = age, y = upr), linetype = 3)