Biostatistics 213: Homework 1

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE, 
    echo = TRUE, fig.width = 5, fig.height = 5)
options(width = 116, scipen = 10)

setwd("~/statistics/bio213/")
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")

## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)

SeeNormality <-
function(data.frame, na.rm = TRUE) {
    require(e1071)
    result <- lapply(data.frame,
                     function(y) {
                         data.frame(Number      = length(y),
                                    Not.NA      = sum(!is.na(y)),
                                    Mean        = mean(y, na.rm = na.rm),
                                    SD          = sd(y, na.rm = na.rm),
                                    Skewness    = skewness(y, type = 2, na.rm = na.rm),
                                    Kurtosis    = kurtosis(y, type = 2, na.rm = na.rm),
                                    Min         = min(y, na.rm = na.rm),
                                    Q25         = quantile(y, probs = 0.25, na.rm = na.rm),
                                    Median      = median(y, na.rm = na.rm),
                                    Q75         = quantile(y, probs = 0.75, na.rm = na.rm),
                                    Max         = max(y, na.rm = na.rm)
                                    )
                     })
    do.call(rbind, result)
}

normality.bwt.by <- function(group) {
    ddply(lbw,
          group,
          summarise,

          Number      = length(bwt),
          Not.NA      = sum(!is.na(bwt)),
          Mean        = mean(bwt),
          SD          = sd(bwt),
          Skewness    = skewness(bwt, type = 2),
          Kurtosis    = kurtosis(bwt, type = 2),
          Min         = min(bwt),
          Q25         = quantile(bwt, probs = 0.25),
          Median      = median(bwt),
          Q75         = quantile(bwt, probs = 0.75),
          Max         = max(bwt)
          )}

library(ggplot2)
library(lattice)

Assignment: Using the data set on birth weights, use the continuous measure of birth weight, BWT, (not the dichotomized one) and:

1. Find all the univariate predictors of birth weight. Do not use regression. Use “simple” tests, like t-tests, Wilcoxon tests, ANOVA, correlations, etc.

The outcome variable is the birth weight, a continuous variable. The skewness is within the -1 to 1 range, and kurtosis is negative, meaning relatively normal distribution. Therefore, to see association with categorical variables, t-test and ANOVA can be used depending on the number of category levels.

SeeNormality(lbw["bwt"])
    Number Not.NA Mean    SD Skewness Kurtosis Min  Q25 Median  Q75  Max
bwt    189    189 2945 729.1  -0.2106 -0.08188 709 2414   2977 3475 4990
densityplot(lbw$bwt)

plot of chunk unnamed-chunk-3

ggplot(lbw) + geom_density(aes(bwt))

plot of chunk unnamed-chunk-3

age: Age of mother (continuous)
The maternal age variable is also relatively normally distributed. The skewness is within the -1 to 1 range, and kurtosis is less than 1. Thus, between two normally distributed continuous variables, Pearson correlation is the measure of association.

SeeNormality(lbw["age"])
    Number Not.NA  Mean    SD Skewness Kurtosis Min Q25 Median Q75 Max
age    189    189 23.24 5.299   0.7222   0.6162  14  19     23  26  45
densityplot(lbw$age)

plot of chunk unnamed-chunk-4

ggplot(lbw) + geom_density(aes(age))

plot of chunk unnamed-chunk-4

The absolute value of the Pearson correlation coefficient is small and is not significant.

res.age <- with(lbw, cor.test(age, bwt, method = "pearson"))
list(coef = res.age$estimate, p.value = res.age$p.value)
$coef
    cor 
0.09014 

$p.value
[1] 0.2174

lwt: Weight of mother at last menstrual cycle (continuous)
The maternal weight variable is non-normal as the kurtosis is high. Thus, Spearman correlation is more appropriate.

SeeNormality(lbw["lwt"])
    Number Not.NA  Mean    SD Skewness Kurtosis Min Q25 Median Q75 Max
lwt    189    189 129.8 30.58    1.402    2.404  80 110    121 140 250
densityplot(lbw$lwt)

plot of chunk unnamed-chunk-6

ggplot(lbw) + geom_density(aes(lwt))

plot of chunk unnamed-chunk-6

The absolute value of the Spearman correlation coefficient is relatively small, but there is a significant positive correlation. Therefore, with increasing maternal weight, the baby's birth weight is also increased.
'

res.lwt <- with(lbw, cor.test(lwt, bwt, method = "spearman"))
list(coef = res.lwt$estimate, p.value = res.lwt$p.value)
$coef
   rho 
0.2486 

$p.value
[1] 0.0005609

race: Race. (3-level categorical: 1 = White, 2 = Black, 3 = Other)
The mean birth weight is highest among White babies, followed by other race babies, and Black babies.

library(doBy)
library(plyr)
summaryBy(data = lbw, bwt ~ race, FUN = mean)
  race bwt.mean
1    1     3104
2    2     2720
3    3     2804
## ddply(lbw, "race", summarise,
##       n = length(bwt),
##       mean = mean(bwt),
##       sd = sd(bwt),
##       skewness = skewness(bwt, type = 2),
##       kurtosis = kurtosis(bwt, type = 2))
normality.bwt.by("race")
  race Number Not.NA Mean    SD Skewness Kurtosis  Min  Q25 Median  Q75  Max
1    1     96     96 3104 727.8  -0.1432 -0.24046 1021 2585   3076 3651 4990
2    2     26     26 2720 638.7  -0.3502  0.24348 1135 2370   2849 3057 3860
3    3     67     67 2804 721.3  -0.4364 -0.09258  709 2313   2835 3274 4054
ggplot(lbw, aes(x = bwt, color = factor(race))) +
        geom_density() + facet_grid(race~.)

plot of chunk unnamed-chunk-8

plot(data = lbw, bwt ~ factor(race), xlab = "race")

plot of chunk unnamed-chunk-8

res.oneway <- oneway.test(data = lbw, bwt ~ factor(race), var.equal = TRUE)

ANOVA is significant with a p-value of 0.0078. Pairwise comparison resulted in a significant adjusted p-values in comparisons between White and Black, and between White and Others. Thus, White race was associated with a higher birth weight than the other two groups.

with(lbw, pairwise.t.test(bwt, race, p.adj = "bonf"))

    Pairwise comparisons using t tests with pooled SD 

data:  bwt and race 

  1     2    
2 0.048 -    
3 0.027 1.000

P value adjustment method: bonferroni 

smoke: Smoking status during pregnancy (binary)
The mean birth weight is higher for babies of non-smoking mothers.

res.summaryBy <- summaryBy(data = lbw, bwt ~ smoke, FUN = mean)
## ddply(lbw, "smoke", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("smoke")
  smoke Number Not.NA Mean    SD Skewness Kurtosis  Min  Q25 Median  Q75  Max
1     0    115    115 3055 752.4  -0.2911  -0.2269 1021 2509   3100 3622 4990
2     1     74     74 2774 660.3  -0.2963   0.4172  709 2370   2776 3246 4238
plot(data = lbw, bwt ~ factor(smoke), xlab = "smoke")

plot of chunk unnamed-chunk-10

res.var <- var.test(data = lbw, bwt ~ smoke)
res.t <- t.test(data = lbw, bwt ~ smoke, var.equal = TRUE)

The variance was not significantly different in the F-test (p-value = 0.2299). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0092. Therefore, maternal smoking during pregnancy significantly decreases birth weight of the babies, and the mean difference is 281.443 with a 95% confidence interval of [70.4038, 492.4822].

ptl: Number of premature labor (integar)
As the number of mothers with history of premature labor is small,the variable should be dichotomized: no premature labor history vs any history of premature labor ≥ 1.

table(lbw$ptl)

  0   1   2   3 
159  24   5   1 

lbw$ptl.dichotomized <- as.numeric(lbw$ptl > 0)

res.summaryBy <- summaryBy(data = lbw, bwt ~ ptl.dichotomized, FUN = mean)
## ddply(lbw, "ptl.dichotomized", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ptl.dichotomized")
  ptl.dichotomized Number Not.NA Mean    SD Skewness Kurtosis  Min  Q25 Median  Q75  Max
1                0    159    159 3014 704.4  -0.1975  -0.1029 1021 2495   3062 3579 4990
2                1     30     30 2579 760.6  -0.0691   0.1523  709 2140   2417 3270 4174
plot(data = lbw, bwt ~ factor(ptl.dichotomized), xlab = "ptl.dichotomized")

plot of chunk unnamed-chunk-11

res.var <- var.test(data = lbw, bwt ~ ptl.dichotomized)
res.t <- t.test(data = lbw, bwt ~ ptl.dichotomized, var.equal = TRUE)

The variance was not significantly different in the F-test (p-value = 0.541). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0026. Therefore, prior history of premature labor significantly decreases birth weight of the babies, and the mean difference is 434.2981 with a 95% confidence interval of [154.1666, 714.4296].

ht: History of hypertension (binary)

res.summaryBy <- summaryBy(data = lbw, bwt ~ ht, FUN = mean)
## ddply(lbw, "ht", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ht")
  ht Number Not.NA Mean    SD Skewness Kurtosis  Min  Q25 Median  Q75  Max
1  0    177    177 2972 709.3 -0.16433 -0.01924  709 2424   2992 3475 4990
2  1     12     12 2537 917.3 -0.02596 -1.15816 1135 1775   2495 3232 3790
plot(data = lbw, bwt ~ factor(ht), xlab = "ht")

plot of chunk unnamed-chunk-12

res.var <- var.test(data = lbw, bwt ~ ht)
res.t <- t.test(data = lbw, bwt ~ ht, var.equal = TRUE)

The variance was not significantly different in the F-test (p-value = 0.1659). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0448. Therefore, prior history of hypertension significantly decreases birth weight of the babies, and the mean difference is 435.6737 with a 95% confidence interval of [10.1173, 861.2301].

ui: Presence of uterine irritability (binary)

res.summaryBy <- summaryBy(data = lbw, bwt ~ ui, FUN = mean)
## ddply(lbw, "ui", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ui")
  ui Number Not.NA Mean    SD Skewness Kurtosis  Min  Q25 Median  Q75  Max
1  0    161    161 3031 693.7  -0.1230  -0.3027 1135 2466   3062 3586 4990
2  1     28     28 2450 743.0  -0.3079   0.1814  709 2077   2452 2927 3912
plot(data = lbw, bwt ~ factor(ui), xlab = "ui")

plot of chunk unnamed-chunk-13

res.var <- var.test(data = lbw, bwt ~ ui)
res.t <- t.test(data = lbw, bwt ~ ui, var.equal = TRUE)

The variance was not significantly different in the F-test (p-value = 0.5876). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0001. Therefore, presence of uterine irritation significantly decreases birth weight of the babies, and the mean difference is 580.3043 with a 95% confidence interval of [297.1442, 863.4645].

ftv: Number of physician visits during first trimester (integar)
As the number of mothers who visited physicians more than 3 times are small, dichotomizing the variable is preferred. As a logical comparison is between no visit vs any visit, it is dichotomized at ≥ 1. CHANGED TO: Comparison was done between good visit number (1,2) vs bad numbers (0,3+).

table(lbw$ftv)

  0   1   2   3   4   6 
100  47  30   7   4   1 

## lbw$ftv.dichotomized <- as.numeric(lbw$ftv > 0) # No vs any
lbw$ftv.dichotomized <- as.numeric(lbw$ftv %in% c(1,2)) # good (1,2) vs bad

res.summaryBy <- summaryBy(data = lbw, bwt ~ ftv.dichotomized, FUN = mean)
## ddply(lbw, "ftv.dichotomized", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ftv.dichotomized")
  ftv.dichotomized Number Not.NA Mean    SD Skewness Kurtosis  Min  Q25 Median  Q75  Max
1                0    112    112 2858 704.4  -0.3552 -0.21544  709 2374   2892 3434 4238
2                1     77     77 3070 750.5  -0.1149 -0.02575 1021 2594   3035 3651 4990
plot(data = lbw, bwt ~ factor(ftv.dichotomized), xlab = "ftv.dichotomized")

plot of chunk unnamed-chunk-14

res.var <- var.test(data = lbw, bwt ~ ftv.dichotomized)
res.t <- t.test(data = lbw, bwt ~ ftv.dichotomized, var.equal = TRUE)

The variance was not significantly different in the F-test (p-value = 0.5375). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0493. Therefore, presence of healthy number of visits (1,2) is significantly associated with heavier birth weight of the babies, and the mean difference is -211.9091 with a 95% confidence interval of [-423.1931, -0.6251].

ANOVA approach (do not do both)

lbw$ftv.3group <-
    cut(lbw$ftv, breaks = c(-Inf,0,2,Inf), labels = c("0","1-2","3+"))
normality.bwt.by("ftv.3group")
  ftv.3group Number Not.NA Mean    SD Skewness Kurtosis  Min  Q25 Median  Q75  Max
1          0    100    100 2865 725.0  -0.4076 -0.25903  709 2346   2948 3462 4238
2        1-2     77     77 3070 750.5  -0.1149 -0.02575 1021 2594   3035 3651 4990
3         3+     12     12 2802 520.8   0.7690 -0.17866 2126 2441   2666 3136 3860
plot(data = lbw, bwt ~ factor(ftv.3group), xlab = "ftv.3group")

plot of chunk unnamed-chunk-15

res.oneway <- oneway.test(data = lbw, bwt ~ ftv.3group, var.equal = TRUE)

By grouping into 3 groups and performing ANOVA, it is not significant.

2. Use scatter plots and correlation coefficients to look at the relationship between mother's age versus birth weight and, separately, mother's weight versus birth weight. (Note: You may have done these already for #1 above.)

Maternal age and birth weight

ggplot(data = lbw, aes(x = age, y = bwt)) + geom_point(alpha = 1/3) + theme_bw()

plot of chunk unnamed-chunk-16

As stated in question 1, Pearson correlation was used. The absolute value of the coefficient was very low at 0.0901. There is no recognizable non-linear pattern in the scatter plot. Thus, there is no relationship.

Maternal weight and birth weight

ggplot(data = lbw, aes(x = lwt, y = bwt)) + geom_point(alpha = 1/3) + theme_bw()

plot of chunk unnamed-chunk-17

As stated in question 1, Spearman correlation was used. The absolute value of the coefficient was low at 0.2486. There is no recognizable non-linear pattern in the scatter plot. The birth weights tend to be higher in babies born to heavier mothers. Thus there is a weak positive correlation.