Biostatistics 213: Homework 1
## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE,
echo = TRUE, fig.width = 5, fig.height = 5)
options(width = 116, scipen = 10)
setwd("~/statistics/bio213/")
library(gdata)
lbw <- read.xls("~/statistics/bio213/lbw.xls")
## Same data available online: http://www.umass.edu/statdata/statdata/data/index.html
## Cases 10 and 39 needs fix to make them identical to Dr. Orav's dataset
## lbw2 <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls")
## lbw2[c(10,39),"BWT"] <- c(2655,3035)
SeeNormality <-
function(data.frame, na.rm = TRUE) {
require(e1071)
result <- lapply(data.frame,
function(y) {
data.frame(Number = length(y),
Not.NA = sum(!is.na(y)),
Mean = mean(y, na.rm = na.rm),
SD = sd(y, na.rm = na.rm),
Skewness = skewness(y, type = 2, na.rm = na.rm),
Kurtosis = kurtosis(y, type = 2, na.rm = na.rm),
Min = min(y, na.rm = na.rm),
Q25 = quantile(y, probs = 0.25, na.rm = na.rm),
Median = median(y, na.rm = na.rm),
Q75 = quantile(y, probs = 0.75, na.rm = na.rm),
Max = max(y, na.rm = na.rm)
)
})
do.call(rbind, result)
}
normality.bwt.by <- function(group) {
ddply(lbw,
group,
summarise,
Number = length(bwt),
Not.NA = sum(!is.na(bwt)),
Mean = mean(bwt),
SD = sd(bwt),
Skewness = skewness(bwt, type = 2),
Kurtosis = kurtosis(bwt, type = 2),
Min = min(bwt),
Q25 = quantile(bwt, probs = 0.25),
Median = median(bwt),
Q75 = quantile(bwt, probs = 0.75),
Max = max(bwt)
)}
library(ggplot2)
library(lattice)
Assignment: Using the data set on birth weights, use the continuous measure of birth weight, BWT, (not the dichotomized one) and:
1. Find all the univariate predictors of birth weight. Do not use regression. Use “simple” tests, like t-tests, Wilcoxon tests, ANOVA, correlations, etc.
The outcome variable is the birth weight, a continuous variable. The skewness is within the -1 to 1 range, and kurtosis is negative, meaning relatively normal distribution. Therefore, to see association with categorical variables, t-test and ANOVA can be used depending on the number of category levels.
SeeNormality(lbw["bwt"])
Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
bwt 189 189 2945 729.1 -0.2106 -0.08188 709 2414 2977 3475 4990
densityplot(lbw$bwt)
ggplot(lbw) + geom_density(aes(bwt))
age: Age of mother (continuous)
The maternal age variable is also relatively normally distributed. The skewness is within the -1 to 1 range, and kurtosis is less than 1. Thus, between two normally distributed continuous variables, Pearson correlation is the measure of association.
SeeNormality(lbw["age"])
Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
age 189 189 23.24 5.299 0.7222 0.6162 14 19 23 26 45
densityplot(lbw$age)
ggplot(lbw) + geom_density(aes(age))
The absolute value of the Pearson correlation coefficient is small and is not significant.
res.age <- with(lbw, cor.test(age, bwt, method = "pearson"))
list(coef = res.age$estimate, p.value = res.age$p.value)
$coef
cor
0.09014
$p.value
[1] 0.2174
lwt: Weight of mother at last menstrual cycle (continuous)
The maternal weight variable is non-normal as the kurtosis is high. Thus, Spearman correlation is more appropriate.
SeeNormality(lbw["lwt"])
Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
lwt 189 189 129.8 30.58 1.402 2.404 80 110 121 140 250
densityplot(lbw$lwt)
ggplot(lbw) + geom_density(aes(lwt))
The absolute value of the Spearman correlation coefficient is relatively small, but there is a significant positive correlation. Therefore, with increasing maternal weight, the baby's birth weight is also increased.
'
res.lwt <- with(lbw, cor.test(lwt, bwt, method = "spearman"))
list(coef = res.lwt$estimate, p.value = res.lwt$p.value)
$coef
rho
0.2486
$p.value
[1] 0.0005609
race: Race. (3-level categorical: 1 = White, 2 = Black, 3 = Other)
The mean birth weight is highest among White babies, followed by other race babies, and Black babies.
library(doBy)
library(plyr)
summaryBy(data = lbw, bwt ~ race, FUN = mean)
race bwt.mean
1 1 3104
2 2 2720
3 3 2804
## ddply(lbw, "race", summarise,
## n = length(bwt),
## mean = mean(bwt),
## sd = sd(bwt),
## skewness = skewness(bwt, type = 2),
## kurtosis = kurtosis(bwt, type = 2))
normality.bwt.by("race")
race Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
1 1 96 96 3104 727.8 -0.1432 -0.24046 1021 2585 3076 3651 4990
2 2 26 26 2720 638.7 -0.3502 0.24348 1135 2370 2849 3057 3860
3 3 67 67 2804 721.3 -0.4364 -0.09258 709 2313 2835 3274 4054
ggplot(lbw, aes(x = bwt, color = factor(race))) +
geom_density() + facet_grid(race~.)
plot(data = lbw, bwt ~ factor(race), xlab = "race")
res.oneway <- oneway.test(data = lbw, bwt ~ factor(race), var.equal = TRUE)
ANOVA is significant with a p-value of 0.0078. Pairwise comparison resulted in a significant adjusted p-values in comparisons between White and Black, and between White and Others. Thus, White race was associated with a higher birth weight than the other two groups.
with(lbw, pairwise.t.test(bwt, race, p.adj = "bonf"))
Pairwise comparisons using t tests with pooled SD
data: bwt and race
1 2
2 0.048 -
3 0.027 1.000
P value adjustment method: bonferroni
smoke: Smoking status during pregnancy (binary)
The mean birth weight is higher for babies of non-smoking mothers.
res.summaryBy <- summaryBy(data = lbw, bwt ~ smoke, FUN = mean)
## ddply(lbw, "smoke", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("smoke")
smoke Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
1 0 115 115 3055 752.4 -0.2911 -0.2269 1021 2509 3100 3622 4990
2 1 74 74 2774 660.3 -0.2963 0.4172 709 2370 2776 3246 4238
plot(data = lbw, bwt ~ factor(smoke), xlab = "smoke")
res.var <- var.test(data = lbw, bwt ~ smoke)
res.t <- t.test(data = lbw, bwt ~ smoke, var.equal = TRUE)
The variance was not significantly different in the F-test (p-value = 0.2299). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0092. Therefore, maternal smoking during pregnancy significantly decreases birth weight of the babies, and the mean difference is 281.443 with a 95% confidence interval of [70.4038, 492.4822].
ptl: Number of premature labor (integar)
As the number of mothers with history of premature labor is small,the variable should be dichotomized: no premature labor history vs any history of premature labor ≥ 1.
table(lbw$ptl)
0 1 2 3
159 24 5 1
lbw$ptl.dichotomized <- as.numeric(lbw$ptl > 0)
res.summaryBy <- summaryBy(data = lbw, bwt ~ ptl.dichotomized, FUN = mean)
## ddply(lbw, "ptl.dichotomized", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ptl.dichotomized")
ptl.dichotomized Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
1 0 159 159 3014 704.4 -0.1975 -0.1029 1021 2495 3062 3579 4990
2 1 30 30 2579 760.6 -0.0691 0.1523 709 2140 2417 3270 4174
plot(data = lbw, bwt ~ factor(ptl.dichotomized), xlab = "ptl.dichotomized")
res.var <- var.test(data = lbw, bwt ~ ptl.dichotomized)
res.t <- t.test(data = lbw, bwt ~ ptl.dichotomized, var.equal = TRUE)
The variance was not significantly different in the F-test (p-value = 0.541). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0026. Therefore, prior history of premature labor significantly decreases birth weight of the babies, and the mean difference is 434.2981 with a 95% confidence interval of [154.1666, 714.4296].
ht: History of hypertension (binary)
res.summaryBy <- summaryBy(data = lbw, bwt ~ ht, FUN = mean)
## ddply(lbw, "ht", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ht")
ht Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
1 0 177 177 2972 709.3 -0.16433 -0.01924 709 2424 2992 3475 4990
2 1 12 12 2537 917.3 -0.02596 -1.15816 1135 1775 2495 3232 3790
plot(data = lbw, bwt ~ factor(ht), xlab = "ht")
res.var <- var.test(data = lbw, bwt ~ ht)
res.t <- t.test(data = lbw, bwt ~ ht, var.equal = TRUE)
The variance was not significantly different in the F-test (p-value = 0.1659). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0448. Therefore, prior history of hypertension significantly decreases birth weight of the babies, and the mean difference is 435.6737 with a 95% confidence interval of [10.1173, 861.2301].
ui: Presence of uterine irritability (binary)
res.summaryBy <- summaryBy(data = lbw, bwt ~ ui, FUN = mean)
## ddply(lbw, "ui", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ui")
ui Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
1 0 161 161 3031 693.7 -0.1230 -0.3027 1135 2466 3062 3586 4990
2 1 28 28 2450 743.0 -0.3079 0.1814 709 2077 2452 2927 3912
plot(data = lbw, bwt ~ factor(ui), xlab = "ui")
res.var <- var.test(data = lbw, bwt ~ ui)
res.t <- t.test(data = lbw, bwt ~ ui, var.equal = TRUE)
The variance was not significantly different in the F-test (p-value = 0.5876). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0001. Therefore, presence of uterine irritation significantly decreases birth weight of the babies, and the mean difference is 580.3043 with a 95% confidence interval of [297.1442, 863.4645].
ftv: Number of physician visits during first trimester (integar)
As the number of mothers who visited physicians more than 3 times are small, dichotomizing the variable is preferred. As a logical comparison is between no visit vs any visit, it is dichotomized at ≥ 1. CHANGED TO: Comparison was done between good visit number (1,2) vs bad numbers (0,3+).
table(lbw$ftv)
0 1 2 3 4 6
100 47 30 7 4 1
## lbw$ftv.dichotomized <- as.numeric(lbw$ftv > 0) # No vs any
lbw$ftv.dichotomized <- as.numeric(lbw$ftv %in% c(1,2)) # good (1,2) vs bad
res.summaryBy <- summaryBy(data = lbw, bwt ~ ftv.dichotomized, FUN = mean)
## ddply(lbw, "ftv.dichotomized", summarise, n = length(bwt), mean = mean(bwt))
normality.bwt.by("ftv.dichotomized")
ftv.dichotomized Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
1 0 112 112 2858 704.4 -0.3552 -0.21544 709 2374 2892 3434 4238
2 1 77 77 3070 750.5 -0.1149 -0.02575 1021 2594 3035 3651 4990
plot(data = lbw, bwt ~ factor(ftv.dichotomized), xlab = "ftv.dichotomized")
res.var <- var.test(data = lbw, bwt ~ ftv.dichotomized)
res.t <- t.test(data = lbw, bwt ~ ftv.dichotomized, var.equal = TRUE)
The variance was not significantly different in the F-test (p-value = 0.5375). Thus, t-test with equal variance assumption was used, and resulted in a significant p-value of 0.0493. Therefore, presence of healthy number of visits (1,2) is significantly associated with heavier birth weight of the babies, and the mean difference is -211.9091 with a 95% confidence interval of [-423.1931, -0.6251].
ANOVA approach (do not do both)
lbw$ftv.3group <-
cut(lbw$ftv, breaks = c(-Inf,0,2,Inf), labels = c("0","1-2","3+"))
normality.bwt.by("ftv.3group")
ftv.3group Number Not.NA Mean SD Skewness Kurtosis Min Q25 Median Q75 Max
1 0 100 100 2865 725.0 -0.4076 -0.25903 709 2346 2948 3462 4238
2 1-2 77 77 3070 750.5 -0.1149 -0.02575 1021 2594 3035 3651 4990
3 3+ 12 12 2802 520.8 0.7690 -0.17866 2126 2441 2666 3136 3860
plot(data = lbw, bwt ~ factor(ftv.3group), xlab = "ftv.3group")
res.oneway <- oneway.test(data = lbw, bwt ~ ftv.3group, var.equal = TRUE)
By grouping into 3 groups and performing ANOVA, it is not significant.
2. Use scatter plots and correlation coefficients to look at the relationship between mother's age versus birth weight and, separately, mother's weight versus birth weight. (Note: You may have done these already for #1 above.)
Maternal age and birth weight
ggplot(data = lbw, aes(x = age, y = bwt)) + geom_point(alpha = 1/3) + theme_bw()
As stated in question 1, Pearson correlation was used. The absolute value of the coefficient was very low at 0.0901. There is no recognizable non-linear pattern in the scatter plot. Thus, there is no relationship.
Maternal weight and birth weight
ggplot(data = lbw, aes(x = lwt, y = bwt)) + geom_point(alpha = 1/3) + theme_bw()
As stated in question 1, Spearman correlation was used. The absolute value of the coefficient was low at 0.2486. There is no recognizable non-linear pattern in the scatter plot. The birth weights tend to be higher in babies born to heavier mothers. Thus there is a weak positive correlation.