Analysis of Public Health Data

This manual was created by Catherine D’Este and updated by Tambri Housen and Alice Richardson. The manual was originally created for use with Stata data analysis software, the conversion to R was conducted by Nidhi Menon and Arminder Deol. The manual was created for the Australian Field Epidemiology Training Program - Masters of Philosophy (Applied Epidemiology), Australian National University. CRICOS Provider No. 00120C

Solutions:

Framework for analysis:

A. Write a statistical analysis plan
B. Exploratory data analysis
C. Modelling
- c.1 specify the model
- c.2 estimate / generate the model
- c.3 model checking - residual and model diagnostics
- c.4 inference and interpretation

A. Statistical Analysis Plan

Descriptive statistics will be presented as mean and standard deviations for continuous variables. Exploratory data analysis will involve obtaining correlation coefficients for pairs of continuous explanatory variables (Pearson correlation if variables are approximately normally distributed or the non-parametric Spearman correlation coefficient if either of the variables are not normally distributed), ttests for the relationship between the continuous outcome and binary explanatory variables (or a non-parametric equivalent such as the Mann Whitney test if the outcome is not normally distributed) and ANOVA for the relationship between the continuous outcome and categorical variables (or the non-parametric Kruskal-Wallis test).

To adjust for potential confounders, in addition to blood lead level in 1972, age, sex, area of residence and whether or not the child lived within 1 mile of the smelter for the first two years of their life will then be included in a multiple linear regression model. Because blood lead level in 1973 was obtained at a later time to the other variables, this measure was not considered appropriate for inclusion in the model. It is also likely to be highly correlated with blood lead level in 1972. Finger wrist tapping score was also not considered as an explanatory variable as this is a neurological measure, and clinically / biologically not meaningful to include in this model.

Regression diagnostics were undertaken for the final model to assess the appropriateness of the assumptions and the fit of the model. Point estimates, 95% confidence intervals and significance tests are reported for coefficient estimates. A significance level of 5% was used for all analyses.

B. Exploratory data analysis

Firstly we should examine every variable to check for outliers and missing values, to determine the numbers of observations across categories for categorical variables and to assess the shape of the distribution for continuous variables.

There are 124 observations in the dataset, with 3 missing values for lead72, and only 99 observations have data for the finger-wrist tapping score (children aged under 5 do not have this assessment due to difficulties of administering the test in this age group).

Age is not normally distributed, but that is not a problem for these analyses; the other continuous variables are approximately normally distributed, although blood lead level in 1973 is a bit skewed to the right. We will need to check the distribution of the residuals after we have undertaken the regression.

# Create a new project and save the 'blood lead.csv' file in the project folder on your computer. 
bl <- read.csv("blood lead.csv")
attach(bl) # by doing this, you no longer need to use the $ operator when specifying variables - e.g. "bl$area" can now be written as simply "area"

### 
t1 <- table(area); print(t1)

## area
##  1  2 
## 57 67

prop.table(t1)

## area
##         1         2 
## 0.4596774 0.5403226

t2 <- table(sex); print(t2)

## sex
##  1  2 
## 76 48

prop.table(t2)

## sex
##         1         2 
## 0.6129032 0.3870968

t3 <- table(fst2yrs); print(t3)

## fst2yrs
##  1  2 
## 35 89

prop.table(t3)

## fst2yrs
##         1         2 
## 0.2822581 0.7177419

# Draw a histogram of age - simple method:
hist(age, prob=TRUE, breaks=10) # showing density instead of frequencies and split into 10 age bins rather than defaults

lines(density(age)) #to plot density curve

# Add a Normal Curve instead - assign the histogram to 'h'

h<-hist(age, xlab="Age in Years", breaks = 10, main="Histogram with Normal Curve") # prob = T for density

xfit<-seq(min(age),max(age),length=40) 
yfit<-dnorm(xfit,mean=mean(age),sd=sd(age)) 
yfit <- yfit*diff(h$mids[1:2])*length(age) 
lines(xfit, yfit, col="blue", lwd=2)

# Omit any missing values
x1 <- na.omit(lead72)
h<-hist(x1, xlab="Blood lead level in 1972", breaks = 10, main="Histogram with Normal Curve")# for density

xfit<-seq(min(x1),max(x1)) 
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1)) 
yfit <- yfit*diff(h$mids[1:2])*length(x1) 
lines(xfit, yfit, col="blue", lwd=2)

x1 <- na.omit(lead73) # Omits missing values 
h<-hist(x1, xlab="Blood lead level in 1973", breaks = 10, main="Histogram with Normal Curve")# for density

xfit<-seq(min(x1),max(x1)) 
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1)) 
yfit <- yfit*diff(h$mids[1:2])*length(x1) 
lines(xfit, yfit, col="blue", lwd=2)

x1 <- na.omit(fwttest)
h<-hist(x1, xlab="fwttest", breaks = 10, main="Histogram with Normal Curve")# for density

xfit<-seq(min(x1),max(x1)) 
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1)) 
yfit <- yfit*diff(h$mids[1:2])*length(x1) 
lines(xfit, yfit, col="blue", lwd=2)

x1 <- na.omit(iq)
h<-hist(x1, xlab="IQ", breaks = 10, main="Histogram with Normal Curve")# for density

xfit<-seq(min(x1),max(x1)) 
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1)) 
yfit <- yfit*diff(h$mids[1:2])*length(x1) 
lines(xfit, yfit, col="blue", lwd=2)

Exploratory analysis involves investigating the relationship between each explanatory variable of interest and the outcome (iq). It does not make sense to consider blood lead levels in 72 and 73 in a model of IQ, as these variables are expected to be highly correlated (the correlation coefficient or 0.68 confirms this), so we are only using lead72. It also does not make sense to include finger wrist tapping score as an explanatory variable when investigating the relationship between IQ and blood lead level, so this variables has not been considered in the analyses. There is also likely to be a relationship between area of residence (whether the child lives within 1 mile of the smelter) and whether or not the child lived within 1 mile of the smelter for the first two years of their life; we would need to consider how to deal with this (considering only of these variables, examining potential interaction / effect modification); but for this example we have include both variables in the model.

#First install the Hmisc package in R before this step
library(Hmisc)
rcorr(as.matrix(bl), type = "pearson")

##            id  area   sex fst2yrs   age lead72 lead73 fwttest    iq
## id       1.00  0.18 -0.01    0.08 -0.59   0.40   0.22   -0.17 -0.15
## area     0.18  1.00 -0.06    0.39 -0.19   0.04  -0.12    0.09  0.12
## sex     -0.01 -0.06  1.00    0.06  0.18  -0.07  -0.04    0.07 -0.01
## fst2yrs  0.08  0.39  0.06    1.00 -0.10  -0.20  -0.24    0.04 -0.04
## age     -0.59 -0.19  0.18   -0.10  1.00  -0.22  -0.15    0.64  0.02
## lead72   0.40  0.04 -0.07   -0.20 -0.22   1.00   0.68   -0.27 -0.11
## lead73   0.22 -0.12 -0.04   -0.24 -0.15   0.68   1.00   -0.32 -0.11
## fwttest -0.17  0.09  0.07    0.04  0.64  -0.27  -0.32    1.00  0.15
## iq      -0.15  0.12 -0.01   -0.04  0.02  -0.11  -0.11    0.15  1.00
## 
## n
##          id area sex fst2yrs age lead72 lead73 fwttest  iq
## id      124  124 124     124 124    121    124      99 124
## area    124  124 124     124 124    121    124      99 124
## sex     124  124 124     124 124    121    124      99 124
## fst2yrs 124  124 124     124 124    121    124      99 124
## age     124  124 124     124 124    121    124      99 124
## lead72  121  121 121     121 121    121    121      96 121
## lead73  124  124 124     124 124    121    124      99 124
## fwttest  99   99  99      99  99     96     99      99  99
## iq      124  124 124     124 124    121    124      99 124
## 
## P
##         id     area   sex    fst2yrs age    lead72 lead73 fwttest iq    
## id             0.0434 0.9246 0.3542  0.0000 0.0000 0.0132 0.0973  0.1045
## area    0.0434        0.4780 0.0000  0.0305 0.6476 0.1954 0.3581  0.1793
## sex     0.9246 0.4780        0.5298  0.0418 0.4670 0.6838 0.4794  0.8800
## fst2yrs 0.3542 0.0000 0.5298         0.2882 0.0249 0.0070 0.7221  0.6576
## age     0.0000 0.0305 0.0418 0.2882         0.0165 0.0980 0.0000  0.8519
## lead72  0.0000 0.6476 0.4670 0.0249  0.0165        0.0000 0.0078  0.2139
## lead73  0.0132 0.1954 0.6838 0.0070  0.0980 0.0000        0.0014  0.2343
## fwttest 0.0973 0.3581 0.4794 0.7221  0.0000 0.0078 0.0014         0.1260
## iq      0.1045 0.1793 0.8800 0.6576  0.8519 0.2139 0.2343 0.1260

# draw scatterplots if IQ vs lead71 and IQ vs age, and label the axes
plot(lead72, iq, xlab = "blood lead level in 1972", ylab = "IQ score", col = "dark blue", pch = 19, main = "Scatter plot")

plot(age, iq, xlab = "Age in years", ylab = "IQ score", col = "dark blue", pch = 19, main = "Scatter plot")

#ttest
t.test(iq~sex, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  iq by sex
## t = 0.15134, df = 122, p-value = 0.88
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.874543  5.681560
## sample estimates:
## mean in group 1 mean in group 2 
##        91.23684        90.83333

t.test(iq~area, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  iq by area
## t = -1.3505, df = 122, p-value = 0.1793
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.614465  1.627295
## sample estimates:
## mean in group 1 mean in group 2 
##        89.19298        92.68657

# MLR
area <- as.factor(area)
mlr <- lm(iq ~ lead72 + age + sex+ area + fst2yrs)
summary(mlr)

## 
## Call:
## lm(formula = iq ~ lead72 + age + sex + area + fst2yrs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.250  -7.682   0.273   7.297  43.453 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.62301    8.89041  11.318   <2e-16 ***
## lead72       -0.16372    0.10385  -1.577   0.1177    
## age          -0.02074    0.38907  -0.053   0.9576    
## sex           0.32337    2.73123   0.118   0.9060    
## area2         5.99455    2.92422   2.050   0.0426 *  
## fst2yrs      -4.49853    3.28436  -1.370   0.1735    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.36 on 115 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.0509, Adjusted R-squared:  0.009639 
## F-statistic: 1.234 on 5 and 115 DF,  p-value: 0.298

#diagnostics
plot(mlr)

library(MASS)
sresid <- studres(mlr)
hist(sresid, prob = T, main = "Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit, col = "dark blue", lwd = 2)

plot(predict(mlr), sresid, xlab = "Linear Predictions", ylab = "Residuals", col = "dark blue", pch = 19)
abline(h = 0, col = "red", lwd = 2)

The scatter plots and correlation coefficients demonstrate a slight decrease in IQ with increasing blood lead level, but no relationship between IQ and age (this seems logical).

The t.test( ) function produces a variety of t-tests. Unlike most statistical packages, the default assumes unequal variance and applies the Welsh df modification. You can use the var.equal = TRUE option to specify equal variances and a pooled variance estimate. You can use the alternative="less" or alternative="greater" option to specify a one tailed test.

The results of the t tests do not indicate a relationship between IQ and sex, area of residence and whether the child lived within 1 mile of the smelter for the first two years (output not shown).

C. Multiple regression model:

c.1 Specify the model

We will generate a multiple linear regression model, with IQ as the outcome (dependent variable) and blood lead level in 1972, age, sex, area of residence and whether or not the child lived within 1 mile of the smelter for the first two years of their life as explanatory (independent) variables.

c.2 Estimate / generate the model

area<-as.factor(area)
mlr<-lm(iq~lead72+age+sex+area+fst2yrs)
summary(mlr)

## 
## Call:
## lm(formula = iq ~ lead72 + age + sex + area + fst2yrs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.250  -7.682   0.273   7.297  43.453 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.62301    8.89041  11.318   <2e-16 ***
## lead72       -0.16372    0.10385  -1.577   0.1177    
## age          -0.02074    0.38907  -0.053   0.9576    
## sex           0.32337    2.73123   0.118   0.9060    
## area2         5.99455    2.92422   2.050   0.0426 *  
## fst2yrs      -4.49853    3.28436  -1.370   0.1735    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.36 on 115 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.0509, Adjusted R-squared:  0.009639 
## F-statistic: 1.234 on 5 and 115 DF,  p-value: 0.298

# To obtain the 95% confidence intervals: 
confint(mlr)

##                   2.5 %       97.5 %
## (Intercept)  83.0128270 118.23319568
## lead72       -0.3694167   0.04198475
## age          -0.7914055   0.74993517
## sex          -5.0866730   5.73340595
## area2         0.2022314  11.78686313
## fst2yrs     -11.0042249   2.00715558

c.3 Model checking – residual and model diagnostics

plot(mlr) provides diagnostic plots such as residual v/s fitted plot, QQ plots and fitted v/s standardized residual plot.

plot(mlr)

To plot density curve for residuals, install the MASS package and enter the below code:

# Install the MASS package before this step
library(MASS)
sresid <- studres(mlr)


hist(sresid, prob = T, main = "Distribution of Studentized Residuals")
xfit<- seq(min(sresid),max(sresid),length=40) 
yfit<-dnorm(xfit) 
lines(xfit, yfit, col = "dark blue", lwd = 2)

plot(predict(mlr), sresid, xlab = "Linear Predictions", ylab = "Residuals", col = "dark blue", pch = 19)
abline(h = 0, col = "red", lwd = 2)

The histograms show that the errors (residuals) are approximately normally distributed. The plot of errors versus fitted values does indicate that there may be some increase in residuals with increasing fitted values (a mild fan- or funnel shape). This may indicate heterogeneity of errors.

The adjusted R squared values is 0.0096. The model explains about 1% of the variability in IQ – this is not a good model!!!!!

c.4 Inference and interpretation

IQ does not seem to be associated with blood lead level. While IQ does decrease slightly with increasing blood lead level, this is not statistically significant. The 95% CI for the coefficient estimate includes 0 and the P value for the test that the coefficient is 0 is < 0.05 (t = -1.58, p = 0.12).

The only variable which appears to be statistically significantly associated with IQ is area. Adjusting for blood lead level, age, sex and whether or not the child lived within 1 mile of the smelter for the first two years of their life, children who live more than a mile from the smelter have on average 6 units higher IQ (95% CI 0.2, 11.8). We are 95% sure that the true (population) IQ lies between 0.2 and 11.8 units higher for children living more than one mile from the lead smelter. Since this interval does not include 0, we conclude that IQ is statistically significantly higher for children living more than one mile from the smelter, relative to those living within one mile of the smelter. The value of the t test statistics for the null hypothesis that the coefficient for area is 0, is 2.05, with a p value of 0.043. We conclude that IQ is statistically significantly associated with area, consistent with our interpretation of the confidence interval.

However given that the model only explains 1% of the variance in IQ, and that the graph of residuals versus predicted values indicates some heterogeneity of variance, these results need to be interpreted with caution. In particular we might want to explore other variables which could be potentially associated with IQ.

Analysis of Public Health Data

Continuous Measures

Tasks:

Solutions: