This manual was created by Catherine D’Este and updated by Tambri Housen and Alice Richardson. The manual was originally created for use with Stata data analysis software, the conversion to R was conducted by Nidhi Menon and Arminder Deol. The manual was created for the Australian Field Epidemiology Training Program - Masters of Philosophy (Applied Epidemiology), Australian National University. CRICOS Provider No. 00120C
Framework for analysis:
A. Statistical Analysis Plan
Descriptive statistics will be presented as mean and standard deviations for continuous variables. Exploratory data analysis will involve obtaining correlation coefficients for pairs of continuous explanatory variables (Pearson correlation if variables are approximately normally distributed or the non-parametric Spearman correlation coefficient if either of the variables are not normally distributed), ttests for the relationship between the continuous outcome and binary explanatory variables (or a non-parametric equivalent such as the Mann Whitney test if the outcome is not normally distributed) and ANOVA for the relationship between the continuous outcome and categorical variables (or the non-parametric Kruskal-Wallis test).
To adjust for potential confounders, in addition to blood lead level in 1972, age, sex, area of residence and whether or not the child lived within 1 mile of the smelter for the first two years of their life will then be included in a multiple linear regression model. Because blood lead level in 1973 was obtained at a later time to the other variables, this measure was not considered appropriate for inclusion in the model. It is also likely to be highly correlated with blood lead level in 1972. Finger wrist tapping score was also not considered as an explanatory variable as this is a neurological measure, and clinically / biologically not meaningful to include in this model.
Regression diagnostics were undertaken for the final model to assess the appropriateness of the assumptions and the fit of the model. Point estimates, 95% confidence intervals and significance tests are reported for coefficient estimates. A significance level of 5% was used for all analyses.
B. Exploratory data analysis
Firstly we should examine every variable to check for outliers and missing values, to determine the numbers of observations across categories for categorical variables and to assess the shape of the distribution for continuous variables.
There are 124 observations in the dataset, with 3 missing values for lead72, and only 99 observations have data for the finger-wrist tapping score (children aged under 5 do not have this assessment due to difficulties of administering the test in this age group).
Age is not normally distributed, but that is not a problem for these analyses; the other continuous variables are approximately normally distributed, although blood lead level in 1973 is a bit skewed to the right. We will need to check the distribution of the residuals after we have undertaken the regression.
# Create a new project and save the 'blood lead.csv' file in the project folder on your computer.
bl <- read.csv("blood lead.csv")
attach(bl) # by doing this, you no longer need to use the $ operator when specifying variables - e.g. "bl$area" can now be written as simply "area"
###
t1 <- table(area); print(t1)
## area
## 1 2
## 57 67
prop.table(t1)
## area
## 1 2
## 0.4596774 0.5403226
t2 <- table(sex); print(t2)
## sex
## 1 2
## 76 48
prop.table(t2)
## sex
## 1 2
## 0.6129032 0.3870968
t3 <- table(fst2yrs); print(t3)
## fst2yrs
## 1 2
## 35 89
prop.table(t3)
## fst2yrs
## 1 2
## 0.2822581 0.7177419
# Draw a histogram of age - simple method:
hist(age, prob=TRUE, breaks=10) # showing density instead of frequencies and split into 10 age bins rather than defaults
lines(density(age)) #to plot density curve
# Add a Normal Curve instead - assign the histogram to 'h'
h<-hist(age, xlab="Age in Years", breaks = 10, main="Histogram with Normal Curve") # prob = T for density
xfit<-seq(min(age),max(age),length=40)
yfit<-dnorm(xfit,mean=mean(age),sd=sd(age))
yfit <- yfit*diff(h$mids[1:2])*length(age)
lines(xfit, yfit, col="blue", lwd=2)
# Omit any missing values
x1 <- na.omit(lead72)
h<-hist(x1, xlab="Blood lead level in 1972", breaks = 10, main="Histogram with Normal Curve")# for density
xfit<-seq(min(x1),max(x1))
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1))
yfit <- yfit*diff(h$mids[1:2])*length(x1)
lines(xfit, yfit, col="blue", lwd=2)
x1 <- na.omit(lead73) # Omits missing values
h<-hist(x1, xlab="Blood lead level in 1973", breaks = 10, main="Histogram with Normal Curve")# for density
xfit<-seq(min(x1),max(x1))
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1))
yfit <- yfit*diff(h$mids[1:2])*length(x1)
lines(xfit, yfit, col="blue", lwd=2)
x1 <- na.omit(fwttest)
h<-hist(x1, xlab="fwttest", breaks = 10, main="Histogram with Normal Curve")# for density
xfit<-seq(min(x1),max(x1))
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1))
yfit <- yfit*diff(h$mids[1:2])*length(x1)
lines(xfit, yfit, col="blue", lwd=2)
x1 <- na.omit(iq)
h<-hist(x1, xlab="IQ", breaks = 10, main="Histogram with Normal Curve")# for density
xfit<-seq(min(x1),max(x1))
yfit<-dnorm(xfit,mean=mean(x1),sd=sd(x1))
yfit <- yfit*diff(h$mids[1:2])*length(x1)
lines(xfit, yfit, col="blue", lwd=2)
Exploratory analysis involves investigating the relationship between each explanatory variable of interest and the outcome (iq). It does not make sense to consider blood lead levels in 72 and 73 in a model of IQ, as these variables are expected to be highly correlated (the correlation coefficient or 0.68 confirms this), so we are only using lead72. It also does not make sense to include finger wrist tapping score as an explanatory variable when investigating the relationship between IQ and blood lead level, so this variables has not been considered in the analyses. There is also likely to be a relationship between area of residence (whether the child lives within 1 mile of the smelter) and whether or not the child lived within 1 mile of the smelter for the first two years of their life; we would need to consider how to deal with this (considering only of these variables, examining potential interaction / effect modification); but for this example we have include both variables in the model.
#First install the Hmisc package in R before this step
library(Hmisc)
rcorr(as.matrix(bl), type = "pearson")
## id area sex fst2yrs age lead72 lead73 fwttest iq
## id 1.00 0.18 -0.01 0.08 -0.59 0.40 0.22 -0.17 -0.15
## area 0.18 1.00 -0.06 0.39 -0.19 0.04 -0.12 0.09 0.12
## sex -0.01 -0.06 1.00 0.06 0.18 -0.07 -0.04 0.07 -0.01
## fst2yrs 0.08 0.39 0.06 1.00 -0.10 -0.20 -0.24 0.04 -0.04
## age -0.59 -0.19 0.18 -0.10 1.00 -0.22 -0.15 0.64 0.02
## lead72 0.40 0.04 -0.07 -0.20 -0.22 1.00 0.68 -0.27 -0.11
## lead73 0.22 -0.12 -0.04 -0.24 -0.15 0.68 1.00 -0.32 -0.11
## fwttest -0.17 0.09 0.07 0.04 0.64 -0.27 -0.32 1.00 0.15
## iq -0.15 0.12 -0.01 -0.04 0.02 -0.11 -0.11 0.15 1.00
##
## n
## id area sex fst2yrs age lead72 lead73 fwttest iq
## id 124 124 124 124 124 121 124 99 124
## area 124 124 124 124 124 121 124 99 124
## sex 124 124 124 124 124 121 124 99 124
## fst2yrs 124 124 124 124 124 121 124 99 124
## age 124 124 124 124 124 121 124 99 124
## lead72 121 121 121 121 121 121 121 96 121
## lead73 124 124 124 124 124 121 124 99 124
## fwttest 99 99 99 99 99 96 99 99 99
## iq 124 124 124 124 124 121 124 99 124
##
## P
## id area sex fst2yrs age lead72 lead73 fwttest iq
## id 0.0434 0.9246 0.3542 0.0000 0.0000 0.0132 0.0973 0.1045
## area 0.0434 0.4780 0.0000 0.0305 0.6476 0.1954 0.3581 0.1793
## sex 0.9246 0.4780 0.5298 0.0418 0.4670 0.6838 0.4794 0.8800
## fst2yrs 0.3542 0.0000 0.5298 0.2882 0.0249 0.0070 0.7221 0.6576
## age 0.0000 0.0305 0.0418 0.2882 0.0165 0.0980 0.0000 0.8519
## lead72 0.0000 0.6476 0.4670 0.0249 0.0165 0.0000 0.0078 0.2139
## lead73 0.0132 0.1954 0.6838 0.0070 0.0980 0.0000 0.0014 0.2343
## fwttest 0.0973 0.3581 0.4794 0.7221 0.0000 0.0078 0.0014 0.1260
## iq 0.1045 0.1793 0.8800 0.6576 0.8519 0.2139 0.2343 0.1260
# draw scatterplots if IQ vs lead71 and IQ vs age, and label the axes
plot(lead72, iq, xlab = "blood lead level in 1972", ylab = "IQ score", col = "dark blue", pch = 19, main = "Scatter plot")
plot(age, iq, xlab = "Age in years", ylab = "IQ score", col = "dark blue", pch = 19, main = "Scatter plot")
#ttest
t.test(iq~sex, var.equal = T)
##
## Two Sample t-test
##
## data: iq by sex
## t = 0.15134, df = 122, p-value = 0.88
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.874543 5.681560
## sample estimates:
## mean in group 1 mean in group 2
## 91.23684 90.83333
t.test(iq~area, var.equal = T)
##
## Two Sample t-test
##
## data: iq by area
## t = -1.3505, df = 122, p-value = 0.1793
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.614465 1.627295
## sample estimates:
## mean in group 1 mean in group 2
## 89.19298 92.68657
# MLR
area <- as.factor(area)
mlr <- lm(iq ~ lead72 + age + sex+ area + fst2yrs)
summary(mlr)
##
## Call:
## lm(formula = iq ~ lead72 + age + sex + area + fst2yrs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.250 -7.682 0.273 7.297 43.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.62301 8.89041 11.318 <2e-16 ***
## lead72 -0.16372 0.10385 -1.577 0.1177
## age -0.02074 0.38907 -0.053 0.9576
## sex 0.32337 2.73123 0.118 0.9060
## area2 5.99455 2.92422 2.050 0.0426 *
## fst2yrs -4.49853 3.28436 -1.370 0.1735
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.36 on 115 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.0509, Adjusted R-squared: 0.009639
## F-statistic: 1.234 on 5 and 115 DF, p-value: 0.298
#diagnostics
plot(mlr)
library(MASS)
sresid <- studres(mlr)
hist(sresid, prob = T, main = "Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit, col = "dark blue", lwd = 2)
plot(predict(mlr), sresid, xlab = "Linear Predictions", ylab = "Residuals", col = "dark blue", pch = 19)
abline(h = 0, col = "red", lwd = 2)
The scatter plots and correlation coefficients demonstrate a slight decrease in IQ with increasing blood lead level, but no relationship between IQ and age (this seems logical).
The t.test( ) function produces a variety of t-tests. Unlike most statistical packages, the default assumes unequal variance and applies the Welsh df modification. You can use the var.equal = TRUE option to specify equal variances and a pooled variance estimate. You can use the alternative="less" or alternative="greater" option to specify a one tailed test.
The results of the t tests do not indicate a relationship between IQ and sex, area of residence and whether the child lived within 1 mile of the smelter for the first two years (output not shown).
C. Multiple regression model:
c.1 Specify the model
We will generate a multiple linear regression model, with IQ as the outcome (dependent variable) and blood lead level in 1972, age, sex, area of residence and whether or not the child lived within 1 mile of the smelter for the first two years of their life as explanatory (independent) variables.
c.2 Estimate / generate the model
area<-as.factor(area)
mlr<-lm(iq~lead72+age+sex+area+fst2yrs)
summary(mlr)
##
## Call:
## lm(formula = iq ~ lead72 + age + sex + area + fst2yrs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.250 -7.682 0.273 7.297 43.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.62301 8.89041 11.318 <2e-16 ***
## lead72 -0.16372 0.10385 -1.577 0.1177
## age -0.02074 0.38907 -0.053 0.9576
## sex 0.32337 2.73123 0.118 0.9060
## area2 5.99455 2.92422 2.050 0.0426 *
## fst2yrs -4.49853 3.28436 -1.370 0.1735
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.36 on 115 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.0509, Adjusted R-squared: 0.009639
## F-statistic: 1.234 on 5 and 115 DF, p-value: 0.298
# To obtain the 95% confidence intervals:
confint(mlr)
## 2.5 % 97.5 %
## (Intercept) 83.0128270 118.23319568
## lead72 -0.3694167 0.04198475
## age -0.7914055 0.74993517
## sex -5.0866730 5.73340595
## area2 0.2022314 11.78686313
## fst2yrs -11.0042249 2.00715558
c.3 Model checking – residual and model diagnostics
plot(mlr) provides diagnostic plots such as residual v/s fitted plot, QQ plots and fitted v/s standardized residual plot.
plot(mlr)
To plot density curve for residuals, install the MASS package and enter the below code:
# Install the MASS package before this step
library(MASS)
sresid <- studres(mlr)
hist(sresid, prob = T, main = "Distribution of Studentized Residuals")
xfit<- seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit, col = "dark blue", lwd = 2)
plot(predict(mlr), sresid, xlab = "Linear Predictions", ylab = "Residuals", col = "dark blue", pch = 19)
abline(h = 0, col = "red", lwd = 2)
The histograms show that the errors (residuals) are approximately normally distributed. The plot of errors versus fitted values does indicate that there may be some increase in residuals with increasing fitted values (a mild fan- or funnel shape). This may indicate heterogeneity of errors.
The adjusted R squared values is 0.0096. The model explains about 1% of the variability in IQ – this is not a good model!!!!!
c.4 Inference and interpretation
IQ does not seem to be associated with blood lead level. While IQ does decrease slightly with increasing blood lead level, this is not statistically significant. The 95% CI for the coefficient estimate includes 0 and the P value for the test that the coefficient is 0 is < 0.05 (t = -1.58, p = 0.12).
The only variable which appears to be statistically significantly associated with IQ is area. Adjusting for blood lead level, age, sex and whether or not the child lived within 1 mile of the smelter for the first two years of their life, children who live more than a mile from the smelter have on average 6 units higher IQ (95% CI 0.2, 11.8). We are 95% sure that the true (population) IQ lies between 0.2 and 11.8 units higher for children living more than one mile from the lead smelter. Since this interval does not include 0, we conclude that IQ is statistically significantly higher for children living more than one mile from the smelter, relative to those living within one mile of the smelter. The value of the t test statistics for the null hypothesis that the coefficient for area is 0, is 2.05, with a p value of 0.043. We conclude that IQ is statistically significantly associated with area, consistent with our interpretation of the confidence interval.
However given that the model only explains 1% of the variance in IQ, and that the graph of residuals versus predicted values indicates some heterogeneity of variance, these results need to be interpreted with caution. In particular we might want to explore other variables which could be potentially associated with IQ.