This session, we will continue to explore the cot-death dataset and learn more about doing practical regression analysis.
Remember, here, we will be using regression to explore the statistical nature of the association between numeric or categorical exposures and numeric outcomes. As well as exploring the nature of that association, it also allows us to adjust for possible confounding.
The file is available at the following link.
This is a case-control study that considers the association between various risk factors and cot death. Our aim is to stop the epidemic, and possibly learn some epidemiology along the way!
Today we will be thinking about using regression to adjust for confounding.
Sometimes, when we are considering the association between an exposure and disease, it is possible that there is a third variable that may distort the relationship between these variables, and give us inaccurate or distorted results. This is a problem for all observational study designs, since it is only trials that deal with this through randomisation.
Today, we will think about factors associated with preterm birth or gestation.
Premature infants are those that have a gestation of less than 37 weeks. These infants generally have worse outcomes than infants that are carried to term (> 37 weeks).
Which risk factors could we consider?
Let’s think about cigarette smoking. Does this addiction lead to premature birth?
Select Gestation and Mother_smoke
Have a look at the plot. Is there a convincing difference in the distribution?
Is the distribution symmetric? In what direction is the skew?
Do we need to transform the variable?
This is an advanced topic. As it turns out, a transformation could be justified, but we’ll ignore this and continue to analyse the data under an assumption of normality. If you are dealing with a skewed variable it is best to not check the ‘equal variance’ option when doing t-tests. Often, this doesn’t make a huge difference, but it doesn’t hurt to do this.
Click the inference information.
On average, what is the mean gestation in mother’s who smoke and those that do not?
Is maternal smoking likely to be an important factor in determining the gestation of an infant?
What is the difference between clinical significance and statistical significance?
Clinical significance is a slippery concept. One definition from psychology is when “once troubled and disordered clients are now, after treatment, not distinguishable from a meaningful and representative nondisturbed reference group”. This is useful for continuous variables. To assess the importance from a clinical point of view requires a definition of what is considered ‘normal’. Also, how far from normal is abnormal? Is it one standard deviation or two from normal?
If we consider the overall group here, the standard deviation is 2.1 weeks. Here, a mean difference of less than half a week between the two groups is less than the standard deviation, so not likely to be clinically meaningful, even though it is unlikely to be due to chance.
For categorical data, another way of assessing this is given from the absolute risk reduction, or the risk difference. The reciprocal of the risk difference is known as the number needed to treat, which describes how many cases need to be treated to prevent one case. You will see the risk difference information presented in the epidemiology output in iNZight and iNZight lite.
Remember that the number needed to treat statistic entails a causal assumption, so this needs to be considered every time you roll-out this statistic.
Ok, now we want to investigate whether evidence indicates that maternal age has any influence the gestation of the infant. We notice, that if we just select Gestation under the Visualise tab, that the distribution of Gestation is rather skewed. How can we make the distribution more symmetric? Well, there is such a thing as a Box-Cox transformation.
This involves using sophisticated software, but we can use an online version here. All you need to do is select and copy the data from Excel (this will be all the gestation values, with no missing values) into the Data box of the webpage. Click Compute at the bottom of the webpage.
Here, the optimal value for lambda is 10. Hence we need to raise Gestation to the power of 10.
We can do this in iNZight by going to Manipulate variables, then Create Variables.
We select Gestation, click on the “^” operator and click “1”, then “0”. We can call our new variable something like Gestation_power_10.
View the histogram and confirm that it the variable is now symmetric.
Select Gestation_power_10 and Mother_age. Compare this plot with Gestation and Mother_age.
Use Add To Plot to add a linear regression line and smoother.
Interpret the nature of the association between Mother age and Gestation.
Do a regression (using Advanced –> Model Fitting) to estimate the slope of the relationship between the transformation of Gestation and Mother_age.
What does the beta coefficient or slope in the regression model mean?
Remember the model is now
\[ \text{Gestation}^{10} = \beta_0 + \text{Mother_age} * \beta_1 \]
That means the relationship between Gestation and the line is governed by the equation:
\[ \text{Gestation} = (\beta_0 + \text{Mother_age} * \beta_1)^\frac{1}{10} \] That means if \(\beta_0\) = 8.508e+15 and mother age \(\beta_1\) = 4.796e+13, then a one unit increase in Mother age increases Gestation by:
\[\begin{align} \delta \text{Gestation} &= (8.508* 10^{15} + 4.796 * 10^{13})^\frac{1}{10} - (8.508* 10^{15}+ 0*4.796 * 10^{13})^\frac{1}{10} \\ \delta \text{Gestation} &= (8.508* 10^{15} + 4.796 * 10^{13})^\frac{1}{10} - (8.508* 10^{15})^\frac{1}{10}\\ \delta \text{Gestation} &= 0.022 \\ \end{align}\]
Which means that each additional year of age results in a 0.022 * week increase, which translates to 3.7 hours per year.
This is the difference between a one unit change in maternal age (years) between the ages of zero and one year.
This is unrealistic, as most mothers are not between zero and one year old. If we look at the mean age of women is 27 years (Select variables –> Gestation –> Summary). Let’s look at a one unit change at this level.
\[\begin{align} \delta \text{Gestation} &= (8.508* 10^{15} + 28*4.796 * 10^{13})^\frac{1}{10} - (8.508* 10^{15}+ 27*4.796 * 10^{13})^\frac{1}{10} \\ \delta \text{Gestation} &= 0.19 \\ \end{align}\]
This is very close to the estimate that we made at the unrealistically low figure and shows that the relationship is approximately linear.
A figure is given below, which shows the nature of the regression relationship.
Note that an untransformed variable as the outcome leads to an estimated average change of 0.03 * week increase, which translates to a five hour average increase in gestation per year of maternal age. So, the transformation has resulted in some difference in estimation.
Now try adjusting for smoking status. What happens to the association between Gestation and age?
Birth weight is a little simpler, since it does not need transforming. It has a symmetric distribution. Estimate the nature of the association between birth weight and ethnicity. Adjust for cigarette smoking and sex of the child. Explain the results to a lay person.
Estimate the nature of the association between maternal age and birth weight. Adjust for cigarette smoking. What effect does this have on the association between the first two variables? Is there evidence for confounding?