This session, we will explore the relationship between testosterone, the male sex hormone and body mass index (BMI) in a national US survey health survey (NHANES).
This will allow us to explore the need for data transformation and interpretation of regression variables that involve transformation. This may be a little bit more mathematically challenging. Right skewed data occur very commonly in epidemiology, so being confident in how to deal with this situation is particularly important.
The NHANES data is available by clicking File
–>
Dataset Examples
–> Future-Learn
–>
nhanes_2000
–> Select Set
iNZight is available here.
Today, we will aim to detect data that requires:
Warning, a bit of algebra is required!
We will look at a situation where we have a right skewed continuous variable, such as testosterone concentration, the male sex hormone. It is well known that testosterone has an inverse association with obesity which is likely to be related to fat tissue turning testosterone to oestradiol (a female sex hormone). Whether testosterone replacement is useful to reverse obesity, is now the subject of randomised controlled trials.
We will skip the data checking stuff, but suffice to say that this is necessary with any new dataset that you start to analyse.
In the Visualise
tab, select
Testosterone
.
It is well recognised that right skewed data is often transformed to something resembling normality using a \(\text{log}_{e}\) or \(ln\) transformation. In fact, any sort of log transformation is used, but base \(e\) is commonly used, since it is relatively straight forward to interpret.
This is because \(e^{x} \approx 1 + x\) for small values of \(x\). That means, if a \(\beta\) coefficient of a model with a \(\text{log}_e\) transformed outcome is for example, 0.05, that can be interpreted as \(e^{0.05} \approx 1.05\). This means, if the exposure or \(x\) increases by one unit, the average or expected increase in \(y\) or the outcome is 5%.
A plot is shown below of the \(x\) vs. \(e^x - 1\). Note that for small values of \(x\) (\(-0.2 \leq x \leq 0.2\)), the value of \(e^x - 1\) is almost identical.
Create a new variable called log.Testosterone
by
selecting Manipulate variables
–>
Numeric variables
. Select Transform variables
and Testosterone
under Select Columns
and
LOG(e)
under Select Transformation
.
Observe the distribution of the new variable.
Here, the data is more symmetric once the data is log transformed, although it still has a strange bimodal (two peaks) distribution.
Consider the relationship between log.Testosterone
(outcome - vertical axis) and BMI
(exposure - horizontal
axis).
Add a smoother and linear regresssion
line using Add to plot
.
What do you notice? In what direction is the association?
Do you think it may be worth restricting our analysis in some way?
Let’s filter to only include men over the age of 20 years, to avoid the issue of gender differences and the role of puberty.
These are both potential confounders, where restriction makes more sense than stratification for example.
Use Dataset
–> Filter dataset
to
restrict the population to men only and those aged 20 years or more.
Consider the relationship between log.Testosterone
and
BMI
.
How would you describe the strength of association? (hint: the slope of the \(\beta\) coefficient is important here)
Is the association statistically significant? (hint: consider the \(p\)-value)
Adjust for the potential confounder age
, does this
change the relationship at all?
I get <2e-16
, which is highly significant
(<0.001).
Take a look at the Inference
tab and interpret the
meaning of the slope and \(y\)-intercept.
The output I’ve obtained is a value of -0.035. This means that for
every one unit change in BMI
, there is a 3.4% decline in
testosterone concentration (\(1 - e^{-0.035}
\approx 0.034\)).
The interpretation of the slope term is complicated by the log-transformation that we’ve made.
In order to interpret the \(\beta\) coefficient, we need to exponentiate it. This is described as an exponentiated slope or \(\beta\) coefficient and has a similar interpretation as a risk ratio, describing the proportional increase (or decrease) in the average value of testosterone, given a one unit change in body mass index. For example, an exponentiated \(\beta\) coefficient of 1.2 means a 20% increase in the testosterone concentration for a one unit change in body mass index. Conversely, an exponentiated \(\beta\) coefficient of 0.8 would mean that testosterone concentration would decline by 20% (1 - 0.8 = 0.2 or 20%) for every unit increase in body mass index.
By taking \(\frac{ln(2)}{\beta_1}\),
we can estimate the change in BMI
required to, on average,
double the value of the outcome (testosterone
concentration).
Here, for the slope of -0.035, the corresponding value to halve testosterone is \(\frac{ln(2)}{-0.035} = -19.8 \text{ kg}/\text{m}^2\). For a man of average height (1.69m), the average change in weight would be \(19.8 \times 1.69^2 = 56 \text{ kg}\)!
What is the meaning of the P-value associated with the \(\beta\) coefficient for the slope?
Now adjust for Age
(potential confounder), note the
change to the slope term (\(\beta\)
coefficient) for Age
.
Go to Dataset
–> Restore data
Then restrict the data to subjects over 20 years of age.
Consider the relationship between systolic blood pressure
(BPSysAve
- outcome) and Age
(exposure).
Interpret the scatterplot by adding a linear regression line and moving
average. Interpret the model. Adjust for Gender
as a
potential confounder. How does this change the interpretation?
For every one year increase in Age
, there is an
average increase of 0.424 mmHg in blood pressure.
The association betwwen systolic blood pressure and age is highly statistically significant (\(P\) < 0.001). To a lay audience, I would state “it is unlikely to be explained by chance”.
With the addition of Gender
, the \(\beta\) coefficient or slope for
Age
increases modestly to 0.426 mmHg. So, after adjusting
for gender, the average increase in systolic blood pressure is 0.426
mmHg. This is less than a 10% change, so gender is not confounding the
relationship between age and systolic blood pressure.
Note, since systolic blood pressure is ~ symmmetrically distributed, it does not require a \(\text{log}_e\) transformation.