Further exploration of regression
This session, we will explore the relationship between serum testosterone concentration, the male sex hormone, and body mass index (BMI) in a national US survey health survey (NHANES).
This will allow us to explore the need for data transformation and interpretation of regression variables that involve transformation. This may be a little bit more mathematically challenging. Right skewed data occur very commonly in epidemiology, so being confident in how to deal with this situation is particularly important.
The NHANES data is available by clicking File
-->
Dataset Examples
--> Future-Learn
-->
nhanes_2000
--> Select Set
iNZight is available here.
Aims
Today, we will aim to analyse data that requires:
- data transformation and
- interpret models that have log transformed outcomes.
Warning, a bit of algebra is required! 🤓
Testosterone concentration and body mass
We will look at a situation where we have a right-skewed continuous variable, such as testosterone, the male sex hormone. It is well known that testosterone has an inverse association with obesity which is likely to be related to fat tissue turning testosterone to oestradiol (a female sex hormone). Whether testosterone replacement is useful to reverse obesity, is now the subject of a randomised controlled trial.
We will skip the data checking stuff, but suffice to say that this is necessary with any new dataset that you start to analyse.
In the Visualise
tab, select
Testosterone
.
- What do you notice about the distribution of the data? The data is
- In which direction is the data skewed?
It is well recognised that right skewed data is often transformed to something resembling normality using a \(\text{log}_{e}\) or \(ln\) transformation. In fact, any sort of log transformation is used, but base \(e\) is commonly used, since it is relatively straight forward to interpret.
This is because \(e^{x} \approx 1 + x\) for small values of \(x\). That means, if a \(\beta\) coefficient of a model with a \(\text{log}_e\) transformed outcome is for example, 0.05, that can be interpreted as \(e^{0.05} \approx 1.05\). This means, if the exposure or \(x\) increases by one unit, the average or expected increase in \(y\) or the outcome is 5%.
A plot is shown below of the \(x\) vs. \(e^x - 1\). Note that for small values of \(x\) (\(-0.2 \leq x \leq 0.2\)), the value of \(e^x - 1\) is almost identical.
Let's do a few questions to check our understanding.
The main reason for transforming data is to
Usually, a log base \(e\) transformation is used for data.
When the outcome of a variable is \(\text{log}_e\) transformed, the interpretation of the slope of \(\beta\) coefficient is complicated a little. Rather than being the average change in \(y\) for a one unit change in \(x\), it is the change in \(y\) for a given change in \(x\).
Transforming continuous data
Create a new variable called log.Testosterone
by
selecting Manipulate variables
-->
Numeric variables
. Select Transform variables
and Testosterone
under Select Columns
and
LOG(e)
under Select Transformation
.
Observe the distribution of the new variable by using the
Visualise
tab.
Here, the data is more symmetric once the data is log transformed, although it still has a strange (two peaks) distribution.
Consider the relationship between log.Testosterone
(outcome - vertical axis; first variable
) and
BMI
(exposure - horizontal axis;
second variable
).
Add a smoother and linear regresssion
line using Add to plot
.
Add a third variable Gender
.
What do you think the two clouds of points in males
represent?
What do you notice? In what direction is the association for
males
?
Do you think it may be worth restricting our analysis in some way to simplify the analysis?
- by age? Yes, testosterone is likely to be lower in children who have not yet reached puberty.
- by gender? Yes, testosterone is likely to be lower in women than in men!
Let's filter to only include men over the age of 20 years, to avoid the issue of gender differences and puberty.
These are both potential confounders, where restriction makes more sense than stratification.
Use Dataset
--> Filter Dataset
to
restrict the population to men only and those aged 20 years or more.
Age restriction
To apply the Age
restriction, use Dataset
--> Filter Dataset
-->
Select Filter to apply
-->
numeric condition
.
First, select the variable Age
-->
Select a condition
--> >=
Provide a numeric value to test for
--> input
20
-->
PERFORM OPERATION
.
Gender restriction
To select males only, use Dataset
-->
Filter Dataset
-->
Select Filter to apply
-->
levels of categorical variable
-->
Select the variable Gender
-->
Select levels to include
-->
male
--> PERFORM OPERATION
.
This will restrict the dataset to subjects 20 years or more who are male.
Time to analyse
The distribution of log.testosterone
is now
The data set should now be restricted to 708 subjects.
Now, again, consider the relationship between
log.Testosterone
and BMI
in the plot.
Don't forget to add a regression line and smoother.
What is the slope of the line?
Under the Inference
tab, The \(\beta\) coefficient or slope is given by
the
How would you describe the strength of association between BMI and testosterone? (hint: the \(\beta\) coefficient is important here)
Is the association statistically significant? (hint: consider the \(P\)-value) I get
<2e-16
, which is highly significant (<0.001).
Take a look at the Inference
tab and interpret the
meaning of the slope and \(y\)-intercept.
The output I've obtained is a value of
.
This means that for every one unit change in BMI
, there is
a decline in testosterone concentration (\(1 -
e^{-0.035} \approx 0.034\)) of
%.
Note that the \(\beta\) coefficient
gives approximately the percentage increase or decrease associated with
a one unit change in the exposure variable (here: BMI
).
Warning
In order to interpret the \(\beta\)
coefficient, we need to exponentiate it. This is described as an
exponentiated slope or \(\beta\) coefficient and has a
similar interpretation as a risk ratio, describing the proportional
increase (or decrease) in the average value of testosterone, given a one
unit change in BMI
. For example, an exponentiated \(\beta\) coefficient of 1.2 means a 20%
increase in the testosterone concentration for a one unit change in body
mass index. Conversely, an exponentiated \(\beta\) coefficient of 0.8 would mean that
testosterone concentration would decline by 20% (1 - 0.8 =
0.2 or 20%) for every unit increase in
BMI
.
The interpretation of the slope term is complicated by the log-transformation that we've made.
By taking \(\frac{ln(2)}{\beta_1}\),
we can estimate the change in BMI
required to, on average,
double or halve the value of the
outcome (testosterone concentration).
Here, for the slope of -0.035, the corresponding value to halve
testosterone is
\(\frac{ln(2)}{-0.035} = -19.8 \text{
kg}/\text{m}^2\) For a man of average height (1.69m), the average
change in weight would be \(19.8 \times 1.69^2
= 56 \text{ kg}\)!
What is the best meaning of the \(P\)-value associated with the \(\beta\) coefficient for the slope?
Which model is best?
A very simple way of comparing model fit between our model with a log-transformed outcome and a naive model (with no transformation) is to compare the \(R^2\) for each model.
Remember the \(R^2\) is simply 1 minus the ratio of the squared model residuals [vertical distance from the observed data point to the corresponding point on the linear model (prediction)] over the sum of squared simple mean residuals [vertical distance from observed data point to horizontal yellow line; see below]. You can see that the better the model reduces its residuals, compared to those from a simple mean, the greater the \(R^2\).
To get the \(R^2\), we need to
estimate the model using Advanced
-->
Model fitting
.
Select Testosterone
as the Y variable
and
BMI
as the Variables of interest
, then
FIT MODEL
.
The \(R^2\) for the naive model is , indicating about 18% of variation is explained by the model.
If we now select log.Testosterone
as the
Y variable
and have BMI
as the independent
variable as before, we see that the \(R^2\) has
.
The new \(R^2\) is now which is a % improvement over the naive model.
The rule of thumb is that the the \(R^2\), the better the model fit!
Adjusting for confounding
Potential confounder Age
A colleague suggests the relationship you have discovered between
BMI
and Testosterone
may be explained by age,
which is a possible shared cause of the two, since people generally get
heavier when they age and testosterone goes down with age.
It is possible that Age
is a shared common
cause or confounder of the relationship
between BMI
and log.Testosterone
. With a
linear regression model, we can easily adjust for Age
and
see if it turns our positive association to null or not significant.
Select Advanced
--> Model fitting
. Then
select log.Testosterone
as the Y variable
and
BMI
as the Variables of interest
, then
FIT MODEL
.
Now adjust for Age
(potential confounder), note the
change to the slope term (\(\beta\)
coefficient) for BMI
. The crude \(\beta\) coefficient for BMI
is
-0.035
and the adjusted is
.
This means that Age
is not a confounder, since the
difference between the crude and adjusted \(\beta\) coefficient is less than
%.
The adjusted \(\beta\) coefficient for
Age
is
which means that for each year a male ages, on average,
Testosterone
concentration reduces by
%.
The \(P\)-values for adjusted \(\beta\) coefficients are statistically significant.
Potential confounder Education
Adjust for the potential confounder Education
, by adding
this as a third variable in the Visualise
tab. What do you
notice?
This is an example of
since the relationship between BMI
and log.Testosterone
is modified by the third variable
(Education
).
Homework
Go to Dataset
--> Restore data
Then restrict the data to subjects over 20 years of age.
Check distribution
Consider the distribution of the outcome variable
BPSysAve
.
Overall it is . That means that the variable need a log transformation before modelling or other statistical testing.
Interpret scatter plot and linear model
Consider the relationship between systolic blood pressure
(BPSysAve
- outcome) and Age
(exposure).
Interpret the scatter plot by adding a linear regression line and moving
average. Which is the best description of the scatter plot?
Slope interpretation
As an individual ages by one year, on average their systolic blood pressure increases by mmHg. Over ten years that would be mmHg.
Interpret the model. Adjust for Gender
as a potential
confounder. How does this change the interpretation?
The adjusted \(\beta\) coefficient for Age
after adjusting for gender is
.
This is
than a 10% change, so gender
confound the
relationship between Age
and systolic blood pressure.
Incidentally, the \(\beta\)
coefficient for Gender
means males have a systolic blood
pressure that is on average
mmHg
than females, after adjusting for Age
.