Photo by Simone Pellegrini on Unsplash
Photo by Simone Pellegrini on Unsplash

Further exploration of regression

This session, we will explore the relationship between serum testosterone concentration, the male sex hormone, and body mass index (BMI) in a national US survey health survey (NHANES).

This will allow us to explore the need for data transformation and interpretation of regression variables that involve transformation. This may be a little bit more mathematically challenging. Right skewed data occur very commonly in epidemiology, so being confident in how to deal with this situation is particularly important.

The NHANES data is available by clicking File --> Dataset Examples --> Future-Learn --> nhanes_2000 --> Select Set

iNZight is available here.

Aims

Today, we will aim to analyse data that requires:

  • data transformation and
  • interpret models that have log transformed outcomes.

Warning, a bit of algebra is required! 🤓

Testosterone concentration and body mass

We will look at a situation where we have a right-skewed continuous variable, such as testosterone, the male sex hormone. It is well known that testosterone has an inverse association with obesity which is likely to be related to fat tissue turning testosterone to oestradiol (a female sex hormone). Whether testosterone replacement is useful to reverse obesity, is now the subject of a randomised controlled trial.

We will skip the data checking stuff, but suffice to say that this is necessary with any new dataset that you start to analyse.

In the Visualise tab, select Testosterone.

  • What do you notice about the distribution of the data? The data is
  • In which direction is the data skewed?

It is well recognised that right skewed data is often transformed to something resembling normality using a \(\text{log}_{e}\) or \(ln\) transformation. In fact, any sort of log transformation is used, but base \(e\) is commonly used, since it is relatively straight forward to interpret.

This is because \(e^{x} \approx 1 + x\) for small values of \(x\). That means, if a \(\beta\) coefficient of a model with a \(\text{log}_e\) transformed outcome is for example, 0.05, that can be interpreted as \(e^{0.05} \approx 1.05\). This means, if the exposure or \(x\) increases by one unit, the average or expected increase in \(y\) or the outcome is 5%.

A plot is shown below of the \(x\) vs. \(e^x - 1\). Note that for small values of \(x\) (\(-0.2 \leq x \leq 0.2\)), the value of \(e^x - 1\) is almost identical.

Let's do a few questions to check our understanding.

The main reason for transforming data is to

Usually, a log base \(e\) transformation is used for data.

When the outcome of a variable is \(\text{log}_e\) transformed, the interpretation of the slope of \(\beta\) coefficient is complicated a little. Rather than being the average change in \(y\) for a one unit change in \(x\), it is the change in \(y\) for a given change in \(x\).

Transforming continuous data

Create a new variable called log.Testosterone by selecting Manipulate variables --> Numeric variables. Select Transform variables and Testosterone under Select Columns and LOG(e) under Select Transformation.

Observe the distribution of the new variable by using the Visualise tab.

Here, the data is more symmetric once the data is log transformed, although it still has a strange (two peaks) distribution.

Consider the relationship between log.Testosterone (outcome - vertical axis; first variable) and BMI (exposure - horizontal axis; second variable).

Add a smoother and linear regresssion line using Add to plot.

Add a third variable Gender.

What do you think the two clouds of points in males represent?

What do you notice? In what direction is the association for males?

Do you think it may be worth restricting our analysis in some way to simplify the analysis?

  • by age? Yes, testosterone is likely to be lower in children who have not yet reached puberty.
  • by gender? Yes, testosterone is likely to be lower in women than in men!

Let's filter to only include men over the age of 20 years, to avoid the issue of gender differences and puberty.

These are both potential confounders, where restriction makes more sense than stratification.

Use Dataset --> Filter Dataset to restrict the population to men only and those aged 20 years or more.

Age restriction

To apply the Age restriction, use Dataset --> Filter Dataset -->
Select Filter to apply --> numeric condition.

First, select the variable Age --> Select a condition --> >=
Provide a numeric value to test for --> input 20 -->
PERFORM OPERATION.

Gender restriction

To select males only, use Dataset --> Filter Dataset -->
Select Filter to apply --> levels of categorical variable -->

Select the variable Gender --> Select levels to include -->
male --> PERFORM OPERATION.

This will restrict the dataset to subjects 20 years or more who are male.

Time to analyse

The distribution of log.testosterone is now

The data set should now be restricted to 708 subjects.

Now, again, consider the relationship between log.Testosterone and BMI in the plot.

Don't forget to add a regression line and smoother.

What is the slope of the line?

Under the Inference tab, The \(\beta\) coefficient or slope is given by the

  • How would you describe the strength of association between BMI and testosterone? (hint: the \(\beta\) coefficient is important here)

  • Is the association statistically significant? (hint: consider the \(P\)-value) I get <2e-16, which is highly significant (<0.001).

Take a look at the Inference tab and interpret the meaning of the slope and \(y\)-intercept.

The output I've obtained is a value of . This means that for every one unit change in BMI, there is a decline in testosterone concentration (\(1 - e^{-0.035} \approx 0.034\)) of %.
Note that the \(\beta\) coefficient gives approximately the percentage increase or decrease associated with a one unit change in the exposure variable (here: BMI).

Warning
In order to interpret the \(\beta\) coefficient, we need to exponentiate it. This is described as an exponentiated slope or \(\beta\) coefficient and has a similar interpretation as a risk ratio, describing the proportional increase (or decrease) in the average value of testosterone, given a one unit change in BMI. For example, an exponentiated \(\beta\) coefficient of 1.2 means a 20% increase in the testosterone concentration for a one unit change in body mass index. Conversely, an exponentiated \(\beta\) coefficient of 0.8 would mean that testosterone concentration would decline by 20% (1 - 0.8 = 0.2 or 20%) for every unit increase in BMI.

The interpretation of the slope term is complicated by the log-transformation that we've made.

By taking \(\frac{ln(2)}{\beta_1}\), we can estimate the change in BMI required to, on average, double or halve the value of the outcome (testosterone concentration).

Here, for the slope of -0.035, the corresponding value to halve testosterone is
\(\frac{ln(2)}{-0.035} = -19.8 \text{ kg}/\text{m}^2\) For a man of average height (1.69m), the average change in weight would be \(19.8 \times 1.69^2 = 56 \text{ kg}\)!

What is the best meaning of the \(P\)-value associated with the \(\beta\) coefficient for the slope?

Which model is best?

A very simple way of comparing model fit between our model with a log-transformed outcome and a naive model (with no transformation) is to compare the \(R^2\) for each model.

Remember the \(R^2\) is simply 1 minus the ratio of the squared model residuals [vertical distance from the observed data point to the corresponding point on the linear model (prediction)] over the sum of squared simple mean residuals [vertical distance from observed data point to horizontal yellow line; see below]. You can see that the better the model reduces its residuals, compared to those from a simple mean, the greater the \(R^2\).

Visual representation of R squared
Visual representation of R squared

Source

To get the \(R^2\), we need to estimate the model using Advanced --> Model fitting.

Select Testosterone as the Y variable and BMI as the Variables of interest, then FIT MODEL.

The \(R^2\) for the naive model is , indicating about 18% of variation is explained by the model.

If we now select log.Testosterone as the Y variable and have BMI as the independent variable as before, we see that the \(R^2\) has .

The new \(R^2\) is now which is a % improvement over the naive model.

The rule of thumb is that the the \(R^2\), the better the model fit!

Adjusting for confounding

Potential confounder Age

A colleague suggests the relationship you have discovered between BMI and Testosterone may be explained by age, which is a possible shared cause of the two, since people generally get heavier when they age and testosterone goes down with age.

It is possible that Age is a shared common cause or confounder of the relationship between BMI and log.Testosterone. With a linear regression model, we can easily adjust for Age and see if it turns our positive association to null or not significant.

Select Advanced --> Model fitting. Then select log.Testosterone as the Y variable and BMI as the Variables of interest, then FIT MODEL.

Now adjust for Age (potential confounder), note the change to the slope term (\(\beta\) coefficient) for BMI. The crude \(\beta\) coefficient for BMI is -0.035 and the adjusted is . This means that Age is not a confounder, since the difference between the crude and adjusted \(\beta\) coefficient is less than %. The adjusted \(\beta\) coefficient for Age is which means that for each year a male ages, on average, Testosterone concentration reduces by %.

The \(P\)-values for adjusted \(\beta\) coefficients are statistically significant.

Potential confounder Education

Adjust for the potential confounder Education, by adding this as a third variable in the Visualise tab. What do you notice?

This is an example of since the relationship between BMI and log.Testosterone is modified by the third variable (Education).

Homework

Go to Dataset --> Restore data

Then restrict the data to subjects over 20 years of age.

Check distribution

Consider the distribution of the outcome variable BPSysAve.

Overall it is . That means that the variable need a log transformation before modelling or other statistical testing.

Interpret scatter plot and linear model

Consider the relationship between systolic blood pressure (BPSysAve - outcome) and Age (exposure). Interpret the scatter plot by adding a linear regression line and moving average. Which is the best description of the scatter plot?

Slope interpretation

As an individual ages by one year, on average their systolic blood pressure increases by mmHg. Over ten years that would be mmHg.

Interpret the model. Adjust for Gender as a potential confounder. How does this change the interpretation?

The adjusted \(\beta\) coefficient for Age after adjusting for gender is . This is than a 10% change, so gender confound the relationship between Age and systolic blood pressure.

Incidentally, the \(\beta\) coefficient for Gender means males have a systolic blood pressure that is on average mmHg than females, after adjusting for Age.