5. Tutorial for POPLHLTH 216: Quantitative methods for health

Simon Thornley

22 September, 2023


Photo by Simone Pellegrini on Unsplash
Photo by Simone Pellegrini on Unsplash

Further exploration of regression.

This session, we will explore the relationship between testosterone, the male sex hormone and body mass index (BMI) in a national US survey health survey (NHANES).

This will allow us to explore the need for data transformation and interpretation of regression variables that involve transformation. This may be a little bit more mathematically challenging. Right skewed data occur very commonly in epidemiology, so being confident in how to deal with this situation is particularly important.

The NHANES data is available by clicking File –> Dataset Examples –> Future-Learn –> nhanes_2000 –> Select Set

iNZight is available here.

Aims

Today, we will aim to detect data that requires:

  • data transformation and
  • interpret models that have log transformed outcomes.

Warning, a bit of algebra is required!

Testosterone concentration and body mass

We will look at a situation where we have a right skewed continuous variable, such as testosterone concentration, the male sex hormone. It is well known that testosterone has an inverse association with obesity which is likely to be related to fat tissue turning testosterone to oestradiol (a female sex hormone). Whether testosterone replacement is useful to reverse obesity, is now the subject of randomised controlled trials.

We will skip the data checking stuff, but suffice to say that this is necessary with any new dataset that you start to analyse.

In the Visualise tab, select Testosterone.

  • What do you notice about the distribution of the data? (symmetric or asymmetric?)
  • In which direction is the data skewed? (left or right?)

It is well recognised that right skewed data is often transformed to something resembling normality using a \(\text{log}_{e}\) or \(ln\) transformation. In fact, any sort of log transformation is used, but base \(e\) is commonly used, since it is relatively straight forward to interpret.

This is because \(e^{x} \approx 1 + x\) for small values of \(x\). That means, if a \(\beta\) coefficient of a model with a \(\text{log}_e\) transformed outcome is for example, 0.05, that can be interpreted as \(e^{0.05} \approx 1.05\). This means, if the exposure or \(x\) increases by one unit, the average or expected increase in \(y\) or the outcome is 5%.

A plot is shown below of the \(x\) vs. \(e^x - 1\). Note that for small values of \(x\) (\(-0.2 \leq x \leq 0.2\)), the value of \(e^x - 1\) is almost identical.

Create a new variable called log.Testosterone by selecting Manipulate variables –> Numeric variables. Select Transform variables and Testosterone under Select Columns and LOG(e) under Select Transformation.

Observe the distribution of the new variable.

  • Is it asymmetric or symmetric compared to the original variable?
  • Why might this be important for statistical analysis and regression analysis?

Here, the data is more symmetric once the data is log transformed, although it still has a strange bimodal (two peaks) distribution.

Consider the relationship between log.Testosterone (outcome - vertical axis) and BMI (exposure - horizontal axis).

Add a smoother and linear regresssion line using Add to plot.

What do you notice? In what direction is the association?

Do you think it may be worth restricting our analysis in some way?

  • by age?
  • by gender?

Let’s filter to only include men over the age of 20 years, to avoid the issue of gender differences and the role of puberty.

These are both potential confounders, where restriction makes more sense than stratification for example.

Use Dataset –> Filter dataset to restrict the population to men only and those aged 20 years or more.

Consider the relationship between log.Testosterone and BMI.

  • How would you describe the strength of association? (hint: the slope of the \(\beta\) coefficient is important here)

  • Is the association statistically significant? (hint: consider the \(p\)-value)

  • Adjust for the potential confounder age, does this change the relationship at all?

I get <2e-16, which is highly significant (<0.001).

Take a look at the Inference tab and interpret the meaning of the slope and \(y\)-intercept.

The output I’ve obtained is a value of -0.035. This means that for every one unit change in BMI, there is a 3.4% decline in testosterone concentration (\(1 - e^{-0.035} \approx 0.034\)).

The interpretation of the slope term is complicated by the log-transformation that we’ve made.

In order to interpret the \(\beta\) coefficient, we need to exponentiate it. This is described as an exponentiated slope or \(\beta\) coefficient and has a similar interpretation as a risk ratio, describing the proportional increase (or decrease) in the average value of testosterone, given a one unit change in body mass index. For example, an exponentiated \(\beta\) coefficient of 1.2 means a 20% increase in the testosterone concentration for a one unit change in body mass index. Conversely, an exponentiated \(\beta\) coefficient of 0.8 would mean that testosterone concentration would decline by 20% (1 - 0.8 = 0.2 or 20%) for every unit increase in body mass index.

By taking \(\frac{ln(2)}{\beta_1}\), we can estimate the change in BMI required to, on average, double the value of the outcome (testosterone concentration).

Here, for the slope of -0.035, the corresponding value to halve testosterone is \(\frac{ln(2)}{-0.035} = -19.8 \text{ kg}/\text{m}^2\). For a man of average height (1.69m), the average change in weight would be \(19.8 \times 1.69^2 = 56 \text{ kg}\)!

What is the meaning of the P-value associated with the \(\beta\) coefficient for the slope?

Now adjust for Age (potential confounder), note the change to the slope term (\(\beta\) coefficient) for Age.

Homework

Go to Dataset –> Restore data

Then restrict the data to subjects over 20 years of age.

Consider the relationship between systolic blood pressure (BPSysAve - outcome) and Age (exposure). Interpret the scatterplot by adding a linear regression line and moving average. Interpret the model. Adjust for Gender as a potential confounder. How does this change the interpretation?

Answers

  • For every one year increase in Age, there is an average increase of 0.424 mmHg in blood pressure.

  • The association betwwen systolic blood pressure and age is highly statistically significant (\(P\) < 0.001). To a lay audience, I would state “it is unlikely to be explained by chance”.

  • With the addition of Gender, the \(\beta\) coefficient or slope for Age increases modestly to 0.426 mmHg. So, after adjusting for gender, the average increase in systolic blood pressure is 0.426 mmHg. This is less than a 10% change, so gender is not confounding the relationship between age and systolic blood pressure.

  • Note, since systolic blood pressure is ~ symmmetrically distributed, it does not require a \(\text{log}_e\) transformation.