5. Tutorial for POPLHLTH 216: Quantitative methods for Health Research

Simon Thornley

29 September, 2022

Further exploration of regression.

This session, we will explore the relationship between testosterone, the male sex hormone and body mass index (BMI) in a national US survey health survey (NHANES).

This will allow us to explore the need for data transformation and interpretation of regression variables that involve transformation. This may be a little bit more mathematically challenging. Right skewed data occur very commonly in epidemiology, so being confident in how to deal with this situation is particularly important.

The NHANES data is available here.

iNZight is available here.

Aims

Today, we will aim to detect data that requires:

data transformation and
interpret models that have log transformed outcomes.

Warning, a bit of algebra is required!

Testosterone concentration and body mass

We will look at a situation where we have a right skewed variable, such as testosterone concentration, the male sex hormone. It is well known that testosterone has an inverse association with obesity which is likely to be related to fat tissue turning testosterone to oestradiol (a female sex hormone). Whether testosterone replacement is useful, is now the subject of randomised controlled trials.

Upload the NHANES data to iNZight.

We will skip the data checking stuff, but suffice to say that this is necessary with any new dataset that you start to analyse.

In the Visualise tab, select Testosterone.

What do you notice about the distribution of the data?
In which direction is the data skewed?

It is well recognised that right skewed data is often transformed to something resembling normality using a \(\text{log}_e\) or \(ln\) transformation.

Create a new variable called log.Testosterone by selecting Manipulate variables –> Numeric variables. Select Transform variables and Testosterone under Select Columns and LOG(e) under Select Transformation.

Observe the distribution of the new variable.

Is it asymmetric or symmetric compared to the original variable?
Why might this be important for statistical analysis and regression analysis?

Here, the data is more symmetric once the data is log transformed, although it still has a strange bimodal distribution.

Consider the relationship between log.Testosterone and BMI.

Add a smoother and linear regresssion line using Add to plot

What do you notice? In what direction is the association?

Do you think it may be worth restricting our analysis in some way?

by age?
by gender?

These are both potential confounders, where restriction makes more sense than stratification for example.

Use Dataset –> Filter dataset to restrict the population to men only and those aged 20 years or more.

Consider the relationship between log.Testosterone and BMI.

How would you describe the strength of association?
Is the association statistically significant?
Adjust for the potential confounder age, does this change the relationship at all?

Take a look at the Inference tab and interpret the meaning of the slope and y-intercept.

The interpretation of the slope term is complicated by the log-transformation that we’ve made.

In order to interpret the \(\beta\) coefficient, we need to exponentiate it. This is described as an exponentiated slope or \(\beta\) coefficient and has a similar interpretation as a risk ratio, describing the proportional increase (or decrease) in the average value of testosterone, given a one unit change in body mass index. For example, an exponentiated \(\beta\) coefficient of 1.2 means a 20% increase in the testosterone concentration for a one unit change in body mass index. Conversely, an exponentiated \(\beta\) coefficient of 0.8 would mean that testosterone concentration would decline by 20% (1 - 0.8 = 0.2 or 20%) for every unit increase in body mass index.

By taking \(\frac{ln(2)}{\beta_1}\), we can estimate the change in BMI required to double the value of the outcome (testosterone concentration).

What is the meaning of the P-value associated with the \(\beta\) coefficient for the slope?

Now adjust for Age (potential confounder), note the change to the slope term (\(\beta\) coefficient) for Age.

Homework

Consider the relationship between systolic blood pressure (BPSysAve) and Age. Interpret the scatterplot by adding a linear regression line and moving average. Interpret the model. Adjust for Gender as a potential confounder. How does this change the interpretation?