5th October 2023

Let me introduce myself

Brad Wakefield

I’m a Statistical Consultant in the Statistical Consulting Centre NIASRA

The Statistical Consulting Centre

  • Aim - The service aims to improve the statistical content of research carried out by members of the University. Researchers from all disciplines may use the Centre. Priority is currently given to staff members and postgraduate students undertaking research for Doctor of Philosophy or Masters’ degrees.

Introducing Statistical Models

Statistical Models

Statistical models are mathematical representations or formal descriptions of real-world phenomena or data, constructed using statistical methods and techniques.

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \qquad \epsilon_i \sim N(0,1) \]

Statistical Modelling

Statistical modelling is a process of creating mathematical or computational representations of real-world phenomena or data using statistical techniques.

What makes a model statistical?

Statistical models are…

  • Data Driven - informed by empirical observations.

  • Probabilistic (stochastic) - informed by random variation.

  • Mathematical - underpinned by mathematical relationships (equations).

  • Assumption based - require assumed knowledge about the world.

  • Diverse - there are a variety of ways of approaching the same problem.

The Box Aphorism

Methods we used to model

Common modelling methods we use include:

  • Regression-based methods
    • Linear Regression
    • Logistic Regression
    • General Linear Models
    • Non-linear Regression
  • Mixed Effect Models
  • Times series models
  • Survival Analysis
  • Machine Learning Models

Methods we used to model (cont)

More common modelling methods we use include:

  • Structural Equation Models (SEMs)
  • Bayesian Models
  • Network Analysis
  • Text Analysis (Natural Language Processing models)

… plus many more.

Although we are not going to look at all these methods, there are principles that apply to them all.

Purpose of a Statistical Model

When applying statistical models, broadly speaking there are two central motivations we can have:

Explanation

To investigate relationships, associations, and trends between data, draw conclusions, and obtain statistical inference.

Prediction

To predict and forecast outcomes based on existing or historical data and to estimate the level of uncertainty in these predictions.

What do we Model?

Dependent variables

Also referred to as:

  • Outcome variables - variables that relate to a particular outcome or measure the success or failure of a study.
  • Response variables - variables that respond to a particular change in independent variable.
  • Endogenous variables - variables that are determined by other variables in a model.
  • Target or output variables - terms used particularly in machine learning to describe variables that are being modelled.

What do we condition on?

Independent variables

Also referred to as:

  • Predictor variables - variables that are used to predict changes in a dependent variable.
  • Explanatory variables - variables that are used to explain changes in a dependent variable
  • Covariates - variables that are controlled for and measured to account for their effect on the dependent variable. Also sometimes refers explicitly to continuous independent variables.

What do we condition on? (continued)

Independent variables

Still also referred to as:

  • Factors - categorical variables that describe the effect of different groups on the dependent variable.
  • Exogenous variables - variables that are independent and not affected by other variables in the model.
  • Inputs - term often used in machine learning to describe variables that will be used to predict the target or output variables.

Other terms to worry about

Independent variables may also be classed as either exposure or treatment variables to describe the variables that are being studied for their effect on the dependent variable.

Confounding variables that are being controlled or accounted for are often referred to as moderator or control variables, or effect modifiers.

A note about these terms…

These terms are attributed to data by the analyst

All of these terms (dependent, independent, confounding etc) are prescribed by us based on our research question.

For different questions, there will be different terms.

The first step (besides data visualisation) when approaching any modelling problem is to identify these roles based on your modelling objectives.

Example - Modelling Cholesterol

Aim: To explore the association of age and gender on cholesterol.

What role does each variable play in this situation?

Multivariate or Multivariable?

In the previous example we considered multiple variables, (cholesterol, age, and gender).

Q. Based on the roles we assigned, would you call this a multivariate or multivariable model, or do these just mean the same thing?

Multivariate or Multivariable?

In the previous example we considered multiple variables, (cholesterol, age, and gender).

Q. Based on the roles we assigned, would you call this a multivariate or multivariable model, or do these just mean the same thing?

Answer: Multivariable (not multivariate)

Multivariate modelling refers to the modelling of multiple dependent variables.

Multivariable modelling refers to the modelling of a single dependent variable with multiple independent variables.

Regression

Regression Models

Regression models feature unknown parameters, called coefficients, that quantify the magnitude and direction of the relationship between the dependent variable and the independent variables.

Throwback to High School

Remember \(y = mx+b\) ?

What if wrote it like…

\[ y = \beta_0 + \beta_1x \]

Adding Error…

And what if we had error attached to each of our \(y\) terms…

\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \]

Where \(\epsilon_i\) denotes the error term for each of our \(i\)th values.

We get a line of best fit!

Linear Regression

In practice, the intercept \(\beta_0\) and slope \(\beta_1\) are unknown.

Therefore these are our unknown parameters (coefficients) and hence a regression problem.

As we are modelling a line, we call this linear regression.

More specifically, as we are regressing on only one independent variable, we call this simple linear regression.

Regression analysis refers to the statistical process by which these coefficients are estimated based on an observed sample of the data and the process of checking assumptions of the model.

Fitting a Linear Regression

We determine estimates for the coefficients \(\beta_0\) and \(\beta_1\)by minimising the residual (error) sum of squares.

We can think of the residuals of our model as being the difference between the y-values for a given value of \(x\) and what our prediction \(\hat{y}\) would be for estimated values \(\hat{\beta_0}\) and \(\hat{\beta_1}\) of our intercept and slope parameters.

\[e_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1x_i . \]

We put a hat \(\hat{}\) on top of a term to represent an estimate of that term.

Least Squares Regression

We call \(\hat{\beta_0}\) and \(\hat{\beta_1}\), that minimise the residual sum of squares (\(e_1^2 + e_2^2 + \cdots + e_n^2\) ) least squares estimates.

The full scary model…

The linear regression model equation is given by,

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon.\]

  • \(y\) denotes the dependent variable.
  • \(x_i\) denotes the \(i\)th independent variable.
  • \(\beta_0\) denotes the constant (intercept) coefficient.
  • \(\beta_i\) denotes the \(i\)th slope/effect coefficient corresponding to the \(i\)th independent variable.
  • \(p\) denotes the total number of independent variables.
  • \(\epsilon\) denotes the error term which represents an independent random variable with mean zero and constant variance (assumed to be normally distributed).

Lets talk about \(\epsilon\)

The error term \(\epsilon\) we assume to be independent, identically distributed, with a normal distribution and constant variance.

What does that mean for our analysis?

The key assumptions required in linear regression are,

  1. The relationship between dependent and independent variables is linear.
  2. Constant variance in the residuals (homoscedasticity).
  3. Normality of the residuals.
    • Normality is not strictly required of the dependent and independent variables ONLY the residuals.
  4. Independence in the residuals (no auto-correlations or multilinearity).

What does an analysis output look like?

Well consider our cholesterol data again.

If we were to fit a linear regression model in R, the output looks like this…

R Output - Linear Regression

## 
## Call:
## lm(formula = chol ~ age, data = cholesterol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2121 -0.8496 -0.1429  0.7267  7.8135 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.754221   0.468348  10.151  < 2e-16 ***
## age         0.030110   0.008471   3.554 0.000441 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.319 on 295 degrees of freedom
## Multiple R-squared:  0.04106,    Adjusted R-squared:  0.03781 
## F-statistic: 12.63 on 1 and 295 DF,  p-value: 0.0004413

Linear Regression Interpretation

Our estimates:

\[ \hat{\beta_0} = 4.75, \quad\hat{\beta_1}=0.03 \]

The estimated line of best fit:

\[ \hat{y} = 4.75 + 0.03x \]

The estimated variance of our residuals

\[ \hat\sigma^2_\epsilon = 1.319^2 = 1.739 \]

This is called our mean square error.

Linear Regression Interpretation (cont)

We also conduct a hypothesis test to see if our estimated coefficients are significantly different to 0.

Note, a coefficient is significant if its \(p\)-value is \(\leq 0.05\) (in general).

As the \(p\)-value associated with age was 0.0004, we say age had a significant effect or is a significant predictor on cholesterol.

Checking Assumptions

To check assumptions we need to obtain diagnostic plots.

  • Residual vs Fitted plot

  • Normal Q-Q plot

  • Scale-Location plot

  • Residuals vs Leverage

Residual vs Fitted Plot

Normal Q-Q plot

Scale-Location Plot

Residual vs Leverage Plot

Assumption checking our example

Checking Model Quality

There are a variety of model quality measures:

  • \(R^2\) value - measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1, where higher values indicate a better fit.

  • F statistic - Whether the model explains significantly more variation then a model with no predictors

  • Mean square error - the smaller the less error (but influenced by the scale of the data).

  • AIC and BIC - measures goodness of fit while penalising for model complexity.

R Output - (again)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.754221   0.468348  10.151  < 2e-16 ***
age         0.030110   0.008471   3.554 0.000441 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.319 on 295 degrees of freedom
Multiple R-squared:  0.04106,   Adjusted R-squared:  0.03781 
F-statistic: 12.63 on 1 and 295 DF,  p-value: 0.0004413

Q. Do you think this model would be very useful for prediction?

Categorical Predictors

What happens when we have a predictor that is categorical (like gender)?

Linear regression requires the use of numeric predictors.

To express categorical variables as a numeric predictor, we implement a dummy binary coding of \(l-1\) binary variables for the \(l\) different levels of the categorical variable. \[ X_1 = \Bigg\{ \begin{array}{ll} 1, & \text{if Gender = Male}\\ 0, & \text{if Gender = Female} \end{array} \]

Categorical Predictors (cont)

If we had a variable with 3 levels (like remoteness area) we would have two variables that look like this… \[ X_2 = \Bigg\{ \begin{array}{ll} 1, & \text{if Area = Regional}\\ 0, & \text{if Area = City} \end{array}, \] \[ X_3 = \Bigg\{ \begin{array}{ll} 1, & \text{if Area = Remote}\\ 0, & \text{if Area = City} \end{array} \]

Note there is always a reference (baseline) category.

Fitting with Categorical Predictors

If we were to add the Gender variable to our previous model, we end up with…

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.241091   0.485458  10.796  < 2e-16 ***
age          0.027626   0.008377   3.298  0.00109 **
genderMale  -0.519234   0.161812  -3.209  0.00148 **
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.299 on 294 degrees of freedom
Multiple R-squared:  0.07351,   Adjusted R-squared:  0.06721 
F-statistic: 11.66 on 2 and 294 DF,  p-value: 1.335e-05

Visualising this Fit

Both lines have the same slope with respect to age.

The Male line has been shifted down by a factor of -0.52.

But what if we want to have a different slope for each case?

Remember the first plot…

Each line has its own intercept and slope.

We have modelled an interaction between age and gender.

Understanding Interaction

An interaction term represents how the effect of one independent variable on the dependent variable depends on or changes with the value of another independent variable.

How does this look in our output?

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.17577    0.79752   5.236 3.14e-07 ***
age             0.04673    0.01411   3.313  0.00104 ** 
genderMale      1.10236    0.97828   1.127  0.26074    
age:genderMale -0.02942    0.01750  -1.681  0.09391 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.295 on 293 degrees of freedom
Multiple R-squared:  0.08236,   Adjusted R-squared:  0.07296 
F-statistic: 8.766 on 3 and 293 DF,  p-value: 1.385e-05

Modify Chunk OptionsRun All Chunks AboveRun Current Chunk

What does the -0.029 represent?

What does the -0.029 represent?

Its the difference in slope of the two lines!

Generalised Linear Regression

Generalised Linear Regression Models

Generalised Linear Models (GLMs) extend the linear regression framework.

Dependent variable data types that can be modelled with GLMs include:

  • Binary outcome data
  • Count data
  • Categorial and ordinal data
  • Continuous data
    • Can handle different distributions like Gamma.

Binary Logistic Regression

Binary logistic regression is used to model the odds of a binary outcome.

For example:

\[ Y=\Bigg\{\begin{array}{cl} 1, & \text{if an individual has heart disease,} \\ 0, & \text{if an individual does not have heart disease.}\end{array} \]

The odds of an event occurring refers to the probability of an event occuring divided by the probability of the event not occurring.

\[ Odds = \frac{p}{1-p} \]

Odds Ratios

Just like in linear regression when we fit a binary logistic regression, we obtain estimated coefficients - \(\hat\beta_i\).

However, we usually use these values to obtain the odds ratios \(e^{\hat\beta_i}\).

Odds ratios quantify the effect of a predictor on increasing (or decreasing) the odds of the outcome.

For example: Suppose I wanted to compare the differences between the odds of men and women having chronic heart disease in my data.

I determine that men are 3.6 times more likely to suffer from chronic heart disease then women..

Example - Chronic Heart Disease

Lets consider the same cholesterol data but note some of the participants has chronic heart disease:

Modelling CHD on Gender

Fitting a binary logistic regression in R we obtain:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0438     0.2326  -4.488 7.18e-06 ***
genderMale    1.2737     0.2725   4.674 2.95e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So our male odds ratio is \(e^{1.2737} = 3.57\).

We interpret this as males are significantly more likely (OR=3.57,p<0.001) to have chronic heart disease in our cohort.

Modelling CHD on Cholesterol

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.92999    0.57634  -1.614    0.107
chol         0.12103    0.08813   1.373    0.170

So we have our cholesterol odds ratio as \(e^{0.12103} = 1.13\).

We interpret this OR as saying for every mmol/L increase in a patients cholesterol level, we estimate their odds of having chronic heart disease increases by 1.13 times but this increase is not significant (p =0.170).

Hence if patient A had a cholesterol level of 4.5mmol/L and patient B had a cholesterol level of 7.5mmol/L, then we would estimate patient B to have \(3\times1.13 = 3.39\) times greater odds of having chronic heart disease than patient A.

Modelling CHD on Cholesterol, Age, and Gender

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.89977    1.10738  -5.328 9.95e-08 ***
age          0.06331    0.01527   4.146 3.39e-05 ***
chol         0.17803    0.09846   1.808   0.0706 .  
genderMale   1.62473    0.30451   5.336 9.52e-08 ***

Odds Ratios:

## (Intercept)         age        chol  genderMale 
## 0.002740087 1.065354308 1.194858024 5.077061469

Q. Which variables were significant?

Mixed Models

Clustered Data

Clustered data is a term applied to data that can be grouped into a set of distinct homogeneous subgroups known as clusters. Within each cluster, the observations are more similar than to observations outside the cluster.

Examples of the types of clusters we may see in clustered data include:

  • repeated observations from the same subject,
  • observations within the same hierarchical grouping (e.g. classes of students within a group of schools),
  • subjects within the same household / familiar grouping,
  • organisms of the same species.

Mixed Effect Modelling

Linear mixed modelling is a term we use to describe mixed models that apply a linear model to clustered data. In standard linear regression modelling, a linear model is applied to independent data to described the relationship between a dependent variable and a set of fixed predictor variables. In linear mixed modelling, we introduce a random term to account for the effect of clustering in clustered data.

Linear mixed modelling can be applied even when there are uneven numbers of measurements in each cluster and when there is missing data present.

Random Intercepts

Random Slopes Model

Other Models

Time Series Models

Time series models apply to data which have a temporal (time-based) association.

Time Series Models (cont)

Time series models are appropriate when

  • values at one time point depend on or are influenced by previous time points,
  • when there are long-term trends in time,
  • when seasonality is present,
  • when unpredictable fluctuations are present.

Common time series analyses include ARIMA modelling, GARCH modelling, segmented linear regression.

Survival Analysis

Survival models approximate time till event data.

ECOG performance score as rated by the physician are 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound

Decision Trees

Decision trees recursively partition the feature space to make prediction very easy.

Neural Networks

There are lots more methods out there… but we will leave it there.

The Statistical Consulting Centre

  • Aim - The service aims to improve the statistical content of research carried out by members of the University. Researchers from all disciplines may use the Centre. Priority is currently given to staff members and postgraduate students undertaking research for Doctor of Philosophy or Masters’ degrees.
  • How we can help - Currently the Statistical Consulting Centre provides each post-graduate student with a free initial consultation. Up to ten hours per calendar year of consulting time is provided without charge if research funding is not available. When students require more consulting time, or receive external funding, a service charge may be necessary.

To book an appointment with me…

  1. Discuss with one of your supervisors first about booking a consultation.

    • One of your supervisors must attend your first consultation.
  2. Go on to the Statistical Consulting Centre website and select

    Make an Appointment.

  3. Fill out the form with you and your chosen supervisor’s details.

  4. We will then send you a link to book.

More News

We are also running a short course on using data vis.

  • How to create graphs and figures for publication | 16-17th October 9:30am-12:30pm | Running Online

Advertised in Universe $110 (or $100) and on our website https://www.uow.edu.au/niasra/

Chat with your supervisor if you’re interested…

  • I’m running another seminar Thinking Statistically: Principles of Statistics and Analysis next week on Thursday at 9:30am 67.102.

The Data Science and Statistics CoP

Need more info….