5th October 2023
Statistical models are mathematical representations or formal descriptions of real-world phenomena or data, constructed using statistical methods and techniques.
\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \qquad \epsilon_i \sim N(0,1) \]
Statistical modelling is a process of creating mathematical or computational representations of real-world phenomena or data using statistical techniques.
Data Driven - informed by empirical observations.
Probabilistic (stochastic) - informed by random variation.
Mathematical - underpinned by mathematical relationships (equations).
Assumption based - require assumed knowledge about the world.
Diverse - there are a variety of ways of approaching the same problem.
Common modelling methods we use include:
More common modelling methods we use include:
Although we are not going to look at all these methods, there are principles that apply to them all.
When applying statistical models, broadly speaking there are two central motivations we can have:
To investigate relationships, associations, and trends between data, draw conclusions, and obtain statistical inference.
To predict and forecast outcomes based on existing or historical data and to estimate the level of uncertainty in these predictions.
Also referred to as:
Also referred to as:
Still also referred to as:
Independent variables may also be classed as either exposure or treatment variables to describe the variables that are being studied for their effect on the dependent variable.
Confounding variables that are being controlled or accounted for are often referred to as moderator or control variables, or effect modifiers.
All of these terms (dependent, independent, confounding etc) are prescribed by us based on our research question.
For different questions, there will be different terms.
Aim: To explore the association of age and gender on cholesterol.
In the previous example we considered multiple variables, (cholesterol, age, and gender).
In the previous example we considered multiple variables, (cholesterol, age, and gender).
Multivariate modelling refers to the modelling of multiple dependent variables.
Multivariable modelling refers to the modelling of a single dependent variable with multiple independent variables.
Regression models feature unknown parameters, called coefficients, that quantify the magnitude and direction of the relationship between the dependent variable and the independent variables.
Remember \(y = mx+b\) ?
What if wrote it like…
\[ y = \beta_0 + \beta_1x \]
And what if we had error attached to each of our \(y\) terms…
\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i \]
Where \(\epsilon_i\) denotes the error term for each of our \(i\)th values.
In practice, the intercept \(\beta_0\) and slope \(\beta_1\) are unknown.
Therefore these are our unknown parameters (coefficients) and hence a regression problem.
More specifically, as we are regressing on only one independent variable, we call this simple linear regression.
Regression analysis refers to the statistical process by which these coefficients are estimated based on an observed sample of the data and the process of checking assumptions of the model.
We determine estimates for the coefficients \(\beta_0\) and \(\beta_1\)by minimising the residual (error) sum of squares.
We can think of the residuals of our model as being the difference between the y-values for a given value of \(x\) and what our prediction \(\hat{y}\) would be for estimated values \(\hat{\beta_0}\) and \(\hat{\beta_1}\) of our intercept and slope parameters.
\[e_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1x_i . \]
We put a hat \(\hat{}\) on top of a term to represent an estimate of that term.
We call \(\hat{\beta_0}\) and \(\hat{\beta_1}\), that minimise the residual sum of squares (\(e_1^2 + e_2^2 + \cdots + e_n^2\) ) least squares estimates.
The linear regression model equation is given by,
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon.\]
The error term \(\epsilon\) we assume to be independent, identically distributed, with a normal distribution and constant variance.
The key assumptions required in linear regression are,
Well consider our cholesterol data again.
If we were to fit a linear regression model in R, the output looks like this…
## ## Call: ## lm(formula = chol ~ age, data = cholesterol) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.2121 -0.8496 -0.1429 0.7267 7.8135 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.754221 0.468348 10.151 < 2e-16 *** ## age 0.030110 0.008471 3.554 0.000441 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.319 on 295 degrees of freedom ## Multiple R-squared: 0.04106, Adjusted R-squared: 0.03781 ## F-statistic: 12.63 on 1 and 295 DF, p-value: 0.0004413
Our estimates:
\[ \hat{\beta_0} = 4.75, \quad\hat{\beta_1}=0.03 \]
The estimated line of best fit:
\[ \hat{y} = 4.75 + 0.03x \]
The estimated variance of our residuals
\[ \hat\sigma^2_\epsilon = 1.319^2 = 1.739 \]
This is called our mean square error.
We also conduct a hypothesis test to see if our estimated coefficients are significantly different to 0.
Note, a coefficient is significant if its \(p\)-value is \(\leq 0.05\) (in general).
As the \(p\)-value associated with age was 0.0004, we say age had a significant effect or is a significant predictor on cholesterol.
To check assumptions we need to obtain diagnostic plots.
Residual vs Fitted plot
Normal Q-Q plot
Scale-Location plot
Residuals vs Leverage
There are a variety of model quality measures:
\(R^2\) value - measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1, where higher values indicate a better fit.
F statistic - Whether the model explains significantly more variation then a model with no predictors
Mean square error - the smaller the less error (but influenced by the scale of the data).
AIC and BIC - measures goodness of fit while penalising for model complexity.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.754221 0.468348 10.151 < 2e-16 ***
age 0.030110 0.008471 3.554 0.000441 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.319 on 295 degrees of freedom
Multiple R-squared: 0.04106, Adjusted R-squared: 0.03781
F-statistic: 12.63 on 1 and 295 DF, p-value: 0.0004413
What happens when we have a predictor that is categorical (like gender)?
Linear regression requires the use of numeric predictors.
To express categorical variables as a numeric predictor, we implement a dummy binary coding of \(l-1\) binary variables for the \(l\) different levels of the categorical variable. \[ X_1 = \Bigg\{ \begin{array}{ll} 1, & \text{if Gender = Male}\\ 0, & \text{if Gender = Female} \end{array} \]
If we had a variable with 3 levels (like remoteness area) we would have two variables that look like this… \[ X_2 = \Bigg\{ \begin{array}{ll} 1, & \text{if Area = Regional}\\ 0, & \text{if Area = City} \end{array}, \] \[ X_3 = \Bigg\{ \begin{array}{ll} 1, & \text{if Area = Remote}\\ 0, & \text{if Area = City} \end{array} \]
If we were to add the Gender variable to our previous model, we end up with…
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.241091 0.485458 10.796 < 2e-16 ***
age 0.027626 0.008377 3.298 0.00109 **
genderMale -0.519234 0.161812 -3.209 0.00148 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.299 on 294 degrees of freedom
Multiple R-squared: 0.07351, Adjusted R-squared: 0.06721
F-statistic: 11.66 on 2 and 294 DF, p-value: 1.335e-05
Both lines have the same slope with respect to age.
The Male line has been shifted down by a factor of -0.52.
But what if we want to have a different slope for each case?
We have modelled an interaction between age and gender.
An interaction term represents how the effect of one independent variable on the dependent variable depends on or changes with the value of another independent variable.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.17577 0.79752 5.236 3.14e-07 ***
age 0.04673 0.01411 3.313 0.00104 **
genderMale 1.10236 0.97828 1.127 0.26074
age:genderMale -0.02942 0.01750 -1.681 0.09391 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.295 on 293 degrees of freedom
Multiple R-squared: 0.08236, Adjusted R-squared: 0.07296
F-statistic: 8.766 on 3 and 293 DF, p-value: 1.385e-05
Modify Chunk OptionsRun All Chunks AboveRun Current Chunk
Generalised Linear Models (GLMs) extend the linear regression framework.
Dependent variable data types that can be modelled with GLMs include:
Binary logistic regression is used to model the odds of a binary outcome.
For example:
\[ Y=\Bigg\{\begin{array}{cl} 1, & \text{if an individual has heart disease,} \\ 0, & \text{if an individual does not have heart disease.}\end{array} \]
The odds of an event occurring refers to the probability of an event occuring divided by the probability of the event not occurring.
\[ Odds = \frac{p}{1-p} \]
Just like in linear regression when we fit a binary logistic regression, we obtain estimated coefficients - \(\hat\beta_i\).
However, we usually use these values to obtain the odds ratios \(e^{\hat\beta_i}\).
Odds ratios quantify the effect of a predictor on increasing (or decreasing) the odds of the outcome.
For example: Suppose I wanted to compare the differences between the odds of men and women having chronic heart disease in my data.
I determine that men are 3.6 times more likely to suffer from chronic heart disease then women..
Lets consider the same cholesterol data but note some of the participants has chronic heart disease:
Fitting a binary logistic regression in R we obtain:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0438 0.2326 -4.488 7.18e-06 ***
genderMale 1.2737 0.2725 4.674 2.95e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So our male odds ratio is \(e^{1.2737} = 3.57\).
We interpret this as males are significantly more likely (OR=3.57,p<0.001) to have chronic heart disease in our cohort.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.92999 0.57634 -1.614 0.107
chol 0.12103 0.08813 1.373 0.170
So we have our cholesterol odds ratio as \(e^{0.12103} = 1.13\).
We interpret this OR as saying for every mmol/L increase in a patients cholesterol level, we estimate their odds of having chronic heart disease increases by 1.13 times but this increase is not significant (p =0.170).
Hence if patient A had a cholesterol level of 4.5mmol/L and patient B had a cholesterol level of 7.5mmol/L, then we would estimate patient B to have \(3\times1.13 = 3.39\) times greater odds of having chronic heart disease than patient A.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.89977 1.10738 -5.328 9.95e-08 ***
age 0.06331 0.01527 4.146 3.39e-05 ***
chol 0.17803 0.09846 1.808 0.0706 .
genderMale 1.62473 0.30451 5.336 9.52e-08 ***
Odds Ratios:
## (Intercept) age chol genderMale ## 0.002740087 1.065354308 1.194858024 5.077061469
Clustered data is a term applied to data that can be grouped into a set of distinct homogeneous subgroups known as clusters. Within each cluster, the observations are more similar than to observations outside the cluster.
Examples of the types of clusters we may see in clustered data include:
Linear mixed modelling is a term we use to describe mixed models that apply a linear model to clustered data. In standard linear regression modelling, a linear model is applied to independent data to described the relationship between a dependent variable and a set of fixed predictor variables. In linear mixed modelling, we introduce a random term to account for the effect of clustering in clustered data.
Time series models apply to data which have a temporal (time-based) association.
Time series models are appropriate when
Common time series analyses include ARIMA modelling, GARCH modelling, segmented linear regression.
Survival models approximate time till event data.
| ECOG performance score as rated by the physician are 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound |
Decision trees recursively partition the feature space to make prediction very easy.
Discuss with one of your supervisors first about booking a consultation.
Go on to the Statistical Consulting Centre website and select
Make an Appointment.
Fill out the form with you and your chosen supervisor’s details.
We will then send you a link to book.
We are also running a short course on using data vis.
Advertised in Universe $110 (or $100) and on our website https://www.uow.edu.au/niasra/
Chat with your supervisor if you’re interested…
If you have any questions feel free to email me…
bradleyw@uow.edu.au
or check out the SCC website…
https://www.uow.edu.au/niasra/our-research/statistical-consulting-centre/
also have a look at the NIASRA website…