Thinking Statistically: Principles of Statistics and Analysis

12th October 2023

Let me introduce myself

Brad Wakefield

I’m a Statistical Consultant in the Statistical Consulting Centre NIASRA

The Statistical Consulting Centre

Aim - The service aims to improve the statistical content of research carried out by members of the University. Researchers from all disciplines may use the Centre. Priority is currently given to staff members and postgraduate students undertaking research for Doctor of Philosophy or Masters’ degrees.

Fundamentals

What are Statistics?

Statistics are functions of observable random variables.

A function is a relation or equation that takes in an input and generates an output.
Random variables are functions that generate random output (sample data).
Statistics are functions that take in observed random output as input and produce output.

Why do we use Statistics?

“Statistics is a body of methods for making wise decisions in the face of uncertainty” - Wallis & Roberts (1962)

“Statistics is the art and science of gathering, analyzing, and making inferences from data” - Mosteller et al. (1961)

Statistics is the discipline concerned with the study of variability, with the study of uncertainty and with the study of decision-making in the face of uncertainty….It is united through a common body of knowledge and a common heritage. A distinguishing feature of the statistics profession, and the methodology it develops, is the focus on a set of cautious principles for drawing scientific conclusions from data. - Lindsay et al. (2004)

What kind of Statistics do we use?

In applied statistics, we usually consider two types of statistics.

Descriptive statistics - summarise particular aspects of the sample data.
Inferential statistics - make inference on an unknown population.

To apply inferential statistics we often need to have a random sample.

Random Sample

A random sample is set of independent, identically distributed observations.

Independent means they were selected independently of other observations.
Identically distributed means they all came from the same population.

The way data is collected can have dramatic consequences on the ability to make inference.

How to Estimate the Population?

Given a random sample,we can apply probability theory …

Inferential statistics have known probability distributions that relate to the population parameters.

Population vs Sample

The term mean can apply to two different quantities.

	Population Mean	Sample Mean
	Based on the entire population or theoretical distribution. Notated usually as $\mu$ Is the expected value of the random variable.	A calculable statistic based on sample data. Notated usually as $\bar{x}$ or $\hat{\mu}$. Is the average value in the sample.

We use sample mean as an estimate of population mean.

In notation we often place a hat on top of a population parameter $\alpha$ to symbolise an estimate $\hat{\alpha}$.

Study Design

The Research Question

Who / what am I studying?
How will I measure the phenomena?
What are my limitations?
Can my research question be tested with the data I’m collecting?
Will my research be meaningful?

Avoid ambiguity, be very clear as to what you are investigating.

Quantitative vs Qualitative

Quantitative analysis - is the process of using numerical and statistical methods to analyse and interpret data, in order to make objective conclusions or predictions based on the patterns, relationships, or trends found in the data.

Qualitative analysis - is the process of examining data such as words, images, or observations in order to identify patterns, themes, or meanings that can provide insights into human behavior, attitudes, or experiences.

Mixed Methods Analysis

Mixed methods analysis is an approach to research that combines both quantitative and qualitative analysis methods in order to provide a comprehensive understanding of a particular phenomenon or topic.

Suppose a researcher produces a survey to obtain information on how a population may feel or respond to a certain issue. They also run various interviews and focus groups to get a sense of the deeper themes and motivations people have about those issues. This would be called a mixed methods study as it requires both quantitative analysis (of the survey) and qualitative analysis (of the interviews and focus groups).

Understand Your Study Type

Experimental studies - Involves controlling variables and performing an intervention.
Observational studies - Obtain data about the phenomena without intervention.
Reviews - Investigate and collate the findings of existing studies.

What Makes Data Representative?

Sampling design refers to how observations are selected or measured from the population.

Probability samples - chosen from a population using probability theory.
Non-probability samples - chosen from a population by convenience or some other non-random mechanism.

A good sampling design leads to representative data.

To make reliable inference, your target population should match your sampling frame as best as possible.

Probability Sampling Methods

Simple Random Sampling - Each observation from the sampling frame is chosen with equal chance.
Systematic Sampling - Each observation is chosen in regular intervals from a sampling frame.
Stratified Sampling - The sampling frame is divided into subgroups, known as strata, of observations which share a similar characteristic. Sampling is then performed on each strata so that a representation of each group is obtained.
Clustered Sampling - Subgroups of the population are used as a sampling unit, rather than individual observations. The entire subgroup is surveyed.

Non-probability Sampling Methods

Quota Sampling - A population is divided into subgroups and a pre-decided amount of people from each subgroup is selected. Selection of individuals in each subgroup is chosen freely by the data collector.
Purposive (Judgemental) Sampling - Observations are selected based on the views and judgement of the researcher. The researcher selects observations as they believe they will have a representative opinion of the population.
Snowball Sampling - Respondents recommend other respondents to be included in the study. Often used for rare populations.

Non-probability Sampling Methods

Continued…

Convenience Sampling - In convenience sampling the researcher selects data which is easily accessible to them. Self-selection sampling is a form of convenience sampling where people volunteer to participate.

More data does not necessarily mean more representative!!

What Makes Data Representative?

Asking more statisticians won’t make me any cooler.

Does Sample Size Really Matter?

Yes - sample size does still matter.

$\quad$

Increasing sample size reduces sampling error.

$\quad$

Sampling error is the difference between the population parameter and the sample estimate as a result of it being from a sample.

Sampling error does not account for mistakes in the sampling design or collection leading to unrepresentative data. This is called non-sampling error.

Non-Sampling Error

Non-sampling error is can only be controlled through good study design and implementation.

In reality, no study is perfect and we have constraints.

Although we may not be able to control for non-sampling error, we can investigate it.

Check the demographics of your cohort - is bias present?
Check literature to see whether problems have been identified.
Control for bias where possible.

What makes a good study?

Here are some important things to consider to make quality inference.

If testing the effect of an intervention, a control group is needed.
Remove / control confounding factors
Use sufficient sample size.
Randomisation is key to ensure generalisability.
Have a good sampling design.
Replication and reproducibility.

What Size is Sufficient?

THINK BEFORE FITTING

What are the different sub-groups I’m modeling?
Have I got enough observations in each sub-group to get a general idea?
Should I simplify my model?

Example: Suppose I want to model income based on age, education level, gender, country of birth …

There are calculations and rules of thumb for each analysis which can help decide.

What is Over-Fitting?

Over-fitting is the process of performing analysis too “closely” to an insufficient number of data points leading to non-generalisable results.

Over-fit models have too many parameters needing to be estimated then data available to estimate them.

Types of Data

Variables

Variable - a quantity or characteristic that can take on different values or categories. Variables are used to represent and measure different attributes or features of a population or sample.

In machine learning, variables may be referred to as features or attributes.

Variable Types:

Quantitative (Numeric) - Variables which have a numeric meaning.
Qualitative (Categorical)- Variables which describe a categorisation of an attribute.

Quantitative Variables

Variables which have a numeric meaning.
- Discrete - Numeric variable in countable units
  
  (maps to $\subseteq \mathbb{N}$).
- Continuous - Numeric variable given on any interval
  
  (maps to $\subseteq \mathbb{R}$).
Additional classification:
- Interval Data - Differences are meaningful; zero is not.
- Ratio Data - Differences and ratios are meaningful. Zero represents absence of property.

Qualitative Variables

Variables which describe a categorisation of an attribute.
- Nominal - Categorical with no ordering structure.
- Ordinal - Categorical with an ordering structure.

Confusingly, quantitative and qualitative data are both used in quantitative analysis.

Qualitative Data (cont)

We often have qualitative data within our data sets.

That could be…

	Age	Marital Status	Ethnicity	Education
231655	18	1. Never Married	1. White	1. < HS Grad
86582	24	1. Never Married	1. White	4. College Grad
161300	45	2. Married	1. White	3. Some College
155159	43	2. Married	3. Asian	4. College Grad
11443	50	4. Divorced	1. White	2. HS Grad

Qualitative Data (cont 2)

We often have qualitative data within our data sets.

Or it could be…

Name	Age	Gender	Symptoms
John	52	Male	Patient experienced shortness of breath, pain in arm
Joe	47	Male	Pain was experienced by the patient in arm
Jane	32	Female	Patient exhibted a fever on arrival
Jack	19	Male	Coughing, sore throat, runny nose, fever

Analysing Qualitative Data

To analyse this data we need to use appropriate statistics.

Education	Frequency
1. < HS Grad	268
2. HS Grad	971
3. Some College	650
4. College Grad	685
5. Advanced Degree	426

Workers with college or advanced degree make up 37% of workers in data set.

Analysing Qualitative Data (cont)

Or we need to compute some variables first…

Name	Age	Gender	Symptoms	Pain in Arm?	Has Fever?
John	52	Male	Patient experienced shortness of breath, pain in arm	Yes	No
Joe	47	Male	Pain was experienced by the patient in arm	Yes	No
Jane	32	Female	Patient exhibted a fever on arrival	No	Yes
Jack	19	Male	Coughing, sore throat, runny nose, fever	No	Yes

Analysing Qualitative Data (DOs)

Use frequency and contingency tables.
Calculate modes and proportions.
To compare differences in other variables across different groups.
Use logistic regression (when a dependent variable).
Use bar plots to visualise or boxplots when comparing groups across a quantitative variable.
Treat it as a qualitative variable in analysis.

Analysing Qualitative Data (DON’TS)

DO NOT

Compute mean, median, variance, standard deviation.
Calculate correlations (unless ordinal).
Fit a linear regression (when a dependent variable).
Visualise with a line graph or scatterplot (unless ordinal).
Treat it as a quantitative variable in analysis.

Statistical Testing

What is Hypothesis Testing?

Hypothesis testing is a statistical analysis used to determine whether there is sufficient statistical evidence to reject a specific statement (null hypothesis) about a population parameter or distribution.

We do not prove things.

At the conclusion of a hypothesis test we either,

Conclude that we have sufficient evidence to reject a null hypothesis.
Conclude that we do not have sufficient evidence to reject a null hypothesis.
- Does not mean the null hypothesis is true.

The Logic of a Hypothesis Test

We begin every hypothesis test with two hypotheses.

e.g. a null $H_0: \mu =10$ vs an alternate $H_1:\mu \neq 10$.

We assume the null hypothesis is true.

We compute a test statistic e.g. $T = 2.65$.

Given the null hypothesis and the assumptions of the test, we know the probability distribution of the test statistic. (e.g. $T \sim t_{9}$)

The Logic of a Hypothesis Test

Given a significance level ($\alpha$), we determine the “extremes” of our test statistic distribution.

The most extreme values with a combined area that add up to $\alpha$ belong to the rejection region.

The Logic of a Hypothesis Test

If our observed test statistic value falls within the rejection region…

Our observed value is extreme and not consistent with what is expected (were our null hypothesis true).
Evidence suggests our null hypothesis is not true.

What are $p$-values?

$p$-values are often used with statistical software to perform hypothesis testing.

$p$-values are the probability, under the null hypothesis, of achieving test statistic values at least as extreme as the result actually observed.

$\qquad \qquad \qquad \qquad$The total area in orange the $p$-value.

What are $p$-values?

To determine whether to reject a null hypothesis,

If $p$-value $\leq \alpha$ reject the null hypothesis.

If $p$-value $> \alpha$ do not reject the null hypothesis.

Consider the $p$-value for the value below.

“Significance” is Significant

In statistics, significant means “significant with respect to some hypothesis test and significance level”.

Do not use the term significant unless you have performed a hypothesis test.

Statistical significant results account for the size of the difference relative to the variation in the data.

Results with large differences in mean can be described as “considerable” but not significant.

“Significance” is Significant

Example 1

Consider the difference between 100 observations of the two variables $X\sim N(200,1000^2)$ and $Y\sim N(300,1000^2)$.

With sample means $\bar{x} = 63 \text{ and } \bar{y}= 205$.

Performing a Students’ t-test we conclude there is not a significant difference in means.

“Significance” is Significant

Example 2

Consider the difference between 100 observations of the two variables $X\sim N(2.1,1^2)$ and $Y\sim N(2.5,1^2)$.

With sample means $\bar{x} = 1.96 \text{ and } \bar{y}= 2.40$.

Performing a Students’ t-test we conclude there is a significant difference in means.

Confidence Interval Plots

To get a better sense of the differences between these two examples we can look at the 95% confidence interval plot.

You have to take into account variation in the data!

What is a Confidence Interval?

A 95% confidence interval for a parameter is an interval, based on sample statistics, that if we were to repeatedly compute for different random samples of the population indefinitely would contain the true parameter value 95% of the time.

The true value of a parameter is not random in frequentist statistics.
Computing probabilities of the true value being any value(s) does not make sense.
Confidence intervals give a sense of the uncertainty in the estimate within the scale of the estimate.
Some statisticians and journals prefer CIs over $p$-values.

Degrees of Freedom

The degrees of freedom of a statistical analysis are the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample.

Degrees of freedom can be thought of as the number of independent pieces of information used to obtain an estimate.

Consider the sample variance:

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2 \]

This estimate has $n-1$ degrees of freedom.

Parametric Assumptions

A parametric assumption refers to an assumption about how our distribution is “parameterised.”

For many common statistical analyses, parametric assumptions are often used.

We use parametric assumptions to simplify problems, have more certainty in estimates, and be more conclusive about differences.

ALWAYS CHECK THE SPECIFIC ASSUMPTIONS OF YOUR ANALYSIS

The Normal Assumption

Some analyses require normal distributed data.

e.g. t-tests, ANOVA, pearson correlations, etc

Some analyses require aspects (like residuals) to be normally distributed, but not the actual data.

e.g. linear regression, ARIMA modelling etc.

Some analyses do not have a parametric assumption (non-parametric) … but they do have others.

e.g. Wilcoxon test, Mann-Whitney test, Kruskal-Wallis test, etc.

Checking Assumptions

There are often more than just normal distribution assumptions to models to consider.

Other assumptions include:

Constant variances, homogeneity of variance, homoscedasticity,…
Independence between observations …
Sufficient sample sizes (are asymptotic methods) …
Linearity and no multicolineraity …

All relevant assumptions should be checked.

The $p$-Hacking Problem

Many researchers feel the need to have statistically significant results.

Non-statistically significant results are still a result (but should not be over interpreted).

We should select statistical models to test based on what is best for the research objective.

Changing your analysis so you get a significant result is not just bad stats, it’s unethical.

The False Positive Bias

When we incorrectly reject a null hypothesis (incorrectly claim a significant difference), we have a type I error - a false positive.

If we were to use a significance level of $\alpha=0.05$, and repeat a hypothesis test on 100 different samples where the null hypothesis is actually true, we would expect to make type I error 5% of the time (1 in 20).

It follows that the more tests you do … the more chance you obtain a false positive result.

Hypothesis testing should be planned sparingly. If multiple comparisons are used a Bonferroni correction should be applied.

Common Mistakes

Correlation vs Causation

Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.

A linear correlation between two variables suggests as one increase, the other increases (if positively correlated) or decreases (if negatively correlated).

Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events.

A correlation between variable X and Y does not mean X causes Y (or vice versa).

Spurious Correlations

Source: https://www.tylervigen.com/spurious-correlations

Spurious Correlations

Source: https://www.tylervigen.com/spurious-correlations

Correlation vs Causation

In order to make causal inference we usually need to conduct an experiment or a controlled study.

A controlled study involves controlling external sources of variation and performing a specidic intervention.

Inference around correlation can be made on uncontrolled studies.

Implementing a good study or experimental design ensures you can make quality inference.

What is Circular Analysis?

Circular analysis is any analysis that retrospectively selects features of the data to characterise variables of interest resulting in a distortion of the statistical analysis,

Circular analysis is sometimes called double dipping.

Researchers are at risk of performing circular analysis when they manipulate data into groups after already performing analysis,

Consider the following example…

What is Circular Analysis?

Suppose I did a study on the effectiveness of a specific medication in treating memory loss.

I compare pre treatment memory scores of 100 subjects to their post treatment memory loss score.

I found that 50 people saw improvements in their memory scores and 50 people did not.

Overall there was not a significant improvement in memory score across the 100 participants.

I group the data into two groups 1. those that had an improvement and 2. those who did not.

I get a significant difference in memory scores in group 1.

Inflating Sample Sizes

The experimental unit is the smallest observation that can be randomly assigned.

The observational unit is the smallest level in which a measurement is taken.

In longitudinal studies, the experiment unit is each subject / participant, the observation unit is each recorded measurement within each subject.

The number of experimental units should be used when considering degrees of freedom (minus the number of parameters estimated).

Using the number of observations units artificially inflates the sample size.

Inflating Sample Sizes

Recording more observations from the same experimental unit does not reduce the number of experimental units required.

$\quad$

Asking 10 people 100 questions is not the same as asking 100 people 10 questions.

I could go on longer but it’s best to leave it there..

The Statistical Consulting Centre

Aim - The service aims to improve the statistical content of research carried out by members of the University. Researchers from all disciplines may use the Centre. Priority is currently given to staff members and postgraduate students undertaking research for Doctor of Philosophy or Masters’ degrees.
How we can help - Currently the Statistical Consulting Centre provides each post-graduate student with a free initial consultation. Up to ten hours per calendar year of consulting time is provided without charge if research funding is not available. When students require more consulting time, or receive external funding, a service charge may be necessary.

We’re here to help…

How can we help?

Currently the Statistical Consulting Centre provides each post-graduate student with a free initial consultation. Up to ten hours per calendar year of consulting time is provided without charge if research funding is not available. When students require more consulting time, or receive external funding, a service charge may be necessary.

What do we help with?

The planning of experiments,
Designing questionnaires,
Data collection,

Data entry and management,
Statistical analyses,
The presentation of results.

To book an appointment with me…

Discuss with one of your supervisors first about booking a consultation.
- One of your supervisors must attend your first consultation.
Go on to the Statistical Consulting Centre website and select

Make an Appointment.
Fill out the form with you and your chosen supervisor’s details.
We will then send you a link to book.

More News

We are also running a short course on using data vis.

How to create graphs and figures for publication | 16-17th October 9:30am-12:30pm | Running Online

Advertised in Universe $110 (or $100) and on our website https://www.uow.edu.au/niasra/

Chat with your supervisor if you’re interested…
NIASRA will be hosting a talk next week on Thursday 2pm for Global Climate Change Week.

“Antarctic biodiversity modelling with uncertainty quantification” - check NIASRA website for details

The Data Science and Statistics CoP

Click here to visit

Need more info….

If you have any questions feel free to email me…

bradleyw@uow.edu.au

or check out the SCC website…

https://www.uow.edu.au/niasra/our-research/statistical-consulting-centre/

also have a look at the NIASRA website…

https://www.uow.edu.au/niasra/

Let me introduce myself

Brad Wakefield

I’m a Statistical Consultant in the Statistical Consulting Centre NIASRA

The Statistical Consulting Centre

Fundamentals

What are Statistics?

Statistics are functions of observable random variables.

Why do we use Statistics?

What kind of Statistics do we use?

Random Sample

The way data is collected can have dramatic consequences on the ability to make inference.

How to Estimate the Population?

Population vs Sample

Study Design

The Research Question

Avoid ambiguity, be very clear as to what you are investigating.

Quantitative vs Qualitative

Mixed Methods Analysis

Understand Your Study Type

What Makes Data Representative?

To make reliable inference, your target population should match your sampling frame as best as possible.

Probability Sampling Methods

Non-probability Sampling Methods

Non-probability Sampling Methods

More data does not necessarily mean more representative!!

What Makes Data Representative?

Does Sample Size Really Matter?

Yes - sample size does still matter.

Increasing sample size reduces sampling error.

Non-Sampling Error

Non-sampling error is can only be controlled through good study design and implementation.

Although we may not be able to control for non-sampling error, we can investigate it.

What makes a good study?

What Size is Sufficient?

THINK BEFORE FITTING

There are calculations and rules of thumb for each analysis which can help decide.

What is Over-Fitting?

Types of Data

Variables

Variable Types:

Quantitative Variables

Qualitative Variables

Qualitative Data (cont)

Qualitative Data (cont 2)

Analysing Qualitative Data

Analysing Qualitative Data (cont)

Analysing Qualitative Data (DOs)

Analysing Qualitative Data (DON’TS)

Statistical Testing

What is Hypothesis Testing?

The Logic of a Hypothesis Test

The Logic of a Hypothesis Test

The Logic of a Hypothesis Test

What are \(p\)-values?

What are \(p\)-values?

“Significance” is Significant

“Significance” is Significant

“Significance” is Significant

Confidence Interval Plots

What is a Confidence Interval?

Degrees of Freedom

Parametric Assumptions

ALWAYS CHECK THE SPECIFIC ASSUMPTIONS OF YOUR ANALYSIS

The Normal Assumption

Checking Assumptions

The \(p\)-Hacking Problem

Changing your analysis so you get a significant result is not just bad stats, it’s unethical.

The False Positive Bias

Common Mistakes

Correlation vs Causation

Spurious Correlations

Spurious Correlations

Correlation vs Causation

What is Circular Analysis?

What is Circular Analysis?

Inflating Sample Sizes

Inflating Sample Sizes

Recording more observations from the same experimental unit does not reduce the number of experimental units required.

I could go on longer but it’s best to leave it there..

The Statistical Consulting Centre