12th October 2023
A function is a relation or equation that takes in an input and generates an output.
Random variables are functions that generate random output (sample data).
Statistics are functions that take in observed random output as input and produce output.
“Statistics is a body of methods for making wise decisions in the face of uncertainty” - Wallis & Roberts (1962)
“Statistics is the art and science of gathering, analyzing, and making inferences from data” - Mosteller et al. (1961)
Statistics is the discipline concerned with the study of variability, with the study of uncertainty and with the study of decision-making in the face of uncertainty….It is united through a common body of knowledge and a common heritage. A distinguishing feature of the statistics profession, and the methodology it develops, is the focus on a set of cautious principles for drawing scientific conclusions from data. - Lindsay et al. (2004)
In applied statistics, we usually consider two types of statistics.
Descriptive statistics - summarise particular aspects of the sample data.
Inferential statistics - make inference on an unknown population.
To apply inferential statistics we often need to have a random sample.
A random sample is set of independent, identically distributed observations.
Independent means they were selected independently of other observations.
Identically distributed means they all came from the same population.
Given a random sample,we can apply probability theory …
Inferential statistics have known probability distributions that relate to the population parameters.
The term mean can apply to two different quantities.
| Population Mean | Sample Mean | |
|
|
We use sample mean as an estimate of population mean.
In notation we often place a hat on top of a population parameter \(\alpha\) to symbolise an estimate \(\hat{\alpha}\).
Who / what am I studying?
How will I measure the phenomena?
What are my limitations?
Can my research question be tested with the data I’m collecting?
Will my research be meaningful?
Quantitative analysis - is the process of using numerical and statistical methods to analyse and interpret data, in order to make objective conclusions or predictions based on the patterns, relationships, or trends found in the data.
Qualitative analysis - is the process of examining data such as words, images, or observations in order to identify patterns, themes, or meanings that can provide insights into human behavior, attitudes, or experiences.
Mixed methods analysis is an approach to research that combines both quantitative and qualitative analysis methods in order to provide a comprehensive understanding of a particular phenomenon or topic.
Suppose a researcher produces a survey to obtain information on how a population may feel or respond to a certain issue. They also run various interviews and focus groups to get a sense of the deeper themes and motivations people have about those issues. This would be called a mixed methods study as it requires both quantitative analysis (of the survey) and qualitative analysis (of the interviews and focus groups).
Experimental studies - Involves controlling variables and performing an intervention.
Observational studies - Obtain data about the phenomena without intervention.
Reviews - Investigate and collate the findings of existing studies.
Sampling design refers to how observations are selected or measured from the population.
Probability samples - chosen from a population using probability theory.
Non-probability samples - chosen from a population by convenience or some other non-random mechanism.
A good sampling design leads to representative data.
Continued…
Asking more statisticians won’t make me any cooler.
\(\quad\)
\(\quad\)
Sampling error is the difference between the population parameter and the sample estimate as a result of it being from a sample.
Sampling error does not account for mistakes in the sampling design or collection leading to unrepresentative data. This is called non-sampling error.
In reality, no study is perfect and we have constraints.
Check the demographics of your cohort - is bias present?
Check literature to see whether problems have been identified.
Control for bias where possible.
Here are some important things to consider to make quality inference.
If testing the effect of an intervention, a control group is needed.
Remove / control confounding factors
Use sufficient sample size.
Randomisation is key to ensure generalisability.
Have a good sampling design.
Replication and reproducibility.
What are the different sub-groups I’m modeling?
Have I got enough observations in each sub-group to get a general idea?
Should I simplify my model?
Example: Suppose I want to model income based on age, education level, gender, country of birth …
Over-fitting is the process of performing analysis too “closely” to an insufficient number of data points leading to non-generalisable results.
Over-fit models have too many parameters needing to be estimated then data available to estimate them.
Variable - a quantity or characteristic that can take on different values or categories. Variables are used to represent and measure different attributes or features of a population or sample.
In machine learning, variables may be referred to as features or attributes.
Quantitative (Numeric) - Variables which have a numeric meaning.
Qualitative (Categorical)- Variables which describe a categorisation of an attribute.
Variables which have a numeric meaning.
Discrete - Numeric variable in countable units
(maps to \(\subseteq \mathbb{N}\)).
Continuous - Numeric variable given on any interval
(maps to \(\subseteq \mathbb{R}\)).
Additional classification:
Interval Data - Differences are meaningful; zero is not.
Ratio Data - Differences and ratios are meaningful. Zero represents absence of property.
Variables which describe a categorisation of an attribute.
Nominal - Categorical with no ordering structure.
Ordinal - Categorical with an ordering structure.
Confusingly, quantitative and qualitative data are both used in quantitative analysis.
We often have qualitative data within our data sets.
That could be…
| Age | Marital Status | Ethnicity | Education | |
|---|---|---|---|---|
| 231655 | 18 | 1. Never Married | 1. White | 1. < HS Grad |
| 86582 | 24 | 1. Never Married | 1. White | 4. College Grad |
| 161300 | 45 | 2. Married | 1. White | 3. Some College |
| 155159 | 43 | 2. Married | 3. Asian | 4. College Grad |
| 11443 | 50 | 4. Divorced | 1. White | 2. HS Grad |
We often have qualitative data within our data sets.
Or it could be…
| Name | Age | Gender | Symptoms |
|---|---|---|---|
| John | 52 | Male | Patient experienced shortness of breath, pain in arm |
| Joe | 47 | Male | Pain was experienced by the patient in arm |
| Jane | 32 | Female | Patient exhibted a fever on arrival |
| Jack | 19 | Male | Coughing, sore throat, runny nose, fever |
To analyse this data we need to use appropriate statistics.
| Education | Frequency |
|---|---|
| 1. < HS Grad | 268 |
| 2. HS Grad | 971 |
| 3. Some College | 650 |
| 4. College Grad | 685 |
| 5. Advanced Degree | 426 |
Workers with college or advanced degree make up 37% of workers in data set.
Or we need to compute some variables first…
| Name | Age | Gender | Symptoms | Pain in Arm? | Has Fever? |
|---|---|---|---|---|---|
| John | 52 | Male | Patient experienced shortness of breath, pain in arm | Yes | No |
| Joe | 47 | Male | Pain was experienced by the patient in arm | Yes | No |
| Jane | 32 | Female | Patient exhibted a fever on arrival | No | Yes |
| Jack | 19 | Male | Coughing, sore throat, runny nose, fever | No | Yes |
DO
Use frequency and contingency tables.
Calculate modes and proportions.
To compare differences in other variables across different groups.
Use logistic regression (when a dependent variable).
Use bar plots to visualise or boxplots when comparing groups across a quantitative variable.
Treat it as a qualitative variable in analysis.
DO NOT
Compute mean, median, variance, standard deviation.
Calculate correlations (unless ordinal).
Fit a linear regression (when a dependent variable).
Visualise with a line graph or scatterplot (unless ordinal).
Treat it as a quantitative variable in analysis.
Hypothesis testing is a statistical analysis used to determine whether there is sufficient statistical evidence to reject a specific statement (null hypothesis) about a population parameter or distribution.
We do not prove things.
At the conclusion of a hypothesis test we either,
Conclude that we have sufficient evidence to reject a null hypothesis.
Conclude that we do not have sufficient evidence to reject a null hypothesis.
We begin every hypothesis test with two hypotheses.
e.g. a null \(H_0: \mu =10\) vs an alternate \(H_1:\mu \neq 10\).
We assume the null hypothesis is true.
We compute a test statistic e.g. \(T = 2.65\).
Given the null hypothesis and the assumptions of the test, we know the probability distribution of the test statistic. (e.g. \(T \sim t_{9}\))
Given a significance level (\(\alpha\)), we determine the “extremes” of our test statistic distribution.
The most extreme values with a combined area that add up to \(\alpha\) belong to the rejection region.
If our observed test statistic value falls within the rejection region…
Our observed value is extreme and not consistent with what is expected (were our null hypothesis true).
Evidence suggests our null hypothesis is not true.
\(p\)-values are often used with statistical software to perform hypothesis testing.
\(p\)-values are the probability, under the null hypothesis, of achieving test statistic values at least as extreme as the result actually observed.
\(\qquad \qquad \qquad \qquad\)The total area in orange the \(p\)-value.
To determine whether to reject a null hypothesis,
If \(p\)-value \(\leq \alpha\) reject the null hypothesis.
If \(p\)-value \(> \alpha\) do not reject the null hypothesis.
Consider the \(p\)-value for the value below.
In statistics, significant means “significant with respect to some hypothesis test and significance level”.
Do not use the term significant unless you have performed a hypothesis test.
Statistical significant results account for the size of the difference relative to the variation in the data.
Results with large differences in mean can be described as “considerable” but not significant.
Example 1
Consider the difference between 100 observations of the two variables \(X\sim N(200,1000^2)\) and \(Y\sim N(300,1000^2)\).
With sample means \(\bar{x} = 63 \text{ and } \bar{y}= 205\).
Performing a Students’ t-test we conclude there is not a significant difference in means.
Example 2
Consider the difference between 100 observations of the two variables \(X\sim N(2.1,1^2)\) and \(Y\sim N(2.5,1^2)\).
With sample means \(\bar{x} = 1.96 \text{ and } \bar{y}= 2.40\).
Performing a Students’ t-test we conclude there is a significant difference in means.
To get a better sense of the differences between these two examples we can look at the 95% confidence interval plot.
You have to take into account variation in the data!
A 95% confidence interval for a parameter is an interval, based on sample statistics, that if we were to repeatedly compute for different random samples of the population indefinitely would contain the true parameter value 95% of the time.
The true value of a parameter is not random in frequentist statistics.
Computing probabilities of the true value being any value(s) does not make sense.
Confidence intervals give a sense of the uncertainty in the estimate within the scale of the estimate.
Some statisticians and journals prefer CIs over \(p\)-values.
The degrees of freedom of a statistical analysis are the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample.
Degrees of freedom can be thought of as the number of independent pieces of information used to obtain an estimate.
Consider the sample variance:
\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2 \]
This estimate has \(n-1\) degrees of freedom.
A parametric assumption refers to an assumption about how our distribution is “parameterised.”
For many common statistical analyses, parametric assumptions are often used.
We use parametric assumptions to simplify problems, have more certainty in estimates, and be more conclusive about differences.
Some analyses require normal distributed data.
Some analyses require aspects (like residuals) to be normally distributed, but not the actual data.
Some analyses do not have a parametric assumption (non-parametric) … but they do have others.
There are often more than just normal distribution assumptions to models to consider.
Other assumptions include:
Constant variances, homogeneity of variance, homoscedasticity,…
Independence between observations …
Sufficient sample sizes (are asymptotic methods) …
Linearity and no multicolineraity …
All relevant assumptions should be checked.
Many researchers feel the need to have statistically significant results.
Non-statistically significant results are still a result (but should not be over interpreted).
We should select statistical models to test based on what is best for the research objective.
When we incorrectly reject a null hypothesis (incorrectly claim a significant difference), we have a type I error - a false positive.
If we were to use a significance level of \(\alpha=0.05\), and repeat a hypothesis test on 100 different samples where the null hypothesis is actually true, we would expect to make type I error 5% of the time (1 in 20).
It follows that the more tests you do … the more chance you obtain a false positive result.
Hypothesis testing should be planned sparingly. If multiple comparisons are used a Bonferroni correction should be applied.
Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables.
A linear correlation between two variables suggests as one increase, the other increases (if positively correlated) or decreases (if negatively correlated).
Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events.
A correlation between variable X and Y does not mean X causes Y (or vice versa).
In order to make causal inference we usually need to conduct an experiment or a controlled study.
A controlled study involves controlling external sources of variation and performing a specidic intervention.
Inference around correlation can be made on uncontrolled studies.
Implementing a good study or experimental design ensures you can make quality inference.
Circular analysis is any analysis that retrospectively selects features of the data to characterise variables of interest resulting in a distortion of the statistical analysis,
Circular analysis is sometimes called double dipping.
Researchers are at risk of performing circular analysis when they manipulate data into groups after already performing analysis,
Consider the following example…
Suppose I did a study on the effectiveness of a specific medication in treating memory loss.
I compare pre treatment memory scores of 100 subjects to their post treatment memory loss score.
I found that 50 people saw improvements in their memory scores and 50 people did not.
Overall there was not a significant improvement in memory score across the 100 participants.
I group the data into two groups 1. those that had an improvement and 2. those who did not.
I get a significant difference in memory scores in group 1.
The experimental unit is the smallest observation that can be randomly assigned.
The observational unit is the smallest level in which a measurement is taken.
In longitudinal studies, the experiment unit is each subject / participant, the observation unit is each recorded measurement within each subject.
The number of experimental units should be used when considering degrees of freedom (minus the number of parameters estimated).
Using the number of observations units artificially inflates the sample size.
\(\quad\)
Asking 10 people 100 questions is not the same as asking 100 people 10 questions.
Currently the Statistical Consulting Centre provides each post-graduate student with a free initial consultation. Up to ten hours per calendar year of consulting time is provided without charge if research funding is not available. When students require more consulting time, or receive external funding, a service charge may be necessary.
|
|
Discuss with one of your supervisors first about booking a consultation.
Go on to the Statistical Consulting Centre website and select
Make an Appointment.
Fill out the form with you and your chosen supervisor’s details.
We will then send you a link to book.
We are also running a short course on using data vis.
How to create graphs and figures for publication | 16-17th October 9:30am-12:30pm | Running Online
Advertised in Universe $110 (or $100) and on our website https://www.uow.edu.au/niasra/
Chat with your supervisor if you’re interested…
NIASRA will be hosting a talk next week on Thursday 2pm for Global Climate Change Week.
“Antarctic biodiversity modelling with uncertainty quantification” - check NIASRA website for details
If you have any questions feel free to email me…
bradleyw@uow.edu.au
or check out the SCC website…
https://www.uow.edu.au/niasra/our-research/statistical-consulting-centre/
also have a look at the NIASRA website…