Introductory Biostatistics Training for Health Research Professionals

Day 1

Get to know each other
Set up schedule
Software to use
Overview of biostatistics topics

Day 2

Topic 1. Introduction to Biostatistics and Its Importance

Biostatistics is the application of statistical methods to biological, medical, or health-related fields. It helps health practitioners in:

Analyzing health data
Interpreting research results
Making evidence-based clinical and public health decisions

Understanding biostatistics is essential for:

Public health professionals
Clinical practitioners
Researchers in health sciences

Topic 2. Types of Data and Measurement Scales

Data in biostatistics can be classified into different types based on the measurement scale:

Nominal: Categories without order
- blood type （A, B, AB, and O)
- gender (Male, Female).
Ordinal: Categories with a natural order
- stages of cancer （0， I, II, III, IV）
- severity of symptoms (mild, moderate, severe, extreme).
Ratio: Numeric values with meaningful differences and an absolute zero (e.g., height, weight, age).
Interval: Numeric values with meaningful differences but no absolute zero (e.g., temperature in Celsius).

Each type of data requires different statistical techniques for analysis.

Topic 3. Descriptive Statistics

Descriptive statistics summarize data and provide insights about its central tendency and variability. The main tools include:

Measures of central tendency:
- Mean
- Median
- Mode
Measures of dispersion:
- Range
- Variance
- Standard deviation
- Interquartile range (IQR)

Descriptive statistics are useful in health research to summarize demographic data, disease prevalence, or treatment outcomes.

Topic 4. Probability and Probability Distributions

Probability theory is fundamental to understanding the uncertainty inherent in health data. Common probability distributions used in biostatistics include:

Normal Distribution: Symmetrical and bell-shaped, commonly used in modeling continuous variables like blood pressure or cholesterol levels. Examples:
- Male height
- IQ
Multinomial Distribution: a few categories, each with a certain proportion. Examples:
- Education level (High school or lower, Some college, Bachelor’s Degree, Master’s or higher)
- Income level (Low, Medium, High)

Understanding probability distributions helps in hypothesis testing and decision-making in healthcare.

Topic 5. Sampling Methods

Sampling is critical to ensuring the representativeness of a study’s population. Common sampling methods include:

Simple Random Sampling: Every member of the population has an equal chance of being selected.
Stratified Sampling: The population is divided into subgroups (strata), and a random sample is taken from each subgroup.
Cluster Sampling: Entire clusters or groups are selected randomly, and all members of the selected groups are included.
Systematic Sampling: Every \(n\)th individual is selected after a random start.

Choosing an appropriate sampling method reduces sampling bias and ensures generalizability of study results.

Topic 6. RStudio for Data Visualization

We will use RSudio online: https://posit.cloud/.

Data visualization is an important tool in biostatistics for summarizing and presenting results in a clear and accessible manner. Effective visualizations enhance data interpretation and communication of research findings.

Common types of visualizations in biostatistics include:

Barplots: To visualize the distribution of categorical or discrete data.
Histograms: To visualize the distribution of continuous data.
Boxplots: To compare distributions across different groups.
Scatterplots: To visualize relationships between two continuous variables.

You can also use the ggplot2 package to do each of the above visualizations.

We are going to use the “mtcars” data to create

A bar plot for the variable “cyl”,
A histogram for the variable “mpg”,
A boxplot for the variable “mpg”,
A scatterplot for the variable “mpg” vs the variable “wt”, and
Use the ggplot2 package to do each of the above.

First of all, you need to

Get registered with https://posit.cloud/.
Now, log in.
Click “New Project” and then “New RStudio Project”. Change the “Untitled Project” to something meaningful.
Let’s create an R Markdown by clicking “File->New File->R Markdown”. Click “OK” and you will see a default R Markdown. Clicking the “knit” button to generate an output which can be chosen to be an HTML file, a PDF file, and a Microsoft Word file. The output document has a title, an author name, and a date it is created. It also has a few sections.

Now, use ChatGPT to create an R Markdown with the following prompt:

I have a data frame called “mtcars”with some columns such as “mpg”, “cyl”, “wt”. Create an R Markdown to do all of the following: A bar plot for the variable “cyl”, A histogram for the variable “mpg”, A boxplot for the variable “mpg”, A scatterplot for the variable “mpg” vs the variable “wt”, and Use the ggplot2 package to do each of the above.

Use the generated R code to modify the default R Markdown.

A practical issue: How to create data visualizations with “.csv” data stored on your own computer? You will need to import the data into RStudio by running the following R code through a code chunk such as the setup code chunk.

my_data = read.csv("the file path")

The code will create a data frame called “my_data”, which can be used by all code chunks that follow.

Day 3

Topic 7. Inferential Statistics

Inferential statistics are used to make inferences about a population based on sample data. This includes:

Hypothesis Testing: A method of statistical inference used to test the validity of a hypothesis about a population parameter. The null hypothesis (\(H_0\)) is tested against an alternative hypothesis (\(H_1\) or \(H_a\)).
Confidence Intervals (CI): A range of values that likely contain the true population parameter, providing a measure of precision for estimates.
p-values: A probability that helps determine whether the observed data is consistent with the null hypothesis.
Type I and Type II Errors: Type I (false positive) occurs when the null hypothesis is incorrectly rejected, while Type II (false negative) occurs when the null hypothesis is incorrectly accepted.

Topic 8. Correlation and Regression Analysis

Correlation measures the strength and direction of a linear relationship between two variables (e.g., height and weight).
Pearson’s correlation coefficient is commonly used for continuous data.
Spearman’s rank correlation is used for ordinal data or non-linear relationships.
Regression Analysis assesses the relationship between a dependent variable and one or more independent variables.
Simple Linear Regression: A method for modeling the relationship between two continuous variables.
Multiple Linear Regression: Extends simple linear regression by including multiple independent variables.

Regression is widely used to model health outcomes based on various risk factors.

Day 4

Topic 9. Chi-Square Test and Contingency Tables

The Chi-Square Test is used for categorical data to test for relationships between variables. It is commonly used in:

Testing whether a distribution of observed frequencies matches an expected distribution.
Testing the association between two categorical variables (e.g., gender and disease status).

Contingency tables display the relationship between categorical variables and are the basis for performing a chi-square test.

Topic 10. Survival Analysis and Kaplan-Meier Curves

Survival Analysis is used to analyze time-to-event data, such as the time until a patient’s death, disease recurrence, or recovery.

Kaplan-Meier Curves: A graphical representation of survival probabilities over time.
Log-Rank Test: Used to compare survival curves between different groups.

Survival analysis is commonly applied in clinical trials and epidemiological studies to assess treatment efficacy or disease prognosis.