1. Introduction

In biostatistics, when we analyze data (e.g., patient ages, blood pressure readings, recovery times), the first step is often descriptive statistics. We want to summarize the data with a single number that represents the “center” or “typical value” of the distribution.

These numbers are called Measures of Central Tendency.

The three most common measures are: 1. Mean (Arithmetic Average) 2. Median (The Middle Value) 3. Mode (The Most Frequent Value)


2. The Arithmetic Mean

The mean is the most widely used measure of central tendency. It is the sum of all observations divided by the number of observations.

Equations

Sample Mean (\(\bar{x}\)): Used when dealing with a subset of the population (e.g., a study of 50 patients).

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]

Population Mean (\(\mu\)): Used when dealing with the entire population (rare in practice).

\[ \mu = \frac{\sum_{i=1}^{N} x_i}{N} \]

Biostatistics Example: Heart Rate

Suppose we have the resting heart rates (beats per minute) of 5 patients: \[ Data: \{72, 68, 80, 75, 70\} \]

Calculation: \[ \bar{x} = \frac{72 + 68 + 80 + 75 + 70}{5} = \frac{365}{5} = 73 \text{ bpm} \]

R Implementation

# Define the vector of heart rates
heart_rates <- c(72, 68, 80, 75, 70)

# Calculate mean
avg_hr <- mean(heart_rates)

print(paste("The mean heart rate is:", avg_hr))
## [1] "The mean heart rate is: 73"

Pros & Cons: * Pro: Uses every data point. * Con: Highly sensitive to outliers (extreme values).


3. The Median

The median is the middle value when the data is ordered from smallest to largest. It splits the dataset exactly in half (50% above, 50% below).

Algorithm

  1. Sort the data.
  2. If \(n\) is odd, the median is the middle number.
  3. If \(n\) is even, the median is the average of the two middle numbers.

Equations (Position)

The position of the median is: \[ Position = \frac{n+1}{2} \]

Biostatistics Example: Hospital Stay Duration

Consider the length of stay (in days) for 6 patients: \[ Data: \{2, 5, 1, 9, 4, 3\} \]

Step 1: Sort \(\rightarrow \{1, 2, 3, 4, 5, 9\}\)
Step 2: Find Middle. Since \(n=6\) (even), we take the 3rd and 4th values. \[ Median = \frac{3 + 4}{2} = 3.5 \text{ days} \]

R Implementation

# Define the vector of stays
stays <- c(2, 5, 1, 9, 4, 3)

# Calculate median
med_stay <- median(stays)

print(paste("The median length of stay is:", med_stay))
## [1] "The median length of stay is: 3.5"

Robustness Example: If the last patient stayed 100 days instead of 9: * Mean would jump significantly. * Median would remain exactly the same (3.5). This makes the median robust.


4. The Mode

The mode is the value that appears most frequently in the dataset.

Characteristics

  • Unimodal: One mode.
  • Bimodal: Two modes (two distinct peaks).
  • Multimodal: More than two modes.
  • No Mode: All values are unique.

Biostatistics Example: Parity

Number of previous births for 10 women: \[ Data: \{0, 1, 2, 1, 0, 3, 1, 4, 1, 2\} \] The number 1 appears 4 times. The mode is 1.

R Implementation

Note: R does not have a built-in function for statistical mode. We create a simple one.

# Custom function to calculate mode
get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

parity <- c(0, 1, 2, 1, 0, 3, 1, 4, 1, 2)
mode_parity <- get_mode(parity)

print(paste("The mode of parity is:", mode_parity))
## [1] "The mode of parity is: 1"

5. Visualizing Central Tendency

Let’s look at a generated dataset representing Hemoglobin levels (g/dL) in a population. We will generate slightly skewed data to see the difference between Mean and Median.

set.seed(123)
# Generate data: 100 normal points + 10 high outliers
data_hemo <- c(rnorm(100, mean=13, sd=1), runif(10, min=16, max=20))

df <- data.frame(Hemoglobin = data_hemo)

# Calculate stats
mean_val <- mean(df$Hemoglobin)
med_val <- median(df$Hemoglobin)

# Plot
ggplot(df, aes(x=Hemoglobin)) +
  geom_histogram(binwidth=0.5, fill="lightblue", color="black", alpha=0.7) +
  geom_vline(aes(xintercept=mean_val), color="red", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=med_val), color="blue", linetype="solid", size=1) +
  annotate("text", x=mean_val+1, y=20, label=paste("Mean (Red):", round(mean_val, 2)), color="red") +
  annotate("text", x=med_val-1, y=20, label=paste("Median (Blue):", round(med_val, 2)), color="blue") +
  theme_minimal() +
  labs(title = "Distribution of Hemoglobin Levels",
       subtitle = "Notice how the outliers pull the Mean (Red) to the right",
       y = "Frequency")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Key Takeaway: * If Data is Symmetric: Mean \(\approx\) Median. * If Data is Skewed Right (Positive): Mean > Median. * If Data is Skewed Left (Negative): Mean < Median.


6. Module Quiz

Test your understanding of the concepts above.

Question 1: Conceptual

Which measure of central tendency is most appropriate for a dataset describing “Survival Time” where most patients survive 1-2 years, but a few lucky ones survive 20 years? a) Mean b) Median c) Mode

Question 2: Calculation

Calculate the mean and median of this Glucose level dataset: \[ \{80, 90, 90, 100, 200\} \]

Question 3: R Coding Practice

Write an R script below to calculate the mean of a vector containing the numbers 1 through 100.


7. Quiz Answers

Answer 1: (b) Median. Because the “lucky survivors” are outliers that skew the data to the right. The Mean would be misleadingly high.

Answer 2: * Sum = \(80+90+90+100+200 = 560\) * n = 5 * Mean = \(560 / 5 = 112\) * Ordered Data: \(80, 90, 90, 100, 200\). Middle is 90. * Median = \(90\). * Note: The outlier (200) pulled the mean to 112, while the median stayed at 90.

Answer 3:

# Solution
numbers <- 1:100
mean(numbers)
## [1] 50.5

```

How to use this file:

  1. Install R and RStudio: Ensure you have these installed.
  2. Install Packages: You may need to run install.packages(c("rmarkdown", "ggplot2")) in your R console once.
  3. Create the File: Open RStudio, go to File > New File > R Markdown..., delete the default content, and paste the code block above.
  4. Knit: Click the Knit button (icon looks like a ball of yarn) in the toolbar to generate the lecture notes.