In biostatistics, when we analyze data (e.g., patient ages, blood pressure readings, recovery times), the first step is often descriptive statistics. We want to summarize the data with a single number that represents the “center” or “typical value” of the distribution.
These numbers are called Measures of Central Tendency.
The three most common measures are: 1. Mean (Arithmetic Average) 2. Median (The Middle Value) 3. Mode (The Most Frequent Value)
The mean is the most widely used measure of central tendency. It is the sum of all observations divided by the number of observations.
Sample Mean (\(\bar{x}\)): Used when dealing with a subset of the population (e.g., a study of 50 patients).
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
Population Mean (\(\mu\)): Used when dealing with the entire population (rare in practice).
\[ \mu = \frac{\sum_{i=1}^{N} x_i}{N} \]
Suppose we have the resting heart rates (beats per minute) of 5 patients: \[ Data: \{72, 68, 80, 75, 70\} \]
Calculation: \[ \bar{x} = \frac{72 + 68 + 80 + 75 + 70}{5} = \frac{365}{5} = 73 \text{ bpm} \]
# Define the vector of heart rates
heart_rates <- c(72, 68, 80, 75, 70)
# Calculate mean
avg_hr <- mean(heart_rates)
print(paste("The mean heart rate is:", avg_hr))## [1] "The mean heart rate is: 73"
Pros & Cons: * Pro: Uses every data point. * Con: Highly sensitive to outliers (extreme values).
The median is the middle value when the data is ordered from smallest to largest. It splits the dataset exactly in half (50% above, 50% below).
The position of the median is: \[ Position = \frac{n+1}{2} \]
Consider the length of stay (in days) for 6 patients: \[ Data: \{2, 5, 1, 9, 4, 3\} \]
Step 1: Sort \(\rightarrow
\{1, 2, 3, 4, 5, 9\}\)
Step 2: Find Middle. Since \(n=6\) (even), we take the 3rd and 4th
values. \[ Median = \frac{3 + 4}{2} = 3.5
\text{ days} \]
# Define the vector of stays
stays <- c(2, 5, 1, 9, 4, 3)
# Calculate median
med_stay <- median(stays)
print(paste("The median length of stay is:", med_stay))## [1] "The median length of stay is: 3.5"
Robustness Example: If the last patient stayed 100 days instead of 9: * Mean would jump significantly. * Median would remain exactly the same (3.5). This makes the median robust.
The mode is the value that appears most frequently in the dataset.
Number of previous births for 10 women: \[ Data: \{0, 1, 2, 1, 0, 3, 1, 4, 1, 2\} \] The number 1 appears 4 times. The mode is 1.
Note: R does not have a built-in function for statistical mode. We create a simple one.
# Custom function to calculate mode
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
parity <- c(0, 1, 2, 1, 0, 3, 1, 4, 1, 2)
mode_parity <- get_mode(parity)
print(paste("The mode of parity is:", mode_parity))## [1] "The mode of parity is: 1"
Let’s look at a generated dataset representing Hemoglobin levels (g/dL) in a population. We will generate slightly skewed data to see the difference between Mean and Median.
set.seed(123)
# Generate data: 100 normal points + 10 high outliers
data_hemo <- c(rnorm(100, mean=13, sd=1), runif(10, min=16, max=20))
df <- data.frame(Hemoglobin = data_hemo)
# Calculate stats
mean_val <- mean(df$Hemoglobin)
med_val <- median(df$Hemoglobin)
# Plot
ggplot(df, aes(x=Hemoglobin)) +
geom_histogram(binwidth=0.5, fill="lightblue", color="black", alpha=0.7) +
geom_vline(aes(xintercept=mean_val), color="red", linetype="dashed", size=1) +
geom_vline(aes(xintercept=med_val), color="blue", linetype="solid", size=1) +
annotate("text", x=mean_val+1, y=20, label=paste("Mean (Red):", round(mean_val, 2)), color="red") +
annotate("text", x=med_val-1, y=20, label=paste("Median (Blue):", round(med_val, 2)), color="blue") +
theme_minimal() +
labs(title = "Distribution of Hemoglobin Levels",
subtitle = "Notice how the outliers pull the Mean (Red) to the right",
y = "Frequency")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Key Takeaway: * If Data is Symmetric: Mean \(\approx\) Median. * If Data is Skewed Right (Positive): Mean > Median. * If Data is Skewed Left (Negative): Mean < Median.
Test your understanding of the concepts above.
Which measure of central tendency is most appropriate for a dataset describing “Survival Time” where most patients survive 1-2 years, but a few lucky ones survive 20 years? a) Mean b) Median c) Mode
Calculate the mean and median of this Glucose level dataset: \[ \{80, 90, 90, 100, 200\} \]
Write an R script below to calculate the mean of a vector containing the numbers 1 through 100.
Answer 1: (b) Median. Because the “lucky survivors” are outliers that skew the data to the right. The Mean would be misleadingly high.
Answer 2: * Sum = \(80+90+90+100+200 = 560\) * n = 5 * Mean = \(560 / 5 = 112\) * Ordered Data: \(80, 90, 90, 100, 200\). Middle is 90. * Median = \(90\). * Note: The outlier (200) pulled the mean to 112, while the median stayed at 90.
Answer 3:
## [1] 50.5
```
install.packages(c("rmarkdown", "ggplot2")) in your R
console once.File > New File > R Markdown..., delete the default
content, and paste the code block above.