In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.
It is often referred to as a “summary statistic.” The three most common measures are: 1. The Mean 2. The Median 3. The Mode
The mean (or average) is the most popular and well-known measure of central tendency. It is the sum of all values divided by the number of values.
For a sample of size \(n\), the sample mean \(\bar{x}\) is calculated as:
\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
Where: * \(\sum\): The summation symbol * \(x_i\): Each individual value in the data set * \(n\): The total number of observations
Scenario: A teacher wants to find the average score of 5 students in a mini-quiz. The scores are: 85, 90, 70, 75, and 95.
# Define the data
scores <- c(85, 90, 70, 75, 95)
# Calculate mean
avg_score <- mean(scores)
print(paste("The average score is:", avg_score))
## [1] "The average score is: 83"
Note: The mean is highly sensitive to outliers (extreme values).
The median is the middle value in a data set when the values are arranged in ascending or descending order.
To find the median: 1. Arrange the data in order. 2. If \(n\) is odd, the median is the value at position: \(\frac{n+1}{2}\) 3. If \(n\) is even, the median is the average of the values at positions: \(\frac{n}{2}\) and \(\frac{n}{2} + 1\)
Scenario: Monthly household incomes in a small neighborhood: $2000, $2500, $3000, $3500, and $1,000,000 (an outlier).
The mean would be heavily skewed by the millionaire, but the median provides a better sense of what a “typical” neighbor earns.
incomes <- c(2000, 2500, 3000, 3500, 1000000)
# Calculate Median
med_income <- median(incomes)
avg_income <- mean(incomes)
print(paste("Median Income:", med_income))
## [1] "Median Income: 3000"
print(paste("Mean Income:", round(avg_income, 2)))
## [1] "Mean Income: 202200"
The mode is the value that appears most frequently in a data set. A data set can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal).
Scenario: A shoe store tracks the sizes of sneakers sold in one hour: 8, 9, 9, 10, 11, 9, 8. The mode is 9 because it occurs most frequently.
Note: R does not have a built-in standard function for mode, so we create a simple custom function or use a table.
# Sample data
shoe_sizes <- c(8, 9, 9, 10, 11, 9, 8)
# Calculate mode using a frequency table
mode_val <- names(sort(table(shoe_sizes), decreasing = TRUE))[1]
print(paste("The most popular shoe size is:", mode_val))
## [1] "The most popular shoe size is: 9"
When should you use which measure?
| Measure | Best Used For… | Sensitivity to Outliers |
|---|---|---|
| Mean | Continuous data with a symmetrical distribution | High (Very sensitive) |
| Median | Skewed data or data with outliers (e.g., Salaries) | Low (Robust) |
| Mode | Categorical/Nominal data (e.g., Favorite color) | Low |
The code below generates a skewed distribution and shows where the Mean and Median fall.
library(ggplot2)
# Create a skewed dataset
data <- data.frame(val = c(rbeta(1000, 2, 8) * 100))
# Calculate stats
mu <- mean(data$val)
med <- median(data$val)
# Plot
ggplot(data, aes(x=val)) +
geom_histogram(fill="skyblue", color="white", bins=30) +
geom_vline(aes(xintercept=mu, color="Mean"), size=1) +
geom_vline(aes(xintercept=med, color="Median"), size=1) +
labs(title="Mean vs Median in Right-Skewed Data",
x="Value", y="Frequency") +
scale_color_manual(name = "Measures", values = c(Mean = "red", Median = "blue")) +
theme_minimal()
rnorm(20) and calculate all three measures of central
tendency.End of Module I ```
$...$ for
inline math and $$...$$ for centered formulas, which is
standard for academic notes.ggplot2
block to visualize the difference between mean and median in skewed
data—a crucial concept for students.cosmo theme and a floating table of contents for easy
navigation.