In statistics, a Measure of Central Tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.
It is often referred to as a “summary statistic.” The three most common measures are: 1. The Mean 2. The Median 3. The Mode
The mean (or average) is the sum of all values divided by the total number of values.
For a sample of size \(n\), the mean \(\bar{x}\) is calculated as:
\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]
Where: * \(\sum\): The summation symbol. * \(x_i\): Each individual value in the data set. * \(n\): The total number of observations.
Academic Performance: If a student scores 80, 85, 90, 75, and 100 in five exams, the mean represents their overall average performance.
# Exam scores
scores <- c(80, 85, 90, 75, 100)
# Calculate Mean
mean_score <- mean(scores)
print(paste("The mean score is:", mean_score))
## [1] "The mean score is: 86"
The median is the middle value in a data set when the values are arranged in ascending or descending order. It splits the data into two equal halves.
Real Estate: Median house prices are often used instead of the mean because a few multi-million dollar mansions (outliers) would artificially inflate the average, making the “typical” house seem more expensive than it actually is.
# Salaries in a small startup (in thousands)
salaries <- c(45, 50, 52, 55, 60, 150) # 150 is an outlier
# Mean vs Median
cat("Mean Salary:", mean(salaries), "\n")
## Mean Salary: 68.66667
cat("Median Salary:", median(salaries), "\n")
## Median Salary: 53.5
Note: Notice how the outlier (150) pulls the mean up, while the median remains representative of the staff.
The mode is the value that appears most frequently in a data set. A data set can have one mode (unimodal), two modes (bimodal), or multiple modes (multimodal).
There is no specific algebraic formula for the mode; it is identified by counting frequencies: \[Mode = \text{Value with highest frequency}\]
Inventory Management: A shoe store manager looks at the mode of shoe sizes sold to determine which size to restock most heavily. If size 9 is sold more than any other, the mode is 9.
R does not have a standard built-in function for the mode, so we create a simple one:
# Function to calculate mode
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Data: Shoe sizes sold
shoe_sizes <- c(7, 8, 8, 9, 9, 9, 10, 11)
# Calculate Mode
shoe_mode <- get_mode(shoe_sizes)
print(paste("The mode of shoe sizes is:", shoe_mode))
## [1] "The mode of shoe sizes is: 9"
| Measure | Best Used For… | Sensitivity to Outliers |
|---|---|---|
| Mean | Symmetric data, continuous variables. | High (Very sensitive) |
| Median | Skewed data, ordinal data. | Low (Robust) |
| Mode | Categorical data (e.g., favorite color). | Low |
Let’s see how these measures look on a distribution.
set.seed(123)
data <- rgamma(100, shape = 2, scale = 2) # Right-skewed data
hist(data, col="lightblue", main="Mean vs Median on Skewed Data", xlab="Value")
abline(v = mean(data), col = "red", lwd = 2, lty = 1)
abline(v = median(data), col = "blue", lwd = 2, lty = 2)
legend("topright", legend=c("Mean", "Median"), col=c("red", "blue"), lty=1:2, lwd=2)
Understanding when to use each measure is crucial for accurate data storytelling and decision-making. ```