In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.
Think of it as a “typical” value for the data. The three most common measures are: 1. Mean 2. Median 3. Mode
The mean is the most common measure of central tendency. It is the “average” we calculate by summing all observations and dividing by the total count.
For a sample of size \(n\), the mean (\(\bar{x}\)) is: \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
For a population of size \(N\), the mean (\(\mu\)) is: \[\mu = \frac{\sum_{i=1}^{N} x_i}{N}\]
Context: A teacher wants to find the average score of 5 students in a mini-quiz. Scores: 85, 90, 70, 75, 95.
Calculation: \[\bar{x} = \frac{85 + 90 + 70 + 75 + 95}{5} = \frac{415}{5} = 83\]
scores <- c(85, 90, 70, 75, 95)
mean_score <- mean(scores)
print(paste("The mean score is:", mean_score))
## [1] "The mean score is: 83"
The median is the middle value in a data set when the values are arranged in ascending or descending order. It splits the data into two equal halves.
Context: Household incomes in a small neighborhood. Incomes (in thousands): $45, $50, $52, $55, $250.
Note how the $250k income is an outlier. * Ordered Data: 45, 50, 52, 55, 250 * Median: 52 (The middle value)
Unlike the mean (which would be 90.4), the median is not affected by the outlier ($250), making it a better measure for skewed data.
incomes <- c(45, 50, 52, 55, 250)
median_income <- median(incomes)
print(paste("The median income is:", median_income))
## [1] "The median income is: 52"
The mode is the value that appears most frequently in a data set. A set can have one mode (unimodal), two modes (bimodal), or many modes (multimodal).
Context: A shoe store tracks the sizes sold in one hour to determine what to restock. Sizes sold: 7, 8, 8, 9, 10, 10, 10, 11.
Mode: 10 (It appears 3 times).
Note: Base R does not have a built-in function for the statistical mode, so we create a simple function.
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
shoe_sizes <- c(7, 8, 8, 9, 10, 10, 10, 11)
mode_size <- get_mode(shoe_sizes)
print(paste("The mode shoe size is:", mode_size))
## [1] "The mode shoe size is: 10"
| Measure | Best Used For… | Sensitivity to Outliers |
|---|---|---|
| Mean | Symmetric data, Normal distributions | High (Highly affected) |
| Median | Skewed data (e.g., Income, House Prices) | Low (Robust) |
| Mode | Categorical data (e.g., Favorite color) | Low |
library(ggplot2)
# Create a skewed dataset
data <- data.frame(val = c(rnorm(100, 50, 10), 150, 160, 170))
ggplot(data, aes(x = val)) +
geom_histogram(fill = "skyblue", color = "white", bins = 30) +
geom_vline(aes(xintercept = mean(val), color = "Mean"), size = 1) +
geom_vline(aes(xintercept = median(val), color = "Median"), size = 1) +
scale_color_manual(name = "Statistics", values = c(Mean = "red", Median = "blue")) +
labs(title = "Mean vs. Median in Skewed Data",
x = "Value", y = "Frequency") +
theme_minimal()
$$ for clean
mathematical representation.mean(), median(), and a custom
get_mode) so students can see how to apply the math
computationally.ggplot2
chart to demonstrate how outliers pull the mean away from the
median.