In statistics, a measure of central tendency is a single value that describes the center of a data distribution. It identifies the location where data points cluster. The three most common measures are:
This module explores the mathematical formulas behind these measures, real-life applications, and how to calculate them using R.
The mean is the sum of all observations divided by the number of observations. It is the most common measure of central tendency but is sensitive to outliers.
For a sample size \(n\) with values \(x_1, x_2, ..., x_n\):
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
Imagine a teacher wants to calculate the “average” performance of a student over 5 exams. Scores: 85, 90, 78, 92, 88.
\[ \text{Mean} = \frac{85 + 90 + 78 + 92 + 88}{5} = \frac{433}{5} = 86.6 \]
# Create a vector of exam scores
exam_scores <- c(85, 90, 78, 92, 88)
# Calculate the mean using the built-in mean() function
average_score <- mean(exam_scores)
print(paste("The mean exam score is:", average_score))
## [1] "The mean exam score is: 86.6"
If one student scores a 0, the mean drops significantly, even if the other scores are high.
# Adding an outlier
scores_with_outlier <- c(85, 90, 78, 92, 88, 0)
mean(scores_with_outlier)
## [1] 72.16667
The median is the middle value when the data set is ordered from least to greatest. It is robust against outliers (it is not pulled by extreme values).
First, sort the data \(\{x_1, x_2, ..., x_n\}\) in ascending order.
House prices in a neighborhood are often skewed. Prices (in $1000s): 350, 400, 380, 420, 2500 (Mansion).
If we use the mean, the mansion makes it look like the “average” house costs $810k. The median gives a better representation of a typical house.
Ordered: 350, 380, 400, 420, 2500. Median = 400.
# Create a vector of house prices
house_prices <- c(350, 400, 380, 420, 2500)
# Calculate the median using the built-in median() function
median_price <- median(house_prices)
mean_price <- mean(house_prices)
print(paste("The median price is:", median_price))
## [1] "The median price is: 400"
print(paste("The mean price is:", mean_price, "(Skewed by the mansion)"))
## [1] "The mean price is: 810 (Skewed by the mansion)"
The mode is the value that appears most frequently in a data set. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal.
\[ \text{Mode} = x_k \text{ such that freq}(x_k) \ge \text{freq}(x_i) \text{ for all } i \]
A shoe shop manager needs to restock. Knowing the “average” shoe size (e.g., 8.43) is useless. They need to know which size sells the most (e.g., Size 9).
Sales: 7, 8, 9, 9, 9, 10, 10, 11. Mode: 9.
R does not have a built-in statistical function called
mode() (the base command checks variable storage types). We
must create a custom function or use a frequency table.
# Create a vector of shoe sizes sold
shoe_sizes <- c(7, 8, 9, 9, 9, 10, 10, 11)
# Method 1: Using a frequency table to visualize
freq_table <- table(shoe_sizes)
print(freq_table)
## shoe_sizes
## 7 8 9 10 11
## 1 1 3 2 1
# Method 2: Custom Function to find the Mode
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
calculated_mode <- get_mode(shoe_sizes)
print(paste("The most popular shoe size (Mode) is:", calculated_mode))
## [1] "The most popular shoe size (Mode) is: 9"
| Measure | Best Used For… | Sensitivity to Outliers |
|---|---|---|
| Mean | Symmetric data, numeric data without extremes | High |
| Median | Skewed data (e.g., income, home prices) | Low (Robust) |
| Mode | Categorical data (nominal) or discrete data (e.g., sizes) | Low |
Let’s visualize a skewed distribution to see where the Mean and Median land.
# Generate skewed data (Chi-square distribution)
set.seed(123)
data <- rchisq(1000, df=4)
# Plot histogram
hist(data, col="lightblue", main="Right Skewed Distribution",
xlab="Value", breaks=30)
# Add vertical lines for Mean and Median
abline(v = mean(data), col = "red", lwd = 2, lty = 1) # Mean
abline(v = median(data), col = "blue", lwd = 2, lty = 2) # Median
legend("topright", legend=c("Mean", "Median"),
col=c("red", "blue"), lty=c(1, 2), lwd=2)
Observation: In a right-skewed distribution, the Mean (Red) is pulled to the right by the tail, while the Median (Blue) stays closer to the peak.
```
File -> New File ->
R Markdown.