Measures of central tendency are summary statistics that represent the center point or “typical” value of a dataset. They condense a large set of data into a single representative value to make the data easier to understand.
The three primary measures are: 1. Mean (The Average) 2. Median (The Middle Value) 3. Mode (The Most Frequent Value)
The mean is the most common measure of central tendency. It is calculated by summing all observations and dividing by the total number of observations.
For a sample of size \(n\): \[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
Where: - \(\bar{x}\) = Sample Mean - \(\sum x_i\) = Sum of all values - \(n\) = Number of values
# Exam scores of 10 students
scores <- c(85, 90, 78, 92, 88, 76, 89, 95, 82, 85)
# Calculate mean using R's built-in function
avg_score <- mean(scores)
print(paste("The average score is:", avg_score))
## [1] "The average score is: 86"
Education: Teachers use the mean to determine the overall performance of a class. If the mean score is low, it might indicate that the teaching method or the exam was too difficult.
The median is the middle value of a dataset when it is ordered from smallest to largest. It is highly “robust,” meaning it is not affected by extreme outliers.
To find the median, first sort the data: 1. If \(n\) is odd: The median is the value at position \(\frac{n+1}{2}\). 2. If \(n\) is even: The median is the average of the values at positions \(\frac{n}{2}\) and \(\frac{n}{2} + 1\).
# Salaries of 5 employees (in thousands)
salaries <- c(45, 50, 55, 60, 1000) # 1000 is an outlier (CEO)
# Calculate mean vs median
print(paste("Mean Salary:", mean(salaries)))
## [1] "Mean Salary: 242"
print(paste("Median Salary:", median(salaries)))
## [1] "Median Salary: 55"
Real Estate: Median house prices are used instead of the mean because a few multi-million dollar mansions (outliers) would unfairly inflate the “average” price, making it look like most houses are more expensive than they actually are.
The mode is the value that appears most frequently in a dataset. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal.
Note: Base R does not have a standard statistical
mode()function (themode()function in R actually tells you the data type). We often create a custom function or use a package.
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Survey of favorite colors
colors <- c("Blue", "Red", "Blue", "Green", "Blue", "Red")
print(paste("The most popular color is:", get_mode(colors)))
## [1] "The most popular color is: Blue"
Inventory Management: A shoe store manager needs the mode of shoe sizes sold to know which size to stock the most. The “average” shoe size (e.g., 8.34) is useless for ordering physical inventory.
| Measure | Best Used For | Sensitive to Outliers? |
|---|---|---|
| Mean | Continuous data with no outliers (Normal distribution) | Yes (Very sensitive) |
| Median | Skewed data or data with extreme outliers | No |
| Mode | Categorical data or discrete data | No |
When data is skewed, these measures separate: - Right Skewed (Positive): Mode < Median < Mean - Left Skewed (Negative): Mean < Median < Mode - Symmetric: Mean \(\approx\) Median \(\approx\) Mode
Selecting the correct measure of central tendency depends entirely on the type of data and the distribution. While the mean is mathematically powerful, the median and mode often provide a more “honest” look at real-world data containing outliers or categories. ```