In statistics, a Measure of Central Tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location.
The three main measures are:
The mean is the most common measure of central tendency. It is the sum of all values divided by the number of observations.
If we have a dataset \(x\) containing \(n\) values: \(x_1, x_2, ..., x_n\).
The Sample Mean (\(\bar{x}\)): \[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
The Population Mean (\(\mu\)): \[ \mu = \frac{\sum_{i=1}^{N} X_i}{N} \]
Scenario: A student receives the following quiz scores: 85, 90, 75, 92, 88.
\[ \bar{x} = \frac{85 + 90 + 75 + 92 + 88}{5} \] \[ \bar{x} = \frac{430}{5} = 86 \]
# Create a vector of quiz scores
quiz_scores <- c(85, 90, 75, 92, 88)
# Calculate the mean
mean_score <- mean(quiz_scores)
print(paste("The mean score is:", mean_score))
## [1] "The mean score is: 86"
The median is the middle score for a set of data that has been arranged in order of magnitude (sorted). The median is less affected by outliers (extremely high or low values) than the mean.
First, sort the data (\(x_{(1)}, x_{(2)}, ..., x_{(n)}\)).
If \(n\) is odd: The median is the value at position: \[ \frac{n+1}{2} \]
If \(n\) is even: The median is the average of the two middle values at positions: \[ \frac{n}{2} \quad \text{and} \quad \frac{n}{2} + 1 \]
Scenario (Odd number): Family ages: 10, 50, 12, 45, 15. 1. Sort: 10, 12, 15, 45, 50. 2. Middle value is 15.
Scenario (Even number): Test scores: 10, 20, 30, 40. 1. Middle two are 20 and 30. 2. Average: \((20+30)/2 = 25\).
# Create a vector of home prices (in thousands)
# Note the outlier (2500)
home_prices <- c(300, 350, 280, 320, 2500, 290)
# Calculate Mean and Median to see the difference
mean_price <- mean(home_prices)
median_price <- median(home_prices)
print(paste("Mean Price:", round(mean_price, 2)))
## [1] "Mean Price: 673.33"
print(paste("Median Price:", median_price))
## [1] "Median Price: 310"
Note: The median (310) represents the “typical” house better than the mean (673.33).
The mode is the value that appears most frequently in a data set. A set of data may have one mode (unimodal), two modes (bimodal), or no mode at all.
There is no algebraic formula for the mode in raw data; it is determined by counting frequencies.
Scenario: Shoe sizes sold today: 8, 9, 8, 10, 7, 8, 11. * 7: 1 time * 8: 3 times * 9: 1 time * 10: 1 time * 11: 1 time
The mode is 8.
R does not have a built-in function for statistical mode (the
mode() function in R checks data storage types). We usually
create a custom function or use a table.
# Create a vector of shoe sizes
shoe_sizes <- c(8, 9, 8, 10, 7, 8, 11, 9, 8)
# Using table to find frequencies
freq_table <- table(shoe_sizes)
print(freq_table)
## shoe_sizes
## 7 8 9 10 11
## 1 4 2 1 1
# Sorting to find the max
sorted_freq <- sort(freq_table, decreasing = TRUE)
print(paste("The most common shoe size (Mode) is:", names(sorted_freq)[1]))
## [1] "The most common shoe size (Mode) is: 8"
| Measure | Definition | Best Used When… |
|---|---|---|
| Mean | Average | Data is symmetrical (Normal Distribution) and has no outliers. |
| Median | Middle Point | Data is skewed or has outliers (e.g., Income, Property Prices). |
| Mode | Most Frequent | Data is categorical (Nominal) or when finding the “most popular” item. |
Consider the following dataset representing the number of hours 10 students studied for an exam:
\[ Data = \{2, 5, 3, 2, 10, 4, 2, 5, 1, 6\} \]
Calculate the Mean, Median, and Mode using R.
hours <- c(2, 5, 3, 2, 10, 4, 2, 5, 1, 6)
# Calculations
calc_mean <- mean(hours)
calc_median <- median(hours)
# Custom function for Mode
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
calc_mode <- get_mode(hours)
# Output
data.frame(
Metric = c("Mean", "Median", "Mode"),
Value = c(calc_mean, calc_median, calc_mode)
)
## Metric Value
## 1 Mean 4.0
## 2 Median 3.5
## 3 Mode 2.0
```