In statistics, a measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location.
The three main measures are: 1. Mean (The Average) 2. Median (The Middle) 3. Mode (The Most Frequent)
The mean is the most popular and well-known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data.
For a sample size of \(n\) with observed values \(x_1, x_2, ..., x_n\), the sample mean (denoted as \(\bar{x}\)) is calculated as:
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
Where: * \(\sum\) indicates the summation. * \(x_i\) represents each individual value. * \(n\) represents the total number of observations.
Imagine a teacher wants to find the “average” performance of a class on a difficult math quiz.
Data: 78, 85, 92, 60, 88
Calculation: \[ \frac{78 + 85 + 92 + 60 + 88}{5} = \frac{403}{5} = 80.6 \]
# Define the dataset
quiz_scores <- c(78, 85, 92, 60, 88)
# Calculate the mean
mean_score <- mean(quiz_scores)
print(paste("The average test score is:", mean_score))## [1] "The average test score is: 80.6"
Warning: The mean is heavily influenced by outliers (extreme values). If one student scored a 0, the average would drop significantly, potentially misrepresenting the class performance.
The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data.
In a neighborhood, house prices vary wildly. Most houses cost $200k, but there is one mansion worth $5 million.
Data: $200k, $210k, $205k, $220k, $5,000k (Sorted)
The mode is the most frequent score in our data set. On a histogram it represents the highest bar. The mode is the only measure of central tendency that can be used with categorical data (nominal data).
There is no complex algebraic formula. It is determined by frequency counts. \[ Mode = \text{Value}(x) \text{ where } Frequency(x) \text{ is maximized} \]
A shoe store manager needs to restock inventory. Knowing the “average” shoe size (e.g., size 9.3) is useless because you cannot buy a size 9.3 shoe. The manager needs to know the Mode—the size that sells the most.
Data: 7, 8, 8, 9, 9, 9, 10, 10, 11
Mode: 9 (It appears 3 times).
R does not have a built-in statistical mode function, so we usually create a frequency table.
# Shoe sizes sold today
shoe_sizes <- c(7, 8, 8, 9, 9, 9, 10, 10, 11)
# Create a frequency table
freq_table <- table(shoe_sizes)
# Find the size with the max frequency
mode_size <- names(freq_table)[which.max(freq_table)]
print(freq_table)## shoe_sizes
## 7 8 9 10 11
## 1 2 3 2 1
## [1] "The modal shoe size is: 9"
| Measure | Best Used For | Pros | Cons |
|---|---|---|---|
| Mean | Continuous data, Symmetrical distributions | Uses every value in the data | Sensitive to outliers |
| Median | Skewed data (Income, Home prices) | Robust against outliers | Ignores the precise value of outliers |
| Mode | Categorical data (Eye color, Brand preference) | Works with non-numeric data | There may be no mode or multiple modes |
Let’s look at a skewed distribution (Salary Data) to see how the measures diverge.
```
File -> New File ->
R Markdown.