In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.
It is often referred to as a “summary statistic.” The three most common measures are the Mean, Median, and Mode.
The mean (or average) is the most popular and well-known measure of central tendency. It is the sum of all values divided by the total number of values.
For a sample of size \(n\), the sample mean \(\bar{x}\) is calculated as:
\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + \dots + x_n}{n}\]
Where: - \(\sum\): Sigma notation (summation) - \(x_i\): Each individual value in the data set - \(n\): The total number of observations
Scenario: A small tech startup has 5 employees with the following monthly salaries (in USD): $3,000, $3,500, $3,200, $4,000, and $10,000 (CEO).
salaries <- c(3000, 3500, 3200, 4000, 10000)
mean_salary <- mean(salaries)
print(paste("The mean salary is:", mean_salary))
## [1] "The mean salary is: 4740"
Pros & Cons: - Pro: Uses every value in the dataset. - Con: Highly sensitive to outliers (extreme values). In the example above, the $10,000 salary pulls the mean higher than what most employees actually earn.
The median is the middle value in a data set that has been arranged in ascending or descending order.
To find the median: 1. Arrange data from smallest to largest. 2. Find the position of the median using: \[\text{Position} = \frac{n + 1}{2}\]
Scenario: A real estate agent lists the prices of 6 houses sold in a neighborhood: $250k, $270k, $260k, $280k, $240k, and $1M (a luxury mansion).
house_prices <- c(240, 250, 260, 270, 280, 1000) # in thousands
med_price <- median(house_prices)
print(paste("The median house price is:", med_price))
## [1] "The median house price is: 265"
Why use Median? In the house price example, the mean would be 383.3333333. The median ($265) is a better representation of “typical” housing prices because it is resistant to outliers.
The mode is the value that appears most frequently in a data set. A distribution can be unimodal (one mode), bimodal (two modes), or multimodal.
Scenario: A shoe store records the sizes of sneakers sold in one hour: 8, 9, 9, 10, 11, 9, 8, 12.
R does not have a built-in function for the statistical mode, so we create one:
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
shoe_sizes <- c(8, 9, 9, 10, 11, 9, 8, 12)
result_mode <- get_mode(shoe_sizes)
print(paste("The mode of shoe sizes is:", result_mode))
## [1] "The mode of shoe sizes is: 9"
When to use Mode? - Used primarily for categorical data (e.g., favorite color, most popular car brand). - Useful for inventory management (knowing which item to restock most).
| Measure | Best for… | Sensitive to Outliers? |
|---|---|---|
| Mean | Symmetric data, Normal distribution | Yes |
| Median | Skewed data (Income, House prices) | No |
| Mode | Nominal / Categorical data | No |
In a perfectly symmetrical distribution, Mean = Median = Mode.
85, 90, 78, 92, 100, 20. Which measure better represents
the class performance?iPhone, Samsung, iPhone, Google, Samsung, iPhone.End of Module I Notes ```