In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.
It is often referred to as the “typical” value of the dataset. The three most common measures are: 1. Mean 2. Median 3. Mode
The mean is the most commonly used measure of central tendency. It is the “average” that we are all familiar with.
For a sample of size \(n\), the sample mean (\(\bar{x}\)) is calculated as:
\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]
Where: - \(\sum\): Sigma notation (summation) - \(x_i\): Each individual value in the dataset - \(n\): The total number of observations
Scenario: A teacher wants to find the average score of 5 students in a mini-quiz. Scores: 85, 90, 70, 75, 95.
Calculation: \[\bar{x} = \frac{85 + 90 + 70 + 75 + 95}{5} = \frac{415}{5} = 83\]
scores <- c(85, 90, 70, 75, 95)
mean_score <- mean(scores)
print(paste("The Mean Score is:", mean_score))
## [1] "The Mean Score is: 83"
Pros: Uses every value in the dataset. Cons: Highly sensitive to outliers (extreme values).
The median is the middle value of a dataset when the observations are arranged in order (ascending or descending).
Scenario: Monthly salaries of 6 employees in a small startup (in USD): $2000, $2500, $2200, $2800, $3000, $10000.
Ordered Data: 2000, 2200, 2500, 2800, 3000, 10000. Since \(n=6\) (even), we average the 3rd and 4th terms: \[Median = \frac{2500 + 2800}{2} = 2650\]
Note: The mean would be 3750, which is skewed by the $10,000 salary. The median (2650) is a better representation of a “typical” salary here.
salaries <- c(2000, 2500, 2200, 2800, 3000, 10000)
median_salary <- median(salaries)
print(paste("The Median Salary is:", median_salary))
## [1] "The Median Salary is: 2650"
Pros: Robust to outliers. Cons: Does not use all values in the calculation.
The mode is the value that appears most frequently in a dataset.
The mode is simply identified by frequency counting: \[Mode = \text{Value with highest frequency}\]
Scenario: A shoe store records the sizes sold in one hour: 7, 8, 8, 9, 10, 8, 7, 11, 8.
Observation: The number 8 appears 4 times, which is more than any other number. Mode: 8.
R does not have a built-in function for the statistical mode (the
mode() function in R returns the data type). We can find it
using a table:
shoe_sizes <- c(7, 8, 8, 9, 10, 8, 7, 11, 8)
# Create a frequency table
freq_table <- table(shoe_sizes)
# Find the value with the max frequency
mode_val <- names(freq_table)[which.max(freq_table)]
print(paste("The Mode Shoe Size is:", mode_val))
## [1] "The Mode Shoe Size is: 8"
Pros: Can be used for categorical data (e.g., “What is the most popular car color?”). Cons: A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal/multimodal).
| Measure | Best for… | Sensitive to Outliers? |
|---|---|---|
| Mean | Symmetric data / Normal distribution | Yes |
| Median | Skewed data (like income or house prices) | No |
| Mode | Categorical data (nominal level) | No |
In a skewed distribution, the mean is pulled toward the tail.
End of Module 1 ```