1. Introduction

In statistics, a measure of central tendency is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or center of its distribution.

It is often referred to as the “typical” value of the dataset. The three most common measures are: 1. Mean 2. Median 3. Mode


2. The Arithmetic Mean

The mean is the most commonly used measure of central tendency. It is the “average” that we are all familiar with.

Mathematical Formula

For a sample of size \(n\), the sample mean (\(\bar{x}\)) is calculated as:

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Where: - \(\sum\): Sigma notation (summation) - \(x_i\): Each individual value in the dataset - \(n\): The total number of observations

Real-Life Example

Scenario: A teacher wants to find the average score of 5 students in a mini-quiz. Scores: 85, 90, 70, 75, 95.

Calculation: \[\bar{x} = \frac{85 + 90 + 70 + 75 + 95}{5} = \frac{415}{5} = 83\]

R Implementation

scores <- c(85, 90, 70, 75, 95)
mean_score <- mean(scores)
print(paste("The Mean Score is:", mean_score))
## [1] "The Mean Score is: 83"

Pros: Uses every value in the dataset. Cons: Highly sensitive to outliers (extreme values).


3. The Median

The median is the middle value of a dataset when the observations are arranged in order (ascending or descending).

Mathematical Formula

  1. Sort the data from smallest to largest.
  2. If \(n\) is odd, the median is the value at position: \[Median = \left( \frac{n + 1}{2} \right)^{th} \text{term}\]
  3. If \(n\) is even, the median is the average of the two middle terms: \[Median = \frac{(\frac{n}{2})^{th} \text{term} + (\frac{n}{2} + 1)^{th} \text{term}}{2}\]

Real-Life Example

Scenario: Monthly salaries of 6 employees in a small startup (in USD): $2000, $2500, $2200, $2800, $3000, $10000.

Ordered Data: 2000, 2200, 2500, 2800, 3000, 10000. Since \(n=6\) (even), we average the 3rd and 4th terms: \[Median = \frac{2500 + 2800}{2} = 2650\]

Note: The mean would be 3750, which is skewed by the $10,000 salary. The median (2650) is a better representation of a “typical” salary here.

R Implementation

salaries <- c(2000, 2500, 2200, 2800, 3000, 10000)
median_salary <- median(salaries)
print(paste("The Median Salary is:", median_salary))
## [1] "The Median Salary is: 2650"

Pros: Robust to outliers. Cons: Does not use all values in the calculation.


4. The Mode

The mode is the value that appears most frequently in a dataset.

Mathematical Formula

The mode is simply identified by frequency counting: \[Mode = \text{Value with highest frequency}\]

Real-Life Example

Scenario: A shoe store records the sizes sold in one hour: 7, 8, 8, 9, 10, 8, 7, 11, 8.

Observation: The number 8 appears 4 times, which is more than any other number. Mode: 8.

R Implementation

R does not have a built-in function for the statistical mode (the mode() function in R returns the data type). We can find it using a table:

shoe_sizes <- c(7, 8, 8, 9, 10, 8, 7, 11, 8)

# Create a frequency table
freq_table <- table(shoe_sizes)
# Find the value with the max frequency
mode_val <- names(freq_table)[which.max(freq_table)]

print(paste("The Mode Shoe Size is:", mode_val))
## [1] "The Mode Shoe Size is: 8"

Pros: Can be used for categorical data (e.g., “What is the most popular car color?”). Cons: A dataset can have no mode, one mode (unimodal), or multiple modes (bimodal/multimodal).


5. Summary: Which one to use?

Measure Best for… Sensitive to Outliers?
Mean Symmetric data / Normal distribution Yes
Median Skewed data (like income or house prices) No
Mode Categorical data (nominal level) No

Visualizing the Relationship

In a skewed distribution, the mean is pulled toward the tail.


End of Module 1 ```